ouR data generation
https://www.rdatagen.net/
Recent content on ouR data generationHugo -- gohugo.iokeith.goldfeld@nyumc.org (Keith Goldfeld)keith.goldfeld@nyumc.org (Keith Goldfeld)Mon, 16 Oct 2017 00:00:00 +0000Can we use B-splines to generate non-linear data?
https://www.rdatagen.net/post/generating-non-linear-data-using-b-splines/
Mon, 16 Oct 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/generating-non-linear-data-using-b-splines/<p>I’m exploring the idea of adding a function or set of functions to the <code>simstudy</code> package that would make it possible to easily generate non-linear data. One way to do this would be using B-splines. Typically, one uses splines to fit a curve to data, but I thought it might be useful to switch things around a bit to use the underlying splines to generate data. This would facilitate exploring models where we know the assumption of linearity is violated. It would also make it easy to explore spline methods, because as with any other simulated data set, we would know the underlying data generating process.</p>
<div id="b-splines" class="section level3">
<h3>B-splines</h3>
<p>A B-spline is a linear combination of a set of basis functions that are determined by the number and location of specified knots or cut-points, as well as the (polynomial) degree of curvature. A degree of one implies a set of straight lines, degree of two implies a quadratic curve, three a cubic curve, etc. This <a href="https://cran.r-project.org/web/packages/crs/vignettes/spline_primer.pdf">nice quick intro</a> provides much more insight into issues B-splines than I can provide here. Or if you want even more detail, check out this <a href="http://www.springer.com/us/book/9780387953663">book</a>. It is a very rich topic.</p>
<p>Within a cut-point region, the sum of the basis functions always equals 1. This is easy to see by looking at a plot of basis functions, several of which are provided below. The definition and shape of the basis functions do not in any way depend on the data, only on the degree and cut-points. Of course, these functions can be added together in infinitely different ways using weights. If one is trying to fit a B-spline line to data, those weights can be estimated using regression models.</p>
</div>
<div id="splines-in-r" class="section level2">
<h2>Splines in R</h2>
<p>The <code>bs</code> function in the <code>splines</code> package, returns values from these basis functions based on the specification of knots and degree of curvature. I wrote a wrapper function that uses the <code>bs</code> function to generate the basis function, and then I do a linear transformation of these functions by multiplying the vector parameter <em>theta</em>, which is just a vector of coefficients. The linear combination at each value of <span class="math inline">\(x\)</span> (the support of the basis functions) generates a value (which I call <span class="math inline">\(y.spline\)</span>) on the desired curve. The wrapper returns a list of objects, including a data.table that includes <span class="math inline">\(x\)</span> and <span class="math inline">\(y.spline\)</span>, as well as the basis functions, and knots.</p>
<pre class="r"><code>library(splines)
library(data.table)
library(ggplot2)
library(broom)
genSpline <- function(x, knots, degree, theta) {
basis <- bs(x = x, knots = knots, degree = degree,
Boundary.knots = c(0,1), intercept = TRUE)
y.spline <- basis %*% theta
dt <- data.table(x, y.spline = as.vector(y.spline))
return(list(dt = dt, basis = basis, knots = knots))
}</code></pre>
<p>I’ve also written two functions that make it easy to print the basis function and the spline curve. This will enable us to look at a variety of splines.</p>
<pre class="r"><code>plot.basis <- function(basisdata) {
dtbasis <- as.data.table(basisdata$basis)
dtbasis[, x := seq(0, 1, length.out = .N)]
dtmelt <- melt(data = dtbasis, id = "x",
variable.name = "basis", variable.factor = TRUE)
ggplot(data=dtmelt, aes(x=x, y=value, group = basis)) +
geom_line(aes(color=basis), size = 1) +
theme(legend.position = "none") +
scale_x_continuous(limits = c(0, 1),
breaks = c(0, basisdata$knots, 1)) +
theme(panel.grid.minor = element_blank())
}</code></pre>
<pre class="r"><code>plot.spline <- function(basisdata, points = FALSE) {
p <- ggplot(data = basisdata$dt)
if (points) p <- p + geom_point(aes(x=x, y = y), color = "grey75")
p <- p +
geom_line(aes(x = x, y = y.spline), color = "red", size = 1) +
scale_y_continuous(limits = c(0, 1)) +
scale_x_continuous(limits = c(0, 1), breaks = knots) +
theme(panel.grid.minor = element_blank())
return(p)
}</code></pre>
<div id="linear-spline-with-quartile-cut-points" class="section level3">
<h3>Linear spline with quartile cut-points</h3>
<p>Here is a simple linear spline that has four regions defined by three cut-points, and the slope of the line in each region varies. The first value of <em>theta</em> is essentially the intercept. When you look at the basis plot, you will see that any single region has two “active” basis functions (represented by two colors), the other functions are all 0 in that region. The slope of the line in each is determined by the relevant values of theta. It is probably just easier to take a look:</p>
<pre class="r"><code>x <- seq(0, 1, length.out = 1000)
knots <- c(0.25, 0.5, 0.75)
theta = c(0.6, 0.1, 0.3, 0.2, 0.9)
sdata <- genSpline(x, knots, 1, theta)</code></pre>
<p>For this example, I am printing out the basis function for the first few values of <span class="math inline">\(x\)</span>.</p>
<pre class="r"><code>round( head(cbind(x = sdata$dt$x, sdata$basis)), 4 )</code></pre>
<pre><code>## x 1 2 3 4 5
## [1,] 0.000 1.000 0.000 0 0 0
## [2,] 0.001 0.996 0.004 0 0 0
## [3,] 0.002 0.992 0.008 0 0 0
## [4,] 0.003 0.988 0.012 0 0 0
## [5,] 0.004 0.984 0.016 0 0 0
## [6,] 0.005 0.980 0.020 0 0 0</code></pre>
<pre class="r"><code>plot.basis(sdata)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-5-1.png" width="672" /></p>
<pre class="r"><code>plot.spline(sdata)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-5-2.png" width="672" /></p>
</div>
<div id="same-knots-cut-points-but-different-theta-coefficients" class="section level3">
<h3>Same knots (cut-points) but different theta (coefficients)</h3>
<p>If use the same knot and degree specification, but change the vector <span class="math inline">\(theta\)</span>, we change the slope of the lines in each of the four regions:</p>
<pre class="r"><code>theta = c(0.2, 0.3, 0.8, 0.2, 0.1)
sdata <- genSpline(x, knots, 1, theta)
plot.basis(sdata)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
<pre class="r"><code>plot.spline(sdata)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-6-2.png" width="672" /></p>
</div>
<div id="quadratic-spline-with-quartile-cut-points" class="section level3">
<h3>Quadratic spline with quartile cut-points</h3>
<p>The basis functions get a little more elaborate with a quadratic spline. With this added degree, we get an additional basis function in each region, so you should see 3 colors instead of 2. The resulting spline is parabolic in each region, but with a different shape, each of which is determined by <em>theta</em>.</p>
<pre class="r"><code>knots <- c(0.25, 0.5, 0.75)
theta = c(0.6, 0.1, 0.5, 0.2, 0.8, 0.3)
sdata <- genSpline(x, knots, 2, theta)
plot.basis(sdata)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
<pre class="r"><code>plot.spline(sdata)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-7-2.png" width="672" /></p>
</div>
<div id="quadratic-spline-with-two-cut-points-three-regions" class="section level3">
<h3>Quadratic spline with two cut-points (three regions)</h3>
<pre class="r"><code>knots <- c(0.333, 0.666)
theta = c(0.2, 0.4, 0.1, 0.9, 0.6)
sdata <- genSpline(x, knots, 2, theta)
plot.basis(sdata)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-8-1.png" width="672" /></p>
<pre class="r"><code>plot.spline(sdata)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-8-2.png" width="672" /></p>
</div>
<div id="cubic-spline-with-two-cut-points-three-regions" class="section level3">
<h3>Cubic spline with two cut-points (three regions)</h3>
<p>And in this last example, we generate basis functions for a cubic spline the differs in three regions. The added curvature is apparent:</p>
<pre class="r"><code>knots <- c(0.333, 0.666)
theta = c(0.2, 0.6, 0.1, 0.9, 0.2, 0.8)
sdata <- genSpline(x, knots, 3, theta)
plot.basis(sdata)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-9-1.png" width="672" /></p>
<pre class="r"><code>plot.spline(sdata)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-9-2.png" width="672" /></p>
</div>
<div id="generating-new-data-from-the-underlying-spline" class="section level3">
<h3>Generating new data from the underlying spline</h3>
<p>It is a simple step to generate data from the spline. Each value on the line is treated as the mean, and “observed” data can be generated by adding variation. In this case, I use the normal distribution, but there is no reason other distributions can’t be used. I’m generating data based on the the parameters in the previous example. And this time, the spline plot includes the randomly generated data:</p>
<pre class="r"><code>set.seed(5)
x <- runif(250)
sdata <- genSpline(x, knots, 3, theta)
sdata$dt[, y := rnorm(.N, y.spline, 0.1)]
plot.spline(sdata, points = TRUE)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-10-1.png" width="672" /></p>
<p>Now that we have generated new data, why don’t we go ahead and fit a model to see if we can recover the coefficients specified in <em>theta</em>? We are interested in the relationship of <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>, but the relationship is not linear and changes across <span class="math inline">\(x\)</span>. To estimate a model, we regress the outcome data <span class="math inline">\(y\)</span> on the values of the basis function that correspond to each value of <span class="math inline">\(x\)</span>:</p>
<pre class="r"><code>dxbasis <- as.data.table(sdata$basis)
setnames(dxbasis, paste0("x", names(dxbasis)))
dxbasis[, y := sdata$dt$y]
round(dxbasis, 3)</code></pre>
<pre><code>## x1 x2 x3 x4 x5 x6 y
## 1: 0.063 0.557 0.343 0.036 0.000 0.000 0.443
## 2: 0.000 0.000 0.140 0.565 0.295 0.000 0.542
## 3: 0.000 0.000 0.003 0.079 0.495 0.424 0.634
## 4: 0.003 0.370 0.523 0.104 0.000 0.000 0.232
## 5: 0.322 0.553 0.120 0.005 0.000 0.000 0.269
## ---
## 246: 0.000 0.023 0.442 0.494 0.041 0.000 0.520
## 247: 0.613 0.356 0.031 0.001 0.000 0.000 0.440
## 248: 0.246 0.584 0.161 0.009 0.000 0.000 0.236
## 249: 0.000 0.000 0.014 0.207 0.597 0.182 0.505
## 250: 0.002 0.344 0.539 0.115 0.000 0.000 0.313</code></pre>
<pre class="r"><code># fit the model - explicitly exclude intercept since x1 is intercept
lmfit <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 - 1, data = dxbasis)
cbind(tidy(lmfit)[,1:3], true = theta)</code></pre>
<pre><code>## term estimate std.error true
## 1 x1 0.16465186 0.03619581 0.2
## 2 x2 0.57855125 0.03996219 0.6
## 3 x3 0.09093425 0.04267027 0.1
## 4 x4 0.94938718 0.04395370 0.9
## 5 x5 0.13579559 0.03805510 0.2
## 6 x6 0.85867619 0.03346704 0.8</code></pre>
<p>Using the parameter estimates (estimated here using OLS), we can get predicted values and plot them to see how well we did:</p>
<pre class="r"><code># get the predicted values so we can plot
dxbasis[ , y.pred := predict(object = lmfit)]
dxbasis[ , x := x]
# blue line represents predicted values
plot.spline(sdata, points = TRUE) +
geom_line(data=dxbasis, aes(x=x, y=y.pred), color = "blue", size = 1 )</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-12-1.png" width="672" /></p>
<p>The model did quite a good job, because we happened to assume the correct underlying assumptions of the spline. However, let’s say we suspected that the data were generated by a quadratic spline. We need to get the basis function assuming the same cut-points for the knots but now using a degree equal to two. Since a reduction in curvature reduces the number of basis functions by one, the linear model changes slightly. (Note that this model is not quite nested in the previous (cubic) model, because the values of the basis functions are different.)</p>
<pre class="r"><code>xdata <- genSpline(x, knots, 2, theta = rep(1,5))
dxbasis <- as.data.table(xdata$basis)
setnames(dxbasis, paste0("x", names(dxbasis)))
dxbasis[, y := sdata$dt$y]
lmfit <- lm(y ~ x1 + x2 + x3 + x4 + x5 - 1, data = dxbasis)
dxbasis[ , y.pred := predict(object = lmfit)]
dxbasis[ , x := x]
plot.spline(sdata, points = TRUE) +
geom_line(data=dxbasis, aes(x=x, y=y.pred),
color = "forestgreen", size = 1 )</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-13-1.png" width="672" /></p>
<p>If we compare the two models in terms of model fit, the cubic model only does slightly better in term of <span class="math inline">\(R^2\)</span>: 0.96 vs. 0.94. In this case, it probably wouldn’t be so obvious which model to use.</p>
</div>
</div>
A minor update to simstudy provides an excuse to talk a bit about the negative binomial and Poisson distributions
https://www.rdatagen.net/post/a-small-update-to-simstudy-neg-bin/
Thu, 05 Oct 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-small-update-to-simstudy-neg-bin/<p>I just updated <code>simstudy</code> to version 0.1.5 (available on <a href="https://cran.r-project.org/web/packages/simstudy/index.html">CRAN</a>) so that it now includes several new distributions - <em>exponential</em>, <em>discrete uniform</em>, and <em>negative binomial</em>.</p>
<p>As part of the release, I thought I’d explore the negative binomial just a bit, particularly as it relates to the Poisson distribution. The Poisson distribution is a discrete (integer) distribution of outcomes of non-negative values that is often used to describe count outcomes. It is characterized by a mean (or rate) and its variance equals its mean.</p>
<div id="added-variation" class="section level3">
<h3>Added variation</h3>
<p>In many situations, when count data are modeled, it turns out that the variance of the data exceeds the mean (a situation called <em>over-dispersion</em>). In this case an alternative model is used that allows for the greater variance, which is based on the negative binomial distribution. It turns out that if the negative binomial distribution has mean <span class="math inline">\(\mu\)</span>, it has a variance of <span class="math inline">\(\mu + \theta \mu^2\)</span>, where <span class="math inline">\(\theta\)</span> is called a <em>dispersion</em> parameter. If <span class="math inline">\(\theta = 0\)</span>, we have the Poisson distribution, but otherwise the variance of a negative binomial random variable will exceed the variance of a Poisson random variable as long as they share the same mean, because <span class="math inline">\(\mu > 0\)</span> and <span class="math inline">\(\theta \ge 0\)</span>.</p>
<p>We can see this by generating data from each distribution with mean 15, and a dispersion parameter of 0.2 for the negative binomial. We expect a variance around 15 for the Poisson distribution, and 60 for the negative binomial distribution.</p>
<pre class="r"><code>library(simstudy)
library(ggplot2)
# for a less cluttered look
theme_no_minor <- function(color = "grey90") {
theme(panel.grid.minor = element_blank(),
panel.background = element_rect(fill="grey95")
)
}
options(digits = 2)
# define data
defC <- defCondition(condition = "dist == 0", formula = 15,
dist = "poisson", link = "identity")
defC <- defCondition(defC, condition = "dist == 1", formula = 15,
variance = 0.2, dist = "negBinomial",
link = "identity")
# generate data
set.seed(50)
dt <- genData(500)
dt <- trtAssign(dt, 2, grpName = "dist")
dt <- addCondition(defC, dt, "y")
genFactor(dt, "dist", c("Poisson", "Negative binomial"))
# compare distributions
dt[, .(mean = mean(y), var = var(y)), keyby = fdist]</code></pre>
<pre><code>## fdist mean var
## 1: Poisson 15 15
## 2: Negative binomial 15 54</code></pre>
<pre class="r"><code>ggplot(data = dt, aes(x = y, group = fdist)) +
geom_density(aes(fill=fdist), alpha = .4) +
scale_fill_manual(values = c("#808000", "#000080")) +
scale_x_continuous(limits = c(0,60),
breaks = seq(0, 60, by = 20)) +
theme_no_minor() +
theme(legend.title = element_blank(),
legend.position = c(0.80, 0.83))</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-05-a-small-update-to-simstudy-provides-an-excuse-to-compare-the-negative-binomial-and-poisson-distributions_files/figure-html/unnamed-chunk-1-1.png" width="672" /></p>
</div>
<div id="underestimating-standard-errors" class="section level3">
<h3>Underestimating standard errors</h3>
<p>In the context of a regression, misspecifying a model as Poisson rather than negative binomial, can lead to an underestimation of standard errors, even though the point estimates may be quite reasonable (or may not). The Poisson model will force the variance estimate to be equal to the mean at any particular point on the regression curve. The Poisson model will effectively ignore the true extent of the variation, which can lead to problems of interpretation. We might conclude that there is an association when in fact there is none.</p>
<p>In this simple simulation, we generate two predictors (<span class="math inline">\(x\)</span> and <span class="math inline">\(b\)</span>) and an outcome (<span class="math inline">\(y\)</span>). The outcome is a function of <span class="math inline">\(x\)</span> only:</p>
<pre class="r"><code>library(broom)
library(MASS)
# Generating data from negative binomial dist
def <- defData(varname = "x", formula = 0, variance = 1,
dist = "normal")
def <- defData(def, varname = "b", formula = 0, variance = 1,
dist = "normal")
def <- defData(def, varname = "y", formula = "0.9 + 0.6*x",
variance = 0.3, dist = "negBinomial", link = "log")
set.seed(35)
dt <- genData(500, def)
ggplot(data = dt, aes(x=x, y = y)) +
geom_jitter(width = .1) +
ggtitle("Outcome as function of 1st predictor") +
theme_no_minor()</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-05-a-small-update-to-simstudy-provides-an-excuse-to-compare-the-negative-binomial-and-poisson-distributions_files/figure-html/unnamed-chunk-2-1.png" width="672" /></p>
<pre class="r"><code>ggplot(data = dt, aes(x=b, y = y)) +
geom_jitter(width = 0) +
ggtitle("Outcome as function of 2nd predictor") +
theme_no_minor()</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-05-a-small-update-to-simstudy-provides-an-excuse-to-compare-the-negative-binomial-and-poisson-distributions_files/figure-html/unnamed-chunk-2-2.png" width="672" /></p>
<p>I fit two models using both predictors. The first assumes (incorrectly) a Poisson distribution, and the second assumes (correctly) a negative binomial distribution. We can see that although the point estimates are quite close, the standard error estimates for the predictors in the Poisson model are considerably greater (about 50% higher) than the negative binomial model. And if we were basing any conclusion on the p-value (which is not always the obvious way to do <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/abandon.pdf">things</a>), we might make the wrong call since the p-value for the slope of <span class="math inline">\(b\)</span> is estimated to be 0.029. Under the correct model model, the p-value is 0.29.</p>
<pre class="r"><code>glmfit <- glm(y ~ x + b, data = dt, family = poisson (link = "log") )
tidy(glmfit)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 0.956 0.030 32.3 1.1e-228
## 2 x 0.516 0.024 21.9 1.9e-106
## 3 b -0.052 0.024 -2.2 2.9e-02</code></pre>
<pre class="r"><code>nbfit <- glm.nb(y ~ x + b, data = dt)
tidy(nbfit)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 0.954 0.039 24.2 1.1e-129
## 2 x 0.519 0.037 14.2 7.9e-46
## 3 b -0.037 0.036 -1.1 2.9e-01</code></pre>
<p>A plot of the fitted regression curve and confidence bands of <span class="math inline">\(b\)</span> estimated by each model reinforces the difference. The lighter shaded region is the wider confidence band of the negative binomial model, and the darker shaded region the based on the Poisson model.</p>
<pre class="r"><code>newb <- data.table(b=seq(-3,3,length = 100), x = 0)
poispred <- predict(glmfit, newdata = newb, se.fit = TRUE,
type = "response")
nbpred <-predict(nbfit, newdata = newb, se.fit = TRUE,
type = "response")
poisdf <- data.table(b = newb$b, y = poispred$fit,
lwr = poispred$fit - 1.96*poispred$se.fit,
upr = poispred$fit + 1.96*poispred$se.fit)
nbdf <- data.table(b = newb$b, y = nbpred$fit,
lwr = nbpred$fit - 1.96*nbpred$se.fit,
upr = nbpred$fit + 1.96*nbpred$se.fit)
ggplot(data = poisdf, aes(x=b, y = y)) +
geom_line() +
geom_ribbon(data=nbdf, aes(ymin = lwr, ymax=upr), alpha = .3,
fill = "red") +
geom_ribbon(aes(ymin = lwr, ymax=upr), alpha = .5,
fill = "red") +
theme_no_minor()</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-05-a-small-update-to-simstudy-provides-an-excuse-to-compare-the-negative-binomial-and-poisson-distributions_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
<p>And finally, if we take 500 samples of size 500, and estimate slope for <span class="math inline">\(b\)</span> each time and calculate the standard deviation of those estimates, it is quite close to the standard error estimate we saw in the model of the original simulated data set using the negative binomial assumption (0.036). And the mean of those estimates is quite close to zero, the true value.</p>
<pre class="r"><code>result <- data.table()
for (i in 1:500) {
dt <- genData(500, def)
glmfit <- glm(y ~ x + b, data = dt, family = poisson)
nbfit <- glm.nb(y ~ x + b, data = dt)
result <- rbind(result, data.table(bPois = coef(glmfit)["b"],
bNB = coef(nbfit)["b"])
)
}
result[,.(sd(bPois), sd(bNB))] # observed standard error</code></pre>
<pre><code>## V1 V2
## 1: 0.037 0.036</code></pre>
<pre class="r"><code>result[,.(mean(bPois), mean(bNB))] # observed mean</code></pre>
<pre><code>## V1 V2
## 1: 0.0025 0.0033</code></pre>
</div>
<div id="negative-binomial-as-mixture-of-poissons" class="section level3">
<h3>Negative binomial as mixture of Poissons</h3>
<p>An interesting relationship between the two distributions is that a negative binomial distribution can be generated from a mixture of individuals whose outcomes come from a Poisson distribution, but each individual has her own rate or mean. Furthermore, those rates must have a specific distribution - a Gamma. (For much more on this, you can take a look <a href="https://probabilityandstats.wordpress.com/tag/poisson-gamma-mixture/">here</a>.) Here is a little simulation:</p>
<pre class="r"><code>mu = 15
disp = 0.2
# Gamma distributed means
def <- defData(varname = "gmu", formula = mu, variance = disp,
dist = "gamma")
# generate data from each distribution
defC <- defCondition(condition = "nb == 0", formula = "gmu",
dist = "poisson")
defC <- defCondition(defC, condition = "nb == 1", formula = mu,
variance = disp, dist = "negBinomial")
dt <- genData(5000, def)
dt <- trtAssign(dt, 2, grpName = "nb")
genFactor(dt, "nb", labels = c("Poisson-Gamma", "Negative binomial"))
dt <- addCondition(defC, dt, "y")
# means and variances should be very close
dt[, .(Mean = mean(y), Var = var(y)), keyby = fnb]</code></pre>
<pre><code>## fnb Mean Var
## 1: Poisson-Gamma 15 62
## 2: Negative binomial 15 57</code></pre>
<pre class="r"><code># plot
ggplot(data = dt, aes(x = y, group = fnb)) +
geom_density(aes(fill=fnb), alpha = .4) +
scale_fill_manual(values = c("#808000", "#000080")) +
scale_x_continuous(limits = c(0,60),
breaks = seq(0, 60, by = 20)) +
theme_no_minor() +
theme(legend.title = element_blank(),
legend.position = c(0.80, 0.83))</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-05-a-small-update-to-simstudy-provides-an-excuse-to-compare-the-negative-binomial-and-poisson-distributions_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
<p>```</p>
</div>
CACE closed: EM opens up exclusion restriction (among other things)
https://www.rdatagen.net/post/em-estimation-of-cace/
Thu, 28 Sep 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/em-estimation-of-cace/<p>This is the third, and probably last, of a series of posts touching on the estimation of <a href="https://www.rdatagen.net/post/cace-explored/">complier average causal effects</a> (CACE) and <a href="https://www.rdatagen.net/post/simstudy-update-provides-an-excuse-to-talk-a-little-bit-about-the-em-algorithm-and-latent-class/">latent variable modeling techniques</a> using an expectation-maximization (EM) algorithm . What follows is a simplistic way to implement an EM algorithm in <code>R</code> to do principal strata estimation of CACE.</p>
<div id="the-em-algorithm" class="section level3">
<h3>The EM algorithm</h3>
<p>In this approach, we assume that individuals fall into one of three possible groups - <em>never-takers</em>, <em>always-takers</em>, and <em>compliers</em> - but we cannot see who is who (except in a couple of cases). For each group, we are interested in estimating the unobserved potential outcomes <span class="math inline">\(Y_0\)</span> and <span class="math inline">\(Y_1\)</span> using observed outcome measures of <span class="math inline">\(Y\)</span>. The EM algorithm does this in two steps. The <em>E-step</em> estimates the missing class membership for each individual, and the <em>M-step</em> provides maximum likelihood estimates of the group-specific potential outcomes and variation.</p>
<p>An estimate group membership was presented in this <a href="https://projecteuclid.org/euclid.aos/1034276631">Imbens & Rubin 1997 paper</a>. The probability that an individual is a member of a particular group is a function of how close the individual’s observed outcome is to the mean of the group and the overall probability of group membership:</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-em-cace/table.png" />
</div>
<p>where <span class="math inline">\(Z\)</span> is treatment assignment and <span class="math inline">\(M\)</span> is treatment received. In addition, <span class="math inline">\(g_{c0}^i = \phi\left( \frac{Y_{obs,i} - \mu_{c0}}{\sigma_{c0}} \right)/\sigma_{c0}\)</span>, where <span class="math inline">\(\phi(.)\)</span> is the standard normal density. (And the same goes for the other <span class="math inline">\(g^i\)</span>’s.) <span class="math inline">\(\pi_a\)</span>, <span class="math inline">\(\pi_n\)</span>, and <span class="math inline">\(\pi_c\)</span> are estimated in the prior stage (or with starting values). <span class="math inline">\(\mu_{c0}\)</span>, <span class="math inline">\(\mu_{c1}\)</span>, <span class="math inline">\(\sigma_{c0}\)</span>, <span class="math inline">\(\sigma_{c1}\)</span>, etc. are also estimated in the prior <em>M-step</em> or with starting values in the case of the first <em>E-step</em>. Note that because we <em>are</em> assuming monotonicity (no <em>deniers</em> - which is not a necessary assumption for the EM approach, but used here to simplify things a bit), the probability of group membership is 1 for those randomized to control but who receive treatment (<em>always-takers</em>) and for those randomized to intervention but refuse (<em>never-takers</em>).</p>
</div>
<div id="em-steps" class="section level3">
<h3>EM steps</h3>
<p>I’ve created a separate function for each step in the algorithm. The <em>E-step</em> follows the Imbens & Rubin specification just described. The <em>M-step</em> just calculates the weighted averages and variances of the outcomes within each <span class="math inline">\(Z\)</span>/<span class="math inline">\(M\)</span> pair, with the weights coming from the probabilities estimated in the <em>E-step</em>. (These are, in fact, maximum likelihood estimates of the means and variances.) There are a pair of functions to estimate the log likelihood after each iteration. We stop iterating once the log likelihood has reached a stable state. And finally, there is a function to initialize the 15 parameters.</p>
<p>One thing to highlight here is that a strong motivation for using the EM algorithm is that we do <em>not</em> need to assume the exclusion restriction. That is, it is possible that randomizing someone to the intervention may have an effect on the outcome even if there is no effect on whether or not the intervention is used. Or in other words, we are saying it is possible that randomization has an effect on <em>always-takers</em> and <em>never-takers</em>, an assumption we <em>cannot</em> make using an instrumental variable (IV) approach. I mention that here, because the <em>M-step</em> function as written here explicitly drops the exclusion restriction assumption. However, I will first illustrate the model estimates in a case where data are indeed based on that assumption; while my point is to show that the EM estimates are unbiased as are the IV estimates in this scenario, I may actually be introducing a small amount of bias into the EM estimate by not re-writing the function to create a single mean for <em>always-takers</em> and <em>never-takers</em>. But, for brevity’s sake, this seems adequate.</p>
<pre class="r"><code>estep <- function(params, y, z, m) {
piC <- 0
piN <- 0
piA <- 0
if (z == 0 & m == 0) {
gC0 <- dnorm((y - params$mC0)/params$sC0) / params$sC0
gN0 <- dnorm((y - params$mN0)/params$sN0) / params$sN0
piC <- params$pC * gC0 / ( params$pC * gC0 + params$pN * gN0)
piN <- 1- piC
}
if (z == 0 & m == 1) {
piA <- 1
}
if (z == 1 & m == 0) {
piN <- 1
}
if (z == 1 & m == 1) {
gC1 <- dnorm((y - params$mC1)/params$sC1) / params$sC1
gA1 <- dnorm((y - params$mA1)/params$sA1) / params$sA1
piC <- params$pC * gC1 / ( params$pC * gC1 + params$pA * gA1)
piA <- 1 - piC
}
return(list(piC = piC, piN = piN, piA = piA))
}
library(Weighted.Desc.Stat)
mstep <- function(params, dx) {
params$mN0 <- dx[z == 0 & m == 0, w.mean(y, piN)] # never-taker
params$sN0 <- dx[z == 0 & m == 0, sqrt(w.var(y, piN))] # never-taker
params$mN1 <- dx[z == 1 & m == 0, w.mean(y, piN)] # never-taker
params$sN1 <- dx[z == 1 & m == 0, sqrt(w.var(y, piN))] # never-taker
params$mA0 <- dx[z == 0 & m == 1, w.mean(y, piA)]# always-taker
params$sA0 <- dx[z == 0 & m == 1, sqrt(w.var(y, piA))] # always-taker
params$mA1 <- dx[z == 1 & m == 1, w.mean(y, piA)]# always-taker
params$sA1 <- dx[z == 1 & m == 1, sqrt(w.var(y, piA))] # always-taker
params$mC0 <- dx[z == 0 & m == 0, w.mean(y, piC)] # complier, z=0
params$sC0 <- dx[z == 0 & m == 0, sqrt(w.var(y, piC))] # complier, z=0
params$mC1 <- dx[z == 1 & m == 1, w.mean(y, piC)] # complier, z=1
params$sC1 <- dx[z == 1 & m == 1, sqrt(w.var(y, piC))] # complier, z=1
nC <- dx[, sum(piC)]
nN <- dx[, sum(piN)]
nA <- dx[, sum(piA)]
params$pC <- (nC / sum(nC, nN, nA))
params$pN <- (nN / sum(nC, nN, nA))
params$pA <- (nA / sum(nC, nN, nA))
return(params)
}
like.i <- function(params, y, z, m) {
if (z == 0 & m == 0) {
l <- params$pC * dnorm(x = y, mean = params$mC0, sd = params$sC0) +
params$pN * dnorm(x = y, mean = params$mN0, sd = params$sN0)
}
if (z == 0 & m == 1) {
l <- params$pA * dnorm(x = y, mean = params$mA0, sd = params$sA0)
}
if (z == 1 & m == 0) {
l <- params$pN * dnorm(x = y, mean = params$mN1, sd = params$sN1)
}
if (z == 1 & m == 1) {
l <- params$pC * dnorm(x = y, mean = params$mC1, sd = params$sC1) +
params$pA * dnorm(x = y, mean = params$mA1, sd = params$sA1)
}
return(l)
}
loglike <- function(dt, params){
dl <- dt[, .(l.i = like.i(params, y, z, m)), keyby = id]
return(dl[, sum(log(l.i))])
}
initparams <- function() {
params = list(pC = 1/3, pN = 1/3, pA = 1/3,
mC0 = rnorm(1,0,.1), sC0 = 0.2,
mC1 = rnorm(1,0,.1), sC1 = 0.2,
mN0 = rnorm(1,0,.1), sN0 = 0.2,
mN1 = rnorm(1,0,.1), sN1 = 0.2,
mA0 = rnorm(1,0,.1), sA0 = 0.2,
mA1 = rnorm(1,0,.1), sA1 = 0.2)
return(params)
}</code></pre>
</div>
<div id="data-defintions" class="section level3">
<h3>Data defintions</h3>
<p>These next set of statements define the data that will be generated. I define the distribution of group assignment as well as potential outcomes for the intervention and the outcome <span class="math inline">\(Y\)</span>. We also define how the observed data will be generated, which is a function of treatment randomization …</p>
<pre class="r"><code>library(simstudy)
### Define data distributions
# Status :
# 1 = A(lways taker)
# 2 = N(ever taker)
# 3 = C(omplier)
def <- defDataAdd(varname = "Status",
formula = "0.25; 0.40; 0.35", dist = "categorical")
# potential outcomes (PO) for intervention depends on group status
def <- defDataAdd(def, varname = "M0",
formula = "(Status == 1) * 1", dist = "nonrandom")
def <- defDataAdd(def, varname = "M1",
formula = "(Status != 2) * 1", dist = "nonrandom")
# observed intervention status based on randomization and PO
def <- defDataAdd(def, varname = "m",
formula = "(z==0) * M0 + (z==1) * M1",
dist = "nonrandom")
# potential outcome for Y (depends group status - A, N, or C)
# under assumption of exclusion restriction
defY0 <- defCondition(condition = "Status == 1",
formula = 0.3, variance = .25, dist = "normal")
defY0 <- defCondition(defY0, condition = "Status == 2",
formula = 0.0, variance = .36, dist = "normal")
defY0 <- defCondition(defY0, condition = "Status == 3",
formula = 0.1, variance = .16, dist = "normal")
defY1 <- defCondition(condition = "Status == 1",
formula = 0.3, variance = .25, dist = "normal")
defY1 <- defCondition(defY1, condition = "Status == 2",
formula = 0.0, variance = .36, dist = "normal")
defY1 <- defCondition(defY1, condition = "Status == 3",
formula = 0.9, variance = .49, dist = "normal")
# observed outcome function of actual treatment
defy <- defDataAdd(varname = "y",
formula = "(z == 0) * Y0 + (z == 1) * Y1",
dist = "nonrandom")</code></pre>
</div>
<div id="data-generation" class="section level3">
<h3>Data generation</h3>
<p>I am generating multiple data sets and estimating the causal effects for each using the EM and IV approaches. This gives better picture of the bias and variation under the two different scenarios (exclusion restriction & no exclusion restriction) and different methods (EM & IV). To simplify the code a bit, I’ve written a function to consolidate the data generating process:</p>
<pre class="r"><code>createDT <- function(n, def, defY0, defY1, defy) {
dt <- genData(n)
dt <- trtAssign(dt, n=2, grpName = "z")
dt <- addColumns(def, dt)
genFactor(dt, "Status",
labels = c("Always-taker","Never-taker", "Complier"),
prefix = "A")
dt <- addCondition(defY0, dt, "Y0")
dt <- addCondition(defY1, dt, "Y1")
dt <- addColumns(defy, dt)
}
set.seed(16)
dt <- createDT(2500, def, defY0, defY1, defy)
options(digits = 3)
dt</code></pre>
<pre><code>## id Y1 Y0 z Status M0 M1 m AStatus y
## 1: 1 0.12143 -0.400007 0 2 0 0 0 Never-taker -0.4000
## 2: 2 0.13114 0.713202 1 1 1 1 1 Always-taker 0.1311
## 3: 3 0.73766 -0.212530 1 3 0 1 1 Complier 0.7377
## 4: 4 -0.07531 0.209330 1 1 1 1 1 Always-taker -0.0753
## 5: 5 -0.25214 -0.696207 0 2 0 0 0 Never-taker -0.6962
## ---
## 2496: 2496 -0.00882 0.206581 0 2 0 0 0 Never-taker 0.2066
## 2497: 2497 0.39226 0.749465 1 2 0 0 0 Never-taker 0.3923
## 2498: 2498 -0.81486 0.000605 1 2 0 0 0 Never-taker -0.8149
## 2499: 2499 0.10359 -0.417344 0 2 0 0 0 Never-taker -0.4173
## 2500: 2500 -0.68397 0.304398 1 2 0 0 0 Never-taker -0.6840</code></pre>
</div>
<div id="cace-estimation" class="section level3">
<h3>CACE estimation</h3>
<p>Finally, we are ready to put all of this together and estimate the CACE using the EM algorithm. After initializing the parameters (here we just use random values except for the probabilities of group membership, which we assume to be 1/3 to start), we loop through the E and M steps, checking the change in log likelihood each time. For this single data set, we provide a point estimate of the CACE using EM and IV. (We could provide an estimate of standard error using a bootstrap approach.) We see that both do a reasonable job, getting fairly close to the truth.</p>
<pre class="r"><code>params <- initparams()
prev.loglike <- -Inf
continue <- TRUE
while (continue) {
dtPIs <- dt[, estep(params, y, z, m), keyby = id]
dx <- dt[dtPIs]
params <- mstep(params, dx)
EM.CACE <- params$mC1 - params$mC0
current.loglike <- loglike(dt, params)
diff <- current.loglike - prev.loglike
prev.loglike <- current.loglike
if ( diff < 1.00e-07 ) continue = FALSE
}
library(ivpack)
ivmodel <- ivreg(formula = y ~ m | z, data = dt, x = TRUE)
data.table(truthC = dt[AStatus == "Complier", mean(Y1 - Y0)],
IV.CACE = coef(ivmodel)[2],
EM.CACE)</code></pre>
<pre><code>## truthC IV.CACE EM.CACE
## 1: 0.806 0.823 0.861</code></pre>
</div>
<div id="more-general-performance" class="section level3">
<h3>More general performance</h3>
<p>I am not providing the code here (it is just a slight modification of what has come before), but I want to show the results of generating 1000 data sets of 500 observations in each. The first plot assumes all data sets were generated using an exclusion restriction - just as we did with the single data set. The IV approach, as expected is unbiased (estimated bias 0.01), while the EM approach is slightly biased (-0.13). We can also see that the EM approach (standard deviation 0.30) has more variation than IV (standard deviation 0.15), while the actual sample CACE (calculated based on the actual group membership and potential outcomes) had a standard deviation of 0.05, which we can see from the narrow vertical band in the plot:</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-em-cace/Exclusion_restriction.png" />
</div>
<p>In the second set of simulations, I change the potential outcomes definition so that the exclusion restriction is no longer relevant.</p>
<pre class="r"><code>defY0 <- defCondition(condition = "Status == 1",
formula = 0.3, variance = .20, dist = "normal")
defY0 <- defCondition(defY0, condition = "Status == 2",
formula = 0.0, variance = .36, dist = "normal")
defY0 <- defCondition(defY0, condition = "Status == 3",
formula = 0.1, variance = .16, dist = "normal")
defY1 <- defCondition(condition = "Status == 1",
formula = 0.7, variance = .25, dist = "normal")
defY1 <- defCondition(defY1, condition = "Status == 2",
formula = 0.2, variance = .40, dist = "normal")
defY1 <- defCondition(defY1, condition = "Status == 3",
formula = 0.9, variance = .49, dist = "normal")</code></pre>
<p>In this second case, the IV estimate is biased (0.53), while the EM estimated does quite well (-.03). (I suspect EM did worse in the first example above, because estimates were made without the assumption of the exclusion restriction, even though that was the case.) However, EM estimates still have more variation than IV: standard deviation 0.26 vs 0.17, consistent with the estimates under the exclusion restriction assumption. This variation arises from the fact that we don’t know what the true group membership is, and we need to estimate it. Here is what the estimates look like:</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-em-cace/No_exclusion_restriction.png" />
</div>
</div>
<div id="can-we-expand-on-this" class="section level3">
<h3>Can we expand on this?</h3>
<p>The whole point of this was to illustrate that there might be a way around some rather restrictive assumptions, which in some cases might not seem so reasonable. EM methods provide an alternative way to approach things - more of which you can see in the <a href="https://courseplus.jhu.edu/core/index.cfm/go/course.home/coid/8155/">free online course</a> that inspired these last few posts. Unfortunately, there is no obvious way to tackle these problems in <code>R</code> using existing packages, and I am not suggesting that what I have done here is the best way to go about it. The course suggests using <code>Mplus</code>. While that is certainly a great software package, maybe it would be worthwhile to build an R package to implement these methods more completely in R? Or maybe someone has already done this, and I just haven’t come across it yet?</p>
</div>
A simstudy update provides an excuse to talk a little bit about latent class regression and the EM algorithm
https://www.rdatagen.net/post/simstudy-update-provides-an-excuse-to-talk-a-little-bit-about-the-em-algorithm-and-latent-class/
Wed, 20 Sep 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simstudy-update-provides-an-excuse-to-talk-a-little-bit-about-the-em-algorithm-and-latent-class/<p>I was just going to make a quick announcement to let folks know that I’ve updated the <code>simstudy</code> package to version 0.1.4 (now available on CRAN) to include functions that allow conversion of columns to factors, creation of dummy variables, and most importantly, specification of outcomes that are more flexibly conditional on previously defined variables. But, as I was coming up with an example that might illustrate the added conditional functionality, I found myself playing with package <code>flexmix</code>, which uses an Expectation-Maximization (EM) algorithm to estimate latent classes and fit regression models. So, in the end, this turned into a bit more than a brief service announcement.</p>
<div id="defining-data-conditionally" class="section level3">
<h3>Defining data conditionally</h3>
<p>Of course, simstudy has always enabled conditional distributions based on sequentially defined variables. That is really the whole point of simstudy. But, what if I wanted to specify completely different families of distributions or very different regression curves based on different individual characteristics? With the previous version of simstudy, it was not really easy to do. Now, with the addition of two key functions, <code>defCondition</code> and <code>addCondition</code> the process is much improved. <code>defCondition</code> is analogous to the function <code>defData</code>, in that this new function provides an easy way to specify conditional definitions (as does <code>defReadCond</code>, which is analogous to <code>defRead</code>). <code>addCondition</code> is used to actually add the data column, just as <code>addColumns</code> adds columns.</p>
<p>It is probably easiest to see in action:</p>
<pre class="r"><code>library(simstudy)
# Define baseline data set
def <- defData(varname="x", dist="normal", formula=0, variance=9)
def <- defData(def, varname = "group", formula = "0.2;0.5;0.3",
dist = "categorical")
# Generate data
set.seed(111)
dt <- genData(1000, def)
# Convert group to factor - new function
dt <- genFactor(dt, "group", replace = TRUE)
dt</code></pre>
<p><code>defCondition</code> is the same as <code>defData</code>, except that instead of specifying a variable name, we need to specify a condition that is based on a pre-defined field:</p>
<pre class="r"><code>defC <- defCondition(condition = "fgroup == 1", formula = "5 + 2*x",
variance = 4, dist = "normal")
defC <- defCondition(defC, condition = "fgroup == 2", formula = 4,
variance = 3, dist="normal")
defC <- defCondition(defC, condition = "fgroup == 3", formula = "3 - 2*x",
variance = 2, dist="normal")
defC</code></pre>
<pre><code>## condition formula variance dist link
## 1: fgroup == 1 5 + 2*x 4 normal identity
## 2: fgroup == 2 4 3 normal identity
## 3: fgroup == 3 3 - 2*x 2 normal identity</code></pre>
<p>A subsequent call to <code>addCondition</code> generates a data table with the new variable, in this case <span class="math inline">\(y\)</span>:</p>
<pre class="r"><code>dt <- addCondition(defC, dt, "y")
dt</code></pre>
<pre><code>## id y x fgroup
## 1: 1 5.3036869 0.7056621 2
## 2: 2 2.1521853 -0.9922076 2
## 3: 3 4.7422359 -0.9348715 3
## 4: 4 16.1814232 -6.9070370 3
## 5: 5 4.3958893 -0.5126281 3
## ---
## 996: 996 -0.8115245 -2.7092396 1
## 997: 997 1.9946074 0.7126094 2
## 998: 998 11.8384871 2.3895135 1
## 999: 999 3.3569664 0.8123200 1
## 1000: 1000 3.4662074 -0.4653198 3</code></pre>
<p>In this example, I’ve partitioned the data into three subsets, each of which has a very different linear relationship between variables <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>, and different variation. In this particular case, all relationships are linear with normally distributed noise, but this is absolutely not required.</p>
<p>Here is what the data look like:</p>
<pre class="r"><code>library(ggplot2)
mycolors <- c("#555bd4","#d4555b","#d4ce55")
ggplot(data = dt, aes(x = x, y = y, group = fgroup)) +
geom_point(aes(color = fgroup), size = 1, alpha = .4) +
geom_smooth(aes(color = fgroup), se = FALSE, method = "lm") +
scale_color_manual(name = "Cluster", values = mycolors) +
scale_x_continuous(limits = c(-10,10), breaks = c(-10, -5, 0, 5, 10)) +
theme(panel.grid = element_blank(),
panel.background = element_rect(fill = "grey96", color = "grey80"))</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-09-20-simstudy-update-provides-an-excuse-to-talk-a-little-bit-about-the-em-algorithm-and-latent-class_files/figure-html/unnamed-chunk-4-1.png" width="576" /></p>
</div>
<div id="latent-class-regression-models" class="section level3">
<h3>Latent class regression models</h3>
<p>Suppose we come across the same data set, but are not privy to the group classification, and we are still interested in the relationship between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>. This is what the data set would look like - not as user-friendly:</p>
<pre class="r"><code>rawp <- ggplot(data = dt, aes(x = x, y = y, group = fgroup)) +
geom_point(color = "grey75", size = .5) +
scale_x_continuous(limits = c(-10,10), breaks = c(-10, -5, 0, 5, 10)) +
theme(panel.grid = element_blank(),
panel.background = element_rect(fill = "grey96", color = "grey80"))
rawp</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-09-20-simstudy-update-provides-an-excuse-to-talk-a-little-bit-about-the-em-algorithm-and-latent-class_files/figure-html/unnamed-chunk-5-1.png" width="504" /></p>
<p>We might see from the plot, or we might have some subject-matter knowledge that suggests there are are several sub-clusters within the data, each of which appears to have a different relationship between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>. (Obviously, we know this is the case, since we generated the data.) The question is, how can we estimate the regression lines if we don’t know the class membership? That is where the EM algorithm comes into play.</p>
</div>
<div id="the-em-algorithm-very-very-briefly" class="section level3">
<h3>The EM algorithm, very, very briefly</h3>
<p>The EM algorithm handles model parameter estimation in the context of incomplete or missing data. In the example I’ve been discussing here, the subgroups or cluster membership are the missing data. There is an extensive literature on EM methods (starting with <a href="http://www.jstor.org/stable/2984875">this article</a> by Dempster, Laird & Rubin), and I am barely even touching the surface, let alone scratching it.</p>
<p>The missing data (cluster memberships) are estimated in the <em>Expectation-</em> or <em>E-step</em>. These are replaced with their expected values as given by the posterior probabilities. The mixture model assumes that each observation is exactly from one cluster, but this information has not been observed. The unknown model parameters (intercept, slope, and variance) for each of the clusters is estimated in the <em>Maximization-</em> or <em>M-step</em>, which in this case assumes the data come from a linear process with normally distributed noise - both the linear coefficients and variation around the line are conditional on cluster membership. The process is iterative. First, the <em>E-step</em>, which is based on some starting model parameters at first and then updated with the most recent parameter estimates from the prior <em>M-step</em>. Second, the <em>M-step</em> is based on estimates of the maximum likelihood of all the data (including the ‘missing’ data estimated in the prior <em>E-step</em>). We iterate back and forth until the parameter estimates in the <em>M-step</em> reach a steady state, or the overal likelihood estimate becomes stable.</p>
<p>The strength or usefulness of the EM method is that the likelihood of the full data (both observed data - <span class="math inline">\(x\)</span>’s and <span class="math inline">\(y\)</span>’s - and unobserved data - cluster probabilities) is much easier to write down and estimate than the likelihood of the observed data only (<span class="math inline">\(x\)</span>’s and <span class="math inline">\(y\)</span>’s). Think of the first plot above with the structure given by the colors compared to the second plot in grey without the structure. The first seems so much more manageable than the second - if only we knew the underlying structure defined by the clusters. The EM algorithm builds the underlying structure so that the maximum likelihood estimation problem becomes much easier.</p>
<p>Here is a little more detail on what the EM algorithm is estimating in our application. (See <a href="https://cran.r-project.org/web/packages/flexmix/vignettes/flexmix-intro.pdf">this</a> for the much more detail.) First, we estimate the probability of membership in cluster <span class="math inline">\(j\)</span> for our linear regression model with three clusters:</p>
<p><span class="math display">\[P_i(j|x_i, y_i, \mathbf{\pi}, \mathbf{\alpha_0}, \mathbf{\alpha_1}, \mathbf{\sigma}) = p_{ij}= \frac{\pi_jf(y_i|x_i, \mathbf{\alpha_0}, \mathbf{\alpha_1}, \mathbf{\sigma})}{\sum_{k=1}^3 \pi_k f(y_i|x_i, \mathbf{\alpha_0}, \mathbf{\alpha_1}, \mathbf{\sigma})},\]</span> where <span class="math inline">\(\mathbf{\alpha_0}\)</span>, <span class="math inline">\(\mathbf{\alpha_1}\)</span>, and <span class="math inline">\(\mathbf{\sigma}\)</span> are the vectors of intercepts, slopes, and standard deviations for the three clusters. <span class="math inline">\(\pi\)</span> is the vector of probabilities that any individual is in the respective clusters, and each <span class="math inline">\(\pi_j\)</span> is estimated by averaging the <span class="math inline">\(p_{ij}\)</span>’s across all individuals. Finally, <span class="math inline">\(f(.|.)\)</span> is the density from the normal distribution <span class="math inline">\(N(\alpha_{j0} + \alpha_{j1}x, \sigma_j^2)\)</span>, with cluster-specific parameters.</p>
<p>Second, we maximize each of the three cluster-specific log-likelihoods, where each individual is weighted by its probability of cluster membership (which is <span class="math inline">\(P_i(j)\)</span>, estimated in the <em>E-step</em>). In particular, we are maximizing the cluster-specific likelihood with respect to the three unknown parameters <span class="math inline">\(\alpha_{j0}\)</span>, <span class="math inline">\(\alpha_{j1}\)</span>, and <span class="math inline">\(\sigma_j\)</span>:</p>
<p><span class="math display">\[\sum_{n=1}^N \hat{p}_{nk} \text{log} (f(y_n|x_n,\alpha_{j0},\alpha_{j1},\sigma_j)\]</span> In <code>R</code>, the <code>flexmix</code> package has implemented an EM algorithm to estimate latent class regression models. The package documentation provides a really nice, accessible <a href="https://cran.r-project.org/web/packages/flexmix/vignettes/flexmix-intro.pdf">description</a> of the two-step procedure, with much more detail than I have provided here. I encourage you to check it out.</p>
</div>
<div id="iterating-slowly-through-the-em-algorithm" class="section level3">
<h3>Iterating slowly through the EM algorithm</h3>
<p>Here is a slow-motion version of the EM estimation process. I show the parameter estimates (visually) at the early stages of estimation, checking in after every three steps. In addition, I highlight two individuals and show the estimated probabilities of cluster membership. At the beginning, there is little differentiation between the regression lines for each cluster. However, by the 10th iteration the parameter estimates for the regression lines are looking pretty similar to the original plot.</p>
<pre class="r"><code>library(flexmix)
selectIDs <- c(508, 775) # select two individuals
ps <- list()
count <- 0
p.ij <- data.table() # keep track of estimated probs
pi.j <- data.table() # keep track of average probs
for (i in seq(1,10, by=3)) {
count <- count + 1
set.seed(5)
# fit model up to "i" iterations - either 1, 4, 7, or 10
exMax <- flexmix(y ~ x,
data = dt, k = 3,
control = list(iter.max = i)
)
p.ij <- rbind(p.ij,
data.table(i, selectIDs, posterior(exMax)[selectIDs,]))
pi.j <- rbind(pi.j,
data.table(i, t(apply(posterior(exMax), 2, mean))))
dp <- as.data.table(t(parameters(exMax)))
setnames(dp, c("int","slope", "sigma"))
# flexmix rearranges columns/clusters
dp[, grp := c(3, 1, 2)]
setkey(dp, grp)
# create plot for each iteration
ps[[count]] <- rawp +
geom_abline(data = dp, aes(intercept = int, slope = slope,
color=factor(grp)), size = 1) +
geom_point(data = dt[id %in% selectIDs], color = "black") +
scale_color_manual(values = mycolors) +
ggtitle(paste("Iteration", i)) +
theme(legend.position = "none",
plot.title = element_text(size = 9))
}</code></pre>
<pre class="r"><code>library(gridExtra)
grid.arrange(ps[[1]], ps[[2]], ps[[3]], ps[[4]], nrow = 1)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-09-20-simstudy-update-provides-an-excuse-to-talk-a-little-bit-about-the-em-algorithm-and-latent-class_files/figure-html/unnamed-chunk-7-1.png" width="864" style="display: block; margin: auto;" /></p>
<p>For the two individuals, we can see the probabilities converging to a level of certainty/uncertainty. The individual with ID #775 lies right on the regression line for cluster 3, far from the other lines, and the algorithm quickly assigns a probability of 100% to cluster 3 (its actual cluster). The cluster assignment is less certain for ID #508, which lies between the two regression lines for clusters 1 and 2.</p>
<pre class="r"><code># actual cluster membership
dt[id %in% selectIDs, .(id, fgroup)]</code></pre>
<pre><code>## id fgroup
## 1: 508 2
## 2: 775 3</code></pre>
<pre class="r"><code>setkey(p.ij, selectIDs, i)
p.ij[, .(selectIDs, i, C1 = round(V2, 2), C2 = round(V3,2), C3 = round(V1,2))]</code></pre>
<pre><code>## selectIDs i C1 C2 C3
## 1: 508 1 0.32 0.36 0.32
## 2: 508 4 0.29 0.44 0.27
## 3: 508 7 0.25 0.65 0.10
## 4: 508 10 0.24 0.76 0.00
## 5: 775 1 0.35 0.28 0.37
## 6: 775 4 0.33 0.14 0.53
## 7: 775 7 0.11 0.01 0.88
## 8: 775 10 0.00 0.00 1.00</code></pre>
<p>In addition, we can see how the estimate of overall group membership (for all individuals) changes through the iterations. The algorithm starts by assigning equal probability to each cluster (1/3) and slowly moves towards the actual distribution used to generate the data (20%, 50%, and 30%).</p>
<pre class="r"><code>pi.j[, .(i, C1 = round(V2, 2), C2 = round(V3,2), C3 = round(V1,2))]</code></pre>
<pre><code>## i C1 C2 C3
## 1: 1 0.33 0.34 0.33
## 2: 4 0.31 0.34 0.35
## 3: 7 0.25 0.39 0.36
## 4: 10 0.23 0.44 0.33</code></pre>
</div>
<div id="final-estimation-of-linear-models" class="section level3">
<h3>Final estimation of linear models</h3>
<p>The final estimation is shown below, and we can see that the parameters have largely converged to the values used to generate the data.</p>
<pre class="r"><code># Estimation until convergence
set.seed(5)
ex1 <- flexmix(y ~ x, data = dt, k = 3)
# paramter estimates
data.table(parameters(ex1))[, .(param = c("int", "slope", "sd"),
C1 = round(Comp.2, 2),
C2 = round(Comp.3, 2),
C3 = round(Comp.1, 2))]</code></pre>
<pre><code>## param C1 C2 C3
## 1: int 5.18 3.94 3.00
## 2: slope 1.97 -0.03 -1.99
## 3: sd 2.07 1.83 1.55</code></pre>
<pre class="r"><code># estimates of cluster probabilities
round(apply(posterior(ex1), 2, mean), 2)[c(2,3,1)]</code></pre>
<pre><code>## [1] 0.19 0.51 0.30</code></pre>
<pre class="r"><code># estimates of individual probabilities
data.table(posterior(exMax)[selectIDs,])[,.(selectIDs,
C1 = round(V2, 2),
C2 = round(V3, 2),
C3 = round(V1, 2))]</code></pre>
<pre><code>## selectIDs C1 C2 C3
## 1: 508 0.24 0.76 0
## 2: 775 0.00 0.00 1</code></pre>
</div>
<div id="how-do-we-know-the-relationship-is-linear" class="section level3">
<h3>How do we know the relationship is linear?</h3>
<p>In reality, there is no reason to assume that the relationship between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> is simply linear. We might want to look at other possibilities, such as a quadratic relationship. So, we use flexmix to estimate an expanded model, and then we plot the fitted lines on the original data:</p>
<pre class="r"><code>ex2 <- flexmix(y ~ x + I(x^2), data = dt, k = 3)
dp <- as.data.table(t(parameters(ex2)))
setnames(dp, c("int","slope", "slope2", "sigma"))
dp[, grp := c(1,2,3)]
x <- c(seq(-10,10, by =.1))
dp1 <- data.table(grp = 1, x, dp[1, int + slope*x + slope2*(x^2)])
dp2 <- data.table(grp = 2, x, dp[2, int + slope*x + slope2*(x^2)])
dp3 <- data.table(grp = 3, x, dp[3, int + slope*x + slope2*(x^2)])
dp <- rbind(dp1, dp2, dp3)
rawp +
geom_line(data=dp, aes(x=x, y=V3, group = grp, color = factor(grp)),
size = 1) +
scale_color_manual(values = mycolors) +
theme(legend.position = "none")</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-09-20-simstudy-update-provides-an-excuse-to-talk-a-little-bit-about-the-em-algorithm-and-latent-class_files/figure-html/unnamed-chunk-11-1.png" width="576" /></p>
<p>And even though the parameter estimates appear to be reasonable, we would want to compare the simple linear model with the quadratic model, which we can use with something like the BIC. We see that the linear model is a better fit (lower BIC value) - not surprising since this is how we generated the data.</p>
<pre class="r"><code>summary(refit(ex2))</code></pre>
<pre><code>## $Comp.1
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.440736 0.309576 4.6539 3.257e-06 ***
## x -0.405118 0.048808 -8.3003 < 2.2e-16 ***
## I(x^2) -0.246075 0.012162 -20.2337 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## $Comp.2
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.955542 0.289914 23.9918 < 2.2e-16 ***
## x 0.305995 0.049584 6.1712 6.777e-10 ***
## I(x^2) 0.263160 0.014150 18.5983 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## $Comp.3
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.9061090 0.1489738 26.2201 < 2e-16 ***
## x -0.0681887 0.0277366 -2.4584 0.01395 *
## I(x^2) 0.0113305 0.0060884 1.8610 0.06274 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</code></pre>
<pre class="r"><code># Comparison of the two models
BIC(ex1)</code></pre>
<pre><code>## [1] 5187.862</code></pre>
<pre class="r"><code>BIC(ex2)</code></pre>
<pre><code>## [1] 5316.034</code></pre>
</div>
Complier average causal effect? Exploring what we learn from an RCT with participants who don't do what they are told
https://www.rdatagen.net/post/cace-explored/
Tue, 12 Sep 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/cace-explored/<p>Inspired by a free online <a href="https://courseplus.jhu.edu/core/index.cfm/go/course.home/coid/8155/">course</a> titled <em>Complier Average Causal Effects (CACE) Analysis</em> and taught by Booil Jo and Elizabeth Stuart (through Johns Hopkins University), I’ve decided to explore the topic a little bit. My goal here isn’t to explain CACE analysis in extensive detail (you should definitely go take the course for that), but to describe the problem generally and then (of course) simulate some data. A plot of the simulated data gives a sense of what we are estimating and assuming. And I end by describing two simple methods to estimate the CACE, which we can compare to the truth (since this is a simulation); next time, I will describe a third way.</p>
<div id="non-compliance-in-randomized-trials" class="section level3">
<h3>Non-compliance in randomized trials</h3>
<p>Here’s the problem. In a randomized trial, investigators control the randomization process; they determine if an individual is assigned to the treatment group or control group (I am talking about randomized trials here, but many of these issues can apply in the context of observed or quasi-experimental settings, but require more data and assumptions). However, those investigators may not have as much control over the actual treatments that study participants receive. For example, an individual randomized to some type of behavioral intervention may opt not to take advantage of the intervention. Likewise, someone assigned to control may, under some circumstances, figure out a way to get services that are quite similar to the intervention. In all cases, the investigator is able to collect outcome data on all of these patients, regardless of whether or not they followed directions. (This is different from drop-out or loss-to-followup, where outcome data may be missing.)</p>
</div>
<div id="cace" class="section level3">
<h3>CACE</h3>
<p>Typically, studies analyze data based on treatment <em>assignment</em> rather than treatment <em>received</em>. This focus on assignment is called an intention-to-treat (ITT) analysis. In a policy environment, the ITT may make a lot of sense; we are answering this specific question: “What is the overall effect in the real world where the intervention is made available yet some people take advantage of it while others do not?” Alternatively, researchers may be interested in different question: “What is the causal effect of actually receiving the treatment?”</p>
<p>Now, to answer the second question, there are numerous subtle issues that you need to wrestle with (again, go take the <a href="https://courseplus.jhu.edu/core/index.cfm/go/course.home/coid/8155/">course</a>). But, long story short, we need to (1) identify the folks in the <em>intervention</em> group who actually do what they have been encouraged to do (receive the intervention) but only because they were encouraged, and not because they would have received the intervention anyways had they not been randomized, and compare their outcomes with (2) the folks in the control group who did not seek out the intervention on their own initiative but would have received the intervention had they been encouraged. These two groups are considered to be <em>compliers</em> - they would always do what they are told in the context of the study. And the effect of the intervention that is based on outcomes from this type of patient is called the <em>complier average causal effect</em> (CACE).</p>
<p>The biggest challenge in estimating the CACE is that we cannot actually identify if people are compliers or not. Some of those receiving the treatment in the intervention group are <em>compliers</em>, but the rest are <em>always-takers</em>. Some of those not receiving the treatment in the control arm are also <em>compliers</em>, but the others are <em>never-takers</em>. There are several methods available to overcome this challenge, two of which I will briefly mention here: method of moments and instrumental variables.</p>
</div>
<div id="using-potential-outcomes-to-define-cace" class="section level3">
<h3>Using potential outcomes to define CACE</h3>
<p>In an earlier <a href="https://www.rdatagen.net/post/be-careful/">post</a>, I briefly introduced the idea of potential outcomes. Since we are talking about causal relationships, they are useful here. If <span class="math inline">\(Z\)</span> is the randomization indicator, <span class="math inline">\(Z=1\)</span> for those randomized to the intervention, <span class="math inline">\(Z=0\)</span> for those in control. <span class="math inline">\(M\)</span> is the indicator of whether or not the individual received the intervention. Since <span class="math inline">\(M\)</span> is an outcome, we can imagine the potential outcomes <span class="math inline">\(M_{0i}\)</span> and <span class="math inline">\(M_{1i}\)</span>, or what the value of <span class="math inline">\(M_i\)</span> would be for an individual if <span class="math inline">\(Z_i=0\)</span> or <span class="math inline">\(Z_i=1\)</span>, respectively. And let us say <span class="math inline">\(Y\)</span> is the outcome, so we have potential outcomes that can be written as <span class="math inline">\(Y_{0,M_0}\)</span> and <span class="math inline">\(Y_{1,M_1}\)</span>. Think about that for a bit.</p>
<p>Using these potential outcomes, we can define the compliers and the CACE. Compliers are people for whom <span class="math inline">\(M_0 = 0\)</span> <em>and</em> <span class="math inline">\(M_1 = 1\)</span>. (Never-takers look like this: <span class="math inline">\(M_0 = 0\)</span> <em>and</em> <span class="math inline">\(M_1 = 0\)</span>. Always-takers: <span class="math inline">\(M_0 = 1\)</span> <em>and</em> <span class="math inline">\(M_1 = 1\)</span>). Now, the average causal effect is the average difference between potential outcomes. In this case, the CACE is <span class="math inline">\(E[Y_{1,M_1} - Y_{0,M_0}|M_0 = 0 \ \& \ M_1 = 1]\)</span>. The patients for whom <span class="math inline">\(M_0 = 0\)</span> <em>and</em> <span class="math inline">\(M_1 = 1\)</span> are the compliers.</p>
</div>
<div id="simulating-data" class="section level3">
<h3>Simulating data</h3>
<p>The data simulation will be based on generating potential outcomes. Observed outcomes will be a function of randomization group and complier status.</p>
<pre class="r"><code>options(digits = 3)
library(data.table)
library(simstudy)
library(ggplot2)
# Status :
# 1 = A(lways taker)
# 2 = N(ever taker)
# 3 = C(omplier)
def <- defDataAdd(varname = "Status",
formula = "0.20; 0.40; 0.40", dist = "categorical")
# potential outcomes (PO) for intervention
def <- defDataAdd(def, varname = "M0",
formula = "(Status == 1) * 1", dist = "nonrandom")
def <- defDataAdd(def, varname = "M1",
formula = "(Status != 2) * 1", dist = "nonrandom")
# observed intervention status based on randomization and PO
def <- defDataAdd(def, varname = "m",
formula = "(z==0) * M0 + (z==1) * M1", dist = "nonrandom")
# potential outcome for Y (depends on potential outcome for M)
set.seed(888)
dt <- genData(2000)
dt <- trtAssign(dt, n=2, grpName = "z")
dt <- addColumns(def, dt)
# using data functions here, not simstudy - I need add
# this functionality to simstudy
dt[, AStatus := factor(Status,
labels = c("Always-taker","Never-taker", "Complier"))]
# potential outcomes depend on group status - A, N, or C
dt[Status == 1, Y0 := rnorm(.N, 1.0, sqrt(0.25))]
dt[Status == 2, Y0 := rnorm(.N, 0.0, sqrt(0.36))]
dt[Status == 3, Y0 := rnorm(.N, 0.1, sqrt(0.16))]
dt[Status == 1, Y1 := rnorm(.N, 1.0, sqrt(0.25))]
dt[Status == 2, Y1 := rnorm(.N, 0.0, sqrt(0.36))]
dt[Status == 3, Y1 := rnorm(.N, 0.9, sqrt(0.49))]
# observed outcome function of actual treatment
dt[, y := (m == 0) * Y0 + (m == 1) * Y1]
dt</code></pre>
<pre><code>## id z Status M0 M1 m AStatus Y0 Y1 y
## 1: 1 1 3 0 1 1 Complier 0.5088 0.650 0.6500
## 2: 2 1 3 0 1 1 Complier 0.1503 0.729 0.7292
## 3: 3 1 2 0 0 0 Never-taker 1.4277 0.454 1.4277
## 4: 4 0 3 0 1 0 Complier 0.6393 0.998 0.6393
## 5: 5 0 1 1 1 1 Always-taker 0.6506 1.927 1.9267
## ---
## 1996: 1996 0 3 0 1 0 Complier -0.9554 0.114 -0.9554
## 1997: 1997 0 3 0 1 0 Complier 0.0366 0.903 0.0366
## 1998: 1998 1 3 0 1 1 Complier 0.3606 1.098 1.0982
## 1999: 1999 1 3 0 1 1 Complier 0.6651 1.708 1.7082
## 2000: 2000 0 3 0 1 0 Complier 0.2207 0.531 0.2207</code></pre>
<p>The plot shows outcomes <span class="math inline">\(y\)</span> for the two randomization groups. The ITT estimate would be based on an average of all the points in group, regardless of color or shape. The difference between the average of the black circles in the two groups represents the CACE.</p>
<pre class="r"><code>ggplot(data=dt, aes(y=y, x = factor(z, labels = c("Assigned to control",
"Assigned to treatment")))) +
geom_jitter(aes(shape=factor(m, labels = c("No treatment", "Treatment")),
color=AStatus),
width = 0.35) +
scale_shape_manual(values = c(1,19)) +
scale_color_manual(values = c("#e1d07d", "#7d8ee1", "grey25")) +
scale_y_continuous(breaks = seq(-3, 3, 1), labels = seq(-3, 3, 1)) +
theme(legend.title = element_blank(),
axis.title.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.major.x = element_blank())</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-09-08-iv-em-two-important-ideas-explored_files/figure-html/unnamed-chunk-2-1.png" width="672" /></p>
<p>In the real world, we cannot see the colors, yet we need to estimate as if we do, or at least use a method to bypasses that need:</p>
<p><img src="https://www.rdatagen.net/post/2017-09-08-iv-em-two-important-ideas-explored_files/figure-html/unnamed-chunk-3-1.png" width="672" /></p>
</div>
<div id="estimating-cace-using-observed-data" class="section level3">
<h3>Estimating CACE using observed data</h3>
<p>The challenge is to estimate the CACE using <em>observed</em> data only, since that is all we have (along with a couple of key assumptions). We start of by claiming that the average causal effect of treatment <strong>assignment</strong> (<span class="math inline">\(ACE\)</span>) is a weighted average of the three sub-populations of <em>compliers</em>, <em>never-takers</em>, and <em>always-takers</em>:</p>
<p><span class="math display">\[ ACE = \pi_C \times CACE + \pi_N \times NACE + \pi_A \times AACE, \]</span> where <span class="math inline">\(CACE\)</span> is the average causal effect of treatment assignment for the subset of those in the sample who are <em>compliers</em>, <span class="math inline">\(NACE\)</span> is the average causal effect of treatment assignment for the subset who are <em>never-takers</em>, and <span class="math inline">\(AACE\)</span> is the average causal effect for those who are <em>always-takers</em>. <span class="math inline">\(\pi_C\)</span>, <span class="math inline">\(\pi_N\)</span>, and <span class="math inline">\(\pi_A\)</span> represent the sample proportions of compliers, never-takers, and always-takers, respectively.</p>
<p>A key assumption often made to estimate <span class="math inline">\(CACE\)</span> is known as the <em>exclusion restriction</em>: treatment assignment has an effect on the outcome <em>only</em> if it changes the actual treatment taken. (A second key assumption is that there are no <em>deniers</em>, or folks who do the opposite of what they are told. This is called the monotonicity assumption.) This <em>exclusion restriction</em> implies that both <span class="math inline">\(NACE=0\)</span> and <span class="math inline">\(AACE=0\)</span>, since in both cases the treatment <em>received</em> is the same regardless of treatment assignment. In that case, we can re-write the equality as</p>
<p><span class="math display">\[ ACE = \pi_C \times CACE,\]</span></p>
<p>and finally with a little re-arranging,</p>
<p><span class="math display">\[ CACE = \frac{ACE}{\pi_C}. \]</span> So, in order estimate <span class="math inline">\(CACE\)</span>, we need to be able to estimate <span class="math inline">\(ACE\)</span> and <span class="math inline">\(\pi_C\)</span>. Fortunately, we are in a position to do this. Since this is a randomized trial, the average causal effect of treatment assignment is just the difference in observed outcomes for the two treatment assignment groups:</p>
<p><span class="math display">\[ ACE = E[Y | Z = 1] - E[Y | Z = 0] \]</span> This also happens to be the <em>intention-to-treat</em> ) (<span class="math inline">\(ITT\)</span>) estimate.</p>
<p><span class="math inline">\(\pi_C\)</span> is a little harder, but in this simplified scenario, not that hard. We just need to follow a little logic: for the control group, we can identify the <em>always-takers</em> (they’re the ones who actually receive the treatment), so we know <span class="math inline">\(\pi_A\)</span> for the the control group. This can be estimated as <span class="math inline">\(P(M=1|Z=0)\)</span>. And, since the study was randomized, the distribution of <em>always-takers</em> in the treatment group must be the same. So, we can use <span class="math inline">\(\pi_A\)</span> estimated from the control group as an estimate for the treatment group.</p>
<p>For the treatment group, we know that <span class="math inline">\(\pi_C + \pi_A = P(M = 1 | Z = 1)\)</span>. That is everyone who receives treatment in the treatment group is either a complier or always-taker. With this, we can say</p>
<p><span class="math display">\[\pi_C = P(M=1 | Z = 1) - \pi_A.\]</span></p>
<p>But, of course, we argued above that we can estimate <span class="math inline">\(\pi_A\)</span> as <span class="math inline">\(P(M=1|Z=0)\)</span>. So, finally, we have</p>
<p><span class="math display">\[\pi_C = P(M=1 | Z = 1) - P(M=1|Z=0).\]</span> This gives us a method of moments estimator for <span class="math inline">\(CACE\)</span> from observed data:</p>
<p><span class="math display">\[ CACE = \frac{ACE}{\pi_C} = \frac{E[Y | Z = 1] - E[Y | Z = 0]}{P(M=1 | Z = 1) - P(M=1|Z=0)}. \]</span></p>
</div>
<div id="the-simulated-estimate" class="section level2">
<h2>The simulated estimate</h2>
<pre class="r"><code>ACE <- dt[z==1, mean(y)] - dt[z==0, mean(y)] # Also ITT
ACE</code></pre>
<pre><code>## [1] 0.307</code></pre>
<pre class="r"><code>pi_C <- dt[z==1, mean(m)] - dt[z==0, mean(m)] # strength of instrument
pi_C</code></pre>
<pre><code>## [1] 0.372</code></pre>
<pre class="r"><code>truth <- dt[AStatus == "Complier", mean(Y1 - Y0)]
truth</code></pre>
<pre><code>## [1] 0.81</code></pre>
<pre class="r"><code>ACE/pi_C</code></pre>
<pre><code>## [1] 0.826</code></pre>
<p>A method quite commonly used to analyze non-compliance is the instrumental variable model estimated with two-staged least squares regression. The R package <code>ivpack</code> is one of several that facilitates this type of analysis. A discussio of this methodology far exceeds the scope of this post. In any case, we can see that in this simple example, the IV estimate is the same as the method of moments estimator (by looking at the coefficient estimate of <code>m</code>).</p>
<pre class="r"><code>library(ivpack)
ivmodel <- ivreg(formula = y ~ m | z, data = dt, x = TRUE)
summary(ivmodel)</code></pre>
<pre><code>##
## Call:
## ivreg(formula = y ~ m | z, data = dt, x = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.19539 -0.36249 0.00248 0.35859 2.27902
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0932 0.0302 3.09 0.002 **
## m 0.8262 0.0684 12.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.569 on 1998 degrees of freedom
## Multiple R-Squared: 0.383, Adjusted R-squared: 0.383
## Wald test: 146 on 1 and 1998 DF, p-value: <2e-16</code></pre>
<p>So, again, if I have piqued your interest of this very rich and interesting topic, or if I have totally confused you, go check out the <a href="https://courseplus.jhu.edu/core/index.cfm/go/course.home/coid/8155/">course</a>. In my next post, I will describe a simple latent variable model using a maximum likelihood EM (expectation-maximization) algorithm that arrives at an estimate by predicting complier status.</p>
</div>
Further considerations of a hidden process underlying categorical responses
https://www.rdatagen.net/post/a-hidden-process-part-2-of-2/
Tue, 05 Sep 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-hidden-process-part-2-of-2/<p>In my <a href="https://www.rdatagen.net/post/ordinal-regression/">previous post</a>, I described a continuous data generating process that can be used to generate discrete, categorical outcomes. In that post, I focused largely on binary outcomes and simple logistic regression just because things are always easier to follow when there are fewer moving parts. Here, I am going to focus on a situation where we have <em>multiple</em> outcomes, but with a slight twist - these groups of interest can be interpreted in an ordered way. This conceptual latent process can provide another perspective on the models that are typically applied to analyze these types of outcomes.</p>
<div id="categorical-outcomes-generally" class="section level3">
<h3>Categorical outcomes, generally</h3>
<p>Certainly, group membership is not necessarily intrinsically ordered. In a general categorical or multinomial outcome, a group does not necessarily have any quantitative relationship vis a vis the other groups. For example, if we were interested in primary type of meat consumption, individuals might be grouped into those favoring (1) chicken, (2) beef, (3) pork, or (4) no meat. We might be interested in estimating the different distributions across the four groups for males and females. However, since there is no natural ranking or ordering of these meat groups (though maybe I am just not creative enough), we are limited to comparing the odds of being in one group relative to another for two exposure groups A and B, such as</p>
<p><span class="math display">\[\small{\frac{P(Beef|Group = A)}{P(Chicken|Group = A)} \ vs. \frac{P(Beef|Group = B)}{P(Chicken|Group = B)}}\]</span>.</p>
</div>
<div id="ordinal-outcomes" class="section level3">
<h3>Ordinal outcomes</h3>
<p>Order becomes relevant when the categories take on meanings related strength of opinion or agreement (as in a Likert-type response) or frequency. In the motivating example I described in the initial post, the response of interest was the frequency meat consumption in a month, so the response categories could be (1) none, (2) 1-3 times per month, (3) once per week, (4) 2-6 times per week, (5) 1 or more times per day. Individuals in group 2 consume meat more frequently than group 1, individuals in group 3 consume meat more frequently than those both groups 1 & 2, and so on. There is a natural quantitative relationship between the groups.</p>
<p>Once we have thrown ordering into the mix, we can expand our possible interpretations of the data. In particular it is quite common to summarize the data by looking at <em>cumulative</em> probabilities, odds, or log-odds. Comparisons of different exposures or individual characteristics typically look at how these cumulative measures vary across the different exposures or characteristics. So, if we were interested in cumulative odds, we would compare <span class="math display">\[\small{\frac{P(Response = 1|Group = A)}{P(Response > 1|Group = A)} \ \ vs. \ \frac{P(Response = 1|Group = B)}{P(Response > 1|Group = B)}},\]</span></p>
<p><span class="math display">\[\small{\frac{P(Response \leq 2|Group = A)}{P(Response > 2|Group = A)} \ \ vs. \ \frac{P(Response \leq 2|Group = B)}{P(Response > 2|Group = B)}},\]</span></p>
<p>and continue until the last (in this case, fourth) comparison</p>
<p><span class="math display">\[\small{\frac{P(Response \leq 4|Group = A)}{P(Response = 5|Group = A)} \ \ vs. \ \frac{P(Response \leq 4|Group = B)}{P(Response = 5|Group = B)}}.\]</span></p>
</div>
<div id="multiple-responses-multiple-thresholds" class="section level3">
<h3>Multiple responses, multiple thresholds</h3>
<p>The latent process that was described for the binary outcome is extended to the multinomial outcome by the addition of more thresholds. These thresholds define the portions of the density that define the probability of each possible response. If there are <span class="math inline">\(k\)</span> possible responses (in the meat example, we have 5), then there will be <span class="math inline">\(k-1\)</span> thresholds. The area under the logistic density curve of each of the regions defined by those thresholds (there will be <span class="math inline">\(k\)</span> distinct regions) represents the probability of each possible response tied to that region. In the example here, we define five regions of a logistic density by setting the four thresholds. We can say that this underlying continuous distribution represents the probability distribution of categorical responses for a specific population, which we are calling <em>Group A</em>.</p>
<pre class="r"><code># preliminary libraries and plotting defaults
library(ggplot2)
library(data.table)
my_theme <- function() {
theme(panel.background = element_rect(fill = "grey90"),
panel.grid = element_blank(),
axis.ticks = element_line(colour = "black"),
panel.spacing = unit(0.25, "lines"),
plot.title = element_text(size = 12, vjust = 0.5, hjust = 0),
panel.border = element_rect(fill = NA, colour = "gray90"))
}
# create data points density curve
x <- seq(-6, 6, length = 1000)
pdf <- dlogis(x, location = 0, scale = 1)
dt <- data.table(x, pdf)
# set thresholds for Group A
thresholdA <- c(-2.1, -0.3, 1.4, 3.6)
pdf <- dlogis(thresholdA)
grpA <- data.table(threshold = thresholdA, pdf)
aBreaks <- c(-6, grpA$threshold, 6)
# plot density with cutpoints
dt[, grpA := cut(x, breaks = aBreaks, labels = F, include.lowest = TRUE)]
p1 <- ggplot(data = dt, aes(x = x, y = pdf)) +
geom_line() +
geom_area(aes(x = x, y = pdf, group = grpA, fill = factor(grpA))) +
geom_hline(yintercept = 0, color = "grey50") +
annotate("text", x = -5, y = .28, label = "Group A", size = 5) +
scale_fill_manual(values = c("#d0d7d1", "#bbc5bc", "#a6b3a7", "#91a192", "#7c8f7d"),
labels = c("None", "1-3/month", "1/week", "2-6/week", "1+/day"),
name = "Frequency") +
scale_x_continuous(breaks = thresholdA) +
scale_y_continuous(limits = c(0, 0.3), name = "Density") +
my_theme() +
theme(legend.position = c(.85, .7),
legend.background = element_rect(fill = "grey90"),
legend.key = element_rect(color = "grey90"))
p1</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-09-04-a-hidden-process-part-2-of-2_files/figure-html/threshold-1.png" width="480" /></p>
<p>The area for each of the five regions can easily be calculated, where each area represents the probability of each response:</p>
<pre class="r"><code>pA= plogis(c(thresholdA, Inf)) - plogis(c(-Inf, thresholdA))
probs <- data.frame(pA)
rownames(probs) <- c("P(Resp = 1)", "P(Resp = 2)",
"P(Resp = 3)", "P(Resp = 4)", "P(Resp = 5)")
probs</code></pre>
<pre><code>## pA
## P(Resp = 1) 0.109
## P(Resp = 2) 0.316
## P(Resp = 3) 0.377
## P(Resp = 4) 0.171
## P(Resp = 5) 0.027</code></pre>
<p>As I’ve already mentioned, when we characterize a multinomial response, we typically do so in terms of cumulative probabilities. I’ve calculated several quantities below, and we can see that the logs of the cumulative odds for this particular group are indeed the threshold values that we used to define the sub-regions.</p>
<pre class="r"><code># cumulative probabilities defined by the threshold
probA <- data.frame(
cprob = plogis(thresholdA),
codds = plogis(thresholdA)/(1-plogis(thresholdA)),
lcodds = log(plogis(thresholdA)/(1-plogis(thresholdA)))
)
rownames(probA) <- c("P(Grp < 2)", "P(Grp < 3)", "P(Grp < 4)", "P(Grp < 5)")
probA</code></pre>
<pre><code>## cprob codds lcodds
## P(Grp < 2) 0.11 0.12 -2.1
## P(Grp < 3) 0.43 0.74 -0.3
## P(Grp < 4) 0.80 4.06 1.4
## P(Grp < 5) 0.97 36.60 3.6</code></pre>
<p>The last column of the table below matches the thresholds defined in vector <code>thresholdA</code>.</p>
<pre class="r"><code>thresholdA</code></pre>
<pre><code>## [1] -2.1 -0.3 1.4 3.6</code></pre>
</div>
<div id="comparing-response-distributions-of-different-populations" class="section level3">
<h3>Comparing response distributions of different populations</h3>
<p>In the cumulative logit model, the underlying assumption is that the odds ratio of one population relative to another is constant across all the possible responses. This means that all of the cumulative odds ratios are equal:</p>
<p><span class="math display">\[\small{\frac{codds(P(Resp = 1 | A))}{codds(P(Resp = 1 | B))} = \frac{codds(P(Resp \leq 2 | A))}{codds(P(Resp \leq 2 | B))} = \ ... \ = \frac{codds(P(Resp \leq 4 | A))}{codds(P(Resp \leq 4 | B))}}\]</span></p>
<p>In terms of the underlying process, this means that each of the thresholds shifts the same amount, as shown below, where we add 1.1 units to each threshold that was set Group A:</p>
<pre class="r"><code># Group B threshold is an additive shift to the right
thresholdB <- thresholdA + 1.1
pdf <- dlogis(thresholdB)
grpB <- data.table(threshold = thresholdB, pdf)
bBreaks <- c(-6, grpB$threshold, 6)</code></pre>
<p>Based on this shift, we can see that the probability distribution for Group B is quite different:</p>
<pre class="r"><code>pB = plogis(c(thresholdB, Inf)) - plogis(c(-Inf, thresholdB))
probs <- data.frame(pA, pB)
rownames(probs) <- c("P(Resp = 1)", "P(Resp = 2)",
"P(Resp = 3)", "P(Resp = 4)", "P(Resp = 5)")
probs</code></pre>
<pre><code>## pA pB
## P(Resp = 1) 0.109 0.269
## P(Resp = 2) 0.316 0.421
## P(Resp = 3) 0.377 0.234
## P(Resp = 4) 0.171 0.067
## P(Resp = 5) 0.027 0.009</code></pre>
<p>Plotting Group B along with Group A, we can see visually how that shift affects the sizes of the five regions (I’ve left the thresholds of Group A in the Group B plot so you can see clearly the shift).</p>
<pre class="r"><code># Plot density for group B
dt[, grpB := cut(x, breaks = bBreaks, labels = F, include.lowest = TRUE)]
p2 <- ggplot(data = dt, aes(x = x, y = pdf)) +
geom_line() +
geom_area(aes(x = x, y = pdf, group = grpB, fill = factor(grpB))) +
geom_hline(yintercept = 0, color = "grey5") +
geom_segment(data=grpA,
aes(x=threshold, xend = threshold, y=0, yend=pdf),
size = 0.3, lty = 2, color = "#857284") +
annotate("text", x = -5, y = .28, label = "Group B", size = 5) +
scale_fill_manual(values = c("#d0d7d1", "#bbc5bc", "#a6b3a7", "#91a192", "#7c8f7d"),
labels = c("None", "1-3/month", "1/week", "2-6/week", "1+/day"),
name = "Frequency") +
scale_x_continuous(breaks = thresholdB) +
scale_y_continuous(limits = c(0.0, 0.3), name = "Density") +
my_theme() +
theme(legend.position = "none")
library(gridExtra)
grid.arrange(p1, p2, nrow = 2 )</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-09-04-a-hidden-process-part-2-of-2_files/figure-html/plotB-1.png" width="480" /></p>
<p>When we look at the cumulative odds ratio comparing the odds of Group B to Group A for each response category, we see a constant ratio. And, of course, a constant log odds ratio, which also reflects the size of the shift from Group A to Group B.</p>
<pre class="r"><code># cumulative probabilities defined by the threshold
probB <- data.frame(
cprob = plogis(thresholdB),
codds = plogis(thresholdB)/(1-plogis(thresholdB)),
lcodds = log(plogis(thresholdB)/(1-plogis(thresholdB)))
)
oddsratio <- data.frame(coddsA = probA$codds,
coddsB = probB$codds,
cOR = probB$codds / probA$codds,
logcOR = log(probB$codds / probA$codds)
)
rownames(oddsratio) <- c("P(Grp < 2)", "P(Grp < 3)", "P(Grp < 4)", "P(Grp < 5)")
oddsratio</code></pre>
<pre><code>## coddsA coddsB cOR logcOR
## P(Grp < 2) 0.12 0.37 3 1.1
## P(Grp < 3) 0.74 2.23 3 1.1
## P(Grp < 4) 4.06 12.18 3 1.1
## P(Grp < 5) 36.60 109.95 3 1.1</code></pre>
</div>
<div id="the-cumulative-proportional-odds-model" class="section level3">
<h3>The cumulative proportional odds model</h3>
<p>In the <code>R</code> package <code>ordinal</code>, the model is fit using function <code>clm</code>. The model that is being estimated has the form</p>
<p><span class="math display">\[log \left( \frac{P(Resp \leq i)}{P(Resp > i)} | Group \right) = \alpha_i - \beta*I(Group=B) \ \ , \ i \in \{1, 2, 3, 4\}\]</span></p>
<p>The model specifies that the cumulative log-odds for a particular category is a function of two parameters, <span class="math inline">\(\alpha_i\)</span> and <span class="math inline">\(\beta\)</span>. (Note that in this parameterization and the model fit, <span class="math inline">\(-\beta\)</span> is used.) <span class="math inline">\(\alpha_i\)</span> represents the cumulative log odds of being in category <span class="math inline">\(i\)</span> or lower for those in the reference exposure group, which in our example is Group A. <em><span class="math inline">\(\alpha_i\)</span> also represents the threshold of the latent continuous (logistic) data generating process.</em> <span class="math inline">\(\beta\)</span> is the cumulative log-odds ratio for the category <span class="math inline">\(i\)</span> comparing Group B to reference Group A. <em><span class="math inline">\(\beta\)</span> also represents the shift of the threshold on the latent continuous process for Group B relative to Group A</em>. The proportionality assumption implies that the shift of the threshold for each of the categories is identical. This is what I illustrated above.</p>
</div>
<div id="simulation-and-model-fit" class="section level3">
<h3>Simulation and model fit</h3>
<p>To show how this process might actually work, I am simulating data from the standardized logistic distribution and applying the thresholds defined above based on the group status. In practice, each individual could have her own set of thresholds, depending on her characteristics (gender, age, etc.). In this case, group membership is the only characteristic I am using, so all individuals in a particular group share the same set of thresholds. (We could even have random effects, where subgroups have random shifts that are subgroup specific. In the addendum, following the main part of the post, I provide code to generate data from a mixed effects model with group level random effects plus fixed effects for exposure, gender, and a continuous outcome.)</p>
<pre class="r"><code>set.seed(123)
n = 1000
x.A <- rlogis(n)
acuts <- c(-Inf, thresholdA, Inf)
catA <- cut(x.A, breaks = acuts, label = F)
dtA <- data.table(id = 1:n, grp = "A", cat = catA)</code></pre>
<p>Not surprisingly (since we are using a generous sample size of 1000), the simulated proportions are quite close to the hypothetical proportions established by the thresholds:</p>
<pre class="r"><code>cumsum(prop.table(table(catA)))</code></pre>
<pre><code>## 1 2 3 4 5
## 0.11 0.44 0.81 0.97 1.00</code></pre>
<pre class="r"><code>probA$cprob</code></pre>
<pre><code>## [1] 0.11 0.43 0.80 0.97</code></pre>
<p>Now we generate a sample from Group B and combine them into a single data set:</p>
<pre class="r"><code>x.B <- rlogis(n)
bcuts <- c(-Inf, thresholdA + 1.1, Inf)
catB <- cut(x.B, breaks = bcuts, label = F)
dtB <- data.table(id = (n+1):(2*n), grp = "B", cat=catB)
dt <- rbind(dtA, dtB)
dt[, cat := factor(cat, labels = c("None", "1-3/month", "1/week", "2-6/week", "1+/day"))]
dt</code></pre>
<pre><code>## id grp cat
## 1: 1 A 1-3/month
## 2: 2 A 1/week
## 3: 3 A 1-3/month
## 4: 4 A 2-6/week
## 5: 5 A 2-6/week
## ---
## 1996: 1996 B 1/week
## 1997: 1997 B 1-3/month
## 1998: 1998 B 1/week
## 1999: 1999 B 1-3/month
## 2000: 2000 B 1-3/month</code></pre>
<p>Finally, we estimate the parameters of the model using function <code>clm</code> and we see that we recover the original parameters quite well.</p>
<pre class="r"><code>library(ordinal)
clmFit <- clm(cat ~ grp, data = dt)
summary(clmFit)</code></pre>
<pre><code>## formula: cat ~ grp
## data: dt
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 2000 -2655.03 5320.05 6(0) 1.19e-11 2.3e+01
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## grpB -1.0745 0.0848 -12.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## None|1-3/month -2.0912 0.0784 -26.68
## 1-3/month|1/week -0.2465 0.0612 -4.02
## 1/week|2-6/week 1.4212 0.0728 19.51
## 2-6/week|1+/day 3.5150 0.1643 21.39</code></pre>
<p>In the model output, the <code>grpB</code> coefficient of -1.07 is the estimate of <span class="math inline">\(-\beta\)</span>, which was set to 1.1 in the simulation. The threshold coefficients are the estimates of the <span class="math inline">\(\alpha_i\)</span>’s in the model, and we can see the estimates are not too bad by looking at the initial thresholds:</p>
<pre class="r"><code>coeffit <- coef(clmFit)[1:4]
names(coeffit) <- c(1:4)
rbind( thresholdA, coeffit)</code></pre>
<pre><code>## 1 2 3 4
## thresholdA -2.1 -0.30 1.4 3.6
## coeffit -2.1 -0.25 1.4 3.5</code></pre>
<p>This was a relatively simple simulation. However it highlights how it would be possible to generate more complex scenarios of multinomial response data to more fully explore other types of models. These more flexible models might be able to handle situations where the possibly restrictive assumptions of this model (particularly the proportional odds assumption) do not hold.</p>
</div>
<div id="addendum-1" class="section level2">
<h2>Addendum 1</h2>
<p>Here is code to generate cluster-randomized data with an ordinal outcome that is a function of treatment assignment, gender, and a continuous status measure at the individual level. There is also a group level random effect. Once the data are generated, I fit a mixed cumulative logit model.</p>
<pre class="r"><code>library(simstudy)
# define data
defSchool <- defData(varname = "reS", formula = 0,
variance = 0.10, id = "idS")
defSchool <- defData(defSchool, varname = "n",
formula = 250, dist = "noZeroPoisson")
defInd <- defDataAdd(varname = "male", formula = 0.45, dist = "binary")
defInd <- defDataAdd(defInd, varname = "status",
formula = 0, variance = 1, dist = "normal")
defInd <- defDataAdd(defInd,
varname = "z",
formula = "0.8 * grp + 0.3 * male - 0.2 * status + reS",
dist = "nonrandom")
# generate data
dtS <- genData(100, defSchool)
dtS <- trtAssign(dtS, grpName = "grp")
dt <- genCluster(dtS, "idS", "n", "id")
dt <- addColumns(defInd, dt)
# set reference probabilities for 4-category outcome
probs <- c(0.35, 0.30, 0.25, 0.10)
cprop <- cumsum(probs)
# map cumulative probs to thresholds for reference group
gamma.c <- qlogis(cprop)
matlp <- matrix(rep(gamma.c, nrow(dt)),
ncol = length(cprop),
byrow = TRUE
)
head(matlp)</code></pre>
<pre><code>## [,1] [,2] [,3] [,4]
## [1,] -0.62 0.62 2.2 Inf
## [2,] -0.62 0.62 2.2 Inf
## [3,] -0.62 0.62 2.2 Inf
## [4,] -0.62 0.62 2.2 Inf
## [5,] -0.62 0.62 2.2 Inf
## [6,] -0.62 0.62 2.2 Inf</code></pre>
<pre class="r"><code># set individual thresholds based on covariates,
# which is an additive shift from the reference group
# based on z
matlpInd <- matlp - dt[, z]
head(matlpInd)</code></pre>
<pre><code>## [,1] [,2] [,3] [,4]
## [1,] -1.52 -0.28 1.3 Inf
## [2,] -1.58 -0.34 1.2 Inf
## [3,] -0.95 0.29 1.9 Inf
## [4,] -1.53 -0.29 1.3 Inf
## [5,] -1.49 -0.25 1.3 Inf
## [6,] -1.13 0.11 1.7 Inf</code></pre>
<pre class="r"><code># convert log odds to cumulative probabability
matcump <- 1 / (1 + exp(-matlpInd))
matcump <- cbind(0, matcump)
head(matcump)</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0.18 0.43 0.78 1
## [2,] 0 0.17 0.42 0.78 1
## [3,] 0 0.28 0.57 0.87 1
## [4,] 0 0.18 0.43 0.78 1
## [5,] 0 0.18 0.44 0.79 1
## [6,] 0 0.24 0.53 0.84 1</code></pre>
<pre class="r"><code># convert cumulative probs to category probs:
# originally, I used a loop to do this, but
# thought it would be better to vectorize.
# see 2nd addendum for time comparison - not
# much difference
p <- t(t(matcump)[-1,] - t(matcump)[-5,])
# show some indvidual level probabilities
head(p)</code></pre>
<pre><code>## [,1] [,2] [,3] [,4]
## [1,] 0.18 0.25 0.36 0.22
## [2,] 0.17 0.24 0.36 0.22
## [3,] 0.28 0.29 0.29 0.13
## [4,] 0.18 0.25 0.36 0.22
## [5,] 0.18 0.25 0.35 0.21
## [6,] 0.24 0.28 0.32 0.16</code></pre>
<pre class="r"><code>apply(head(p), 1, sum)</code></pre>
<pre><code>## [1] 1 1 1 1 1 1</code></pre>
<pre class="r"><code># generate indvidual level category outcomes based on p
cat <- simstudy:::matMultinom(p)
catF <- ordered(cat)
dt[, cat := catF]</code></pre>
<p>When we fit the mixed effects model, it is not surprising that we recover the parameters used to generate the data, which were based on the model. The fixed effects were specified as “0.8 * grp + 0.3 * male - 0.2 * status”, the variance of the random group effect was 0.10, and the latent thresholds based on the category probabilities were {-0.62, 0.62, 2.20}:</p>
<pre class="r"><code>fmm <- clmm(cat ~ grp + male + status + (1|idS), data=dt)
summary(fmm)</code></pre>
<pre><code>## Cumulative Link Mixed Model fitted with the Laplace approximation
##
## formula: cat ~ grp + male + status + (1 | idS)
## data: dt
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 24990 -33096.42 66206.85 705(2118) 2.37e-02 1.3e+02
##
## Random effects:
## Groups Name Variance Std.Dev.
## idS (Intercept) 0.109 0.331
## Number of groups: idS 100
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## grp 0.8117 0.0702 11.6 <2e-16 ***
## male 0.3163 0.0232 13.7 <2e-16 ***
## status -0.1959 0.0116 -16.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## 1|2 -0.6478 0.0511 -12.7
## 2|3 0.6135 0.0511 12.0
## 3|4 2.1789 0.0529 41.2</code></pre>
</div>
<div id="addendum-2---vector-vs-loop" class="section level2">
<h2>Addendum 2 - vector vs loop</h2>
<p>In case any one is obsessed with vectorization in <code>R</code>, here is a comparison of two different functions that convert cumulative probabilities into probabilities. One method uses a loop, the other uses matrix operations. In this case, it actually appears that my non-loop approach is slower - maybe there is a faster way? Maybe not, since the loop is actually quite short - determined by the number of possible responses in the categorical measure…</p>
<pre class="r"><code>library(microbenchmark)
loopdif <- function(mat) {
ncols <- ncol(mat)
p <- matrix(0, nrow = nrow(mat), ncol = ( ncols - 1 ))
for (i in 1 : ( ncol(mat) - 1 )) {
p[,i] <- mat[, i+1] - mat[, i]
}
return(p)
}
vecdif <- function(mat) {
ncols <- ncol(mat)
p <- t(t(mat)[-1,] - t(mat)[-ncols,])
return(p)
}
head(loopdif(matcump))</code></pre>
<pre><code>## [,1] [,2] [,3] [,4]
## [1,] 0.18 0.25 0.36 0.22
## [2,] 0.17 0.24 0.36 0.22
## [3,] 0.28 0.29 0.29 0.13
## [4,] 0.18 0.25 0.36 0.22
## [5,] 0.18 0.25 0.35 0.21
## [6,] 0.24 0.28 0.32 0.16</code></pre>
<pre class="r"><code>head(vecdif(matcump))</code></pre>
<pre><code>## [,1] [,2] [,3] [,4]
## [1,] 0.18 0.25 0.36 0.22
## [2,] 0.17 0.24 0.36 0.22
## [3,] 0.28 0.29 0.29 0.13
## [4,] 0.18 0.25 0.36 0.22
## [5,] 0.18 0.25 0.35 0.21
## [6,] 0.24 0.28 0.32 0.16</code></pre>
<pre class="r"><code>microbenchmark(loopdif(matcump), vecdif(matcump),
times = 1000L, unit = "ms")</code></pre>
<pre><code>## Unit: milliseconds
## expr min lq mean median uq max neval
## loopdif(matcump) 0.96 1.4 1.9 1.7 1.9 112 1000
## vecdif(matcump) 0.92 1.7 3.1 2.3 2.7 115 1000</code></pre>
</div>
A hidden process behind binary or other categorical outcomes?
https://www.rdatagen.net/post/ordinal-regression/
Mon, 28 Aug 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/ordinal-regression/<p>I was thinking a lot about proportional-odds cumulative logit models last fall while designing a study to evaluate an intervention’s effect on meat consumption. After a fairly extensive pilot study, we had determined that participants can have quite a difficult time recalling precise quantities of meat consumption, so we were forced to move to a categorical response. (This was somewhat unfortunate, because we would not have continuous or even count outcomes, and as a result, might not be able to pick up small changes in behavior.) We opted for a question that was based on 30-day meat consumption: none, 1-3 times per month, 1 time per week, etc. - six groups in total. The question was how best to evaluate effectiveness of the intervention?</p>
<p>Since the outcome was categorical <em>and</em> ordinal - that is category 1 implied less meat consumption that category 2, category 2 implied less consumption that category 3, and so on - a model that estimates the cumulative probability of ordinal outcomes seemed like a possible way to proceed. Cumulative logit models estimate a number of parameters that represent the cumulative log-odds of an outcome; the parameters are the log-odds of categories 2 through 6 versus category 1, categories 3 through 6 versus 1 & 2, etc. Maybe not the most intuitive way to interpret the data, but seems to plausibly fit the data generating process.</p>
<p>I was concerned about the proportionality assumption of the cumulative logit model, particularly when we started to consider adjusting for baseline characteristics (more on that in the next post). I looked more closely at the data generating assumptions of the cumulative logit model, which are quite frequently framed in the context of a continuous latent measure that follows a logistic distribution. I thought I’d describe that data generating process here to give an alternative view of discrete data models.</p>
<p>I know I have been describing a context that includes an outcome with multiple categories, but in this post I will focus on regular logistic regression with a binary outcome. This will hopefully allow me to establish the idea of a latent threshold. I think it will be useful to explain this simpler case first before moving on to the more involved case of an ordinal response variable, which I plan to tackle in the near future.</p>
<div id="a-latent-continuous-process-underlies-the-observed-binary-process" class="section level3">
<h3>A latent continuous process underlies the observed binary process</h3>
<p>For an event with a binary outcome (true or false, A or B, 0 or 1), the observed outcome may, at least in some cases, be conceived as the manifestation of an unseen, latent continuous outcome. In this conception, the observed (binary) outcome merely reflects whether or not the unseen continuous outcome has exceeded a specified threshold. Think of this threshold as a tipping point, above which the observable characteristic takes on one value (say false), below which it takes on a second value (say true).</p>
</div>
<div id="the-logistic-distribution" class="section level3">
<h3>The logistic distribution</h3>
<p>Logistic regression models are used to estimate relationships of individual characteristics with categorical outcomes. The name of this regression model arises from the logistic distribution, which is a symmetrical continuous distribution. In a latent (or hidden) variable framework, the underlying, unobserved continuous measure is drawn from this logistic distribution. More specifically, the standard logistic distribution is typically assumed, with a location parameter of 0, and a scale parameter of 1. (The mean of this distribution is 0 and variance is approximately 3.29.)</p>
<p>Here is a plot of a logistic pdf, shown in relation to a standard normal pdf (with mean 0 and variance 1):</p>
<pre class="r"><code>library(ggplot2)
library(data.table)
my_theme <- function() {
theme(panel.background = element_rect(fill = "grey90"),
panel.grid = element_blank(),
axis.ticks = element_line(colour = "black"),
panel.spacing = unit(0.25, "lines"),
plot.title = element_text(size = 12, vjust = 0.5, hjust = 0),
panel.border = element_rect(fill = NA, colour = "gray90"))
}
x <- seq(-6, 6, length = 1000)
yNorm <- dnorm(x, 0, 1)
yLogis <- dlogis(x, location = 0, scale = 1)
dt <- data.table(x, yNorm, yLogis)
dtm <- melt(dt, id.vars = "x", value.name = "Density")
ggplot(data = dtm) +
geom_line(aes(x = x, y = Density, color = variable)) +
geom_hline(yintercept = 0, color = "grey50") +
my_theme() +
scale_color_manual(values = c("red", "black"),
labels=c("Normal", "Logistic")) +
theme(legend.position = c(0.8, 0.6),
legend.title = element_blank())</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-08-23-ordinal-regression_files/figure-html/unnamed-chunk-1-1.png" width="672" /></p>
</div>
<div id="the-threshold-defines-the-probability" class="section level3">
<h3>The threshold defines the probability</h3>
<p>Below, I have plotted the standardized logistic pdf with a threshold that defines a tipping point for a particular Group A. In this case the threshold is 1.5, so for everyone with a unseen value of <span class="math inline">\(X < 1.5\)</span>, the observed binary outcome <span class="math inline">\(Y\)</span> will be 1. For those where <span class="math inline">\(X \geq 1.5\)</span>, the observed binary outcome <span class="math inline">\(Y\)</span> will be 0:</p>
<pre class="r"><code>xGrpA <- 1.5
ggplot(data = dtm[variable == "yLogis"], aes(x = x, y = Density)) +
geom_line() +
geom_segment(x = xGrpA, y = 0, xend = xGrpA, yend = dlogis(xGrpA), lty = 2) +
geom_area(mapping = aes(ifelse(x < xGrpA, x, xGrpA)), fill = "white") +
geom_hline(yintercept = 0, color = "grey50") +
ylim(0, 0.3) +
my_theme() +
scale_x_continuous(breaks = c(-6, -3, 0, xGrpA, 3, 6))</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-08-23-ordinal-regression_files/figure-html/unnamed-chunk-2-1.png" width="672" /></p>
<p>Since we have plot a probability density (pdf), the area under the entire curve is equal to 1. We are interested in the binary outcome <span class="math inline">\(Y\)</span> defined by the threshold, so we can say that the area below the curve to the left of threshold (filled in white) represents <span class="math inline">\(P(Y = 1|Group=A)\)</span>. The remaining area represents <span class="math inline">\(P(Y = 0|Group=A)\)</span>. The area to the left of the threshold can be calculated in <code>R</code> using the <code>plogis</code> function:</p>
<pre class="r"><code>(p_A <- plogis(xGrpA))</code></pre>
<pre><code>## [1] 0.8175745</code></pre>
<p>Here is the plot for a second group that has a threshold of 2.2:</p>
<p><img src="https://www.rdatagen.net/post/2017-08-23-ordinal-regression_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
<p>The area under the curve to the left of the threshold is <span class="math inline">\(P(X < 2.2)\)</span>, which is also <span class="math inline">\(P(Y = 1 | Group=B)\)</span>:</p>
<pre class="r"><code>(p_B <- plogis(xGrpB))</code></pre>
<pre><code>## [1] 0.9002495</code></pre>
</div>
<div id="log-odds-and-probability" class="section level3">
<h3>Log-odds and probability</h3>
<p>In logistic regression, we are actually estimating the log-odds of an outcome, which can be written as</p>
<p><span class="math display">\[log \left[ \frac{P(Y=1)}{P(Y=0)} \right]\]</span>.</p>
<p>In the case of Group A, log-odds of Y being equal to 1 is</p>
<pre class="r"><code>(logodds_A <- log(p_A / (1 - p_A) ))</code></pre>
<pre><code>## [1] 1.5</code></pre>
<p>And for Group B,</p>
<pre class="r"><code>(logodds_B <- log(p_B / (1 - p_B) ))</code></pre>
<pre><code>## [1] 2.2</code></pre>
<p>As you may have noticed, we’ve recovered the thresholds that we used to define the probabilities for the two groups. The threshold is actually the log-odds for a particular group.</p>
</div>
<div id="logistic-regression" class="section level3">
<h3>Logistic regression</h3>
<p>The logistic regression model that estimates the log-odds for each group can be written as</p>
<p><span class="math display">\[log \left[ \frac{P(Y=1)}{P(Y=0)} \right] = B_0 + B_1 * I(Grp = B) \quad ,\]</span></p>
<p>where <span class="math inline">\(B_0\)</span> represents the threshold for Group A and <span class="math inline">\(B_1\)</span> represents the shift in the threshold for Group B. In our example, the threshold for Group B is 0.7 units (2.2 - 1.5) to the right of the threshold for Group A. If we generate data for both groups, our estimates for <span class="math inline">\(B_0\)</span> and <span class="math inline">\(B_1\)</span> should be close to 1.5 and 0.7, respectively</p>
</div>
<div id="the-process-in-action" class="section level3">
<h3>The process in action</h3>
<p>To put this all together in a simulated data generating process, we can see the direct link with the logistic distribution, the binary outcomes, and an interpretation of estimates from a logistic model. The only stochastic part of this simulation is the generation of continuous outcomes from a logistic distribution. Everything else follows from the pre-defined group assignments and the group-specific thresholds:</p>
<pre class="r"><code>n = 5000
set.seed(999)
# Stochastic step
xlatent <- rlogis(n, location = 0, scale = 1)
# Deterministic part
grp <- rep(c("A","B"), each = n / 2)
dt <- data.table(id = 1:n, grp, xlatent, y = 0)
dt[grp == "A" & xlatent <= xGrpA, y := 1]
dt[grp == "B" & xlatent <= xGrpB, y := 1]
# Look at the data
dt</code></pre>
<pre><code>## id grp xlatent y
## 1: 1 A -0.4512173 1
## 2: 2 A 0.3353507 1
## 3: 3 A -2.2579527 1
## 4: 4 A 1.7553890 0
## 5: 5 A 1.3054260 1
## ---
## 4996: 4996 B -0.2574943 1
## 4997: 4997 B -0.9928283 1
## 4998: 4998 B -0.7297179 1
## 4999: 4999 B -1.6430344 1
## 5000: 5000 B 3.1379593 0</code></pre>
<p>The probability of a “successful” outcome (i.e <span class="math inline">\(P(Y = 1\)</span>)) for each group based on this data generating process is pretty much equal to the areas under the respective densities to the left of threshold used to define success:</p>
<pre class="r"><code>dt[, round(mean(y), 2), keyby = grp]</code></pre>
<pre><code>## grp V1
## 1: A 0.82
## 2: B 0.90</code></pre>
<p>Now let’s estimate a logistic regression model:</p>
<pre class="r"><code>library(broom)
glmfit <- glm(y ~ grp, data = dt, family = "binomial")
tidy(glmfit, quick = TRUE)</code></pre>
<pre><code>## term estimate
## 1 (Intercept) 1.5217770
## 2 grpB 0.6888526</code></pre>
<p>The estimates from the model recover the logistic distribution thresholds for each group. The Group A threshold is estimated to be 1.52 (the intercept) and the Group B threshold is estimated to be 2.21 (intercept + grpB parameter). These estimates can be interpreted as the log-odds of success for each group, but also <em>as the threshold for the underlying continuous data generating process that determines the binary outcome <span class="math inline">\(Y\)</span></em>. And we can interpret the parameter for <code>grpB</code> in the traditional way as the log-odds ratio comparing the log-odds of success for Group B with the log-odds of success for Group A, or <em>as the shift in the logistic threshold for Group A to the logistic threshold for Group B</em>.</p>
<p>In the next week or so, I will extend this to a discussion of an ordinal categorical outcome. I think the idea of shifting the thresholds underscores the proportionality assumption I alluded to earlier …</p>
</div>
Be careful not to control for a post-exposure covariate
https://www.rdatagen.net/post/be-careful/
Mon, 21 Aug 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/be-careful/<p>A researcher was presenting an analysis of the impact various types of childhood trauma might have on subsequent substance abuse in adulthood. Obviously, a very interesting and challenging research question. The statistical model included adjustments for several factors that are plausible confounders of the relationship between trauma and substance use, such as childhood poverty. However, the model also include a measurement for poverty in adulthood - believing it was somehow confounding the relationship of trauma and substance use. A confounder is a common cause of an exposure/treatment and an outcome; it is hard to conceive of adult poverty as a cause of childhood events, even though it might be related to adult substance use (or maybe not). At best, controlling for adult poverty has no impact on the conclusions of the research; less good, though, is the possibility that it will lead to the conclusion that the effect of trauma is less than it actually is.</p>
<p>Using a highly contrived simulation of data and the abstract concept of <em>potential outcomes</em>, I am hoping to illuminate some of the issues raised by this type of analysis.</p>
<div id="potential-outcomes-and-causal-effects" class="section level2">
<h2>Potential outcomes and causal effects</h2>
<p>The field of causal inference is a rich one, and I won’t even scratch the surface here. My goal is to present the concepts of potential outcomes so that we can articulate at least one clear way to think about what a causal effect can be defined. Under this framework, we generate data where we can find out the “true” measure of causal effect. And then we can use simple regression models to see how well (or not) they recapture these “known” causal effects.</p>
<p>If an individual <span class="math inline">\(i\)</span> experiences a traumatic effect as a child, we say that the exposure <span class="math inline">\(X_i = 1\)</span>. Otherwise <span class="math inline">\(X_i = 0\)</span>, there was no traumatic event. (I am going to assume binary exposures just to keep things simple - exposed vs. not exposed.) In the potential outcomes world we say that every individual has possible outcomes <span class="math inline">\(Y_{1i}\)</span> (the outcome we would observe <em>if</em> the individual had experienced trauma) and <span class="math inline">\(Y_{0i}\)</span> (the outcome we would observe <em>if</em> the individual had not. Quite simply, we define the causal effect of <span class="math inline">\(X\)</span> on <span class="math inline">\(Y\)</span> as the difference in potential outcomes, <span class="math inline">\(CE_i = Y_{1i} - Y_{0i}\)</span>. If <span class="math inline">\(Y_{1i} = Y_{0i}\)</span> (i.e. the potential outcomes are the same), we would say that <span class="math inline">\(X\)</span> does not cause <span class="math inline">\(Y\)</span>, at least for individual <span class="math inline">\(i\)</span>.</p>
<p>In the real world, we only observe one potential outcome - the one associated with the actual exposure. The field of causal inference has lots to say about the assumptions and conditions that are required for us to use observed data to estimate average causal effects; many would say that unless we use a randomized controlled study, those assumptions will never be reasonable. But in the world of simulation, we can generate potential outcomes and observed outcomes, so we know the causal effect both at the individual level and the average population level. And we can see how well our models do.</p>
</div>
<div id="simple-confounding" class="section level2">
<h2>Simple confounding</h2>
<p>Here’s a relatively straightforward example. Let’s say we are interested in understanding if some measure <span class="math inline">\(X\)</span> causes an outcome <span class="math inline">\(Y\)</span>, where there is a common cause <span class="math inline">\(C\)</span> (the diagram is called a DAG - a directed acyclic graph - and is useful for many things, including laying out data generating process):</p>
<p><img src="https://www.rdatagen.net/img/post-careful/SimpleCausal.png" width="300px" /></p>
<pre class="r"><code>library(broom)
library(data.table)
library(simstudy)
def <- defData(varname = "C", formula = 0.4, dist = "binary")
def <- defData(def, "X", formula = "0.3 + 0.4 * C", dist = "binary")
def <- defData(def, "e", formula = 0, variance = 2, dist = "normal")
def <- defData(def, "Y0", formula = "2 * C + e", dist="nonrandom")
def <- defData(def, "Y1", formula = "0.5 + 2 * C + e", dist="nonrandom")
def <- defData(def, "Y_obs", formula = "Y0 + (Y1 - Y0) * X", dist = "nonrandom")
def</code></pre>
<pre><code>## varname formula variance dist link
## 1: C 0.4 0 binary identity
## 2: X 0.3 + 0.4 * C 0 binary identity
## 3: e 0 2 normal identity
## 4: Y0 2 * C + e 0 nonrandom identity
## 5: Y1 0.5 + 2 * C + e 0 nonrandom identity
## 6: Y_obs Y0 + (Y1 - Y0) * X 0 nonrandom identity</code></pre>
<p>In this example, <span class="math inline">\(X\)</span> does have an effect on <span class="math inline">\(Y\)</span>, but so does <span class="math inline">\(C\)</span>. If we ignore <span class="math inline">\(C\)</span> in assessing the size of the effect of <span class="math inline">\(X\)</span> on <span class="math inline">\(Y\)</span>, we will overestimate that effect, which is 0.5. We can generate data and see that this is the case:</p>
<pre class="r"><code>set.seed(5)
dt <- genData(1000, def)</code></pre>
<p>We see that the true causal effect is easily recovered if we have access to the potential outcomes <span class="math inline">\(Y_1\)</span> and <span class="math inline">\(Y_0\)</span>, but of course we don’t:</p>
<pre class="r"><code>dt[, mean(Y1 - Y0)] # True causal effect</code></pre>
<pre><code>## [1] 0.5</code></pre>
<p>If we compare the average <em>observed</em> outcomes for each exposure group ignoring the confounder, we overestimate the effect of the exposure:</p>
<pre class="r"><code>dt[X == 1, mean(Y_obs)] - dt[X == 0, mean(Y_obs)]</code></pre>
<pre><code>## [1] 1.285009</code></pre>
<p>We can estimate the same effect using simple linear regression:</p>
<pre class="r"><code>lm1 <- lm(Y_obs ~ X, data = dt)
tidy(lm1)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 0.5515963 0.07325865 7.529436 1.137854e-13
## 2 X 1.2850091 0.10674523 12.038094 2.916261e-31</code></pre>
<p>And finally, if we adjust for the confounder <span class="math inline">\(C\)</span>, we recover the true causal effect of <span class="math inline">\(X\)</span> on <span class="math inline">\(Y\)</span>, or at least get very close to it:</p>
<pre class="r"><code>lm2 <- lm(Y_obs ~ X + C, data = dt)
tidy(lm2)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 0.08491216 0.06502545 1.305830 1.919117e-01
## 2 X 0.48935880 0.09678110 5.056347 5.083959e-07
## 3 C 2.05729945 0.09825925 20.937464 5.767575e-81</code></pre>
</div>
<div id="adjusting-for-a-post-exposure-covariate" class="section level2">
<h2>Adjusting for a post-exposure covariate</h2>
<p>Now, we are ready to see what happens in a slightly more complicated setting that is defined by this DAG:</p>
<p><img src="https://www.rdatagen.net/img/post-careful/ComplexCausal.png" width="400px" /></p>
<p>In this example <span class="math inline">\(C\)</span> is measured in two time periods, and exposure in period 1 relates to exposure in period 2. (For example, if a child is poor, he is more likely to be poor as an adult.) We are primarily interested in whether or not <span class="math inline">\(X\)</span> (trauma) causes <span class="math inline">\(Y\)</span> (substance use). The difficulty is that <span class="math inline">\(X\)</span> and <span class="math inline">\(C_2\)</span> are related, as are <span class="math inline">\(C_2\)</span> and <span class="math inline">\(Y\)</span>.</p>
<p>I suggest that in order to fully understand the effect of <span class="math inline">\(X\)</span> on <span class="math inline">\(Y\)</span>, we cannot control for <span class="math inline">\(C_2\)</span>, as tempting as it might be. The intuition is that part of the effect of <span class="math inline">\(X\)</span> on <span class="math inline">\(Y\)</span> is due to the fact that <span class="math inline">\(X\)</span> has an effect on <span class="math inline">\(C_2\)</span>, at least for some individuals. <em>If we control for <span class="math inline">\(C_2\)</span>, we are actually removing a key component of the causal mechanism.</em> Below in is the data generating process - a few things to note: (1) <span class="math inline">\(C_2\)</span> has potential outcomes based on the exposure <span class="math inline">\(X\)</span>. (2) We have restricted the potential outcome <span class="math inline">\(C_{21}\)</span> to be set to 1 if <span class="math inline">\(C_{20}\)</span> is 1. For example, if someone would have been poor in adulthood <em>without</em> exposure to trauma, we assume that they also would have been poor in adulthood had they been exposed to trauma. (3) The potential outcome for <span class="math inline">\(Y\)</span> is dependent on the relevant potential outcome for <span class="math inline">\(C_2\)</span>. That is <span class="math inline">\(Y_0\)</span> depends on <span class="math inline">\(C_{20}\)</span> and <span class="math inline">\(Y_1\)</span> depends on <span class="math inline">\(C_{21}\)</span>.</p>
<pre><code>## varname formula variance dist link
## 1: C1 0.25 0 binary identity
## 2: X -2 + 0.8 * C1 0 binary logit
## 3: C2.0 -2.0 + 1 * C1 0 binary logit
## 4: C2.1x -1.5 + 1 * C1 0 binary logit
## 5: C2.1 pmax(C2.0, C2.1x) 0 nonrandom identity
## 6: e 0 4 normal identity
## 7: Y0 -3 + 5*C2.0 + e 0 nonrandom identity
## 8: Y1 0 + 5*C2.1 + e 0 nonrandom identity
## 9: C2_obs C2.0 + (C2.1 - C2.0) * X 0 nonrandom identity
## 10: Y_obs Y0 + (Y1 - Y0) * X 0 nonrandom identity</code></pre>
<pre class="r"><code>set.seed(25)
dt <- genData(5000, def2)</code></pre>
<p>Here is the true average causal effect, based on information we will never know:</p>
<pre class="r"><code>dt[, mean(Y1 - Y0)]</code></pre>
<pre><code>## [1] 3.903</code></pre>
<p>When we control for <span class="math inline">\(C_2\)</span>, we are essentially estimating the effect of <span class="math inline">\(X\)</span> at each level <span class="math inline">\(C_2\)</span> (and <span class="math inline">\(C_1\)</span>, since we are controlling for that as well), and then averaging across the sub-samples to arrive at an estimate for the entire sample. We can see that, based on the specification of the potential outcomes in the data generation process, the effect at each level of <span class="math inline">\(C_2\)</span> will be centered around 3.0, which is different from the true causal effect of 3.9. The discrepancy is due to the fact each approach is effectively collecting different sub-samples (one defines groups based on set levels of <span class="math inline">\(X\)</span> and <span class="math inline">\(C_2\)</span>, and the other defines groups based on set levels of <span class="math inline">\(X\)</span> alone) and estimating average effects based on weights determined by the sizes of those two sets of sub-samples.</p>
<p>Here is the inappropriate model that adjusts for <span class="math inline">\(C_2\)</span>:</p>
<pre class="r"><code>lm2a <- lm( Y_obs ~ C1 + C2_obs + X , data = dt)
tidy(lm2a)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) -3.01360235 0.03481082 -86.5708464 0.000000e+00
## 2 C1 -0.02078171 0.06765129 -0.3071887 7.587126e-01
## 3 C2_obs 4.92972384 0.07625838 64.6450138 0.000000e+00
## 4 X 3.04600204 0.08114223 37.5390478 6.677416e-272</code></pre>
<p>The estimate for the coefficient of <span class="math inline">\(X\)</span> is 3.0, just as anticipated. Here now is the correct model, and you will see that we recover the true causal effect in the coefficient estimate of <span class="math inline">\(X\)</span> (or at least, we get much, much closer):</p>
<pre class="r"><code>lm2b <- lm( Y_obs ~ C1 + X , data = dt)
tidy(lm2b)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) -2.4886726 0.04586832 -54.25689 0.000000e+00
## 2 C1 0.9665413 0.08930265 10.82321 5.315059e-27
## 3 X 3.9377832 0.10834926 36.34343 7.868650e-257</code></pre>
<p>Of course, in the real world, we don’t know the underlying data generating process or the true DAG. And what I have described here is a gross oversimplification of the underlying relationships, and have indeed left out many other factors that likely affect the relationship between childhood trauma and adult substance use. Other measures, such as parental substance use, may be related to both childhood trauma and adult substance use, and may affect poverty in the two time periods in different, complicated ways.</p>
<p>But the point is that one should give careful thought to what gets included in a model. We may not want to throw everything we measure into the model. Be careful.</p>
</div>
Should we be concerned about incidence - prevalence bias?
https://www.rdatagen.net/post/simulating-incidence-prevalence-bias/
Wed, 09 Aug 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simulating-incidence-prevalence-bias/<p>Recently, we were planning a study to evaluate the effect of an intervention on outcomes for very sick patients who show up in the emergency department. My collaborator had concerns about a phenomenon that she had observed in other studies that might affect the results - patients measured earlier in the study tend to be sicker than those measured later in the study. This might not be a problem, but in the context of a stepped-wedge study design (see <a href="https://www.rdatagen.net/post/using-simulation-for-power-analysis-an-example/">this</a> for a discussion that touches this type of study design), this could definitely generate biased estimates: when the intervention occurs later in the study (as it does in a stepped-wedge design), the “exposed” and “unexposed” populations could differ, and in turn so could the outcomes. We might confuse an artificial effect as an intervention effect.</p>
<p>What could explain this phenomenon? The title of this post provides a hint: cases earlier in a study are more likely to be prevalent ones (i.e. they have been sick for a while), whereas later in the study cases tend to be incident (i.e. they only recently become sick). Even though both prevalent and incident cases are sick, the former may be sicker on average than the latter, simply because their condition has had more time develop.</p>
<p>We didn’t have any data to test out this hypothesis (if our grant proposal is funded, we will be able to do that), so I decided to see if I could simulate this phenomenon. In my continuing series exploring simulation using <code>Rcpp</code>, <code>simstudy</code>, and <code>data.table</code>, I am presenting some code that I used to do this.</p>
<div id="generating-a-population-of-patients" class="section level2">
<h2>Generating a population of patients</h2>
<p>The first task is to generate a population of individuals, each of whom starts out healthy and potentially becomes sicker over time. Time starts in month 1 and ends at some fixed point - in the first example, I end at 400 months. Each individual has a starting health status and a start month. In the examples that follow, health status is 1 through 4, with 1 being healthy, 3 is quite sick, and 4 is death. And, you can think of the start month as the point where the individual ages into the study. (For example, if the study includes only people 65 and over, the start month is the month the individual turns 65.) If an individual starts in month 300, she will have no measurements in periods 1 through 299 (i.e. health status will be 0).</p>
<p>The first part of the simulation generates a start month and starting health status for each individual, and then generates a health status for each individual until the end of time. Some individuals may die, while others may go all the way to the end of the simulation in a healthy state.</p>
<div id="rcpp-function-to-generate-health-status-for-each-period" class="section level4">
<h4>Rcpp function to generate health status for each period</h4>
<p>While it is generally preferable to avoid loops in R, sometimes it cannot be <a href="https://www.rdatagen.net/post/first-blog-entry/">avoided</a>. I believe generating a health status that depends on the previous health status (a Markov process) is one of those situations. So, I have written an Rcpp function to do this - it is orders of magnitude faster than doing this in R:</p>
<pre class="cpp"><code>#include <RcppArmadilloExtensions/sample.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector MCsim( unsigned int nMonths, NumericMatrix P,
int startStatus, unsigned int startMonth ) {
IntegerVector sim( nMonths );
IntegerVector healthStats( P.ncol() );
NumericVector currentP;
IntegerVector newstate;
unsigned int q = P.ncol();
healthStats = Rcpp::seq(1, q);
sim[startMonth - 1] = startStatus;
/* Loop through each month for each individual */
for (unsigned int i = startMonth; i < nMonths; i++) {
/* new state based on health status of last period and
probability of transitioning to different state */
newstate = RcppArmadillo::sample( healthStats,
1,
TRUE,
P.row(sim(i-1) - 1) );
sim(i) = newstate(0);
}
return sim;
}</code></pre>
<p><br></p>
</div>
<div id="generating-the-data" class="section level4">
<h4>Generating the data</h4>
<p>The data generation process is shown below. The general outline of the process is (1) define transition probabilities, (2) define starting health status distribution, (3) generate starting health statuses and start months, and (4) generate health statuses for each follow-up month.</p>
<pre class="r"><code># Transition matrix for moving through health statuses
P <- matrix(c(0.985, 0.015, 0.000, 0.000,
0.000, 0.950, 0.050, 0.000,
0.000, 0.000, 0.850, 0.150,
0.000, 0.000, 0.000, 1.000), nrow = 4, byrow = TRUE)
maxFU = 400
nPerMonth = 350
N = maxFU * nPerMonth
ddef <- defData(varname = "sHealth",
formula = "0.80; 0.15; 0.05",
dist = "categorical")
# generate starting health values (1, 2, or 3) for all individuals
set.seed(123)
did <- genData(n = N, dtDefs = ddef)
# each month, 350 age in to the sample
did[, sMonth := rep(1:maxFU, each = nPerMonth)]
# show table for 10 randomly selected individuals
did[id %in% sample(x = did$id, size = 10, replace = FALSE)] </code></pre>
<pre><code>## id sHealth sMonth
## 1: 15343 2 44
## 2: 19422 2 56
## 3: 41426 1 119
## 4: 50050 1 143
## 5: 63042 1 181
## 6: 83584 1 239
## 7: 93295 1 267
## 8: 110034 1 315
## 9: 112164 3 321
## 10: 123223 1 353</code></pre>
<pre class="r"><code># generate the health status history based on the transition matrix
dhealth <- did[, .(sHealth, sMonth, health = MCsim(maxFU, P, sHealth, sMonth)),
keyby = id]
dhealth[, month := c(1:.N), by = id]
dhealth</code></pre>
<pre><code>## id sHealth sMonth health month
## 1: 1 1 1 1 1
## 2: 1 1 1 1 2
## 3: 1 1 1 1 3
## 4: 1 1 1 1 4
## 5: 1 1 1 1 5
## ---
## 55999996: 140000 1 400 0 396
## 55999997: 140000 1 400 0 397
## 55999998: 140000 1 400 0 398
## 55999999: 140000 1 400 0 399
## 56000000: 140000 1 400 1 400</code></pre>
<p><br></p>
</div>
<div id="simulation-needs-burn-in-period" class="section level4">
<h4>Simulation needs burn-in period</h4>
<p>The simulation process itself is biased in its early phases as there are too many individuals in the sample who have just aged in compared to those who are “older”. (This is sort of the the reverse of the incidence - prevalence bias.) Since individuals tend to have better health status when they are “younger”, the average health status of the simulation in its early phases is biased downwards by the preponderance of young individuals in the population. This suggests that any evaluation of simulated data needs to account for a “burn-in” period that ensures there is a mix of “younger” and “older” individuals. To show this, I have calculated an average health score for each period of the simulation and plotted the results. You can see that the sample stabilizes after about 200 months in this simulation.</p>
<pre class="r"><code># count number of individuals with a particular heath statust each month
cmonth <- dhealth[month > 0, .N, keyby = .(month, health)]
cmonth</code></pre>
<pre><code>## month health N
## 1: 1 0 139650
## 2: 1 1 286
## 3: 1 2 47
## 4: 1 3 17
## 5: 2 0 139300
## ---
## 1994: 399 4 112203
## 1995: 400 1 18610
## 1996: 400 2 6515
## 1997: 400 3 2309
## 1998: 400 4 112566</code></pre>
<pre class="r"><code># transform data from "long" form to "wide" form and calculate average
mtotal <- dcast(data = cmonth,
formula = month ~ health,
fill = 0,
value.var = "N")
mtotal[, total := `1` + `2` + `3`]
mtotal[, wavg := (`1` + 2*`2` + 3*`3`)/total]
mtotal</code></pre>
<pre><code>## month 0 1 2 3 4 total wavg
## 1: 1 139650 286 47 17 0 350 1.231429
## 2: 2 139300 558 106 32 4 696 1.244253
## 3: 3 138950 829 168 45 8 1042 1.247601
## 4: 4 138600 1104 215 66 15 1385 1.250542
## 5: 5 138250 1362 278 87 23 1727 1.261726
## ---
## 396: 396 1400 18616 6499 2351 111134 27466 1.407813
## 397: 397 1050 18613 6537 2321 111479 27471 1.406938
## 398: 398 700 18587 6561 2323 111829 27471 1.407957
## 399: 399 350 18602 6541 2304 112203 27447 1.406201
## 400: 400 0 18610 6515 2309 112566 27434 1.405810</code></pre>
<pre class="r"><code>ggplot(data = mtotal, aes(x=month, y=wavg)) +
geom_line() +
ylim(1.2, 1.5) +
geom_hline(yintercept = 1.411, lty = 3) +
geom_vline(xintercept = 200, lty = 3) +
xlab("Month") +
ylab("Average health status") +
theme(panel.background = element_rect(fill = "grey90"),
panel.grid = element_blank(),
plot.title = element_text(size = 12, vjust = 0.5, hjust = 0) ) +
ggtitle("Average health status of simulated population")</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-08-09-simulating-incidence-prevalence-bias_files/figure-html/unnamed-chunk-3-1.png" width="672" /></p>
</div>
</div>
<div id="generating-monthly-study-cohorts" class="section level2">
<h2>Generating monthly study cohorts</h2>
<p>Now we are ready to see if we can simulate the incidence - prevalence bias. The idea here is to find the first month during which an individual (1) is “active” (i.e. the period being considered is on or after the individual’s start period), (2) has an emergency department visit, and (3) whose health status has reached a specified threshold.</p>
<p>We can set a final parameter that looks back some number of months (say 6 or 12) to see if there have been any previous qualifying emergency room visits before the study start period (which in our case will be month 290 to mitigate an burn-in bias identified above). This “look-back” will be used to mitigate some of the bias by creating a washout period that makes the prevalent cases look more like incident cases. This look-back parameter is calculated each month for each individual using an Rcpp function that loops through each period:</p>
<pre class="cpp"><code>#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector cAddPrior(IntegerVector idx,
IntegerVector event,
int lookback) {
int nRow = idx.length();
IntegerVector sumPrior(nRow, NA_INTEGER);
for (unsigned int i = lookback; i < nRow; i++) {
IntegerVector seqx = Rcpp::seq(i-lookback, i-1);
IntegerVector x = event[seqx];
sumPrior[i] = sum(x);
}
return(sumPrior);
}</code></pre>
<p><br></p>
<div id="generating-a-single-cohort" class="section level4">
<h4>Generating a single cohort</h4>
<p>The following code (1) generates a population (as we did above), (2) generates emergency department visits that are dependent on the health status (the sicker an individual is, the more likely they are to go to the ED), (3) calculates the number of eligible ED visits during the look-back period, and (4) creates the monthly cohorts based on the selection criteria. At the end, we calculate average health status for the cohort by month of cohort - this will be used to illustrate the bias.</p>
<pre class="r"><code>maxFU = 325
nPerMonth = 100
N = maxFU * nPerMonth
START = 289 # to allow for adequate burn-in
HEALTH = 2
LOOKBACK = 6 # how far to lookback
set.seed(123)
did <- genData(n = N, dtDefs = ddef)
did[, sMonth := rep(1:maxFU, each = nPerMonth)]
healthStats <- did[, .(sHealth,
sMonth,
health = MCsim(maxFU, P, sHealth, sMonth)),
keyby = id]
healthStats[, month := c(1:.N), by = id]
# eliminate period without status measurement (0) & death (4)
healthStats <- healthStats[!(health %in% c(0,4))]
# ensure burn-in by starting with observations far
# into simulation
healthStats <- healthStats[month > (START - LOOKBACK)]
# set probability of emergency department visit
healthStats[, pED := (health == 1) * 0.02 +
(health == 2) * 0.10 +
(health == 3) * 0.20]
# generate emergency department visit
healthStats[, ed := rbinom(.N, 1, pED)]
healthStats[, edAdj := ed * as.integer(health >= HEALTH)] # if you want to restrict
healthStats[, pSum := cAddPrior(month, edAdj, lookback = LOOKBACK), keyby=id]
# look at one individual
healthStats[id == 28069]</code></pre>
<pre><code>## id sHealth sMonth health month pED ed edAdj pSum
## 1: 28069 1 281 1 284 0.02 0 0 NA
## 2: 28069 1 281 1 285 0.02 0 0 NA
## 3: 28069 1 281 1 286 0.02 0 0 NA
## 4: 28069 1 281 1 287 0.02 0 0 NA
## 5: 28069 1 281 2 288 0.10 0 0 NA
## 6: 28069 1 281 2 289 0.10 0 0 NA
## 7: 28069 1 281 2 290 0.10 1 1 0
## 8: 28069 1 281 2 291 0.10 0 0 1
## 9: 28069 1 281 2 292 0.10 0 0 1
## 10: 28069 1 281 2 293 0.10 0 0 1
## 11: 28069 1 281 2 294 0.10 0 0 1
## 12: 28069 1 281 3 295 0.20 1 1 1</code></pre>
<pre class="r"><code># cohort includes individuals with 1 prior ed visit in
# previous 6 months
cohort <- healthStats[edAdj == 1 & pSum == 0]
cohort <- cohort[, .(month = min(month)), keyby = id]
cohort</code></pre>
<pre><code>## id month
## 1: 53 306
## 2: 82 313
## 3: 140 324
## 4: 585 291
## 5: 790 299
## ---
## 3933: 31718 324
## 3934: 31744 325
## 3935: 31810 325
## 3936: 31860 325
## 3937: 31887 325</code></pre>
<pre class="r"><code># estimate average health status of monthly cohorts
cohortStats <- healthStats[cohort, on = c("id","month")]
sumStats <- cohortStats[ , .(avghealth = mean(health), n = .N), keyby = month]
head(sumStats)</code></pre>
<pre><code>## month avghealth n
## 1: 290 2.248175 137
## 2: 291 2.311765 170
## 3: 292 2.367347 147
## 4: 293 2.291925 161
## 5: 294 2.366906 139
## 6: 295 2.283871 155</code></pre>
</div>
</div>
<div id="exploring-bias" class="section level2">
<h2>Exploring bias</h2>
<p>Finally, we are at the point where we can see what, if any, bias results in selecting our cohorts under the scenario I’ve outlined above. We start by generating multiple iterations of populations and cohorts and estimating average health status by month under the assumption that we will have a look-back period of 0. That is, we will accept an individual into the first possible cohort regardless of her previous emergency department visit history. The plot below shows average across 1000 iterations. What we see is that the average health status of the cohorts in the first 20 months or so exceed the long run average. The incidence - prevalence bias is extremely strong if we ignore prior ED history!</p>
<p><img src="https://www.rdatagen.net/img/post-incidence/plot00.png" /> <br></p>
<div id="taking-history-into-account" class="section level4">
<h4>Taking history into account</h4>
<p>Once we start to incorporate ED history by using look-back periods greater than 0, we see that we can reduce bias considerably. The two plots below show the results of using look-back periods of 6 and 12 months. Both have reduced bias, but only at 12 months are we approaching something that actually looks desirable. In fact, under this scenario, we’d probably like to go back 24 months to eliminate the bias entirely. Of course, these particular results are dependent on the simulation assumptions, so determining an appropriate look-back period will certainly depend on the actual data. (When we do finally get the actual data, I will follow-up to let you know what kind of adjustment we needed to make in the real, non-simulated world.)</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-incidence/plot06.png" />
</div>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-incidence/plot12.png" />
</div>
</div>
</div>
Using simulation for power analysis: an example based on a stepped wedge study design
https://www.rdatagen.net/post/using-simulation-for-power-analysis-an-example/
Mon, 10 Jul 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/using-simulation-for-power-analysis-an-example/<p>Simulation can be super helpful for estimating power or sample size requirements when the study design is complex. This approach has some advantages over an analytic one (i.e. one based on a formula), particularly the flexibility it affords in setting up the specific assumptions in the planned study, such as time trends, patterns of missingness, or effects of different levels of clustering. A downside is certainly the complexity of writing the code as well as the computation time, which <em>can</em> be a bit painful. My goal here is to show that at least writing the code need not be overwhelming.</p>
<p>Recently, I was helping an investigator plan a stepped wedge cluster randomized trial to study the effects of modifying a physician support system on patient-level diabetes management. While analytic approaches for power calculations do exist in the context of this complex study design, it seemed worth the effort to be explicit about all of the assumptions. So in this case I opted to use simulation. The basic approach is outlined below.</p>
<div id="the-stepped-wedge-design" class="section level2">
<h2>The stepped wedge design</h2>
<p>In cluster randomized trials, the unit of randomization is the group rather than the individual. While outcomes might be collected at the individual (e.g. student or patient) level, the intervention effect is assessed at the group (e.g. school or clinic). In a stepped wedge cluster design, the randomization unit is still the group, but all groups are eventually exposed to the intervention at some point in the study. Randomization determines <em>when</em> the intervention starts.</p>
<p>Below is schematic view of how a stepped wedge study is implemented. In this example, a block of clusters receives the intervention starting in the second period, another block starts the intervention in the third period, and so on. The intervention effect is essentially assessed by making within group comparisons. By staggering the starting points, the study is able to distinguish between time effects and treatment effects. If all groups started intervention at the same point, we would need to make an assumption that any improvements were due only to the intervention rather than changes that were occurring over time. This is not an assumption any one can easily justify.</p>
<p><img src="https://www.rdatagen.net/post/2017-07-10-using-simulation-for-power-analysis-an-example-based-on-a-stepped-wedge-study-design_files/figure-html/unnamed-chunk-1-1.png" width="672" /></p>
</div>
<div id="power-and-simulation" class="section level2">
<h2>Power and simulation</h2>
<p>The statistical power of a study is the conditional probability (conditional on a given effect size), that a hypothesis test will correctly reject the null hypothesis (i.e. conclude there is an effect when there actually is one). Power is underscored by the notion that a particular study can be replicated exactly over and over again. So, if the power of a study is 80%, that means in 80% of the replications of that study we will (appropriately) reject the null hypothesis.</p>
<p>So, to estimate power, we can simulate replications of the study many times and conduct repeated hypothesis tests. The proportion of tests where we reject the null hypothesis is the estimated power. Each of these replications is based on the same set of data generating assumptions: effect sizes, sample sizes, individual level variation, group level variation, etc.</p>
</div>
<div id="simulating-from-a-stepped-wedge-design" class="section level2">
<h2>Simulating from a stepped wedge design</h2>
<p>In this example, we are assuming a 3-year study with four groups of clusters randomized to start an intervention at either 12 months, 18 months, 24 months, or 30 months (i.e. every 6 months following the 1st baseline year). The study would enroll patients at baseline in each of the clusters, and a measurement of a binary outcome (say diabetes under control, or not) would be collected at that time. Those patients would be followed over time and the same measurement would be collected every 6 months, concluding with the 7th measurement in the 36th month of the study. (It is totally possible to enroll new patients as the study progresses and have a different follow-up scheme, but this approximates the actual study I was working on.)</p>
<p>The data are generated based on a mixed effects model where there are group level effects (<span class="math inline">\(b_j\)</span> in the model) as well as individual level effects (<span class="math inline">\(b_i\)</span>). The model also assumes a very slight time trend before the intervention (e.g. diabetes control is improving slightly over time for an individual), an intervention effect, and an almost non-existent change in the time trend after the intervention. The outcome in each period is generated based on this formula:</p>
<p><span class="math inline">\(logit(Y_{ijt}) = 0.8 + .01 * period + 0.8 * I_{jt} + 0.001 * I_{jt} * (period-s_j) + b_i + b_j,\)</span></p>
<p>where <span class="math inline">\(period\)</span> goes from 0 to 6 (period 0 is the baseline, period 1 is the 6 month follow, etc.), <span class="math inline">\(I_{jt}\)</span> is 1 if cluster <span class="math inline">\(j\)</span> is in the intervention in period <span class="math inline">\(t\)</span>, <span class="math inline">\(s_j\)</span> is the period where the intervention starts for cluster <span class="math inline">\(j\)</span>, and <span class="math inline">\(logit(Y_{ijt})\)</span> is the log odds of the outcome <span class="math inline">\(Y\)</span> for individual <span class="math inline">\(i\)</span> in cluster <span class="math inline">\(j\)</span> during period <span class="math inline">\(t\)</span>.</p>
<p>We start by defining the data structure using <code>simstudy</code> “data def”" commands. We are assuming that there will be 100 individuals followed at each site for the full study. (We are not assuming any dropout, though we could easily do that.) In this particular case, we are assuming an effect size of 0.8 (which is a log odds ratio):</p>
<pre class="r"><code>library(simstudy)
starts <- "rep(c(2 : 5), each = 10)"
siteDef <- defData(varname = "bj", dist = "normal", formula = 0,
variance = .01, id="site")
siteDef <- defData(siteDef, varname = "sj", dist = "nonrandom",
formula = starts)
siteDef <- defData(siteDef, varname = "ips", dist = "nonrandom",
formula = 100)
indDef <- defDataAdd(varname = "bi", dist = "normal", formula = 0,
variance = 0.01)
trtDef <- defDataAdd(varname = "Ijt" ,
formula = "as.numeric(period >= sj)",
dist = "nonrandom")
f = "0.8 + .01 * period + 0.8 * Ijt + 0.001 * Ijt * (period-sj) + bi + bj"
trtDef <- defDataAdd(trtDef, varname = "Yijt", formula = f,
dist = "binary", link = "logit")</code></pre>
<p>To generate 40 clusters of data, we use the following code:</p>
<pre class="r"><code>set.seed(6789)
dtSite <- genData(40, siteDef)
dtSite <- genCluster(dtSite, cLevelVar = "site", numIndsVar = "ips",
level1ID = "id")
dtSite <- addColumns(indDef, dtSite)
dtSiteTm <- addPeriods(dtSite, nPeriods = 7, idvars = "id")
dtSiteTm <- addColumns(trtDef, dtSiteTm)
dtSiteTm</code></pre>
<pre><code>## id period site bj sj ips bi timeID Ijt Yijt
## 1: 1 0 1 -0.1029785 2 100 0.08926153 1 0 1
## 2: 1 1 1 -0.1029785 2 100 0.08926153 2 0 1
## 3: 1 2 1 -0.1029785 2 100 0.08926153 3 1 1
## 4: 1 3 1 -0.1029785 2 100 0.08926153 4 1 1
## 5: 1 4 1 -0.1029785 2 100 0.08926153 5 1 1
## ---
## 27996: 4000 2 40 0.1000898 5 100 0.18869371 27996 0 1
## 27997: 4000 3 40 0.1000898 5 100 0.18869371 27997 0 0
## 27998: 4000 4 40 0.1000898 5 100 0.18869371 27998 0 1
## 27999: 4000 5 40 0.1000898 5 100 0.18869371 27999 1 1
## 28000: 4000 6 40 0.1000898 5 100 0.18869371 28000 1 1</code></pre>
<p>And to visualize what the study data might looks like under these assumptions:</p>
<pre class="r"><code># summary by site
dt <- dtSiteTm[, .(Y = mean(Yijt)), keyby = .(site, period, Ijt, sj)]
ggplot(data = dt, aes(x=period, y=Y, group=site)) +
geom_hline(yintercept = c(.7, .83), color = "grey99") +
geom_line(aes(color=factor(site))) +
geom_point(data = dt[sj == period], color="grey50") +
theme(panel.background = element_rect(fill = "grey90"),
panel.grid = element_blank(),
plot.title = element_text(size = 10, hjust = 0),
panel.border = element_rect(fill = NA, colour = "gray90"),
legend.position = "none",
axis.title.x = element_blank()
) +
ylab("Proportion controlled") +
scale_x_continuous(breaks = seq(0, 10, by = 2),
labels = c("Baseline", paste("Year", c(1:5)))) +
scale_y_continuous(limits = c(.5, 1),
breaks = c(.5, .6, .7, .8, .9, 1)) +
ggtitle("Stepped-wedge design with immediate effect") +
facet_grid(sj~.)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-07-10-using-simulation-for-power-analysis-an-example-based-on-a-stepped-wedge-study-design_files/figure-html/unnamed-chunk-4-1.png" width="480" /></p>
</div>
<div id="estimating-power" class="section level2">
<h2>Estimating power</h2>
<p>We are going to estimate power using only 20 clusters and effect size of 0.25. (Assuming 40 clusters and a large effect size was useful for visualizing the data, but not so interesting for illustrating power, since under those assumptions we are virtually guaranteed to find an effect.)</p>
<p>After generating the data (code not shown) for one iteration, we fit a generalized mixed effects model to show the effect estimate. In this case, the effect estimate is 1.46 (95% CI 1.21-1.77) on the odds ratio scale or 0.37 (95% CI 0.19-0.57) on the log odds ratio scale.</p>
<pre class="r"><code>library(lme4)
library(sjPlot)
glmfit <- glmer(data = dtSiteTm,
Yijt ~ period + Ijt + I(Ijt*(period - sj)) + (1|id) + (1|site),
family="binomial" )
sjt.glmer(glmfit, show.icc = FALSE, show.dev = FALSE)</code></pre>
<table style="border-collapse:collapse; border:none;border-bottom:double;">
<tr>
<td style="padding:0.2cm; border-top:double;">
</td>
<td style="border-bottom:1px solid; padding-left:0.5em; padding-right:0.5em; border-top:double;">
</td>
<td style="padding:0.2cm; text-align:center; border-bottom:1px solid; border-top:double;" colspan="3">
Yijt
</td>
</tr>
<tr>
<td style="padding:0.2cm; font-style:italic;">
</td>
<td style="padding-left:0.5em; padding-right:0.5em; font-style:italic;">
</td>
<td style="padding:0.2cm; text-align:center; font-style:italic; ">
Odds Ratio
</td>
<td style="padding:0.2cm; text-align:center; font-style:italic; ">
CI
</td>
<td style="padding:0.2cm; text-align:center; font-style:italic; ">
p
</td>
</tr>
<tr>
<td colspan="5" style="padding:0.2cm; text-align:left; border-top:1px solid; font-weight:bold; text-align:left;">
Fixed Parts
</td>
</tr>
<tr>
<td style="padding:0.2cm; text-align:left;">
(Intercept)
</td>
<td style="padding-left:0.5em; padding-right:0.5em; ">
</td>
<td style="padding:0.2cm; text-align:center; ">
2.15
</td>
<td style="padding:0.2cm; text-align:center; ">
1.90 – 2.44
</td>
<td style="padding:0.2cm; text-align:center; ">
<.001
</td>
</tr>
<tr>
<td style="padding:0.2cm; text-align:left;">
period
</td>
<td style="padding-left:0.5em; padding-right:0.5em;">
</td>
<td style="padding:0.2cm; text-align:center; ">
1.00
</td>
<td style="padding:0.2cm; text-align:center; ">
0.95 – 1.06
</td>
<td style="padding:0.2cm; text-align:center; ">
.959
</td>
</tr>
<tr>
<td style="padding:0.2cm; text-align:left;">
Ijt
</td>
<td style="padding-left:0.5em; padding-right:0.5em;">
</td>
<td style="padding:0.2cm; text-align:center; ">
1.46
</td>
<td style="padding:0.2cm; text-align:center; ">
1.21 – 1.77
</td>
<td style="padding:0.2cm; text-align:center; ">
<.001
</td>
</tr>
<tr>
<td style="padding:0.2cm; text-align:left;">
I(Ijt * (period - sj))
</td>
<td style="padding-left:0.5em; padding-right:0.5em;">
</td>
<td style="padding:0.2cm; text-align:center; ">
0.99
</td>
<td style="padding:0.2cm; text-align:center; ">
0.91 – 1.07
</td>
<td style="padding:0.2cm; text-align:center; ">
.759
</td>
</tr>
<tr>
<td colspan="5" style="padding:0.2cm; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; font-weight:bold; text-align:left; padding-top:0.5em;">
Random Parts
</td>
</tr>
<tr>
<td style="padding:0.2cm; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;">
τ<sub>00, id</sub>
</td>
<td style="padding-left:0.5em; padding-right:0.5em;">
</td>
<td style="padding:0.2cm; text-align:center; padding-top:0.1cm; padding-bottom:0.1cm;" colspan="3">
0.011
</td>
</tr>
<tr>
<td style="padding:0.2cm; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;">
τ<sub>00, site</sub>
</td>
<td style="padding-left:0.5em; padding-right:0.5em;">
</td>
<td style="padding:0.2cm; text-align:center; padding-top:0.1cm; padding-bottom:0.1cm;" colspan="3">
0.029
</td>
</tr>
<tr>
<td style="padding:0.2cm; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;">
N<sub>id</sub>
</td>
<td style="padding-left:0.5em; padding-right:0.5em;">
</td>
<td style="padding:0.2cm; text-align:center; padding-top:0.1cm; padding-bottom:0.1cm;" colspan="3">
1000
</td>
</tr>
<tr>
<td style="padding:0.2cm; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;">
N<sub>site</sub>
</td>
<td style="padding-left:0.5em; padding-right:0.5em;">
</td>
<td style="padding:0.2cm; text-align:center; padding-top:0.1cm; padding-bottom:0.1cm;" colspan="3">
20
</td>
</tr>
<tr>
<td style="padding:0.2cm; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;">
Observations
</td>
<td style="padding-left:0.5em; padding-right:0.5em; border-top:1px solid;">
</td>
<td style="padding:0.2cm; padding-top:0.1cm; padding-bottom:0.1cm; text-align:center; border-top:1px solid;" colspan="3">
7000
</td>
</tr>
</table>
<p>In order to estimate power, we need to generate a large number of replications. I created a simple function that generates a new data set every iteration based on the definitions. If we want to vary the model assumptions across different replications, we can write code to modify the data definition part of the process. In this way we could look at power across different sample size, effect size, or variance assumptions. Here, I am only considering a single set of assumptions.</p>
<pre class="r"><code>gData <- function() {
dtSite <- genData(nsites, siteDef)
dtSite <- genCluster(dtSite, cLevelVar = "site",
numIndsVar = "ips", level1ID = "id")
dtSite <- addColumns(indDef, dtSite)
dtSiteTm <- addPeriods(dtSite, nPeriods = 7, idvars = "id")
dtSiteTm <- addColumns(trtDef, dtSiteTm)
return(dtSiteTm)
}</code></pre>
<p>And finally, we iterate through a series of replications, keeping track of each hypothesis test in the variable <em>result</em>. Typically, it would be nice to replicate a large number of times (say 1000), but this can sometimes take a long time. In this case, each call to <code>glmer</code> is very resource intensive - unfortunately, I know of know way to speed this up (please get in touch if you have thoughts on this) - so for the purposes of illustration, I’ve only used 99 iterations. Note also that I check to see if the model converges in each iteration, and only include results from valid estimates. This can be an issue with mixed effects models, particularly when sample sizes are small. To estimate the power (which in this case is 78%), calculate the proportion of successful iterations with a p-value smaller than 0.05, the alpha-level threshold we have used in our hypothesis test:</p>
<pre class="r"><code>result <- NULL
i=1
while (i < 100) {
dtSite <- gData()
glmfit <- tryCatch(glmer(data = dtSite,
Yijt ~ period + Ijt + I(Ijt*(period - sj)) + (1|id) + (1|site),
family="binomial" ),
warning = function(w) { "warning" }
)
if (! is.character(glmfit)) {
pvalue <- coef(summary(glmfit))["Ijt", "Pr(>|z|)"]
result <- c(result, pvalue)
i <- i + 1
}
}
mean(result < .05)</code></pre>
<pre><code>## [1] 0.7812</code></pre>
<p>To explore the sensitivity of the power estimates to changing underlying assumptions of effect size, sample size, variation, and time trends, we could vary those parameters and run a sequence of iterations. The code gets a little more complicated (essentially we need to change the “data defs” for each set of iterations), but it is still quite manageable. Of course, you might want to plan for fairly long execution times, particularly if you use 500 or 1000 iterations for each scenario, rather than the 100 I used here.</p>
</div>
simstudy update: two new functions that generate correlated observations from non-normal distributions
https://www.rdatagen.net/post/simstudy-update-two-functions-for-correlation/
Wed, 05 Jul 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simstudy-update-two-functions-for-correlation/<p>In an earlier <a href="https://www.rdatagen.net/post/correlated-data-copula/">post</a>, I described in a fair amount of detail an algorithm to generate correlated binary or Poisson data. I mentioned that I would be updating <code>simstudy</code> with functions that would make generating these kind of data relatively painless. Well, I have managed to do that, and the updated package (version 0.1.3) is available for download from <a href="https://cran.r-project.org/web/packages/simstudy/index.html">CRAN</a>. There are now two additional functions to facilitate the generation of correlated data from <em>binomial</em>, <em>poisson</em>, <em>gamma</em>, and <em>uniform</em> distributions: <code>genCorGen</code> and <code>addCorGen</code>. Here’s a brief intro to these functions.</p>
<div id="generate-generally-correlated-data" class="section level3">
<h3>Generate generally correlated data</h3>
<p><code>genCorGen</code> is an extension of <code>genCorData</code>, which was provided in earlier versions of <code>simstudy</code> to generate multivariate normal data. In the first example below, we are generating data from a multivariate Poisson distribution. To do this, we need to specify the mean of the Poisson distribution for each new variable, and then we specify the correlation structure, just as we did with the normal distribution.</p>
<pre class="r"><code>l <- c(8, 10, 12) # lambda for each new variable
dp <- genCorGen(1000, nvars = 3, params1 = l, dist = "poisson",
rho = 0.3, corstr = "cs", wide = TRUE)
dp</code></pre>
<pre><code>## id V1 V2 V3
## 1: 1 7 13 12
## 2: 2 7 11 13
## 3: 3 7 8 14
## 4: 4 7 12 9
## 5: 5 8 13 18
## ---
## 996: 996 8 14 15
## 997: 997 10 5 11
## 998: 998 4 9 9
## 999: 999 5 10 9
## 1000: 1000 6 12 17</code></pre>
<p>Here is the the estimated correlation (we would expect an estimate close to 0.3):</p>
<pre class="r"><code>round(cor(as.matrix(dp[, .(V1, V2, V3)])), 2)</code></pre>
<pre><code>## V1 V2 V3
## V1 1.00 0.29 0.26
## V2 0.29 1.00 0.31
## V3 0.26 0.31 1.00</code></pre>
<p>Similarly, we can generate correlated binary data by specifying the probabilities:</p>
<pre class="r"><code>db<- genCorGen(1000, nvars = 3, params1 = c(.3, .5, .7), dist = "binary",
rho = 0.8, corstr = "cs", wide = TRUE)
db</code></pre>
<pre><code>## id V1 V2 V3
## 1: 1 1 1 1
## 2: 2 0 0 1
## 3: 3 1 1 1
## 4: 4 0 0 0
## 5: 5 1 1 1
## ---
## 996: 996 0 1 1
## 997: 997 0 0 0
## 998: 998 0 1 1
## 999: 999 1 1 1
## 1000: 1000 0 0 0</code></pre>
<p>In the case of the binary outcome, the observed correlation will be lower that what is specified, which in this case was 0.8. I tried to provide some intuition about this in the earlier <a href="https://www.rdatagen.net/post/correlated-data-copula/">post</a>:</p>
<pre class="r"><code>round(cor(as.matrix(db[, .(V1, V2, V3)])), 2)</code></pre>
<pre><code>## V1 V2 V3
## V1 1.0 0.50 0.40
## V2 0.5 1.00 0.56
## V3 0.4 0.56 1.00</code></pre>
<p>The gamma distribution requires two parameters - the mean and dispersion. (These are converted into shape and rate parameters more commonly used.)</p>
<pre class="r"><code>dg <- genCorGen(1000, nvars = 3, params1 = c(3,5,7), params2 = c(1,1,1),
dist = "gamma", rho = .7, corstr = "cs",
wide = TRUE, cnames="a, b, c")
dg</code></pre>
<pre><code>## id a b c
## 1: 1 0.1957971 0.9902398 2.299307
## 2: 2 0.2566630 2.4271728 1.217599
## 3: 3 1.9550985 13.9248696 5.178042
## 4: 4 3.5525418 2.5711661 7.848605
## 5: 5 6.6981281 8.7494117 12.478329
## ---
## 996: 996 2.2059693 6.3474811 3.054551
## 997: 997 2.3571427 7.7841085 7.887417
## 998: 998 5.5326638 7.3273337 15.965228
## 999: 999 5.6284681 13.3574118 17.215722
## 1000: 1000 0.3749373 1.1480452 0.696243</code></pre>
<pre class="r"><code>round(cor(as.matrix(dg[, .(a, b, c)])), 2)</code></pre>
<pre><code>## a b c
## a 1.00 0.65 0.67
## b 0.65 1.00 0.62
## c 0.67 0.62 1.00</code></pre>
<p>These data sets can be generated in either <em>wide</em> or <em>long</em> form. So far, we have generated <em>wide</em> form data, where there is one row per unique id. The <em>long</em> form, where the correlated data are on different rows, is useful for plotting or fitting models, because there are repeated measurements for each id:</p>
<pre class="r"><code>dgl <- genCorGen(1000, nvars = 3, params1 = l, params2 = c(1,1,1),
dist = "gamma", rho = .7, corstr = "cs", wide = FALSE,
cnames="NewCol")
dgl</code></pre>
<pre><code>## id period NewCol
## 1: 1 0 1.066558
## 2: 1 1 5.666802
## 3: 1 2 5.366408
## 4: 2 0 1.419593
## 5: 2 1 9.318227
## ---
## 2996: 999 1 21.821011
## 2997: 999 2 21.800972
## 2998: 1000 0 12.082063
## 2999: 1000 1 18.541231
## 3000: 1000 2 12.063846</code></pre>
<p>Here is a plot of a subset of the data:</p>
<pre class="r"><code>ids <- sample(1000,50, replace = FALSE)
ggplot(data=dgl[id %in% ids,], aes(x=factor(period), y=NewCol, group=id)) +
geom_line(aes(color=factor(id)))+
theme(legend.position = "none") +
scale_x_discrete(expand = c(0,0.1))</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-07-05-simstudy-update-two-functions-to-generate-correlated-observations-from-non-normal-distributions_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
</div>
<div id="generate-data-based-on-values-from-existing-data-set" class="section level3">
<h3>Generate data based on values from existing data set</h3>
<p><code>addCorGen</code> allows us to create correlated data from an existing data set, as one can already do using <code>addCorData</code>, but with non-normal data. In the case of <code>addCorGen</code>, the parameter(s) used to define the distribution is a field (or fields) in the data set. The correlated data are added to the existing data set. In the example below, we are going to generate three sets (Poisson, binary, and gamma) of correlated data with means that are a function of the variable <code>xbase</code>, which varies by id.</p>
<p>First we define the data and generate a data set:</p>
<pre class="r"><code>def <- defData(varname = "xbase", formula = 5, variance = 0.2,
dist = "gamma", id = "cid")
def <- defData(def, varname = "lambda", formula = "0.5 + 0.1 * xbase",
dist="nonrandom", link = "log")
def <- defData(def, varname = "p", formula = "-2.0 + 0.3 * xbase",
dist="nonrandom", link = "logit")
def <- defData(def, varname = "gammaMu", formula = "0.5 + 0.2 * xbase",
dist="nonrandom", link = "log")
def <- defData(def, varname = "gammaDis", formula = 1,
dist="nonrandom")
dt <- genData(10000, def)
dt</code></pre>
<pre><code>## cid xbase lambda p gammaMu gammaDis
## 1: 1 12.1128232 5.536056 0.8366960 18.588900 1
## 2: 2 4.9148342 2.695230 0.3715554 4.405998 1
## 3: 3 11.5550282 5.235712 0.8125261 16.626630 1
## 4: 4 3.0802596 2.243475 0.2542785 3.052778 1
## 5: 5 0.9767811 1.817893 0.1535577 2.004423 1
## ---
## 9996: 9996 6.0564517 3.021173 0.4543613 5.536100 1
## 9997: 9997 3.1298866 2.254636 0.2571119 3.083229 1
## 9998: 9998 12.4642670 5.734076 0.8505956 19.942505 1
## 9999: 9999 4.6559318 2.626345 0.3536072 4.183660 1
## 10000: 10000 3.4314285 2.323658 0.2747666 3.274895 1</code></pre>
<p>The Poisson distribution has a single parameter, lambda:</p>
<pre class="r"><code>dtX1 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 3, rho = 0.1,
corstr = "cs", dist = "poisson", param1 = "lambda",
cnames = "a, b, c")
dtX1[, .(cid, xbase, lambda, a, b, c)]</code></pre>
<pre><code>## cid xbase lambda a b c
## 1: 1 12.1128232 5.536056 4 6 7
## 2: 2 4.9148342 2.695230 2 4 1
## 3: 3 11.5550282 5.235712 5 6 4
## 4: 4 3.0802596 2.243475 1 3 1
## 5: 5 0.9767811 1.817893 2 1 0
## ---
## 9996: 9996 6.0564517 3.021173 1 3 3
## 9997: 9997 3.1298866 2.254636 2 3 1
## 9998: 9998 12.4642670 5.734076 4 6 8
## 9999: 9999 4.6559318 2.626345 2 3 5
## 10000: 10000 3.4314285 2.323658 0 0 3</code></pre>
<p>The Bernoulli (binary) distribution has a single parameter, p:</p>
<pre class="r"><code>dtX2 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 4, rho = .4,
corstr = "ar1", dist = "binary", param1 = "p")
dtX2[, .(cid, xbase, p, V1, V2, V3, V4)]</code></pre>
<pre><code>## cid xbase p V1 V2 V3 V4
## 1: 1 12.1128232 0.8366960 1 1 1 1
## 2: 2 4.9148342 0.3715554 0 0 0 0
## 3: 3 11.5550282 0.8125261 1 1 1 1
## 4: 4 3.0802596 0.2542785 0 1 0 0
## 5: 5 0.9767811 0.1535577 0 0 0 1
## ---
## 9996: 9996 6.0564517 0.4543613 0 0 0 0
## 9997: 9997 3.1298866 0.2571119 1 0 0 0
## 9998: 9998 12.4642670 0.8505956 0 1 1 1
## 9999: 9999 4.6559318 0.3536072 1 1 0 0
## 10000: 10000 3.4314285 0.2747666 1 0 1 1</code></pre>
<p>And here is the the Gamma distribution, with its two parameters (mean and dispersion):</p>
<pre class="r"><code>dtX3 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 3, rho = .4,
corstr = "cs", dist = "gamma",
param1 = "gammaMu", param2 = "gammaDis")
dtX3[, .(cid, xbase, gammaMu, gammaDis,
V1 = round(V1,2), V2 = round(V2,2), V3 = round(V3,2))]</code></pre>
<pre><code>## cid xbase gammaMu gammaDis V1 V2 V3
## 1: 1 12.1128232 18.588900 1 11.24 3.44 9.11
## 2: 2 4.9148342 4.405998 1 0.91 3.77 0.76
## 3: 3 11.5550282 16.626630 1 68.47 12.91 1.72
## 4: 4 3.0802596 3.052778 1 2.54 3.63 2.98
## 5: 5 0.9767811 2.004423 1 0.39 0.14 0.42
## ---
## 9996: 9996 6.0564517 5.536100 1 0.29 4.84 1.80
## 9997: 9997 3.1298866 3.083229 1 4.81 0.38 0.81
## 9998: 9998 12.4642670 19.942505 1 17.10 3.56 4.04
## 9999: 9999 4.6559318 4.183660 1 1.17 0.21 1.47
## 10000: 10000 3.4314285 3.274895 1 1.02 1.61 2.24</code></pre>
</div>
<div id="long-form-data" class="section level3">
<h3>Long form data</h3>
<p>If we have data in <em>long</em> form (e.g. longitudinal data), the function will recognize the structure:</p>
<pre class="r"><code>def <- defData(varname = "xbase", formula = 5, variance = .4,
dist = "gamma", id = "cid")
def <- defData(def, "nperiods", formula = 3,
dist = "noZeroPoisson")
def2 <- defDataAdd(varname = "lambda",
formula = "0.5 + 0.5 * period + 0.1 * xbase",
dist="nonrandom", link = "log")
dt <- genData(1000, def)
dtLong <- addPeriods(dt, idvars = "cid", nPeriods = 3)
dtLong <- addColumns(def2, dtLong)
dtLong</code></pre>
<pre><code>## cid period xbase nperiods timeID lambda
## 1: 1 0 6.693980 1 1 3.220053
## 2: 1 1 6.693980 1 2 5.308971
## 3: 1 2 6.693980 1 3 8.753013
## 4: 2 0 10.008645 2 4 4.485565
## 5: 2 1 10.008645 2 5 7.395447
## ---
## 2996: 999 1 6.753605 2 2996 5.340720
## 2997: 999 2 6.753605 2 2997 8.805359
## 2998: 1000 0 2.006781 4 2998 2.015119
## 2999: 1000 1 2.006781 4 2999 3.322369
## 3000: 1000 2 2.006781 4 3000 5.477661</code></pre>
<pre class="r"><code>### Generate the data
dtX3 <- addCorGen(dtOld = dtLong, idvar = "cid", nvars = 3,
rho = .6, corstr = "cs", dist = "poisson",
param1 = "lambda", cnames = "NewPois")
dtX3</code></pre>
<pre><code>## cid period xbase nperiods timeID lambda NewPois
## 1: 1 0 6.693980 1 1 3.220053 3
## 2: 1 1 6.693980 1 2 5.308971 5
## 3: 1 2 6.693980 1 3 8.753013 9
## 4: 2 0 10.008645 2 4 4.485565 2
## 5: 2 1 10.008645 2 5 7.395447 4
## ---
## 2996: 999 1 6.753605 2 2996 5.340720 6
## 2997: 999 2 6.753605 2 2997 8.805359 11
## 2998: 1000 0 2.006781 4 2998 2.015119 2
## 2999: 1000 1 2.006781 4 2999 3.322369 4
## 3000: 1000 2 2.006781 4 3000 5.477661 7</code></pre>
<p>We can fit a generalized estimating equation (GEE) model and examine the coefficients and the working correlation matrix. As we would expect, they match closely to the data generating parameters:</p>
<pre class="r"><code>geefit <- gee(NewPois ~ period + xbase, data = dtX3, id = cid,
family = poisson, corstr = "exchangeable")</code></pre>
<pre><code>## Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27</code></pre>
<pre><code>## running glm to get initial regression estimate</code></pre>
<pre><code>## (Intercept) period xbase
## 0.52045259 0.50354885 0.09746544</code></pre>
<pre class="r"><code>round(summary(geefit)$working.correlation, 2)</code></pre>
<pre><code>## [,1] [,2] [,3]
## [1,] 1.00 0.58 0.58
## [2,] 0.58 1.00 0.58
## [3,] 0.58 0.58 1.00</code></pre>
<p>In the future, I plan on adding other distributions. Some folks have suggested the negative binomial distribution, which I will do. If you have other suggestions/requests, <a href="mailto:keith.goldfeld@nyumc.org">let me know</a>.</p>
</div>
Balancing on multiple factors when the sample is too small to stratify
https://www.rdatagen.net/post/balancing-when-sample-is-too-small-to-stratify/
Mon, 26 Jun 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/balancing-when-sample-is-too-small-to-stratify/<p>Ideally, a study that uses randomization provides a balance of characteristics that might be associated with the outcome being studied. This way, we can be more confident that any differences in outcomes between the groups are due to the group assignments and not to differences in characteristics. Unfortunately, randomization does not <em>guarantee</em> balance, especially with smaller sample sizes. If we want to be certain that groups are balanced with respect to a particular characteristic, we need to do something like stratified randomization.</p>
<p>When the sample size is small and we want to guarantee balance across <em>multiple</em> characteristics, the task is a bit more challenging. Say we have 20 schools that we are randomizing to two groups, 10 in each, and want to make sure the groups are balanced with respect to 4 characteristics: language, poverty, location, and size. Simple stratification may not work so well. If we assume that these four characteristics are binary (e.g. either “yes” or “no”), there are 16 possible combinations. One or more of these combinations could easily be represented by a single school - so it would be impossible to randomize within each of the 16 combinations. What to do?</p>
<p>One possible approach is to generate all possible randomization schemes of the 20 schools, and keep only those schemes that are balanced with respect to the four characteristics. Once we have a list of acceptable randomization schemes, we can just pick one of <em>those</em> at random. (Of course, it is preferable if each school has close to a 50% chance of being assigned to either intervention group.)</p>
<div id="simulate-school-level-data" class="section level2">
<h2>Simulate school-level data</h2>
<p>To start, we generate data for our 20 hypothetical schools using <code>simstudy</code> functions:</p>
<pre class="r"><code>library(simstudy)
set.seed(125)
# define data characteristics for schools
ddef <- defData(varname = "language", formula = .3, dist = "binary")
ddef <- defData(ddef, "poverty", formula = .2, dist = "binary")
ddef <- defData(ddef, "location", formula = .5, dist = "binary")
ddef <- defData(ddef, "size", formula = .5, dist = "binary")
ddef</code></pre>
<pre><code>## varname formula variance dist link
## 1: language 0.3 0 binary identity
## 2: poverty 0.2 0 binary identity
## 3: location 0.5 0 binary identity
## 4: size 0.5 0 binary identity</code></pre>
<pre class="r"><code># generate schools
dt <- genData(20, ddef)
# number of schools in each combination
dt[, .N, keyby = .(language,poverty,location,size)]</code></pre>
<pre><code>## language poverty location size N
## 1: 0 0 0 1 5
## 2: 0 0 1 0 1
## 3: 0 0 1 1 5
## 4: 0 1 0 0 1
## 5: 0 1 1 0 2
## 6: 1 0 0 0 2
## 7: 1 0 0 1 1
## 8: 1 0 1 0 1
## 9: 1 0 1 1 2</code></pre>
<p>In this case, we have nine different combinations of the four characteristics, four of which include only a single school (rows 2, 4, 7, and 8). Stratification wouldn’t work necessarily work here if our goal was balance across all four characteristics.</p>
</div>
<div id="create-randomization-scenarios-to-assess-for-balance" class="section level2">
<h2>Create randomization scenarios to assess for balance</h2>
<p>Ideally, we would generate all possible randomization combinations and check them all for balance. If the number of total units (e.g. schools) is small, this does not pose a challenge (e.g. if N=4, then we only have six possible randomization schemes: TTCC, TCTC, TCCT, CTTC, CTCT, CCTT). However, with N=20, then there are 184,756 possible randomization schemes. Depending on the efficiency of the algorithm, it may be impractical to evaluate all the schemes. So, an alternative is to sample a subset of the schemes and evaluate those. For illustration purposes (so that you can understand what I am doing), I am using some very inefficient <code>R</code> code (using a loops). As a result, I cannot evaluate all possible schemes in a reasonable period of time to get this post out; I decided to sample instead to evaluate 1000 possible randomizations. (At the end of this post, I show results using much more efficient code that uses data.table and Rcpp code much more effectively - so that we can quickly evaluate millions of randomization schemes.)</p>
<p>To start, I create all combinations of randomization schemes:</p>
<pre class="r"><code>totalSchools = 20
rxSchools = 10
xRx <- t(combn(totalSchools, rxSchools))
# show 5 randomly sampled combinations
sampleRows <- sample(nrow(xRx), 5, replace = FALSE)
xRx[sampleRows,]</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 2 3 5 6 7 8 10 12 14 19
## [2,] 5 6 7 8 13 14 15 16 17 18
## [3,] 1 3 4 5 7 9 12 15 17 20
## [4,] 2 3 4 5 9 11 14 15 19 20
## [5,] 3 5 6 7 8 10 11 12 15 16</code></pre>
<p>Below is a function (which I chose to do in Rcpp) that converts the <code>xRx</code> matrix of school ids to a 20-column matrix of 1’s and 0’s indicating whether or not a school is randomized to the intervention in a particular scenario:</p>
<pre class="cpp"><code>#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
NumericMatrix convert01(NumericMatrix xmat, int tcols) {
int xrows = xmat.nrow();
int xcols = xmat.ncol();
NumericMatrix pmat(xrows, tcols);
for (int i=0; i < xrows; i++) {
for (int j=0; j < xcols; j++) {
pmat(i, xmat(i,j) - 1) = 1;
}
}
return(pmat);
}</code></pre>
<pre class="r"><code>x01 <- convert01(xRx, totalSchools)
# show some rows
x01[sampleRows,]</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] 0 1 1 0 1 1 1 1 0 1 0 1
## [2,] 0 0 0 0 1 1 1 1 0 0 0 0
## [3,] 1 0 1 1 1 0 1 0 1 0 0 1
## [4,] 0 1 1 1 1 0 0 0 1 0 1 0
## [5,] 0 0 1 0 1 1 1 1 0 1 1 1
## [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 0 1 0 0 0 0 1 0
## [2,] 1 1 1 1 1 1 0 0
## [3,] 0 0 1 0 1 0 0 1
## [4,] 0 1 1 0 0 0 1 1
## [5,] 0 0 1 1 0 0 0 0</code></pre>
<p>Because the evaluation code is so inefficient, I draw 1,000 rows at random from this “intervention” matrix <code>x01</code> (after converting it to a data.table).</p>
<pre class="r"><code># convert matrix to data.table
d01 <- data.table(x01)
d01[, id := .I]
ids <- sample(nrow(d01), 1000, replace = FALSE)
sampleD01 <- d01[id %in% ids]</code></pre>
<p>Now we are ready to evaluate each of the 1,000 schemes. As I mentioned before, this approach is highly inefficient as the algorithm requires us to literally loop through each each combination to find the balanced ones. I have sacrificed efficiency and speed for clarity of code (I hope).</p>
<pre class="r"><code>for (i in 1:1000) {
dt[, grp:= t(sampleD01[i,1:20])]
dx <- dt[ , .N, keyby = .(language, grp)]
dc <- dcast(dx, language ~ grp, fill = 0, value.var = "N" )
dc[, diff := abs(`1` - `0`)]
# we declare a scheme balanced if counts differ by
# no more than 1 school
sampleD01[i, language := (sum(dc[, diff > 1]) == 0)]
dx <- dt[ , .N, keyby = .(poverty, grp)]
dc <- dcast(dx, poverty ~ grp, fill = 0, value.var = "N" )
dc[, diff := abs(`1` - `0`)]
sampleD01[i, poverty := (sum(dc[, diff > 1]) == 0)]
dx <- dt[ , .N, keyby = .(location, grp)]
dc <- dcast(dx, location ~ grp, fill = 0, value.var = "N" )
dc[, diff := abs(`1` - `0`)]
sampleD01[i, location := (sum(dc[, diff > 1]) == 0)]
dx <- dt[ , .N, keyby = .(size, grp)]
dc <- dcast(dx, size ~ grp, fill = 0, value.var = "N" )
dc[, diff := abs(`1` - `0`)]
sampleD01[i, size := (sum(dc[, diff > 1]) == 0)]
}</code></pre>
<p>The final determination of balance is made if a scheme is balanced across all four characteristics. In this case, 136 of the 1,000 schemes were balanced based on this criterion:</p>
<pre class="r"><code>sampleD01[, balanced := all(language, poverty, location, size), keyby = id]
# proportion of sampled combinations that are balanced ...
sampleD01[,mean(balanced)]</code></pre>
<pre><code>## [1] 0.136</code></pre>
<p>And let’s inspect the actual balance of two randomly selected schemes - one which is balanced, and one which is not:</p>
<pre class="r"><code>sTrue <- sampleD01[balanced == TRUE]
sFalse <- sampleD01[balanced == FALSE]</code></pre>
<div id="a-balanced-scheme" class="section level3">
<h3>A balanced scheme</h3>
<pre class="r"><code>dtAssigned <- copy(dt)
dtAssigned[, group := as.vector(t(sTrue[sample(.N, 1), 1:20]))]
dtAssigned[, .N, keyby=.(language, group)]</code></pre>
<pre><code>## language group N
## 1: 0 0 7
## 2: 0 1 7
## 3: 1 0 3
## 4: 1 1 3</code></pre>
<pre class="r"><code>dtAssigned[, .N, keyby=.(poverty, group)]</code></pre>
<pre><code>## poverty group N
## 1: 0 0 9
## 2: 0 1 8
## 3: 1 0 1
## 4: 1 1 2</code></pre>
<pre class="r"><code>dtAssigned[, .N, keyby=.(location, group)]</code></pre>
<pre><code>## location group N
## 1: 0 0 4
## 2: 0 1 5
## 3: 1 0 6
## 4: 1 1 5</code></pre>
<pre class="r"><code>dtAssigned[, .N, keyby=.(size, group)]</code></pre>
<pre><code>## size group N
## 1: 0 0 3
## 2: 0 1 4
## 3: 1 0 7
## 4: 1 1 6</code></pre>
</div>
<div id="an-unbalanced-scheme" class="section level3">
<h3>An unbalanced scheme</h3>
<p>In this case, language and location are imbalanced, though size and poverty are fine.</p>
<pre class="r"><code>dtAssigned <- copy(dt)
dtAssigned[, group := as.vector(t(sFalse[sample(.N, 1), 1:20]))]
dtAssigned[, .N, keyby=.(language, group)]</code></pre>
<pre><code>## language group N
## 1: 0 0 8
## 2: 0 1 6
## 3: 1 0 2
## 4: 1 1 4</code></pre>
<pre class="r"><code>dtAssigned[, .N, keyby=.(poverty, group)]</code></pre>
<pre><code>## poverty group N
## 1: 0 0 8
## 2: 0 1 9
## 3: 1 0 2
## 4: 1 1 1</code></pre>
<pre class="r"><code>dtAssigned[, .N, keyby=.(location, group)]</code></pre>
<pre><code>## location group N
## 1: 0 0 3
## 2: 0 1 6
## 3: 1 0 7
## 4: 1 1 4</code></pre>
<pre class="r"><code>dtAssigned[, .N, keyby=.(size, group)]</code></pre>
<pre><code>## size group N
## 1: 0 0 4
## 2: 0 1 3
## 3: 1 0 6
## 4: 1 1 7</code></pre>
</div>
</div>
<div id="fast-implementation-with-data.table-and-rcpp" class="section level2">
<h2>Fast implementation with data.table and Rcpp</h2>
<p>As I alluded to before, if we want to implement this in the real world, it would be preferable to use code that does not bog down when we want to search 100,000+ possible randomization schemes. I have written a set of <code>R</code> and <code>Rcpp</code> functions the facilitate this. (Code is available <a href="https://github.com/kgoldfeld/RDataGenBlog/tree/master/static/img/post-balance">here</a>.)</p>
<pre class="r"><code># generate all possible schemes
xperm <- xPerms(totalSchools, rxSchools, N=NULL)
nrow(xperm)</code></pre>
<pre><code>## [1] 184756</code></pre>
<pre class="r"><code>xperm[sample(nrow(xperm), 5, replace = FALSE)]</code></pre>
<pre><code>## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19
## 1: 0 1 1 1 1 0 1 1 0 0 0 1 0 1 0 0 0 1 0
## 2: 1 1 0 1 1 1 1 1 0 1 0 0 0 0 1 0 1 0 0
## 3: 1 0 1 0 0 1 1 1 1 1 1 0 1 0 0 1 0 0 0
## 4: 1 1 1 0 0 1 0 1 0 1 1 1 1 0 0 1 0 0 0
## 5: 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 0 1 1 0
## V20 id
## 1: 1 94784
## 2: 0 19535
## 3: 0 61644
## 4: 0 14633
## 5: 0 35651</code></pre>
<pre class="r"><code># prepare data for evaluation
dtMat <- as.matrix(dt[,-1])
cc <- parse(text=attr(xperm, "varlist"))
cc</code></pre>
<pre><code>## expression(c(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12,
## V13, V14, V15, V16, V17, V18, V19, V20))</code></pre>
<pre class="r"><code># evaluate each combination
sF <- xperm[, cppChk(eval(cc), dtMat), keyby = id]
sF[sample(nrow(sF), 5, replace = FALSE)]</code></pre>
<pre><code>## id V1
## 1: 15924 FALSE
## 2: 68284 FALSE
## 3: 149360 FALSE
## 4: 62924 FALSE
## 5: 14009 TRUE</code></pre>
<pre class="r"><code># keep only the balanced schemes
sFinal <- xperm[sF$V1]
nrow(sFinal)</code></pre>
<pre><code>## [1] 7742</code></pre>
<pre class="r"><code># randomize from the balanced schemes
selectRow <- sample(nrow(sFinal), 1)
# check balance of randomized scheme
dtAssigned <- copy(dt)
dtAssigned[, group := as.vector(t(sFinal[selectRow, -"id"]))]
dtAssigned[, .N, keyby=.(language, group)]</code></pre>
<pre><code>## language group N
## 1: 0 0 7
## 2: 0 1 7
## 3: 1 0 3
## 4: 1 1 3</code></pre>
<pre class="r"><code>dtAssigned[, .N, keyby=.(poverty, group)]</code></pre>
<pre><code>## poverty group N
## 1: 0 0 9
## 2: 0 1 8
## 3: 1 0 1
## 4: 1 1 2</code></pre>
<pre class="r"><code>dtAssigned[, .N, keyby=.(location, group)]</code></pre>
<pre><code>## location group N
## 1: 0 0 5
## 2: 0 1 4
## 3: 1 0 5
## 4: 1 1 6</code></pre>
<pre class="r"><code>dtAssigned[, .N, keyby=.(size, group)]</code></pre>
<pre><code>## size group N
## 1: 0 0 3
## 2: 0 1 4
## 3: 1 0 7
## 4: 1 1 6</code></pre>
</div>
Copulas and correlated data generation: getting beyond the normal distribution
https://www.rdatagen.net/post/correlated-data-copula/
Mon, 19 Jun 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/correlated-data-copula/<p>Using the <code>simstudy</code> package, it’s possible to generate correlated data from a normal distribution using the function <em>genCorData</em>. I’ve wanted to extend the functionality so that we can generate correlated data from other sorts of distributions; I thought it would be a good idea to begin with binary and Poisson distributed data, since those come up so frequently in my work. <code>simstudy</code> can already accommodate more general correlated data, but only in the context of a random effects data generation process. This might not be what we want, particularly if we are interested in explicitly generating data to explore marginal models (such as a GEE model) rather than a conditional random effects model (a topic I explored in my <a href="https://www.rdatagen.net/post/marginal-v-conditional/">previous</a> discussion). The extension can quite easily be done using <em>copulas</em>.</p>
<p>Based on <a href="https://en.wikipedia.org/wiki/Copula_%28probability_theory%29">this</a> definition, a copula is a “multivariate probability distribution for which the marginal probability distribution of each variable is uniform.” It can be shown that <span class="math inline">\(U\)</span> is uniformly distributed if <span class="math inline">\(U=F(X)\)</span>, where <span class="math inline">\(F\)</span> is the CDF of a continuous random variable <span class="math inline">\(X\)</span>. Furthermore, if we can generate a multivariate <span class="math inline">\(\mathbf{X}\)</span>, say <span class="math inline">\((X_1, X_2, ..., X_k)\)</span> with a known covariance or correlation structure (e.g. exchangeable, auto-regressive, unstructured), it turns that the corresponding multivariate <span class="math inline">\(\mathbf{U}, (U_1, U_2, ..., U_k)\)</span> will maintain that structure. And in a final step, we can transform <span class="math inline">\(\mathbf{U}\)</span> to another random variable <span class="math inline">\(\mathbf{Y}\)</span> that has a target distribution by applying the inverse CDF <span class="math inline">\(F_i^{-1}(U_i)\)</span> of that target distribution to each <span class="math inline">\(U_i\)</span>. Since we can generate a multivariate normal <span class="math inline">\(\mathbf{X}\)</span>, it is relatively short leap to implement this copula algorithm in order to generate correlated data from other distributions.</p>
<div id="implementing-the-copula-algorithm-in-r" class="section level2">
<h2>Implementing the copula algorithm in R</h2>
<p>While this hasn’t been implemented just yet in <code>simstudy</code>, this is along the lines of what I am thinking:</p>
<pre class="r"><code>library(simstudy)
library(data.table)
set.seed(555)
# Generate 1000 observations of 4 RVs from a multivariate normal
# dist - each N(0,1) - with a correlation matrix where rho = 0.4
dt <- genCorData(1000, mu = c(0, 0, 0, 0), sigma = 1,
rho = 0.4, corstr = "cs" )
dt</code></pre>
<pre><code>## id V1 V2 V3 V4
## 1: 1 -1.1667574 -0.05296536 0.2995360 -0.5232691
## 2: 2 0.4505618 0.57499589 -0.9629426 1.5495697
## 3: 3 -0.1294505 1.68372035 1.1309223 0.4205397
## 4: 4 0.0858846 1.27479473 0.4247491 0.1054230
## 5: 5 0.4654873 3.05566796 0.5846449 1.0906072
## ---
## 996: 996 0.3420099 -0.35783480 -0.8363306 0.2656964
## 997: 997 -1.0928169 0.50081091 -0.8915582 -0.7428976
## 998: 998 0.7490765 -0.09559294 -0.2351121 0.6632157
## 999: 999 0.8143565 -1.00978384 0.2266132 -1.2345192
## 1000: 1000 -1.9795559 -0.16668454 -0.5883966 -1.7424941</code></pre>
<pre class="r"><code>round(cor(dt[,-1]), 2)</code></pre>
<pre><code>## V1 V2 V3 V4
## V1 1.00 0.41 0.36 0.44
## V2 0.41 1.00 0.33 0.42
## V3 0.36 0.33 1.00 0.35
## V4 0.44 0.42 0.35 1.00</code></pre>
<pre class="r"><code>### create a long version of the data set
dtM <- melt(dt, id.vars = "id", variable.factor = TRUE,
value.name = "X", variable.name = "seq")
setkey(dtM, "id") # sort data by id
dtM[, seqid := .I] # add index for each record
### apply CDF to X to get uniform distribution
dtM[, U := pnorm(X)]
### Generate correlated Poisson data with mean and variance 8
### apply inverse CDF to U
dtM[, Y_pois := qpois(U, 8), keyby = seqid]
dtM</code></pre>
<pre><code>## id seq X seqid U Y_pois
## 1: 1 V1 -1.16675744 1 0.12165417 5
## 2: 1 V2 -0.05296536 2 0.47887975 8
## 3: 1 V3 0.29953603 3 0.61773446 9
## 4: 1 V4 -0.52326909 4 0.30039350 6
## 5: 2 V1 0.45056179 5 0.67384729 9
## ---
## 3996: 999 V4 -1.23451924 3996 0.10850474 5
## 3997: 1000 V1 -1.97955591 3997 0.02387673 3
## 3998: 1000 V2 -0.16668454 3998 0.43380913 7
## 3999: 1000 V3 -0.58839655 3999 0.27813308 6
## 4000: 1000 V4 -1.74249414 4000 0.04071101 3</code></pre>
<pre class="r"><code>### Check mean and variance of Y_pois
dtM[, .(mean = round(mean(Y_pois), 1),
var = round(var(Y_pois), 1)), keyby = seq]</code></pre>
<pre><code>## seq mean var
## 1: V1 8.0 8.2
## 2: V2 8.1 8.5
## 3: V3 8.1 7.6
## 4: V4 8.0 7.9</code></pre>
<pre class="r"><code>### Check correlation matrix of Y_pois's - I know this code is a bit ugly
### but I just wanted to get the correlation matrix quickly.
round(cor(as.matrix(dcast(data = dtM, id~seq,
value.var = "Y_pois")[,-1])), 2)</code></pre>
<pre><code>## V1 V2 V3 V4
## V1 1.00 0.40 0.37 0.43
## V2 0.40 1.00 0.33 0.40
## V3 0.37 0.33 1.00 0.35
## V4 0.43 0.40 0.35 1.00</code></pre>
<p>The correlation matrices for <span class="math inline">\(\mathbf{X}\)</span> and <span class="math inline">\(\mathbf{Y_{Pois}}\)</span> aren’t too far off.</p>
<p>Here are the results for an auto-regressive (AR-1) correlation structure. (I am omitting some of the code for brevity’s sake):</p>
<pre class="r"><code># Generate 1000 observations of 4 RVs from a multivariate normal
# dist - each N(0,1) - with a correlation matrix where rho = 0.4
dt <- genCorData(1000, mu = c(0, 0, 0, 0), sigma = 1,
rho = 0.4, corstr = "ar1" )
round(cor(dt[,-1]), 2)</code></pre>
<pre><code>## V1 V2 V3 V4
## V1 1.00 0.43 0.18 0.12
## V2 0.43 1.00 0.39 0.13
## V3 0.18 0.39 1.00 0.38
## V4 0.12 0.13 0.38 1.00</code></pre>
<pre class="r"><code>### Check mean and variance of Y_pois
dtM[, .(mean = round(mean(Y_pois), 1),
var = round(var(Y_pois), 1)), keyby = seq]</code></pre>
<pre><code>## seq mean var
## 1: V1 8.1 8.3
## 2: V2 7.9 7.8
## 3: V3 8.0 8.4
## 4: V4 8.0 7.5</code></pre>
<pre class="r"><code>### Check correlation matrix of Y_pois's
round(cor(as.matrix(dcast(data = dtM, id~seq,
value.var = "Y_pois")[,-1])), 2)</code></pre>
<pre><code>## V1 V2 V3 V4
## V1 1.00 0.41 0.18 0.13
## V2 0.41 1.00 0.39 0.14
## V3 0.18 0.39 1.00 0.36
## V4 0.13 0.14 0.36 1.00</code></pre>
<p>Again - comparing the two correlation matrices - the original normal data, and the derivative Poisson data - suggests that this can work pretty well.</p>
<p>Using the last data set, I fit a GEE model to see how well the data generating process is recovered:</p>
<pre class="r"><code>library(geepack)
geefit <- geepack::geeglm(Y_pois ~ 1, data = dtM, family = poisson,
id = id, corstr = "ar1")
summary(geefit)</code></pre>
<pre><code>##
## Call:
## geepack::geeglm(formula = Y_pois ~ 1, family = poisson, data = dtM,
## id = id, corstr = "ar1")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 2.080597 0.007447 78060 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 0.9984 0.02679
##
## Correlation: Structure = ar1 Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.3987 0.02008
## Number of clusters: 1000 Maximum cluster size: 4</code></pre>
<p>In the GEE output, alpha is an estimate of <span class="math inline">\(\rho\)</span>. The estimated alpha is 0.399, quite close to 0.40, the original value used to generate the normally distributed data.</p>
</div>
<div id="binary-outcomes" class="section level2">
<h2>Binary outcomes</h2>
<p>We can also generate binary data:</p>
<pre class="r"><code>### Generate binary data with p=0.5 (var = 0.25)
dtM[, Y_bin := qbinom(U, 1, .5), keyby = seqid]
dtM</code></pre>
<pre><code>## id seq X seqid U Y_pois Y_bin
## 1: 1 V1 1.7425 1 0.959288 13 1
## 2: 1 V2 1.4915 2 0.932086 12 1
## 3: 1 V3 0.7379 3 0.769722 10 1
## 4: 1 V4 -1.6581 4 0.048644 4 0
## 5: 2 V1 2.3262 5 0.989997 15 1
## ---
## 3996: 999 V4 -0.3805 3996 0.351772 7 0
## 3997: 1000 V1 -0.8724 3997 0.191505 6 0
## 3998: 1000 V2 -1.0085 3998 0.156600 5 0
## 3999: 1000 V3 -2.0451 3999 0.020420 3 0
## 4000: 1000 V4 -2.7668 4000 0.002831 1 0</code></pre>
<pre class="r"><code>### Check mean and variance of Y_bin
dtM[, .(mean = round(mean(Y_bin), 2),
var = round(var(Y_bin), 2)), keyby = seq]</code></pre>
<pre><code>## seq mean var
## 1: V1 0.52 0.25
## 2: V2 0.50 0.25
## 3: V3 0.48 0.25
## 4: V4 0.49 0.25</code></pre>
<pre class="r"><code>### Check correlation matrix of Y_bin's
round(cor(as.matrix(dcast(data = dtM, id~seq,
value.var = "Y_bin")[,-1])), 2)</code></pre>
<pre><code>## V1 V2 V3 V4
## V1 1.00 0.29 0.10 0.05
## V2 0.29 1.00 0.27 0.03
## V3 0.10 0.27 1.00 0.23
## V4 0.05 0.03 0.23 1.00</code></pre>
<p>The binary data are correlated, but the correlation coefficient doesn’t replicate as well as the Poisson distribution. While both the Poisson and binary CDF’s are discontinuous, the extreme jump in the binary CDF leads to this discrepancy. Values that are relatively close to each other on the normal scale, and in particular on the uniform scale, can be ‘sent’ to opposite ends of the binary scale (that is to 0 and to 1) if they straddle the cutoff point <span class="math inline">\(p\)</span> (the probability of the outcome in the binary distribution); values similar in the original data are very different in the target data. This bias is partially attenuated by values far apart on the uniform scale yet falling on the same side of <span class="math inline">\(p\)</span> (both driven to 0 or both to 1); in this case values different in the original data are similar (actually identical) in the target data.</p>
<p>The series of plots below show bivariate data for the original multivariate normal data, and the corresponding uniform, Poisson, and binary data. We can see the effect of extreme discontinuity of the binary data. (R code available <a href="https://github.com/kgoldfeld/RDataGenBlog/blob/master/static/img/post-copula/Check%20rho%20for%20dists.R">here</a>.)</p>
<p><img src="https://www.rdatagen.net/post/2017-06-19-generating-correlated-data-using-a-copula-approach_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
</div>
<div id="some-simulation-results" class="section level2">
<h2>Some simulation results</h2>
<p>A series of simulations shows how well the estimates of <span class="math inline">\(\rho\)</span> compare across a set of different assumptions. In each of the plots below, we see how <span class="math inline">\(\rho\)</span> for the non-normal data changes as a function of <span class="math inline">\(\rho\)</span> from the original normally distributed data. For each value of <span class="math inline">\(\rho\)</span>, I varied the parameter of the non-normal distribution (in the case of the binary data, I varied the probability of the outcome; in the case of the Poisson data, I varied the parameter <span class="math inline">\(\lambda\)</span> which defines the mean and variance). I also considered both covariance structures, exchangeable and ar-1. (R code available <a href="https://github.com/kgoldfeld/RDataGenBlog/blob/master/static/img/post-copula/Copula%20data%20generation.R">here</a>.)</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-copula/dists.png" />
</div>
<p>These simulations confirm what we saw earlier. The Poisson data generating process recovers the original <span class="math inline">\(\rho\)</span> under both covariance structures reasonably well. The binary data generating process is less successful, with the exchangeable structure doing slightly better than then auto-regressive structure.</p>
<p>Hopefully soon, this will be implemented in <code>simstudy</code> so that we can generate data from more general distributions with a single function call.</p>
</div>
When marginal and conditional logistic model estimates diverge
https://www.rdatagen.net/post/marginal-v-conditional/
Fri, 09 Jun 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/marginal-v-conditional/<STYLE TYPE="text/css">
<!--
td{
font-family: Arial;
font-size: 9pt;
height: 2px;
padding:0px;
cellpadding="0";
cellspacing="0";
text-align: center;
}
th {
font-family: Arial;
font-size: 9pt;
height: 20px;
font-weight: bold;
text-align: center;
}
table {
border-spacing: 0px;
border-collapse: collapse;
}
--->
</STYLE>
<p>Say we have an intervention that is assigned at a group or cluster level but the outcome is measured at an individual level (e.g. students in different schools, eyes on different individuals). And, say this outcome is binary; that is, something happens, or it doesn’t. (This is important, because none of this is true if the outcome is continuous and close to normally distributed.) If we want to measure the <em>effect</em> of the intervention - perhaps the risk difference, risk ratio, or odds ratio - it can really matter if we are interested in the <em>marginal</em> effect or the <em>conditional</em> effect, because they likely won’t be the same.</p>
<p>My aim is to show this through a couple of data simulations that allow us to see this visually.</p>
<div id="first-example" class="section level3">
<h3>First example</h3>
<p>In the first scenario, I am going to use a <em>causal inference</em> framework that uses the idea that everyone has a potential outcome under one exposure (such as an intervention of some sort), and another potential outcome under a different exposure (such as treatment as usual or control). (I may discuss potential outcomes and causal inference in more detail in the future.) The potential outcome can be written with a superscript, lie <span class="math inline">\(Y^0\)</span> or <span class="math inline">\(Y^1\)</span>.</p>
<p>To generate the data, I will use this simple model for each potential outcome: <span class="math display">\[ log\left[\frac{P(Y^0_{ij})}{1-P(Y^0_{ij})}\right] = \gamma + \alpha_i\]</span></p>
<p>and <span class="math display">\[ log\left[\frac{P(Y^1_{ij})}{1-P(Y^1_{ij})}\right] = \gamma + \alpha_i + \delta.\]</span> <span class="math inline">\(\delta\)</span> is the treatment effect and is constant across the clusters, on the log-odds scale. <span class="math inline">\(\alpha_i\)</span> is the cluster specific effect for cluster <span class="math inline">\(i\)</span>. <span class="math inline">\(Y^a_{ij}\)</span> is the potential outcome for individual <span class="math inline">\(j\)</span> under exposure <span class="math inline">\(a\)</span>.</p>
<p>Now let’s generate some data and look at it:</p>
<pre class="r"><code># Define data
def1 <- defData(varname = "clustEff", formula = 0, variance = 2,
id = "cID")
def1 <- defData(def1, varname = "nInd", formula = 10000,
dist = "noZeroPoisson")
def2 <- defDataAdd(varname = "Y0", formula = "-1 + clustEff",
dist = "binary", link = "logit")
def2 <- defDataAdd(def2, varname = "Y1",
formula = "-1 + clustEff + 2",
dist = "binary", link = "logit")
options(width = 80)
def1</code></pre>
<pre><code>## varname formula variance dist link
## 1: clustEff 0 2 normal identity
## 2: nInd 10000 0 noZeroPoisson identity</code></pre>
<pre class="r"><code>def2</code></pre>
<pre><code>## varname formula variance dist link
## 1: Y0 -1 + clustEff 0 binary logit
## 2: Y1 -1 + clustEff + 2 0 binary logit</code></pre>
<pre class="r"><code># Generate cluster level data
set.seed(123)
dtC <- genData(n = 100, def1)
# Generate individual level data
dt <- genCluster(dtClust = dtC, cLevelVar = "cID", numIndsVar = "nInd",
level1ID = "id")
dt <- addColumns(def2, dt)</code></pre>
<p>Since we have repeated measurements for each cluster (the two potential outcomes), we can transform this into a “longitudinal” data set, though the periods are not time but different exposures.</p>
<pre class="r"><code>dtLong <- addPeriods(dtName = dt, idvars = c("id","cID"),
nPeriods = 2,timevars = c("Y0","Y1"),
timevarName = "Y"
)</code></pre>
<p>When we look at the data visually, we get a hint that the marginal (or average) effect might not be the same as the conditional (cluster-specific) effects.</p>
<pre class="r"><code># Calculate average potential outcomes by exposure (which is called period)
dtMean <- dtLong[, .(Y = mean(Y)), keyby = .(period, cID)] # conditional mean
dtMMean <- dtLong[, .(Y = mean(Y)), keyby = .(period)] # marginal mean
dtMMean[, cID := 999]
ggplot(data = dtMean, aes(x=factor(period), y = Y, group= cID)) +
# geom_jitter(width= .25, color = "grey75") +
geom_line(color = "grey75", position=position_jitter(w=0.02, h=0.02)) +
geom_point(data=dtMMean) +
geom_line(data=dtMMean, size = 1, color = "red") +
ylab("Estimated cluster probability") +
scale_y_continuous(labels = scales::percent) +
theme(axis.title.x = element_blank()) +
my_theme()</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-06-09-marginal-v-conditional_files/figure-html/unnamed-chunk-2-1.png" width="672" style="display: block; margin: auto;" /></p>
<p>Looking at the plot, we see that the slopes of the grey lines - each representing the change in probability as a result of the exposure for each cluster - vary quite a bit. When the probability without exposure (<span class="math inline">\(Y^0\)</span>) is particularly low or high, the absolute effect of the intervention is small (the slope is minimal). The slope or absolute effect increases when the starting probability is closer to 50%. The red line represents the averages of <span class="math inline">\(Y^0\)</span> and <span class="math inline">\(Y^1\)</span> across all individuals in all clusters. There is no reason to believe that the average slope of the grey lines is the same as the slope of the red line, which is slope of the averages. We will see that more clearly with the next data generation scenario.</p>
<p>Finally, if we look at cluster-specific effects of exposure, we see that on the risk difference scale (difference in probabilities), there is much variation, but on the log-odds ratio scale there is almost no variation. This is as it should be, because on the log-odds scale (which is how we generated the data), the difference between exposure and non-exposure is additive. On the probability scale, the difference is multiplicative. Here are some estimated differences for a sample of clusters:</p>
<pre class="r"><code>dtChange <- dt[, .(Y0 = mean(Y0), Y1 = mean(Y1)), keyby = cID]
dtChange[, riskdiff := round(Y1 - Y0, 2)]
dtChange[, loratio := round( log( (Y1 / (1-Y1)) / (Y0 / (1-Y0) )), 2)]
dtChange[sample(1:100, 10, replace = F),
.(Y0 = round(Y0,2), Y1 = round(Y1,2), riskdiff, loratio),
keyby=cID]</code></pre>
<pre><code>## cID Y0 Y1 riskdiff loratio
## 1: 18 0.02 0.14 0.12 1.95
## 2: 19 0.50 0.89 0.39 2.06
## 3: 22 0.22 0.67 0.45 1.97
## 4: 24 0.12 0.49 0.37 1.94
## 5: 30 0.69 0.94 0.26 2.00
## 6: 31 0.40 0.83 0.42 1.95
## 7: 34 0.56 0.90 0.35 1.99
## 8: 38 0.26 0.72 0.46 2.01
## 9: 72 0.01 0.09 0.08 1.97
## 10: 99 0.22 0.66 0.44 1.93</code></pre>
</div>
<div id="second-example" class="section level2">
<h2>Second example</h2>
<p>This time around, we will add an additional individual level covariate that will help us visualize the difference a bit more clearly. Let us say that <em>age</em> is positively associated with increased probability in the outcome. (In this case, we measured age and then normalized it so that the mean age in the sample is 0.) And this time around, we are not going to use potential outcomes, but will randomly assign clusters to an intervention or treatment group.</p>
<p>This is the data generating model and the code:</p>
<p><span class="math display">\[ log\left[\frac{P(Y_{ij})}{1-P(Y_{ij})}\right] = \gamma + \alpha_j + \beta_1*Trt_j + \beta_2*Age_i\]</span></p>
<pre class="r"><code>def1 <- defData(varname = "clustEff", formula = 0, variance = 2, id = "cID")
def1 <- defData(def1, varname = "nInd", formula = 100, dist = "noZeroPoisson")
# Each individual now has a measured age
def2 <- defDataAdd(varname = "age", formula = 0, variance = 2)
def2 <- defDataAdd(def2, varname = "Y",
formula = "-4 + clustEff + 2*trt + 2*age",
dist = "binary", link = "logit")
# Generate cluster level data
dtC <- genData(200, def1)
dtC <- trtAssign(dtC, grpName = "trt") #
# Generate individual level data
dt <- genCluster(dtClust = dtC, cLevelVar = "cID", numIndsVar = "nInd",
level1ID = "id")
dt <- addColumns(def2, dt)</code></pre>
<p>By fitting a conditional model (generalized linear mixed effects model) and a marginal model (we should fit a generalized estimating equation model to get the proper standard error estimates, but will estimate a generalized linear model, because the GEE model does not have a “predict” option in R; the point estimates for both marginal models should be quite close), we can see that indeed the conditional and marginal averages can be quite different.</p>
<pre class="r"><code>glmerFit1 <- glmer(Y ~ trt + age + (1 | cID), data = dt, family="binomial")
glmFit1 <- glm(Y ~ trt + age, family = binomial, data = dt)</code></pre>
<pre><code>## Intercept Trt Age
## conditional model -3.82 1.99 2.01
## marginal model -2.97 1.60 1.54</code></pre>
<p>Now, we’d like to visualize how the conditional and marginal treatment effects diverge. We can use the model estimates from the conditional model to predict probabilities for each cluster, age, and treatment group. (These will appear as grey lines in the plots below). We can also predict marginal probabilities from the marginal model based on age and treatment group while ignoring cluster. (These marginal estimates appear as red lines.) Finally, we can predict probability of outcomes for the conditional model also based on age and treatment group alone, but fixed at a mythical cluster whose random effect is 0. (These “average” conditional estimates appear as black lines.)</p>
<pre class="r"><code>newCond <- expand.grid(cID = unique(dt$cID), age=seq(-4, 4, by =.1))
newCond0 <- data.table(trt = 0, newCond)
newCond1 <- data.table(trt = 1, newCond)
newMarg0 <- data.table(trt = 0, age = seq(-4, 4, by = .1))
newMarg1 <- data.table(trt = 1, age = seq(-4, 4, by = .1))
newCond0[, pCond0 := predict(glmerFit1, newCond0, type = "response")]
newCond1[, pCond1 := predict(glmerFit1, newCond1, type = "response")]
newMarg0[, pMarg0 := predict(glmFit1, newMarg0, type = "response")]
newMarg0[, pCAvg0 := predict(glmerFit1, newMarg0[,c(1,2)],
re.form = NA, type="response")]
newMarg1[, pMarg1 := predict(glmFit1, newMarg1, type = "response")]
newMarg1[, pCAvg1 := predict(glmerFit1, newMarg1[,c(1,2)],
re.form = NA, type="response")]
dtAvg <- data.table(age = newMarg1$age,
avgMarg = newMarg1$pMarg1 - newMarg0$pMarg0,
avgCond = newMarg1$pCAvg1 - newMarg0$pCAvg0
)
p1 <- ggplot(aes(x = age, y = pCond1), data=newCond1) +
geom_line(color="grey", aes(group = cID)) +
geom_line(data=newMarg1, aes(x = age, y = pMarg1), color = "red", size = 1) +
geom_line(data=newMarg1, aes(x = age, y = pCAvg1), color = "black", size = 1) +
ggtitle("Treatment group") +
xlab("Age") +
ylab("Probability") +
my_theme()
p0 <- ggplot(aes(x = age, y = pCond0), data=newCond0) +
geom_line(color="grey", aes(group = cID)) +
geom_line(data=newMarg0, aes(x = age, y = pMarg0), color = "red", size = 1) +
geom_line(data=newMarg0, aes(x = age, y = pCAvg0), color = "black", size = 1) +
ggtitle("Control group") +
xlab("Age") +
ylab("Probability") +
my_theme()
pdiff <- ggplot(data = dtAvg) +
geom_line(aes(x = age, y = avgMarg), color = "red", size = 1) +
geom_line(aes(x = age, y = avgCond), color = "black", size = 1) +
ggtitle("Risk difference") +
xlab("Age") +
ylab("Probability") +
my_theme()
grid.arrange(p1, p0, pdiff)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-06-09-marginal-v-conditional_files/figure-html/unnamed-chunk-7-1.png" width="432" style="display: block; margin: auto;" /></p>
<p>We see pretty clearly across all ages that the marginal and conditional estimates of average treatment differences differ quite dramatically.</p>
<p>Below are point estimates and plots for data generated with very little variance across clusters, that is <span class="math inline">\(var(\alpha_i)\)</span> is close to 0. (We change this in the simulation by setting <code>def1 <- defData(varname = "clustEff", formula = 0, variance = 0.05, id = "cID")</code>.)</p>
<pre><code>## Intercept Trt Age
## conditional model -4.03 2.08 2.01
## marginal model -4.00 2.07 1.99</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-06-09-marginal-v-conditional_files/figure-html/unnamed-chunk-8-1.png" width="432" style="display: block; margin: auto;" /></p>
<p>The black lines obscure the red - the marginal model estimate is not much different from the conditional model estimate - because the variance across clusters is negligible.</p>
</div>
It can be easy to explore data generating mechanisms with the simstudy package
https://www.rdatagen.net/post/intro-to-simstudy/
Tue, 16 May 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/intro-to-simstudy/<STYLE TYPE="text/css">
<!--
td{
font-family: Arial;
font-size: 9pt;
height: 2px;
padding:0px;
cellpadding="0";
cellspacing="0";
text-align: center;
}
th {
font-family: Arial;
font-size: 9pt;
height: 20px;
font-weight: bold;
text-align: center;
}
table {
border-spacing: 0px;
border-collapse: collapse;
}
--->
</STYLE>
<p>I learned statistics and probability by simulating data. Sure, I did the occasional proof, but I never believed the results until I saw it in a simulation. I guess I have it backwards, but I that’s just the way I am. And now that I am a so-called professional, I continue to use simulation to understand models, to do sample size estimates and power calculations, and of course to teach. Sure - I’ll use the occasional formula when one exists, but I always feel the need to check it with simulation. It’s just the way I am.</p>
<p>Since I found myself constantly setting up simulations, over time I developed ways to make the process a bit easier. Those processes turned into a package, which I called <a href = https://cran.r-project.org/web/packages/simstudy/index.html>simstudy</a>, which obviously means <em>simulating study data</em>. The purpose here in this blog entyr is to introduce the basic idea behind simstudy, and provide a relatively brief example that actually comes from a question a user posed about generating correlated longitudinal data.</p>
<div id="the-basic-idea" class="section level2">
<h2>The basic idea</h2>
<p>Simulation using simstudy has two primary steps. First, the user defines the data elements of a data set either in an external csv file or internally through a set of repeated definition statements. Second, the user generates the data, using these definitions. Data generation can be as simple as a cross-sectional design or prospective cohort design, or it can be more involved, extending to allow simulators to generate observed or randomized <em>treatment assignment/exposures</em>, <em>survival</em> data, <em>longitudinal/panel</em> data, <em>multi-level/hierarchical</em> data, datasets with <em>correlated variables</em> based on a specified covariance structure, and to data sets with <em>missing</em> data based on a variety of missingness patterns.</p>
<p>The key to simulating data in simstudy is the creation of series of data defintion tables that look like this:</p>
<table>
<thead>
<tr class="header">
<th align="left">varname</th>
<th align="left">formula</th>
<th align="right">variance</th>
<th align="left">dist</th>
<th align="left">link</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">nr</td>
<td align="left">7</td>
<td align="right">0</td>
<td align="left">nonrandom</td>
<td align="left">identity</td>
</tr>
<tr class="even">
<td align="left">x1</td>
<td align="left">10;20</td>
<td align="right">0</td>
<td align="left">uniform</td>
<td align="left">identity</td>
</tr>
<tr class="odd">
<td align="left">y1</td>
<td align="left">nr + x1 * 2</td>
<td align="right">8</td>
<td align="left">normal</td>
<td align="left">identity</td>
</tr>
<tr class="even">
<td align="left">y2</td>
<td align="left">nr - 0.2 * x1</td>
<td align="right">0</td>
<td align="left">poisson</td>
<td align="left">log</td>
</tr>
<tr class="odd">
<td align="left">xCat</td>
<td align="left">0.3;0.2;0.5</td>
<td align="right">0</td>
<td align="left">categorical</td>
<td align="left">identity</td>
</tr>
<tr class="even">
<td align="left">g1</td>
<td align="left">5+xCat</td>
<td align="right">1</td>
<td align="left">gamma</td>
<td align="left">log</td>
</tr>
<tr class="odd">
<td align="left">a1</td>
<td align="left">-3 + xCat</td>
<td align="right">0</td>
<td align="left">binary</td>
<td align="left">logit</td>
</tr>
</tbody>
</table>
<p>Here’s the code that is used to generate this definition, which is stored as a <a href = https://github.com/Rdatatable/data.table/wiki>data.table </a>:</p>
<pre class="r"><code>def <- defData(varname = "nr", dist = "nonrandom", formula = 7, id = "idnum")
def <- defData(def, varname = "x1", dist = "uniform", formula = "10;20")
def <- defData(def, varname = "y1", formula = "nr + x1 * 2", variance = 8)
def <- defData(def, varname = "y2", dist = "poisson", formula = "nr - 0.2 * x1",
link = "log")
def <- defData(def, varname = "xCat", formula = "0.3;0.2;0.5", dist = "categorical")
def <- defData(def, varname = "g1", dist = "gamma", formula = "5+xCat", variance = 1,
link = "log")
def <- defData(def, varname = "a1", dist = "binary", formula = "-3 + xCat",
link = "logit")</code></pre>
<p>To create a simple data set based on these definitions, all one needs to do is execute a single <code>genData</code> command. In this example, we generate 500 records that are based on the definition in the <code>def</code>table:</p>
<pre class="r"><code>dt <- genData(500, def)
dt</code></pre>
<pre><code>## idnum nr x1 y1 y2 xCat g1 a1
## 1: 1 7 11.78709 30.24305 109 1 297.03396 0
## 2: 2 7 13.02129 30.58814 87 3 929.80474 0
## 3: 3 7 15.37784 37.31205 44 2 3738.52996 1
## 4: 4 7 18.66916 42.51506 24 2 78.17338 0
## 5: 5 7 10.55980 26.39898 131 2 147.42608 0
## ---
## 496: 496 7 19.74327 46.62290 24 1 15.49462 0
## 497: 497 7 18.41837 43.35002 24 1 201.66739 0
## 498: 498 7 18.64485 47.62767 26 1 215.74763 0
## 499: 499 7 12.07687 28.65423 105 3 864.30964 1
## 500: 500 7 11.37337 29.71361 132 2 149.21883 0</code></pre>
<p>There’s a lot more functionality in the package, and I’ll be writing about that in the future. But here, I just want give a little more introduction by way of an example that came in from around the world a couple of days ago. (I’d say the best thing about building a package is hearing from folks literally all over the world and getting to talk to them about statistics and R. It is really incredible to be able to do that.)</p>
</div>
<div id="going-a-bit-further-simulating-a-prosepctive-cohort-study-with-repeated-measures" class="section level2">
<h2>Going a bit further: simulating a prosepctive cohort study with repeated measures</h2>
<p>The question was, can we simulate a study with two arms, say a control and treatment, with repeated measures at three time points: baseline, after 1 month, and after 2 months? Of course.</p>
<p>This was what I sent back to my correspondent:</p>
<pre class="r"><code># Define the outcome
ydef <- defDataAdd(varname = "Y", dist = "normal", formula = "5 + 2.5*period + 1.5*T + 3.5*period*T",
variance = 3)
# Generate a 'blank' data.table with 24 observations and assign them to
# groups
set.seed(1234)
indData <- genData(24)
indData <- trtAssign(indData, nTrt = 2, balanced = TRUE, grpName = "T")
# Create a longitudinal data set of 3 records for each id
longData <- addPeriods(indData, nPeriods = 3, idvars = "id")
longData <- addColumns(dtDefs = ydef, longData)
longData[, `:=`(T, factor(T, labels = c("No", "Yes")))]
# Let's look at the data
ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T,
group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-05-16-intro-to-simstudy_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
<p>If we generate a data set based on 1,000 indviduals and estimate a linear regression model we see that the parameter estimates are quite good. However, my correspondent wrote back saying she wanted correlated data, which makes sense. We can see from the alpha estimate of approximately 0.02 (at the bottom of the output), we don’t have much correlation:</p>
<pre class="r"><code># Fit a GEE model to the data
fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"),
data = longData, id = id, corstr = "exchangeable")
summary(fit)</code></pre>
<pre><code>##
## Call:
## geeglm(formula = Y ~ factor(T) + period + factor(T) * period,
## family = gaussian(link = "identity"), data = longData, id = id,
## corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 4.98268 0.07227 4753.4 <2e-16 ***
## factor(T)Yes 1.48555 0.10059 218.1 <2e-16 ***
## period 2.53946 0.05257 2333.7 <2e-16 ***
## factor(T)Yes:period 3.51294 0.07673 2096.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 2.952 0.07325
##
## Correlation: Structure = exchangeable Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.01737 0.01862
## Number of clusters: 1000 Maximum cluster size: 3</code></pre>
</div>
<div id="one-way-to-generate-correlated-data" class="section level2">
<h2>One way to generate correlated data</h2>
<p>The first way to approach this is to use the simstudy function <code>genCorData</code> to generate correlated errors that are normally distributed with mean 0, variance of 3, and and common correlation coeffcient of 0.7. This approach is a bit mysterious, because we are acknowledging that we don’t know what is driving the relationship between the three outcomes, just that they have a common cause.</p>
<pre class="r"><code># define the outcome
ydef <- defDataAdd(varname = "Y", dist = "normal", formula = "5 + 2.5*period + 1.5*T + 3.5*period*T + e")
# define the correlated errors
mu <- c(0, 0, 0)
sigma <- rep(sqrt(3), 3)
# generate correlated data for each id and assign treatment
dtCor <- genCorData(24, mu = mu, sigma = sigma, rho = 0.7, corstr = "cs")
dtCor <- trtAssign(dtCor, nTrt = 2, balanced = TRUE, grpName = "T")
# create longitudinal data set and generate outcome based on definition
longData <- addPeriods(dtCor, nPeriods = 3, idvars = "id", timevars = c("V1",
"V2", "V3"), timevarName = "e")
longData <- addColumns(ydef, longData)
longData[, `:=`(T, factor(T, labels = c("No", "Yes")))]
# look at the data, outcomes should appear more correlated, lines a bit
# straighter
ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T,
group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-05-16-intro-to-simstudy_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
<p>Again, we recover the true parameters. And this time, if we look at the estimated correlation, we see that indeed the outcomes are correlated within each indivdual. The estimate is 0.77, close to the our specified value of 0.7.</p>
<pre class="r"><code>fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"),
data = longData, id = id, corstr = "exchangeable")
summary(fit)</code></pre>
<pre><code>##
## Call:
## geeglm(formula = Y ~ factor(T) + period + factor(T) * period,
## family = gaussian(link = "identity"), data = longData, id = id,
## corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 5.0636 0.0762 4411 <2e-16 ***
## factor(T)Yes 1.4945 0.1077 192 <2e-16 ***
## period 2.4972 0.0303 6798 <2e-16 ***
## factor(T)Yes:period 3.5204 0.0426 6831 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 3.07 0.117
##
## Correlation: Structure = exchangeable Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.711 0.0134
## Number of clusters: 1000 Maximum cluster size: 3</code></pre>
</div>
<div id="another-way-to-generate-correlated-data" class="section level2">
<h2>Another way to generate correlated data</h2>
<p>A second way to generate correlatd data is through an individual level random-effect or random intercept. This could be considered some unmeasured characteristic of the individuals (which happens to have a convenient normal distribution with mean zero). This random effect contributes equally to all instances of an individuals outcomes, but the outcomes for a particular individual deviate slightly from the hypothetical straight line as a result of unmeasured noise.</p>
<pre class="r"><code>ydef1 <- defData(varname = "randomEffect", dist = "normal", formula = 0, variance = sqrt(3))
ydef2 <- defDataAdd(varname = "Y", dist = "normal", formula = "5 + 2.5*period + 1.5*T + 3.5*period*T + randomEffect",
variance = 1)
indData <- genData(24, ydef1)
indData <- trtAssign(indData, nTrt = 2, balanced = TRUE, grpName = "T")
indData[1:6]</code></pre>
<pre><code>## id T randomEffect
## 1: 1 0 -1.3101
## 2: 2 1 0.3423
## 3: 3 0 0.5716
## 4: 4 1 2.6723
## 5: 5 0 -0.9996
## 6: 6 1 -0.0722</code></pre>
<pre class="r"><code>longData <- addPeriods(indData, nPeriods = 3, idvars = "id")
longData <- addColumns(dtDefs = ydef2, longData)
longData[, `:=`(T, factor(T, labels = c("No", "Yes")))]
ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T,
group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-05-16-intro-to-simstudy_files/figure-html/unnamed-chunk-10-1.png" width="672" /></p>
<pre class="r"><code>fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"),
data = longData, id = id, corstr = "exchangeable")
summary(fit)</code></pre>
<pre><code>##
## Call:
## geeglm(formula = Y ~ factor(T) + period + factor(T) * period,
## family = gaussian(link = "identity"), data = longData, id = id,
## corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 4.9230 0.0694 5028 <2e-16 ***
## factor(T)Yes 1.4848 0.1003 219 <2e-16 ***
## period 2.5310 0.0307 6793 <2e-16 ***
## factor(T)Yes:period 3.5076 0.0449 6104 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 2.63 0.0848
##
## Correlation: Structure = exchangeable Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.619 0.0146
## Number of clusters: 1000 Maximum cluster size: 3</code></pre>
<p>I sent all this back to my correspondent, but I haven’t heard yet if it is what she wanted. I certainly hope so. If there are specific topics you’d like me to discuss related to simstudy, definitely get in touch, and I will try to write something up.</p>
</div>
Everyone knows that loops in R are to be avoided, but vectorization is not always possible
https://www.rdatagen.net/post/first-blog-entry/
Wed, 10 May 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/first-blog-entry/<p>It goes without saying that there are always many ways to solve a problem in R, but clearly some ways are better (for example, faster) than others. Recently, I found myself in a situation where I could not find a way to avoid using a loop, and I was immediately concerned, knowing that I would want this code to be flexible enough to run with a very large number of observations, possibly over many observations. Two tools immediately came to mind: <a href = https://github.com/Rdatatable/data.table/wiki>data.table </a> and <a href = http://adv-r.had.co.nz/Rcpp.html>Rcpp </a>. This brief description explains the background of the simulation problem I was working on and walks through the evolution of ideas to address the problems I ran into when I tried to simulate a large number of inviduals. In particular, when I tried to simulate a very large number of individuals, say over 1 million, running the simulation over night wasn’t enough.</p>
<div id="setting-up-the-problem" class="section level2">
<h2>Setting up the problem</h2>
<p>The task in question here is not the focus, but needs a little explanation to understand what is motivating the programming issue. I am conducting a series of simulations that involve generating an individual-level stochastic (Markov) process for any number of individuals. For the data generation, I am using the <a href = https://cran.r-project.org/web/packages/simstudy/index.html>simstudy</a> package developed to help facilitate simulated data.</p>
<p>The functions <code>defDataAdd</code> and <code>genData</code> are both from <code>simstudy</code>. The first part of the simulation involves specifying the transition matrix <code>P</code> that determine a state I am calling <code>status</code>, and then defining the probability of an event that are based on a particular status level at a particular time point. For each individual, I generate 36 months of data and a status and event for each month.</p>
<pre class="r"><code>library(data.table)
library(simstudy)
set.seed(123)
P <- matrix(c(0.985, 0.015, 0.000, 0.000,
0.000, 0.950, 0.050, 0.000,
0.000, 0.000, 0.850, 0.150,
0.000, 0.000, 0.000, 1.000),
nrow = 4, byrow = TRUE)
form <- "(status == 1) * 0.02 + (status == 2) * 0.10 + (status == 3) * 0.20"
dtDef <- defDataAdd(varname = "event",
formula = form,
dist = "binary",
link = "identity")
N = 5000
did <- genData(N)</code></pre>
<p>In order to simulate the Markov process, I decided immediately that <code>Rcpp</code> would be most appropriate because I knew I could not avoid looping. Since each state of a Markov process depends on the state immediately preceding, states need to be generated sequentially, which means no obvious way to vectorize (if someone has figured that out, let me know.)</p>
<pre class="cpp"><code>#include <RcppArmadilloExtensions/sample.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector MCsim( unsigned int nMonths, NumericMatrix P,
int startStatus, unsigned int startMonth ) {
IntegerVector sim( nMonths );
IntegerVector m( P.ncol());
NumericVector currentP;
IntegerVector newstate;
unsigned int q = P.ncol();
m = Rcpp::seq(1, q);
sim[startMonth - 1] = startStatus;
for (unsigned int i = startMonth; i < nMonths; i++) {
newstate = RcppArmadillo::sample(m, 1, TRUE, P.row(sim(i-1) - 1));
sim(i) = newstate(0);
}
return sim;
}</code></pre>
<p>The process is simulated for each individual using the <strong>Rcpp</strong> function <code>MCsim</code>, but is done in the context of a <code>data.table</code> statement. The key here is that each individual is processed separately through the <code>keyby = id</code> statement. This obviates the requirement to loop through individuals even though I still need to loop within individuals for the stochastic process. This algorithm is quite fast, even with very large numbers of individuals and large numbers of observations (in this case months) per individual.</p>
<pre class="r"><code>dt <- did[, .(status = MCsim(36, P, 1, 1)),
keyby = id]
dt[, month := 1 : .N, keyby = id]
dt <- addColumns(dtDefs = dtDef, dtOld = dt)
dt</code></pre>
<pre><code>## id status month event
## 1: 1 1 1 0
## 2: 1 1 2 0
## 3: 1 1 3 0
## 4: 1 1 4 0
## 5: 1 1 5 0
## ---
## 179996: 5000 4 32 0
## 179997: 5000 4 33 0
## 179998: 5000 4 34 0
## 179999: 5000 4 35 0
## 180000: 5000 4 36 0</code></pre>
</div>
<div id="this-is-where-things-begin-to-slow-down" class="section level2">
<h2>This is where things begin to slow down</h2>
<p>It is the next phase of the simulation that started to cause me problems. For the simulation, I need to assign individuals to a group or cohort which is defined by a month and is based on several factors: (1) whether an event occurred in that month, (2) whether the status of that individual in that month exceeded a value of <code>1</code>, and (3) whether or not the individual experienced 2 or more events in the prior 12 months. An indivdual might be eligible for more than one cohort, but will be assigned to the first possible cohort (i.e. the earliest month where all three criteria are met.)</p>
<p>Again, the specifics of the simulation are not important here. What is important, is the notion that the problem requires looking through individual data sequentially, something R is generally not so good at when the sequences get particularly long, and they must be repeated a large number of times.</p>
<p>My first, naïve, approach was to create an <strong>R</strong> function that loops through all the individuals and loops within each individual until a cohort is found:</p>
<pre class="r"><code>rAssignCohortID <- function(id, month, status,
event, nInds,
startMonth, thresholdNum) {
cohort <- rep(0, length(id));
for (j in (1 : nInds)) {
idMonth = month[id == j];
idEvent = event[id == j];
idStatus = status[id == j];
endMonth = length(idMonth);
done = FALSE;
i = max(startMonth - idMonth[1], 13);
while (i <= endMonth && !done) {
if (idEvent[i] == 1 && idStatus[i] > 1) {
begin = i-12;
end = i-1;
sumED = sum(idEvent[begin:end]);
if (sumED >= thresholdNum) {
cohort[id == j] <- i - 1 + month[1];
done = TRUE;
}
}
i = i + 1;
}
}
return(cohort);
} </code></pre>
</div>
<div id="working-through-possible-solutions" class="section level2">
<h2>Working through possible solutions</h2>
<pre class="r"><code>system.time(dt[, cohort1 := rAssignCohortID(id, month, status, event,
nInds = N, startMonth = 13, thresholdNum = 2)])</code></pre>
<pre><code>## user system elapsed
## 12.555 0.180 13.256</code></pre>
<p>The naïve approach works, but can we do better? I thought <strong>Rcpp</strong> might be a solution, because we know that loops in C++ are much more efficient. However, things did not turn out so well after I translated the function into C++; in fact, they got a little worse.</p>
<pre class="cpp"><code>#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector cAssignCohortID( IntegerVector id,
IntegerVector month,
IntegerVector status,
IntegerVector event,
int nInds,
int startMonth,
int thresholdNum) {
IntegerVector cohort(id.length(), 0);
IntegerVector idMonth;
IntegerVector idEvent;
IntegerVector idStatus;
for (int j = 0; j < nInds; j++) {
idMonth = month[id == j+1];
idEvent = event[id == j+1];
idStatus = status[id == j+1];
int endMonth = idMonth.length();
int sumED;
bool done = FALSE;
int i = std::max(startMonth - idMonth(0), 12);
int begin;
int end;
while (i < endMonth && !done) {
if (idEvent(i) == 1 && idStatus(i) > 1) {
begin = i-12;
end = i-1;
sumED = sum(idEvent[Rcpp::seq(begin, end)]);
if (sumED >= thresholdNum) {
cohort[id == j + 1] = i + month(0);
done = TRUE;
}
}
i += 1;
}
}
return(cohort);
}</code></pre>
<pre class="r"><code>system.time(dt[, cohort2 := cAssignCohortID(id, month, status, event,
nInds = N, startMonth = 13, thresholdNum = 2)])</code></pre>
<pre><code>## user system elapsed
## 12.405 0.140 12.886</code></pre>
<p>I know that the function <code>cAssignCohortID</code> bogs down not in the loop, but in each phase where I need to subset the data set to work on a single <code>id</code>. For example, I need to execute the statement <code>idMonth = month[id == j+1]</code> for each <code>id</code>, and this apparently uses a lot of resources. I tried variations on this theme, alternatives to subset the data set within the Rcpp function, but could get no improvements.</p>
<p>But a light bulb went off in my head (dim as it might be), telling me that this is one of the many things <code>data.table</code> is particularly good at. In fact, I used this trick earlier in generating the stochastic process data. So, rather than subsetting the data within the function, I created a regular R function that handles only a single individual <code>id</code> at a time, and let <code>data.table</code> do the hard work of splitting up the data set to process by individual.</p>
<p>As you can see, things got markedly faster.</p>
<pre class="r"><code>rAssignCohort <- function(id, month, status, event,
nInds, startMonth, thresholdNum) {
cohort <- 0
endMonth = length(month);
done = FALSE;
i = max(startMonth - month[1], 13);
while (i <= endMonth && !done) {
if (event[i] == 1 && status[i] > 1) {
begin = i-12;
end = i-1;
sumED = sum(event[begin:end]);
if (sumED >= thresholdNum) {
cohort <- i - 1 + month[1];
done = TRUE;
}
}
i = i + 1;
}
return(cohort)
}</code></pre>
<pre class="r"><code>system.time(dt[, cohort3 := rAssignCohort(id, month, status, event,
nInds = N, startMonth = 13, thresholdNum = 2),
keyby=id])</code></pre>
<pre><code>## user system elapsed
## 0.150 0.009 0.171</code></pre>
<p>Finally, it occurred to me that an <code>Rcpp</code> function that is not required to subset the data might offer more yet improvements in speed. So, for the last iteration, I combined the strengths of looping in <code>Rcpp</code> with the strengths of subsetting in <code>data.table</code> to create a formidable combination. (Even when sample sizes exceed 1 million, the data are generated in a flash.)</p>
<pre class="cpp"><code>#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int cAssignCohort( IntegerVector month,
IntegerVector status,
IntegerVector event,
int startMonth, int thresholdNum) {
int endMonth = month.length();
int sumED;
int cohort = 0;
bool done = FALSE;
int i = std::max(startMonth - month(0), 12);
int begin;
int end;
while (i < endMonth && !done) {
if (event(i) == 1 && status(i) > 1) {
begin = i-12;
end = i-1;
sumED = sum(event[Rcpp::seq(begin, end)]);
if (sumED >= thresholdNum) {
cohort = i + month(0);
done = TRUE;
}
}
i += 1;
}
return(cohort);
}</code></pre>
<pre class="r"><code>system.time(dt[, cohort4 := cAssignCohort(month, status, event,
startMonth=13, thresholdNum = 2), keyby=id])</code></pre>
<pre><code>## user system elapsed
## 0.029 0.004 0.035</code></pre>
<p>For a more robust comparison, let’s use the <code>benchmark</code> function in package <code>rbenchmark</code>, and you can see how well <code>data.table</code> performs and how much <code>Rcpp</code> can add when used efficiently.</p>
<pre class="r"><code>library(rbenchmark)
benchmark(
dt[, cohort1 := rAssignCohortID(id, month, status, event, # Naïve approach
nInds = N, startMonth = 13, thresholdNum = 2)],
dt[, cohort2 := cAssignCohortID(id, month, status, event, # Rcpp approach
nInds = N, startMonth = 13, thresholdNum = 2)],
dt[, cohort3 := rAssignCohort(id, month, status, event, # data.table approach
nInds = N, startMonth = 13, thresholdNum = 2), keyby=id],
dt[, cohort4 := cAssignCohort(month, status, event, # combined data.table/Rcpp
startMonth=13, thresholdNum = 2), keyby=id],
replications = 5,
columns = c("replications", "elapsed", "relative"))</code></pre>
<pre><code>## replications elapsed relative
## 1 5 52.578 395.323
## 2 5 66.528 500.211
## 3 5 0.752 5.654
## 4 5 0.133 1.000</code></pre>
</div>
<div id="postscript" class="section level2">
<h2>Postscript</h2>
<p>I shared all of this with the incredibly helpful folks who have created data.table, and they offered a data.table only solution that avoids all looping, which I will share here for completeness. While it is an improvement over the third approach presented above (R function with data.table statment <code>keyby</code>), it is still no match for the fastest solution. (But, this all just goes to show you there will always be new approaches to consider, and I don’t claim to have come any where near to trying them all out.)</p>
<pre class="r"><code>dtfunc <- function(dx) {
dx[, prev12 := Reduce(`+`, shift(event, 1:12)), by=id]
map <- CJ(id=1:N, start=13L, end=36L, event=1L, statusx=1L, prev12x=1L)
ans <- dx[map, on=.(id, event, status > statusx, prev12 > prev12x, month >= start, month <= end),
.I, allow=TRUE, by=.EACHI, nomatch=0L][, .(id, I)]
minans <- ans[, .(I=min(I)), by=id]
dx <- dx[, cohort5 := 0L][minans, cohort5 := min(month) - 1L + dx$month[I], on="id", by=.EACHI]
return(dx)
}
system.time(dtfunc(dt))</code></pre>
<pre><code>## user system elapsed
## 0.195 0.011 0.210</code></pre>
<p>And here is a more complete comparison of the fastest version with this additional approach:</p>
<pre class="r"><code>benchmark(
dt[, cohort6 := cAssignCohort(month, status, event, # combined data.table/Rcpp
startMonth=13, thresholdNum = 2), keyby=id],
dt2 <- dtfunc(dt),
replications = 5,
columns = c("replications", "elapsed", "relative"))</code></pre>
<pre><code>## replications elapsed relative
## 1 5 0.112 1.000
## 2 5 0.899 8.027</code></pre>
</div>