ouR data generation
https://www.rdatagen.net/
Recent content on ouR data generationHugo -- gohugo.iokeith.goldfeld@nyumc.org (Keith Goldfeld)keith.goldfeld@nyumc.org (Keith Goldfeld)Sun, 18 Mar 2018 00:00:00 +0000Exploring the underlying theory of the chi-square test through simulation - part 1
https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence/
Sun, 18 Mar 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence/<p>Kids today are so sophisticated (at least they are in New York City, where I live). While I didn’t hear about the chi-square test of independence until my first stint in graduate school, they’re already talking about it in high school. When my kids came home and started talking about it, I did what I usually do when they come home asking about a new statistical concept. I opened up R and started generating some data. Of course, they rolled their eyes, but when the evening was done, I had something that might illuminate some of what underlies the theory of this ubiquitous test.</p>
<p>Actually, I created enough simulations to justify two posts - so this is just part 1, focusing on the <span class="math inline">\(\chi^2\)</span> distribution and its relationship to the Poisson distribution. Part 2 will consider contingency tables, where we are often interested in understanding the nature of the relationship between two categorical variables. More on that the next time.</p>
<div id="the-chi-square-distribution" class="section level3">
<h3>The chi-square distribution</h3>
<p>The chi-square (or <span class="math inline">\(\chi^2\)</span>) distribution can be described in many ways (for example as a special case of the Gamma distribution), but it is most intuitively characterized in relation to the standard normal distribution, <span class="math inline">\(N(0,1)\)</span>. The <span class="math inline">\(\chi^2_k\)</span> distribution has a single parameter <span class="math inline">\(k\)</span> which represents the <em>degrees of freedom</em>. If <span class="math inline">\(U\)</span> is standard normal, (i.e <span class="math inline">\(U \sim N(0,1)\)</span>), then <span class="math inline">\(U^2\)</span> has a <span class="math inline">\(\chi^2_1\)</span> distribution. If <span class="math inline">\(V\)</span> is also standard normal, then <span class="math inline">\((U^2 + V^2) \sim \chi^2_2\)</span>. That is, if we add two squared standard normal random variables, the distribution of the sum is chi-squared with 2 degrees of freedom. More generally, <span class="math display">\[\sum_{j=1}^k X^2_j \sim \chi^2_k,\]</span></p>
<p>where each <span class="math inline">\(X_j \sim N(0,1)\)</span>.</p>
<p>The following code defines a data set with two standard normal random variables and their sum:</p>
<pre class="r"><code>library(simstudy)
def <- defData(varname = "x", formula = 0, variance = 1, dist = "normal")
def <- defData(def, "chisq1df", formula = "x^2", dist = "nonrandom")
def <- defData(def, "y", formula = 0, variance = 1, dist = "normal")
def <- defData(def, "chisq2df",
formula = "(x^2) + (y^2)", dist = "nonrandom")
set.seed(2018)
dt <- genData(10000, def)
dt[1:5,]</code></pre>
<pre><code>## id x chisq1df y chisq2df
## 1: 1 -0.42298398 0.178915450 0.05378131 0.181807879
## 2: 2 -1.54987816 2.402122316 0.70312385 2.896505464
## 3: 3 -0.06442932 0.004151137 -0.07412058 0.009644997
## 4: 4 0.27088135 0.073376707 -1.09181873 1.265444851
## 5: 5 1.73528367 3.011209400 -0.79937643 3.650212075</code></pre>
<p>The standard normal has mean zero and variance one. Approximately 95% of the values will be expected to fall within two standard deviations of zero. Here is your classic “bell” curve:</p>
<p><img src="https://www.rdatagen.net/post/2018-03-18-a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence_files/figure-html/unnamed-chunk-2-1.png" width="672" /></p>
<p>Since the statistic <span class="math inline">\(X^2\)</span> (try not to confuse <span class="math inline">\(X^2\)</span> and <span class="math inline">\(\chi^2\)</span>, unfortunate I know) is the sum of the squares of a continuous random variable and is always greater or equal to zero, the <span class="math inline">\(\chi^2\)</span> is a distribution of positive, continuous measures. Here is a histogram of <code>chisq1df</code> from the data set <code>dt</code>, which has a <span class="math inline">\(\chi^2_1\)</span> distribution:</p>
<p><img src="https://www.rdatagen.net/post/2018-03-18-a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence_files/figure-html/unnamed-chunk-3-1.png" width="672" /></p>
<p>And here is a plot of <code>chisq2df</code>, which has two degrees of freedom, and has a <span class="math inline">\(\chi^2_2\)</span> distribution. Unsurprisingly, since we are adding positive numbers, we start to see values further away from zero:</p>
<p><img src="https://www.rdatagen.net/post/2018-03-18-a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
<p>Just to show that the data we generated by adding two squared standard normal random variables is actually distributed as a <span class="math inline">\(\chi^2_2\)</span>, we can generate data from this distribution directly, and overlay the plots:</p>
<pre class="r"><code>actual_chisq2 <- rchisq(10000, 2)</code></pre>
<p><img src="https://www.rdatagen.net/post/2018-03-18-a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
</div>
<div id="recycling-and-the-poisson-distribution" class="section level3">
<h3>Recycling and the Poisson distribution</h3>
<p>When we talk about counts, we are often dealing with a Poisson distribution. An example I use below is the number of glass bottles that end up in an apartment building’s recycling bin every day (as I mentioned, I do live in New York City). The Poisson distribution is a non-negative, discrete distribution that is characterized by a single parameter <span class="math inline">\(\lambda\)</span>. If <span class="math inline">\(H \sim Poisson(\lambda)\)</span>, then <span class="math inline">\(E(H) = Var(H) = \lambda\)</span>.</p>
<pre class="r"><code>def <- defData(varname = "h", formula = 40, dist = "poisson")
dh <- genData(10000, def)
round(dh[, .(avg = mean(h), var = var(h))], 1)</code></pre>
<pre><code>## avg var
## 1: 40.1 40</code></pre>
<p>To standardize a <em>normally</em> distributed variable (such as <span class="math inline">\(W \sim N(\mu,\sigma^2)\)</span>), we subtract the mean and divide by the standard deviation:</p>
<p><span class="math display">\[ W_i^{s} = \frac{W_i - \mu}{\sigma},\]</span></p>
<p>and <span class="math inline">\(W^s \sim N(0,1)\)</span>. Analogously, to standardize a Poisson variable we do the same, since <span class="math inline">\(\lambda\)</span> is the mean and the variance:</p>
<p><span class="math display">\[ S_{i} = \frac{H_i - \lambda}{\sqrt{\lambda}}\]</span></p>
<p>The distribution of this standardized variable <span class="math inline">\(S\)</span> will be close to a standard normal. We can generate some data and check this out. In this case, the mean and variance of the Poisson variable is 40:</p>
<pre class="r"><code>defA <- defDataAdd(varname = "s", formula = "(h-40)/sqrt(40)",
dist = "nonrandom")
dh <- addColumns(defA, dh)
dh[1:5, ]</code></pre>
<pre><code>## id h s
## 1: 1 34 -0.9486833
## 2: 2 44 0.6324555
## 3: 3 37 -0.4743416
## 4: 4 46 0.9486833
## 5: 5 42 0.3162278</code></pre>
<p>The mean and variance of the standardized data do suggest a standardized normal distribution:</p>
<pre class="r"><code>round(dh[ , .(mean = mean(s), var = var(s))], 1)</code></pre>
<pre><code>## mean var
## 1: 0 1</code></pre>
<p>Overlaying the plots of the standardized poisson distribution with the standard normal distribution, we can see that they <em>are</em> quite similar:</p>
<p><img src="https://www.rdatagen.net/post/2018-03-18-a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence_files/figure-html/unnamed-chunk-10-1.png" width="672" /></p>
<p>Since the standardized Poisson is roughly standard normal, the square of the standardized Poisson should be roughly <span class="math inline">\(\chi^2_1\)</span>. If we square normalized Poisson, this is what we have:</p>
<p><span class="math display">\[ S_i^2 = \frac{(H_i - \lambda)^2}{\lambda}\]</span></p>
<p>Or maybe in a more familiar form (think Pearson):</p>
<p><span class="math display">\[ S_i^2 = \frac{(O_i - E_i)^2}{E_i},\]</span></p>
<p>where <span class="math inline">\(O_i\)</span> is the observed value and <span class="math inline">\(E_i\)</span> is the expected value. Since <span class="math inline">\(\lambda\)</span> is the expected value (and variance) of the Poisson random variable, the two formulations are equivalent.</p>
<p>Adding the transformed data to the data set, and calculating the mean and variance, it is apparent that these observations are close to a <span class="math inline">\(\chi^2_1\)</span> distribution:</p>
<pre class="r"><code>defA <- defDataAdd(varname = "h.chisq", formula = "(h-40)^2/40",
dist = "nonrandom")
dh <- addColumns(defA, dh)
round(dh[, .(avg = mean(h.chisq), var = var(h.chisq))], 2)</code></pre>
<pre><code>## avg var
## 1: 1 1.97</code></pre>
<pre class="r"><code>actual_chisq1 <- rchisq(10000, 1)
round(c(avg = mean(actual_chisq1), var = var(actual_chisq1)), 2)</code></pre>
<pre><code>## avg var
## 0.99 2.04</code></pre>
<p>Once again, an overlay of the two distributions based on the data we generated shows that this is plausible:</p>
<p><img src="https://www.rdatagen.net/post/2018-03-18-a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence_files/figure-html/unnamed-chunk-12-1.png" width="672" /></p>
<p>Just for fun, let’s repeatedly generate 10 Poisson variables each with its own value of <span class="math inline">\(\lambda\)</span> and calculate <span class="math inline">\(X^2\)</span> for each iteration to compare with data generated from a <span class="math inline">\(\chi^2_{10}\)</span> distribution:</p>
<pre class="r"><code>nObs <- 10000
nMeasures <- 10
lambdas <- rpois(nMeasures, 50)
poisMat <- matrix(rpois(n = nMeasures*nObs, lambda = lambdas),
ncol = nMeasures, byrow = T)
poisMat[1:5,]</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 48 51 49 61 59 51 67 35 43 39
## [2,] 32 58 49 67 57 35 69 40 57 55
## [3,] 44 50 60 56 57 49 68 49 48 32
## [4,] 44 44 42 49 52 50 63 39 51 38
## [5,] 42 38 62 57 62 40 68 34 41 58</code></pre>
<p>Each column (variable) has its own mean and variance:</p>
<pre class="r"><code>rbind(lambdas,
mean = apply(poisMat, 2, function(x) round(mean(x), 0)),
var = apply(poisMat, 2, function(x) round(var(x), 0))
)</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## lambdas 43 45 51 61 55 46 62 35 48 47
## mean 43 45 51 61 55 46 62 35 48 47
## var 43 46 51 61 55 46 62 35 47 47</code></pre>
<p>Calculate <span class="math inline">\(X^2\)</span> for each iteration (i.e. each row of the matrix <code>poisMat</code>), and estimate mean and variance across all estimated values of <span class="math inline">\(X^2\)</span>:</p>
<pre class="r"><code>X2 <- sapply(seq_len(nObs),
function(x) sum((poisMat[x,] - lambdas)^2 / lambdas))
round(c(mean(X2), var(X2)), 1)</code></pre>
<pre><code>## [1] 10.0 20.2</code></pre>
<p>The true <span class="math inline">\(\chi^2\)</span> distribution with 10 degrees of freedom:</p>
<pre class="r"><code>chisqkdf <- rchisq(nObs, nMeasures)
round(c(mean(chisqkdf), var(chisqkdf)), 1)</code></pre>
<pre><code>## [1] 10.0 19.8</code></pre>
<p>These simulations strongly suggest that summing across independent standardized Poisson variables generates a statistic that has a <span class="math inline">\(\chi^2\)</span> distribution.</p>
</div>
<div id="the-consequences-of-conditioning" class="section level3">
<h3>The consequences of conditioning</h3>
<p>If we find ourselves in the situation where we have some number of bins or containers or cells into which we are throwing a <em>fixed</em> number of something, we are no longer in the realm of independent, unconditional Poisson random variables. This has implications for our <span class="math inline">\(X^2\)</span> statistic.</p>
<p>As an example, say we have those recycling bins again (this time five) and a total of 100 glass bottles. If each bottle has an equal chance of ending up in any of the five bins, we would expect on average 20 bottles to end up in each. Typically, we highlight the fact that under this constraint (of having 100 bottles) information about about four of the bins is the same as having information about all five. If I tell you that the first four bins contain a total of 84 bottles, we know that the last bin must have exactly 16. Actually counting those bottles in the fifth bin provides <em>no</em> additional information. In this case (where we really only have 4 pieces of information, and not the the 5 we are looking at), we say we have lost 1 degree of freedom due to the constraint. This loss gets translated into the chi-square test.</p>
<p>I want to explore more concretely how the constraint on the total number of bottles affects the distribution of the <span class="math inline">\(X^2\)</span> statistic and ultimately the chi-square test.</p>
<div id="unconditional-counting" class="section level4">
<h4>Unconditional counting</h4>
<p>Consider a simpler example of three glass recycling bins in three different buildings. We know that, on average, the bin in building 1 typically has 20 bottles deposited daily, the bin in building 2 usually has 40, and the bin in building 3 has 80. These number of bottles in each bin is Poisson distributed, with <span class="math inline">\(\lambda_i, \ i \in \{1,2, 3\}\)</span> equal to 20, 40, and 80, respectively. Note, while we would expect on average 140 total bottles across the three buildings, some days we have fewer, some days we have more - all depending on what happens in each individual building. The total is also Poisson distributed with <span class="math inline">\(\lambda_{total} = 140\)</span>.</p>
<p>Let’s generate 10,000 days worth of data (under the assumption that bottle disposal patterns are consistent over a very long time, a dubious assumption).</p>
<pre class="r"><code>library(simstudy)
def <- defData(varname = "bin_1", formula = 20, dist = "poisson")
def <- defData(def, "bin_2", formula = 40, dist = "poisson")
def <- defData(def, "bin_3", formula = 80, dist = "poisson")
def <- defData(def, varname = "N",
formula = "bin_1 + bin_2 + bin_3",
dist = "nonrandom")
set.seed(1234)
dt <- genData(10000, def)
dt[1:5, ]</code></pre>
<pre><code>## id bin_1 bin_2 bin_3 N
## 1: 1 14 44 59 117
## 2: 2 21 36 81 138
## 3: 3 21 34 68 123
## 4: 4 16 43 81 140
## 5: 5 22 44 86 152</code></pre>
<p>The means and variances are as expected:</p>
<pre class="r"><code>round(dt[ ,.(mean(bin_1), mean(bin_2), mean(bin_3))], 1)</code></pre>
<pre><code>## V1 V2 V3
## 1: 20 39.9 80.1</code></pre>
<pre class="r"><code>round(dt[ ,.(var(bin_1), var(bin_2), var(bin_3))], 1)</code></pre>
<pre><code>## V1 V2 V3
## 1: 19.7 39.7 80.6</code></pre>
<p>This plot shows the actual numbers of bottles in each bin in each building over the 10,000 days:</p>
<p><img src="https://www.rdatagen.net/post/2018-03-18-a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence_files/figure-html/unnamed-chunk-19-1.png" width="672" /></p>
<p>There is also quite a lot of variability in the daily totals calculated by adding up the bins across three buildings. (While it is clear based on the mean and variance that this total has a <span class="math inline">\(Poisson(140)\)</span> distribution, the plot looks quite symmetrical. It is the case that as <span class="math inline">\(\lambda\)</span> increases, the Poisson distribution becomes well approximated by the normal distribution.)</p>
<pre class="r"><code>round(dt[, .(avgN = mean(N), varN = var(N))], 1)</code></pre>
<pre><code>## avgN varN
## 1: 140 139.6</code></pre>
<p><img src="https://www.rdatagen.net/post/2018-03-18-a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence_files/figure-html/unnamed-chunk-21-1.png" width="672" /></p>
</div>
<div id="conditional-counting" class="section level4">
<h4>Conditional counting</h4>
<p>Now, let’s say that the three bins are actually in the <em>same</em> (very large) building, located in different rooms in the basement, just to make it more convenient for residents (in case you are wondering, my bins are right next to the service elevator). But, let’s also make the assumption (and condition) that there are always between 138 and 142 total bottles on any given day. The expected values for each bin remain 20, 40, and 80, respectively</p>
<p>We calculate the total number of bottles every day and identify all cases where the sum of the three bins is within the fixed range. For this subset of the sample, we see that the means are unchanged:</p>
<pre class="r"><code>defAdd <- defDataAdd(varname = "condN",
formula = "(N >= 138 & N <= 142)",
dist = "nonrandom")
dt <- addColumns(defAdd, dt)
round(dt[condN == 1,
.(mean(bin_1), mean(bin_2), mean(bin_3))],
1)</code></pre>
<pre><code>## V1 V2 V3
## 1: 20.1 40 79.9</code></pre>
<p>However, <strong>and this is really the key point</strong>, the variance of the sample (which is conditional on the sum being between 138 and 142) is reduced:</p>
<pre class="r"><code>round(dt[condN == 1,
.(var(bin_1), var(bin_2), var(bin_3))],
1)</code></pre>
<pre><code>## V1 V2 V3
## 1: 17.2 28.3 35.4</code></pre>
<p>The red points in the plot below represent all daily totals <span class="math inline">\(\sum_i bin_i\)</span> that fall between 138 and 142 bottles. The spread from top to bottom is contained by the rest of the (unconstrained) sample, an indication that the variance for this conditional scenario is smaller:</p>
<p><img src="https://www.rdatagen.net/post/2018-03-18-a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence_files/figure-html/unnamed-chunk-24-1.png" width="672" /></p>
<p>Not surprisingly, the distribution of the totals across the bins is quite narrow. But, this is almost a tautology, since this is how we defined the sample:</p>
<p><img src="https://www.rdatagen.net/post/2018-03-18-a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence_files/figure-html/unnamed-chunk-25-1.png" width="672" /></p>
</div>
</div>
<div id="biased-standardization" class="section level3">
<h3>Biased standardization</h3>
<p>And here is the grand finale of part 1. When we calculate <span class="math inline">\(X^2\)</span> using the standard formula under a constrained data generating process, we are not dividing by the proper variance. We just saw that the conditional variance within each bin is smaller than the variance of the unconstrained Poisson distribution. So, <span class="math inline">\(X^2\)</span>, as defined by</p>
<p><span class="math display">\[ X^2 = \sum_{i=1}^{k \ bins} {\frac{(O_i - E_i)^2}{E_i}}\]</span></p>
<p>is not a sum of approximately standard normal variables - the variance used in the formula is too high. <span class="math inline">\(X^2\)</span> will be smaller than a <span class="math inline">\(\chi^2_k\)</span>. How much smaller? Well, if the constraint is even tighter, limited to where the total equals exactly 140 bottles every day, <span class="math inline">\(X^2\)</span> has a <span class="math inline">\(\chi^2_{k-1}\)</span> distribution.</p>
<p>Even using our slightly looser constraint of fixing the total between 138 and 142, the distribution is quite close to a <span class="math inline">\(chi^2_2\)</span> distribution:</p>
<pre class="r"><code>defA <- defDataAdd(varname = "X2.1",
formula = "(bin_1-20)^2 / 20", dist = "nonrandom")
defA <- defDataAdd(defA, "X2.2",
formula = "(bin_2-40)^2 / 40", dist = "nonrandom")
defA <- defDataAdd(defA, "X2.3",
formula = "(bin_3-80)^2 / 80", dist = "nonrandom")
defA <- defDataAdd(defA, "X2",
formula = "X2.1 + X2.2 + X2.3", dist = "nonrandom")
dt <- addColumns(defA, dt)</code></pre>
<p>Comparison with <span class="math inline">\(\chi^2_3\)</span> shows clear bias:</p>
<p><img src="https://www.rdatagen.net/post/2018-03-18-a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence_files/figure-html/unnamed-chunk-27-1.png" width="672" /></p>
<p>Here it is with a <span class="math inline">\(\chi^2_2\)</span> distribution:</p>
<p><img src="https://www.rdatagen.net/post/2018-03-18-a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence_files/figure-html/unnamed-chunk-28-1.png" width="672" /></p>
</div>
<div id="recycling-more-than-glass" class="section level3">
<h3>Recycling more than glass</h3>
<p>Part 2 will extend this discussion to the contingency table, which is essentially a 2-dimensional array of bins. If we have different types of materials to recycle - glass bottles, plastic containers, cardboard boxes, and metal cans - we need four bins at each location. We might be interested in knowing if the distribution of these four materials is different across the 3 different locations - this is where the chi-square test for independence can be useful.</p>
<p>As an added bonus, you can expect to see lots of code that allows you to simulate contingency tables under different assumptions of conditioning. I know my kids are psyched.</p>
</div>
Another reason to be careful about what you control for
https://www.rdatagen.net/post/another-reason-to-be-careful-about-what-you-control-for/
Wed, 07 Mar 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/another-reason-to-be-careful-about-what-you-control-for/<p>Modeling data without any underlying causal theory can sometimes lead you down the wrong path, particularly if you are interested in understanding the <em>way</em> things work rather than making <em>predictions.</em> A while back, I <a href="https://www.rdatagen.net/post/be-careful/">described</a> what can go wrong when you control for a mediator when you are interested in an exposure and an outcome. Here, I describe the potential biases that are introduced when you inadvertently control for a variable that turns out to be a <strong><em>collider</em></strong>.</p>
<p>A collider, like a mediator, is a post-exposure/post-intervention outcome. Unlike a mediator, a collider is not necessarily causally related to the outcome of interest. (This is not to say that it cannot be, which is why this concept came up in a talk I gave about marginal structural models, described <a href="https://www.rdatagen.net/post/potential-outcomes-confounding/">here</a>, <a href="https://www.rdatagen.net/post/inverse-probability-weighting-when-the-outcome-is-binary/">here</a>, and <a href="https://www.rdatagen.net/post/when-a-covariate-is-a-confounder-and-a-mediator/">here</a>.) The key distinction of a collider is that it is an outcome that has two causes. In a directed acyclic graph (or <a href="http://www.dagitty.net/learn/index.html">DAG</a>), a collider is a variable with two arrows pointing towards it. This is easier to see visually:</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-collider/Collider_names.png" />
</div>
<p>In this (admittedly thoroughly made-up though not entirely implausible) network diagram, the <em>test score</em> outcome is a collider, influenced by a <em>test preparation</em> class and <em>socio-economic status</em> (SES). In particular, both the test prep course and high SES are related to the probability of having a high test score. One might expect an arrow of some sort to connect SES and the test prep class; in this case, participation in test prep is randomized so there is no causal link (and I am assuming that everyone randomized to the class actually takes it, a compliance issue I addressed in a series of posts starting with <a href="https://www.rdatagen.net/post/cace-explored/">this one</a>.)</p>
<p>The researcher who carried out the randomization had a hypothesis that test prep actually is detrimental to college success down the road, because it de-emphasizes deep thinking in favor of wrote memorization. In reality, it turns out that the course and subsequent college success are not related, indicated by an <em>absence</em> of a connection between the course and the long term outcome.</p>
<div id="simulate-data" class="section level3">
<h3>Simulate data</h3>
<p>We can simulate data from this hypothetical world (using functions from package <code>simstudy</code>):</p>
<pre class="r"><code># define data
library(simstudy)
defCollide <- defData(varname = "SES",
formula = "0;1",
dist = "uniform")
defCollide <- defData(defCollide, varname = "testPrep",
formula = 0.5,
dist = "binary")
defCollide <- defData(defCollide, varname = "highScore",
formula = "-1.2 + 3*SES + 3*testPrep",
dist = "binary", link="logit")
defCollide <- defData(defCollide, varname = "successMeasure",
formula = "20 + SES*40", variance = 9,
dist = "normal")
defCollide</code></pre>
<pre><code>## varname formula variance dist link
## 1: SES 0;1 0 uniform identity
## 2: testPrep 0.5 0 binary identity
## 3: highScore -1.2 + 3*SES + 3*testPrep 0 binary logit
## 4: successMeasure 20 + SES*40 9 normal identity</code></pre>
<pre class="r"><code># generate data
set.seed(139)
dt <- genData(1500, defCollide)
dt[1:6]</code></pre>
<pre><code>## id SES testPrep highScore successMeasure
## 1: 1 0.52510665 1 1 40.89440
## 2: 2 0.31565690 0 1 34.72037
## 3: 3 0.47978492 1 1 41.79532
## 4: 4 0.19114934 0 0 30.05569
## 5: 5 0.06889896 0 0 21.28575
## 6: 6 0.10139604 0 0 21.30306</code></pre>
<p>We can see that the distribution of the long-term (continuous) success outcome is the same for those who are randomized to test prep compared to those who are not, indicating there is no causal relationship between the test and the college outcome:</p>
<p><img src="https://www.rdatagen.net/post/2018-03-07-another-reason-to-be-careful-about-what-you-control-for_files/figure-html/unnamed-chunk-4-1.png" width="768" /></p>
<p>An unadjusted linear model leads us to the same conclusion, since the parameter estimate representing the treatment effect is quite small (and the hypothesis test is not statistically significant):</p>
<pre class="r"><code>library(broom)
rndtidy( lm(successMeasure ~ testPrep, data = dt))</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 40.112 0.44 91.209 0.000
## 2 testPrep -0.495 0.61 -0.811 0.418</code></pre>
</div>
<div id="but-dont-we-need-to-adjust-for-some-measure-of-intellectual-ability" class="section level3">
<h3>But, don’t we need to adjust for some measure of intellectual ability?</h3>
<p>Or so the researcher might ask after looking at the initial results, questioning the model. He believes that differences in ability could be related to future outcomes. While this may be the case, the question isn’t about ability but the impact of test prep. Based on his faulty logic, the researcher decides to fit a second model and control for the test score that followed the experiment. And this is where things go awry. Take a look at the following model where there appears to be a relationship between test prep and college success after controlling for the test score:</p>
<pre class="r"><code># adjusted model
rndtidy( lm(successMeasure ~ highScore + testPrep, data = dt))</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 35.525 0.619 57.409 0
## 2 highScore 8.027 0.786 10.207 0
## 3 testPrep -3.564 0.662 -5.380 0</code></pre>
<p>It does indeed appear that the test prep course is causing problems for real learning in college later on!</p>
</div>
<div id="what-is-going-on" class="section level3">
<h3>What is going on?</h3>
<p>Because the test score (here I am treating it as binary - either a high score or not), is related to both SES and test prep, the fact that someone does well on the test is due either to the fact that the student took the course, has high SES, or both. But, let’s consider the students who are possibly high SES or maybe took the course, but not not both, <strong><em>and</em></strong> who had a high test score. If a student is low SES, she probably took the course, or if she did not take the course, she is probably high SES. So, within the group that scored well, SES and the probability of taking the course are slightly negatively correlated.</p>
<p>If we “control” for test scores in the model, we are essentially comparing students within two distinct groups - those who scored well and those who did not. The updated network diagram shows a relationship between SES and test prep that didn’t exist before. This is the induced relationship we get by controlling a collider. (Control is shown in the diagram by removing the connection of SES and test prep to the test score.)</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-collider/Collider_names_adjust.png" />
</div>
<p>If we look at the entire sample and compare the SES distribution (which is a continuous measure uniformly distributed between 0 and 1) for each test prep group, we see that both groups have the same distribution (i.e. there is no relationship):</p>
<p><img src="https://www.rdatagen.net/post/2018-03-07-another-reason-to-be-careful-about-what-you-control-for_files/figure-html/unnamed-chunk-7-1.png" width="768" /></p>
<p>But if we look at the relationship between SES and test prep within each test score group, the distributions no longer completely overlap - within each test score group, there is a relationship between SES and test prep.</p>
<p><img src="https://www.rdatagen.net/post/2018-03-07-another-reason-to-be-careful-about-what-you-control-for_files/figure-html/unnamed-chunk-8-1.png" width="768" /></p>
</div>
<div id="why-does-this-matter" class="section level3">
<h3>Why does this matter?</h3>
<p>If the researcher has no good measure for SES or no measure at all, he cannot control for SES in the model. And now, because of the induced relationship between test prep and (unmeasured) SES, there is unmeasured confounding. This confounding leads to the biased estimate that we saw in the second model. And we see this bias in the densities shown for each test score group:</p>
<p><img src="https://www.rdatagen.net/post/2018-03-07-another-reason-to-be-careful-about-what-you-control-for_files/figure-html/unnamed-chunk-9-1.png" width="768" /></p>
<p>If it turns out that we <em>can</em> control for SES as well, because we have an adequate measure for it, then the artificial link between SES and test prep is severed, and so is the relationship between test prep and the long term college outcome.</p>
<pre class="r"><code>rndtidy( lm(successMeasure ~ SES + highScore + testPrep, data = dt))</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 19.922 0.194 102.519 0.000
## 2 SES 40.091 0.279 143.528 0.000
## 3 highScore -0.098 0.212 -0.462 0.644
## 4 testPrep 0.137 0.174 0.788 0.431</code></pre>
<p>The researcher can create problems by controlling for all the variables he has and not controlling for the variables he doesn’t have. Of course, if there are no colliders and mediators, then there is no harm. And unfortunately, without theory, it may be hard to know the structure of the DAG, particularly if there are important unmeasured variables. But, the researcher needs to proceed with a bit of caution.</p>
</div>
<div id="addendum-selection-bias" class="section level2">
<h2>Addendum: selection bias</h2>
<p>“Selection bias” is used in a myriad of ways to characterize the improper assessment of an exposure-outcome relationship. For example, unmeasured confounding (where there is an unmeasured factor that influences both an exposure and an outcome) is often called selection bias, in the sense that the exposure is “selected” based on that particular characteristic.</p>
<p>Epidemiologists talk about selection bias in a very specific way, related to how individuals are selected or self-select into a study. In particular, if selection into a study depends on the exposure of interest and some other factor that is associated with the outcome, we can have selection bias.</p>
<p>How is this relevant to this post? Selection bias results from controlling a collider. In this case, however, control is done on through the study design, rather than through statistical modeling. Let’s say we have the same scenario with a randomized trial of a test prep course and we are primarily interested in the impact on the near-term test score. But, later on, we decide to explore the relationship of the course with the long-term college outcome and we send out a survey to collect the college outcome data. It turns out that those who did well on the near-term test were much more likely to respond to the survey - so those who have been selected (or in this case self-selected) will have an induced relationship between the test prep course and SES, just as before. Here is the new DAG:</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-collider/Collider_names_select.png" />
</div>
<div id="simulate-new-study-selection-variable" class="section level3">
<h3>Simulate new study selection variable</h3>
<p>The study response or selection variable is dependent on the near-term test score. The selected group is explicitly defined by the value of <code>inStudy</code></p>
<pre class="r"><code># selection bias
defS <- defDataAdd(varname = "inStudy",
formula = "-2.0 + 2.2 * highScore",
dist = "binary", link = "logit")
dt <- addColumns(defS, dt)
dSelect <- dt[inStudy == 1]</code></pre>
<p>We can see that a large proportion of the the selected group has a high probability of having scored high on the test score:</p>
<pre class="r"><code>dSelect[, mean(highScore)]</code></pre>
<pre><code>## [1] 0.9339207</code></pre>
</div>
<div id="selection-bias-is-a-muted-version-of-full-on-collider-bias" class="section level3">
<h3>Selection bias is a muted version of full-on collider bias</h3>
<p>Within this group of selected students, there is an (incorrectly) estimated relationship between the test prep course and subsequent college success. This bias is what epidemiologists are talking about when they talk about selection bias:</p>
<pre class="r"><code>rndtidy( lm(successMeasure ~ testPrep, data = dSelect))</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 41.759 0.718 58.154 0.000
## 2 testPrep -2.164 0.908 -2.383 0.017</code></pre>
</div>
</div>
“I have to randomize by cluster. Is it OK if I only have 6 sites?"
https://www.rdatagen.net/post/i-have-to-randomize-by-site-is-it-ok-if-i-only-have-6/
Wed, 21 Feb 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/i-have-to-randomize-by-site-is-it-ok-if-i-only-have-6/<p>The answer is probably no, because there is a not-so-low chance (perhaps considerably higher than 5%) you will draw the wrong conclusions from the study. I have heard variations on this question not so infrequently, so I thought it would be useful (of course) to do a few quick simulations to see what happens when we try to conduct a study under these conditions. (Another question I get every so often, after a study has failed to find an effect: “can we get a post-hoc estimate of the power?” I was all set to post on the issue, but then I found <a href="http://daniellakens.blogspot.com/2014/12/observed-power-and-what-to-do-if-your.html">this</a>, which does a really good job of explaining why this is not a very useful exercise.) But, back to the question at hand.</p>
<p>Here is the bottom line: if there are differences between clusters that relate to the outcome, there is a good chance that we might confuse those inherent differences for treatment effects. These inherent differences could be the characteristics of people in the different clusters; for example, a health care clinic might attract healthier people than others. Or the differences could be characteristics of the clusters themselves; for example, we could imagine that some health care clinics are better at managing high blood pressure than others. In both scenarios, individuals in a particular cluster are likely to have good outcomes regardless of the intervention. And if these clusters happen to get assigned to the intervention, we could easily confuse the underlying structure or characteristics as an intervention effect.</p>
<p>This problem easiest to observe if we generate data with the underlying assumption that there is no treatment effect. Actually, I will generate lots of these data sets, and for each one I am going to test for statistical significance. (I am comfortable doing that in this situation, since I literally can repeat the identical experiment over an over again, a key pre-requisite for properly interpreting a p-value.) I am going to estimate the proportion of cases where the test statistic would lead me to incorrectly reject the null hypothesis, or make a Type I error. (I am not getting into the case where there is actually a treatment effect.)</p>
<div id="a-single-cluster-randomized-trial-with-6-sites" class="section level3">
<h3>A single cluster randomized trial with 6 sites</h3>
<p>First, I define the cluster level data. Each cluster or site will have a “fixed” effect that will apply to all individuals within that site. I will generate the fixed effect so that on average (across all sites) it is 0 with a variance of 0.053. (I will explain that arbitrary number in a second.) Each site will have exactly 50 individuals.</p>
<pre class="r"><code>library(simstudy)
defC <- defData(varname = "siteFix", formula = 0,
variance = .053, dist = "normal", id = "cID")
defC <- defData(defC, varname = "nsite", formula = 50,
dist = "nonrandom")
defC</code></pre>
<pre><code>## varname formula variance dist link
## 1: siteFix 0 0.053 normal identity
## 2: nsite 50 0.000 nonrandom identity</code></pre>
<p>Now, I generate the cluster-level data and assign treatment:</p>
<pre class="r"><code>set.seed(7)
dtC <- genData(6, defC)
dtC <- trtAssign(dtC)
dtC</code></pre>
<p>Once the cluster-level are ready, I can define and generate the individual-level data. Each cluster will have 50 records, for a total of 300 individuals.</p>
<pre class="r"><code>defI <- defDataAdd(varname = "y", formula = "siteFix", variance = 1 )
dtI <- genCluster(dtClust = dtC, cLevelVar = "cID", numIndsVar = "nsite",
level1ID = "id")
dtI <- addColumns(defI, dtI)
dtI</code></pre>
<pre><code>## cID trtGrp siteFix nsite id y
## 1: 1 0 0.5265638 50 1 2.7165419
## 2: 1 0 0.5265638 50 2 0.8835501
## 3: 1 0 0.5265638 50 3 3.2433156
## 4: 1 0 0.5265638 50 4 2.8080158
## 5: 1 0 0.5265638 50 5 0.8505844
## ---
## 296: 6 1 -0.2180802 50 296 -0.6351033
## 297: 6 1 -0.2180802 50 297 -1.3822554
## 298: 6 1 -0.2180802 50 298 1.5197839
## 299: 6 1 -0.2180802 50 299 -0.4721576
## 300: 6 1 -0.2180802 50 300 -1.1917988</code></pre>
<p>I promised a little explanation of why the variance of the sites was specified as 0.053. The statistic that characterizes the extent of clustering is the inter-class coefficient, or ICC. This is calculated by</p>
<p><span class="math display">\[ICC = \frac{\sigma^2_{clust}}{\sigma^2_{clust}+\sigma^2_{ind}}\]</span> where <span class="math inline">\(\sigma^2_{clust}\)</span> is the variance of the cluster means, and <span class="math inline">\(\sigma^2_{ind}\)</span> is the variance of the individuals within the clusters. (We are assuming that the within-cluster variance is constant across all clusters.) The denominator represents the total variation across all individuals. The ICC ranges from 0 (no clustering) to 1 (maximal clustering). When <span class="math inline">\(\sigma^2_{clust} = 0\)</span> then the <span class="math inline">\(ICC=0\)</span>. This means that all variation is due to individual variation. And when <span class="math inline">\(\sigma^2_{ind}=0\)</span>, <span class="math inline">\(ICC=1\)</span>. In this case, there is no variation across individuals within a cluster (i.e. they are all the same with respect to this measure) and any variation across individuals more generally is due entirely to the cluster variation. I used a cluster-level variance of 0.053 so that the ICC is 0.05:</p>
<p><span class="math display">\[ICC = \frac{0.053}{0.053+1.00} \approx 0.05\]</span></p>
<p>OK - back to the data. Let’s take a quick look at it:</p>
<pre class="r"><code>library(ggplot2)
ggplot(data=dtI, aes(x=factor(cID), y=y)) +
geom_jitter(aes(color=factor(trtGrp)), width = .1) +
scale_color_manual(labels=c("control", "rx"),
values = c("#ffc734", "#346cff")) +
theme(panel.grid.minor = element_blank(),
legend.title = element_blank())</code></pre>
<p><img src="https://www.rdatagen.net/post/2018-02-21-i-have-to-randomize-by-site-is-it-ok-if-i-only-have-6_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
<p>Remember, there is no treatment effect (either positive or negative). But, due to cluster variation, Site 1 (randomized to control) has higher than average outcomes. We estimate the treatment effect using a fixed effects model. This model seems reasonable, since we don’t have enough sites to estimates the variability of a random effects model. We conclude that the treatment has a (deleterious) effect (assuming higher <span class="math inline">\(y\)</span> is a good thing), based on the p-value for the treatment effect estimate that is considerably less than 0.05.</p>
<pre class="r"><code>library(broom)
library(lme4)
lmfit <- lm(y ~ trtGrp + factor(cID), data = dtI)
tidy(lmfit)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 0.8267802 0.1394788 5.9276404 8.597761e-09
## 2 trtGrp -0.9576641 0.1972528 -4.8550088 1.958238e-06
## 3 factor(cID)2 -0.1162042 0.1972528 -0.5891129 5.562379e-01
## 4 factor(cID)3 0.1344241 0.1972528 0.6814812 4.961035e-01
## 5 factor(cID)4 -0.8148341 0.1972528 -4.1309123 4.714672e-05
## 6 factor(cID)5 -1.2684515 0.1972528 -6.4305878 5.132896e-10</code></pre>
<p> </p>
</div>
<div id="a-more-systematic-evaluation" class="section level3">
<h3>A more systematic evaluation</h3>
<p>OK, so I was able to pick a seed that generated the outcomes in that single instance that seemed to illustrate my point. But, what happens if we look at this a bit more systematically? The series of plots that follow seem to tell a story. Each one represents a series of simulations, similar to the one above (I am not including the code, because it is a bit convoluted, but would be happy to share if anyone wants it.)</p>
<p>The first scenario shown below is based on six sites using ICCs that range from 0 to 0.10. For each level of ICC, I generated 100 different samples of six sites. For each of those 100 samples, I generated 100 different randomization schemes (which I know is overkill in the case of 6 sites since there are only 20 possible randomization schemes) and generated a new set of individuals. For each of those 100 randomization schemes, I estimated a fixed effects model and recorded the proportion of the 100 where the p-values were below the 0.05 threshold.</p>
<p><img src="https://www.rdatagen.net/img/post-smallcluster/Fixed6.png" /> How do we interpret this plot? When there is no clustering (<span class="math inline">\(ICC=0\)</span>), the probability of a Type I error is close to 5%, which is what we would expect based on theory. But, once we have any kind of clustering, things start to go a little haywire. Even when <span class="math inline">\(ICC=0.025\)</span>, we would make a lot of mistakes. The error rate only increases as the extent of clustering increases. There is quite a lot variability in the error rate, which is a function of the variability of the site specific effects.</p>
<p>If we use 24 sites, and continue to fit a fixed effect model, we see largely the same thing. Here, we have a much bigger sample size, so a smaller treatment effect is more likely to be statistically significant:</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-smallcluster/Fixed24.png" />
</div>
<p>One could make the case that instead of fitting a fixed effects model, we should be using a random effects model (particularly if the sites themselves are randomly pulled from a population of sites, though this is hardly likely to be the case when you are having a hard time recruiting sites to participate in your study). The next plot shows that the error rate goes down for 6 sites, but still not enough for my comfort:</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-smallcluster/Random6.png" />
</div>
<p>With 24 sites, the random effects model seems much safer to use:</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-smallcluster/Random24.png" />
</div>
<p>But, in reality, if we only have 6 sites, the best that we could do is randomize within site and use a fixed effect model to draw our conclusions. Even at high levels of clustering, this approach will generally lead us towards a valid conclusion (assuming, of course, the study itself is well designed and implemented):</p>
<p><img src="https://www.rdatagen.net/img/post-smallcluster/RwithinC6.png" /> But, I assume the researcher couldn’t randomize at the individual level, otherwise they wouldn’t have asked that question. In which case I would say, “It might not be the best use of resources.”</p>
</div>
Have you ever asked yourself, "how should I approach the classic pre-post analysis?"
https://www.rdatagen.net/post/thinking-about-the-run-of-the-mill-pre-post-analysis/
Sun, 28 Jan 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/thinking-about-the-run-of-the-mill-pre-post-analysis/<p>Well, maybe you haven’t, but this seems to come up all the time. An investigator wants to assess the effect of an intervention on a outcome. Study participants are randomized either to receive the intervention (could be a new drug, new protocol, behavioral intervention, whatever) or treatment as usual. For each participant, the outcome measure is recorded at baseline - this is the <em>pre</em> in pre/post analysis. The intervention is delivered (or not, in the case of the control group), some time passes, and the outcome is measured a second time. This is our <em>post</em>. The question is, how should we analyze this study to draw conclusions about the intervention’s effect on the outcome?</p>
<p>There are at least three possible ways to approach this. (1) Ignore the <em>pre</em> outcome measure and just compare the average <em>post</em> scores of the two groups. (2) Calculate a <em>change</em> score for each individual (<span class="math inline">\(\Delta_i = post_i - pre_i\)</span>), and compare the average <span class="math inline">\(\Delta\)</span>’s for each group. Or (3), use a more sophisticated regression model to estimate the intervention effect while <em>controlling</em> for the <em>pre</em> or baseline measure of the outcome. Here are three models associated with each approach (<span class="math inline">\(T_i\)</span> is 1 if the individual <span class="math inline">\(i\)</span> received the treatment, 0 if not, and <span class="math inline">\(\epsilon_i\)</span> is an error term):</p>
<span class="math display">\[\begin{aligned}
&(1) \ \ post_i = \beta_0 + \beta_1T_i + \epsilon_i \\
\\
&(2) \ \ \Delta_i = \alpha_0 + \alpha_1T_i + \epsilon_i \\
\\
&(3) \ \ post_i = \gamma_0 + \gamma_1 pre_i+ \gamma_2 T_i + \epsilon_i
\end{aligned}\]</span>
<p>I’ve explored various scenarios (i.e. different data generating assumptions) to see if it matters which approach we use. (Of course it does.)</p>
<div id="scenario-1-pre-and-post-not-correlated" class="section level3">
<h3>Scenario 1: pre and post not correlated</h3>
<p>In the simulations that follow, I am generating potential outcomes for each individual. So, the variable <code>post0</code> represents the follow-up outcome for the individual under the control condition, and <code>post1</code> is the outcome in the intervention condition. <code>pre0</code> and <code>pre1</code> are the same, because the intervention does not affect the baseline measurement. The effect of the intervention is specified by <code>eff</code>. In the first scenario, the baseline and follow-up measures are not related to each other, and the effect size is 1. All of the data definitions and data generation are done using package <code>simstudy</code>.</p>
<pre class="r"><code>library(simstudy)
# generate potential outcomes
defPO <- defData(varname = "pre0", formula = 8.5,
variance = 4, dist = "normal")
defPO <- defData(defPO, varname = "post0", formula = 7.5,
variance = 4, dist = "normal")
defPO <- defData(defPO, varname = "pre1", formula = "pre0",
dist = "nonrandom")
defPO <- defData(defPO, varname = "eff", formula = 1,
variance = 0.2, dist = "normal")
defPO <- defData(defPO, varname = "post1", formula = "post0 + eff",
dist = "nonrandom")</code></pre>
<p>The baseline, follow-up, and change that are actually <em>observed</em> are merely a function of the group assignment.</p>
<pre class="r"><code># generate observed data
defObs <- defDataAdd(varname = "pre",
formula = "pre0 * (trtGrp == 0) + pre1 * (trtGrp == 1)",
dist = "nonrandom")
defObs <- defDataAdd(defObs, varname = "post",
formula = "post0 * (trtGrp == 0) + post1 * (trtGrp == 1)",
dist = "nonrandom")
defObs <- defDataAdd(defObs, varname = "delta",
formula = "post - pre",
dist = "nonrandom")</code></pre>
<p>Now we generate the potential outcomes, the group assignment, and observed data for 1000 individuals. (I’m using package <code>stargazer</code>, definitely worth checking out, to print out the first five rows of the dataset.)</p>
<pre class="r"><code>set.seed(123)
dt <- genData(1000, defPO)
dt <- trtAssign(dt)
dt <- addColumns(defObs, dt)
stargazer::stargazer(dt[1:5,], type = "text", summary=FALSE, digits = 2)</code></pre>
<pre><code>##
## =========================================================
## id trtGrp pre0 post0 pre1 eff post1 pre post delta
## ---------------------------------------------------------
## 1 1 1 7.38 5.51 7.38 0.77 6.28 7.38 6.28 -1.10
## 2 2 1 8.04 5.42 8.04 1.11 6.53 8.04 6.53 -1.51
## 3 3 1 11.62 7.46 11.62 0.76 8.22 11.62 8.22 -3.40
## 4 4 0 8.64 7.24 8.64 1.55 8.78 8.64 7.24 -1.41
## 5 5 1 8.76 2.40 8.76 1.08 3.48 8.76 3.48 -5.28
## ---------------------------------------------------------</code></pre>
<p>The plots show the three different types of analysis - follow-up measurement alone, change, or follow-up controlling for baseline:</p>
<p><img src="https://www.rdatagen.net/post/2018-01-28-thinking-about-the-run-of-the-mill-pre-post-analysis_files/figure-html/unnamed-chunk-4-1.png" width="1152" /></p>
<p>I compare the different modeling approaches by using simulation to estimate statistical power for each. That is, given that there is some effect, how often is the p-value of the test less than 0.05. I’ve written a function to handle the data generation and power estimation. The function generates 1000 data sets of a specified sample size, each time fitting the three models, and keeping track of the relevant p-values for each iteration.</p>
<pre class="r"><code>powerFunc <- function(def, addDef, ss, rct = TRUE) {
presults <- data.table()
iter <- 1000
for (i in 1:iter) {
dt <- genData(ss, def)
if (rct) {
dt <- trtAssign(dt)
} else {
dt <- trtObserve(dt, "-4.5 + .5*pre0", logit.link = TRUE)
}
dt <- addColumns(addDef, dt)
lmfit1 <- lm(post ~ trtGrp, data = dt)
lmfit2 <- lm(delta ~ trtGrp, data = dt)
lmfit3 <- lm(post ~ pre + trtGrp, data = dt)
lmfit3x <- lm(post ~ pre + trtGrp + pre*trtGrp, data = dt)
p1 <- coef(summary(lmfit1))["trtGrp","Pr(>|t|)" ]
p2 <- coef(summary(lmfit2))["trtGrp","Pr(>|t|)" ]
p3 <- coef(summary(lmfit3))["trtGrp","Pr(>|t|)" ]
p3x <- coef(summary(lmfit3x))["pre:trtGrp","Pr(>|t|)" ]
presults <- rbind(presults, data.table(p1, p2, p3, p3x))
}
return(presults)
}</code></pre>
<p>The results for the first data set are based on a sample size of 150 individuals (75 in each group). The <em>post-only</em> model does just as well as the <em>post adjusted for baseline</em> model. The model evaluating change in this scenario is way under powered.</p>
<pre class="r"><code>presults <- powerFunc(defPO, defObs, 150)
presults[, .(postonly = mean(p1 <= 0.05),
change = mean(p2 <= 0.05),
adjusted = mean(p3 <= 0.05))]</code></pre>
<pre><code>## postonly change adjusted
## 1: 0.85 0.543 0.845</code></pre>
<p> </p>
</div>
<div id="scenario-2-pre-and-post-are-moderately-correlated" class="section level3">
<h3>Scenario 2: pre and post are moderately correlated</h3>
<p>Now, we update the definition of <code>post0</code> so that it is now a function of <code>pre0</code>, so that the correlation is around 0.45.</p>
<pre class="r"><code>defPO <- updateDef(defPO, changevar = "post0",
newformula = "3.5 + 0.47 * pre0",
newvariance = 3) </code></pre>
<p><img src="https://www.rdatagen.net/post/2018-01-28-thinking-about-the-run-of-the-mill-pre-post-analysis_files/figure-html/unnamed-chunk-8-1.png" width="1152" /></p>
<p>The correlation actually increases power, so we use a reduced sample size of 120 for the power estimation. In this case, the three models actually all do pretty well, but the <em>adjusted</em> model is slightly superior.</p>
<pre><code>## postonly change adjusted
## 1: 0.776 0.771 0.869</code></pre>
<p> </p>
</div>
<div id="scenario-3-pre-and-post-are-almost-perfectly-correlated" class="section level3">
<h3>Scenario 3: pre and post are almost perfectly correlated</h3>
<p>When baseline and follow-up measurements are almost perfectly correlated (in this case about 0.85), we would be indifferent between the <em>change</em> and <em>adjusted</em> analyses - the power of the tests is virtually identical. However, the analysis that considers the follow-up measure alone does is much less adequate, due primarily to the measure’s relatively high variability.</p>
<pre class="r"><code>defPO <- updateDef(defPO, changevar = "post0",
newformula = "0.9 * pre0",
newvariance = 1) </code></pre>
<p><img src="https://www.rdatagen.net/post/2018-01-28-thinking-about-the-run-of-the-mill-pre-post-analysis_files/figure-html/unnamed-chunk-11-1.png" width="1152" /></p>
<pre><code>## postonly change adjusted
## 1: 0.358 0.898 0.894</code></pre>
<p> </p>
</div>
<div id="when-the-effect-differs-by-baseline-measurement" class="section level3">
<h3>When the effect differs by baseline measurement</h3>
<p>In a slight variation of the previous scenario, the <em>effect</em> of the intervention itself is a now function of the baseline score. Those who score higher will benefit less from the intervention - they simply have less room to improve. In this case, the adjusted model appears slightly inferior to the change model, while the unadjusted <em>post-only</em> model is still relatively low powered.</p>
<pre class="r"><code>defPO <- updateDef(defPO, changevar = "eff",
newformula = "1.9 - 1.9 * pre0/15") </code></pre>
<p><img src="https://www.rdatagen.net/post/2018-01-28-thinking-about-the-run-of-the-mill-pre-post-analysis_files/figure-html/unnamed-chunk-13-1.png" width="1152" /></p>
<pre class="r"><code>presults[, .(postonly = mean(p1 <= 0.05),
change = mean(p2 <= 0.05),
adjusted = mean(p3 <= 0.025 | p3x <= 0.025))]</code></pre>
<pre><code>## postonly change adjusted
## 1: 0.425 0.878 0.863</code></pre>
<p>The <em>adjusted</em> model has less power than the <em>change</em> model, because I used a reduced <span class="math inline">\(\alpha\)</span>-level for the hypothesis test of the <em>adjusted</em> models. I am testing for interaction first, then if that fails, for main effects, so I need to adjust for multiple comparisons. (I have another <a href="https://www.rdatagen.net/post/sub-group-analysis-in-rct/">post</a> that shows why this might be a good thing to do.) I have used a Bonferroni adjustment, which can be a more conservative test. I still prefer the <em>adjusted</em> model, because it provides more insight into the underlying process than the <em>change</em> model.</p>
</div>
<div id="treatment-assignment-depends-on-baseline-measurement" class="section level3">
<h3>Treatment assignment depends on baseline measurement</h3>
<p>Now, slightly off-topic. So far, we’ve been talking about situations where treatment assignment is randomized. What happens in a scenario where those with higher baseline scores are more likely to receive the intervention? Well, if we don’t adjust for the baseline score, we will have unmeasured confounding. A comparison of follow-up scores in the two groups will be biased towards the intervention group if the baseline scores are correlated with follow-up scores - as we see visually with a scenario in which the effect size is set to 0. Also notice that the p-values for the unadjusted model are consistently below 0.05 - we are almost always drawing the wrong conclusion if we use this model. On the other hand, the error rate for the adjusted model is close to 0.05, what we would expect.</p>
<pre class="r"><code>defPO <- updateDef(defPO, changevar = "eff",
newformula = 0)
dt <- genData(1000, defPO)
dt <- trtObserve(dt, "-4.5 + 0.5 * pre0", logit.link = TRUE)
dt <- addColumns(defObs, dt)</code></pre>
<p><img src="https://www.rdatagen.net/post/2018-01-28-thinking-about-the-run-of-the-mill-pre-post-analysis_files/figure-html/unnamed-chunk-16-1.png" width="1152" /></p>
<pre><code>## postonly change adjusted
## 1: 0.872 0.095 0.046</code></pre>
<p>I haven’t proved anything here, but these simulations suggest that we should certainly think twice about using an unadjusted model if we happen to have baseline measurements. And it seems like you are likely to maximize power (and maybe minimize bias) if you compare follow-up scores while adjusting for baseline scores rather than analyzing change in scores by group.</p>
</div>
Importance sampling adds an interesting twist to Monte Carlo simulation
https://www.rdatagen.net/post/importance-sampling-adds-a-little-excitement-to-monte-carlo-simulation/
Thu, 18 Jan 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/importance-sampling-adds-a-little-excitement-to-monte-carlo-simulation/<p>I’m contemplating the idea of teaching a course on simulation next fall, so I have been exploring various topics that I might include. (If anyone has great ideas either because you have taught such a course or taken one, definitely drop me a note.) Monte Carlo (MC) simulation is an obvious one. I like the idea of talking about <em>importance sampling</em>, because it sheds light on the idea that not all MC simulations are created equally. I thought I’d do a brief blog to share some code I put together that demonstrates MC simulation generally, and shows how importance sampling can be an improvement.</p>
<p>Like many of the topics I’ve written about, this is a vast one that certainly warrants much, much more than a blog entry. MC simulation in particular, since it is so fundamental to the practice of statistics. MC methods are an essential tool to understand the behavior of statistical models. In fact, I’ve probably used MC simulations in just about all of my posts - to generate repeated samples from a model to explore bias, variance, and other distributional characteristics of a particular method.</p>
<p>For example, if we want to assess the variability of a regression parameter estimate, we can repeatedly generate data from a particular “hidden” model, and for each data set fit a regression model to estimate the parameter in question. For each iteration, we will arrive at a different estimate; the variation of all those estimates might be of great interest. In particular, the standard deviation of those estimates is the standard error of the estimate. (Of course, with certain problems, there are ways to analytically derive the standard errors. In these cases, MC simulation can be used to verify an analysis was correct. That’s the beauty of statistics - you can actually show yourself you’ve gotten the right answer.)</p>
<div id="a-simple-problem" class="section level3">
<h3>A simple problem</h3>
<p>In this post, I am considering a simple problem. We are interested in estimating the probability of drawing a value between 2 and 2.5 from a standard normal distribution. That is, we want to use MC simulation to estimate</p>
<p><span class="math display">\[p =P(2.0 <= X <= 2.5), \ \ \ X\sim N(0,1)\]</span></p>
<p>Of course, we can use <code>R</code> to get us the true <span class="math inline">\(p\)</span> directly without any simulation at all, but that is no fun:</p>
<pre class="r"><code>pnorm(2.5, 0, 1) - pnorm(2, 0, 1)</code></pre>
<pre><code>## [1] 0.01654047</code></pre>
<p>To do this using simulation, I wrote a simple function that checks to see if a value falls between two numbers.</p>
<pre class="r"><code>inSet <- function(x, minx, maxx) {
result <- (x >= minx & x <= maxx)
return(as.integer(result))
}</code></pre>
<p>To estimate the desired probability, we just repeatedly draw from the standard normal distribution. After each draw, we check to see if the value falls between 2.0 and 2.5, and store that information in a vector. The vector will have a value of 1 each time a value falls into the range, and 0 otherwise. The proportion of 1’s is the desired probability. Or in other words, <span class="math inline">\(\hat{p} = \bar{z}\)</span>, where <span class="math inline">\(\bar{z} = \frac{1}{1000} \sum{z}\)</span>.</p>
<pre class="r"><code>set.seed(1234)
z <- vector("numeric", 1000)
for (i in 1:1000) {
y <- rnorm(1, 0, 1)
z[i] <- inSet(y, 2, 2.5)
}
mean(z)</code></pre>
<pre><code>## [1] 0.018</code></pre>
<p>The estimate is close to the true value, but there is no reason it would be exact. In fact, I would be suspicious if it were. Now, we can also use the variance of <span class="math inline">\(z\)</span> to estimate the standard error of <span class="math inline">\(\hat{p}\)</span>:</p>
<pre class="r"><code>sd(z)/sqrt(1000)</code></pre>
<pre><code>## [1] 0.004206387</code></pre>
</div>
<div id="faster-code" class="section level3">
<h3>Faster code?</h3>
<p>If you’ve read any of my other posts, you know I am often interested in trying to make <code>R</code> run a little faster. This can be particularly important if we need to repeat tasks over and over, as we will be doing here. The <code>for</code> loop I used here is not ideal. Maybe <code>simstudy</code> (and you could do this without simstudy) can do better. Let’s first see if it provides the same estimates:</p>
<pre class="r"><code>library(data.table)
library(simstudy)
# define the data
defMC <- defData(varname = "y", formula = 0,
variance = 1, dist = "normal")
defMC <- defData(defMC, varname = "z", formula = "inSet(y, 2, 2.5)",
dist = "nonrandom")
# generate the data - the MC simulation without a loop
set.seed(1234)
dMC <- genData(1000, defMC)
# evaluate mean and standard error
dMC[ , .(mean(z), sd(z)/sqrt(1000))]</code></pre>
<pre><code>## V1 V2
## 1: 0.018 0.004206387</code></pre>
<p>So, the results are identical - no surprise there. But which approach used fewer computing resources. To find this out, we turn to the <code>microbenchmark</code> package. (I created a function out of the loop above that returns a vector of 1’s and 0’s.)</p>
<pre class="r"><code>library(microbenchmark)
mb <- microbenchmark(tradMCsim(1000), genData(1000, defMC))
summary(mb)[, c("expr", "lq", "mean", "uq", "neval")]</code></pre>
<pre><code>## expr lq mean uq neval
## 1 tradMCsim(1000) 1.428656 2.186979 2.573003 100
## 2 genData(1000, defMC) 1.376450 1.668248 1.674146 100</code></pre>
<p>With 1000 draws, there is actually very little difference between the two approaches. But if we start to increase the number of simulations, the differences become apparent. With 10000 draws, the simstudy approach is more than 7 times faster. The relative improvement continues to increase as the number of draws increases.</p>
<pre class="r"><code>mb <- microbenchmark(tradMCsim(10000), genData(10000, defMC))
summary(mb)[, c("expr", "lq", "mean", "uq", "neval")]</code></pre>
<pre><code>## expr lq mean uq neval
## 1 tradMCsim(10000) 18.453128 21.619022 22.226165 100
## 2 genData(10000, defMC) 2.006622 2.432078 2.508662 100</code></pre>
</div>
<div id="estimating-variation-of-hatp" class="section level3">
<h3>Estimating variation of <span class="math inline">\(\hat{p}\)</span></h3>
<p>Now, we can stop using the loop, at least to generate a single set of draws. But, in order to use MC simulation to estimate the variance of <span class="math inline">\(\hat{p}\)</span>, we still need to use a loop. In this case, we will generate 1500 data sets of 1000 draws each, so we will have 1500 estimates of <span class="math inline">\(\hat{p}\)</span>. (It would probably be best to do all of this using Rcpp, where we can loop with impunity.)</p>
<pre class="r"><code>iter <- 1500
estMC <- vector("numeric", iter)
for (i in 1:iter) {
dtMC <- genData(1000, defMC)
estMC[i] <- dtMC[, mean(z)]
}
head(estMC)</code></pre>
<pre><code>## [1] 0.020 0.013 0.023 0.017 0.016 0.019</code></pre>
<p>We can estimate the average of the <span class="math inline">\(\hat{p}\)</span>’s, which should be close to the true value of <span class="math inline">\(p \approx 0.0165\)</span>. And we can check to see if the standard error of <span class="math inline">\(\hat{p}\)</span> is close to our earlier estimate of 0.004.</p>
<pre class="r"><code>c(mean(estMC), sd(estMC))</code></pre>
<pre><code>## [1] 0.016820000 0.004113094</code></pre>
</div>
<div id="importance-sampling" class="section level3">
<h3>Importance sampling</h3>
<p>As we were trying to find an estimate for <span class="math inline">\(p\)</span> using the simulations above, we spent a lot of time drawing values far outside the range of 2 to 2.5. In fact, almost all of the draws were outside that range. You could almost so that most of those draws were providing little if any information. What if we could focus our attention on the area we are interested in - in this case the 2 to 2.5, without sacrificing our ability to make an unbiased estimate? That would be great, wouldn’t it? That is the idea behind importance sampling.</p>
<p>The idea is to draw from a distribution that is (a) easy to draw from and (b) is close to the region of interest. Obviously, if 100% of our draws is from the set/range in question, then we’ve way over-estimated the proportion. So, we need to reweight the draws in such a way that we get an unbiased estimate.</p>
</div>
<div id="a-very-small-amount-of-theory" class="section level3">
<h3>A very small amount of theory</h3>
<p>A teeny bit of stats theory here (hope you don’t jump ship). The expected value of a draw falling between 2 and 2.5 is</p>
<p><span class="math display">\[E_x(I_R) = \int_{-\infty}^{\infty}{I_{R}(x)f(x)dx} \ ,\]</span></p>
<p>where <span class="math inline">\(I_R(x)=1\)</span> when <span class="math inline">\(2.0 \le x \le 2.5\)</span>, and is 0 otherwise, and <span class="math inline">\(f(x)\)</span> is the standard normal density. This is the quantity that we were estimating above. Now, let’s say we want to draw closer to the range in question - say using <span class="math inline">\(Y\sim N(2.25, 1)\)</span>. We will certainly get more values around 2 and 2.5. If <span class="math inline">\(g(y)\)</span> represents this new density, we can write <span class="math inline">\(E(I_R)\)</span> another way:</p>
<p><span class="math display">\[E_y(I_R) = \int_{-\infty}^{\infty}{I_{R}(y)\frac{f(y)}{g(y)}g(y)dy} \ .\]</span> Notice that the <span class="math inline">\(g(y)\)</span>’s cancel out and we end up with the same expectation as above, except it is with respect to <span class="math inline">\(y\)</span>. Also, notice that the second equation is also a representation of <span class="math inline">\(E_y \left( I_{R}(y)\frac{f(y)}{g(y)} \right)\)</span>.</p>
<p>I know I am doing a lot of hand waving here, but the point is that</p>
<p><span class="math display">\[E_x(I_R) = E_y \left( I_{R}\frac{f}{g} \right)\]</span></p>
<p>Again, <span class="math inline">\(f\)</span> and <span class="math inline">\(g\)</span> are just the original density of interest - <span class="math inline">\(N(0,1)\)</span> - and the “important” density - <span class="math inline">\(N(2.25, 1)\)</span> - respectively. In our modified MC simulation, we draw a <span class="math inline">\(y_i\)</span> from the <span class="math inline">\(N(2.25, 1)\)</span>, and then we calculate <span class="math inline">\(f(y_i)\)</span>, <span class="math inline">\(g(y_i)\)</span>, and <span class="math inline">\(I_R(y_i)\)</span>, or more precisely, <span class="math inline">\(z_i = I_{R}(y_i)\frac{f(y_i)}{g(y_i)}\)</span>. To get <span class="math inline">\(\hat{p}\)</span>, we average the <span class="math inline">\(z_i\)</span>’s, as we did before.</p>
</div>
<div id="beyond-theory" class="section level3">
<h3>Beyond theory</h3>
<p>Why go to all of this trouble? Well, it turns out that the <span class="math inline">\(z_i\)</span>’s will be much less variable if we use importance sampling. And, as a result, the standard error of our estimate can be reduced. This is always a good thing, because it means a reduction in uncertainty.</p>
<p>Maybe a pretty plot will provide a little intuition? Our goal is to estimate the area under the black curve between 2 and 2.5. An importance sample from a <span class="math inline">\(N(2.25, 1)\)</span> distribution is represented by the green curve. I think, however, it might be easiest to understand the adjustment mechanism by looking at the orange curve, which represents the uniform distribution between 2 and 2.5. The density is <span class="math inline">\(g(y) = 2\)</span> for all values within the range, and <span class="math inline">\(g(y) = 0\)</span> outside the range. Each time we generate a <span class="math inline">\(y_i\)</span> from the <span class="math inline">\(U(2,2.5)\)</span>, the value is guaranteed to be in the target range. As calculated, the average of all the <span class="math inline">\(z_i\)</span>’s is the ratio of the area below the black line relative to the area below the orange line, but only in the range between 2 and 2.5. (This may not be obvious, but perhaps staring at the plot for a couple of minutes will help.)</p>
<p><img src="https://www.rdatagen.net/post/2018-01-18-importance-sampling-adds-a-little-excitement-to-monte-carlo-simulation_files/figure-html/unnamed-chunk-11-1.png" width="672" /></p>
</div>
<div id="reducing-standard-errors-by-improving-focus" class="section level3">
<h3>Reducing standard errors by improving focus</h3>
<p>Now we can generate data and estimate <span class="math inline">\(\hat{p}\)</span> and <span class="math inline">\(se(\hat{p})\)</span>. First, here is a simple function to calculate <span class="math inline">\(z\)</span>.</p>
<pre class="r"><code>h <- function(I, f, g) {
dx <- data.table(I, f, g)
dx[I != 0, result := I * f / g]
dx[I == 0, result := 0]
return(dx$result)
}</code></pre>
<p>We can define the three Monte Carlo simulations based on the three different distributions using <code>simstudy</code>. The elements that differ across the three MC simulations are the distribution we are drawing from and the density <span class="math inline">\(g\)</span> of that function.</p>
<pre class="r"><code># Normal(2.5, 1)
def1 <- defData(varname = "y", formula = 2.25,
variance = 1, dist = "normal")
def1 <- defData(def1, varname = "f", formula = "dnorm(y, 0, 1)",
dist = "nonrandom")
def1 <- defData(def1, varname = "g", formula = "dnorm(y, 2.25, 1)",
dist = "nonrandom")
def1 <- defData(def1, varname = "I", formula = "inSet(y, 2, 2.5)",
dist = "nonrandom")
def1 <- defData(def1, varname = "z", formula = "h(I, f, g)",
dist = "nonrandom")
# Normal(2.5, .16)
def2 <- updateDef(def1, "y", newvariance = 0.4^2)
def2 <- updateDef(def2, "g", newformula = "dnorm(y, 2.25, 0.4)")
# Uniform(2, 3)
def3 <- updateDef(def1, "y", newformula = "2;2.5",
newvariance = 0, newdist = "uniform")
def3 <- updateDef(def3, "g", newformula = "dunif(y, 2, 2.5)")</code></pre>
<p>Here is a peek at one data set using the uniform sampling approach:</p>
<pre class="r"><code>genData(1000, def3)</code></pre>
<pre><code>## id y f g I z
## 1: 1 2.181324 0.03695603 2 1 0.018478013
## 2: 2 2.381306 0.02341805 2 1 0.011709023
## 3: 3 2.338066 0.02593364 2 1 0.012966819
## 4: 4 2.200399 0.03544350 2 1 0.017721749
## 5: 5 2.461919 0.01926509 2 1 0.009632543
## ---
## 996: 996 2.118506 0.04229983 2 1 0.021149914
## 997: 997 2.433722 0.02064175 2 1 0.010320876
## 998: 998 2.265325 0.03066025 2 1 0.015330127
## 999: 999 2.107219 0.04332075 2 1 0.021660374
## 1000: 1000 2.444599 0.02010130 2 1 0.010050651</code></pre>
<p>And here are the estimates based on the three different importance samples. Again each iteration is 1000 draws from the distribution - and we use 1500 iterations:</p>
<pre class="r"><code>iter <- 1500
N <- 1000
est1 <- vector("numeric", iter)
est2 <- vector("numeric", iter)
est3 <- vector("numeric", iter)
for (i in 1:iter) {
dt1 <- genData(N, def1)
est1[i] <- dt1[, mean(z)]
dt2 <- genData(N, def2)
est2[i] <- dt2[, mean(z)]
dt3 <- genData(N, def3)
est3[i] <- dt3[, mean(z)]
}
# N(2.25, 1)
c(mean(est1), sd(est1))</code></pre>
<pre><code>## [1] 0.016525503 0.001128918</code></pre>
<pre class="r"><code># N(2.25, .16)
c(mean(est2), sd(est2))</code></pre>
<pre><code>## [1] 0.0165230677 0.0005924007</code></pre>
<pre class="r"><code># Uniform(2, 2.5)
c(mean(est3), sd(est3))</code></pre>
<pre><code>## [1] 0.0165394920 0.0001643243</code></pre>
<p>In each case, the average <span class="math inline">\(\hat{p}\)</span> is 0.0165, and the standard errors are all below the standard MC standard error of 0.0040. The estimates based on draws from the uniform distribution are the most efficient, with a standard error below 0.0002.</p>
</div>
Simulating a cost-effectiveness analysis to highlight new functions for generating correlated data
https://www.rdatagen.net/post/generating-correlated-data-for-a-simulated-cost-effectiveness-analysis/
Mon, 08 Jan 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/generating-correlated-data-for-a-simulated-cost-effectiveness-analysis/<p>My dissertation work (which I only recently completed - in 2012 - even though I am not exactly young, a whole story on its own) focused on inverse probability weighting methods to estimate a causal cost-effectiveness model. I don’t really do any cost-effectiveness analysis (CEA) anymore, but it came up very recently when some folks in the Netherlands contacted me about using <code>simstudy</code> to generate correlated (and clustered) data to compare different approaches to estimating cost-effectiveness. As part of this effort, I developed two more functions in simstudy that allow users to generate correlated data drawn from different types of distributions. Earlier I had created the <code>CorGen</code> functions to generate multivariate data from a single distribution – e.g. multivariate gamma. Now, with the new <code>CorFlex</code> functions (<code>genCorFlex</code> and <code>addCorFlex</code>), users can mix and match distributions. The new version of simstudy is not yet up on CRAN, but is available for download from my <a href="https://github.com/kgoldfeld/simstudy">github</a> site. If you use RStudio, you can install using <code>devtools::install.github("kgoldfeld/simstudy")</code>. [Update: <code>simstudy</code> version 0.1.8 is now available on <a href="https://cran.rstudio.com/web/packages/simstudy/">CRAN</a>.]</p>
<p>I thought I’d introduce this new functionality by generating some correlated cost and outcome data, and show how to estimate a cost-effectiveness analysis curve (CEAC). The CEAC is based on a measure called the incremental net benefit (INB). It is far more common in cost-effectiveness analysis to measure the incremental cost-effectiveness ratio (ICER). I was never enamored of ICERs, because ratios can behave poorly when denominators (in this case the changes in outcomes) get very small. Since it is a difference, the INB behaves much better. Furthermore, it seems relatively intuitive that a negative INB is not a good thing (i.e., it is not good if costs are greater than benefits), but a negative ICER has an unclear interpretation. My goal isn’t to give you a full explanation of CEA, but to provide an application to demonstrate the new simstudy functions. If you really want to learn more about this topic, you can find a paper <a href="http://onlinelibrary.wiley.com/doi/10.1002/sim.6017/full">here</a> that described my dissertation work. Of course, this is a well-established field of study, so naturally there is much more out there…</p>
<div id="simulation-scenario" class="section level3">
<h3>Simulation scenario</h3>
<p>In the simulation scenario I’ve concocted, the goal is to increase the number of patients that come in for an important test. A group of public health professionals have developed a new outreach program that they think will be able to draw in more patients. The study is conducted at the site level - some sites will implement the new approach, and the others, serving as controls, will continue with the existing approach. The cost for the new approach is expected to be higher, and will vary by site. In the first scenario, we assume that costs and recruitment are correlated with each other. That is, sites that tend to spend more generally have higher recruitment levels, even before introducing the new recruitment method.</p>
<p>The data are simulated using the assumption that costs have a gamma distribution (since costs are positive, continuous and skewed to the right) and that recruitment numbers are Poisson distributed (since they are non-negative counts). The intervention sites will have costs that are on average $1000 greater than the control sites. Recruitment will be 10 patients higher for the intervention sites. This is an average expenditure of $100 per additional patient recruited:</p>
<pre class="r"><code>library(simstudy)
# Total of 500 sites, 250 control/250 intervention
set.seed(2018)
dt <- genData(500)
dt <- trtAssign(dtName = dt, nTrt = 2,
balanced = TRUE, grpName = "trtSite")
# Define data - intervention costs $1000 higher on average
def <- defDataAdd(varname = "cost", formula = "1000 + 1000*trtSite",
variance = 0.2, dist = "gamma")
def <- defDataAdd(def, varname = "nRecruits",
formula = "100 + 10*trtSite",
dist = "poisson")
# Set correlation paramater (based on Kendall's tau)
tau <- 0.2
# Generate correlated data using new function addCorFlex
dOutcomes <- addCorFlex(dt, defs = def, tau = tau)
dOutcomes</code></pre>
<pre><code>## id trtSite cost nRecruits
## 1: 1 1 1553.7862 99
## 2: 2 1 913.2466 90
## 3: 3 1 1314.5522 91
## 4: 4 1 1610.5535 112
## 5: 5 1 3254.1100 99
## ---
## 496: 496 1 1452.5903 99
## 497: 497 1 292.8769 109
## 498: 498 0 835.3930 85
## 499: 499 1 1618.0447 92
## 500: 500 0 363.2429 101</code></pre>
<p>The data have been generated, so now we can examine the means and standard deviations of costs and recruitment:</p>
<pre class="r"><code>dOutcomes[, .(meanCost = mean(cost), sdCost = sd(cost)),
keyby = trtSite]</code></pre>
<pre><code>## trtSite meanCost sdCost
## 1: 0 992.2823 449.8359
## 2: 1 1969.2057 877.1947</code></pre>
<pre class="r"><code>dOutcomes[, .(meanRecruit = mean(nRecruits), sdRecruit = sd(nRecruits)),
keyby = trtSite]</code></pre>
<pre><code>## trtSite meanRecruit sdRecruit
## 1: 0 99.708 10.23100
## 2: 1 108.600 10.10308</code></pre>
<p>And here is the estimate of Kendall’s tau within each intervention arm:</p>
<pre class="r"><code>dOutcomes[, .(tau = cor(cost, nRecruits, method = "kendall")),
keyby = trtSite]</code></pre>
<pre><code>## trtSite tau
## 1: 0 0.2018365
## 2: 1 0.1903694</code></pre>
</div>
<div id="cost-effectiveness-icer" class="section level3">
<h3>Cost-effectiveness: ICER</h3>
<p>The question is, are the added expenses of the program worth it when we look at the difference in recruitment? In the traditional approach, the incremental cost-effectiveness ratio is defined as</p>
<p><span class="math display">\[ICER = \frac{ \bar{C}_{intervention} - \bar{C}_{control} }{ \bar{R}_{intervention} - \bar{R}_{control}}\]</span></p>
<p>where <span class="math inline">\(\bar{C}\)</span> and <span class="math inline">\(\bar{R}\)</span> represent the average costs and recruitment levels, respectively.</p>
<p>We can calculate the ICER in this simulated study:</p>
<pre class="r"><code>(costDif <- dOutcomes[trtSite == 1, mean(cost)] -
dOutcomes[trtSite == 0, mean(cost)])</code></pre>
<pre><code>## [1] 976.9235</code></pre>
<pre class="r"><code>(nDif <- dOutcomes[trtSite == 1, mean(nRecruits)] -
dOutcomes[trtSite == 0, mean(nRecruits)])</code></pre>
<pre><code>## [1] 8.892</code></pre>
<pre class="r"><code># ICER
costDif/nDif</code></pre>
<pre><code>## [1] 109.8654</code></pre>
<p>In this case the average cost for the intervention group is $976 higher than the control group, and recruitment goes up by about 9 people. Based on this, the ICER is $110 per additional recruited individual. We would deem the initiative cost-effective if we are willing to pay at least $110 to recruit a single person. If, for example, we save $150 in future health care costs for every additional person we recruit, we should be willing to invest $110 for a new recruit. Under this scenario, we would deem the program cost effective (assuming, of course, we have some measure of uncertainty for our estimate).</p>
</div>
<div id="cost-effectiveness-inb-the-ceac" class="section level3">
<h3>Cost-effectiveness: INB & the CEAC</h3>
<p>I alluded to the fact that I believe that the incremental net benefit (INB) might be a preferable way to measure cost-effectiveness, just because the measure is more stable and easier to interpret. This is how it is defined:</p>
<p><span class="math display">\[INB = \lambda (\bar{R}_{intervention} - \bar{R}_{control}) - (\bar{C}_{intervention} - \bar{C}_{control})\]</span></p>
<p>where <span class="math inline">\(\lambda\)</span> is the willingness-to-pay I mentioned above. One of the advantages to using the INB is that we don’t need to specify <span class="math inline">\(\lambda\)</span>, but can estimate a range of INBs based on a range of willingness-to-pay values. For all values of <span class="math inline">\(\lambda\)</span> where the INB exceeds $0, the intervention is cost-effective.</p>
<p>The CEAC is a graphical approach to cost-effectiveness analysis that takes into consideration uncertainty. We estimate uncertainty using a bootstrap approach, which entails sampling repeatedly from the original “observed” data set with replacement. Each time we draw a sample, we estimate the mean differences in cost and recruitment for the two treatment arms. A plot of these estimated means gives a sense of the variability of our estimates (and we can see how strongly these means are correlated). Once we have all these bootstrapped means, we can calculate a range of INB’s for each pair of means and a range of <span class="math inline">\(\lambda\)</span>’s. The CEAC represents <em>the proportion of bootstrapped estimates with a positive INB at a particular level of <span class="math inline">\(\lambda\)</span>.</em></p>
<p>This is much easier to see in action. To implement this, I wrote a little function that randomly samples the original data set and estimates the means:</p>
<pre class="r"><code>estMeans <- function(dt, grp, boot = FALSE) {
dGrp <- dt[trtSite == grp]
if (boot) {
size <- nrow(dGrp)
bootIds <- dGrp[, sample(id, size = size, replace = TRUE)]
dGrp <- dt[bootIds]
}
dGrp[, .(mC = mean(cost), mN = mean(nRecruits))]
}</code></pre>
<p>First, we calculate the differences in means of the observed data:</p>
<pre class="r"><code>(estResult <- estMeans(dOutcomes, 1) - estMeans(dOutcomes, 0))</code></pre>
<pre><code>## mC mN
## 1: 976.9235 8.892</code></pre>
<p>Next, we draw 1000 bootstrap samples:</p>
<pre class="r"><code>bootResults <- data.table()
for (i in 1:1000) {
changes <- estMeans(dOutcomes, 1, boot = TRUE) -
estMeans(dOutcomes, 0, boot = TRUE)
bootResults <- rbind(bootResults, changes)
}
bootResults</code></pre>
<pre><code>## mC mN
## 1: 971.3087 9.784
## 2: 953.2996 8.504
## 3: 1053.0340 9.152
## 4: 849.5292 8.992
## 5: 1008.9378 8.452
## ---
## 996: 894.0251 8.116
## 997: 1002.0393 7.948
## 998: 981.6729 8.784
## 999: 1109.8255 9.596
## 1000: 995.6786 8.736</code></pre>
<p>Finally, we calculate the proportion of INBs that exceed zero for a range of <span class="math inline">\(\lambda\)</span>’s from $75 to $150. We can see that at willingness-to-pay levels higher than $125, there is a very high probability (~90%) of the intervention being cost-effective. (At the ICER level of $110, the probability of cost-effectiveness is only around 50%.)</p>
<pre class="r"><code>CEAC <- data.table()
for (wtp in seq(75, 150, 5)) {
propPos <- bootResults[, mean((wtp * mN - mC) > 0)]
CEAC <- rbind(CEAC, data.table(wtp, propPos))
}
CEAC</code></pre>
<pre><code>## wtp propPos
## 1: 75 0.000
## 2: 80 0.000
## 3: 85 0.002
## 4: 90 0.018
## 5: 95 0.075
## 6: 100 0.183
## 7: 105 0.339
## 8: 110 0.505
## 9: 115 0.659
## 10: 120 0.776
## 11: 125 0.871
## 12: 130 0.941
## 13: 135 0.965
## 14: 140 0.984
## 15: 145 0.992
## 16: 150 0.998</code></pre>
</div>
<div id="a-visual-cea" class="section level3">
<h3>A visual CEA</h3>
<p>Here are three series of plots, shown for different levels of correlation between cost and recruitment. Each series includes a plot of the original cost and recruitment data, where each point represents a site. The second plot shows the average difference in means between the intervention and control sites in purple and the bootstrapped differences in grey. The third plot is the CEAC with a horizontal line drawn at 90%. The first series is the data set we generated with tau = 0.2:</p>
<p><img src="https://www.rdatagen.net/post/2018-01-08-generating-correlated-data-for-a-simulated-cost-effectiveness-analysis_files/figure-html/plot1-1.png" width="1056" /></p>
<p>When there is no correlation between costs and recruitment across sites (tau = 0):</p>
<p><img src="https://www.rdatagen.net/post/2018-01-08-generating-correlated-data-for-a-simulated-cost-effectiveness-analysis_files/figure-html/tau2-1.png" width="1056" /></p>
<p>And finally - when there is a higher degree of correlation, tau = 0.4:</p>
<p><img src="https://www.rdatagen.net/post/2018-01-08-generating-correlated-data-for-a-simulated-cost-effectiveness-analysis_files/figure-html/tau3-1.png" width="1056" /></p>
</div>
<div id="effect-of-correlation" class="section level3">
<h3>Effect of correlation?</h3>
<p>In all three scenarios (with different levels of tau), the ICER is approximately $110. Of course, this is directly related to the fact that the estimated differences in means of the two intervention groups is the same across the scenarios. But, when we look at the three site-level and bootstrap plots, we can see the varying levels of correlation.</p>
<p>And while there also appears to be a subtle visual difference between the CEAC’s for different levels of correlation, it is not clear if this is a real difference or random variation. To explore this a bit further, I generated 250 data sets and their associated CEACs (which in turn are generated by 1000 bootstrap steps eacj) under a range of tau’s, starting with no correlation (tau = 0) up to a considerable level of correlation (tau = 0.4). In these simulations, I used a larger sample size of 2000 sites to reduce the variation a bit. Here are the results:</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-cea/tauplots.png" />
</div>
<p>It appears that the variability of the CEAC curves decreases as correlation between cost and recruitment (determined by tau) increases; the range of the curves is smallest when tau is 0.4. In addition, in looks like the “median” CEAC moves slightly rightward as tau increases, which suggests that probability of cost-effectiveness will vary across different levels of tau. All this is to say that correlation appears to matter, so it might be an important factor to consider when both simulating these sorts of data and actually conducting a CEA.</p>
</div>
<div id="next-steps" class="section level3">
<h3>Next steps?</h3>
<p>In this example, I based the entire analysis on a simple non-parametric estimate of the means. In the future, I might explore copula-based methods to fit joint models of costs and outcomes. In simstudy, a Gaussian copula generates the correlated data. However there is a much larger world of copulas out there that can be used to model correlation between measures regardless of their marginal distributions. And some of these methods have been applied in the context of CEA. Stay tuned on this front (though it might be a while).</p>
</div>
When there's a fork in the road, take it. Or, taking a look at marginal structural models.
https://www.rdatagen.net/post/when-a-covariate-is-a-confounder-and-a-mediator/
Mon, 11 Dec 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/when-a-covariate-is-a-confounder-and-a-mediator/<p>I am going to cut right to the chase, since this is the third of three posts related to confounding and weighting, and it’s kind of a long one. (If you want to catch up, the first two are <a href="https://www.rdatagen.net/post/potential-outcomes-confounding/">here</a> and <a href="https://www.rdatagen.net/post/inverse-probability-weighting-when-the-outcome-is-binary/">here</a>.) My aim with these three posts is to provide a basic explanation of the <em>marginal structural model</em> (MSM) and how we should interpret the estimates. This is obviously a very rich topic with a vast literature, so if you remain interested in the topic, I recommend checking out this (as of yet unpublished) <a href="https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/">text book</a> by Hernán & Robins for starters.</p>
<p>The DAG below is a simple version of how things can get complicated very fast if we have sequential treatments or exposures that both affect and are affected by intermediate factors or conditions.</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-msm/MSM_DAG_observed.png" />
</div>
<p><span class="math inline">\(A_0\)</span> and <span class="math inline">\(A_1\)</span> represent two treatment points and <span class="math inline">\(L_0\)</span> and <span class="math inline">\(L_1\)</span> represent measurements taken before and after treatments, respectively. Both treatments and at least <span class="math inline">\(L_1\)</span> affect outcome <span class="math inline">\(Y\)</span>. (I am assuming that the <span class="math inline">\(A\)</span>’s and <span class="math inline">\(L\)</span>’s are binary and that <span class="math inline">\(Y\)</span> is continuous. <span class="math inline">\(\epsilon\)</span> is <span class="math inline">\(N(0, \sigma_\epsilon^2)\)</span>.)</p>
<p>An example of this might be a situation where we are interested in the effect of a drug treatment on mental health well-being for patients with prehypertension or hypertension. A physician’s decision to administer the drug at each visit is influenced by the patient’s level of hypertension. In turn, the treatment <span class="math inline">\(A_0\)</span> potentially reduces the probability of hypertension - <span class="math inline">\(P(L_1=1)\)</span>. And finally, <span class="math inline">\(L_1\)</span> influences the next treatment decision and ultimately the mental health outcome <span class="math inline">\(Y\)</span>.</p>
<p>The complicating factor is that the hypertension level following the first treatment (<span class="math inline">\(L_1\)</span>) is both a mediator the effect of treatment <span class="math inline">\(A_0\)</span> and confounder of the treatment effect <span class="math inline">\(A_1\)</span> on <span class="math inline">\(Y\)</span>. To get an unbiased estimate the effect of the combined treatment regime (<span class="math inline">\(A_0\)</span> and <span class="math inline">\(A_1\)</span>) we need to both control for <span class="math inline">\(L_1\)</span> and not control for <span class="math inline">\(L_1\)</span>. This is where MSMs and inverse probability weighting (IPW) come into play.</p>
<p>The MSM is marginal in the sense that we’ve been talking about in this series - the estimate will be a population-wide estimate that reflects the mixture of the covariates that influence the treatments and outcomes (in this case, the <span class="math inline">\(L\)</span>’s). It is structural in the sense that we are modeling <em>potential outcomes</em>. Nothing has changed from the last <a href="https://www.rdatagen.net/post/inverse-probability-weighting-when-the-outcome-is-binary/">post</a> except for the fact that we are now defining the exposures as a sequence of different treatments (here <span class="math inline">\(A_0\)</span> and <span class="math inline">\(A_1\)</span>, but could easily extend to <span class="math inline">\(n\)</span> treatments - up to <span class="math inline">\(A_n\)</span>.)</p>
<div id="imagine-an-experiment" class="section level3">
<h3>Imagine an experiment …</h3>
<p>To understand the MSM, it is actually helpful to think about how a single individual fits into the picture. The tree diagram below literally shows that. The MSM posits a weird experiment where measurements (of <span class="math inline">\(L\)</span>) are collected and treatments (<span class="math inline">\(A\)</span>) are assigned repeatedly until a final outcome is measured. In this experiment, the patient is not just assigned to one treatment arm, but to both! Impossible of course, but that is the world of potential outcomes.</p>
<p>At the start of the experiment, a measurement of <span class="math inline">\(L_0\)</span> is collected. This sends the patient down the one of the branches of the tree. Since the patient is assigned to both <span class="math inline">\(A_0=0\)</span> and <span class="math inline">\(A_0=1\)</span>, she actually heads down two <em>different</em> branches simultaneously. Following the completion of the first treatment period <span class="math inline">\(A_0\)</span>, the second measurement (<span class="math inline">\(L_1\)</span>) is collected. But, two measurements are taken for the patient - one for each branch. The results need not be the same. In fact, if the treatment has an individual-level effect on <span class="math inline">\(L_1\)</span>, then the results will be different for this patient. In the example below, this is indeed the case. Following each of those measurements (in parallel universes), the patient is sent down the next treatment branches (<span class="math inline">\(A_1\)</span>). At this point, the patient finds herself in four branches. At the end of each, the measurement of <span class="math inline">\(Y\)</span> is taken, and we have four potential outcomes for individual {i}: <span class="math inline">\(Y^i_{00}\)</span>, <span class="math inline">\(Y^i_{10}\)</span>, <span class="math inline">\(Y^i_{01}\)</span>, and <span class="math inline">\(Y^i_{11}\)</span>.</p>
<p>A different patient will head down different branches based on his own values of <span class="math inline">\(L_0\)</span> and <span class="math inline">\(L_1\)</span>, and will thus end up with different potential outcomes. (Note: the values represented in the figure are intended to be average values for that particular path.)</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-msm/IPW_MSM_Ind.png" />
</div>
</div>
<div id="how-do-we-define-the-causal-effect" class="section level3">
<h3>How do we define the causal effect?</h3>
<p>With four potential outcomes rather than two, it is less obvious how to define the causal effect. We could, for example, consider three separate causal effects by comparing each of the treatment “regimes” that include at least one exposure to the intervention to the single regime that leaves the patient entirely unexposed. That is, we could be interested in (at the individual <span class="math inline">\(i\)</span> level) <span class="math inline">\(E^i_1 = Y^i_{10}-Y^i_{00}\)</span>, <span class="math inline">\(E^i_2 = Y^i_{01}-Y^i_{00}\)</span>, and <span class="math inline">\(E^i_3 = Y^i_{11}-Y^i_{00}\)</span>. This is just one possibility; the effects of interest are driven entirely by the research question.</p>
<p>When we have three or four or more intervention periods, the potential outcomes can start to pile up rapidly (we will have <span class="math inline">\(2^n\)</span> potential outcomes for a sequence of <span class="math inline">\(n\)</span> treatments.) So, the researcher might want to be judicious in deciding which contrasts to be made. Maybe something like <span class="math inline">\(Y_{1111} - Y_{0000}\)</span>, <span class="math inline">\(Y_{0111} - Y_{0000}\)</span>, <span class="math inline">\(Y_{0011} - Y_{0000}\)</span>, and <span class="math inline">\(Y_{0001} - Y_{0000}\)</span> for a four-period intervention. This would allow us to consider the effect of starting (and never stopping) the intervention in each period compared to never starting the intervention at all. By doing this, though, we would be using only 5 out of the 16 potential outcomes. If the remaining 11 paths are not so rare, we might be ignoring a lot of data.</p>
</div>
<div id="the-marginal-effect" class="section level3">
<h3>The marginal effect</h3>
<p>The tree below represents an aggregate set of branches for a sample of 5000 individuals. The sample is initially characterized only by the distribution of <span class="math inline">\(L_0\)</span>. Each individual will go down her own set of four paths, which depend on the starting value of <span class="math inline">\(L_0\)</span> and how each value of <span class="math inline">\(L_1\)</span> responds in the context of each treatment arm.</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-msm/IPW_MSM_PO.png" />
</div>
<p>Each individual <span class="math inline">\(i\)</span> (at least in theory) has four potential outcomes: <span class="math inline">\(Y^i_{00}\)</span>, <span class="math inline">\(Y^i_{10}\)</span>, <span class="math inline">\(Y^i_{01}\)</span>, and <span class="math inline">\(Y^i_{11}\)</span>. Averaging across the sample provides a marginal estimate of each of these potential outcomes. For example, <span class="math inline">\(E(Y_{00})=\sum_i{Y^i_{00}}/5000\)</span>. This can be calculated from the tree as <span class="math display">\[(1742*53 + 1908*61 + 392*61 + 958*69)/5000 = 59.7\]</span> Similarly, <span class="math inline">\(E(Y_{11}) = 40.1\)</span> The sample average causal effects are estimated using the respective averages of the potential outcomes. For example, <span class="math inline">\(E_3\)</span> at the sample level would be defined as <span class="math inline">\(E(Y_{11}) - E(Y_{00}) = 40.1 - 59.7 = -19.6\)</span>.</p>
</div>
<div id="back-in-the-real-world" class="section level3">
<h3>Back in the real world</h3>
<p>In reality, there are no parallel universes. Maybe we could come up with an actual randomized experiment to mimic this, but it may be difficult. More likely, we’ll have observed data that looks like this:</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-msm/IPW_MSM_obs_noIPW.png" />
</div>
<p>Each individual heads down his or her own path, receiving a single treatment at each time point. Since this is not a randomized trial, the probability of treatment is different across different levels of <span class="math inline">\(L_0\)</span> and <span class="math inline">\(L_1\)</span> and that <span class="math inline">\(L_0\)</span> and <span class="math inline">\(L_1\)</span> are associated with the outcome (i.e. there is confounding).</p>
</div>
<div id="estimating-the-marginal-effects" class="section level3">
<h3>Estimating the marginal effects</h3>
<p>In the previous posts in this series, I provided some insight as to how we might justify using observed data only to estimate these sample-wide average potential outcomes. The most important assumption is that when we have measured all confounders, we may be able to say, for example, <span class="math inline">\(E(Y_{01}) = E(Y | A_0 = 0 \ \& \ A_1 = 1 )\)</span>. The <em>potential outcome</em> for everyone in the sample is equal to the <em>observed</em> outcome for the subgroup who actually followed the particular path that represents that potential outcome. We will make the same assumption here.</p>
<p>At the start of this post, I argued that given the complex nature of the data generating process (in particular given that <span class="math inline">\(L_1\)</span> is both a mediator and confounder), it is challenging to get unbiased estimates of the intervention effects. One way to do this with marginal structural models (another way is using <a href="https://academic.oup.com/aje/article/173/7/731/104142"><em>g-computation</em></a>, but I won’t talk about that here). Inverse probability weighting converts the observed tree graph from the real world to the marginal tree graph so that we can estimate sample-wide average (marginal) potential outcomes as an estimate for some population causal effects.</p>
<p>In this case, the inverse probability weight is calculated as <span class="math display">\[IPW = \frac{1}{P(A_0=a_0 | L_0=l_0) \times P(A_1=a_1 | L_0=l_0, A_0=a_0, L_1=l_1)}\]</span> In practice, we estimate both probabilities using logistic regression or some other modeling technique. But here, we can read the probabilities off the tree graph. For example, if we are interested in the weight associated with individuals observed with <span class="math inline">\(L_0=1, A_0=0, L_1=0, \textbf{and } A_1=1\)</span>, the probabilities are <span class="math display">\[P(A_0 = 0 | L_0=1) = \frac{676}{1350}=0.50\]</span> and <span class="math display">\[P(A_1=1 | L_0=1, A_0=0, L_1=0) = \frac{59}{196} = 0.30\]</span></p>
<p>So, the inverse probability weight for these individuals is <span class="math display">\[IPW = \frac{1}{0.50 \times 0.30} = 6.67\]</span> For the 59 individuals that followed this pathway, the weighted number is <span class="math inline">\(59 \times 6.67 = 393\)</span>. In the marginal world of parallel universes, there were 394 individuals.</p>
</div>
<div id="simulating-data-from-an-msm" class="section level3">
<h3>Simulating data from an MSM</h3>
<p>Before I jump into the simulation, I do want to reference a paper by <a href="http://onlinelibrary.wiley.com/doi/10.1002/sim.5472/full">Havercroft and Didelez</a> that describes in great detail how to generate data from a MSM with time-dependent confounding. It turns out that the data can’t be generated exactly using the intial DAG (presented above), but rather needs to come from something like this:</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-msm/MSM_DAG_dataGen.png" />
</div>
<p>where <span class="math inline">\(U\)</span> is an unmeasured, maybe latent, covariate. The observed data (that ignores <span class="math inline">\(U\)</span>) will indeed have a DAG that looks like the one that we started with.</p>
<p>When doing simulations with potential outcomes, we can generate all the potential outcomes for each individual using a parallel universe approach. The observed data (treatment choices and observed outcomes) are generated separately. The advantage of this is that we can confirm the <em>true</em> causal effects because we have actually generated potential outcomes. The disadvantage is that the code is considerably more complicated and the quantity of data generated grows. The situation is not so bad with just two treatment periods, but the size of the data increases exponentially with the number of treatments: as I mentioned earlier, there will be <span class="math inline">\(2^n\)</span> potential outcomes for each individual.</p>
<p>Alternatively, we can just generate the observed data directly. Since we know the true causal parameters we actually “know” the causal effects and can compare our estimates.</p>
<p>I will go through the convoluted approach because I think it clarifies (at least a bit) what is going on. As an addendum, I will show how all of this could be done in a few lines of code if we take the second approach …</p>
<pre class="r"><code>library(broom)
library(simstudy)
# define U, e and L0
defA0 <- defData(varname = "U", formula = "0;1", dist = "uniform")
defA0 <- defData(defA0, varname = "e", formula = 0,
variance = 4, dist = "normal")
defA0<- defData(defA0, varname = "L0", formula = "-2.66+ 3*U",
dist = "binary", link = "logit")
# generate the data
set.seed(1234)
dtA0 <- genData(n = 50000, defA0)
dtA0[1:6]</code></pre>
<pre><code>## id U e L0
## 1: 1 0.1137034 -3.5951796 0
## 2: 2 0.6222994 -0.5389197 0
## 3: 3 0.6092747 1.0675660 0
## 4: 4 0.6233794 -0.7226909 1
## 5: 5 0.8609154 0.8280401 0
## 6: 6 0.6403106 3.3532399 0</code></pre>
<p>Now we need to create the two parallel universes - assigning each individual to both treatments. <code>simstudy</code> has a function <code>addPeriods</code> to generate longitudinal data. I am not doing that here, but can generate 2-period data and change the name of the “period” field to “A0”.</p>
<pre class="r"><code>dtA0 <- addPeriods(dtA0, 2)
setnames(dtA0, "period", "A0")
dtA0[1:6]</code></pre>
<pre><code>## id A0 U e L0 timeID
## 1: 1 0 0.1137034 -3.5951796 0 1
## 2: 1 1 0.1137034 -3.5951796 0 2
## 3: 2 0 0.6222994 -0.5389197 0 3
## 4: 2 1 0.6222994 -0.5389197 0 4
## 5: 3 0 0.6092747 1.0675660 0 5
## 6: 3 1 0.6092747 1.0675660 0 6</code></pre>
<p>Now we are ready to randomly assign a value of <span class="math inline">\(L_1\)</span>. The probability is lower for cases where <span class="math inline">\(A_0 = 1\)</span>, so individuals themselves may have different values of <span class="math inline">\(L_1\)</span> in the alternative paths.</p>
<pre class="r"><code># generate L1 as a function of U, L0, and A0
addA0 <- defDataAdd(varname = "L1",
formula = "-1.2 + 3*U + 0.2*L0 - 2.5*A0",
dist= "binary", link="logit")
dtA0 <- addColumns(addA0, dtOld = dtA0)
dtA0[1:6]</code></pre>
<pre><code>## id A0 U e L0 timeID L1
## 1: 1 0 0.1137034 -3.5951796 0 1 0
## 2: 1 1 0.1137034 -3.5951796 0 2 0
## 3: 2 0 0.6222994 -0.5389197 0 3 1
## 4: 2 1 0.6222994 -0.5389197 0 4 0
## 5: 3 0 0.6092747 1.0675660 0 5 0
## 6: 3 1 0.6092747 1.0675660 0 6 0</code></pre>
<pre class="r"><code># L1 is clearly a function of A0
dtA0[, .(prob_L1 = mean(L1)), keyby = .(L0,A0)]</code></pre>
<pre><code>## L0 A0 prob_L1
## 1: 0 0 0.5238369
## 2: 0 1 0.1080039
## 3: 1 0 0.7053957
## 4: 1 1 0.2078551</code></pre>
<p>Now we create two additional parallel universes for treatment <span class="math inline">\(A_1\)</span> and the potential outcomes. This will result in four records per individual:</p>
<pre class="r"><code>dtA1 <- addPeriods(dtA0, 2)
setnames(dtA1, "period", "A1")
addA1 <- defDataAdd(varname = "Y_PO",
formula = "39.95 + U*40 - A0 * 8 - A1 * 12 + e",
dist = "nonrandom")
dtA1 <- addColumns(addA1, dtA1)
dtA1[1:8]</code></pre>
<pre><code>## id A1 A0 U e L0 timeID L1 Y_PO
## 1: 1 0 0 0.1137034 -3.5951796 0 1 0 40.90296
## 2: 1 0 1 0.1137034 -3.5951796 0 2 0 32.90296
## 3: 1 1 0 0.1137034 -3.5951796 0 3 0 28.90296
## 4: 1 1 1 0.1137034 -3.5951796 0 4 0 20.90296
## 5: 2 0 0 0.6222994 -0.5389197 0 5 1 64.30306
## 6: 2 0 1 0.6222994 -0.5389197 0 6 0 56.30306
## 7: 2 1 0 0.6222994 -0.5389197 0 7 1 52.30306
## 8: 2 1 1 0.6222994 -0.5389197 0 8 0 44.30306</code></pre>
<p>Not surprisingly, the estimates for the causal effects mirror the parameters we used to generate the <span class="math inline">\(Y\)</span>’s above …</p>
<pre class="r"><code># estimate for Y_00 is close to what we estimated from the tree
Y_00 <- dtA1[A0 == 0 & A1 == 0, mean(Y_PO)]
Y_00</code></pre>
<pre><code>## [1] 59.96619</code></pre>
<pre class="r"><code>Y_10 <- dtA1[A0 == 1 & A1 == 0, mean(Y_PO)]
Y_01 <- dtA1[A0 == 0 & A1 == 1, mean(Y_PO)]
Y_11 <- dtA1[A0 == 1 & A1 == 1, mean(Y_PO)]
# estimate 3 causal effects
c(Y_10 - Y_00, Y_01 - Y_00, Y_11 - Y_00)</code></pre>
<pre><code>## [1] -8 -12 -20</code></pre>
<p>Now that we’ve generated the four parallel universes with four potential outcomes per individual, we will generate an observed treatment sequence and measurements of the <span class="math inline">\(L\)</span>’s and <span class="math inline">\(Y\)</span> for each individual. The observed data set will have a single record for each individual:</p>
<pre class="r"><code>dt <- dtA1[A0 == 0 & A1 == 0, .(id, L0)]
dt</code></pre>
<pre><code>## id L0
## 1: 1 0
## 2: 2 0
## 3: 3 0
## 4: 4 1
## 5: 5 0
## ---
## 49996: 49996 1
## 49997: 49997 0
## 49998: 49998 1
## 49999: 49999 0
## 50000: 50000 1</code></pre>
<p><span class="math inline">\(A_0\)</span> is a function of <span class="math inline">\(L_0\)</span>:</p>
<pre class="r"><code>dtAdd <- defDataAdd(varname = "A0",
formula = "0.3 + L0 * 0.2", dist = "binary" )
dt <- addColumns(dtAdd, dt)
dt[, mean(A0), keyby= L0]</code></pre>
<pre><code>## L0 V1
## 1: 0 0.3015964
## 2: 1 0.4994783</code></pre>
<p>Now, we need to pull the appropriate value of <span class="math inline">\(L_1\)</span> from the original data set that includes both possible values for each individual. The value that gets pulled will be based on the observed value of <span class="math inline">\(A_0\)</span>:</p>
<pre class="r"><code>setkeyv(dt, c("id", "A0"))
setkeyv(dtA1, c("id", "A0"))
dt <- merge(dt, dtA1[, .(id, A0, L1, A1) ], by = c("id", "A0"))
dt <- dt[A1 == 0, .(id, L0, A0, L1)]
dt</code></pre>
<pre><code>## id L0 A0 L1
## 1: 1 0 1 0
## 2: 2 0 1 0
## 3: 3 0 0 0
## 4: 4 1 1 1
## 5: 5 0 0 1
## ---
## 49996: 49996 1 1 0
## 49997: 49997 0 1 0
## 49998: 49998 1 1 0
## 49999: 49999 0 0 1
## 50000: 50000 1 0 0</code></pre>
<p>Finally, we generate <span class="math inline">\(A_1\)</span> based on the observed values of <span class="math inline">\(A_0\)</span> and <span class="math inline">\(L_1\)</span>, and select the appropriate value of <span class="math inline">\(Y\)</span>:</p>
<pre class="r"><code>dtAdd <- defDataAdd(varname = "A1",
formula = "0.3 + L1 * 0.2 + A0 * .2", dist = "binary")
dt <- addColumns(dtAdd, dt)
# merge to get potential outcome that matches actual path
setkey(dt, id, L0, A0, L1, A1)
setkey(dtA1, id, L0, A0, L1, A1)
dtObs <- merge(dt, dtA1[,.(id, L0, A0, L1, A1, Y = Y_PO)])
dtObs</code></pre>
<pre><code>## id L0 A0 L1 A1 Y
## 1: 1 0 1 0 0 32.90296
## 2: 2 0 1 0 1 44.30306
## 3: 3 0 0 0 1 53.38856
## 4: 4 1 1 1 1 44.16249
## 5: 5 0 0 1 0 75.21466
## ---
## 49996: 49996 1 1 0 0 74.09161
## 49997: 49997 0 1 0 0 50.26162
## 49998: 49998 1 1 0 0 73.29376
## 49999: 49999 0 0 1 0 52.96703
## 50000: 50000 1 0 0 0 57.13109</code></pre>
<p>If we do a crude estimate of the causal effects using the unadjusted observed data, we know we are going to get biased estimates (remember the true causal effects are -8, -12, and -20):</p>
<pre class="r"><code>Y_00 <- dtObs[A0 == 0 & A1 == 0, mean(Y)]
Y_10 <- dtObs[A0 == 1 & A1 == 0, mean(Y)]
Y_01 <- dtObs[A0 == 0 & A1 == 1, mean(Y)]
Y_11 <- dtObs[A0 == 1 & A1 == 1, mean(Y)]
c(Y_10 - Y_00, Y_01 - Y_00, Y_11 - Y_00)</code></pre>
<pre><code>## [1] -6.272132 -10.091513 -17.208856</code></pre>
<p>This biased result is confirmed using an unadjusted regression model:</p>
<pre class="r"><code>lmfit <- lm(Y ~ A0 + A1, data = dtObs)
tidy(lmfit)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 58.774695 0.07805319 753.00828 0
## 2 A0 -6.681213 0.10968055 -60.91520 0
## 3 A1 -10.397080 0.10544448 -98.60241 0</code></pre>
<p>Now, shouldn’t we do better if we adjust for the confounders? I don’t think so - the parameter estimate for <span class="math inline">\(A_0\)</span> should be close to <span class="math inline">\(8\)</span>; the estimate for <span class="math inline">\(A_1\)</span> should be approximately <span class="math inline">\(12\)</span>, but this is not the case, at least not for both of the estimates:</p>
<pre class="r"><code>lmfit <- lm(Y ~ L0 + L1 + A0 + A1, data = dtObs)
tidy(lmfit)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 53.250244 0.08782653 606.31157 0
## 2 L0 7.659460 0.10798594 70.93016 0
## 3 L1 8.203983 0.10644683 77.07119 0
## 4 A0 -4.369547 0.11096204 -39.37875 0
## 5 A1 -12.037274 0.09592735 -125.48323 0</code></pre>
<p>Maybe if we just adjust for <span class="math inline">\(L_0\)</span> or <span class="math inline">\(L_1\)</span>?</p>
<pre class="r"><code>lmfit <- lm(Y ~ L1 + A0 + A1, data = dtObs)
tidy(lmfit)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 54.247394 0.09095074 596.44808 0.000000e+00
## 2 L1 9.252919 0.11059038 83.66839 0.000000e+00
## 3 A0 -2.633981 0.11354466 -23.19775 2.031018e-118
## 4 A1 -12.016545 0.10063687 -119.40499 0.000000e+00</code></pre>
<pre class="r"><code>lmfit <- lm(Y ~ L0 + A0 + A1, data = dtObs)
tidy(lmfit)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 57.036320 0.07700591 740.67459 0
## 2 L0 8.815691 0.11311215 77.93761 0
## 3 A0 -8.150706 0.10527255 -77.42480 0
## 4 A1 -10.632238 0.09961593 -106.73231 0</code></pre>
<p>So, none of these approaches seem to work. This is where IPW can provide a solution. First we estimate the treatment/exposure models, then we estimate the IPW, and finally we use weighted regression or just estimate weighted average outcomes directly (we’d have to bootstrap here if we want standard errors for the simple average approach):</p>
<pre class="r"><code># estimate P(A0|L0) and P(A1|L0, A0, L1)
fitA0 <- glm(A0 ~ L0, data = dtObs, family=binomial)
fitA1 <- glm(A1 ~ L0 + A0 + L1, data = dtObs, family=binomial)
dtObs[, predA0 := predict(fitA0, type = "response")]
dtObs[, predA1 := predict(fitA1, type = "response")]
# function to convert propenisty scores to IPW
getWeight <- function(predA0, actA0, predA1, actA1) {
predActA0 <- actA0*predA0 + (1-actA0)*(1-predA0)
predActA1 <- actA1*predA1 + (1-actA1)*(1-predA1)
p <- predActA0 * predActA1
return(1/p)
}
dtObs[, wgt := getWeight(predA0, A0, predA1, A1)]
# fit weighted model
lmfit <- lm(Y ~ A0 + A1, weights = wgt, data = dtObs)
tidy(lmfit)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 59.982379 0.09059652 662.08257 0
## 2 A0 -7.986486 0.10464257 -76.32157 0
## 3 A1 -12.051805 0.10464258 -115.17114 0</code></pre>
<pre class="r"><code># non-parametric estimation
Y_00 <- dtObs[A0 == 0 & A1 == 0, weighted.mean(Y, wgt)]
Y_10 <- dtObs[A0 == 1 & A1 == 0, weighted.mean(Y, wgt)]
Y_01 <- dtObs[A0 == 0 & A1 == 1, weighted.mean(Y, wgt)]
Y_11 <- dtObs[A0 == 1 & A1 == 1, weighted.mean(Y, wgt)]
round(c(Y_10 - Y_00, Y_01 - Y_00, Y_11 - Y_00), 2)</code></pre>
<pre><code>## [1] -8.04 -12.10 -20.04</code></pre>
</div>
<div id="addendum" class="section level2">
<h2>Addendum</h2>
<p>This post has been quite long, so I probably shouldn’t go on. But, I wanted to show that we can do the data generation in a much less convoluted way that avoids generating all possible forking paths for each individual. As always in <code>simstudy</code> the data generation process needs us to create a data definition table. In this example, I’ve created that table in an external file named <code>msmDef.csv</code>. In the end, this simpler approach has reduced necessary code by about 95%.</p>
<pre class="r"><code>defMSM <- defRead("msmDef.csv")
defMSM</code></pre>
<pre><code>## varname formula variance dist link
## 1: U 0;1 0 uniform identity
## 2: e 0 9 normal identity
## 3: L0 -2.66+ 3*U 0 binary logit
## 4: A0 0.3 + L0 * 0.2 0 binary identity
## 5: L1 -1.2 + 3*U + 0.2*L0 - 2.5*A0 0 binary logit
## 6: A1 0.3 + L1*0.2 + A0*0.2 0 binary identity
## 7: Y 39.95 + U*40 - A0*8 - A1*12 + e 0 nonrandom identity</code></pre>
<pre class="r"><code>dt <- genData(50000, defMSM)
fitA0 <- glm(A0 ~ L0, data = dt, family=binomial)
fitA1 <- glm(A1 ~ L0 + A0 + L1, data = dt, family=binomial)
dt[, predA0 := predict(fitA0, type = "response")]
dt[, predA1 := predict(fitA1, type = "response")]
dt[, wgt := getWeight(predA0, A0, predA1, A1)]
tidy(lm(Y ~ A0 + A1, weights = wgt, data = dt))</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 60.061609 0.09284532 646.89967 0
## 2 A0 -7.931715 0.10715916 -74.01808 0
## 3 A1 -12.131829 0.10715900 -113.21335 0</code></pre>
<div id="does-the-msm-still-work-with-more-complicated-effects" class="section level3">
<h3>Does the MSM still work with more complicated effects?</h3>
<p>In conclusion, I wanted to show that MSMs still function well even when the causal effects do not follow a simple linear pattern. (And I wanted to be able to end with a figure.) I generated 10000 datasets of 900 observations each, and calculated the crude and marginal causal effects after each iteration. The true treatment effects are described by an “interaction” between <span class="math inline">\(A_0\)</span> and <span class="math inline">\(A_1\)</span>. If treatment is received in <em>both</em> periods (i.e. <span class="math inline">\(A_0=1\)</span> and <span class="math inline">\(A_1=1\)</span>), there is an extra additive effect:</p>
<p><span class="math display">\[ Y = 39.95 + U*40 - A0*8 - A1*12 - A0*A1*3 + e\]</span></p>
<p>The purple density is the (biased) observed estimates and the green density is the (unbiased) IPW-based estimate. Again the true causal effects are -8, -12, and -23:</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-msm/densities.png" />
</div>
</div>
</div>
When you use inverse probability weighting for estimation, what are the weights actually doing?
https://www.rdatagen.net/post/inverse-probability-weighting-when-the-outcome-is-binary/
Mon, 04 Dec 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/inverse-probability-weighting-when-the-outcome-is-binary/<p>Towards the end of <a href="https://www.rdatagen.net/post/potential-outcomes-confounding/">Part 1</a> of this short series on confounding, IPW, and (hopefully) marginal structural models, I talked a little bit about the fact that inverse probability weighting (IPW) can provide unbiased estimates of marginal causal effects in the context of confounding just as more traditional regression models like OLS can. I used an example based on a normally distributed outcome. Now, that example wasn’t super interesting, because in the case of a linear model with homogeneous treatment effects (i.e. no interaction), the marginal causal effect is the same as the conditional effect (that is, conditional on the confounders.) There was no real reason to use IPW in that example - I just wanted to illustrate that the estimates looked reasonable.</p>
<p>But in many cases, the conditional effect <em>is</em> different from the marginal effect. (And in other cases, there may not even be an obvious way to estimate the conditional effect - that will be the topic for the last post in this series). When the outcome is binary, the notion that conditional effects are equal to marginal effects is no longer the case. (I’ve touched on this <a href="https://www.rdatagen.net/post/marginal-v-conditional/">before</a>.) What this means, is that we can recover the true conditional effects using logistic regression, but we cannot estimate the marginal effect. This is directly related to the fact that logistic regression is linear on the logit (or log-odds) scale, not on the probability scale. The issue here is collapsibility, or rather, non-collapsibility.</p>
<div id="a-simulation" class="section level3">
<h3>A simulation</h3>
<p>Because binary outcomes are less amenable to visual illustration, I am going to stick with model estimation to see how this plays out:</p>
<pre class="r"><code>library(simstudy)
# define the data
defB <- defData(varname = "L", formula =0.27,
dist = "binary")
defB <- defData(defB, varname = "Y0", formula = "-2.5 + 1.75*L",
dist = "binary", link = "logit")
defB <- defData(defB, varname = "Y1", formula = "-1.5 + 1.75*L",
dist = "binary", link = "logit")
defB <- defData(defB, varname = "A", formula = "0.315 + 0.352 * L",
dist = "binary")
defB <- defData(defB, varname = "Y", formula = "Y0 + A * (Y1 - Y0)",
dist = "nonrandom")
# generate the data
set.seed(2002)
dtB <- genData(200000, defB)
dtB[1:6]</code></pre>
<pre><code>## id L Y0 Y1 A Y
## 1: 1 0 0 0 0 0
## 2: 2 0 0 0 0 0
## 3: 3 1 0 1 1 1
## 4: 4 0 1 1 1 1
## 5: 5 1 0 0 1 0
## 6: 6 1 0 0 0 0</code></pre>
<p>We can look directly at the potential outcomes to see the true causal effect, measured as a log odds ratio (LOR):</p>
<pre class="r"><code>odds <- function (p) {
return((p/(1 - p)))
}
# log odds ratio for entire sample (marginal LOR)
dtB[, log( odds( mean(Y1) ) / odds( mean(Y0) ) )]</code></pre>
<pre><code>## [1] 0.8651611</code></pre>
<p>In the linear regression context, the conditional effect measured using observed data from the exposed and unexposed subgroups was in fact a good estimate of the marginal effect in the population. Not the case here, as the conditional causal effect (LOR) of A is 1.0, which is greater than the true marginal effect of 0.86:</p>
<pre class="r"><code>library(broom)
tidy(glm(Y ~ A + L , data = dtB, family="binomial")) </code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) -2.4895846 0.01053398 -236.33836 0
## 2 A 0.9947154 0.01268904 78.39167 0
## 3 L 1.7411358 0.01249180 139.38225 0</code></pre>
<p>This regression estimate for the coefficient of <span class="math inline">\(A\)</span> <em>is</em> a good estimate of the conditional effect in the population (based on the potential outcomes at each level of <span class="math inline">\(L\)</span>):</p>
<pre class="r"><code>dtB[, .(LOR = log( odds( mean(Y1) ) / odds( mean(Y0) ) ) ), keyby = L]</code></pre>
<pre><code>## L LOR
## 1: 0 0.9842565
## 2: 1 0.9865561</code></pre>
<p>Of course, ignoring the confounder <span class="math inline">\(L\)</span> is not very useful if we are interested in recovering the marginal effect. The estimate of 1.4 is biased for <em>both</em> the conditional effect <em>and</em> the marginal effect - it is not really useful for anything:</p>
<pre class="r"><code>tidy(glm(Y ~ A , data = dtB, family="binomial"))</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) -2.049994 0.009164085 -223.6987 0
## 2 A 1.433094 0.011723767 122.2384 0</code></pre>
</div>
<div id="how-weighting-reshapes-the-data" class="section level3">
<h3>How weighting reshapes the data …</h3>
<p>Here is a simple tree graph that shows the potential outcomes for 1000 individuals (based on the same distributions we’ve been using in our simulation). For 27% of the individuals, <span class="math inline">\(L=1\)</span>, for 73% <span class="math inline">\(L=0\)</span>. Each individual has a potential outcome under each level of treatment <span class="math inline">\(A\)</span>. So, that is why there are 730 individuals with <span class="math inline">\(L=0\)</span> who are both with and without treatment. Likewise each treatment arm for those with <span class="math inline">\(L=0\)</span> has 270 individuals. We are not double counting.</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-ipw/PO_flow_large.png" />
</div>
<p>Both the marginal and conditional estimates that we estimated before using the simulated data can be calculated directly using information from this tree. The conditional effects on the log-odds scale can be calculated as …</p>
<p><span class="math display">\[LOR_{A=1 \textbf{ vs } A=0|L = 0} = log \left (\frac{0.182/0.818}{0.076/0.924} \right)=log(2.705) = 0.995\]</span></p>
<p>and</p>
<p><span class="math display">\[LOR_{A=1 \textbf{ vs } A=0|L = 1} = log \left (\frac{0.562/0.438}{0.324/0.676} \right)=log(2.677) = 0.984\]</span></p>
<p>The marginal effect on the log odds scale is estimated marginal probabilities: <span class="math inline">\(P(Y=1|A=0)\)</span> and <span class="math inline">\(P(Y=1|A=1)\)</span>. Again, we can take this right from the tree …</p>
<p><span class="math display">\[P(Y=1|A=0) = 0.73\times0.076 + 0.27\times0.324 = 0.143\]</span> and</p>
<p><span class="math display">\[P(Y=1|A=1) = 0.73\times0.182 + 0.27\times0.562 = 0.285\]</span></p>
<p>Based on these average outcomes (probabilities) by exposure, the marginal log-odds for the sample is:</p>
<p><span class="math display">\[LOR_{A=1 \textbf{ vs } A=0} = log \left (\frac{0.285/0.715}{0.143/0.857} \right)=log(2.389) = 0.871\]</span></p>
<p>Back in the real world of observed data, this is what the tree diagram looks like:</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-ipw/Obs_flow_large.png" />
</div>
<p>This tree tells us that the probability of exposure <span class="math inline">\(A=1\)</span> is different depending upon that value of <span class="math inline">\(L\)</span>. For <span class="math inline">\(L=1\)</span>, <span class="math inline">\(P(A=1) = 230/730 = 0.315\)</span> and for <span class="math inline">\(L=0\)</span>, <span class="math inline">\(P(A=1) = 180/270 = 0.667\)</span>. Because of this disparity, the crude estimate of the effect (ignoring <span class="math inline">\(L\)</span>) is biased for the marginal causal effect:</p>
<p><span class="math display">\[P(Y=1|A=0) = \frac{500\times0.076 + 90\times0.324}{500+90}=0.114\]</span></p>
<p>and</p>
<p><span class="math display">\[P(Y=1|A=1) = \frac{230\times0.182 + 180\times0.562}{230+180}=0.349\]</span></p>
<p>The crude log odds ratio is</p>
<p><span class="math display">\[LOR_{A=1 \textbf{ vs } A=0} = log \left (\frac{0.349/0.651}{0.114/0.886} \right)=log(4.170) = 1.420\]</span></p>
<p>And now we finally get to the weights. As mentioned in the prior post, the IPW is based on the probability of the actual exposure at each level of <span class="math inline">\(L\)</span>: <span class="math inline">\(P(A=a | L)\)</span>, where <span class="math inline">\(a\in(0,1)\)</span> (and not on <span class="math inline">\(P(A=1|L)\)</span>, the propensity score). Here are the simple weights for each group:</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-ipw/Weights.png" />
</div>
<p>If we apply the weights to each of the respective groups, you can see that the number of individuals in each treatment arm is adjusted to the total number of individuals in the sub-group defined the level of <span class="math inline">\(L\)</span>. For example, if we apply the weight of 3.17 (730/230) to each person observed with <span class="math inline">\(L=0\)</span> and <span class="math inline">\(A=1\)</span>, we end up with <span class="math inline">\(230\times3.17=730\)</span>. Applying each of the respective weights to the subgroups of <span class="math inline">\(L\)</span> and <span class="math inline">\(A\)</span> results in a new sample of individuals that looks exactly like the one we started out with in the potential outcomes world:</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-ipw/PO_flow_large.png" />
</div>
<p>This all works only if we make these two assumptions: <span class="math display">\[P(Y=1|A=0, L=l) = P(Y_0=1 | A=1, L=l)\]</span> and <span class="math display">\[P(Y=1|A=1, L=l) = P(Y_1=1 | A=0, L=l)\]</span></p>
<p>That is, we can make this claim <em>only under the assumption of no unmeasured confounding</em>. (This was discussed in the <a href="https://www.rdatagen.net/post/potential-outcomes-confounding/">Part 1</a> post.)</p>
</div>
<div id="applying-ipw-to-our-data" class="section level3">
<h3>Applying IPW to our data</h3>
<p>We need to estimate the weights using logistic regression (though other, more flexible methods, can also be used). First, we estimate <span class="math inline">\(P(A=1|L)\)</span> …</p>
<pre class="r"><code>exposureModel <- glm(A ~ L, data = dtB, family = "binomial")
dtB[, pA := predict(exposureModel, type = "response")]</code></pre>
<p>Now we can derive an estimate for <span class="math inline">\(P(A=a|L=l)\)</span> and get the weight itself…</p>
<pre class="r"><code># Define two new columns
defB2 <- defDataAdd(varname = "pA_actual",
formula = "(A * pA) + ((1 - A) * (1 - pA))",
dist = "nonrandom")
defB2 <- defDataAdd(defB2, varname = "IPW",
formula = "1/pA_actual",
dist = "nonrandom")
# Add weights
dtB <- addColumns(defB2, dtB)
dtB[1:6]</code></pre>
<pre><code>## id L Y0 Y1 A Y pA pA_actual IPW
## 1: 1 0 0 0 0 0 0.3146009 0.6853991 1.459004
## 2: 2 0 0 0 0 0 0.3146009 0.6853991 1.459004
## 3: 3 1 0 1 1 1 0.6682351 0.6682351 1.496479
## 4: 4 0 1 1 1 1 0.3146009 0.3146009 3.178631
## 5: 5 1 0 0 1 0 0.6682351 0.6682351 1.496479
## 6: 6 1 0 0 0 0 0.6682351 0.3317649 3.014183</code></pre>
<p>To estimate the marginal effect on the log-odds scale, we use the function <code>glm</code> with weights specified by IPW. The true value of marginal effect (based on the population-wide potential outcomes) was 0.87 (as we estimated from the potential outcomes directly and from the tree graph). Our estimate here is spot on (but with such a large sample size, this is not so surprising):</p>
<pre class="r"><code>tidy(glm(Y ~ A , data = dtB, family="binomial", weights = IPW)) </code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) -1.7879512 0.006381189 -280.1909 0
## 2 A 0.8743154 0.008074115 108.2862 0</code></pre>
<p>It may not seem like a big deal to be able to estimate the marginal effect - we may actually not be interested in it. However, in the next post, I will touch on the issue of estimating causal effects when there are repeated exposures (for example, administering a drug over time) and time dependent confounders that are both affected by prior exposures and affect future exposures <em>and</em> affect the outcome. Under this scenario, it is very difficult if not impossible to control for these confounders - the best we might be able to do is estimate a marginal, population-wide causal effect. That is where weighting will be really useful.</p>
</div>
Characterizing the variance for clustered data that are Gamma distributed
https://www.rdatagen.net/post/icc-for-gamma-distribution/
Mon, 27 Nov 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/icc-for-gamma-distribution/<p>Way back when I was studying algebra and wrestling with one word problem after another (I think now they call them story problems), I complained to my father. He laughed and told me to get used to it. “Life is one big word problem,” is how he put it. Well, maybe one could say any statistical analysis is really just some form of multilevel data analysis, whether we treat it that way or not.</p>
<p>A key feature of the multilevel model is the ability to explicitly untangle the variation that occurs at different levels. Variation of individuals within a sub-group, variation across sub-groups, variation across groups of sub-groups, and so on. The intra-class coefficient (ICC) is one summarizing statistic that attempts to characterize the <em>relative</em> variability across the different levels.</p>
<p>The amount of clustering as measured by the ICC has implications for study design, because it communicates how much information is available at different levels of the hierarchy. We may have thousands of individuals that fall into ten or twenty clusters, and think we have a lot of information. But if most of the variation is at the cluster/group level (and not across individuals within a cluster), we don’t have thousands of observations, but more like ten or twenty. This has important implications for our measures of uncertainty.</p>
<p>Recently, a researcher was trying to use <code>simstudy</code> to generate cost and quality-of-life measurements to simulate clustered data for a cost-effectiveness analysis. (They wanted the cost and quality measurements to correlate within individuals, but I am going to ignore that aspect here.) Cost data are typically <em>right skewed</em> with most values falling on the lower end, but with some extremely high values on the upper end. (These dollar values cannot be negative.)</p>
<p>Because of this characteristic shape, cost data are often modeled using a Gamma distribution. The challenge here was that in simulating the data, the researcher wanted to control the group level variation relative to the individual-level variation. If the data were normally distributed, it would be natural to talk about that control in terms of the ICC. But, with the Gamma distribution, it is not as obvious how to partition the variation.</p>
<p>As most of my posts do, this one provides simulation and plots to illuminate some of these issues.</p>
<div id="gamma-distribtution" class="section level3">
<h3>Gamma distribtution</h3>
<p>The Gamma distribution is a continuous probability distribution that includes all non-negative numbers. The probability density function is typically written as a function of two parameters - the shape <span class="math inline">\(\alpha\)</span> and the rate <span class="math inline">\(\beta\)</span>:</p>
<p><span class="math display">\[f(x) = \frac{\beta ^ \alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x},\]</span></p>
<p>with <span class="math inline">\(\text{E}(x) = \alpha / \beta\)</span>, and <span class="math inline">\(\text{Var}(x)=\alpha / \beta^2\)</span>. <span class="math inline">\(\Gamma(.)\)</span> is the continuous Gamma function, which lends its name to the distribution. (When <span class="math inline">\(\alpha\)</span> is a positive integer, <span class="math inline">\(\Gamma(\alpha)=(\alpha - 1 )!\)</span>) In <code>simstudy</code>, I decided to parameterize the pdf using <span class="math inline">\(\mu\)</span> to represent the mean and a dispersion parameter <span class="math inline">\(\nu\)</span>, where <span class="math inline">\(\text{Var}(x) = \nu\mu^2\)</span>. In this parameterization, shape <span class="math inline">\(\alpha = \frac{1}{\nu}\)</span> and rate <span class="math inline">\(\beta = \frac{1}{\nu\mu}\)</span>. (There is a simstudy function <code>gammaGetShapeRate</code> that maps <span class="math inline">\(\mu\)</span> and <span class="math inline">\(\nu\)</span> to <span class="math inline">\(\alpha\)</span> and <span class="math inline">\(\beta\)</span>.) With this parameterization, it is clear that the variance of a Gamma distributed random variable is a function of the (square) of the mean.</p>
<p>Simulating data gives a sense of the shape of the distribution and also makes clear that the variance depends on the mean (which is not the case for the normal distribution):</p>
<pre class="r"><code>mu <- 20
nu <- 1.2
# theoretical mean and variance
c(mean = mu, variance = mu^2 * nu) </code></pre>
<pre><code>## mean variance
## 20 480</code></pre>
<pre class="r"><code>library(simstudy)
(ab <- gammaGetShapeRate(mu, nu))</code></pre>
<pre><code>## $shape
## [1] 0.8333333
##
## $rate
## [1] 0.04166667</code></pre>
<pre class="r"><code># simulate data using R function
set.seed(1)
g.rfunc <- rgamma(100000, ab$shape, ab$rate)
round(c(mean(g.rfunc), var(g.rfunc)), 2)</code></pre>
<pre><code>## [1] 19.97 479.52</code></pre>
<pre class="r"><code># simulate data using simstudy function - no difference
set.seed(1)
defg <- defData(varname = "g.sim", formula = mu, variance = nu,
dist = "gamma")
dt.g1 <- genData(100000, defg)
dt.g1[, .(round(mean(g.sim),2), round(var(g.sim),2))]</code></pre>
<pre><code>## V1 V2
## 1: 19.97 479.52</code></pre>
<pre class="r"><code># doubling dispersion factor
defg <- updateDef(defg, changevar = "g.sim", newvariance = nu * 2)
dt.g0 <- genData(100000, defg)
dt.g0[, .(round(mean(g.sim),2), round(var(g.sim),2))]</code></pre>
<pre><code>## V1 V2
## 1: 20.09 983.01</code></pre>
<pre class="r"><code># halving dispersion factor
defg <- updateDef(defg, changevar = "g.sim", newvariance = nu * 0.5)
dt.g2 <- genData(100000, defg)
dt.g2[, .(round(mean(g.sim),2), round(var(g.sim),2))]</code></pre>
<pre><code>## V1 V2
## 1: 19.98 240.16</code></pre>
<p>Generating data sets with the same mean but decreasing levels of dispersion makes it appear as if the distribution is “moving” to the right: the peak shifts to the right and variance decreases …</p>
<pre class="r"><code>library(ggplot2)
dt.g0[, nugrp := 0]
dt.g1[, nugrp := 1]
dt.g2[, nugrp := 2]
dt.g <- rbind(dt.g0, dt.g1, dt.g2)
ggplot(data = dt.g, aes(x=g.sim), group = nugrp) +
geom_density(aes(fill=factor(nugrp)), alpha = .5) +
scale_fill_manual(values = c("#226ab2","#b22222","#22b26a"),
labels = c(nu*2, nu, nu*0.5),
name = bquote(nu)) +
scale_y_continuous(limits = c(0, 0.10)) +
scale_x_continuous(limits = c(0, 100)) +
theme(panel.grid.minor = element_blank()) +
ggtitle(paste0("Varying dispersion with mean = ", mu))</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-11-27-icc-for-clustered-data-that-happen-to-have-a-gamma-distribution_files/figure-html/unnamed-chunk-2-1.png" width="672" /></p>
<p>Conversely, generating data with constant dispersion but increasing the mean does not shift the location but makes the distribution appear less “peaked”. In this case, variance increases with higher means (we can see that longer tails are associated with higher means) …</p>
<p><img src="https://www.rdatagen.net/post/2017-11-27-icc-for-clustered-data-that-happen-to-have-a-gamma-distribution_files/figure-html/unnamed-chunk-3-1.png" width="672" /></p>
</div>
<div id="icc-for-clustered-data-where-within-group-observations-have-a-gaussian-normal-distribution" class="section level3">
<h3>ICC for clustered data where within-group observations have a Gaussian (normal) distribution</h3>
<p>In a 2-level world, with multiple groups each containing individuals, a normally distributed continuous outcome can be described by this simple model: <span class="math display">\[Y_{ij} = \mu + a_j + e_{ij},\]</span> where <span class="math inline">\(Y_{ij}\)</span> is the outcome for individual <span class="math inline">\(i\)</span> who is a member of group <span class="math inline">\(j\)</span>. <span class="math inline">\(\mu\)</span> is the average across all groups and individuals. <span class="math inline">\(a_j\)</span> is the group level effect and is typically assumed to be normally distributed as <span class="math inline">\(N(0, \sigma^2_a)\)</span>, and <span class="math inline">\(e_{ij}\)</span> is the individual level effect that is <span class="math inline">\(N(0, \sigma^2_e)\)</span>. The variance of <span class="math inline">\(Y_{ij}\)</span> is <span class="math inline">\(\text{Var}(a_j + e_{ij}) = \text{Var}(a_j) + \text{Var}(e_{ij}) = \sigma^2_a + \sigma^2_e\)</span>. The ICC is the proportion of total variation of <span class="math inline">\(Y\)</span> explained by the group variation: <span class="math display">\[ICC = \frac{\sigma^2_a}{\sigma^2_a+\sigma^2_e}\]</span> If individual level variation is relatively low or variation across groups is relatively high, then the ICC will be higher. Conversely, higher individual variation or lower variation between groups implies a smaller ICC.</p>
<p>Here is a simulation of data for 50 groups, where each group has 250 individuals. The ICC is 0.10:</p>
<pre class="r"><code># define the group level data
defgrp <- defData(varname = "a", formula = 0,
variance = 2.8, dist = "normal", id = "cid")
defgrp <- defData(defgrp, varname = "n", formula = 250,
dist = "nonrandom")
# define the individual level data
defind <- defDataAdd(varname = "ynorm", formula = "30 + a",
variance = 25.2, dist = "normal")
# generate the group and individual level data
set.seed(3017)
dt <- genData(50, defgrp)
dc <- genCluster(dt, "cid", "n", "id")
dc <- addColumns(defind, dc)
dc</code></pre>
<pre><code>## cid a n id ynorm
## 1: 1 -2.133488 250 1 30.78689
## 2: 1 -2.133488 250 2 25.48245
## 3: 1 -2.133488 250 3 22.48975
## 4: 1 -2.133488 250 4 30.61370
## 5: 1 -2.133488 250 5 22.51571
## ---
## 12496: 50 -1.294690 250 12496 25.26879
## 12497: 50 -1.294690 250 12497 27.12190
## 12498: 50 -1.294690 250 12498 34.82744
## 12499: 50 -1.294690 250 12499 27.93607
## 12500: 50 -1.294690 250 12500 32.33438</code></pre>
<pre class="r"><code># mean Y by group
davg <- dc[, .(avgy = mean(ynorm)), keyby = cid]
# variance of group means
(between.var <- davg[, var(avgy)])</code></pre>
<pre><code>## [1] 2.70381</code></pre>
<pre class="r"><code># overall (marginal) mean and var of Y
gavg <- dc[, mean(ynorm)]
gvar <- dc[, var(ynorm)]
# individual variance within each group
dvar <- dc[, .(vary = var(ynorm)), keyby = cid]
(within.var <- dvar[, mean(vary)])</code></pre>
<pre><code>## [1] 25.08481</code></pre>
<pre class="r"><code># estimate of ICC
(ICCest <- between.var/(between.var + within.var))</code></pre>
<pre><code>## [1] 0.09729918</code></pre>
<pre class="r"><code>ggplot(data=dc, aes(y = ynorm, x = factor(cid))) +
geom_jitter(size = .5, color = "grey50", width = 0.2) +
geom_point(data = davg, aes(y = avgy, x = factor(cid)),
shape = 21, fill = "firebrick3", size = 3) +
theme(panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_text(size = 12),
axis.title = element_text(size = 14)
) +
xlab("Group") +
scale_y_continuous(limits = c(0, 60), name = "Measure") +
ggtitle(bquote("ICC:" ~ .(round(ICCest, 2)) ~
(sigma[a]^2 == .(round(between.var, 1)) ~ "," ~
sigma[e]^2 == .(round(within.var, 1)))
)) </code></pre>
<p><img src="https://www.rdatagen.net/post/2017-11-27-icc-for-clustered-data-that-happen-to-have-a-gamma-distribution_files/figure-html/unnamed-chunk-4-1.png" width="960" /></p>
<p>Here is a plot of data generated using the same overall variance of 28, but based on a much higher ICC of 0.80. Almost all of the variation in the data is driven by the clusters rather than the individuals. This has implications for a study, because (in contrast to the first data set generated above) the individual-level data is not providing as much information or insight into the variation of <span class="math inline">\(Y\)</span>. The most useful information (from this extreme example) can be derived from the difference between the groups (so we really have more like 50 data points rather than 125K).</p>
<p><img src="https://www.rdatagen.net/post/2017-11-27-icc-for-clustered-data-that-happen-to-have-a-gamma-distribution_files/figure-html/unnamed-chunk-6-1.png" width="960" /></p>
<p>Of course, if we look at the individual-level data for each of the two data sets while ignoring the group membership, the two data sets are indistinguishable. That is, the marginal (or population level) distributions are both normally distributed with mean 30 and variance 28:</p>
<p><img src="https://www.rdatagen.net/post/2017-11-27-icc-for-clustered-data-that-happen-to-have-a-gamma-distribution_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
</div>
<div id="icc-for-clustered-data-with-gamma-distribution" class="section level3">
<h3>ICC for clustered data with Gamma distribution</h3>
<p>Now, back to the original question … how do we think about the ICC with clustered data that is Gamma distributed? The model (and data generating process) for these type of data can be described as:</p>
<p><span class="math display">\[Y_{ij} \sim \text{gamma}(\mu_{j}, \nu),\]</span> where <span class="math inline">\(\text{E}(Y_{j}) = \mu_j\)</span> and <span class="math inline">\(\text{Var}(Y_j) = \nu\mu_j^2\)</span>. In addition, the mean of each group is often modeled as:</p>
<p><span class="math display">\[\text{log}(\mu_j) = \beta + a_j,\]</span> where <span class="math inline">\(\beta\)</span> is log of the mean for the group whose group effect is 0, and <span class="math inline">\(a_j \sim N(0, \sigma^2_a)\)</span>. So, the group means are normally distributed on the log scale (or are lognormal) with variance <span class="math inline">\(\sigma^2_a\)</span>. (Although the individual observations within each cluster are Gamma-distributed, the means of the groups are not themselves Gamma-distributed.)</p>
<p>But what is the within group (individual) variation, which <em>is</em> Gamma-distributed? It is not so clear, as the variance within each group depends on both the group mean <span class="math inline">\(\mu_j\)</span> and the dispersion factor <span class="math inline">\(\nu\)</span>. A <a href="https://www.biorxiv.org/content/early/2016/12/21/095851">paper</a> by Nakagawa <em>et al</em> shows that <span class="math inline">\(\sigma^2_e\)</span> on the log scale is also lognormal and can be estimated using the trigamma function (the 2nd derivative of the gamma function) of the dispersion factor. So, the ICC of clustered Gamma observations can be defined on the the log scale:</p>
<p><span class="math display">\[\text{ICC}_\text{gamma-log} = \frac{\sigma^2_a}{\sigma^2_a + \psi_1 \left( \frac{1}{\nu}\right)}\]</span> <span class="math inline">\(\psi_1\)</span> is the <em>trigamma</em> function. I’m quoting from the paper here: “the variance of a gamma-distributed variable on the log scale is equal to <span class="math inline">\(\psi_1 (\frac{1}{\nu})\)</span>, where <span class="math inline">\(\frac{1}{\nu}\)</span> is the shape parameter of the gamma distribution and hence <span class="math inline">\(\sigma^2_e\)</span> is <span class="math inline">\(\psi_1 (\frac{1}{\nu})\)</span>.” (The formula I have written here is slightly different, as I define the dispersion factor as the reciprocal of the the dispersion factor used in the paper.)</p>
<pre class="r"><code>sigma2a <- 0.8
nuval <- 2.5
(sigma2e <- trigamma(1/nuval))</code></pre>
<pre><code>## [1] 7.275357</code></pre>
<pre class="r"><code># Theoretical ICC on log scale
(ICC <- sigma2a/(sigma2a + sigma2e))</code></pre>
<pre><code>## [1] 0.09906683</code></pre>
<pre class="r"><code># generate clustered gamma data
def <- defData(varname = "a", formula = 0, variance = sigma2a,
dist = "normal")
def <- defData(def, varname = "n", formula = 250, dist = "nonrandom")
defc <- defDataAdd(varname = "g", formula = "2 + a",
variance = nuval, dist = "gamma", link = "log")
dt <- genData(1000, def)
dc <- genCluster(dt, "id", "n", "id1")
dc <- addColumns(defc, dc)
dc</code></pre>
<pre><code>## id a n id1 g
## 1: 1 -0.5127477 250 1 0.21708707
## 2: 1 -0.5127477 250 2 2.28008091
## 3: 1 -0.5127477 250 3 3.00000226
## 4: 1 -0.5127477 250 4 0.01637102
## 5: 1 -0.5127477 250 5 0.92374322
## ---
## 249996: 1000 1.6738901 250 249996 0.10547825
## 249997: 1000 1.6738901 250 249997 0.08740397
## 249998: 1000 1.6738901 250 249998 1.24423215
## 249999: 1000 1.6738901 250 249999 41.52665306
## 250000: 1000 1.6738901 250 250000 89.10427825</code></pre>
<p>Here is an estimation of the ICC on the log scale using the raw data …</p>
<pre class="r"><code>dc[, lg := log(g)]
davg <- dc[, .(avgg = mean(lg)), keyby = id]
(between <- davg[, var(avgg)])</code></pre>
<pre><code>## [1] 0.8581585</code></pre>
<pre class="r"><code>dvar <- dc[, .(varg = var(lg)), keyby = id]
(within <- dvar[, mean(varg)])</code></pre>
<pre><code>## [1] 7.202411</code></pre>
<pre class="r"><code>(ICCest <- between/(between + within))</code></pre>
<pre><code>## [1] 0.1064638</code></pre>
<p>Here is an estimation of the ICC (on the log scale) based on the estimated variance of the random effects using a generalized mixed effects model. The between-group variance is a ratio of the intercept variance and the residual variance. An estimate of <span class="math inline">\(\nu\)</span> is just the residual variance …</p>
<pre class="r"><code>library(lme4)
glmerfit <- glmer(g ~ 1 + (1|id),
family = Gamma(link="log"), data= dc)
summary(glmerfit)</code></pre>
<pre><code>## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: Gamma ( log )
## Formula: g ~ 1 + (1 | id)
## Data: dc
##
## AIC BIC logLik deviance df.resid
## 1317249.8 1317281.1 -658621.9 1317243.8 249997
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -0.6392 -0.6008 -0.4058 0.1744 13.9665
##
## Random effects:
## Groups Name Variance Std.Dev.
## id (Intercept) 2.001 1.414
## Residual 2.448 1.564
## Number of obs: 250000, groups: id, 1000
##
## Fixed effects:
## Estimate Std. Error t value Pr(>|z|)
## (Intercept) 2.0099 0.0287 70.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</code></pre>
<pre class="r"><code>estnu <- as.data.table(VarCorr(glmerfit))[2,4]
estsig <- as.data.table(VarCorr(glmerfit))[1,4] / estnu
estsig/(estsig + trigamma(1/estnu))</code></pre>
<pre><code>## vcov
## 1: 0.1044648</code></pre>
<p>Finally, here are some plots of the generated observations and the group means on the log scale. The plots in each row have the same ICC but different underlying mean and dispersion parameters. I find these plots interesting because looking across the columns or up and down the two rows, they provide some insight to the interplay of group means and dispersion on the ICC …</p>
<p><img src="https://www.rdatagen.net/post/2017-11-27-icc-for-clustered-data-that-happen-to-have-a-gamma-distribution_files/figure-html/unnamed-chunk-12-1.png" width="960" /></p>
</div>
Visualizing how confounding biases estimates of population-wide (or marginal) average causal effects
https://www.rdatagen.net/post/potential-outcomes-confounding/
Thu, 16 Nov 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/potential-outcomes-confounding/<p>When we are trying to assess the effect of an exposure or intervention on an outcome, confounding is an ever-present threat to our ability to draw the proper conclusions. My goal (starting here and continuing in upcoming posts) is to think a bit about how to characterize confounding in a way that makes it possible to literally see why improperly estimating intervention effects might lead to bias.</p>
<div id="confounding-potential-outcomes-and-causal-effects" class="section level3">
<h3>Confounding, potential outcomes, and causal effects</h3>
<p>Typically, we think of a confounder as a factor that influences <em>both</em> exposure <em>and</em> outcome. If we ignore the confounding factor in estimating the effect of an exposure, we can easily over- or underestimate the size of the effect due to the exposure. If sicker patients are more likely than healthier patients to take a particular drug, the relatively poor outcomes of those who took the drug may be due to the initial health status rather than the drug.</p>
<p>A slightly different view of confounding is tied to the more conceptual framework of potential outcomes, which I <a href="https://www.rdatagen.net/post/be-careful/">wrote</a> a bit about earlier. A potential outcome is the outcome we <em>would</em> observe <em>if</em> an individual were subjected to a particular exposure. We may or may not observe the potential outcome - this depends on the actual exposure. (To simplify things here, I will assume we are interested only in two different exposures.) <span class="math inline">\(Y_0\)</span> and <span class="math inline">\(Y_1\)</span> represent the potential outcomes for an individual with and without exposure, respectively. We observe <span class="math inline">\(Y_0\)</span> if the individual is not exposed, and <span class="math inline">\(Y_1\)</span> if she is.</p>
<p>The causal effect of the exposure for the individual <span class="math inline">\(i\)</span> can be defined as <span class="math inline">\(Y_{1i} - Y_{0i}\)</span>. If we can observe each individual in both states (with and without the exposure) long enough to measure the outcome <span class="math inline">\(Y\)</span>, we are observing both potential outcomes and can measure the causal effect for each individual. Averaging across all individuals in the sample provides an estimate the population average causal effect. (Think of a crossover or N-of-1 study.)</p>
<p>Unfortunately, in the real world, it is rarely feasible to expose an individual to multiple conditions. Instead, we use one group as a proxy for the other. For example, the control group represents what would have happened to the exposed group had the exposed group not been exposed. This approach only makes sense if the control group is identical in every way to the exposure group (except for the exposure, of course.)</p>
<p>Our goal is to compare the distribution of outcomes for the control group with the exposed group. We often simplify this comparison by looking at the means of each distribution. The average causal effect (across all individuals) can be written as <span class="math inline">\(E(Y_1 - Y_0)\)</span>, where <span class="math inline">\(E()\)</span> is the expectation or average. In reality, we cannot directly measure this since only one potential outcome is observed for each individual.</p>
<p>Using the following logic, we might be able to convince ourselves that we can use <em>observed</em> measurements to estimate unobservable average causal effects. First, we can say <span class="math inline">\(E(Y_1 - Y_0) = E(Y_1) - E(Y_0)\)</span>, because expectation is linear. Next, it seems fairly reasonable to say that <span class="math inline">\(E(Y_1 | A = 1) = E(Y | A = 1)\)</span>, where <span class="math inline">\(A=1\)</span> for exposure, <span class="math inline">\(A=0\)</span> for control. In words, this states that the average <strong>potential outcome of exposure</strong> for the <strong><em>exposed group</em></strong> is the same as what we actually <strong>observe</strong> for the <strong><em>exposed group</em></strong> (this is the consistency assumption in causal inference theory). Along the same lines, <span class="math inline">\(E(Y_0 | A = 0) = E(Y | A = 0)\)</span>. Finally, <em>if</em> we can say that <span class="math inline">\(E(Y_1) = E(Y_1 | A = 1)\)</span> - the potential outcome of exposure for <strong><em>everyone</em></strong> is equal to the potential outcome of exposure for those <strong><em>exposed</em></strong> - then we can say that <span class="math inline">\(E(Y_1) = E(Y | A = 1)\)</span> (the potential outcome with exposure for <strong><em>everyone</em></strong> is the same as the observed outcome for <strong><em>the exposed</em></strong>. Similarly, we can make the same argument to conclude that <span class="math inline">\(E(Y_0) = E(Y | A = 0)\)</span>. At the end of this train of logic, we conclude that we can estimate <span class="math inline">\(E(Y_1 - Y_0)\)</span> using observed data only: <span class="math inline">\(E(Y | A = 1) - E(Y | A = 0)\)</span>.</p>
<p>This nice logic fails if <span class="math inline">\(E(Y_1) \ne E(Y | A = 1)\)</span> and/or <span class="math inline">\(E(Y_0) \ne E(Y | A = 0)\)</span>. That is, this nice logic fails when there is confounding.</p>
<p>This is all a very long-winded way of saying that confounding arises when the distributions of potential outcomes <strong><em>for the population</em></strong> are different from those distributions for <strong><em>the subgroups</em></strong> we are using for analysis. For example, if the potential outcome under exposure for the population as a whole (<span class="math inline">\(Y_1\)</span>) differs from the observed outcome for the subgroup that was exposed (<span class="math inline">\(Y|A=1\)</span>), or the potential outcome without exposure for the entire population (<span class="math inline">\(Y_0\)</span>) differs from the observed outcome for the subgroup that was not exposed (<span class="math inline">\(Y|A=0\)</span>), any estimates of population level causal effects using observed data will be biased.</p>
<p>However, if we can find a factor <span class="math inline">\(L\)</span> (or factors) where</p>
<p><span class="math display">\[ \begin{aligned}
P(Y_1 | L=l) &= P(Y | A = 1 \text{ and } L=l) \\
P(Y_0 | L=l) &= P(Y | A = 0 \text{ and } L=l)
\end{aligned}
\]</span> both hold for all levels or values of <span class="math inline">\(L\)</span>, we can remove confounding (and get unbiased estimates of the causal effect) by “controlling” for <span class="math inline">\(L\)</span>. In some cases, the causal effect we measure will be conditional on <span class="math inline">\(L\)</span>, sometimes it will be a population-wide average (or marginal) causal effect, and sometimes it will be both.</p>
</div>
<div id="what-confounding-looks-like" class="section level3">
<h3>What confounding looks like …</h3>
<p>The easiest way to illustrate the population/subgroup contrast is to generate data from a process that includes confounding. In this first example, the outcome is continuous, and is a function of both the exposure (<span class="math inline">\(A\)</span>) and a covariate (<span class="math inline">\(L\)</span>). For each individual, we can generate both potential outcomes <span class="math inline">\(Y_0\)</span> and <span class="math inline">\(Y_1\)</span>. (Note that both potential outcomes share the same individual level noise term <span class="math inline">\(e\)</span> - this is not a necessary assumption.) This way, we can “know” the true population, or marginal causal effect of exposure. The observed outcome <span class="math inline">\(Y\)</span> is determined by the exposure status. For the purposes of plotting a smooth density curve, we generate a very large sample - 2 million.</p>
<pre class="r"><code>library(simstudy)
defC <- defData(varname = "e", formula = 0, variance = 2,
dist = "normal")
defC <- defData(defC, varname = "L", formula = 0.4,
dist = "binary")
defC <- defData(defC, varname = "Y0", formula = "1 + 4*L + e",
dist = "nonrandom")
defC <- defData(defC, varname = "Y1", formula = "5 + 4*L + e",
dist = "nonrandom")
defC <- defData(defC, varname = "A", formula = "0.3 + 0.3 * L",
dist = "binary")
defC <- defData(defC, varname = "Y", formula = "1 + 4*A + 4*L + e",
dist = "nonrandom")
set.seed(2017)
dtC <- genData(n = 2000000, defC)
dtC[1:5]</code></pre>
<pre><code>## id e L Y0 Y1 A Y
## 1: 1 2.02826718 1 7.0282672 11.028267 1 11.0282672
## 2: 2 -0.10930734 0 0.8906927 4.890693 0 0.8906927
## 3: 3 1.04529790 0 2.0452979 6.045298 0 2.0452979
## 4: 4 -2.48704266 1 2.5129573 6.512957 1 6.5129573
## 5: 5 -0.09874778 0 0.9012522 4.901252 0 0.9012522</code></pre>
<p>Feel free to skip over this code - I am just including in case anyone finds it useful to see how I generated the following series of annotated density curves:</p>
<pre class="r"><code>library(ggplot2)
getDensity <- function(vector, weights = NULL) {
if (!is.vector(vector)) stop("Not a vector!")
if (is.null(weights)) {
avg <- mean(vector)
} else {
avg <- weighted.mean(vector, weights)
}
close <- min(which(avg < density(vector)$x))
x <- density(vector)$x[close]
if (is.null(weights)) {
y = density(vector)$y[close]
} else {
y = density(vector, weights = weights)$y[close]
}
return(data.table(x = x, y = y))
}
plotDens <- function(dtx, var, xPrefix, title, textL = NULL, weighted = FALSE) {
dt <- copy(dtx)
if (weighted) {
dt[, nIPW := IPW/sum(IPW)]
dMarginal <- getDensity(dt[, get(var)], weights = dt$nIPW)
} else {
dMarginal <- getDensity(dt[, get(var)])
}
d0 <- getDensity(dt[L==0, get(var)])
d1 <- getDensity(dt[L==1, get(var)])
dline <- rbind(d0, dMarginal, d1)
brk <- round(dline$x, 1)
p <- ggplot(aes(x=get(var)), data=dt) +
geom_density(data=dt[L==0], fill = "#ce682f", alpha = .4) +
geom_density(data=dt[L==1], fill = "#96ce2f", alpha = .4)
if (weighted) {
p <- p + geom_density(aes(weight = nIPW),
fill = "#2f46ce", alpha = .8)
} else p <- p + geom_density(fill = "#2f46ce", alpha = .8)
p <- p + geom_segment(data = dline, aes(x = x, xend = x,
y = 0, yend = y),
size = .7, color = "white", lty=3) +
annotate(geom="text", x = 12.5, y = .24,
label = title, size = 5, fontface = 2) +
scale_x_continuous(limits = c(-2, 15),
breaks = brk,
name = paste(xPrefix, var)) +
theme(panel.grid = element_blank(),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 13)
)
if (!is.null(textL)) {
p <- p +
annotate(geom = "text", x = textL[1], y = textL[2],
label = "L=0", size = 4, fontface = 2) +
annotate(geom = "text", x = textL[3], y = textL[4],
label="L=1", size = 4, fontface = 2) +
annotate(geom = "text", x = textL[5], y = textL[6],
label="Population distribution", size = 4, fontface = 2)
}
return(p)
}</code></pre>
<pre class="r"><code>library(gridExtra)
grid.arrange(plotDens(dtC, "Y0", "Potential outcome", "Full\npopulation",
c(1, .24, 5, .22, 2.6, .06)),
plotDens(dtC[A==0], "Y", "Observed", "Unexposed\nonly"),
plotDens(dtC, "Y1", "Potential outcome", "Full\npopulation"),
plotDens(dtC[A==1], "Y", "Observed", "Exposed\nonly"),
nrow = 2
)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-11-16-potential-outcomes-confounding_files/figure-html/unnamed-chunk-3-1.png" width="1152" /></p>
<p>Looking at the various plots, we can see a few interesting things. The density curves on the left represent the entire population. The conditional distributions of the potential outcomes at the population level are all normally distributed, with means that depend on the exposure and covariate <span class="math inline">\(L\)</span>. We can also see that the population-wide distribution of <span class="math inline">\(Y_0\)</span> and <span class="math inline">\(Y_1\)</span> (in blue) are non-symmetrically shaped, because they are a mixture of the conditional normal distributions, weighted by the proportion of each level of <span class="math inline">\(L\)</span>. Since the proportions for the top and bottom plots are in fact the population proportion, the population-level density curves for <span class="math inline">\(Y_0\)</span> and <span class="math inline">\(Y_1\)</span> are similarly shaped, with less mass on the higher end, because individuals are less likely to have an <span class="math inline">\(L\)</span> value of 1:</p>
<pre class="r"><code>dtC[, .(propLis1 = mean(L))]</code></pre>
<pre><code>## propLis1
## 1: 0.399822</code></pre>
<p>The shape of the marginal distribution of <span class="math inline">\(Y_1\)</span> is identical to <span class="math inline">\(Y_0\)</span> (in this case, because that is the way I generated the data), but shifted to the right by an amount equal to the causal effect. The conditional effect sizes are 4, as is the population or marginal effect size.</p>
<p>The subgroup plots on the right are a different story. In this case, the distributions of <span class="math inline">\(L\)</span> vary across the exposed and unexposed groups:</p>
<pre class="r"><code>dtC[, .(propLis1 = mean(L)), keyby = A]</code></pre>
<pre><code>## A propLis1
## 1: 0 0.2757937
## 2: 1 0.5711685</code></pre>
<p>So, even though the distributions of (observed) <span class="math inline">\(Y\)</span> conditional on <span class="math inline">\(L\)</span> are identical to their potential outcome counterparts in the whole population - for example, <span class="math inline">\(P(Y | A=0 \text{ and } L = 1) = P(Y_0 | L = 1)\)</span> - the marginal distributions of <span class="math inline">\(Y\)</span> are quite different for the exposed and unexposed. For example, <span class="math inline">\(P(Y | A = 0) \ne P(Y_0)\)</span>. This is directly due to the fact that the mixing weights (the proportions of <span class="math inline">\(L\)</span>) are different for each of the groups. In the unexposed group, about 28% have <span class="math inline">\(L=1\)</span>, but for the exposed group, about 57% do. Using the subgroup data only, the conditional effect sizes are still 4 (comparing mean outcomes <span class="math inline">\(Y\)</span> within each level of <span class="math inline">\(L\)</span>). However the difference in means between the marginal distributions of each subgroup is about 5.2 (calculated by 7.3 - 2.1). This is confounding.</p>
</div>
<div id="no-confounding" class="section level3">
<h3>No confounding</h3>
<p>Just so we can see that when the covariate <span class="math inline">\(L\)</span> has nothing to do with the probability of exposure, the marginal distributions of the subgroups do in fact look like their population-level potential outcome marginal distributions:</p>
<pre class="r"><code>defC <- updateDef(defC, "A", newformula = 0.5) # change data generation
dtC <- genData(n = 2000000, defC)
dtC[, .(propLis1 = mean(L)), keyby = A] # subgroup proportions</code></pre>
<pre><code>## A propLis1
## 1: 0 0.4006499
## 2: 1 0.3987437</code></pre>
<pre class="r"><code>dtC[, .(propLis1 = mean(L))] # population/marginal props</code></pre>
<pre><code>## propLis1
## 1: 0.3996975</code></pre>
<pre class="r"><code>grid.arrange(plotDens(dtC, "Y0", "Potential outcome", "Population",
c(1, .24, 5, .22, 2.6, .06)),
plotDens(dtC[A==0], "Y", "Observed", "Unexposed"),
plotDens(dtC, "Y1", "Potential outcome", "Population"),
plotDens(dtC[A==1], "Y", "Observed", "Exposed"),
nrow = 2
)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-11-16-potential-outcomes-confounding_files/figure-html/unnamed-chunk-6-1.png" width="1152" /></p>
</div>
<div id="estimation-of-causal-effects-now-with-confounding" class="section level3">
<h3>Estimation of causal effects (now with confounding)</h3>
<p>Generating a smaller data set, we estimate the causal effects using simple calculations and linear regression:</p>
<pre class="r"><code>library(broom)
# change back to confounding
defC <- updateDef(defC, "A", newformula = ".3 + .3 * L")
dtC <- genData(2500, defC)</code></pre>
<p>The true average (marginal) causal effect from the average difference in potential outcomes for the entire population:</p>
<pre class="r"><code>dtC[, mean(Y1 - Y0)]</code></pre>
<pre><code>## [1] 4</code></pre>
<p>And the true average causal effects conditional on the covariate <span class="math inline">\(L\)</span>:</p>
<pre class="r"><code>dtC[, mean(Y1 - Y0), keyby = L]</code></pre>
<pre><code>## L V1
## 1: 0 4
## 2: 1 4</code></pre>
<p>If we try to estimate the marginal causal effect by using a regression model that does not include <span class="math inline">\(L\)</span>, we run into problems. The estimate of 5.2 we see below is the same biased estimate we saw in the plot above. This model is reporting the differences of the means (across both levels of <span class="math inline">\(L\)</span>) for the two subgroups, which we know (because we saw) are not the same as the potential outcome distributions in the population due to different proportions of <span class="math inline">\(L\)</span> in each subgroup:</p>
<pre class="r"><code>tidy(lm(Y ~ A, data = dtC))</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 2.027132 0.06012997 33.71251 1.116211e-205
## 2 A 5.241004 0.09386145 55.83766 0.000000e+00</code></pre>
<p>If we estimate a model that conditions on <span class="math inline">\(L\)</span>, the estimates are on target because in the context of normal linear regression without interaction terms, conditional effects are the same as marginal effects (when confounding has been removed, or think of the comparisons being made within the orange groups and green groups in the fist set of plots above, not within the purple groups):</p>
<pre class="r"><code>tidy(lm(Y ~ A + L , data = dtC))</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 0.9178849 0.03936553 23.31697 5.809202e-109
## 2 A 4.0968358 0.05835709 70.20288 0.000000e+00
## 3 L 3.9589109 0.05862583 67.52844 0.000000e+00</code></pre>
</div>
<div id="inverse-probability-weighting-ipw" class="section level3">
<h3>Inverse probability weighting (IPW)</h3>
<p>What follows briefly here is just a sneak preview of IPW (without any real explanation), which is one way to recover the marginal mean using observed data with confounding. For now, I am ignoring the question of why you might be interested in knowing the marginal effect when the conditional effect estimate provides the same information. Suffice it to say that the conditional effect is <em>not</em> always the same as the marginal effect (think of data generating processes that include interactions or non-linear relationships), and sometimes the marginal effect estimate may the best that we can do, or at least that we can do easily.</p>
<p>If we weight each individual observation by the inverse probability of exposure, we can remove confounding and estimate the <em>marginal</em> effect of exposure on the outcome. Here is a quick simulation example.</p>
<p>After generating the dataset (the same large one we started out with so you can compare) we estimate the probability of exposure <span class="math inline">\(P(A=1 | L)\)</span>, assuming that we know the correct exposure model. This is definitely a questionable assumption, but in this case, we actually do. Once the model has been fit, we assign the predicted probability to each individual based on her value of <span class="math inline">\(L\)</span>.</p>
<pre class="r"><code>set.seed(2017)
dtC <- genData(2000000, defC)
exposureModel <- glm(A ~ L, data = dtC, family = "binomial")
tidy(exposureModel)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) -0.847190 0.001991708 -425.3584 0
## 2 L 1.252043 0.003029343 413.3053 0</code></pre>
<pre class="r"><code>dtC[, pA := predict(exposureModel, type = "response")]</code></pre>
<p>The IPW is <em>not</em> based exactly on <span class="math inline">\(P(A=1 | L)\)</span> (which is commonly used in propensity score analysis), but rather, the probability of the actual exposure at each level of <span class="math inline">\(L\)</span>: <span class="math inline">\(P(A=a | L)\)</span>, where <span class="math inline">\(a\in(0,1)\)</span>:</p>
<pre class="r"><code># Define two new columns
defC2 <- defDataAdd(varname = "pA_actual",
formula = "A * pA + (1-A) * (1-pA)",
dist = "nonrandom")
defC2 <- defDataAdd(defC2, varname = "IPW",
formula = "1/pA_actual",
dist = "nonrandom")
# Add weights
dtC <- addColumns(defC2, dtC)
round(dtC[1:5], 2)</code></pre>
<pre><code>## id e L Y0 Y1 A Y pA pA_actual IPW
## 1: 1 2.03 1 7.03 11.03 1 11.03 0.6 0.6 1.67
## 2: 2 -0.11 0 0.89 4.89 0 0.89 0.3 0.7 1.43
## 3: 3 1.05 0 2.05 6.05 0 2.05 0.3 0.7 1.43
## 4: 4 -2.49 1 2.51 6.51 1 6.51 0.6 0.6 1.67
## 5: 5 -0.10 0 0.90 4.90 0 0.90 0.3 0.7 1.43</code></pre>
<p>To estimate the marginal effect on the log-odds scale, we use function <code>lm</code> again, but with weights specified by IPW. The true value of the marginal effect of exposure (based on the population-wide potential outcomes) was 4.0. I know I am repeating myself here, but first I am providing the biased estimate that we get when we ignore covariate <span class="math inline">\(L\)</span> to convince you that the relationship between exposure and outcome is indeed confounded:</p>
<pre class="r"><code>tidy(lm(Y ~ A , data = dtC)) </code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 2.101021 0.002176711 965.2275 0
## 2 A 5.184133 0.003359132 1543.2956 0</code></pre>
<p>And now, with the simple addition of the weights but still <em>not</em> including <span class="math inline">\(L\)</span> in the model, our weighted estimate of the marginal effect is spot on (but with such a large sample size, this is not so surprising):</p>
<pre class="r"><code>tidy(lm(Y ~ A , data = dtC, weights = IPW)) </code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 2.596769 0.002416072 1074.789 0
## 2 A 4.003122 0.003416842 1171.585 0</code></pre>
<p>And finally, here is a plot of the IPW-adjusted density. You might think I am just showing you the plots for the unconfounded data again, but you can see from the code (and I haven’t hidden anything) that I am still using the data set with confounding. In particular, you can see that I am calling the routine <code>plotDens</code> with weights.</p>
<pre class="r"><code>grid.arrange(plotDens(dtC, "Y0", "Potential outcome", "Population",
c(1, .24, 5, .22, 2.6, .06)),
plotDens(dtC[A==0], "Y", "Observed", "Unexposed",
weighted = TRUE),
plotDens(dtC, "Y1", "Potential outcome", "Population"),
plotDens(dtC[A==1], "Y", "Observed", "Exposed",
weighted = TRUE),
nrow = 2
)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-11-16-potential-outcomes-confounding_files/figure-html/unnamed-chunk-16-1.png" width="1152" /></p>
<p>As I mentioned, I hope to write more on <em>IPW</em>, and <em>marginal structural models</em>, which make good use of this methodology to estimate effects that can be challenging to get a handle on.</p>
</div>
A simstudy update provides an excuse to generate and display Likert-type data
https://www.rdatagen.net/post/generating-and-displaying-likert-type-data/
Tue, 07 Nov 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/generating-and-displaying-likert-type-data/<p>I just updated <code>simstudy</code> to version 0.1.7. It is available on CRAN.</p>
<p>To mark the occasion, I wanted to highlight a new function, <code>genOrdCat</code>, which puts into practice some code that I presented a little while back as part of a discussion of <a href="https://www.rdatagen.net/post/a-hidden-process-part-2-of-2/">ordinal logistic regression</a>. The new function was motivated by a reader/researcher who came across my blog while wrestling with a simulation study. After a little back and forth about how to generate ordinal categorical data, I ended up with a function that might be useful. Here’s a little example that uses the <code>likert</code> package, which makes plotting Likert-type easy and attractive.</p>
<div id="defining-the-data" class="section level3">
<h3>Defining the data</h3>
<p>The proportional odds model assumes a baseline distribution of probabilities. In the case of a survey item, this baseline is the probability of responding at a particular level - in this example I assume a range of 1 (strongly disagree) to 4 (strongly agree) - given a value of zero for all of the covariates. In this example, there is a single predictor <span class="math inline">\(x\)</span> that ranges from -0.5 to 0.5. The baseline probabilities of the response variable <span class="math inline">\(r\)</span> will apply in cases where <span class="math inline">\(x = 0\)</span>. In the proportional odds data generating process, the covariates “influence” the response through an additive shift (either positive or negative) on the logistic scale. (If this makes no sense at all, maybe check out my <a href="https://www.rdatagen.net/post/a-hidden-process-part-2-of-2/">earlier post</a> for a little explanation.) Here, this additive shift is represented by the variable <span class="math inline">\(z\)</span>, which is a function of <span class="math inline">\(x\)</span>.</p>
<pre class="r"><code>library(simstudy)
baseprobs<-c(0.40, 0.25, 0.15, 0.20)
def <- defData(varname="x", formula="-0.5;0.5", dist = "uniform")
def <- defData(def, varname = "z", formula = "2*x", dist = "nonrandom")</code></pre>
</div>
<div id="generate-data" class="section level3">
<h3>Generate data</h3>
<p>The ordinal data is generated after a data set has been created with an adjustment variable. We have to provide the data.table name, the name of the adjustment variable, and the base probabilities. That’s really it.</p>
<pre class="r"><code>set.seed(2017)
dx <- genData(2500, def)
dx <- genOrdCat(dx, adjVar = "z", baseprobs, catVar = "r")
dx <- genFactor(dx, "r", c("Strongly disagree", "Disagree",
"Agree", "Strongly agree"))
print(dx)</code></pre>
<pre><code>## id x z r fr
## 1: 1 0.42424261 0.84848522 2 Disagree
## 2: 2 0.03717641 0.07435283 3 Agree
## 3: 3 -0.03080435 -0.06160871 3 Agree
## 4: 4 -0.21137382 -0.42274765 1 Strongly disagree
## 5: 5 0.27008816 0.54017632 1 Strongly disagree
## ---
## 2496: 2496 -0.32250407 -0.64500815 4 Strongly agree
## 2497: 2497 -0.10268875 -0.20537751 2 Disagree
## 2498: 2498 -0.17037112 -0.34074223 2 Disagree
## 2499: 2499 0.14778233 0.29556465 2 Disagree
## 2500: 2500 0.10665252 0.21330504 3 Agree</code></pre>
<p>The expected cumulative log odds when <span class="math inline">\(x=0\)</span> can be calculated from the base probabilities:</p>
<pre class="r"><code>dp <- data.table(baseprobs,
cumProb = cumsum(baseprobs),
cumOdds = cumsum(baseprobs)/(1 - cumsum(baseprobs))
)
dp[, cumLogOdds := log(cumOdds)]
dp</code></pre>
<pre><code>## baseprobs cumProb cumOdds cumLogOdds
## 1: 0.40 0.40 0.6666667 -0.4054651
## 2: 0.25 0.65 1.8571429 0.6190392
## 3: 0.15 0.80 4.0000000 1.3862944
## 4: 0.20 1.00 Inf Inf</code></pre>
<p>If we fit a cumulative odds model (using package <code>ordinal</code>), we recover those cumulative log odds (see the output under the section labeled “Threshold coefficients”). Also, we get an estimate for the coefficient of <span class="math inline">\(x\)</span> (where the true value used to generate the data was 2.00):</p>
<pre class="r"><code>library(ordinal)
model.fit <- clm(fr ~ x, data = dx, link = "logit")
summary(model.fit)</code></pre>
<pre><code>## formula: fr ~ x
## data: dx
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 2500 -3185.75 6379.51 5(0) 3.19e-11 3.3e+01
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## x 2.096 0.134 15.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## Strongly disagree|Disagree -0.46572 0.04243 -10.98
## Disagree|Agree 0.60374 0.04312 14.00
## Agree|Strongly agree 1.38954 0.05049 27.52</code></pre>
</div>
<div id="looking-at-the-data" class="section level3">
<h3>Looking at the data</h3>
<p>Below is a plot of the response as a function of the predictor <span class="math inline">\(x\)</span>. I “jitter” the data prior to plotting; otherwise, individual responses would overlap and obscure each other.</p>
<pre class="r"><code>library(ggplot2)
dx[, rjitter := jitter(as.numeric(r), factor = 0.5)]
ggplot(data = dx, aes(x = x, y = rjitter)) +
geom_point(color = "forestgreen", size = 0.5) +
scale_y_continuous(breaks = c(1:4),
labels = c("Strongly disagree", "Disagree",
"Agree", "Strongly Agree")) +
theme(panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
axis.title.y = element_blank())</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-11-07-generating-and-displaying-likert-type-data_files/figure-html/unnamed-chunk-5-1.png" width="672" /></p>
<p>You can see that when <span class="math inline">\(x\)</span> is smaller (closer to -0.5), a response of “Strongly disagree” is more likely. Conversely, when <span class="math inline">\(x\)</span> is closer to +0.5, the proportion of folks responding with “Strongly agree” increases.</p>
<p>If we “bin” the individual responses by ranges of <span class="math inline">\(x\)</span>, say grouping by tenths, -0.5 to -0.4, -0.4 to -0.3, all the way to 0.4 to 0.5, we can get another view of how the probabilities shift with respect to <span class="math inline">\(x\)</span>.</p>
<p>The <code>likert</code> package requires very little data manipulation, and once the data are set, it is easy to look at the data in a number of different ways, a couple of which I plot here. I encourage you to look at the <a href="http://jason.bryer.org/likert/">website</a> for many more examples and instructions on how to download the latest version from github.</p>
<pre class="r"><code>library(likert)
bins <- cut(dx$x, breaks = seq(-.5, .5, .1), include.lowest = TRUE)
dx[ , xbin := bins]
item <- data.frame(dx[, fr])
names(item) <- "r"
bin.grp <- factor(dx[, xbin])
likert.bin <- likert(item, grouping = bin.grp)
likert.bin</code></pre>
<pre><code>## Group Item Strongly disagree Disagree Agree Strongly agree
## 1 [-0.5,-0.4] r 65.63877 18.50220 7.048458 8.810573
## 2 (-0.4,-0.3] r 53.33333 27.40741 8.888889 10.370370
## 3 (-0.3,-0.2] r 52.84553 19.51220 10.975610 16.666667
## 4 (-0.2,-0.1] r 48.00000 22.80000 12.800000 16.400000
## 5 (-0.1,0] r 40.24390 24.39024 17.886179 17.479675
## 6 (0,0.1] r 35.20599 25.46816 15.355805 23.970037
## 7 (0.1,0.2] r 32.06107 27.09924 17.175573 23.664122
## 8 (0.2,0.3] r 25.00000 25.40984 21.721311 27.868852
## 9 (0.3,0.4] r 23.91304 27.39130 17.391304 31.304348
## 10 (0.4,0.5] r 17.82946 21.70543 20.155039 40.310078</code></pre>
<pre class="r"><code>plot(likert.bin)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-11-07-generating-and-displaying-likert-type-data_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
<pre class="r"><code>plot(likert.bin, centered = FALSE)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-11-07-generating-and-displaying-likert-type-data_files/figure-html/unnamed-chunk-8-1.png" width="672" /></p>
<p>These plots show what data look like when the cumulative log odds are proportional as we move across different levels of a covariate. (Note that the two center groups should be closest to the baseline probabilities that were used to generate the data.) If you have real data, obviously it is useful to look at it first to see if this type of pattern emerges from the data. When we have more than one or two covariates, the pictures are not as useful, but then it also is probably harder to justify the proportional odds assumption.</p>
</div>
Thinking about different ways to analyze sub-groups in an RCT
https://www.rdatagen.net/post/sub-group-analysis-in-rct/
Wed, 01 Nov 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/sub-group-analysis-in-rct/<p>Here’s the scenario: we have an intervention that we think will improve outcomes for a particular population. Furthermore, there are two sub-groups (let’s say defined by which of two medical conditions each person in the population has) and we are interested in knowing if the intervention effect is different for each sub-group.</p>
<p>And here’s the question: what is the ideal way to set up a study so that we can assess (1) the intervention effects on the group as a whole, but also (2) the sub-group specific intervention effects?</p>
<p>This is a pretty straightforward, text-book scenario. Sub-group analysis is common in many areas of research, including health services research where I do most of my work. It is definitely an advantage to know ahead of time if you want to do a sub-group analysis, as you would in designing a stratified randomized controlled trial. Much of the criticism of these sub-group analyses arises when they are not pre-specified and conducted in an <em>ad hoc</em> manner after the study data have been collected. The danger there, of course, is that the assumptions underlying the validity of a hypothesis test are violated. (It may not be easy to convince folks to avoid hypothesis testing.) In planning ahead for these analyses, researchers are less likely to be accused of snooping through data in search of findings.</p>
<p>So, given that you know you want to do these analyses, the primary issue is how they should be structured. In particular, how should the statistical tests be set up so that we can draw reasonable conclusions? In my mind there are a few ways to answer the question.</p>
<div id="three-possible-models" class="section level2">
<h2>Three possible models</h2>
<p>Here are three models that can help us assess the effect of an intervention on an outcome in a population with at least two sub-groups:</p>
<p><span class="math display">\[ \text{Model 1: } Y_i = \beta_0 + \beta_1 D_i \]</span></p>
<p><span class="math display">\[ \text{Model 2: } Y_i = \beta_0^{\prime} + \beta_1^{\prime} D_i + \beta^{\prime}_2 T_i \]</span></p>
<p><span class="math display">\[ \text{Model 3: } Y_i = \beta_0^{\prime\prime} + \beta_1^{\prime\prime} D_i +\beta^{\prime\prime}_2 T_i +\beta^{\prime\prime}_3 T_i D_i\]</span></p>
<p>where <span class="math inline">\(Y_i\)</span> is the outcome for subject <span class="math inline">\(i\)</span>, <span class="math inline">\(T_i\)</span> is an indicator of treatment and equals 1 if the subject received the treatment, and <span class="math inline">\(D_i\)</span> is an indicator of having the condition that defines the second sub-group. <em>Model 1</em> assumes the medical condition can only affect the outcome. <em>Model 2</em> assumes that if the intervention does have an effect, it is the same regardless of sub-group. And <em>Model 3</em> allows for the possibility that intervention effects might vary between sub-groups.</p>
<div id="main-effects" class="section level3">
<h3>1. Main effects</h3>
<p>In the first approach, we would estimate both <em>Model 2</em> and <em>Model 3</em>, and conduct a hypothesis test using the null hypothesis <span class="math inline">\(\text{H}_{01}\)</span>: <span class="math inline">\(\beta_2^{\prime} = 0\)</span>. In this case we would reject <span class="math inline">\(\text{H}_{01}\)</span> if the p-value for the estimated value of <span class="math inline">\(\beta_2^{\prime}\)</span> was less than 0.05. If in fact we do reject <span class="math inline">\(\text{H}_{01}\)</span> (and conclude that there is an overall main effect), we could then (and only then) proceed to a second hypothesis test of the interaction term in <em>Model 3</em>, testing <span class="math inline">\(\text{H}_{02}\)</span>: <span class="math inline">\(\beta_3^{\prime\prime} = 0\)</span>. In this second test we can also evaluate the test using a cutoff of 0.05, because we only do this test if we reject the first one.</p>
<p>This is not a path typically taken, for reasons we will see at the end when we explore the relative power of each test under different effect size scenarios.</p>
</div>
<div id="interaction-effects" class="section level3">
<h3>2. Interaction effects</h3>
<p>In the second approach, we would also estimate just <em>Models 2</em> and <em>3</em>, but would reverse the order of the tests. We would first test for interaction in <em>Model 3</em>: <span class="math inline">\(\text{H}_{01}\)</span>: <span class="math inline">\(\beta_3^{\prime\prime} = 0\)</span>. If we reject <span class="math inline">\(\text{H}_{01}\)</span> (and conclude that the intervention effects are different across the two sub-groups), we stop there, because we have evidence that the intervention has some sort of effect, and that it is different across the sub-groups. (Of course, we can report the point estimates.) However, if we fail to reject <span class="math inline">\(\text{H}_{01}\)</span>, we would proceed to test the main effect from <em>Model 2</em>. In this case we would test <span class="math inline">\(\text{H}_{02}\)</span>: <span class="math inline">\(\beta_2^{\prime} = 0\)</span>.</p>
<p>In this approach, we are forced to adjust the size of our tests (and use, for example, 0.025 as a cutoff for both). Here is a little intuition for why. If we use a cutoff of 0.05 for the first test and in fact there is no effect, 5% of the time we will draw the wrong conclusion (by wrongly rejecting <span class="math inline">\(\text{H}_{01}\)</span>). However, 95% of the time we will <em>correctly</em> fail to reject the (true) null hypothesis in step one, and thus proceed to step two. Of all the times we proceed to the second step (which will be 95% of the time), we will err 5% of the time (again assuming the null is true). So, 95% of the time we will have an additional 5% error due to the second step, for an error rate of 4.75% due to the second test (95% <span class="math inline">\(\times\)</span> 5%). In total - adding up the errors from steps 1 and 2 - we will draw the wrong conclusion almost 10% of the time. However, if we use a cutoff of 0.025, then we will be wrong 2.5% of the time in step 1, and about 2.4% (97.5% <span class="math inline">\(\times\)</span> 2.5%) of the time in the second step, for a total error rate of just under 5%.</p>
<p>In the first approach (looking at the main effect first), we need to make no adjustment, because we only do the second test when we’ve rejected (incorrectly) the null hypothesis. By definition, errors we make in the second step will only occur in cases where we have made an error in the first step. In the first approach where we evaluate main effects first, the errors are nested. In the second, they are not nested but additive.</p>
</div>
<div id="global-test" class="section level3">
<h3>3. Global test</h3>
<p>In the third and last approach, we start by comparing <em>Model 3</em> with <em>Model 1</em> using a global F-test. In this case, we are asking the question of whether or not a model that includes treatment as a predictor does “better” than a model that only adjust for sub-group membership. The null hypothesis can crudely be stated as <span class="math inline">\(\text{H}_{01}\)</span>: <span class="math inline">\(\text{Model }3 = \text{Model }1\)</span>. If we reject this hypothesis (and conclude that the intervention does have some sort of effect, either generally or differentially for each sub-group), then we are free to evaluate <em>Models 2</em> and <em>3</em> to see if the there is a varying affect or not.</p>
<p>Here we can use cutoffs of 0.05 in our hypothesis tests. Again, we only make errors in the second step if we’ve made a mistake in the first step. The errors are nested and not additive.</p>
</div>
</div>
<div id="simulating-error-rates" class="section level2">
<h2>Simulating error rates</h2>
<p>This first simulation shows that the error rates of the three approaches are all approximately 5% under the assumption of no intervention effect. That is, given that there is no effect of the intervention on either sub-group (on average), we will draw the wrong conclusion about 5% of the time. In these simulations, the outcome depends only on disease status and not the treatment. Or, in other words, the null hypothesis is in fact true:</p>
<pre class="r"><code>library(simstudy)
# define the data
def <- defData(varname = "disease", formula = .5, dist = "binary")
# outcome depends only on sub-group status, not intervention
def2 <- defCondition(condition = "disease == 0",
formula = 0.0, variance = 1,
dist = "normal")
def2 <- defCondition(def2, condition = "disease == 1",
formula = 0.5, variance = 1,
dist = "normal")
set.seed(1987) # the year I graduated from college, in case
# you are wondering ...
pvals <- data.table() # store simulation results
# run 2500 simulations
for (i in 1: 2500) {
# generate data set
dx <- genData(400, def)
dx <- trtAssign(dx, nTrt = 2, balanced = TRUE,
strata = "disease", grpName = "trt")
dx <- addCondition(def2, dx, "y")
# fit 3 models
lm1 <- lm(y ~ disease, data = dx)
lm2 <- lm(y ~ disease + trt, data = dx)
lm3 <- lm(y ~ disease + trt + trt*disease, data = dx)
# extract relevant p-values
cM <- coef(summary(lm2))["trt", 4]
cI <- coef(summary(lm3))["disease:trt", 4]
fI <- anova(lm1, lm3)$`Pr(>F)`[2]
# store the p-values from each iteration
pvals <- rbind(pvals, data.table(cM, cI, fI))
}
pvals</code></pre>
<pre><code>## cM cI fI
## 1: 0.72272413 0.727465073 0.883669625
## 2: 0.20230262 0.243850267 0.224974909
## 3: 0.83602639 0.897635326 0.970757254
## 4: 0.70949192 0.150259496 0.331072131
## 5: 0.85990787 0.449130976 0.739087609
## ---
## 2496: 0.76142389 0.000834619 0.003572901
## 2497: 0.03942419 0.590363493 0.103971344
## 2498: 0.16305568 0.757882365 0.360893205
## 2499: 0.81873930 0.004805028 0.018188997
## 2500: 0.69122281 0.644801480 0.830958227</code></pre>
<pre class="r"><code># Approach 1
pvals[, mEffect := (cM <= 0.05)] # cases where we would reject null
pvals[, iEffect := (cI <= 0.05)]
# total error rate
pvals[, mean(mEffect & iEffect)] +
pvals[, mean(mEffect & !iEffect)]</code></pre>
<pre><code>## [1] 0.0496</code></pre>
<pre class="r"><code># Approach 2
pvals[, iEffect := (cI <= 0.025)]
pvals[, mEffect := (cM <= 0.025)]
# total error rate
pvals[, mean(iEffect)] +
pvals[, mean((!iEffect) & mEffect)]</code></pre>
<pre><code>## [1] 0.054</code></pre>
<pre class="r"><code># Approach 3
pvals[, fEffect := (fI <= 0.05)]
pvals[, iEffect := (cI <= 0.05)]
pvals[, mEffect := (cM <= 0.05)]
# total error rate
pvals[, mean(fEffect & iEffect)] +
pvals[, mean(fEffect & !(iEffect) & mEffect)]</code></pre>
<pre><code>## [1] 0.05</code></pre>
<p>If we use a cutoff of 0.05 for the second approach, we can see that the overall error rate is indeed inflated to close to 10%:</p>
<pre class="r"><code># Approach 2 - with invalid cutoff
pvals[, iEffect := (cI <= 0.05)]
pvals[, mEffect := (cM <= 0.05)]
# total error rate
pvals[, mean(iEffect)] +
pvals[, mean((!iEffect) & mEffect)]</code></pre>
<pre><code>## [1] 0.1028</code></pre>
</div>
<div id="exploring-power" class="section level2">
<h2>Exploring power</h2>
<p>Now that we have established at least three valid testing schemes, we can compare them by assessing the <em>power</em> of the tests. For the uninitiated, power is simply the probability of concluding that there is an effect when in fact there truly is an effect. Power depends on a number of factors, such as sample size, effect size, variation, and importantly for this post, the testing scheme.</p>
<p>The plot below shows the results of estimating power using a range of assumptions about an intervention’s effect in the two subgroups and the different approaches to testing. (The sample size and variation were fixed across all simulations.) The effect sizes ranged from -0.5 to +0.5. (I have not included the code here, because it is quite similar to what I did to assess the error rates. If anyone wants it, please let me know, and I can post it on github or send it to you.)</p>
<p>The estimated power reflects the probability that the tests correctly rejected at least one null hypothesis. So, if there was no interaction (say both group effects were +0.5) but there was a main effect, we would be correct if we rejected the hypothesis associated with the main effect. Take a look a the plot:</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-interaction/Models.png" />
</div>
<p>What can we glean from this power simulation? Well, it looks like the global test that compares the interaction model with the null model (Approach 3) is the way to go, but just barely when compared to the approach that focuses solely on the interaction model first.</p>
<p>And, we see clearly that the first approach suffers from a fatal flaw. When the sub-group effects are offsetting, as they are when the effect is -0.5 in subgroup 1 and +0.5 in subgroup 2, we will fail to reject the null that says there is no main effect. As a result, we will never test for interaction and see that in fact the intervention does have an effect on both sub-groups (one positive and one negative). We don’t get to test for interaction, because the rule was designed to keep the error rate at 5% when in fact there is no effect, main or otherwise.</p>
<p>Of course, things are not totally clear cut. If we are quite certain that the effects are going to be positive for both groups, the second approach is not such a disaster. In fact, if we suspect that one of the sub-group effects will be large, it may be preferable to go with this approach. (Look at the right-hand side of the bottom plot to see this.) But, it is still hard to argue (though please do if you feel so inclined), at least based on the assumptions I used in the simulation, that we should take any approach other than the global test.</p>
</div>
Who knew likelihood functions could be so pretty?
https://www.rdatagen.net/post/mle-can-be-pretty/
Mon, 23 Oct 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/mle-can-be-pretty/<p>I just released a new iteration of <code>simstudy</code> (version 0.1.6), which fixes a bug or two and adds several spline related routines (available on <a href="https://cran.r-project.org/web/packages/simstudy/index.html">CRAN</a>). The <a href="https://www.rdatagen.net/post/generating-non-linear-data-using-b-splines/">previous post</a> focused on using spline curves to generate data, so I won’t repeat myself here. And, apropos of nothing really - I thought I’d take the opportunity to do a simple simulation to briefly explore the likelihood function. It turns out if we generate lots of them, it can be pretty, and maybe provide a little insight.</p>
<p>If a probability density (or mass) function is more or less forward-looking - answering the question of what is the probability of seeing some future outcome based on some known probability model, the likelihood function is essentially backward-looking. The likelihood takes the data as given or already observed - and allows us to assess how likely that outcome was under different assumptions the underlying probability model. While the form of the model is not necessarily in question (normal, Poisson, binomial, etc) - though it certainly should be - the specific values of the parameters that define the location and shape of that distribution are not known. The likelihood function provides a guide as to how the backward-looking probability varies across different values of the distribution’s parameters for a <em>given</em> data set.</p>
<p>We are generally most interested in finding out where the peak of that curve is, because the parameters associated with that point (the maximum likelihood estimates) are often used to describe the “true” underlying data generating process. However, we are also quite interested in the shape of the likelihood curve itself, because that provides information about how certain we can be about our conclusions about the “true” model. In short, a function that has a more clearly defined peak provides more information than one that is pretty flat. When you are climbing Mount Everest, you are pretty sure you know when you reach the peak. But when you are walking across the rolling hills of Tuscany, you can never be certain if you are at the top.</p>
<div id="the-setup" class="section level3">
<h3>The setup</h3>
<p>A likelihood curve is itself a function of the observed data. That is, if we were able to draw different samples of data from a single population, the curves associated with each of those sample will vary. In effect, the function is a random variable. For this simulation, I repeatedly make draws from an underlying known model - in this case a very simple linear model with only one unknown slope parameter - and plot the likelihood function for each dataset set across a range of possible slopes along with the maximum point for each curve.</p>
<p>In this example, I am interested in understanding the relationship between a variable <span class="math inline">\(X\)</span> and some outcome <span class="math inline">\(Y\)</span>. In truth, there is a simple relationship between the two:</p>
<p><span class="math display">\[ Y_i = 1.5 \times X_i + \epsilon_i \ ,\]</span> where <span class="math inline">\(\epsilon_i \sim Normal(0, \sigma^2)\)</span>. In this case, we have <span class="math inline">\(n\)</span> individual observations, so that <span class="math inline">\(i \in (1,...n)\)</span>. Under this model, the likelihood where we do know <span class="math inline">\(\sigma^2\)</span> but don’t know the coefficient <span class="math inline">\(\beta\)</span> can be written as:</p>
<p><span class="math display">\[L(\beta;y_1, y_2,..., y_n, x_1, x_2,..., x_n,\sigma^2) = (2\pi\sigma^2)^{-n/2}\text{exp}\left(-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta x_i)^2\right)\]</span></p>
<p>Since it is much easier to work with sums than products, we generally work with the log-likelihood function:</p>
<p><span class="math display">\[l(\beta;y_1, y_2,..., y_n, x_1, x_2,..., x_n, \sigma^2) = -\frac{n}{2}\text{ln}(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta x_i)^2\]</span> In the log-likelihood function, <span class="math inline">\(n\)</span>, <span class="math inline">\(x_i\)</span>’s, <span class="math inline">\(y_i\)</span>’s, and <span class="math inline">\(\sigma^2\)</span> are all fixed and known - we are trying to estimate <span class="math inline">\(\beta\)</span>, the slope. That is, the likelihood (or log-likelihood) is a function of <span class="math inline">\(\beta\)</span> only. Typically, we will have more than unknown one parameter - say multiple regression coefficients, or an unknown variance parameter (<span class="math inline">\(\sigma^2\)</span>) - but visualizing the likelihood function gets very hard or impossible; I am not great in imagining (or plotting) in <span class="math inline">\(p\)</span>-dimensions, which is what we need to do if we have <span class="math inline">\(p\)</span> parameters.</p>
</div>
<div id="the-simulation" class="section level3">
<h3>The simulation</h3>
<p>To start, here is a one-line function that returns the log-likelihood of a data set (containing <span class="math inline">\(x\)</span>’s and <span class="math inline">\(y\)</span>’s) based on a specific value of <span class="math inline">\(\beta\)</span>.</p>
<pre class="r"><code>library(data.table)
ll <- function(b, dt, var) {
dt[, sum(dnorm(x = y, mean = b*x, sd = sqrt(var), log = TRUE))]
}
test <- data.table(x=c(1,1,4), y =c(2.0, 1.8, 6.3))
ll(b = 1.8, test, var = 1)</code></pre>
<pre><code>## [1] -3.181816</code></pre>
<pre class="r"><code>ll(b = 0.5, test, var = 1)</code></pre>
<pre><code>## [1] -13.97182</code></pre>
<p>Next, I generate a single draw of 200 observations of <span class="math inline">\(x\)</span>’s and <span class="math inline">\(y\)</span>’s:</p>
<pre class="r"><code>library(simstudy)
b <- c(seq(0, 3, length.out = 500))
truevar = 1
defX <- defData(varname = "x", formula = 0,
variance = 9, dist = "normal")
defA <- defDataAdd(varname = "y", formula = "1.5*x",
variance = truevar, dist = "normal")
set.seed(21)
dt <- genData(200, defX)
dt <- addColumns(defA, dt)
dt</code></pre>
<pre><code>## id x y
## 1: 1 2.379040 4.3166333
## 2: 2 1.566754 0.9801416
## 3: 3 5.238667 8.4869651
## 4: 4 -3.814008 -5.6348268
## 5: 5 6.592169 9.6706410
## ---
## 196: 196 3.843341 4.5740967
## 197: 197 -1.334778 -1.5701510
## 198: 198 3.583162 5.0193182
## 199: 199 1.112866 1.5506167
## 200: 200 4.913644 8.2063354</code></pre>
<p>The likelihood function is described with a series of calls to function <code>ll</code> using <code>sapply</code>. Each iteration uses one value of the <code>b</code> vector. What we end up with is a likelihood estimation for each potential value of <span class="math inline">\(\beta\)</span> given the data.</p>
<pre class="r"><code>loglik <- sapply(b, ll, dt = dt, var = truevar)
bt <- data.table(b, loglike = loglik)
bt</code></pre>
<pre><code>## b loglike
## 1: 0.000000000 -2149.240
## 2: 0.006012024 -2134.051
## 3: 0.012024048 -2118.924
## 4: 0.018036072 -2103.860
## 5: 0.024048096 -2088.858
## ---
## 496: 2.975951904 -2235.436
## 497: 2.981963928 -2251.036
## 498: 2.987975952 -2266.697
## 499: 2.993987976 -2282.421
## 500: 3.000000000 -2298.206</code></pre>
<p>In a highly simplified approach to maximizing the likelihood, I simply select the <span class="math inline">\(\beta\)</span> that has the largest likelihood based on my calls to <code>ll</code> (I am limiting my search to values between 0 and 3, just because I happen to know the true value of the parameter). Of course, this is not how things work in the real world, particularly when you have more than one parameter to estimate - the estimation process requires elaborate algorithms. In the case of a normal regression model, it is actually the case that the ordinary least estimate of the regression parameters is the maximum likelihood estimate (you can see in the above equations that maximizing the likelihood <em>is</em> minimizing the sum of the squared differences of the observed and expected values).</p>
<pre class="r"><code>maxlik <- dt[, max(loglik)]
lmfit <- lm(y ~ x - 1, data =dt) # OLS estimate
(maxest <- bt[loglik == maxlik, b]) # value of beta that maxmizes likelihood</code></pre>
<pre><code>## [1] 1.472946</code></pre>
<p>The plot below on the left shows the data and the estimated slope using OLS. The plot on the right shows the likelihood function. The <span class="math inline">\(x\)</span>-axis represents the values of <span class="math inline">\(\beta\)</span>, and the <span class="math inline">\(y\)</span>-axis is the log-likelihood as a function of those <span class="math inline">\(\beta's\)</span>:</p>
<pre class="r"><code>library(ggplot2)
slopetxt <- paste0("OLS estimate: ", round(coef(lmfit), 2))
p1 <- ggplot(data = dt, aes(x = x, y= y)) +
geom_point(color = "grey50") +
theme(panel.grid = element_blank()) +
geom_smooth(method = "lm", se = FALSE,
size = 1, color = "#1740a6") +
annotate(geom = "text", label = slopetxt,
x = -5, y = 7.5,
family = "sans")
p2 <- ggplot(data = bt) +
scale_y_continuous(name = "Log likelihood") +
scale_x_continuous(limits = c(0, 3),
breaks = seq(0, 3, 0.5),
name = expression(beta)) +
theme(panel.grid.minor = element_blank()) +
geom_line(aes(x = b, y = loglike),
color = "#a67d17", size = 1) +
geom_point(x = maxest, y = maxlik, color = "black", size = 3)
library(gridExtra)
grid.arrange(p1, p2, nrow = 1)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-23-repeated-sampling-to-see-what-the-likelihood-function-looks-like-literally_files/figure-html/unnamed-chunk-5-1.png" width="864" /></p>
</div>
<div id="adding-variation" class="section level3">
<h3>Adding variation</h3>
<p>Now, for the pretty part. Below, I show plots of multiple likelihood functions under three scenarios. The only thing that differs across each of those scenarios is the level of variance in the error term, which is specified in <span class="math inline">\(\sigma^2\)</span>. (I have not included the code here since essentially loop through the process describe above.) If you want the code just let me know, and I will make sure to post it. I do want to highlight the fact that I used package <code>randomcoloR</code> to generate the colors in the plots.)</p>
<p><img src="https://www.rdatagen.net/post/2017-10-23-repeated-sampling-to-see-what-the-likelihood-function-looks-like-literally_files/figure-html/unnamed-chunk-6-1.png" width="672" /><img src="https://www.rdatagen.net/post/2017-10-23-repeated-sampling-to-see-what-the-likelihood-function-looks-like-literally_files/figure-html/unnamed-chunk-6-2.png" width="672" /><img src="https://www.rdatagen.net/post/2017-10-23-repeated-sampling-to-see-what-the-likelihood-function-looks-like-literally_files/figure-html/unnamed-chunk-6-3.png" width="672" /></p>
<p>What we can see here is that as the variance increases, we move away from Mt. Everest towards the Tuscan hills. The variance of the underlying process clearly has an impact on the uncertainty of the maximum likelihood estimates. The likelihood functions flatten out and the MLEs have more variability with increased underlying variance of the outcomes <span class="math inline">\(y\)</span>. Of course, this is all consistent with maximum likelihood theory.</p>
</div>
Can we use B-splines to generate non-linear data?
https://www.rdatagen.net/post/generating-non-linear-data-using-b-splines/
Mon, 16 Oct 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/generating-non-linear-data-using-b-splines/<p>I’m exploring the idea of adding a function or set of functions to the <code>simstudy</code> package that would make it possible to easily generate non-linear data. One way to do this would be using B-splines. Typically, one uses splines to fit a curve to data, but I thought it might be useful to switch things around a bit to use the underlying splines to generate data. This would facilitate exploring models where we know the assumption of linearity is violated. It would also make it easy to explore spline methods, because as with any other simulated data set, we would know the underlying data generating process.</p>
<div id="b-splines" class="section level3">
<h3>B-splines</h3>
<p>A B-spline is a linear combination of a set of basis functions that are determined by the number and location of specified knots or cut-points, as well as the (polynomial) degree of curvature. A degree of one implies a set of straight lines, degree of two implies a quadratic curve, three a cubic curve, etc. This <a href="https://cran.r-project.org/web/packages/crs/vignettes/spline_primer.pdf">nice quick intro</a> provides much more insight into issues B-splines than I can provide here. Or if you want even more detail, check out this <a href="http://www.springer.com/us/book/9780387953663">book</a>. It is a very rich topic.</p>
<p>Within a cut-point region, the sum of the basis functions always equals 1. This is easy to see by looking at a plot of basis functions, several of which are provided below. The definition and shape of the basis functions do not in any way depend on the data, only on the degree and cut-points. Of course, these functions can be added together in infinitely different ways using weights. If one is trying to fit a B-spline line to data, those weights can be estimated using regression models.</p>
</div>
<div id="splines-in-r" class="section level2">
<h2>Splines in R</h2>
<p>The <code>bs</code> function in the <code>splines</code> package, returns values from these basis functions based on the specification of knots and degree of curvature. I wrote a wrapper function that uses the <code>bs</code> function to generate the basis function, and then I do a linear transformation of these functions by multiplying the vector parameter <em>theta</em>, which is just a vector of coefficients. The linear combination at each value of <span class="math inline">\(x\)</span> (the support of the basis functions) generates a value (which I call <span class="math inline">\(y.spline\)</span>) on the desired curve. The wrapper returns a list of objects, including a data.table that includes <span class="math inline">\(x\)</span> and <span class="math inline">\(y.spline\)</span>, as well as the basis functions, and knots.</p>
<pre class="r"><code>library(splines)
library(data.table)
library(ggplot2)
library(broom)
genSpline <- function(x, knots, degree, theta) {
basis <- bs(x = x, knots = knots, degree = degree,
Boundary.knots = c(0,1), intercept = TRUE)
y.spline <- basis %*% theta
dt <- data.table(x, y.spline = as.vector(y.spline))
return(list(dt = dt, basis = basis, knots = knots))
}</code></pre>
<p>I’ve also written two functions that make it easy to print the basis function and the spline curve. This will enable us to look at a variety of splines.</p>
<pre class="r"><code>plot.basis <- function(basisdata) {
dtbasis <- as.data.table(basisdata$basis)
dtbasis[, x := seq(0, 1, length.out = .N)]
dtmelt <- melt(data = dtbasis, id = "x",
variable.name = "basis", variable.factor = TRUE)
ggplot(data=dtmelt, aes(x=x, y=value, group = basis)) +
geom_line(aes(color=basis), size = 1) +
theme(legend.position = "none") +
scale_x_continuous(limits = c(0, 1),
breaks = c(0, basisdata$knots, 1)) +
theme(panel.grid.minor = element_blank())
}</code></pre>
<pre class="r"><code>plot.spline <- function(basisdata, points = FALSE) {
p <- ggplot(data = basisdata$dt)
if (points) p <- p + geom_point(aes(x=x, y = y), color = "grey75")
p <- p +
geom_line(aes(x = x, y = y.spline), color = "red", size = 1) +
scale_y_continuous(limits = c(0, 1)) +
scale_x_continuous(limits = c(0, 1), breaks = knots) +
theme(panel.grid.minor = element_blank())
return(p)
}</code></pre>
<div id="linear-spline-with-quartile-cut-points" class="section level3">
<h3>Linear spline with quartile cut-points</h3>
<p>Here is a simple linear spline that has four regions defined by three cut-points, and the slope of the line in each region varies. The first value of <em>theta</em> is essentially the intercept. When you look at the basis plot, you will see that any single region has two “active” basis functions (represented by two colors), the other functions are all 0 in that region. The slope of the line in each is determined by the relevant values of theta. It is probably just easier to take a look:</p>
<pre class="r"><code>x <- seq(0, 1, length.out = 1000)
knots <- c(0.25, 0.5, 0.75)
theta = c(0.6, 0.1, 0.3, 0.2, 0.9)
sdata <- genSpline(x, knots, 1, theta)</code></pre>
<p>For this example, I am printing out the basis function for the first few values of <span class="math inline">\(x\)</span>.</p>
<pre class="r"><code>round( head(cbind(x = sdata$dt$x, sdata$basis)), 4 )</code></pre>
<pre><code>## x 1 2 3 4 5
## [1,] 0.000 1.000 0.000 0 0 0
## [2,] 0.001 0.996 0.004 0 0 0
## [3,] 0.002 0.992 0.008 0 0 0
## [4,] 0.003 0.988 0.012 0 0 0
## [5,] 0.004 0.984 0.016 0 0 0
## [6,] 0.005 0.980 0.020 0 0 0</code></pre>
<pre class="r"><code>plot.basis(sdata)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-5-1.png" width="672" /></p>
<pre class="r"><code>plot.spline(sdata)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-5-2.png" width="672" /></p>
</div>
<div id="same-knots-cut-points-but-different-theta-coefficients" class="section level3">
<h3>Same knots (cut-points) but different theta (coefficients)</h3>
<p>If use the same knot and degree specification, but change the vector <span class="math inline">\(theta\)</span>, we change the slope of the lines in each of the four regions:</p>
<pre class="r"><code>theta = c(0.2, 0.3, 0.8, 0.2, 0.1)
sdata <- genSpline(x, knots, 1, theta)
plot.basis(sdata)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
<pre class="r"><code>plot.spline(sdata)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-6-2.png" width="672" /></p>
</div>
<div id="quadratic-spline-with-quartile-cut-points" class="section level3">
<h3>Quadratic spline with quartile cut-points</h3>
<p>The basis functions get a little more elaborate with a quadratic spline. With this added degree, we get an additional basis function in each region, so you should see 3 colors instead of 2. The resulting spline is parabolic in each region, but with a different shape, each of which is determined by <em>theta</em>.</p>
<pre class="r"><code>knots <- c(0.25, 0.5, 0.75)
theta = c(0.6, 0.1, 0.5, 0.2, 0.8, 0.3)
sdata <- genSpline(x, knots, 2, theta)
plot.basis(sdata)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
<pre class="r"><code>plot.spline(sdata)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-7-2.png" width="672" /></p>
</div>
<div id="quadratic-spline-with-two-cut-points-three-regions" class="section level3">
<h3>Quadratic spline with two cut-points (three regions)</h3>
<pre class="r"><code>knots <- c(0.333, 0.666)
theta = c(0.2, 0.4, 0.1, 0.9, 0.6)
sdata <- genSpline(x, knots, 2, theta)
plot.basis(sdata)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-8-1.png" width="672" /></p>
<pre class="r"><code>plot.spline(sdata)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-8-2.png" width="672" /></p>
</div>
<div id="cubic-spline-with-two-cut-points-three-regions" class="section level3">
<h3>Cubic spline with two cut-points (three regions)</h3>
<p>And in this last example, we generate basis functions for a cubic spline the differs in three regions. The added curvature is apparent:</p>
<pre class="r"><code>knots <- c(0.333, 0.666)
theta = c(0.2, 0.6, 0.1, 0.9, 0.2, 0.8)
sdata <- genSpline(x, knots, 3, theta)
plot.basis(sdata)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-9-1.png" width="672" /></p>
<pre class="r"><code>plot.spline(sdata)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-9-2.png" width="672" /></p>
</div>
<div id="generating-new-data-from-the-underlying-spline" class="section level3">
<h3>Generating new data from the underlying spline</h3>
<p>It is a simple step to generate data from the spline. Each value on the line is treated as the mean, and “observed” data can be generated by adding variation. In this case, I use the normal distribution, but there is no reason other distributions can’t be used. I’m generating data based on the the parameters in the previous example. And this time, the spline plot includes the randomly generated data:</p>
<pre class="r"><code>set.seed(5)
x <- runif(250)
sdata <- genSpline(x, knots, 3, theta)
sdata$dt[, y := rnorm(.N, y.spline, 0.1)]
plot.spline(sdata, points = TRUE)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-10-1.png" width="672" /></p>
<p>Now that we have generated new data, why don’t we go ahead and fit a model to see if we can recover the coefficients specified in <em>theta</em>? We are interested in the relationship of <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>, but the relationship is not linear and changes across <span class="math inline">\(x\)</span>. To estimate a model, we regress the outcome data <span class="math inline">\(y\)</span> on the values of the basis function that correspond to each value of <span class="math inline">\(x\)</span>:</p>
<pre class="r"><code>dxbasis <- as.data.table(sdata$basis)
setnames(dxbasis, paste0("x", names(dxbasis)))
dxbasis[, y := sdata$dt$y]
round(dxbasis, 3)</code></pre>
<pre><code>## x1 x2 x3 x4 x5 x6 y
## 1: 0.063 0.557 0.343 0.036 0.000 0.000 0.443
## 2: 0.000 0.000 0.140 0.565 0.295 0.000 0.542
## 3: 0.000 0.000 0.003 0.079 0.495 0.424 0.634
## 4: 0.003 0.370 0.523 0.104 0.000 0.000 0.232
## 5: 0.322 0.553 0.120 0.005 0.000 0.000 0.269
## ---
## 246: 0.000 0.023 0.442 0.494 0.041 0.000 0.520
## 247: 0.613 0.356 0.031 0.001 0.000 0.000 0.440
## 248: 0.246 0.584 0.161 0.009 0.000 0.000 0.236
## 249: 0.000 0.000 0.014 0.207 0.597 0.182 0.505
## 250: 0.002 0.344 0.539 0.115 0.000 0.000 0.313</code></pre>
<pre class="r"><code># fit the model - explicitly exclude intercept since x1 is intercept
lmfit <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 - 1, data = dxbasis)
cbind(tidy(lmfit)[,1:3], true = theta)</code></pre>
<pre><code>## term estimate std.error true
## 1 x1 0.16465186 0.03619581 0.2
## 2 x2 0.57855125 0.03996219 0.6
## 3 x3 0.09093425 0.04267027 0.1
## 4 x4 0.94938718 0.04395370 0.9
## 5 x5 0.13579559 0.03805510 0.2
## 6 x6 0.85867619 0.03346704 0.8</code></pre>
<p>Using the parameter estimates (estimated here using OLS), we can get predicted values and plot them to see how well we did:</p>
<pre class="r"><code># get the predicted values so we can plot
dxbasis[ , y.pred := predict(object = lmfit)]
dxbasis[ , x := x]
# blue line represents predicted values
plot.spline(sdata, points = TRUE) +
geom_line(data=dxbasis, aes(x=x, y=y.pred), color = "blue", size = 1 )</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-12-1.png" width="672" /></p>
<p>The model did quite a good job, because we happened to assume the correct underlying assumptions of the spline. However, let’s say we suspected that the data were generated by a quadratic spline. We need to get the basis function assuming the same cut-points for the knots but now using a degree equal to two. Since a reduction in curvature reduces the number of basis functions by one, the linear model changes slightly. (Note that this model is not quite nested in the previous (cubic) model, because the values of the basis functions are different.)</p>
<pre class="r"><code>xdata <- genSpline(x, knots, 2, theta = rep(1,5))
dxbasis <- as.data.table(xdata$basis)
setnames(dxbasis, paste0("x", names(dxbasis)))
dxbasis[, y := sdata$dt$y]
lmfit <- lm(y ~ x1 + x2 + x3 + x4 + x5 - 1, data = dxbasis)
dxbasis[ , y.pred := predict(object = lmfit)]
dxbasis[ , x := x]
plot.spline(sdata, points = TRUE) +
geom_line(data=dxbasis, aes(x=x, y=y.pred),
color = "forestgreen", size = 1 )</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-16-generating-non-linear-data-using-b-splines_files/figure-html/unnamed-chunk-13-1.png" width="672" /></p>
<p>If we compare the two models in terms of model fit, the cubic model only does slightly better in term of <span class="math inline">\(R^2\)</span>: 0.96 vs. 0.94. In this case, it probably wouldn’t be so obvious which model to use.</p>
</div>
</div>
A minor update to simstudy provides an excuse to talk a bit about the negative binomial and Poisson distributions
https://www.rdatagen.net/post/a-small-update-to-simstudy-neg-bin/
Thu, 05 Oct 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-small-update-to-simstudy-neg-bin/<p>I just updated <code>simstudy</code> to version 0.1.5 (available on <a href="https://cran.r-project.org/web/packages/simstudy/index.html">CRAN</a>) so that it now includes several new distributions - <em>exponential</em>, <em>discrete uniform</em>, and <em>negative binomial</em>.</p>
<p>As part of the release, I thought I’d explore the negative binomial just a bit, particularly as it relates to the Poisson distribution. The Poisson distribution is a discrete (integer) distribution of outcomes of non-negative values that is often used to describe count outcomes. It is characterized by a mean (or rate) and its variance equals its mean.</p>
<div id="added-variation" class="section level3">
<h3>Added variation</h3>
<p>In many situations, when count data are modeled, it turns out that the variance of the data exceeds the mean (a situation called <em>over-dispersion</em>). In this case an alternative model is used that allows for the greater variance, which is based on the negative binomial distribution. It turns out that if the negative binomial distribution has mean <span class="math inline">\(\mu\)</span>, it has a variance of <span class="math inline">\(\mu + \theta \mu^2\)</span>, where <span class="math inline">\(\theta\)</span> is called a <em>dispersion</em> parameter. If <span class="math inline">\(\theta = 0\)</span>, we have the Poisson distribution, but otherwise the variance of a negative binomial random variable will exceed the variance of a Poisson random variable as long as they share the same mean, because <span class="math inline">\(\mu > 0\)</span> and <span class="math inline">\(\theta \ge 0\)</span>.</p>
<p>We can see this by generating data from each distribution with mean 15, and a dispersion parameter of 0.2 for the negative binomial. We expect a variance around 15 for the Poisson distribution, and 60 for the negative binomial distribution.</p>
<pre class="r"><code>library(simstudy)
library(ggplot2)
# for a less cluttered look
theme_no_minor <- function(color = "grey90") {
theme(panel.grid.minor = element_blank(),
panel.background = element_rect(fill="grey95")
)
}
options(digits = 2)
# define data
defC <- defCondition(condition = "dist == 0", formula = 15,
dist = "poisson", link = "identity")
defC <- defCondition(defC, condition = "dist == 1", formula = 15,
variance = 0.2, dist = "negBinomial",
link = "identity")
# generate data
set.seed(50)
dt <- genData(500)
dt <- trtAssign(dt, 2, grpName = "dist")
dt <- addCondition(defC, dt, "y")
genFactor(dt, "dist", c("Poisson", "Negative binomial"))
# compare distributions
dt[, .(mean = mean(y), var = var(y)), keyby = fdist]</code></pre>
<pre><code>## fdist mean var
## 1: Poisson 15 15
## 2: Negative binomial 15 54</code></pre>
<pre class="r"><code>ggplot(data = dt, aes(x = y, group = fdist)) +
geom_density(aes(fill=fdist), alpha = .4) +
scale_fill_manual(values = c("#808000", "#000080")) +
scale_x_continuous(limits = c(0,60),
breaks = seq(0, 60, by = 20)) +
theme_no_minor() +
theme(legend.title = element_blank(),
legend.position = c(0.80, 0.83))</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-05-a-small-update-to-simstudy-provides-an-excuse-to-compare-the-negative-binomial-and-poisson-distributions_files/figure-html/unnamed-chunk-1-1.png" width="672" /></p>
</div>
<div id="underestimating-standard-errors" class="section level3">
<h3>Underestimating standard errors</h3>
<p>In the context of a regression, misspecifying a model as Poisson rather than negative binomial, can lead to an underestimation of standard errors, even though the point estimates may be quite reasonable (or may not). The Poisson model will force the variance estimate to be equal to the mean at any particular point on the regression curve. The Poisson model will effectively ignore the true extent of the variation, which can lead to problems of interpretation. We might conclude that there is an association when in fact there is none.</p>
<p>In this simple simulation, we generate two predictors (<span class="math inline">\(x\)</span> and <span class="math inline">\(b\)</span>) and an outcome (<span class="math inline">\(y\)</span>). The outcome is a function of <span class="math inline">\(x\)</span> only:</p>
<pre class="r"><code>library(broom)
library(MASS)
# Generating data from negative binomial dist
def <- defData(varname = "x", formula = 0, variance = 1,
dist = "normal")
def <- defData(def, varname = "b", formula = 0, variance = 1,
dist = "normal")
def <- defData(def, varname = "y", formula = "0.9 + 0.6*x",
variance = 0.3, dist = "negBinomial", link = "log")
set.seed(35)
dt <- genData(500, def)
ggplot(data = dt, aes(x=x, y = y)) +
geom_jitter(width = .1) +
ggtitle("Outcome as function of 1st predictor") +
theme_no_minor()</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-05-a-small-update-to-simstudy-provides-an-excuse-to-compare-the-negative-binomial-and-poisson-distributions_files/figure-html/unnamed-chunk-2-1.png" width="672" /></p>
<pre class="r"><code>ggplot(data = dt, aes(x=b, y = y)) +
geom_jitter(width = 0) +
ggtitle("Outcome as function of 2nd predictor") +
theme_no_minor()</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-05-a-small-update-to-simstudy-provides-an-excuse-to-compare-the-negative-binomial-and-poisson-distributions_files/figure-html/unnamed-chunk-2-2.png" width="672" /></p>
<p>I fit two models using both predictors. The first assumes (incorrectly) a Poisson distribution, and the second assumes (correctly) a negative binomial distribution. We can see that although the point estimates are quite close, the standard error estimates for the predictors in the Poisson model are considerably greater (about 50% higher) than the negative binomial model. And if we were basing any conclusion on the p-value (which is not always the obvious way to do <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/abandon.pdf">things</a>), we might make the wrong call since the p-value for the slope of <span class="math inline">\(b\)</span> is estimated to be 0.029. Under the correct model model, the p-value is 0.29.</p>
<pre class="r"><code>glmfit <- glm(y ~ x + b, data = dt, family = poisson (link = "log") )
tidy(glmfit)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 0.956 0.030 32.3 1.1e-228
## 2 x 0.516 0.024 21.9 1.9e-106
## 3 b -0.052 0.024 -2.2 2.9e-02</code></pre>
<pre class="r"><code>nbfit <- glm.nb(y ~ x + b, data = dt)
tidy(nbfit)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 0.954 0.039 24.2 1.1e-129
## 2 x 0.519 0.037 14.2 7.9e-46
## 3 b -0.037 0.036 -1.1 2.9e-01</code></pre>
<p>A plot of the fitted regression curve and confidence bands of <span class="math inline">\(b\)</span> estimated by each model reinforces the difference. The lighter shaded region is the wider confidence band of the negative binomial model, and the darker shaded region the based on the Poisson model.</p>
<pre class="r"><code>newb <- data.table(b=seq(-3,3,length = 100), x = 0)
poispred <- predict(glmfit, newdata = newb, se.fit = TRUE,
type = "response")
nbpred <-predict(nbfit, newdata = newb, se.fit = TRUE,
type = "response")
poisdf <- data.table(b = newb$b, y = poispred$fit,
lwr = poispred$fit - 1.96*poispred$se.fit,
upr = poispred$fit + 1.96*poispred$se.fit)
nbdf <- data.table(b = newb$b, y = nbpred$fit,
lwr = nbpred$fit - 1.96*nbpred$se.fit,
upr = nbpred$fit + 1.96*nbpred$se.fit)
ggplot(data = poisdf, aes(x=b, y = y)) +
geom_line() +
geom_ribbon(data=nbdf, aes(ymin = lwr, ymax=upr), alpha = .3,
fill = "red") +
geom_ribbon(aes(ymin = lwr, ymax=upr), alpha = .5,
fill = "red") +
theme_no_minor()</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-05-a-small-update-to-simstudy-provides-an-excuse-to-compare-the-negative-binomial-and-poisson-distributions_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
<p>And finally, if we take 500 samples of size 500, and estimate slope for <span class="math inline">\(b\)</span> each time and calculate the standard deviation of those estimates, it is quite close to the standard error estimate we saw in the model of the original simulated data set using the negative binomial assumption (0.036). And the mean of those estimates is quite close to zero, the true value.</p>
<pre class="r"><code>result <- data.table()
for (i in 1:500) {
dt <- genData(500, def)
glmfit <- glm(y ~ x + b, data = dt, family = poisson)
nbfit <- glm.nb(y ~ x + b, data = dt)
result <- rbind(result, data.table(bPois = coef(glmfit)["b"],
bNB = coef(nbfit)["b"])
)
}
result[,.(sd(bPois), sd(bNB))] # observed standard error</code></pre>
<pre><code>## V1 V2
## 1: 0.037 0.036</code></pre>
<pre class="r"><code>result[,.(mean(bPois), mean(bNB))] # observed mean</code></pre>
<pre><code>## V1 V2
## 1: 0.0025 0.0033</code></pre>
</div>
<div id="negative-binomial-as-mixture-of-poissons" class="section level3">
<h3>Negative binomial as mixture of Poissons</h3>
<p>An interesting relationship between the two distributions is that a negative binomial distribution can be generated from a mixture of individuals whose outcomes come from a Poisson distribution, but each individual has her own rate or mean. Furthermore, those rates must have a specific distribution - a Gamma. (For much more on this, you can take a look <a href="https://probabilityandstats.wordpress.com/tag/poisson-gamma-mixture/">here</a>.) Here is a little simulation:</p>
<pre class="r"><code>mu = 15
disp = 0.2
# Gamma distributed means
def <- defData(varname = "gmu", formula = mu, variance = disp,
dist = "gamma")
# generate data from each distribution
defC <- defCondition(condition = "nb == 0", formula = "gmu",
dist = "poisson")
defC <- defCondition(defC, condition = "nb == 1", formula = mu,
variance = disp, dist = "negBinomial")
dt <- genData(5000, def)
dt <- trtAssign(dt, 2, grpName = "nb")
genFactor(dt, "nb", labels = c("Poisson-Gamma", "Negative binomial"))
dt <- addCondition(defC, dt, "y")
# means and variances should be very close
dt[, .(Mean = mean(y), Var = var(y)), keyby = fnb]</code></pre>
<pre><code>## fnb Mean Var
## 1: Poisson-Gamma 15 62
## 2: Negative binomial 15 57</code></pre>
<pre class="r"><code># plot
ggplot(data = dt, aes(x = y, group = fnb)) +
geom_density(aes(fill=fnb), alpha = .4) +
scale_fill_manual(values = c("#808000", "#000080")) +
scale_x_continuous(limits = c(0,60),
breaks = seq(0, 60, by = 20)) +
theme_no_minor() +
theme(legend.title = element_blank(),
legend.position = c(0.80, 0.83))</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-10-05-a-small-update-to-simstudy-provides-an-excuse-to-compare-the-negative-binomial-and-poisson-distributions_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
<p>```</p>
</div>
CACE closed: EM opens up exclusion restriction (among other things)
https://www.rdatagen.net/post/em-estimation-of-cace/
Thu, 28 Sep 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/em-estimation-of-cace/<p>This is the third, and probably last, of a series of posts touching on the estimation of <a href="https://www.rdatagen.net/post/cace-explored/">complier average causal effects</a> (CACE) and <a href="https://www.rdatagen.net/post/simstudy-update-provides-an-excuse-to-talk-a-little-bit-about-the-em-algorithm-and-latent-class/">latent variable modeling techniques</a> using an expectation-maximization (EM) algorithm . What follows is a simplistic way to implement an EM algorithm in <code>R</code> to do principal strata estimation of CACE.</p>
<div id="the-em-algorithm" class="section level3">
<h3>The EM algorithm</h3>
<p>In this approach, we assume that individuals fall into one of three possible groups - <em>never-takers</em>, <em>always-takers</em>, and <em>compliers</em> - but we cannot see who is who (except in a couple of cases). For each group, we are interested in estimating the unobserved potential outcomes <span class="math inline">\(Y_0\)</span> and <span class="math inline">\(Y_1\)</span> using observed outcome measures of <span class="math inline">\(Y\)</span>. The EM algorithm does this in two steps. The <em>E-step</em> estimates the missing class membership for each individual, and the <em>M-step</em> provides maximum likelihood estimates of the group-specific potential outcomes and variation.</p>
<p>An estimate group membership was presented in this <a href="https://projecteuclid.org/euclid.aos/1034276631">Imbens & Rubin 1997 paper</a>. The probability that an individual is a member of a particular group is a function of how close the individual’s observed outcome is to the mean of the group and the overall probability of group membership:</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-em-cace/table.png" />
</div>
<p>where <span class="math inline">\(Z\)</span> is treatment assignment and <span class="math inline">\(M\)</span> is treatment received. In addition, <span class="math inline">\(g_{c0}^i = \phi\left( \frac{Y_{obs,i} - \mu_{c0}}{\sigma_{c0}} \right)/\sigma_{c0}\)</span>, where <span class="math inline">\(\phi(.)\)</span> is the standard normal density. (And the same goes for the other <span class="math inline">\(g^i\)</span>’s.) <span class="math inline">\(\pi_a\)</span>, <span class="math inline">\(\pi_n\)</span>, and <span class="math inline">\(\pi_c\)</span> are estimated in the prior stage (or with starting values). <span class="math inline">\(\mu_{c0}\)</span>, <span class="math inline">\(\mu_{c1}\)</span>, <span class="math inline">\(\sigma_{c0}\)</span>, <span class="math inline">\(\sigma_{c1}\)</span>, etc. are also estimated in the prior <em>M-step</em> or with starting values in the case of the first <em>E-step</em>. Note that because we <em>are</em> assuming monotonicity (no <em>deniers</em> - which is not a necessary assumption for the EM approach, but used here to simplify things a bit), the probability of group membership is 1 for those randomized to control but who receive treatment (<em>always-takers</em>) and for those randomized to intervention but refuse (<em>never-takers</em>).</p>
</div>
<div id="em-steps" class="section level3">
<h3>EM steps</h3>
<p>I’ve created a separate function for each step in the algorithm. The <em>E-step</em> follows the Imbens & Rubin specification just described. The <em>M-step</em> just calculates the weighted averages and variances of the outcomes within each <span class="math inline">\(Z\)</span>/<span class="math inline">\(M\)</span> pair, with the weights coming from the probabilities estimated in the <em>E-step</em>. (These are, in fact, maximum likelihood estimates of the means and variances.) There are a pair of functions to estimate the log likelihood after each iteration. We stop iterating once the log likelihood has reached a stable state. And finally, there is a function to initialize the 15 parameters.</p>
<p>One thing to highlight here is that a strong motivation for using the EM algorithm is that we do <em>not</em> need to assume the exclusion restriction. That is, it is possible that randomizing someone to the intervention may have an effect on the outcome even if there is no effect on whether or not the intervention is used. Or in other words, we are saying it is possible that randomization has an effect on <em>always-takers</em> and <em>never-takers</em>, an assumption we <em>cannot</em> make using an instrumental variable (IV) approach. I mention that here, because the <em>M-step</em> function as written here explicitly drops the exclusion restriction assumption. However, I will first illustrate the model estimates in a case where data are indeed based on that assumption; while my point is to show that the EM estimates are unbiased as are the IV estimates in this scenario, I may actually be introducing a small amount of bias into the EM estimate by not re-writing the function to create a single mean for <em>always-takers</em> and <em>never-takers</em>. But, for brevity’s sake, this seems adequate.</p>
<pre class="r"><code>estep <- function(params, y, z, m) {
piC <- 0
piN <- 0
piA <- 0
if (z == 0 & m == 0) {
gC0 <- dnorm((y - params$mC0)/params$sC0) / params$sC0
gN0 <- dnorm((y - params$mN0)/params$sN0) / params$sN0
piC <- params$pC * gC0 / ( params$pC * gC0 + params$pN * gN0)
piN <- 1- piC
}
if (z == 0 & m == 1) {
piA <- 1
}
if (z == 1 & m == 0) {
piN <- 1
}
if (z == 1 & m == 1) {
gC1 <- dnorm((y - params$mC1)/params$sC1) / params$sC1
gA1 <- dnorm((y - params$mA1)/params$sA1) / params$sA1
piC <- params$pC * gC1 / ( params$pC * gC1 + params$pA * gA1)
piA <- 1 - piC
}
return(list(piC = piC, piN = piN, piA = piA))
}
library(Weighted.Desc.Stat)
mstep <- function(params, dx) {
params$mN0 <- dx[z == 0 & m == 0, w.mean(y, piN)] # never-taker
params$sN0 <- dx[z == 0 & m == 0, sqrt(w.var(y, piN))] # never-taker
params$mN1 <- dx[z == 1 & m == 0, w.mean(y, piN)] # never-taker
params$sN1 <- dx[z == 1 & m == 0, sqrt(w.var(y, piN))] # never-taker
params$mA0 <- dx[z == 0 & m == 1, w.mean(y, piA)]# always-taker
params$sA0 <- dx[z == 0 & m == 1, sqrt(w.var(y, piA))] # always-taker
params$mA1 <- dx[z == 1 & m == 1, w.mean(y, piA)]# always-taker
params$sA1 <- dx[z == 1 & m == 1, sqrt(w.var(y, piA))] # always-taker
params$mC0 <- dx[z == 0 & m == 0, w.mean(y, piC)] # complier, z=0
params$sC0 <- dx[z == 0 & m == 0, sqrt(w.var(y, piC))] # complier, z=0
params$mC1 <- dx[z == 1 & m == 1, w.mean(y, piC)] # complier, z=1
params$sC1 <- dx[z == 1 & m == 1, sqrt(w.var(y, piC))] # complier, z=1
nC <- dx[, sum(piC)]
nN <- dx[, sum(piN)]
nA <- dx[, sum(piA)]
params$pC <- (nC / sum(nC, nN, nA))
params$pN <- (nN / sum(nC, nN, nA))
params$pA <- (nA / sum(nC, nN, nA))
return(params)
}
like.i <- function(params, y, z, m) {
if (z == 0 & m == 0) {
l <- params$pC * dnorm(x = y, mean = params$mC0, sd = params$sC0) +
params$pN * dnorm(x = y, mean = params$mN0, sd = params$sN0)
}
if (z == 0 & m == 1) {
l <- params$pA * dnorm(x = y, mean = params$mA0, sd = params$sA0)
}
if (z == 1 & m == 0) {
l <- params$pN * dnorm(x = y, mean = params$mN1, sd = params$sN1)
}
if (z == 1 & m == 1) {
l <- params$pC * dnorm(x = y, mean = params$mC1, sd = params$sC1) +
params$pA * dnorm(x = y, mean = params$mA1, sd = params$sA1)
}
return(l)
}
loglike <- function(dt, params){
dl <- dt[, .(l.i = like.i(params, y, z, m)), keyby = id]
return(dl[, sum(log(l.i))])
}
initparams <- function() {
params = list(pC = 1/3, pN = 1/3, pA = 1/3,
mC0 = rnorm(1,0,.1), sC0 = 0.2,
mC1 = rnorm(1,0,.1), sC1 = 0.2,
mN0 = rnorm(1,0,.1), sN0 = 0.2,
mN1 = rnorm(1,0,.1), sN1 = 0.2,
mA0 = rnorm(1,0,.1), sA0 = 0.2,
mA1 = rnorm(1,0,.1), sA1 = 0.2)
return(params)
}</code></pre>
</div>
<div id="data-defintions" class="section level3">
<h3>Data defintions</h3>
<p>These next set of statements define the data that will be generated. I define the distribution of group assignment as well as potential outcomes for the intervention and the outcome <span class="math inline">\(Y\)</span>. We also define how the observed data will be generated, which is a function of treatment randomization …</p>
<pre class="r"><code>library(simstudy)
### Define data distributions
# Status :
# 1 = A(lways taker)
# 2 = N(ever taker)
# 3 = C(omplier)
def <- defDataAdd(varname = "Status",
formula = "0.25; 0.40; 0.35", dist = "categorical")
# potential outcomes (PO) for intervention depends on group status
def <- defDataAdd(def, varname = "M0",
formula = "(Status == 1) * 1", dist = "nonrandom")
def <- defDataAdd(def, varname = "M1",
formula = "(Status != 2) * 1", dist = "nonrandom")
# observed intervention status based on randomization and PO
def <- defDataAdd(def, varname = "m",
formula = "(z==0) * M0 + (z==1) * M1",
dist = "nonrandom")
# potential outcome for Y (depends group status - A, N, or C)
# under assumption of exclusion restriction
defY0 <- defCondition(condition = "Status == 1",
formula = 0.3, variance = .25, dist = "normal")
defY0 <- defCondition(defY0, condition = "Status == 2",
formula = 0.0, variance = .36, dist = "normal")
defY0 <- defCondition(defY0, condition = "Status == 3",
formula = 0.1, variance = .16, dist = "normal")
defY1 <- defCondition(condition = "Status == 1",
formula = 0.3, variance = .25, dist = "normal")
defY1 <- defCondition(defY1, condition = "Status == 2",
formula = 0.0, variance = .36, dist = "normal")
defY1 <- defCondition(defY1, condition = "Status == 3",
formula = 0.9, variance = .49, dist = "normal")
# observed outcome function of actual treatment
defy <- defDataAdd(varname = "y",
formula = "(z == 0) * Y0 + (z == 1) * Y1",
dist = "nonrandom")</code></pre>
</div>
<div id="data-generation" class="section level3">
<h3>Data generation</h3>
<p>I am generating multiple data sets and estimating the causal effects for each using the EM and IV approaches. This gives better picture of the bias and variation under the two different scenarios (exclusion restriction & no exclusion restriction) and different methods (EM & IV). To simplify the code a bit, I’ve written a function to consolidate the data generating process:</p>
<pre class="r"><code>createDT <- function(n, def, defY0, defY1, defy) {
dt <- genData(n)
dt <- trtAssign(dt, n=2, grpName = "z")
dt <- addColumns(def, dt)
genFactor(dt, "Status",
labels = c("Always-taker","Never-taker", "Complier"),
prefix = "A")
dt <- addCondition(defY0, dt, "Y0")
dt <- addCondition(defY1, dt, "Y1")
dt <- addColumns(defy, dt)
}
set.seed(16)
dt <- createDT(2500, def, defY0, defY1, defy)
options(digits = 3)
dt</code></pre>
<pre><code>## id Y1 Y0 z Status M0 M1 m AStatus y
## 1: 1 0.12143 -0.400007 0 2 0 0 0 Never-taker -0.4000
## 2: 2 0.13114 0.713202 1 1 1 1 1 Always-taker 0.1311
## 3: 3 0.73766 -0.212530 1 3 0 1 1 Complier 0.7377
## 4: 4 -0.07531 0.209330 1 1 1 1 1 Always-taker -0.0753
## 5: 5 -0.25214 -0.696207 0 2 0 0 0 Never-taker -0.6962
## ---
## 2496: 2496 -0.00882 0.206581 0 2 0 0 0 Never-taker 0.2066
## 2497: 2497 0.39226 0.749465 1 2 0 0 0 Never-taker 0.3923
## 2498: 2498 -0.81486 0.000605 1 2 0 0 0 Never-taker -0.8149
## 2499: 2499 0.10359 -0.417344 0 2 0 0 0 Never-taker -0.4173
## 2500: 2500 -0.68397 0.304398 1 2 0 0 0 Never-taker -0.6840</code></pre>
</div>
<div id="cace-estimation" class="section level3">
<h3>CACE estimation</h3>
<p>Finally, we are ready to put all of this together and estimate the CACE using the EM algorithm. After initializing the parameters (here we just use random values except for the probabilities of group membership, which we assume to be 1/3 to start), we loop through the E and M steps, checking the change in log likelihood each time. For this single data set, we provide a point estimate of the CACE using EM and IV. (We could provide an estimate of standard error using a bootstrap approach.) We see that both do a reasonable job, getting fairly close to the truth.</p>
<pre class="r"><code>params <- initparams()
prev.loglike <- -Inf
continue <- TRUE
while (continue) {
dtPIs <- dt[, estep(params, y, z, m), keyby = id]
dx <- dt[dtPIs]
params <- mstep(params, dx)
EM.CACE <- params$mC1 - params$mC0
current.loglike <- loglike(dt, params)
diff <- current.loglike - prev.loglike
prev.loglike <- current.loglike
if ( diff < 1.00e-07 ) continue = FALSE
}
library(ivpack)
ivmodel <- ivreg(formula = y ~ m | z, data = dt, x = TRUE)
data.table(truthC = dt[AStatus == "Complier", mean(Y1 - Y0)],
IV.CACE = coef(ivmodel)[2],
EM.CACE)</code></pre>
<pre><code>## truthC IV.CACE EM.CACE
## 1: 0.806 0.823 0.861</code></pre>
</div>
<div id="more-general-performance" class="section level3">
<h3>More general performance</h3>
<p>I am not providing the code here (it is just a slight modification of what has come before), but I want to show the results of generating 1000 data sets of 500 observations in each. The first plot assumes all data sets were generated using an exclusion restriction - just as we did with the single data set. The IV approach, as expected is unbiased (estimated bias 0.01), while the EM approach is slightly biased (-0.13). We can also see that the EM approach (standard deviation 0.30) has more variation than IV (standard deviation 0.15), while the actual sample CACE (calculated based on the actual group membership and potential outcomes) had a standard deviation of 0.05, which we can see from the narrow vertical band in the plot:</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-em-cace/Exclusion_restriction.png" />
</div>
<p>In the second set of simulations, I change the potential outcomes definition so that the exclusion restriction is no longer relevant.</p>
<pre class="r"><code>defY0 <- defCondition(condition = "Status == 1",
formula = 0.3, variance = .20, dist = "normal")
defY0 <- defCondition(defY0, condition = "Status == 2",
formula = 0.0, variance = .36, dist = "normal")
defY0 <- defCondition(defY0, condition = "Status == 3",
formula = 0.1, variance = .16, dist = "normal")
defY1 <- defCondition(condition = "Status == 1",
formula = 0.7, variance = .25, dist = "normal")
defY1 <- defCondition(defY1, condition = "Status == 2",
formula = 0.2, variance = .40, dist = "normal")
defY1 <- defCondition(defY1, condition = "Status == 3",
formula = 0.9, variance = .49, dist = "normal")</code></pre>
<p>In this second case, the IV estimate is biased (0.53), while the EM estimated does quite well (-.03). (I suspect EM did worse in the first example above, because estimates were made without the assumption of the exclusion restriction, even though that was the case.) However, EM estimates still have more variation than IV: standard deviation 0.26 vs 0.17, consistent with the estimates under the exclusion restriction assumption. This variation arises from the fact that we don’t know what the true group membership is, and we need to estimate it. Here is what the estimates look like:</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-em-cace/No_exclusion_restriction.png" />
</div>
</div>
<div id="can-we-expand-on-this" class="section level3">
<h3>Can we expand on this?</h3>
<p>The whole point of this was to illustrate that there might be a way around some rather restrictive assumptions, which in some cases might not seem so reasonable. EM methods provide an alternative way to approach things - more of which you can see in the <a href="https://courseplus.jhu.edu/core/index.cfm/go/course.home/coid/8155/">free online course</a> that inspired these last few posts. Unfortunately, there is no obvious way to tackle these problems in <code>R</code> using existing packages, and I am not suggesting that what I have done here is the best way to go about it. The course suggests using <code>Mplus</code>. While that is certainly a great software package, maybe it would be worthwhile to build an R package to implement these methods more completely in R? Or maybe someone has already done this, and I just haven’t come across it yet?</p>
</div>
A simstudy update provides an excuse to talk a little bit about latent class regression and the EM algorithm
https://www.rdatagen.net/post/simstudy-update-provides-an-excuse-to-talk-a-little-bit-about-the-em-algorithm-and-latent-class/
Wed, 20 Sep 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simstudy-update-provides-an-excuse-to-talk-a-little-bit-about-the-em-algorithm-and-latent-class/<p>I was just going to make a quick announcement to let folks know that I’ve updated the <code>simstudy</code> package to version 0.1.4 (now available on CRAN) to include functions that allow conversion of columns to factors, creation of dummy variables, and most importantly, specification of outcomes that are more flexibly conditional on previously defined variables. But, as I was coming up with an example that might illustrate the added conditional functionality, I found myself playing with package <code>flexmix</code>, which uses an Expectation-Maximization (EM) algorithm to estimate latent classes and fit regression models. So, in the end, this turned into a bit more than a brief service announcement.</p>
<div id="defining-data-conditionally" class="section level3">
<h3>Defining data conditionally</h3>
<p>Of course, simstudy has always enabled conditional distributions based on sequentially defined variables. That is really the whole point of simstudy. But, what if I wanted to specify completely different families of distributions or very different regression curves based on different individual characteristics? With the previous version of simstudy, it was not really easy to do. Now, with the addition of two key functions, <code>defCondition</code> and <code>addCondition</code> the process is much improved. <code>defCondition</code> is analogous to the function <code>defData</code>, in that this new function provides an easy way to specify conditional definitions (as does <code>defReadCond</code>, which is analogous to <code>defRead</code>). <code>addCondition</code> is used to actually add the data column, just as <code>addColumns</code> adds columns.</p>
<p>It is probably easiest to see in action:</p>
<pre class="r"><code>library(simstudy)
# Define baseline data set
def <- defData(varname="x", dist="normal", formula=0, variance=9)
def <- defData(def, varname = "group", formula = "0.2;0.5;0.3",
dist = "categorical")
# Generate data
set.seed(111)
dt <- genData(1000, def)
# Convert group to factor - new function
dt <- genFactor(dt, "group", replace = TRUE)
dt</code></pre>
<p><code>defCondition</code> is the same as <code>defData</code>, except that instead of specifying a variable name, we need to specify a condition that is based on a pre-defined field:</p>
<pre class="r"><code>defC <- defCondition(condition = "fgroup == 1", formula = "5 + 2*x",
variance = 4, dist = "normal")
defC <- defCondition(defC, condition = "fgroup == 2", formula = 4,
variance = 3, dist="normal")
defC <- defCondition(defC, condition = "fgroup == 3", formula = "3 - 2*x",
variance = 2, dist="normal")
defC</code></pre>
<pre><code>## condition formula variance dist link
## 1: fgroup == 1 5 + 2*x 4 normal identity
## 2: fgroup == 2 4 3 normal identity
## 3: fgroup == 3 3 - 2*x 2 normal identity</code></pre>
<p>A subsequent call to <code>addCondition</code> generates a data table with the new variable, in this case <span class="math inline">\(y\)</span>:</p>
<pre class="r"><code>dt <- addCondition(defC, dt, "y")
dt</code></pre>
<pre><code>## id y x fgroup
## 1: 1 5.3036869 0.7056621 2
## 2: 2 2.1521853 -0.9922076 2
## 3: 3 4.7422359 -0.9348715 3
## 4: 4 16.1814232 -6.9070370 3
## 5: 5 4.3958893 -0.5126281 3
## ---
## 996: 996 -0.8115245 -2.7092396 1
## 997: 997 1.9946074 0.7126094 2
## 998: 998 11.8384871 2.3895135 1
## 999: 999 3.3569664 0.8123200 1
## 1000: 1000 3.4662074 -0.4653198 3</code></pre>
<p>In this example, I’ve partitioned the data into three subsets, each of which has a very different linear relationship between variables <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>, and different variation. In this particular case, all relationships are linear with normally distributed noise, but this is absolutely not required.</p>
<p>Here is what the data look like:</p>
<pre class="r"><code>library(ggplot2)
mycolors <- c("#555bd4","#d4555b","#d4ce55")
ggplot(data = dt, aes(x = x, y = y, group = fgroup)) +
geom_point(aes(color = fgroup), size = 1, alpha = .4) +
geom_smooth(aes(color = fgroup), se = FALSE, method = "lm") +
scale_color_manual(name = "Cluster", values = mycolors) +
scale_x_continuous(limits = c(-10,10), breaks = c(-10, -5, 0, 5, 10)) +
theme(panel.grid = element_blank(),
panel.background = element_rect(fill = "grey96", color = "grey80"))</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-09-20-simstudy-update-provides-an-excuse-to-talk-a-little-bit-about-the-em-algorithm-and-latent-class_files/figure-html/unnamed-chunk-4-1.png" width="576" /></p>
</div>
<div id="latent-class-regression-models" class="section level3">
<h3>Latent class regression models</h3>
<p>Suppose we come across the same data set, but are not privy to the group classification, and we are still interested in the relationship between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>. This is what the data set would look like - not as user-friendly:</p>
<pre class="r"><code>rawp <- ggplot(data = dt, aes(x = x, y = y, group = fgroup)) +
geom_point(color = "grey75", size = .5) +
scale_x_continuous(limits = c(-10,10), breaks = c(-10, -5, 0, 5, 10)) +
theme(panel.grid = element_blank(),
panel.background = element_rect(fill = "grey96", color = "grey80"))
rawp</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-09-20-simstudy-update-provides-an-excuse-to-talk-a-little-bit-about-the-em-algorithm-and-latent-class_files/figure-html/unnamed-chunk-5-1.png" width="504" /></p>
<p>We might see from the plot, or we might have some subject-matter knowledge that suggests there are are several sub-clusters within the data, each of which appears to have a different relationship between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>. (Obviously, we know this is the case, since we generated the data.) The question is, how can we estimate the regression lines if we don’t know the class membership? That is where the EM algorithm comes into play.</p>
</div>
<div id="the-em-algorithm-very-very-briefly" class="section level3">
<h3>The EM algorithm, very, very briefly</h3>
<p>The EM algorithm handles model parameter estimation in the context of incomplete or missing data. In the example I’ve been discussing here, the subgroups or cluster membership are the missing data. There is an extensive literature on EM methods (starting with <a href="http://www.jstor.org/stable/2984875">this article</a> by Dempster, Laird & Rubin), and I am barely even touching the surface, let alone scratching it.</p>
<p>The missing data (cluster memberships) are estimated in the <em>Expectation-</em> or <em>E-step</em>. These are replaced with their expected values as given by the posterior probabilities. The mixture model assumes that each observation is exactly from one cluster, but this information has not been observed. The unknown model parameters (intercept, slope, and variance) for each of the clusters is estimated in the <em>Maximization-</em> or <em>M-step</em>, which in this case assumes the data come from a linear process with normally distributed noise - both the linear coefficients and variation around the line are conditional on cluster membership. The process is iterative. First, the <em>E-step</em>, which is based on some starting model parameters at first and then updated with the most recent parameter estimates from the prior <em>M-step</em>. Second, the <em>M-step</em> is based on estimates of the maximum likelihood of all the data (including the ‘missing’ data estimated in the prior <em>E-step</em>). We iterate back and forth until the parameter estimates in the <em>M-step</em> reach a steady state, or the overal likelihood estimate becomes stable.</p>
<p>The strength or usefulness of the EM method is that the likelihood of the full data (both observed data - <span class="math inline">\(x\)</span>’s and <span class="math inline">\(y\)</span>’s - and unobserved data - cluster probabilities) is much easier to write down and estimate than the likelihood of the observed data only (<span class="math inline">\(x\)</span>’s and <span class="math inline">\(y\)</span>’s). Think of the first plot above with the structure given by the colors compared to the second plot in grey without the structure. The first seems so much more manageable than the second - if only we knew the underlying structure defined by the clusters. The EM algorithm builds the underlying structure so that the maximum likelihood estimation problem becomes much easier.</p>
<p>Here is a little more detail on what the EM algorithm is estimating in our application. (See <a href="https://cran.r-project.org/web/packages/flexmix/vignettes/flexmix-intro.pdf">this</a> for the much more detail.) First, we estimate the probability of membership in cluster <span class="math inline">\(j\)</span> for our linear regression model with three clusters:</p>
<p><span class="math display">\[P_i(j|x_i, y_i, \mathbf{\pi}, \mathbf{\alpha_0}, \mathbf{\alpha_1}, \mathbf{\sigma}) = p_{ij}= \frac{\pi_jf(y_i|x_i, \mathbf{\alpha_0}, \mathbf{\alpha_1}, \mathbf{\sigma})}{\sum_{k=1}^3 \pi_k f(y_i|x_i, \mathbf{\alpha_0}, \mathbf{\alpha_1}, \mathbf{\sigma})},\]</span> where <span class="math inline">\(\mathbf{\alpha_0}\)</span>, <span class="math inline">\(\mathbf{\alpha_1}\)</span>, and <span class="math inline">\(\mathbf{\sigma}\)</span> are the vectors of intercepts, slopes, and standard deviations for the three clusters. <span class="math inline">\(\pi\)</span> is the vector of probabilities that any individual is in the respective clusters, and each <span class="math inline">\(\pi_j\)</span> is estimated by averaging the <span class="math inline">\(p_{ij}\)</span>’s across all individuals. Finally, <span class="math inline">\(f(.|.)\)</span> is the density from the normal distribution <span class="math inline">\(N(\alpha_{j0} + \alpha_{j1}x, \sigma_j^2)\)</span>, with cluster-specific parameters.</p>
<p>Second, we maximize each of the three cluster-specific log-likelihoods, where each individual is weighted by its probability of cluster membership (which is <span class="math inline">\(P_i(j)\)</span>, estimated in the <em>E-step</em>). In particular, we are maximizing the cluster-specific likelihood with respect to the three unknown parameters <span class="math inline">\(\alpha_{j0}\)</span>, <span class="math inline">\(\alpha_{j1}\)</span>, and <span class="math inline">\(\sigma_j\)</span>:</p>
<p><span class="math display">\[\sum_{n=1}^N \hat{p}_{nk} \text{log} (f(y_n|x_n,\alpha_{j0},\alpha_{j1},\sigma_j)\]</span> In <code>R</code>, the <code>flexmix</code> package has implemented an EM algorithm to estimate latent class regression models. The package documentation provides a really nice, accessible <a href="https://cran.r-project.org/web/packages/flexmix/vignettes/flexmix-intro.pdf">description</a> of the two-step procedure, with much more detail than I have provided here. I encourage you to check it out.</p>
</div>
<div id="iterating-slowly-through-the-em-algorithm" class="section level3">
<h3>Iterating slowly through the EM algorithm</h3>
<p>Here is a slow-motion version of the EM estimation process. I show the parameter estimates (visually) at the early stages of estimation, checking in after every three steps. In addition, I highlight two individuals and show the estimated probabilities of cluster membership. At the beginning, there is little differentiation between the regression lines for each cluster. However, by the 10th iteration the parameter estimates for the regression lines are looking pretty similar to the original plot.</p>
<pre class="r"><code>library(flexmix)
selectIDs <- c(508, 775) # select two individuals
ps <- list()
count <- 0
p.ij <- data.table() # keep track of estimated probs
pi.j <- data.table() # keep track of average probs
for (i in seq(1,10, by=3)) {
count <- count + 1
set.seed(5)
# fit model up to "i" iterations - either 1, 4, 7, or 10
exMax <- flexmix(y ~ x,
data = dt, k = 3,
control = list(iter.max = i)
)
p.ij <- rbind(p.ij,
data.table(i, selectIDs, posterior(exMax)[selectIDs,]))
pi.j <- rbind(pi.j,
data.table(i, t(apply(posterior(exMax), 2, mean))))
dp <- as.data.table(t(parameters(exMax)))
setnames(dp, c("int","slope", "sigma"))
# flexmix rearranges columns/clusters
dp[, grp := c(3, 1, 2)]
setkey(dp, grp)
# create plot for each iteration
ps[[count]] <- rawp +
geom_abline(data = dp, aes(intercept = int, slope = slope,
color=factor(grp)), size = 1) +
geom_point(data = dt[id %in% selectIDs], color = "black") +
scale_color_manual(values = mycolors) +
ggtitle(paste("Iteration", i)) +
theme(legend.position = "none",
plot.title = element_text(size = 9))
}</code></pre>
<pre class="r"><code>library(gridExtra)
grid.arrange(ps[[1]], ps[[2]], ps[[3]], ps[[4]], nrow = 1)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-09-20-simstudy-update-provides-an-excuse-to-talk-a-little-bit-about-the-em-algorithm-and-latent-class_files/figure-html/unnamed-chunk-7-1.png" width="864" style="display: block; margin: auto;" /></p>
<p>For the two individuals, we can see the probabilities converging to a level of certainty/uncertainty. The individual with ID #775 lies right on the regression line for cluster 3, far from the other lines, and the algorithm quickly assigns a probability of 100% to cluster 3 (its actual cluster). The cluster assignment is less certain for ID #508, which lies between the two regression lines for clusters 1 and 2.</p>
<pre class="r"><code># actual cluster membership
dt[id %in% selectIDs, .(id, fgroup)]</code></pre>
<pre><code>## id fgroup
## 1: 508 2
## 2: 775 3</code></pre>
<pre class="r"><code>setkey(p.ij, selectIDs, i)
p.ij[, .(selectIDs, i, C1 = round(V2, 2), C2 = round(V3,2), C3 = round(V1,2))]</code></pre>
<pre><code>## selectIDs i C1 C2 C3
## 1: 508 1 0.32 0.36 0.32
## 2: 508 4 0.29 0.44 0.27
## 3: 508 7 0.25 0.65 0.10
## 4: 508 10 0.24 0.76 0.00
## 5: 775 1 0.35 0.28 0.37
## 6: 775 4 0.33 0.14 0.53
## 7: 775 7 0.11 0.01 0.88
## 8: 775 10 0.00 0.00 1.00</code></pre>
<p>In addition, we can see how the estimate of overall group membership (for all individuals) changes through the iterations. The algorithm starts by assigning equal probability to each cluster (1/3) and slowly moves towards the actual distribution used to generate the data (20%, 50%, and 30%).</p>
<pre class="r"><code>pi.j[, .(i, C1 = round(V2, 2), C2 = round(V3,2), C3 = round(V1,2))]</code></pre>
<pre><code>## i C1 C2 C3
## 1: 1 0.33 0.34 0.33
## 2: 4 0.31 0.34 0.35
## 3: 7 0.25 0.39 0.36
## 4: 10 0.23 0.44 0.33</code></pre>
</div>
<div id="final-estimation-of-linear-models" class="section level3">
<h3>Final estimation of linear models</h3>
<p>The final estimation is shown below, and we can see that the parameters have largely converged to the values used to generate the data.</p>
<pre class="r"><code># Estimation until convergence
set.seed(5)
ex1 <- flexmix(y ~ x, data = dt, k = 3)
# paramter estimates
data.table(parameters(ex1))[, .(param = c("int", "slope", "sd"),
C1 = round(Comp.2, 2),
C2 = round(Comp.3, 2),
C3 = round(Comp.1, 2))]</code></pre>
<pre><code>## param C1 C2 C3
## 1: int 5.18 3.94 3.00
## 2: slope 1.97 -0.03 -1.99
## 3: sd 2.07 1.83 1.55</code></pre>
<pre class="r"><code># estimates of cluster probabilities
round(apply(posterior(ex1), 2, mean), 2)[c(2,3,1)]</code></pre>
<pre><code>## [1] 0.19 0.51 0.30</code></pre>
<pre class="r"><code># estimates of individual probabilities
data.table(posterior(exMax)[selectIDs,])[,.(selectIDs,
C1 = round(V2, 2),
C2 = round(V3, 2),
C3 = round(V1, 2))]</code></pre>
<pre><code>## selectIDs C1 C2 C3
## 1: 508 0.24 0.76 0
## 2: 775 0.00 0.00 1</code></pre>
</div>
<div id="how-do-we-know-the-relationship-is-linear" class="section level3">
<h3>How do we know the relationship is linear?</h3>
<p>In reality, there is no reason to assume that the relationship between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> is simply linear. We might want to look at other possibilities, such as a quadratic relationship. So, we use flexmix to estimate an expanded model, and then we plot the fitted lines on the original data:</p>
<pre class="r"><code>ex2 <- flexmix(y ~ x + I(x^2), data = dt, k = 3)
dp <- as.data.table(t(parameters(ex2)))
setnames(dp, c("int","slope", "slope2", "sigma"))
dp[, grp := c(1,2,3)]
x <- c(seq(-10,10, by =.1))
dp1 <- data.table(grp = 1, x, dp[1, int + slope*x + slope2*(x^2)])
dp2 <- data.table(grp = 2, x, dp[2, int + slope*x + slope2*(x^2)])
dp3 <- data.table(grp = 3, x, dp[3, int + slope*x + slope2*(x^2)])
dp <- rbind(dp1, dp2, dp3)
rawp +
geom_line(data=dp, aes(x=x, y=V3, group = grp, color = factor(grp)),
size = 1) +
scale_color_manual(values = mycolors) +
theme(legend.position = "none")</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-09-20-simstudy-update-provides-an-excuse-to-talk-a-little-bit-about-the-em-algorithm-and-latent-class_files/figure-html/unnamed-chunk-11-1.png" width="576" /></p>
<p>And even though the parameter estimates appear to be reasonable, we would want to compare the simple linear model with the quadratic model, which we can use with something like the BIC. We see that the linear model is a better fit (lower BIC value) - not surprising since this is how we generated the data.</p>
<pre class="r"><code>summary(refit(ex2))</code></pre>
<pre><code>## $Comp.1
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.440736 0.309576 4.6539 3.257e-06 ***
## x -0.405118 0.048808 -8.3003 < 2.2e-16 ***
## I(x^2) -0.246075 0.012162 -20.2337 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## $Comp.2
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.955542 0.289914 23.9918 < 2.2e-16 ***
## x 0.305995 0.049584 6.1712 6.777e-10 ***
## I(x^2) 0.263160 0.014150 18.5983 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## $Comp.3
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.9061090 0.1489738 26.2201 < 2e-16 ***
## x -0.0681887 0.0277366 -2.4584 0.01395 *
## I(x^2) 0.0113305 0.0060884 1.8610 0.06274 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</code></pre>
<pre class="r"><code># Comparison of the two models
BIC(ex1)</code></pre>
<pre><code>## [1] 5187.862</code></pre>
<pre class="r"><code>BIC(ex2)</code></pre>
<pre><code>## [1] 5316.034</code></pre>
</div>
Complier average causal effect? Exploring what we learn from an RCT with participants who don't do what they are told
https://www.rdatagen.net/post/cace-explored/
Tue, 12 Sep 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/cace-explored/<p>Inspired by a free online <a href="https://courseplus.jhu.edu/core/index.cfm/go/course.home/coid/8155/">course</a> titled <em>Complier Average Causal Effects (CACE) Analysis</em> and taught by Booil Jo and Elizabeth Stuart (through Johns Hopkins University), I’ve decided to explore the topic a little bit. My goal here isn’t to explain CACE analysis in extensive detail (you should definitely go take the course for that), but to describe the problem generally and then (of course) simulate some data. A plot of the simulated data gives a sense of what we are estimating and assuming. And I end by describing two simple methods to estimate the CACE, which we can compare to the truth (since this is a simulation); next time, I will describe a third way.</p>
<div id="non-compliance-in-randomized-trials" class="section level3">
<h3>Non-compliance in randomized trials</h3>
<p>Here’s the problem. In a randomized trial, investigators control the randomization process; they determine if an individual is assigned to the treatment group or control group (I am talking about randomized trials here, but many of these issues can apply in the context of observed or quasi-experimental settings, but require more data and assumptions). However, those investigators may not have as much control over the actual treatments that study participants receive. For example, an individual randomized to some type of behavioral intervention may opt not to take advantage of the intervention. Likewise, someone assigned to control may, under some circumstances, figure out a way to get services that are quite similar to the intervention. In all cases, the investigator is able to collect outcome data on all of these patients, regardless of whether or not they followed directions. (This is different from drop-out or loss-to-followup, where outcome data may be missing.)</p>
</div>
<div id="cace" class="section level3">
<h3>CACE</h3>
<p>Typically, studies analyze data based on treatment <em>assignment</em> rather than treatment <em>received</em>. This focus on assignment is called an intention-to-treat (ITT) analysis. In a policy environment, the ITT may make a lot of sense; we are answering this specific question: “What is the overall effect in the real world where the intervention is made available yet some people take advantage of it while others do not?” Alternatively, researchers may be interested in different question: “What is the causal effect of actually receiving the treatment?”</p>
<p>Now, to answer the second question, there are numerous subtle issues that you need to wrestle with (again, go take the <a href="https://courseplus.jhu.edu/core/index.cfm/go/course.home/coid/8155/">course</a>). But, long story short, we need to (1) identify the folks in the <em>intervention</em> group who actually do what they have been encouraged to do (receive the intervention) but only because they were encouraged, and not because they would have received the intervention anyways had they not been randomized, and compare their outcomes with (2) the folks in the control group who did not seek out the intervention on their own initiative but would have received the intervention had they been encouraged. These two groups are considered to be <em>compliers</em> - they would always do what they are told in the context of the study. And the effect of the intervention that is based on outcomes from this type of patient is called the <em>complier average causal effect</em> (CACE).</p>
<p>The biggest challenge in estimating the CACE is that we cannot actually identify if people are compliers or not. Some of those receiving the treatment in the intervention group are <em>compliers</em>, but the rest are <em>always-takers</em>. Some of those not receiving the treatment in the control arm are also <em>compliers</em>, but the others are <em>never-takers</em>. There are several methods available to overcome this challenge, two of which I will briefly mention here: method of moments and instrumental variables.</p>
</div>
<div id="using-potential-outcomes-to-define-cace" class="section level3">
<h3>Using potential outcomes to define CACE</h3>
<p>In an earlier <a href="https://www.rdatagen.net/post/be-careful/">post</a>, I briefly introduced the idea of potential outcomes. Since we are talking about causal relationships, they are useful here. If <span class="math inline">\(Z\)</span> is the randomization indicator, <span class="math inline">\(Z=1\)</span> for those randomized to the intervention, <span class="math inline">\(Z=0\)</span> for those in control. <span class="math inline">\(M\)</span> is the indicator of whether or not the individual received the intervention. Since <span class="math inline">\(M\)</span> is an outcome, we can imagine the potential outcomes <span class="math inline">\(M_{0i}\)</span> and <span class="math inline">\(M_{1i}\)</span>, or what the value of <span class="math inline">\(M_i\)</span> would be for an individual if <span class="math inline">\(Z_i=0\)</span> or <span class="math inline">\(Z_i=1\)</span>, respectively. And let us say <span class="math inline">\(Y\)</span> is the outcome, so we have potential outcomes that can be written as <span class="math inline">\(Y_{0,M_0}\)</span> and <span class="math inline">\(Y_{1,M_1}\)</span>. Think about that for a bit.</p>
<p>Using these potential outcomes, we can define the compliers and the CACE. Compliers are people for whom <span class="math inline">\(M_0 = 0\)</span> <em>and</em> <span class="math inline">\(M_1 = 1\)</span>. (Never-takers look like this: <span class="math inline">\(M_0 = 0\)</span> <em>and</em> <span class="math inline">\(M_1 = 0\)</span>. Always-takers: <span class="math inline">\(M_0 = 1\)</span> <em>and</em> <span class="math inline">\(M_1 = 1\)</span>). Now, the average causal effect is the average difference between potential outcomes. In this case, the CACE is <span class="math inline">\(E[Y_{1,M_1} - Y_{0,M_0}|M_0 = 0 \ \& \ M_1 = 1]\)</span>. The patients for whom <span class="math inline">\(M_0 = 0\)</span> <em>and</em> <span class="math inline">\(M_1 = 1\)</span> are the compliers.</p>
</div>
<div id="simulating-data" class="section level3">
<h3>Simulating data</h3>
<p>The data simulation will be based on generating potential outcomes. Observed outcomes will be a function of randomization group and complier status.</p>
<pre class="r"><code>options(digits = 3)
library(data.table)
library(simstudy)
library(ggplot2)
# Status :
# 1 = A(lways taker)
# 2 = N(ever taker)
# 3 = C(omplier)
def <- defDataAdd(varname = "Status",
formula = "0.20; 0.40; 0.40", dist = "categorical")
# potential outcomes (PO) for intervention
def <- defDataAdd(def, varname = "M0",
formula = "(Status == 1) * 1", dist = "nonrandom")
def <- defDataAdd(def, varname = "M1",
formula = "(Status != 2) * 1", dist = "nonrandom")
# observed intervention status based on randomization and PO
def <- defDataAdd(def, varname = "m",
formula = "(z==0) * M0 + (z==1) * M1", dist = "nonrandom")
# potential outcome for Y (depends on potential outcome for M)
set.seed(888)
dt <- genData(2000)
dt <- trtAssign(dt, n=2, grpName = "z")
dt <- addColumns(def, dt)
# using data functions here, not simstudy - I need add
# this functionality to simstudy
dt[, AStatus := factor(Status,
labels = c("Always-taker","Never-taker", "Complier"))]
# potential outcomes depend on group status - A, N, or C
dt[Status == 1, Y0 := rnorm(.N, 1.0, sqrt(0.25))]
dt[Status == 2, Y0 := rnorm(.N, 0.0, sqrt(0.36))]
dt[Status == 3, Y0 := rnorm(.N, 0.1, sqrt(0.16))]
dt[Status == 1, Y1 := rnorm(.N, 1.0, sqrt(0.25))]
dt[Status == 2, Y1 := rnorm(.N, 0.0, sqrt(0.36))]
dt[Status == 3, Y1 := rnorm(.N, 0.9, sqrt(0.49))]
# observed outcome function of actual treatment
dt[, y := (m == 0) * Y0 + (m == 1) * Y1]
dt</code></pre>
<pre><code>## id z Status M0 M1 m AStatus Y0 Y1 y
## 1: 1 1 3 0 1 1 Complier 0.5088 0.650 0.6500
## 2: 2 1 3 0 1 1 Complier 0.1503 0.729 0.7292
## 3: 3 1 2 0 0 0 Never-taker 1.4277 0.454 1.4277
## 4: 4 0 3 0 1 0 Complier 0.6393 0.998 0.6393
## 5: 5 0 1 1 1 1 Always-taker 0.6506 1.927 1.9267
## ---
## 1996: 1996 0 3 0 1 0 Complier -0.9554 0.114 -0.9554
## 1997: 1997 0 3 0 1 0 Complier 0.0366 0.903 0.0366
## 1998: 1998 1 3 0 1 1 Complier 0.3606 1.098 1.0982
## 1999: 1999 1 3 0 1 1 Complier 0.6651 1.708 1.7082
## 2000: 2000 0 3 0 1 0 Complier 0.2207 0.531 0.2207</code></pre>
<p>The plot shows outcomes <span class="math inline">\(y\)</span> for the two randomization groups. The ITT estimate would be based on an average of all the points in group, regardless of color or shape. The difference between the average of the black circles in the two groups represents the CACE.</p>
<pre class="r"><code>ggplot(data=dt, aes(y=y, x = factor(z, labels = c("Assigned to control",
"Assigned to treatment")))) +
geom_jitter(aes(shape=factor(m, labels = c("No treatment", "Treatment")),
color=AStatus),
width = 0.35) +
scale_shape_manual(values = c(1,19)) +
scale_color_manual(values = c("#e1d07d", "#7d8ee1", "grey25")) +
scale_y_continuous(breaks = seq(-3, 3, 1), labels = seq(-3, 3, 1)) +
theme(legend.title = element_blank(),
axis.title.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.major.x = element_blank())</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-09-08-iv-em-two-important-ideas-explored_files/figure-html/unnamed-chunk-2-1.png" width="672" /></p>
<p>In the real world, we cannot see the colors, yet we need to estimate as if we do, or at least use a method to bypasses that need:</p>
<p><img src="https://www.rdatagen.net/post/2017-09-08-iv-em-two-important-ideas-explored_files/figure-html/unnamed-chunk-3-1.png" width="672" /></p>
</div>
<div id="estimating-cace-using-observed-data" class="section level3">
<h3>Estimating CACE using observed data</h3>
<p>The challenge is to estimate the CACE using <em>observed</em> data only, since that is all we have (along with a couple of key assumptions). We start of by claiming that the average causal effect of treatment <strong>assignment</strong> (<span class="math inline">\(ACE\)</span>) is a weighted average of the three sub-populations of <em>compliers</em>, <em>never-takers</em>, and <em>always-takers</em>:</p>
<p><span class="math display">\[ ACE = \pi_C \times CACE + \pi_N \times NACE + \pi_A \times AACE, \]</span> where <span class="math inline">\(CACE\)</span> is the average causal effect of treatment assignment for the subset of those in the sample who are <em>compliers</em>, <span class="math inline">\(NACE\)</span> is the average causal effect of treatment assignment for the subset who are <em>never-takers</em>, and <span class="math inline">\(AACE\)</span> is the average causal effect for those who are <em>always-takers</em>. <span class="math inline">\(\pi_C\)</span>, <span class="math inline">\(\pi_N\)</span>, and <span class="math inline">\(\pi_A\)</span> represent the sample proportions of compliers, never-takers, and always-takers, respectively.</p>
<p>A key assumption often made to estimate <span class="math inline">\(CACE\)</span> is known as the <em>exclusion restriction</em>: treatment assignment has an effect on the outcome <em>only</em> if it changes the actual treatment taken. (A second key assumption is that there are no <em>deniers</em>, or folks who do the opposite of what they are told. This is called the monotonicity assumption.) This <em>exclusion restriction</em> implies that both <span class="math inline">\(NACE=0\)</span> and <span class="math inline">\(AACE=0\)</span>, since in both cases the treatment <em>received</em> is the same regardless of treatment assignment. In that case, we can re-write the equality as</p>
<p><span class="math display">\[ ACE = \pi_C \times CACE,\]</span></p>
<p>and finally with a little re-arranging,</p>
<p><span class="math display">\[ CACE = \frac{ACE}{\pi_C}. \]</span> So, in order estimate <span class="math inline">\(CACE\)</span>, we need to be able to estimate <span class="math inline">\(ACE\)</span> and <span class="math inline">\(\pi_C\)</span>. Fortunately, we are in a position to do this. Since this is a randomized trial, the average causal effect of treatment assignment is just the difference in observed outcomes for the two treatment assignment groups:</p>
<p><span class="math display">\[ ACE = E[Y | Z = 1] - E[Y | Z = 0] \]</span> This also happens to be the <em>intention-to-treat</em> ) (<span class="math inline">\(ITT\)</span>) estimate.</p>
<p><span class="math inline">\(\pi_C\)</span> is a little harder, but in this simplified scenario, not that hard. We just need to follow a little logic: for the control group, we can identify the <em>always-takers</em> (they’re the ones who actually receive the treatment), so we know <span class="math inline">\(\pi_A\)</span> for the the control group. This can be estimated as <span class="math inline">\(P(M=1|Z=0)\)</span>. And, since the study was randomized, the distribution of <em>always-takers</em> in the treatment group must be the same. So, we can use <span class="math inline">\(\pi_A\)</span> estimated from the control group as an estimate for the treatment group.</p>
<p>For the treatment group, we know that <span class="math inline">\(\pi_C + \pi_A = P(M = 1 | Z = 1)\)</span>. That is everyone who receives treatment in the treatment group is either a complier or always-taker. With this, we can say</p>
<p><span class="math display">\[\pi_C = P(M=1 | Z = 1) - \pi_A.\]</span></p>
<p>But, of course, we argued above that we can estimate <span class="math inline">\(\pi_A\)</span> as <span class="math inline">\(P(M=1|Z=0)\)</span>. So, finally, we have</p>
<p><span class="math display">\[\pi_C = P(M=1 | Z = 1) - P(M=1|Z=0).\]</span> This gives us a method of moments estimator for <span class="math inline">\(CACE\)</span> from observed data:</p>
<p><span class="math display">\[ CACE = \frac{ACE}{\pi_C} = \frac{E[Y | Z = 1] - E[Y | Z = 0]}{P(M=1 | Z = 1) - P(M=1|Z=0)}. \]</span></p>
</div>
<div id="the-simulated-estimate" class="section level2">
<h2>The simulated estimate</h2>
<pre class="r"><code>ACE <- dt[z==1, mean(y)] - dt[z==0, mean(y)] # Also ITT
ACE</code></pre>
<pre><code>## [1] 0.307</code></pre>
<pre class="r"><code>pi_C <- dt[z==1, mean(m)] - dt[z==0, mean(m)] # strength of instrument
pi_C</code></pre>
<pre><code>## [1] 0.372</code></pre>
<pre class="r"><code>truth <- dt[AStatus == "Complier", mean(Y1 - Y0)]
truth</code></pre>
<pre><code>## [1] 0.81</code></pre>
<pre class="r"><code>ACE/pi_C</code></pre>
<pre><code>## [1] 0.826</code></pre>
<p>A method quite commonly used to analyze non-compliance is the instrumental variable model estimated with two-staged least squares regression. The R package <code>ivpack</code> is one of several that facilitates this type of analysis. A discussio of this methodology far exceeds the scope of this post. In any case, we can see that in this simple example, the IV estimate is the same as the method of moments estimator (by looking at the coefficient estimate of <code>m</code>).</p>
<pre class="r"><code>library(ivpack)
ivmodel <- ivreg(formula = y ~ m | z, data = dt, x = TRUE)
summary(ivmodel)</code></pre>
<pre><code>##
## Call:
## ivreg(formula = y ~ m | z, data = dt, x = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.19539 -0.36249 0.00248 0.35859 2.27902
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0932 0.0302 3.09 0.002 **
## m 0.8262 0.0684 12.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.569 on 1998 degrees of freedom
## Multiple R-Squared: 0.383, Adjusted R-squared: 0.383
## Wald test: 146 on 1 and 1998 DF, p-value: <2e-16</code></pre>
<p>So, again, if I have piqued your interest of this very rich and interesting topic, or if I have totally confused you, go check out the <a href="https://courseplus.jhu.edu/core/index.cfm/go/course.home/coid/8155/">course</a>. In my next post, I will describe a simple latent variable model using a maximum likelihood EM (expectation-maximization) algorithm that arrives at an estimate by predicting complier status.</p>
</div>
Further considerations of a hidden process underlying categorical responses
https://www.rdatagen.net/post/a-hidden-process-part-2-of-2/
Tue, 05 Sep 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-hidden-process-part-2-of-2/<p>In my <a href="https://www.rdatagen.net/post/ordinal-regression/">previous post</a>, I described a continuous data generating process that can be used to generate discrete, categorical outcomes. In that post, I focused largely on binary outcomes and simple logistic regression just because things are always easier to follow when there are fewer moving parts. Here, I am going to focus on a situation where we have <em>multiple</em> outcomes, but with a slight twist - these groups of interest can be interpreted in an ordered way. This conceptual latent process can provide another perspective on the models that are typically applied to analyze these types of outcomes.</p>
<div id="categorical-outcomes-generally" class="section level3">
<h3>Categorical outcomes, generally</h3>
<p>Certainly, group membership is not necessarily intrinsically ordered. In a general categorical or multinomial outcome, a group does not necessarily have any quantitative relationship vis a vis the other groups. For example, if we were interested in primary type of meat consumption, individuals might be grouped into those favoring (1) chicken, (2) beef, (3) pork, or (4) no meat. We might be interested in estimating the different distributions across the four groups for males and females. However, since there is no natural ranking or ordering of these meat groups (though maybe I am just not creative enough), we are limited to comparing the odds of being in one group relative to another for two exposure groups A and B, such as</p>
<p><span class="math display">\[\small{\frac{P(Beef|Group = A)}{P(Chicken|Group = A)} \ vs. \frac{P(Beef|Group = B)}{P(Chicken|Group = B)}}\]</span>.</p>
</div>
<div id="ordinal-outcomes" class="section level3">
<h3>Ordinal outcomes</h3>
<p>Order becomes relevant when the categories take on meanings related strength of opinion or agreement (as in a Likert-type response) or frequency. In the motivating example I described in the initial post, the response of interest was the frequency meat consumption in a month, so the response categories could be (1) none, (2) 1-3 times per month, (3) once per week, (4) 2-6 times per week, (5) 1 or more times per day. Individuals in group 2 consume meat more frequently than group 1, individuals in group 3 consume meat more frequently than those both groups 1 & 2, and so on. There is a natural quantitative relationship between the groups.</p>
<p>Once we have thrown ordering into the mix, we can expand our possible interpretations of the data. In particular it is quite common to summarize the data by looking at <em>cumulative</em> probabilities, odds, or log-odds. Comparisons of different exposures or individual characteristics typically look at how these cumulative measures vary across the different exposures or characteristics. So, if we were interested in cumulative odds, we would compare <span class="math display">\[\small{\frac{P(Response = 1|Group = A)}{P(Response > 1|Group = A)} \ \ vs. \ \frac{P(Response = 1|Group = B)}{P(Response > 1|Group = B)}},\]</span></p>
<p><span class="math display">\[\small{\frac{P(Response \leq 2|Group = A)}{P(Response > 2|Group = A)} \ \ vs. \ \frac{P(Response \leq 2|Group = B)}{P(Response > 2|Group = B)}},\]</span></p>
<p>and continue until the last (in this case, fourth) comparison</p>
<p><span class="math display">\[\small{\frac{P(Response \leq 4|Group = A)}{P(Response = 5|Group = A)} \ \ vs. \ \frac{P(Response \leq 4|Group = B)}{P(Response = 5|Group = B)}}.\]</span></p>
</div>
<div id="multiple-responses-multiple-thresholds" class="section level3">
<h3>Multiple responses, multiple thresholds</h3>
<p>The latent process that was described for the binary outcome is extended to the multinomial outcome by the addition of more thresholds. These thresholds define the portions of the density that define the probability of each possible response. If there are <span class="math inline">\(k\)</span> possible responses (in the meat example, we have 5), then there will be <span class="math inline">\(k-1\)</span> thresholds. The area under the logistic density curve of each of the regions defined by those thresholds (there will be <span class="math inline">\(k\)</span> distinct regions) represents the probability of each possible response tied to that region. In the example here, we define five regions of a logistic density by setting the four thresholds. We can say that this underlying continuous distribution represents the probability distribution of categorical responses for a specific population, which we are calling <em>Group A</em>.</p>
<pre class="r"><code># preliminary libraries and plotting defaults
library(ggplot2)
library(data.table)
my_theme <- function() {
theme(panel.background = element_rect(fill = "grey90"),
panel.grid = element_blank(),
axis.ticks = element_line(colour = "black"),
panel.spacing = unit(0.25, "lines"),
plot.title = element_text(size = 12, vjust = 0.5, hjust = 0),
panel.border = element_rect(fill = NA, colour = "gray90"))
}
# create data points density curve
x <- seq(-6, 6, length = 1000)
pdf <- dlogis(x, location = 0, scale = 1)
dt <- data.table(x, pdf)
# set thresholds for Group A
thresholdA <- c(-2.1, -0.3, 1.4, 3.6)
pdf <- dlogis(thresholdA)
grpA <- data.table(threshold = thresholdA, pdf)
aBreaks <- c(-6, grpA$threshold, 6)
# plot density with cutpoints
dt[, grpA := cut(x, breaks = aBreaks, labels = F, include.lowest = TRUE)]
p1 <- ggplot(data = dt, aes(x = x, y = pdf)) +
geom_line() +
geom_area(aes(x = x, y = pdf, group = grpA, fill = factor(grpA))) +
geom_hline(yintercept = 0, color = "grey50") +
annotate("text", x = -5, y = .28, label = "Group A", size = 5) +
scale_fill_manual(values = c("#d0d7d1", "#bbc5bc", "#a6b3a7", "#91a192", "#7c8f7d"),
labels = c("None", "1-3/month", "1/week", "2-6/week", "1+/day"),
name = "Frequency") +
scale_x_continuous(breaks = thresholdA) +
scale_y_continuous(limits = c(0, 0.3), name = "Density") +
my_theme() +
theme(legend.position = c(.85, .7),
legend.background = element_rect(fill = "grey90"),
legend.key = element_rect(color = "grey90"))
p1</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-09-04-a-hidden-process-part-2-of-2_files/figure-html/threshold-1.png" width="480" /></p>
<p>The area for each of the five regions can easily be calculated, where each area represents the probability of each response:</p>
<pre class="r"><code>pA= plogis(c(thresholdA, Inf)) - plogis(c(-Inf, thresholdA))
probs <- data.frame(pA)
rownames(probs) <- c("P(Resp = 1)", "P(Resp = 2)",
"P(Resp = 3)", "P(Resp = 4)", "P(Resp = 5)")
probs</code></pre>
<pre><code>## pA
## P(Resp = 1) 0.109
## P(Resp = 2) 0.316
## P(Resp = 3) 0.377
## P(Resp = 4) 0.171
## P(Resp = 5) 0.027</code></pre>
<p>As I’ve already mentioned, when we characterize a multinomial response, we typically do so in terms of cumulative probabilities. I’ve calculated several quantities below, and we can see that the logs of the cumulative odds for this particular group are indeed the threshold values that we used to define the sub-regions.</p>
<pre class="r"><code># cumulative probabilities defined by the threshold
probA <- data.frame(
cprob = plogis(thresholdA),
codds = plogis(thresholdA)/(1-plogis(thresholdA)),
lcodds = log(plogis(thresholdA)/(1-plogis(thresholdA)))
)
rownames(probA) <- c("P(Grp < 2)", "P(Grp < 3)", "P(Grp < 4)", "P(Grp < 5)")
probA</code></pre>
<pre><code>## cprob codds lcodds
## P(Grp < 2) 0.11 0.12 -2.1
## P(Grp < 3) 0.43 0.74 -0.3
## P(Grp < 4) 0.80 4.06 1.4
## P(Grp < 5) 0.97 36.60 3.6</code></pre>
<p>The last column of the table below matches the thresholds defined in vector <code>thresholdA</code>.</p>
<pre class="r"><code>thresholdA</code></pre>
<pre><code>## [1] -2.1 -0.3 1.4 3.6</code></pre>
</div>
<div id="comparing-response-distributions-of-different-populations" class="section level3">
<h3>Comparing response distributions of different populations</h3>
<p>In the cumulative logit model, the underlying assumption is that the odds ratio of one population relative to another is constant across all the possible responses. This means that all of the cumulative odds ratios are equal:</p>
<p><span class="math display">\[\small{\frac{codds(P(Resp = 1 | A))}{codds(P(Resp = 1 | B))} = \frac{codds(P(Resp \leq 2 | A))}{codds(P(Resp \leq 2 | B))} = \ ... \ = \frac{codds(P(Resp \leq 4 | A))}{codds(P(Resp \leq 4 | B))}}\]</span></p>
<p>In terms of the underlying process, this means that each of the thresholds shifts the same amount, as shown below, where we add 1.1 units to each threshold that was set Group A:</p>
<pre class="r"><code># Group B threshold is an additive shift to the right
thresholdB <- thresholdA + 1.1
pdf <- dlogis(thresholdB)
grpB <- data.table(threshold = thresholdB, pdf)
bBreaks <- c(-6, grpB$threshold, 6)</code></pre>
<p>Based on this shift, we can see that the probability distribution for Group B is quite different:</p>
<pre class="r"><code>pB = plogis(c(thresholdB, Inf)) - plogis(c(-Inf, thresholdB))
probs <- data.frame(pA, pB)
rownames(probs) <- c("P(Resp = 1)", "P(Resp = 2)",
"P(Resp = 3)", "P(Resp = 4)", "P(Resp = 5)")
probs</code></pre>
<pre><code>## pA pB
## P(Resp = 1) 0.109 0.269
## P(Resp = 2) 0.316 0.421
## P(Resp = 3) 0.377 0.234
## P(Resp = 4) 0.171 0.067
## P(Resp = 5) 0.027 0.009</code></pre>
<p>Plotting Group B along with Group A, we can see visually how that shift affects the sizes of the five regions (I’ve left the thresholds of Group A in the Group B plot so you can see clearly the shift).</p>
<pre class="r"><code># Plot density for group B
dt[, grpB := cut(x, breaks = bBreaks, labels = F, include.lowest = TRUE)]
p2 <- ggplot(data = dt, aes(x = x, y = pdf)) +
geom_line() +
geom_area(aes(x = x, y = pdf, group = grpB, fill = factor(grpB))) +
geom_hline(yintercept = 0, color = "grey5") +
geom_segment(data=grpA,
aes(x=threshold, xend = threshold, y=0, yend=pdf),
size = 0.3, lty = 2, color = "#857284") +
annotate("text", x = -5, y = .28, label = "Group B", size = 5) +
scale_fill_manual(values = c("#d0d7d1", "#bbc5bc", "#a6b3a7", "#91a192", "#7c8f7d"),
labels = c("None", "1-3/month", "1/week", "2-6/week", "1+/day"),
name = "Frequency") +
scale_x_continuous(breaks = thresholdB) +
scale_y_continuous(limits = c(0.0, 0.3), name = "Density") +
my_theme() +
theme(legend.position = "none")
library(gridExtra)
grid.arrange(p1, p2, nrow = 2 )</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-09-04-a-hidden-process-part-2-of-2_files/figure-html/plotB-1.png" width="480" /></p>
<p>When we look at the cumulative odds ratio comparing the odds of Group B to Group A for each response category, we see a constant ratio. And, of course, a constant log odds ratio, which also reflects the size of the shift from Group A to Group B.</p>
<pre class="r"><code># cumulative probabilities defined by the threshold
probB <- data.frame(
cprob = plogis(thresholdB),
codds = plogis(thresholdB)/(1-plogis(thresholdB)),
lcodds = log(plogis(thresholdB)/(1-plogis(thresholdB)))
)
oddsratio <- data.frame(coddsA = probA$codds,
coddsB = probB$codds,
cOR = probB$codds / probA$codds,
logcOR = log(probB$codds / probA$codds)
)
rownames(oddsratio) <- c("P(Grp < 2)", "P(Grp < 3)", "P(Grp < 4)", "P(Grp < 5)")
oddsratio</code></pre>
<pre><code>## coddsA coddsB cOR logcOR
## P(Grp < 2) 0.12 0.37 3 1.1
## P(Grp < 3) 0.74 2.23 3 1.1
## P(Grp < 4) 4.06 12.18 3 1.1
## P(Grp < 5) 36.60 109.95 3 1.1</code></pre>
</div>
<div id="the-cumulative-proportional-odds-model" class="section level3">
<h3>The cumulative proportional odds model</h3>
<p>In the <code>R</code> package <code>ordinal</code>, the model is fit using function <code>clm</code>. The model that is being estimated has the form</p>
<p><span class="math display">\[log \left( \frac{P(Resp \leq i)}{P(Resp > i)} | Group \right) = \alpha_i - \beta*I(Group=B) \ \ , \ i \in \{1, 2, 3, 4\}\]</span></p>
<p>The model specifies that the cumulative log-odds for a particular category is a function of two parameters, <span class="math inline">\(\alpha_i\)</span> and <span class="math inline">\(\beta\)</span>. (Note that in this parameterization and the model fit, <span class="math inline">\(-\beta\)</span> is used.) <span class="math inline">\(\alpha_i\)</span> represents the cumulative log odds of being in category <span class="math inline">\(i\)</span> or lower for those in the reference exposure group, which in our example is Group A. <em><span class="math inline">\(\alpha_i\)</span> also represents the threshold of the latent continuous (logistic) data generating process.</em> <span class="math inline">\(\beta\)</span> is the cumulative log-odds ratio for the category <span class="math inline">\(i\)</span> comparing Group B to reference Group A. <em><span class="math inline">\(\beta\)</span> also represents the shift of the threshold on the latent continuous process for Group B relative to Group A</em>. The proportionality assumption implies that the shift of the threshold for each of the categories is identical. This is what I illustrated above.</p>
</div>
<div id="simulation-and-model-fit" class="section level3">
<h3>Simulation and model fit</h3>
<p>To show how this process might actually work, I am simulating data from the standardized logistic distribution and applying the thresholds defined above based on the group status. In practice, each individual could have her own set of thresholds, depending on her characteristics (gender, age, etc.). In this case, group membership is the only characteristic I am using, so all individuals in a particular group share the same set of thresholds. (We could even have random effects, where subgroups have random shifts that are subgroup specific. In the addendum, following the main part of the post, I provide code to generate data from a mixed effects model with group level random effects plus fixed effects for exposure, gender, and a continuous outcome.)</p>
<pre class="r"><code>set.seed(123)
n = 1000
x.A <- rlogis(n)
acuts <- c(-Inf, thresholdA, Inf)
catA <- cut(x.A, breaks = acuts, label = F)
dtA <- data.table(id = 1:n, grp = "A", cat = catA)</code></pre>
<p>Not surprisingly (since we are using a generous sample size of 1000), the simulated proportions are quite close to the hypothetical proportions established by the thresholds:</p>
<pre class="r"><code>cumsum(prop.table(table(catA)))</code></pre>
<pre><code>## 1 2 3 4 5
## 0.11 0.44 0.81 0.97 1.00</code></pre>
<pre class="r"><code>probA$cprob</code></pre>
<pre><code>## [1] 0.11 0.43 0.80 0.97</code></pre>
<p>Now we generate a sample from Group B and combine them into a single data set:</p>
<pre class="r"><code>x.B <- rlogis(n)
bcuts <- c(-Inf, thresholdA + 1.1, Inf)
catB <- cut(x.B, breaks = bcuts, label = F)
dtB <- data.table(id = (n+1):(2*n), grp = "B", cat=catB)
dt <- rbind(dtA, dtB)
dt[, cat := factor(cat, labels = c("None", "1-3/month", "1/week", "2-6/week", "1+/day"))]
dt</code></pre>
<pre><code>## id grp cat
## 1: 1 A 1-3/month
## 2: 2 A 1/week
## 3: 3 A 1-3/month
## 4: 4 A 2-6/week
## 5: 5 A 2-6/week
## ---
## 1996: 1996 B 1/week
## 1997: 1997 B 1-3/month
## 1998: 1998 B 1/week
## 1999: 1999 B 1-3/month
## 2000: 2000 B 1-3/month</code></pre>
<p>Finally, we estimate the parameters of the model using function <code>clm</code> and we see that we recover the original parameters quite well.</p>
<pre class="r"><code>library(ordinal)
clmFit <- clm(cat ~ grp, data = dt)
summary(clmFit)</code></pre>
<pre><code>## formula: cat ~ grp
## data: dt
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 2000 -2655.03 5320.05 6(0) 1.19e-11 2.3e+01
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## grpB -1.0745 0.0848 -12.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## None|1-3/month -2.0912 0.0784 -26.68
## 1-3/month|1/week -0.2465 0.0612 -4.02
## 1/week|2-6/week 1.4212 0.0728 19.51
## 2-6/week|1+/day 3.5150 0.1643 21.39</code></pre>
<p>In the model output, the <code>grpB</code> coefficient of -1.07 is the estimate of <span class="math inline">\(-\beta\)</span>, which was set to 1.1 in the simulation. The threshold coefficients are the estimates of the <span class="math inline">\(\alpha_i\)</span>’s in the model, and we can see the estimates are not too bad by looking at the initial thresholds:</p>
<pre class="r"><code>coeffit <- coef(clmFit)[1:4]
names(coeffit) <- c(1:4)
rbind( thresholdA, coeffit)</code></pre>
<pre><code>## 1 2 3 4
## thresholdA -2.1 -0.30 1.4 3.6
## coeffit -2.1 -0.25 1.4 3.5</code></pre>
<p>This was a relatively simple simulation. However it highlights how it would be possible to generate more complex scenarios of multinomial response data to more fully explore other types of models. These more flexible models might be able to handle situations where the possibly restrictive assumptions of this model (particularly the proportional odds assumption) do not hold.</p>
</div>
<div id="addendum-1" class="section level2">
<h2>Addendum 1</h2>
<p>Here is code to generate cluster-randomized data with an ordinal outcome that is a function of treatment assignment, gender, and a continuous status measure at the individual level. There is also a group level random effect. Once the data are generated, I fit a mixed cumulative logit model.</p>
<pre class="r"><code>library(simstudy)
# define data
defSchool <- defData(varname = "reS", formula = 0,
variance = 0.10, id = "idS")
defSchool <- defData(defSchool, varname = "n",
formula = 250, dist = "noZeroPoisson")
defInd <- defDataAdd(varname = "male", formula = 0.45, dist = "binary")
defInd <- defDataAdd(defInd, varname = "status",
formula = 0, variance = 1, dist = "normal")
defInd <- defDataAdd(defInd,
varname = "z",
formula = "0.8 * grp + 0.3 * male - 0.2 * status + reS",
dist = "nonrandom")
# generate data
dtS <- genData(100, defSchool)
dtS <- trtAssign(dtS, grpName = "grp")
dt <- genCluster(dtS, "idS", "n", "id")
dt <- addColumns(defInd, dt)
# set reference probabilities for 4-category outcome
probs <- c(0.35, 0.30, 0.25, 0.10)
cprop <- cumsum(probs)
# map cumulative probs to thresholds for reference group
gamma.c <- qlogis(cprop)
matlp <- matrix(rep(gamma.c, nrow(dt)),
ncol = length(cprop),
byrow = TRUE
)
head(matlp)</code></pre>
<pre><code>## [,1] [,2] [,3] [,4]
## [1,] -0.62 0.62 2.2 Inf
## [2,] -0.62 0.62 2.2 Inf
## [3,] -0.62 0.62 2.2 Inf
## [4,] -0.62 0.62 2.2 Inf
## [5,] -0.62 0.62 2.2 Inf
## [6,] -0.62 0.62 2.2 Inf</code></pre>
<pre class="r"><code># set individual thresholds based on covariates,
# which is an additive shift from the reference group
# based on z
matlpInd <- matlp - dt[, z]
head(matlpInd)</code></pre>
<pre><code>## [,1] [,2] [,3] [,4]
## [1,] -1.52 -0.28 1.3 Inf
## [2,] -1.58 -0.34 1.2 Inf
## [3,] -0.95 0.29 1.9 Inf
## [4,] -1.53 -0.29 1.3 Inf
## [5,] -1.49 -0.25 1.3 Inf
## [6,] -1.13 0.11 1.7 Inf</code></pre>
<pre class="r"><code># convert log odds to cumulative probabability
matcump <- 1 / (1 + exp(-matlpInd))
matcump <- cbind(0, matcump)
head(matcump)</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0.18 0.43 0.78 1
## [2,] 0 0.17 0.42 0.78 1
## [3,] 0 0.28 0.57 0.87 1
## [4,] 0 0.18 0.43 0.78 1
## [5,] 0 0.18 0.44 0.79 1
## [6,] 0 0.24 0.53 0.84 1</code></pre>
<pre class="r"><code># convert cumulative probs to category probs:
# originally, I used a loop to do this, but
# thought it would be better to vectorize.
# see 2nd addendum for time comparison - not
# much difference
p <- t(t(matcump)[-1,] - t(matcump)[-5,])
# show some indvidual level probabilities
head(p)</code></pre>
<pre><code>## [,1] [,2] [,3] [,4]
## [1,] 0.18 0.25 0.36 0.22
## [2,] 0.17 0.24 0.36 0.22
## [3,] 0.28 0.29 0.29 0.13
## [4,] 0.18 0.25 0.36 0.22
## [5,] 0.18 0.25 0.35 0.21
## [6,] 0.24 0.28 0.32 0.16</code></pre>
<pre class="r"><code>apply(head(p), 1, sum)</code></pre>
<pre><code>## [1] 1 1 1 1 1 1</code></pre>
<pre class="r"><code># generate indvidual level category outcomes based on p
cat <- simstudy:::matMultinom(p)
catF <- ordered(cat)
dt[, cat := catF]</code></pre>
<p>When we fit the mixed effects model, it is not surprising that we recover the parameters used to generate the data, which were based on the model. The fixed effects were specified as “0.8 * grp + 0.3 * male - 0.2 * status”, the variance of the random group effect was 0.10, and the latent thresholds based on the category probabilities were {-0.62, 0.62, 2.20}:</p>
<pre class="r"><code>fmm <- clmm(cat ~ grp + male + status + (1|idS), data=dt)
summary(fmm)</code></pre>
<pre><code>## Cumulative Link Mixed Model fitted with the Laplace approximation
##
## formula: cat ~ grp + male + status + (1 | idS)
## data: dt
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 24990 -33096.42 66206.85 705(2118) 2.37e-02 1.3e+02
##
## Random effects:
## Groups Name Variance Std.Dev.
## idS (Intercept) 0.109 0.331
## Number of groups: idS 100
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## grp 0.8117 0.0702 11.6 <2e-16 ***
## male 0.3163 0.0232 13.7 <2e-16 ***
## status -0.1959 0.0116 -16.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## 1|2 -0.6478 0.0511 -12.7
## 2|3 0.6135 0.0511 12.0
## 3|4 2.1789 0.0529 41.2</code></pre>
</div>
<div id="addendum-2---vector-vs-loop" class="section level2">
<h2>Addendum 2 - vector vs loop</h2>
<p>In case any one is obsessed with vectorization in <code>R</code>, here is a comparison of two different functions that convert cumulative probabilities into probabilities. One method uses a loop, the other uses matrix operations. In this case, it actually appears that my non-loop approach is slower - maybe there is a faster way? Maybe not, since the loop is actually quite short - determined by the number of possible responses in the categorical measure…</p>
<pre class="r"><code>library(microbenchmark)
loopdif <- function(mat) {
ncols <- ncol(mat)
p <- matrix(0, nrow = nrow(mat), ncol = ( ncols - 1 ))
for (i in 1 : ( ncol(mat) - 1 )) {
p[,i] <- mat[, i+1] - mat[, i]
}
return(p)
}
vecdif <- function(mat) {
ncols <- ncol(mat)
p <- t(t(mat)[-1,] - t(mat)[-ncols,])
return(p)
}
head(loopdif(matcump))</code></pre>
<pre><code>## [,1] [,2] [,3] [,4]
## [1,] 0.18 0.25 0.36 0.22
## [2,] 0.17 0.24 0.36 0.22
## [3,] 0.28 0.29 0.29 0.13
## [4,] 0.18 0.25 0.36 0.22
## [5,] 0.18 0.25 0.35 0.21
## [6,] 0.24 0.28 0.32 0.16</code></pre>
<pre class="r"><code>head(vecdif(matcump))</code></pre>
<pre><code>## [,1] [,2] [,3] [,4]
## [1,] 0.18 0.25 0.36 0.22
## [2,] 0.17 0.24 0.36 0.22
## [3,] 0.28 0.29 0.29 0.13
## [4,] 0.18 0.25 0.36 0.22
## [5,] 0.18 0.25 0.35 0.21
## [6,] 0.24 0.28 0.32 0.16</code></pre>
<pre class="r"><code>microbenchmark(loopdif(matcump), vecdif(matcump),
times = 1000L, unit = "ms")</code></pre>
<pre><code>## Unit: milliseconds
## expr min lq mean median uq max neval
## loopdif(matcump) 0.96 1.4 1.9 1.7 1.9 112 1000
## vecdif(matcump) 0.92 1.7 3.1 2.3 2.7 115 1000</code></pre>
</div>
A hidden process behind binary or other categorical outcomes?
https://www.rdatagen.net/post/ordinal-regression/
Mon, 28 Aug 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/ordinal-regression/<p>I was thinking a lot about proportional-odds cumulative logit models last fall while designing a study to evaluate an intervention’s effect on meat consumption. After a fairly extensive pilot study, we had determined that participants can have quite a difficult time recalling precise quantities of meat consumption, so we were forced to move to a categorical response. (This was somewhat unfortunate, because we would not have continuous or even count outcomes, and as a result, might not be able to pick up small changes in behavior.) We opted for a question that was based on 30-day meat consumption: none, 1-3 times per month, 1 time per week, etc. - six groups in total. The question was how best to evaluate effectiveness of the intervention?</p>
<p>Since the outcome was categorical <em>and</em> ordinal - that is category 1 implied less meat consumption that category 2, category 2 implied less consumption that category 3, and so on - a model that estimates the cumulative probability of ordinal outcomes seemed like a possible way to proceed. Cumulative logit models estimate a number of parameters that represent the cumulative log-odds of an outcome; the parameters are the log-odds of categories 2 through 6 versus category 1, categories 3 through 6 versus 1 & 2, etc. Maybe not the most intuitive way to interpret the data, but seems to plausibly fit the data generating process.</p>
<p>I was concerned about the proportionality assumption of the cumulative logit model, particularly when we started to consider adjusting for baseline characteristics (more on that in the next post). I looked more closely at the data generating assumptions of the cumulative logit model, which are quite frequently framed in the context of a continuous latent measure that follows a logistic distribution. I thought I’d describe that data generating process here to give an alternative view of discrete data models.</p>
<p>I know I have been describing a context that includes an outcome with multiple categories, but in this post I will focus on regular logistic regression with a binary outcome. This will hopefully allow me to establish the idea of a latent threshold. I think it will be useful to explain this simpler case first before moving on to the more involved case of an ordinal response variable, which I plan to tackle in the near future.</p>
<div id="a-latent-continuous-process-underlies-the-observed-binary-process" class="section level3">
<h3>A latent continuous process underlies the observed binary process</h3>
<p>For an event with a binary outcome (true or false, A or B, 0 or 1), the observed outcome may, at least in some cases, be conceived as the manifestation of an unseen, latent continuous outcome. In this conception, the observed (binary) outcome merely reflects whether or not the unseen continuous outcome has exceeded a specified threshold. Think of this threshold as a tipping point, above which the observable characteristic takes on one value (say false), below which it takes on a second value (say true).</p>
</div>
<div id="the-logistic-distribution" class="section level3">
<h3>The logistic distribution</h3>
<p>Logistic regression models are used to estimate relationships of individual characteristics with categorical outcomes. The name of this regression model arises from the logistic distribution, which is a symmetrical continuous distribution. In a latent (or hidden) variable framework, the underlying, unobserved continuous measure is drawn from this logistic distribution. More specifically, the standard logistic distribution is typically assumed, with a location parameter of 0, and a scale parameter of 1. (The mean of this distribution is 0 and variance is approximately 3.29.)</p>
<p>Here is a plot of a logistic pdf, shown in relation to a standard normal pdf (with mean 0 and variance 1):</p>
<pre class="r"><code>library(ggplot2)
library(data.table)
my_theme <- function() {
theme(panel.background = element_rect(fill = "grey90"),
panel.grid = element_blank(),
axis.ticks = element_line(colour = "black"),
panel.spacing = unit(0.25, "lines"),
plot.title = element_text(size = 12, vjust = 0.5, hjust = 0),
panel.border = element_rect(fill = NA, colour = "gray90"))
}
x <- seq(-6, 6, length = 1000)
yNorm <- dnorm(x, 0, 1)
yLogis <- dlogis(x, location = 0, scale = 1)
dt <- data.table(x, yNorm, yLogis)
dtm <- melt(dt, id.vars = "x", value.name = "Density")
ggplot(data = dtm) +
geom_line(aes(x = x, y = Density, color = variable)) +
geom_hline(yintercept = 0, color = "grey50") +
my_theme() +
scale_color_manual(values = c("red", "black"),
labels=c("Normal", "Logistic")) +
theme(legend.position = c(0.8, 0.6),
legend.title = element_blank())</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-08-23-ordinal-regression_files/figure-html/unnamed-chunk-1-1.png" width="672" /></p>
</div>
<div id="the-threshold-defines-the-probability" class="section level3">
<h3>The threshold defines the probability</h3>
<p>Below, I have plotted the standardized logistic pdf with a threshold that defines a tipping point for a particular Group A. In this case the threshold is 1.5, so for everyone with a unseen value of <span class="math inline">\(X < 1.5\)</span>, the observed binary outcome <span class="math inline">\(Y\)</span> will be 1. For those where <span class="math inline">\(X \geq 1.5\)</span>, the observed binary outcome <span class="math inline">\(Y\)</span> will be 0:</p>
<pre class="r"><code>xGrpA <- 1.5
ggplot(data = dtm[variable == "yLogis"], aes(x = x, y = Density)) +
geom_line() +
geom_segment(x = xGrpA, y = 0, xend = xGrpA, yend = dlogis(xGrpA), lty = 2) +
geom_area(mapping = aes(ifelse(x < xGrpA, x, xGrpA)), fill = "white") +
geom_hline(yintercept = 0, color = "grey50") +
ylim(0, 0.3) +
my_theme() +
scale_x_continuous(breaks = c(-6, -3, 0, xGrpA, 3, 6))</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-08-23-ordinal-regression_files/figure-html/unnamed-chunk-2-1.png" width="672" /></p>
<p>Since we have plot a probability density (pdf), the area under the entire curve is equal to 1. We are interested in the binary outcome <span class="math inline">\(Y\)</span> defined by the threshold, so we can say that the area below the curve to the left of threshold (filled in white) represents <span class="math inline">\(P(Y = 1|Group=A)\)</span>. The remaining area represents <span class="math inline">\(P(Y = 0|Group=A)\)</span>. The area to the left of the threshold can be calculated in <code>R</code> using the <code>plogis</code> function:</p>
<pre class="r"><code>(p_A <- plogis(xGrpA))</code></pre>
<pre><code>## [1] 0.8175745</code></pre>
<p>Here is the plot for a second group that has a threshold of 2.2:</p>
<p><img src="https://www.rdatagen.net/post/2017-08-23-ordinal-regression_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
<p>The area under the curve to the left of the threshold is <span class="math inline">\(P(X < 2.2)\)</span>, which is also <span class="math inline">\(P(Y = 1 | Group=B)\)</span>:</p>
<pre class="r"><code>(p_B <- plogis(xGrpB))</code></pre>
<pre><code>## [1] 0.9002495</code></pre>
</div>
<div id="log-odds-and-probability" class="section level3">
<h3>Log-odds and probability</h3>
<p>In logistic regression, we are actually estimating the log-odds of an outcome, which can be written as</p>
<p><span class="math display">\[log \left[ \frac{P(Y=1)}{P(Y=0)} \right]\]</span>.</p>
<p>In the case of Group A, log-odds of Y being equal to 1 is</p>
<pre class="r"><code>(logodds_A <- log(p_A / (1 - p_A) ))</code></pre>
<pre><code>## [1] 1.5</code></pre>
<p>And for Group B,</p>
<pre class="r"><code>(logodds_B <- log(p_B / (1 - p_B) ))</code></pre>
<pre><code>## [1] 2.2</code></pre>
<p>As you may have noticed, we’ve recovered the thresholds that we used to define the probabilities for the two groups. The threshold is actually the log-odds for a particular group.</p>
</div>
<div id="logistic-regression" class="section level3">
<h3>Logistic regression</h3>
<p>The logistic regression model that estimates the log-odds for each group can be written as</p>
<p><span class="math display">\[log \left[ \frac{P(Y=1)}{P(Y=0)} \right] = B_0 + B_1 * I(Grp = B) \quad ,\]</span></p>
<p>where <span class="math inline">\(B_0\)</span> represents the threshold for Group A and <span class="math inline">\(B_1\)</span> represents the shift in the threshold for Group B. In our example, the threshold for Group B is 0.7 units (2.2 - 1.5) to the right of the threshold for Group A. If we generate data for both groups, our estimates for <span class="math inline">\(B_0\)</span> and <span class="math inline">\(B_1\)</span> should be close to 1.5 and 0.7, respectively</p>
</div>
<div id="the-process-in-action" class="section level3">
<h3>The process in action</h3>
<p>To put this all together in a simulated data generating process, we can see the direct link with the logistic distribution, the binary outcomes, and an interpretation of estimates from a logistic model. The only stochastic part of this simulation is the generation of continuous outcomes from a logistic distribution. Everything else follows from the pre-defined group assignments and the group-specific thresholds:</p>
<pre class="r"><code>n = 5000
set.seed(999)
# Stochastic step
xlatent <- rlogis(n, location = 0, scale = 1)
# Deterministic part
grp <- rep(c("A","B"), each = n / 2)
dt <- data.table(id = 1:n, grp, xlatent, y = 0)
dt[grp == "A" & xlatent <= xGrpA, y := 1]
dt[grp == "B" & xlatent <= xGrpB, y := 1]
# Look at the data
dt</code></pre>
<pre><code>## id grp xlatent y
## 1: 1 A -0.4512173 1
## 2: 2 A 0.3353507 1
## 3: 3 A -2.2579527 1
## 4: 4 A 1.7553890 0
## 5: 5 A 1.3054260 1
## ---
## 4996: 4996 B -0.2574943 1
## 4997: 4997 B -0.9928283 1
## 4998: 4998 B -0.7297179 1
## 4999: 4999 B -1.6430344 1
## 5000: 5000 B 3.1379593 0</code></pre>
<p>The probability of a “successful” outcome (i.e <span class="math inline">\(P(Y = 1\)</span>)) for each group based on this data generating process is pretty much equal to the areas under the respective densities to the left of threshold used to define success:</p>
<pre class="r"><code>dt[, round(mean(y), 2), keyby = grp]</code></pre>
<pre><code>## grp V1
## 1: A 0.82
## 2: B 0.90</code></pre>
<p>Now let’s estimate a logistic regression model:</p>
<pre class="r"><code>library(broom)
glmfit <- glm(y ~ grp, data = dt, family = "binomial")
tidy(glmfit, quick = TRUE)</code></pre>
<pre><code>## term estimate
## 1 (Intercept) 1.5217770
## 2 grpB 0.6888526</code></pre>
<p>The estimates from the model recover the logistic distribution thresholds for each group. The Group A threshold is estimated to be 1.52 (the intercept) and the Group B threshold is estimated to be 2.21 (intercept + grpB parameter). These estimates can be interpreted as the log-odds of success for each group, but also <em>as the threshold for the underlying continuous data generating process that determines the binary outcome <span class="math inline">\(Y\)</span></em>. And we can interpret the parameter for <code>grpB</code> in the traditional way as the log-odds ratio comparing the log-odds of success for Group B with the log-odds of success for Group A, or <em>as the shift in the logistic threshold for Group A to the logistic threshold for Group B</em>.</p>
<p>In the next week or so, I will extend this to a discussion of an ordinal categorical outcome. I think the idea of shifting the thresholds underscores the proportionality assumption I alluded to earlier …</p>
</div>
Be careful not to control for a post-exposure covariate
https://www.rdatagen.net/post/be-careful/
Mon, 21 Aug 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/be-careful/<p>A researcher was presenting an analysis of the impact various types of childhood trauma might have on subsequent substance abuse in adulthood. Obviously, a very interesting and challenging research question. The statistical model included adjustments for several factors that are plausible confounders of the relationship between trauma and substance use, such as childhood poverty. However, the model also include a measurement for poverty in adulthood - believing it was somehow confounding the relationship of trauma and substance use. A confounder is a common cause of an exposure/treatment and an outcome; it is hard to conceive of adult poverty as a cause of childhood events, even though it might be related to adult substance use (or maybe not). At best, controlling for adult poverty has no impact on the conclusions of the research; less good, though, is the possibility that it will lead to the conclusion that the effect of trauma is less than it actually is.</p>
<p>Using a highly contrived simulation of data and the abstract concept of <em>potential outcomes</em>, I am hoping to illuminate some of the issues raised by this type of analysis.</p>
<div id="potential-outcomes-and-causal-effects" class="section level2">
<h2>Potential outcomes and causal effects</h2>
<p>The field of causal inference is a rich one, and I won’t even scratch the surface here. My goal is to present the concepts of potential outcomes so that we can articulate at least one clear way to think about what a causal effect can be defined. Under this framework, we generate data where we can find out the “true” measure of causal effect. And then we can use simple regression models to see how well (or not) they recapture these “known” causal effects.</p>
<p>If an individual <span class="math inline">\(i\)</span> experiences a traumatic effect as a child, we say that the exposure <span class="math inline">\(X_i = 1\)</span>. Otherwise <span class="math inline">\(X_i = 0\)</span>, there was no traumatic event. (I am going to assume binary exposures just to keep things simple - exposed vs. not exposed.) In the potential outcomes world we say that every individual has possible outcomes <span class="math inline">\(Y_{1i}\)</span> (the outcome we would observe <em>if</em> the individual had experienced trauma) and <span class="math inline">\(Y_{0i}\)</span> (the outcome we would observe <em>if</em> the individual had not. Quite simply, we define the causal effect of <span class="math inline">\(X\)</span> on <span class="math inline">\(Y\)</span> as the difference in potential outcomes, <span class="math inline">\(CE_i = Y_{1i} - Y_{0i}\)</span>. If <span class="math inline">\(Y_{1i} = Y_{0i}\)</span> (i.e. the potential outcomes are the same), we would say that <span class="math inline">\(X\)</span> does not cause <span class="math inline">\(Y\)</span>, at least for individual <span class="math inline">\(i\)</span>.</p>
<p>In the real world, we only observe one potential outcome - the one associated with the actual exposure. The field of causal inference has lots to say about the assumptions and conditions that are required for us to use observed data to estimate average causal effects; many would say that unless we use a randomized controlled study, those assumptions will never be reasonable. But in the world of simulation, we can generate potential outcomes and observed outcomes, so we know the causal effect both at the individual level and the average population level. And we can see how well our models do.</p>
</div>
<div id="simple-confounding" class="section level2">
<h2>Simple confounding</h2>
<p>Here’s a relatively straightforward example. Let’s say we are interested in understanding if some measure <span class="math inline">\(X\)</span> causes an outcome <span class="math inline">\(Y\)</span>, where there is a common cause <span class="math inline">\(C\)</span> (the diagram is called a DAG - a directed acyclic graph - and is useful for many things, including laying out data generating process):</p>
<p><img src="https://www.rdatagen.net/img/post-careful/SimpleCausal.png" width="300px" /></p>
<pre class="r"><code>library(broom)
library(data.table)
library(simstudy)
def <- defData(varname = "C", formula = 0.4, dist = "binary")
def <- defData(def, "X", formula = "0.3 + 0.4 * C", dist = "binary")
def <- defData(def, "e", formula = 0, variance = 2, dist = "normal")
def <- defData(def, "Y0", formula = "2 * C + e", dist="nonrandom")
def <- defData(def, "Y1", formula = "0.5 + 2 * C + e", dist="nonrandom")
def <- defData(def, "Y_obs", formula = "Y0 + (Y1 - Y0) * X", dist = "nonrandom")
def</code></pre>
<pre><code>## varname formula variance dist link
## 1: C 0.4 0 binary identity
## 2: X 0.3 + 0.4 * C 0 binary identity
## 3: e 0 2 normal identity
## 4: Y0 2 * C + e 0 nonrandom identity
## 5: Y1 0.5 + 2 * C + e 0 nonrandom identity
## 6: Y_obs Y0 + (Y1 - Y0) * X 0 nonrandom identity</code></pre>
<p>In this example, <span class="math inline">\(X\)</span> does have an effect on <span class="math inline">\(Y\)</span>, but so does <span class="math inline">\(C\)</span>. If we ignore <span class="math inline">\(C\)</span> in assessing the size of the effect of <span class="math inline">\(X\)</span> on <span class="math inline">\(Y\)</span>, we will overestimate that effect, which is 0.5. We can generate data and see that this is the case:</p>
<pre class="r"><code>set.seed(5)
dt <- genData(1000, def)</code></pre>
<p>We see that the true causal effect is easily recovered if we have access to the potential outcomes <span class="math inline">\(Y_1\)</span> and <span class="math inline">\(Y_0\)</span>, but of course we don’t:</p>
<pre class="r"><code>dt[, mean(Y1 - Y0)] # True causal effect</code></pre>
<pre><code>## [1] 0.5</code></pre>
<p>If we compare the average <em>observed</em> outcomes for each exposure group ignoring the confounder, we overestimate the effect of the exposure:</p>
<pre class="r"><code>dt[X == 1, mean(Y_obs)] - dt[X == 0, mean(Y_obs)]</code></pre>
<pre><code>## [1] 1.285009</code></pre>
<p>We can estimate the same effect using simple linear regression:</p>
<pre class="r"><code>lm1 <- lm(Y_obs ~ X, data = dt)
tidy(lm1)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 0.5515963 0.07325865 7.529436 1.137854e-13
## 2 X 1.2850091 0.10674523 12.038094 2.916261e-31</code></pre>
<p>And finally, if we adjust for the confounder <span class="math inline">\(C\)</span>, we recover the true causal effect of <span class="math inline">\(X\)</span> on <span class="math inline">\(Y\)</span>, or at least get very close to it:</p>
<pre class="r"><code>lm2 <- lm(Y_obs ~ X + C, data = dt)
tidy(lm2)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) 0.08491216 0.06502545 1.305830 1.919117e-01
## 2 X 0.48935880 0.09678110 5.056347 5.083959e-07
## 3 C 2.05729945 0.09825925 20.937464 5.767575e-81</code></pre>
</div>
<div id="adjusting-for-a-post-exposure-covariate" class="section level2">
<h2>Adjusting for a post-exposure covariate</h2>
<p>Now, we are ready to see what happens in a slightly more complicated setting that is defined by this DAG:</p>
<p><img src="https://www.rdatagen.net/img/post-careful/ComplexCausal.png" width="400px" /></p>
<p>In this example <span class="math inline">\(C\)</span> is measured in two time periods, and exposure in period 1 relates to exposure in period 2. (For example, if a child is poor, he is more likely to be poor as an adult.) We are primarily interested in whether or not <span class="math inline">\(X\)</span> (trauma) causes <span class="math inline">\(Y\)</span> (substance use). The difficulty is that <span class="math inline">\(X\)</span> and <span class="math inline">\(C_2\)</span> are related, as are <span class="math inline">\(C_2\)</span> and <span class="math inline">\(Y\)</span>.</p>
<p>I suggest that in order to fully understand the effect of <span class="math inline">\(X\)</span> on <span class="math inline">\(Y\)</span>, we cannot control for <span class="math inline">\(C_2\)</span>, as tempting as it might be. The intuition is that part of the effect of <span class="math inline">\(X\)</span> on <span class="math inline">\(Y\)</span> is due to the fact that <span class="math inline">\(X\)</span> has an effect on <span class="math inline">\(C_2\)</span>, at least for some individuals. <em>If we control for <span class="math inline">\(C_2\)</span>, we are actually removing a key component of the causal mechanism.</em> Below in is the data generating process - a few things to note: (1) <span class="math inline">\(C_2\)</span> has potential outcomes based on the exposure <span class="math inline">\(X\)</span>. (2) We have restricted the potential outcome <span class="math inline">\(C_{21}\)</span> to be set to 1 if <span class="math inline">\(C_{20}\)</span> is 1. For example, if someone would have been poor in adulthood <em>without</em> exposure to trauma, we assume that they also would have been poor in adulthood had they been exposed to trauma. (3) The potential outcome for <span class="math inline">\(Y\)</span> is dependent on the relevant potential outcome for <span class="math inline">\(C_2\)</span>. That is <span class="math inline">\(Y_0\)</span> depends on <span class="math inline">\(C_{20}\)</span> and <span class="math inline">\(Y_1\)</span> depends on <span class="math inline">\(C_{21}\)</span>.</p>
<pre><code>## varname formula variance dist link
## 1: C1 0.25 0 binary identity
## 2: X -2 + 0.8 * C1 0 binary logit
## 3: C2.0 -2.0 + 1 * C1 0 binary logit
## 4: C2.1x -1.5 + 1 * C1 0 binary logit
## 5: C2.1 pmax(C2.0, C2.1x) 0 nonrandom identity
## 6: e 0 4 normal identity
## 7: Y0 -3 + 5*C2.0 + e 0 nonrandom identity
## 8: Y1 0 + 5*C2.1 + e 0 nonrandom identity
## 9: C2_obs C2.0 + (C2.1 - C2.0) * X 0 nonrandom identity
## 10: Y_obs Y0 + (Y1 - Y0) * X 0 nonrandom identity</code></pre>
<pre class="r"><code>set.seed(25)
dt <- genData(5000, def2)</code></pre>
<p>Here is the true average causal effect, based on information we will never know:</p>
<pre class="r"><code>dt[, mean(Y1 - Y0)]</code></pre>
<pre><code>## [1] 3.903</code></pre>
<p>When we control for <span class="math inline">\(C_2\)</span>, we are essentially estimating the effect of <span class="math inline">\(X\)</span> at each level <span class="math inline">\(C_2\)</span> (and <span class="math inline">\(C_1\)</span>, since we are controlling for that as well), and then averaging across the sub-samples to arrive at an estimate for the entire sample. We can see that, based on the specification of the potential outcomes in the data generation process, the effect at each level of <span class="math inline">\(C_2\)</span> will be centered around 3.0, which is different from the true causal effect of 3.9. The discrepancy is due to the fact each approach is effectively collecting different sub-samples (one defines groups based on set levels of <span class="math inline">\(X\)</span> and <span class="math inline">\(C_2\)</span>, and the other defines groups based on set levels of <span class="math inline">\(X\)</span> alone) and estimating average effects based on weights determined by the sizes of those two sets of sub-samples.</p>
<p>Here is the inappropriate model that adjusts for <span class="math inline">\(C_2\)</span>:</p>
<pre class="r"><code>lm2a <- lm( Y_obs ~ C1 + C2_obs + X , data = dt)
tidy(lm2a)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) -3.01360235 0.03481082 -86.5708464 0.000000e+00
## 2 C1 -0.02078171 0.06765129 -0.3071887 7.587126e-01
## 3 C2_obs 4.92972384 0.07625838 64.6450138 0.000000e+00
## 4 X 3.04600204 0.08114223 37.5390478 6.677416e-272</code></pre>
<p>The estimate for the coefficient of <span class="math inline">\(X\)</span> is 3.0, just as anticipated. Here now is the correct model, and you will see that we recover the true causal effect in the coefficient estimate of <span class="math inline">\(X\)</span> (or at least, we get much, much closer):</p>
<pre class="r"><code>lm2b <- lm( Y_obs ~ C1 + X , data = dt)
tidy(lm2b)</code></pre>
<pre><code>## term estimate std.error statistic p.value
## 1 (Intercept) -2.4886726 0.04586832 -54.25689 0.000000e+00
## 2 C1 0.9665413 0.08930265 10.82321 5.315059e-27
## 3 X 3.9377832 0.10834926 36.34343 7.868650e-257</code></pre>
<p>Of course, in the real world, we don’t know the underlying data generating process or the true DAG. And what I have described here is a gross oversimplification of the underlying relationships, and have indeed left out many other factors that likely affect the relationship between childhood trauma and adult substance use. Other measures, such as parental substance use, may be related to both childhood trauma and adult substance use, and may affect poverty in the two time periods in different, complicated ways.</p>
<p>But the point is that one should give careful thought to what gets included in a model. We may not want to throw everything we measure into the model. Be careful.</p>
</div>
Should we be concerned about incidence - prevalence bias?
https://www.rdatagen.net/post/simulating-incidence-prevalence-bias/
Wed, 09 Aug 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simulating-incidence-prevalence-bias/<p>Recently, we were planning a study to evaluate the effect of an intervention on outcomes for very sick patients who show up in the emergency department. My collaborator had concerns about a phenomenon that she had observed in other studies that might affect the results - patients measured earlier in the study tend to be sicker than those measured later in the study. This might not be a problem, but in the context of a stepped-wedge study design (see <a href="https://www.rdatagen.net/post/using-simulation-for-power-analysis-an-example/">this</a> for a discussion that touches this type of study design), this could definitely generate biased estimates: when the intervention occurs later in the study (as it does in a stepped-wedge design), the “exposed” and “unexposed” populations could differ, and in turn so could the outcomes. We might confuse an artificial effect as an intervention effect.</p>
<p>What could explain this phenomenon? The title of this post provides a hint: cases earlier in a study are more likely to be prevalent ones (i.e. they have been sick for a while), whereas later in the study cases tend to be incident (i.e. they only recently become sick). Even though both prevalent and incident cases are sick, the former may be sicker on average than the latter, simply because their condition has had more time develop.</p>
<p>We didn’t have any data to test out this hypothesis (if our grant proposal is funded, we will be able to do that), so I decided to see if I could simulate this phenomenon. In my continuing series exploring simulation using <code>Rcpp</code>, <code>simstudy</code>, and <code>data.table</code>, I am presenting some code that I used to do this.</p>
<div id="generating-a-population-of-patients" class="section level2">
<h2>Generating a population of patients</h2>
<p>The first task is to generate a population of individuals, each of whom starts out healthy and potentially becomes sicker over time. Time starts in month 1 and ends at some fixed point - in the first example, I end at 400 months. Each individual has a starting health status and a start month. In the examples that follow, health status is 1 through 4, with 1 being healthy, 3 is quite sick, and 4 is death. And, you can think of the start month as the point where the individual ages into the study. (For example, if the study includes only people 65 and over, the start month is the month the individual turns 65.) If an individual starts in month 300, she will have no measurements in periods 1 through 299 (i.e. health status will be 0).</p>
<p>The first part of the simulation generates a start month and starting health status for each individual, and then generates a health status for each individual until the end of time. Some individuals may die, while others may go all the way to the end of the simulation in a healthy state.</p>
<div id="rcpp-function-to-generate-health-status-for-each-period" class="section level4">
<h4>Rcpp function to generate health status for each period</h4>
<p>While it is generally preferable to avoid loops in R, sometimes it cannot be <a href="https://www.rdatagen.net/post/first-blog-entry/">avoided</a>. I believe generating a health status that depends on the previous health status (a Markov process) is one of those situations. So, I have written an Rcpp function to do this - it is orders of magnitude faster than doing this in R:</p>
<pre class="cpp"><code>#include <RcppArmadilloExtensions/sample.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector MCsim( unsigned int nMonths, NumericMatrix P,
int startStatus, unsigned int startMonth ) {
IntegerVector sim( nMonths );
IntegerVector healthStats( P.ncol() );
NumericVector currentP;
IntegerVector newstate;
unsigned int q = P.ncol();
healthStats = Rcpp::seq(1, q);
sim[startMonth - 1] = startStatus;
/* Loop through each month for each individual */
for (unsigned int i = startMonth; i < nMonths; i++) {
/* new state based on health status of last period and
probability of transitioning to different state */
newstate = RcppArmadillo::sample( healthStats,
1,
TRUE,
P.row(sim(i-1) - 1) );
sim(i) = newstate(0);
}
return sim;
}</code></pre>
<p><br></p>
</div>
<div id="generating-the-data" class="section level4">
<h4>Generating the data</h4>
<p>The data generation process is shown below. The general outline of the process is (1) define transition probabilities, (2) define starting health status distribution, (3) generate starting health statuses and start months, and (4) generate health statuses for each follow-up month.</p>
<pre class="r"><code># Transition matrix for moving through health statuses
P <- matrix(c(0.985, 0.015, 0.000, 0.000,
0.000, 0.950, 0.050, 0.000,
0.000, 0.000, 0.850, 0.150,
0.000, 0.000, 0.000, 1.000), nrow = 4, byrow = TRUE)
maxFU = 400
nPerMonth = 350
N = maxFU * nPerMonth
ddef <- defData(varname = "sHealth",
formula = "0.80; 0.15; 0.05",
dist = "categorical")
# generate starting health values (1, 2, or 3) for all individuals
set.seed(123)
did <- genData(n = N, dtDefs = ddef)
# each month, 350 age in to the sample
did[, sMonth := rep(1:maxFU, each = nPerMonth)]
# show table for 10 randomly selected individuals
did[id %in% sample(x = did$id, size = 10, replace = FALSE)] </code></pre>
<pre><code>## id sHealth sMonth
## 1: 15343 2 44
## 2: 19422 2 56
## 3: 41426 1 119
## 4: 50050 1 143
## 5: 63042 1 181
## 6: 83584 1 239
## 7: 93295 1 267
## 8: 110034 1 315
## 9: 112164 3 321
## 10: 123223 1 353</code></pre>
<pre class="r"><code># generate the health status history based on the transition matrix
dhealth <- did[, .(sHealth, sMonth, health = MCsim(maxFU, P, sHealth, sMonth)),
keyby = id]
dhealth[, month := c(1:.N), by = id]
dhealth</code></pre>
<pre><code>## id sHealth sMonth health month
## 1: 1 1 1 1 1
## 2: 1 1 1 1 2
## 3: 1 1 1 1 3
## 4: 1 1 1 1 4
## 5: 1 1 1 1 5
## ---
## 55999996: 140000 1 400 0 396
## 55999997: 140000 1 400 0 397
## 55999998: 140000 1 400 0 398
## 55999999: 140000 1 400 0 399
## 56000000: 140000 1 400 1 400</code></pre>
<p><br></p>
</div>
<div id="simulation-needs-burn-in-period" class="section level4">
<h4>Simulation needs burn-in period</h4>
<p>The simulation process itself is biased in its early phases as there are too many individuals in the sample who have just aged in compared to those who are “older”. (This is sort of the the reverse of the incidence - prevalence bias.) Since individuals tend to have better health status when they are “younger”, the average health status of the simulation in its early phases is biased downwards by the preponderance of young individuals in the population. This suggests that any evaluation of simulated data needs to account for a “burn-in” period that ensures there is a mix of “younger” and “older” individuals. To show this, I have calculated an average health score for each period of the simulation and plotted the results. You can see that the sample stabilizes after about 200 months in this simulation.</p>
<pre class="r"><code># count number of individuals with a particular heath statust each month
cmonth <- dhealth[month > 0, .N, keyby = .(month, health)]
cmonth</code></pre>
<pre><code>## month health N
## 1: 1 0 139650
## 2: 1 1 286
## 3: 1 2 47
## 4: 1 3 17
## 5: 2 0 139300
## ---
## 1994: 399 4 112203
## 1995: 400 1 18610
## 1996: 400 2 6515
## 1997: 400 3 2309
## 1998: 400 4 112566</code></pre>
<pre class="r"><code># transform data from "long" form to "wide" form and calculate average
mtotal <- dcast(data = cmonth,
formula = month ~ health,
fill = 0,
value.var = "N")
mtotal[, total := `1` + `2` + `3`]
mtotal[, wavg := (`1` + 2*`2` + 3*`3`)/total]
mtotal</code></pre>
<pre><code>## month 0 1 2 3 4 total wavg
## 1: 1 139650 286 47 17 0 350 1.231429
## 2: 2 139300 558 106 32 4 696 1.244253
## 3: 3 138950 829 168 45 8 1042 1.247601
## 4: 4 138600 1104 215 66 15 1385 1.250542
## 5: 5 138250 1362 278 87 23 1727 1.261726
## ---
## 396: 396 1400 18616 6499 2351 111134 27466 1.407813
## 397: 397 1050 18613 6537 2321 111479 27471 1.406938
## 398: 398 700 18587 6561 2323 111829 27471 1.407957
## 399: 399 350 18602 6541 2304 112203 27447 1.406201
## 400: 400 0 18610 6515 2309 112566 27434 1.405810</code></pre>
<pre class="r"><code>ggplot(data = mtotal, aes(x=month, y=wavg)) +
geom_line() +
ylim(1.2, 1.5) +
geom_hline(yintercept = 1.411, lty = 3) +
geom_vline(xintercept = 200, lty = 3) +
xlab("Month") +
ylab("Average health status") +
theme(panel.background = element_rect(fill = "grey90"),
panel.grid = element_blank(),
plot.title = element_text(size = 12, vjust = 0.5, hjust = 0) ) +
ggtitle("Average health status of simulated population")</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-08-09-simulating-incidence-prevalence-bias_files/figure-html/unnamed-chunk-3-1.png" width="672" /></p>
</div>
</div>
<div id="generating-monthly-study-cohorts" class="section level2">
<h2>Generating monthly study cohorts</h2>
<p>Now we are ready to see if we can simulate the incidence - prevalence bias. The idea here is to find the first month during which an individual (1) is “active” (i.e. the period being considered is on or after the individual’s start period), (2) has an emergency department visit, and (3) whose health status has reached a specified threshold.</p>
<p>We can set a final parameter that looks back some number of months (say 6 or 12) to see if there have been any previous qualifying emergency room visits before the study start period (which in our case will be month 290 to mitigate an burn-in bias identified above). This “look-back” will be used to mitigate some of the bias by creating a washout period that makes the prevalent cases look more like incident cases. This look-back parameter is calculated each month for each individual using an Rcpp function that loops through each period:</p>
<pre class="cpp"><code>#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector cAddPrior(IntegerVector idx,
IntegerVector event,
int lookback) {
int nRow = idx.length();
IntegerVector sumPrior(nRow, NA_INTEGER);
for (unsigned int i = lookback; i < nRow; i++) {
IntegerVector seqx = Rcpp::seq(i-lookback, i-1);
IntegerVector x = event[seqx];
sumPrior[i] = sum(x);
}
return(sumPrior);
}</code></pre>
<p><br></p>
<div id="generating-a-single-cohort" class="section level4">
<h4>Generating a single cohort</h4>
<p>The following code (1) generates a population (as we did above), (2) generates emergency department visits that are dependent on the health status (the sicker an individual is, the more likely they are to go to the ED), (3) calculates the number of eligible ED visits during the look-back period, and (4) creates the monthly cohorts based on the selection criteria. At the end, we calculate average health status for the cohort by month of cohort - this will be used to illustrate the bias.</p>
<pre class="r"><code>maxFU = 325
nPerMonth = 100
N = maxFU * nPerMonth
START = 289 # to allow for adequate burn-in
HEALTH = 2
LOOKBACK = 6 # how far to lookback
set.seed(123)
did <- genData(n = N, dtDefs = ddef)
did[, sMonth := rep(1:maxFU, each = nPerMonth)]
healthStats <- did[, .(sHealth,
sMonth,
health = MCsim(maxFU, P, sHealth, sMonth)),
keyby = id]
healthStats[, month := c(1:.N), by = id]
# eliminate period without status measurement (0) & death (4)
healthStats <- healthStats[!(health %in% c(0,4))]
# ensure burn-in by starting with observations far
# into simulation
healthStats <- healthStats[month > (START - LOOKBACK)]
# set probability of emergency department visit
healthStats[, pED := (health == 1) * 0.02 +
(health == 2) * 0.10 +
(health == 3) * 0.20]
# generate emergency department visit
healthStats[, ed := rbinom(.N, 1, pED)]
healthStats[, edAdj := ed * as.integer(health >= HEALTH)] # if you want to restrict
healthStats[, pSum := cAddPrior(month, edAdj, lookback = LOOKBACK), keyby=id]
# look at one individual
healthStats[id == 28069]</code></pre>
<pre><code>## id sHealth sMonth health month pED ed edAdj pSum
## 1: 28069 1 281 1 284 0.02 0 0 NA
## 2: 28069 1 281 1 285 0.02 0 0 NA
## 3: 28069 1 281 1 286 0.02 0 0 NA
## 4: 28069 1 281 1 287 0.02 0 0 NA
## 5: 28069 1 281 2 288 0.10 0 0 NA
## 6: 28069 1 281 2 289 0.10 0 0 NA
## 7: 28069 1 281 2 290 0.10 1 1 0
## 8: 28069 1 281 2 291 0.10 0 0 1
## 9: 28069 1 281 2 292 0.10 0 0 1
## 10: 28069 1 281 2 293 0.10 0 0 1
## 11: 28069 1 281 2 294 0.10 0 0 1
## 12: 28069 1 281 3 295 0.20 1 1 1</code></pre>
<pre class="r"><code># cohort includes individuals with 1 prior ed visit in
# previous 6 months
cohort <- healthStats[edAdj == 1 & pSum == 0]
cohort <- cohort[, .(month = min(month)), keyby = id]
cohort</code></pre>
<pre><code>## id month
## 1: 53 306
## 2: 82 313
## 3: 140 324
## 4: 585 291
## 5: 790 299
## ---
## 3933: 31718 324
## 3934: 31744 325
## 3935: 31810 325
## 3936: 31860 325
## 3937: 31887 325</code></pre>
<pre class="r"><code># estimate average health status of monthly cohorts
cohortStats <- healthStats[cohort, on = c("id","month")]
sumStats <- cohortStats[ , .(avghealth = mean(health), n = .N), keyby = month]
head(sumStats)</code></pre>
<pre><code>## month avghealth n
## 1: 290 2.248175 137
## 2: 291 2.311765 170
## 3: 292 2.367347 147
## 4: 293 2.291925 161
## 5: 294 2.366906 139
## 6: 295 2.283871 155</code></pre>
</div>
</div>
<div id="exploring-bias" class="section level2">
<h2>Exploring bias</h2>
<p>Finally, we are at the point where we can see what, if any, bias results in selecting our cohorts under the scenario I’ve outlined above. We start by generating multiple iterations of populations and cohorts and estimating average health status by month under the assumption that we will have a look-back period of 0. That is, we will accept an individual into the first possible cohort regardless of her previous emergency department visit history. The plot below shows average across 1000 iterations. What we see is that the average health status of the cohorts in the first 20 months or so exceed the long run average. The incidence - prevalence bias is extremely strong if we ignore prior ED history!</p>
<p><img src="https://www.rdatagen.net/img/post-incidence/plot00.png" /> <br></p>
<div id="taking-history-into-account" class="section level4">
<h4>Taking history into account</h4>
<p>Once we start to incorporate ED history by using look-back periods greater than 0, we see that we can reduce bias considerably. The two plots below show the results of using look-back periods of 6 and 12 months. Both have reduced bias, but only at 12 months are we approaching something that actually looks desirable. In fact, under this scenario, we’d probably like to go back 24 months to eliminate the bias entirely. Of course, these particular results are dependent on the simulation assumptions, so determining an appropriate look-back period will certainly depend on the actual data. (When we do finally get the actual data, I will follow-up to let you know what kind of adjustment we needed to make in the real, non-simulated world.)</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-incidence/plot06.png" />
</div>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-incidence/plot12.png" />
</div>
</div>
</div>
Using simulation for power analysis: an example based on a stepped wedge study design
https://www.rdatagen.net/post/using-simulation-for-power-analysis-an-example/
Mon, 10 Jul 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/using-simulation-for-power-analysis-an-example/<p>Simulation can be super helpful for estimating power or sample size requirements when the study design is complex. This approach has some advantages over an analytic one (i.e. one based on a formula), particularly the flexibility it affords in setting up the specific assumptions in the planned study, such as time trends, patterns of missingness, or effects of different levels of clustering. A downside is certainly the complexity of writing the code as well as the computation time, which <em>can</em> be a bit painful. My goal here is to show that at least writing the code need not be overwhelming.</p>
<p>Recently, I was helping an investigator plan a stepped wedge cluster randomized trial to study the effects of modifying a physician support system on patient-level diabetes management. While analytic approaches for power calculations do exist in the context of this complex study design, it seemed worth the effort to be explicit about all of the assumptions. So in this case I opted to use simulation. The basic approach is outlined below.</p>
<div id="the-stepped-wedge-design" class="section level2">
<h2>The stepped wedge design</h2>
<p>In cluster randomized trials, the unit of randomization is the group rather than the individual. While outcomes might be collected at the individual (e.g. student or patient) level, the intervention effect is assessed at the group (e.g. school or clinic). In a stepped wedge cluster design, the randomization unit is still the group, but all groups are eventually exposed to the intervention at some point in the study. Randomization determines <em>when</em> the intervention starts.</p>
<p>Below is schematic view of how a stepped wedge study is implemented. In this example, a block of clusters receives the intervention starting in the second period, another block starts the intervention in the third period, and so on. The intervention effect is essentially assessed by making within group comparisons. By staggering the starting points, the study is able to distinguish between time effects and treatment effects. If all groups started intervention at the same point, we would need to make an assumption that any improvements were due only to the intervention rather than changes that were occurring over time. This is not an assumption any one can easily justify.</p>
<p><img src="https://www.rdatagen.net/post/2017-07-10-using-simulation-for-power-analysis-an-example-based-on-a-stepped-wedge-study-design_files/figure-html/unnamed-chunk-1-1.png" width="672" /></p>
</div>
<div id="power-and-simulation" class="section level2">
<h2>Power and simulation</h2>
<p>The statistical power of a study is the conditional probability (conditional on a given effect size), that a hypothesis test will correctly reject the null hypothesis (i.e. conclude there is an effect when there actually is one). Power is underscored by the notion that a particular study can be replicated exactly over and over again. So, if the power of a study is 80%, that means in 80% of the replications of that study we will (appropriately) reject the null hypothesis.</p>
<p>So, to estimate power, we can simulate replications of the study many times and conduct repeated hypothesis tests. The proportion of tests where we reject the null hypothesis is the estimated power. Each of these replications is based on the same set of data generating assumptions: effect sizes, sample sizes, individual level variation, group level variation, etc.</p>
</div>
<div id="simulating-from-a-stepped-wedge-design" class="section level2">
<h2>Simulating from a stepped wedge design</h2>
<p>In this example, we are assuming a 3-year study with four groups of clusters randomized to start an intervention at either 12 months, 18 months, 24 months, or 30 months (i.e. every 6 months following the 1st baseline year). The study would enroll patients at baseline in each of the clusters, and a measurement of a binary outcome (say diabetes under control, or not) would be collected at that time. Those patients would be followed over time and the same measurement would be collected every 6 months, concluding with the 7th measurement in the 36th month of the study. (It is totally possible to enroll new patients as the study progresses and have a different follow-up scheme, but this approximates the actual study I was working on.)</p>
<p>The data are generated based on a mixed effects model where there are group level effects (<span class="math inline">\(b_j\)</span> in the model) as well as individual level effects (<span class="math inline">\(b_i\)</span>). The model also assumes a very slight time trend before the intervention (e.g. diabetes control is improving slightly over time for an individual), an intervention effect, and an almost non-existent change in the time trend after the intervention. The outcome in each period is generated based on this formula:</p>
<p><span class="math inline">\(logit(Y_{ijt}) = 0.8 + .01 * period + 0.8 * I_{jt} + 0.001 * I_{jt} * (period-s_j) + b_i + b_j,\)</span></p>
<p>where <span class="math inline">\(period\)</span> goes from 0 to 6 (period 0 is the baseline, period 1 is the 6 month follow, etc.), <span class="math inline">\(I_{jt}\)</span> is 1 if cluster <span class="math inline">\(j\)</span> is in the intervention in period <span class="math inline">\(t\)</span>, <span class="math inline">\(s_j\)</span> is the period where the intervention starts for cluster <span class="math inline">\(j\)</span>, and <span class="math inline">\(logit(Y_{ijt})\)</span> is the log odds of the outcome <span class="math inline">\(Y\)</span> for individual <span class="math inline">\(i\)</span> in cluster <span class="math inline">\(j\)</span> during period <span class="math inline">\(t\)</span>.</p>
<p>We start by defining the data structure using <code>simstudy</code> “data def”" commands. We are assuming that there will be 100 individuals followed at each site for the full study. (We are not assuming any dropout, though we could easily do that.) In this particular case, we are assuming an effect size of 0.8 (which is a log odds ratio):</p>
<pre class="r"><code>library(simstudy)
starts <- "rep(c(2 : 5), each = 10)"
siteDef <- defData(varname = "bj", dist = "normal", formula = 0,
variance = .01, id="site")
siteDef <- defData(siteDef, varname = "sj", dist = "nonrandom",
formula = starts)
siteDef <- defData(siteDef, varname = "ips", dist = "nonrandom",
formula = 100)
indDef <- defDataAdd(varname = "bi", dist = "normal", formula = 0,
variance = 0.01)
trtDef <- defDataAdd(varname = "Ijt" ,
formula = "as.numeric(period >= sj)",
dist = "nonrandom")
f = "0.8 + .01 * period + 0.8 * Ijt + 0.001 * Ijt * (period-sj) + bi + bj"
trtDef <- defDataAdd(trtDef, varname = "Yijt", formula = f,
dist = "binary", link = "logit")</code></pre>
<p>To generate 40 clusters of data, we use the following code:</p>
<pre class="r"><code>set.seed(6789)
dtSite <- genData(40, siteDef)
dtSite <- genCluster(dtSite, cLevelVar = "site", numIndsVar = "ips",
level1ID = "id")
dtSite <- addColumns(indDef, dtSite)
dtSiteTm <- addPeriods(dtSite, nPeriods = 7, idvars = "id")
dtSiteTm <- addColumns(trtDef, dtSiteTm)
dtSiteTm</code></pre>
<pre><code>## id period site bj sj ips bi timeID Ijt Yijt
## 1: 1 0 1 -0.1029785 2 100 0.08926153 1 0 1
## 2: 1 1 1 -0.1029785 2 100 0.08926153 2 0 1
## 3: 1 2 1 -0.1029785 2 100 0.08926153 3 1 1
## 4: 1 3 1 -0.1029785 2 100 0.08926153 4 1 1
## 5: 1 4 1 -0.1029785 2 100 0.08926153 5 1 1
## ---
## 27996: 4000 2 40 0.1000898 5 100 0.18869371 27996 0 1
## 27997: 4000 3 40 0.1000898 5 100 0.18869371 27997 0 0
## 27998: 4000 4 40 0.1000898 5 100 0.18869371 27998 0 1
## 27999: 4000 5 40 0.1000898 5 100 0.18869371 27999 1 1
## 28000: 4000 6 40 0.1000898 5 100 0.18869371 28000 1 1</code></pre>
<p>And to visualize what the study data might looks like under these assumptions:</p>
<pre class="r"><code># summary by site
dt <- dtSiteTm[, .(Y = mean(Yijt)), keyby = .(site, period, Ijt, sj)]
ggplot(data = dt, aes(x=period, y=Y, group=site)) +
geom_hline(yintercept = c(.7, .83), color = "grey99") +
geom_line(aes(color=factor(site))) +
geom_point(data = dt[sj == period], color="grey50") +
theme(panel.background = element_rect(fill = "grey90"),
panel.grid = element_blank(),
plot.title = element_text(size = 10, hjust = 0),
panel.border = element_rect(fill = NA, colour = "gray90"),
legend.position = "none",
axis.title.x = element_blank()
) +
ylab("Proportion controlled") +
scale_x_continuous(breaks = seq(0, 10, by = 2),
labels = c("Baseline", paste("Year", c(1:5)))) +
scale_y_continuous(limits = c(.5, 1),
breaks = c(.5, .6, .7, .8, .9, 1)) +
ggtitle("Stepped-wedge design with immediate effect") +
facet_grid(sj~.)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-07-10-using-simulation-for-power-analysis-an-example-based-on-a-stepped-wedge-study-design_files/figure-html/unnamed-chunk-4-1.png" width="480" /></p>
</div>
<div id="estimating-power" class="section level2">
<h2>Estimating power</h2>
<p>We are going to estimate power using only 20 clusters and effect size of 0.25. (Assuming 40 clusters and a large effect size was useful for visualizing the data, but not so interesting for illustrating power, since under those assumptions we are virtually guaranteed to find an effect.)</p>
<p>After generating the data (code not shown) for one iteration, we fit a generalized mixed effects model to show the effect estimate. In this case, the effect estimate is 1.46 (95% CI 1.21-1.77) on the odds ratio scale or 0.37 (95% CI 0.19-0.57) on the log odds ratio scale.</p>
<pre class="r"><code>library(lme4)
library(sjPlot)
glmfit <- glmer(data = dtSiteTm,
Yijt ~ period + Ijt + I(Ijt*(period - sj)) + (1|id) + (1|site),
family="binomial" )
sjt.glmer(glmfit, show.icc = FALSE, show.dev = FALSE)</code></pre>
<table style="border-collapse:collapse; border:none;border-bottom:double;">
<tr>
<td style="padding:0.2cm; border-top:double;">
</td>
<td style="border-bottom:1px solid; padding-left:0.5em; padding-right:0.5em; border-top:double;">
</td>
<td style="padding:0.2cm; text-align:center; border-bottom:1px solid; border-top:double;" colspan="3">
Yijt
</td>
</tr>
<tr>
<td style="padding:0.2cm; font-style:italic;">
</td>
<td style="padding-left:0.5em; padding-right:0.5em; font-style:italic;">
</td>
<td style="padding:0.2cm; text-align:center; font-style:italic; ">
Odds Ratio
</td>
<td style="padding:0.2cm; text-align:center; font-style:italic; ">
CI
</td>
<td style="padding:0.2cm; text-align:center; font-style:italic; ">
p
</td>
</tr>
<tr>
<td colspan="5" style="padding:0.2cm; text-align:left; border-top:1px solid; font-weight:bold; text-align:left;">
Fixed Parts
</td>
</tr>
<tr>
<td style="padding:0.2cm; text-align:left;">
(Intercept)
</td>
<td style="padding-left:0.5em; padding-right:0.5em; ">
</td>
<td style="padding:0.2cm; text-align:center; ">
2.15
</td>
<td style="padding:0.2cm; text-align:center; ">
1.90 – 2.44
</td>
<td style="padding:0.2cm; text-align:center; ">
<.001
</td>
</tr>
<tr>
<td style="padding:0.2cm; text-align:left;">
period
</td>
<td style="padding-left:0.5em; padding-right:0.5em;">
</td>
<td style="padding:0.2cm; text-align:center; ">
1.00
</td>
<td style="padding:0.2cm; text-align:center; ">
0.95 – 1.06
</td>
<td style="padding:0.2cm; text-align:center; ">
.959
</td>
</tr>
<tr>
<td style="padding:0.2cm; text-align:left;">
Ijt
</td>
<td style="padding-left:0.5em; padding-right:0.5em;">
</td>
<td style="padding:0.2cm; text-align:center; ">
1.46
</td>
<td style="padding:0.2cm; text-align:center; ">
1.21 – 1.77
</td>
<td style="padding:0.2cm; text-align:center; ">
<.001
</td>
</tr>
<tr>
<td style="padding:0.2cm; text-align:left;">
I(Ijt * (period - sj))
</td>
<td style="padding-left:0.5em; padding-right:0.5em;">
</td>
<td style="padding:0.2cm; text-align:center; ">
0.99
</td>
<td style="padding:0.2cm; text-align:center; ">
0.91 – 1.07
</td>
<td style="padding:0.2cm; text-align:center; ">
.759
</td>
</tr>
<tr>
<td colspan="5" style="padding:0.2cm; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; font-weight:bold; text-align:left; padding-top:0.5em;">
Random Parts
</td>
</tr>
<tr>
<td style="padding:0.2cm; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;">
τ<sub>00, id</sub>
</td>
<td style="padding-left:0.5em; padding-right:0.5em;">
</td>
<td style="padding:0.2cm; text-align:center; padding-top:0.1cm; padding-bottom:0.1cm;" colspan="3">
0.011
</td>
</tr>
<tr>
<td style="padding:0.2cm; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;">
τ<sub>00, site</sub>
</td>
<td style="padding-left:0.5em; padding-right:0.5em;">
</td>
<td style="padding:0.2cm; text-align:center; padding-top:0.1cm; padding-bottom:0.1cm;" colspan="3">
0.029
</td>
</tr>
<tr>
<td style="padding:0.2cm; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;">
N<sub>id</sub>
</td>
<td style="padding-left:0.5em; padding-right:0.5em;">
</td>
<td style="padding:0.2cm; text-align:center; padding-top:0.1cm; padding-bottom:0.1cm;" colspan="3">
1000
</td>
</tr>
<tr>
<td style="padding:0.2cm; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;">
N<sub>site</sub>
</td>
<td style="padding-left:0.5em; padding-right:0.5em;">
</td>
<td style="padding:0.2cm; text-align:center; padding-top:0.1cm; padding-bottom:0.1cm;" colspan="3">
20
</td>
</tr>
<tr>
<td style="padding:0.2cm; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;">
Observations
</td>
<td style="padding-left:0.5em; padding-right:0.5em; border-top:1px solid;">
</td>
<td style="padding:0.2cm; padding-top:0.1cm; padding-bottom:0.1cm; text-align:center; border-top:1px solid;" colspan="3">
7000
</td>
</tr>
</table>
<p>In order to estimate power, we need to generate a large number of replications. I created a simple function that generates a new data set every iteration based on the definitions. If we want to vary the model assumptions across different replications, we can write code to modify the data definition part of the process. In this way we could look at power across different sample size, effect size, or variance assumptions. Here, I am only considering a single set of assumptions.</p>
<pre class="r"><code>gData <- function() {
dtSite <- genData(nsites, siteDef)
dtSite <- genCluster(dtSite, cLevelVar = "site",
numIndsVar = "ips", level1ID = "id")
dtSite <- addColumns(indDef, dtSite)
dtSiteTm <- addPeriods(dtSite, nPeriods = 7, idvars = "id")
dtSiteTm <- addColumns(trtDef, dtSiteTm)
return(dtSiteTm)
}</code></pre>
<p>And finally, we iterate through a series of replications, keeping track of each hypothesis test in the variable <em>result</em>. Typically, it would be nice to replicate a large number of times (say 1000), but this can sometimes take a long time. In this case, each call to <code>glmer</code> is very resource intensive - unfortunately, I know of know way to speed this up (please get in touch if you have thoughts on this) - so for the purposes of illustration, I’ve only used 99 iterations. Note also that I check to see if the model converges in each iteration, and only include results from valid estimates. This can be an issue with mixed effects models, particularly when sample sizes are small. To estimate the power (which in this case is 78%), calculate the proportion of successful iterations with a p-value smaller than 0.05, the alpha-level threshold we have used in our hypothesis test:</p>
<pre class="r"><code>result <- NULL
i=1
while (i < 100) {
dtSite <- gData()
glmfit <- tryCatch(glmer(data = dtSite,
Yijt ~ period + Ijt + I(Ijt*(period - sj)) + (1|id) + (1|site),
family="binomial" ),
warning = function(w) { "warning" }
)
if (! is.character(glmfit)) {
pvalue <- coef(summary(glmfit))["Ijt", "Pr(>|z|)"]
result <- c(result, pvalue)
i <- i + 1
}
}
mean(result < .05)</code></pre>
<pre><code>## [1] 0.7812</code></pre>
<p>To explore the sensitivity of the power estimates to changing underlying assumptions of effect size, sample size, variation, and time trends, we could vary those parameters and run a sequence of iterations. The code gets a little more complicated (essentially we need to change the “data defs” for each set of iterations), but it is still quite manageable. Of course, you might want to plan for fairly long execution times, particularly if you use 500 or 1000 iterations for each scenario, rather than the 100 I used here.</p>
</div>
simstudy update: two new functions that generate correlated observations from non-normal distributions
https://www.rdatagen.net/post/simstudy-update-two-functions-for-correlation/
Wed, 05 Jul 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simstudy-update-two-functions-for-correlation/<p>In an earlier <a href="https://www.rdatagen.net/post/correlated-data-copula/">post</a>, I described in a fair amount of detail an algorithm to generate correlated binary or Poisson data. I mentioned that I would be updating <code>simstudy</code> with functions that would make generating these kind of data relatively painless. Well, I have managed to do that, and the updated package (version 0.1.3) is available for download from <a href="https://cran.r-project.org/web/packages/simstudy/index.html">CRAN</a>. There are now two additional functions to facilitate the generation of correlated data from <em>binomial</em>, <em>poisson</em>, <em>gamma</em>, and <em>uniform</em> distributions: <code>genCorGen</code> and <code>addCorGen</code>. Here’s a brief intro to these functions.</p>
<div id="generate-generally-correlated-data" class="section level3">
<h3>Generate generally correlated data</h3>
<p><code>genCorGen</code> is an extension of <code>genCorData</code>, which was provided in earlier versions of <code>simstudy</code> to generate multivariate normal data. In the first example below, we are generating data from a multivariate Poisson distribution. To do this, we need to specify the mean of the Poisson distribution for each new variable, and then we specify the correlation structure, just as we did with the normal distribution.</p>
<pre class="r"><code>l <- c(8, 10, 12) # lambda for each new variable
dp <- genCorGen(1000, nvars = 3, params1 = l, dist = "poisson",
rho = 0.3, corstr = "cs", wide = TRUE)
dp</code></pre>
<pre><code>## id V1 V2 V3
## 1: 1 7 13 12
## 2: 2 7 11 13
## 3: 3 7 8 14
## 4: 4 7 12 9
## 5: 5 8 13 18
## ---
## 996: 996 8 14 15
## 997: 997 10 5 11
## 998: 998 4 9 9
## 999: 999 5 10 9
## 1000: 1000 6 12 17</code></pre>
<p>Here is the the estimated correlation (we would expect an estimate close to 0.3):</p>
<pre class="r"><code>round(cor(as.matrix(dp[, .(V1, V2, V3)])), 2)</code></pre>
<pre><code>## V1 V2 V3
## V1 1.00 0.29 0.26
## V2 0.29 1.00 0.31
## V3 0.26 0.31 1.00</code></pre>
<p>Similarly, we can generate correlated binary data by specifying the probabilities:</p>
<pre class="r"><code>db<- genCorGen(1000, nvars = 3, params1 = c(.3, .5, .7), dist = "binary",
rho = 0.8, corstr = "cs", wide = TRUE)
db</code></pre>
<pre><code>## id V1 V2 V3
## 1: 1 1 1 1
## 2: 2 0 0 1
## 3: 3 1 1 1
## 4: 4 0 0 0
## 5: 5 1 1 1
## ---
## 996: 996 0 1 1
## 997: 997 0 0 0
## 998: 998 0 1 1
## 999: 999 1 1 1
## 1000: 1000 0 0 0</code></pre>
<p>In the case of the binary outcome, the observed correlation will be lower that what is specified, which in this case was 0.8. I tried to provide some intuition about this in the earlier <a href="https://www.rdatagen.net/post/correlated-data-copula/">post</a>:</p>
<pre class="r"><code>round(cor(as.matrix(db[, .(V1, V2, V3)])), 2)</code></pre>
<pre><code>## V1 V2 V3
## V1 1.0 0.50 0.40
## V2 0.5 1.00 0.56
## V3 0.4 0.56 1.00</code></pre>
<p>The gamma distribution requires two parameters - the mean and dispersion. (These are converted into shape and rate parameters more commonly used.)</p>
<pre class="r"><code>dg <- genCorGen(1000, nvars = 3, params1 = c(3,5,7), params2 = c(1,1,1),
dist = "gamma", rho = .7, corstr = "cs",
wide = TRUE, cnames="a, b, c")
dg</code></pre>
<pre><code>## id a b c
## 1: 1 0.1957971 0.9902398 2.299307
## 2: 2 0.2566630 2.4271728 1.217599
## 3: 3 1.9550985 13.9248696 5.178042
## 4: 4 3.5525418 2.5711661 7.848605
## 5: 5 6.6981281 8.7494117 12.478329
## ---
## 996: 996 2.2059693 6.3474811 3.054551
## 997: 997 2.3571427 7.7841085 7.887417
## 998: 998 5.5326638 7.3273337 15.965228
## 999: 999 5.6284681 13.3574118 17.215722
## 1000: 1000 0.3749373 1.1480452 0.696243</code></pre>
<pre class="r"><code>round(cor(as.matrix(dg[, .(a, b, c)])), 2)</code></pre>
<pre><code>## a b c
## a 1.00 0.65 0.67
## b 0.65 1.00 0.62
## c 0.67 0.62 1.00</code></pre>
<p>These data sets can be generated in either <em>wide</em> or <em>long</em> form. So far, we have generated <em>wide</em> form data, where there is one row per unique id. The <em>long</em> form, where the correlated data are on different rows, is useful for plotting or fitting models, because there are repeated measurements for each id:</p>
<pre class="r"><code>dgl <- genCorGen(1000, nvars = 3, params1 = l, params2 = c(1,1,1),
dist = "gamma", rho = .7, corstr = "cs", wide = FALSE,
cnames="NewCol")
dgl</code></pre>
<pre><code>## id period NewCol
## 1: 1 0 1.066558
## 2: 1 1 5.666802
## 3: 1 2 5.366408
## 4: 2 0 1.419593
## 5: 2 1 9.318227
## ---
## 2996: 999 1 21.821011
## 2997: 999 2 21.800972
## 2998: 1000 0 12.082063
## 2999: 1000 1 18.541231
## 3000: 1000 2 12.063846</code></pre>
<p>Here is a plot of a subset of the data:</p>
<pre class="r"><code>ids <- sample(1000,50, replace = FALSE)
ggplot(data=dgl[id %in% ids,], aes(x=factor(period), y=NewCol, group=id)) +
geom_line(aes(color=factor(id)))+
theme(legend.position = "none") +
scale_x_discrete(expand = c(0,0.1))</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-07-05-simstudy-update-two-functions-to-generate-correlated-observations-from-non-normal-distributions_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
</div>
<div id="generate-data-based-on-values-from-existing-data-set" class="section level3">
<h3>Generate data based on values from existing data set</h3>
<p><code>addCorGen</code> allows us to create correlated data from an existing data set, as one can already do using <code>addCorData</code>, but with non-normal data. In the case of <code>addCorGen</code>, the parameter(s) used to define the distribution is a field (or fields) in the data set. The correlated data are added to the existing data set. In the example below, we are going to generate three sets (Poisson, binary, and gamma) of correlated data with means that are a function of the variable <code>xbase</code>, which varies by id.</p>
<p>First we define the data and generate a data set:</p>
<pre class="r"><code>def <- defData(varname = "xbase", formula = 5, variance = 0.2,
dist = "gamma", id = "cid")
def <- defData(def, varname = "lambda", formula = "0.5 + 0.1 * xbase",
dist="nonrandom", link = "log")
def <- defData(def, varname = "p", formula = "-2.0 + 0.3 * xbase",
dist="nonrandom", link = "logit")
def <- defData(def, varname = "gammaMu", formula = "0.5 + 0.2 * xbase",
dist="nonrandom", link = "log")
def <- defData(def, varname = "gammaDis", formula = 1,
dist="nonrandom")
dt <- genData(10000, def)
dt</code></pre>
<pre><code>## cid xbase lambda p gammaMu gammaDis
## 1: 1 12.1128232 5.536056 0.8366960 18.588900 1
## 2: 2 4.9148342 2.695230 0.3715554 4.405998 1
## 3: 3 11.5550282 5.235712 0.8125261 16.626630 1
## 4: 4 3.0802596 2.243475 0.2542785 3.052778 1
## 5: 5 0.9767811 1.817893 0.1535577 2.004423 1
## ---
## 9996: 9996 6.0564517 3.021173 0.4543613 5.536100 1
## 9997: 9997 3.1298866 2.254636 0.2571119 3.083229 1
## 9998: 9998 12.4642670 5.734076 0.8505956 19.942505 1
## 9999: 9999 4.6559318 2.626345 0.3536072 4.183660 1
## 10000: 10000 3.4314285 2.323658 0.2747666 3.274895 1</code></pre>
<p>The Poisson distribution has a single parameter, lambda:</p>
<pre class="r"><code>dtX1 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 3, rho = 0.1,
corstr = "cs", dist = "poisson", param1 = "lambda",
cnames = "a, b, c")
dtX1[, .(cid, xbase, lambda, a, b, c)]</code></pre>
<pre><code>## cid xbase lambda a b c
## 1: 1 12.1128232 5.536056 4 6 7
## 2: 2 4.9148342 2.695230 2 4 1
## 3: 3 11.5550282 5.235712 5 6 4
## 4: 4 3.0802596 2.243475 1 3 1
## 5: 5 0.9767811 1.817893 2 1 0
## ---
## 9996: 9996 6.0564517 3.021173 1 3 3
## 9997: 9997 3.1298866 2.254636 2 3 1
## 9998: 9998 12.4642670 5.734076 4 6 8
## 9999: 9999 4.6559318 2.626345 2 3 5
## 10000: 10000 3.4314285 2.323658 0 0 3</code></pre>
<p>The Bernoulli (binary) distribution has a single parameter, p:</p>
<pre class="r"><code>dtX2 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 4, rho = .4,
corstr = "ar1", dist = "binary", param1 = "p")
dtX2[, .(cid, xbase, p, V1, V2, V3, V4)]</code></pre>
<pre><code>## cid xbase p V1 V2 V3 V4
## 1: 1 12.1128232 0.8366960 1 1 1 1
## 2: 2 4.9148342 0.3715554 0 0 0 0
## 3: 3 11.5550282 0.8125261 1 1 1 1
## 4: 4 3.0802596 0.2542785 0 1 0 0
## 5: 5 0.9767811 0.1535577 0 0 0 1
## ---
## 9996: 9996 6.0564517 0.4543613 0 0 0 0
## 9997: 9997 3.1298866 0.2571119 1 0 0 0
## 9998: 9998 12.4642670 0.8505956 0 1 1 1
## 9999: 9999 4.6559318 0.3536072 1 1 0 0
## 10000: 10000 3.4314285 0.2747666 1 0 1 1</code></pre>
<p>And here is the the Gamma distribution, with its two parameters (mean and dispersion):</p>
<pre class="r"><code>dtX3 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 3, rho = .4,
corstr = "cs", dist = "gamma",
param1 = "gammaMu", param2 = "gammaDis")
dtX3[, .(cid, xbase, gammaMu, gammaDis,
V1 = round(V1,2), V2 = round(V2,2), V3 = round(V3,2))]</code></pre>
<pre><code>## cid xbase gammaMu gammaDis V1 V2 V3
## 1: 1 12.1128232 18.588900 1 11.24 3.44 9.11
## 2: 2 4.9148342 4.405998 1 0.91 3.77 0.76
## 3: 3 11.5550282 16.626630 1 68.47 12.91 1.72
## 4: 4 3.0802596 3.052778 1 2.54 3.63 2.98
## 5: 5 0.9767811 2.004423 1 0.39 0.14 0.42
## ---
## 9996: 9996 6.0564517 5.536100 1 0.29 4.84 1.80
## 9997: 9997 3.1298866 3.083229 1 4.81 0.38 0.81
## 9998: 9998 12.4642670 19.942505 1 17.10 3.56 4.04
## 9999: 9999 4.6559318 4.183660 1 1.17 0.21 1.47
## 10000: 10000 3.4314285 3.274895 1 1.02 1.61 2.24</code></pre>
</div>
<div id="long-form-data" class="section level3">
<h3>Long form data</h3>
<p>If we have data in <em>long</em> form (e.g. longitudinal data), the function will recognize the structure:</p>
<pre class="r"><code>def <- defData(varname = "xbase", formula = 5, variance = .4,
dist = "gamma", id = "cid")
def <- defData(def, "nperiods", formula = 3,
dist = "noZeroPoisson")
def2 <- defDataAdd(varname = "lambda",
formula = "0.5 + 0.5 * period + 0.1 * xbase",
dist="nonrandom", link = "log")
dt <- genData(1000, def)
dtLong <- addPeriods(dt, idvars = "cid", nPeriods = 3)
dtLong <- addColumns(def2, dtLong)
dtLong</code></pre>
<pre><code>## cid period xbase nperiods timeID lambda
## 1: 1 0 6.693980 1 1 3.220053
## 2: 1 1 6.693980 1 2 5.308971
## 3: 1 2 6.693980 1 3 8.753013
## 4: 2 0 10.008645 2 4 4.485565
## 5: 2 1 10.008645 2 5 7.395447
## ---
## 2996: 999 1 6.753605 2 2996 5.340720
## 2997: 999 2 6.753605 2 2997 8.805359
## 2998: 1000 0 2.006781 4 2998 2.015119
## 2999: 1000 1 2.006781 4 2999 3.322369
## 3000: 1000 2 2.006781 4 3000 5.477661</code></pre>
<pre class="r"><code>### Generate the data
dtX3 <- addCorGen(dtOld = dtLong, idvar = "cid", nvars = 3,
rho = .6, corstr = "cs", dist = "poisson",
param1 = "lambda", cnames = "NewPois")
dtX3</code></pre>
<pre><code>## cid period xbase nperiods timeID lambda NewPois
## 1: 1 0 6.693980 1 1 3.220053 3
## 2: 1 1 6.693980 1 2 5.308971 5
## 3: 1 2 6.693980 1 3 8.753013 9
## 4: 2 0 10.008645 2 4 4.485565 2
## 5: 2 1 10.008645 2 5 7.395447 4
## ---
## 2996: 999 1 6.753605 2 2996 5.340720 6
## 2997: 999 2 6.753605 2 2997 8.805359 11
## 2998: 1000 0 2.006781 4 2998 2.015119 2
## 2999: 1000 1 2.006781 4 2999 3.322369 4
## 3000: 1000 2 2.006781 4 3000 5.477661 7</code></pre>
<p>We can fit a generalized estimating equation (GEE) model and examine the coefficients and the working correlation matrix. As we would expect, they match closely to the data generating parameters:</p>
<pre class="r"><code>geefit <- gee(NewPois ~ period + xbase, data = dtX3, id = cid,
family = poisson, corstr = "exchangeable")</code></pre>
<pre><code>## Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27</code></pre>
<pre><code>## running glm to get initial regression estimate</code></pre>
<pre><code>## (Intercept) period xbase
## 0.52045259 0.50354885 0.09746544</code></pre>
<pre class="r"><code>round(summary(geefit)$working.correlation, 2)</code></pre>
<pre><code>## [,1] [,2] [,3]
## [1,] 1.00 0.58 0.58
## [2,] 0.58 1.00 0.58
## [3,] 0.58 0.58 1.00</code></pre>
<p>In the future, I plan on adding other distributions. Some folks have suggested the negative binomial distribution, which I will do. If you have other suggestions/requests, <a href="mailto:keith.goldfeld@nyumc.org">let me know</a>.</p>
</div>
Balancing on multiple factors when the sample is too small to stratify
https://www.rdatagen.net/post/balancing-when-sample-is-too-small-to-stratify/
Mon, 26 Jun 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/balancing-when-sample-is-too-small-to-stratify/<p>Ideally, a study that uses randomization provides a balance of characteristics that might be associated with the outcome being studied. This way, we can be more confident that any differences in outcomes between the groups are due to the group assignments and not to differences in characteristics. Unfortunately, randomization does not <em>guarantee</em> balance, especially with smaller sample sizes. If we want to be certain that groups are balanced with respect to a particular characteristic, we need to do something like stratified randomization.</p>
<p>When the sample size is small and we want to guarantee balance across <em>multiple</em> characteristics, the task is a bit more challenging. Say we have 20 schools that we are randomizing to two groups, 10 in each, and want to make sure the groups are balanced with respect to 4 characteristics: language, poverty, location, and size. Simple stratification may not work so well. If we assume that these four characteristics are binary (e.g. either “yes” or “no”), there are 16 possible combinations. One or more of these combinations could easily be represented by a single school - so it would be impossible to randomize within each of the 16 combinations. What to do?</p>
<p>One possible approach is to generate all possible randomization schemes of the 20 schools, and keep only those schemes that are balanced with respect to the four characteristics. Once we have a list of acceptable randomization schemes, we can just pick one of <em>those</em> at random. (Of course, it is preferable if each school has close to a 50% chance of being assigned to either intervention group.)</p>
<div id="simulate-school-level-data" class="section level2">
<h2>Simulate school-level data</h2>
<p>To start, we generate data for our 20 hypothetical schools using <code>simstudy</code> functions:</p>
<pre class="r"><code>library(simstudy)
set.seed(125)
# define data characteristics for schools
ddef <- defData(varname = "language", formula = .3, dist = "binary")
ddef <- defData(ddef, "poverty", formula = .2, dist = "binary")
ddef <- defData(ddef, "location", formula = .5, dist = "binary")
ddef <- defData(ddef, "size", formula = .5, dist = "binary")
ddef</code></pre>
<pre><code>## varname formula variance dist link
## 1: language 0.3 0 binary identity
## 2: poverty 0.2 0 binary identity
## 3: location 0.5 0 binary identity
## 4: size 0.5 0 binary identity</code></pre>
<pre class="r"><code># generate schools
dt <- genData(20, ddef)
# number of schools in each combination
dt[, .N, keyby = .(language,poverty,location,size)]</code></pre>
<pre><code>## language poverty location size N
## 1: 0 0 0 1 5
## 2: 0 0 1 0 1
## 3: 0 0 1 1 5
## 4: 0 1 0 0 1
## 5: 0 1 1 0 2
## 6: 1 0 0 0 2
## 7: 1 0 0 1 1
## 8: 1 0 1 0 1
## 9: 1 0 1 1 2</code></pre>
<p>In this case, we have nine different combinations of the four characteristics, four of which include only a single school (rows 2, 4, 7, and 8). Stratification wouldn’t work necessarily work here if our goal was balance across all four characteristics.</p>
</div>
<div id="create-randomization-scenarios-to-assess-for-balance" class="section level2">
<h2>Create randomization scenarios to assess for balance</h2>
<p>Ideally, we would generate all possible randomization combinations and check them all for balance. If the number of total units (e.g. schools) is small, this does not pose a challenge (e.g. if N=4, then we only have six possible randomization schemes: TTCC, TCTC, TCCT, CTTC, CTCT, CCTT). However, with N=20, then there are 184,756 possible randomization schemes. Depending on the efficiency of the algorithm, it may be impractical to evaluate all the schemes. So, an alternative is to sample a subset of the schemes and evaluate those. For illustration purposes (so that you can understand what I am doing), I am using some very inefficient <code>R</code> code (using a loops). As a result, I cannot evaluate all possible schemes in a reasonable period of time to get this post out; I decided to sample instead to evaluate 1000 possible randomizations. (At the end of this post, I show results using much more efficient code that uses data.table and Rcpp code much more effectively - so that we can quickly evaluate millions of randomization schemes.)</p>
<p>To start, I create all combinations of randomization schemes:</p>
<pre class="r"><code>totalSchools = 20
rxSchools = 10
xRx <- t(combn(totalSchools, rxSchools))
# show 5 randomly sampled combinations
sampleRows <- sample(nrow(xRx), 5, replace = FALSE)
xRx[sampleRows,]</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 2 3 5 6 7 8 10 12 14 19
## [2,] 5 6 7 8 13 14 15 16 17 18
## [3,] 1 3 4 5 7 9 12 15 17 20
## [4,] 2 3 4 5 9 11 14 15 19 20
## [5,] 3 5 6 7 8 10 11 12 15 16</code></pre>
<p>Below is a function (which I chose to do in Rcpp) that converts the <code>xRx</code> matrix of school ids to a 20-column matrix of 1’s and 0’s indicating whether or not a school is randomized to the intervention in a particular scenario:</p>
<pre class="cpp"><code>#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
NumericMatrix convert01(NumericMatrix xmat, int tcols) {
int xrows = xmat.nrow();
int xcols = xmat.ncol();
NumericMatrix pmat(xrows, tcols);
for (int i=0; i < xrows; i++) {
for (int j=0; j < xcols; j++) {
pmat(i, xmat(i,j) - 1) = 1;
}
}
return(pmat);
}</code></pre>
<pre class="r"><code>x01 <- convert01(xRx, totalSchools)
# show some rows
x01[sampleRows,]</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] 0 1 1 0 1 1 1 1 0 1 0 1
## [2,] 0 0 0 0 1 1 1 1 0 0 0 0
## [3,] 1 0 1 1 1 0 1 0 1 0 0 1
## [4,] 0 1 1 1 1 0 0 0 1 0 1 0
## [5,] 0 0 1 0 1 1 1 1 0 1 1 1
## [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 0 1 0 0 0 0 1 0
## [2,] 1 1 1 1 1 1 0 0
## [3,] 0 0 1 0 1 0 0 1
## [4,] 0 1 1 0 0 0 1 1
## [5,] 0 0 1 1 0 0 0 0</code></pre>
<p>Because the evaluation code is so inefficient, I draw 1,000 rows at random from this “intervention” matrix <code>x01</code> (after converting it to a data.table).</p>
<pre class="r"><code># convert matrix to data.table
d01 <- data.table(x01)
d01[, id := .I]
ids <- sample(nrow(d01), 1000, replace = FALSE)
sampleD01 <- d01[id %in% ids]</code></pre>
<p>Now we are ready to evaluate each of the 1,000 schemes. As I mentioned before, this approach is highly inefficient as the algorithm requires us to literally loop through each each combination to find the balanced ones. I have sacrificed efficiency and speed for clarity of code (I hope).</p>
<pre class="r"><code>for (i in 1:1000) {
dt[, grp:= t(sampleD01[i,1:20])]
dx <- dt[ , .N, keyby = .(language, grp)]
dc <- dcast(dx, language ~ grp, fill = 0, value.var = "N" )
dc[, diff := abs(`1` - `0`)]
# we declare a scheme balanced if counts differ by
# no more than 1 school
sampleD01[i, language := (sum(dc[, diff > 1]) == 0)]
dx <- dt[ , .N, keyby = .(poverty, grp)]
dc <- dcast(dx, poverty ~ grp, fill = 0, value.var = "N" )
dc[, diff := abs(`1` - `0`)]
sampleD01[i, poverty := (sum(dc[, diff > 1]) == 0)]
dx <- dt[ , .N, keyby = .(location, grp)]
dc <- dcast(dx, location ~ grp, fill = 0, value.var = "N" )
dc[, diff := abs(`1` - `0`)]
sampleD01[i, location := (sum(dc[, diff > 1]) == 0)]
dx <- dt[ , .N, keyby = .(size, grp)]
dc <- dcast(dx, size ~ grp, fill = 0, value.var = "N" )
dc[, diff := abs(`1` - `0`)]
sampleD01[i, size := (sum(dc[, diff > 1]) == 0)]
}</code></pre>
<p>The final determination of balance is made if a scheme is balanced across all four characteristics. In this case, 136 of the 1,000 schemes were balanced based on this criterion:</p>
<pre class="r"><code>sampleD01[, balanced := all(language, poverty, location, size), keyby = id]
# proportion of sampled combinations that are balanced ...
sampleD01[,mean(balanced)]</code></pre>
<pre><code>## [1] 0.136</code></pre>
<p>And let’s inspect the actual balance of two randomly selected schemes - one which is balanced, and one which is not:</p>
<pre class="r"><code>sTrue <- sampleD01[balanced == TRUE]
sFalse <- sampleD01[balanced == FALSE]</code></pre>
<div id="a-balanced-scheme" class="section level3">
<h3>A balanced scheme</h3>
<pre class="r"><code>dtAssigned <- copy(dt)
dtAssigned[, group := as.vector(t(sTrue[sample(.N, 1), 1:20]))]
dtAssigned[, .N, keyby=.(language, group)]</code></pre>
<pre><code>## language group N
## 1: 0 0 7
## 2: 0 1 7
## 3: 1 0 3
## 4: 1 1 3</code></pre>
<pre class="r"><code>dtAssigned[, .N, keyby=.(poverty, group)]</code></pre>
<pre><code>## poverty group N
## 1: 0 0 9
## 2: 0 1 8
## 3: 1 0 1
## 4: 1 1 2</code></pre>
<pre class="r"><code>dtAssigned[, .N, keyby=.(location, group)]</code></pre>
<pre><code>## location group N
## 1: 0 0 4
## 2: 0 1 5
## 3: 1 0 6
## 4: 1 1 5</code></pre>
<pre class="r"><code>dtAssigned[, .N, keyby=.(size, group)]</code></pre>
<pre><code>## size group N
## 1: 0 0 3
## 2: 0 1 4
## 3: 1 0 7
## 4: 1 1 6</code></pre>
</div>
<div id="an-unbalanced-scheme" class="section level3">
<h3>An unbalanced scheme</h3>
<p>In this case, language and location are imbalanced, though size and poverty are fine.</p>
<pre class="r"><code>dtAssigned <- copy(dt)
dtAssigned[, group := as.vector(t(sFalse[sample(.N, 1), 1:20]))]
dtAssigned[, .N, keyby=.(language, group)]</code></pre>
<pre><code>## language group N
## 1: 0 0 8
## 2: 0 1 6
## 3: 1 0 2
## 4: 1 1 4</code></pre>
<pre class="r"><code>dtAssigned[, .N, keyby=.(poverty, group)]</code></pre>
<pre><code>## poverty group N
## 1: 0 0 8
## 2: 0 1 9
## 3: 1 0 2
## 4: 1 1 1</code></pre>
<pre class="r"><code>dtAssigned[, .N, keyby=.(location, group)]</code></pre>
<pre><code>## location group N
## 1: 0 0 3
## 2: 0 1 6
## 3: 1 0 7
## 4: 1 1 4</code></pre>
<pre class="r"><code>dtAssigned[, .N, keyby=.(size, group)]</code></pre>
<pre><code>## size group N
## 1: 0 0 4
## 2: 0 1 3
## 3: 1 0 6
## 4: 1 1 7</code></pre>
</div>
</div>
<div id="fast-implementation-with-data.table-and-rcpp" class="section level2">
<h2>Fast implementation with data.table and Rcpp</h2>
<p>As I alluded to before, if we want to implement this in the real world, it would be preferable to use code that does not bog down when we want to search 100,000+ possible randomization schemes. I have written a set of <code>R</code> and <code>Rcpp</code> functions the facilitate this. (Code is available <a href="https://github.com/kgoldfeld/RDataGenBlog/tree/master/static/img/post-balance">here</a>.)</p>
<pre class="r"><code># generate all possible schemes
xperm <- xPerms(totalSchools, rxSchools, N=NULL)
nrow(xperm)</code></pre>
<pre><code>## [1] 184756</code></pre>
<pre class="r"><code>xperm[sample(nrow(xperm), 5, replace = FALSE)]</code></pre>
<pre><code>## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19
## 1: 0 1 1 1 1 0 1 1 0 0 0 1 0 1 0 0 0 1 0
## 2: 1 1 0 1 1 1 1 1 0 1 0 0 0 0 1 0 1 0 0
## 3: 1 0 1 0 0 1 1 1 1 1 1 0 1 0 0 1 0 0 0
## 4: 1 1 1 0 0 1 0 1 0 1 1 1 1 0 0 1 0 0 0
## 5: 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 0 1 1 0
## V20 id
## 1: 1 94784
## 2: 0 19535
## 3: 0 61644
## 4: 0 14633
## 5: 0 35651</code></pre>
<pre class="r"><code># prepare data for evaluation
dtMat <- as.matrix(dt[,-1])
cc <- parse(text=attr(xperm, "varlist"))
cc</code></pre>
<pre><code>## expression(c(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12,
## V13, V14, V15, V16, V17, V18, V19, V20))</code></pre>
<pre class="r"><code># evaluate each combination
sF <- xperm[, cppChk(eval(cc), dtMat), keyby = id]
sF[sample(nrow(sF), 5, replace = FALSE)]</code></pre>
<pre><code>## id V1
## 1: 15924 FALSE
## 2: 68284 FALSE
## 3: 149360 FALSE
## 4: 62924 FALSE
## 5: 14009 TRUE</code></pre>
<pre class="r"><code># keep only the balanced schemes
sFinal <- xperm[sF$V1]
nrow(sFinal)</code></pre>
<pre><code>## [1] 7742</code></pre>
<pre class="r"><code># randomize from the balanced schemes
selectRow <- sample(nrow(sFinal), 1)
# check balance of randomized scheme
dtAssigned <- copy(dt)
dtAssigned[, group := as.vector(t(sFinal[selectRow, -"id"]))]
dtAssigned[, .N, keyby=.(language, group)]</code></pre>
<pre><code>## language group N
## 1: 0 0 7
## 2: 0 1 7
## 3: 1 0 3
## 4: 1 1 3</code></pre>
<pre class="r"><code>dtAssigned[, .N, keyby=.(poverty, group)]</code></pre>
<pre><code>## poverty group N
## 1: 0 0 9
## 2: 0 1 8
## 3: 1 0 1
## 4: 1 1 2</code></pre>
<pre class="r"><code>dtAssigned[, .N, keyby=.(location, group)]</code></pre>
<pre><code>## location group N
## 1: 0 0 5
## 2: 0 1 4
## 3: 1 0 5
## 4: 1 1 6</code></pre>
<pre class="r"><code>dtAssigned[, .N, keyby=.(size, group)]</code></pre>
<pre><code>## size group N
## 1: 0 0 3
## 2: 0 1 4
## 3: 1 0 7
## 4: 1 1 6</code></pre>
</div>
Copulas and correlated data generation: getting beyond the normal distribution
https://www.rdatagen.net/post/correlated-data-copula/
Mon, 19 Jun 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/correlated-data-copula/<p>Using the <code>simstudy</code> package, it’s possible to generate correlated data from a normal distribution using the function <em>genCorData</em>. I’ve wanted to extend the functionality so that we can generate correlated data from other sorts of distributions; I thought it would be a good idea to begin with binary and Poisson distributed data, since those come up so frequently in my work. <code>simstudy</code> can already accommodate more general correlated data, but only in the context of a random effects data generation process. This might not be what we want, particularly if we are interested in explicitly generating data to explore marginal models (such as a GEE model) rather than a conditional random effects model (a topic I explored in my <a href="https://www.rdatagen.net/post/marginal-v-conditional/">previous</a> discussion). The extension can quite easily be done using <em>copulas</em>.</p>
<p>Based on <a href="https://en.wikipedia.org/wiki/Copula_%28probability_theory%29">this</a> definition, a copula is a “multivariate probability distribution for which the marginal probability distribution of each variable is uniform.” It can be shown that <span class="math inline">\(U\)</span> is uniformly distributed if <span class="math inline">\(U=F(X)\)</span>, where <span class="math inline">\(F\)</span> is the CDF of a continuous random variable <span class="math inline">\(X\)</span>. Furthermore, if we can generate a multivariate <span class="math inline">\(\mathbf{X}\)</span>, say <span class="math inline">\((X_1, X_2, ..., X_k)\)</span> with a known covariance or correlation structure (e.g. exchangeable, auto-regressive, unstructured), it turns that the corresponding multivariate <span class="math inline">\(\mathbf{U}, (U_1, U_2, ..., U_k)\)</span> will maintain that structure. And in a final step, we can transform <span class="math inline">\(\mathbf{U}\)</span> to another random variable <span class="math inline">\(\mathbf{Y}\)</span> that has a target distribution by applying the inverse CDF <span class="math inline">\(F_i^{-1}(U_i)\)</span> of that target distribution to each <span class="math inline">\(U_i\)</span>. Since we can generate a multivariate normal <span class="math inline">\(\mathbf{X}\)</span>, it is relatively short leap to implement this copula algorithm in order to generate correlated data from other distributions.</p>
<div id="implementing-the-copula-algorithm-in-r" class="section level2">
<h2>Implementing the copula algorithm in R</h2>
<p>While this hasn’t been implemented just yet in <code>simstudy</code>, this is along the lines of what I am thinking:</p>
<pre class="r"><code>library(simstudy)
library(data.table)
set.seed(555)
# Generate 1000 observations of 4 RVs from a multivariate normal
# dist - each N(0,1) - with a correlation matrix where rho = 0.4
dt <- genCorData(1000, mu = c(0, 0, 0, 0), sigma = 1,
rho = 0.4, corstr = "cs" )
dt</code></pre>
<pre><code>## id V1 V2 V3 V4
## 1: 1 -1.1667574 -0.05296536 0.2995360 -0.5232691
## 2: 2 0.4505618 0.57499589 -0.9629426 1.5495697
## 3: 3 -0.1294505 1.68372035 1.1309223 0.4205397
## 4: 4 0.0858846 1.27479473 0.4247491 0.1054230
## 5: 5 0.4654873 3.05566796 0.5846449 1.0906072
## ---
## 996: 996 0.3420099 -0.35783480 -0.8363306 0.2656964
## 997: 997 -1.0928169 0.50081091 -0.8915582 -0.7428976
## 998: 998 0.7490765 -0.09559294 -0.2351121 0.6632157
## 999: 999 0.8143565 -1.00978384 0.2266132 -1.2345192
## 1000: 1000 -1.9795559 -0.16668454 -0.5883966 -1.7424941</code></pre>
<pre class="r"><code>round(cor(dt[,-1]), 2)</code></pre>
<pre><code>## V1 V2 V3 V4
## V1 1.00 0.41 0.36 0.44
## V2 0.41 1.00 0.33 0.42
## V3 0.36 0.33 1.00 0.35
## V4 0.44 0.42 0.35 1.00</code></pre>
<pre class="r"><code>### create a long version of the data set
dtM <- melt(dt, id.vars = "id", variable.factor = TRUE,
value.name = "X", variable.name = "seq")
setkey(dtM, "id") # sort data by id
dtM[, seqid := .I] # add index for each record
### apply CDF to X to get uniform distribution
dtM[, U := pnorm(X)]
### Generate correlated Poisson data with mean and variance 8
### apply inverse CDF to U
dtM[, Y_pois := qpois(U, 8), keyby = seqid]
dtM</code></pre>
<pre><code>## id seq X seqid U Y_pois
## 1: 1 V1 -1.16675744 1 0.12165417 5
## 2: 1 V2 -0.05296536 2 0.47887975 8
## 3: 1 V3 0.29953603 3 0.61773446 9
## 4: 1 V4 -0.52326909 4 0.30039350 6
## 5: 2 V1 0.45056179 5 0.67384729 9
## ---
## 3996: 999 V4 -1.23451924 3996 0.10850474 5
## 3997: 1000 V1 -1.97955591 3997 0.02387673 3
## 3998: 1000 V2 -0.16668454 3998 0.43380913 7
## 3999: 1000 V3 -0.58839655 3999 0.27813308 6
## 4000: 1000 V4 -1.74249414 4000 0.04071101 3</code></pre>
<pre class="r"><code>### Check mean and variance of Y_pois
dtM[, .(mean = round(mean(Y_pois), 1),
var = round(var(Y_pois), 1)), keyby = seq]</code></pre>
<pre><code>## seq mean var
## 1: V1 8.0 8.2
## 2: V2 8.1 8.5
## 3: V3 8.1 7.6
## 4: V4 8.0 7.9</code></pre>
<pre class="r"><code>### Check correlation matrix of Y_pois's - I know this code is a bit ugly
### but I just wanted to get the correlation matrix quickly.
round(cor(as.matrix(dcast(data = dtM, id~seq,
value.var = "Y_pois")[,-1])), 2)</code></pre>
<pre><code>## V1 V2 V3 V4
## V1 1.00 0.40 0.37 0.43
## V2 0.40 1.00 0.33 0.40
## V3 0.37 0.33 1.00 0.35
## V4 0.43 0.40 0.35 1.00</code></pre>
<p>The correlation matrices for <span class="math inline">\(\mathbf{X}\)</span> and <span class="math inline">\(\mathbf{Y_{Pois}}\)</span> aren’t too far off.</p>
<p>Here are the results for an auto-regressive (AR-1) correlation structure. (I am omitting some of the code for brevity’s sake):</p>
<pre class="r"><code># Generate 1000 observations of 4 RVs from a multivariate normal
# dist - each N(0,1) - with a correlation matrix where rho = 0.4
dt <- genCorData(1000, mu = c(0, 0, 0, 0), sigma = 1,
rho = 0.4, corstr = "ar1" )
round(cor(dt[,-1]), 2)</code></pre>
<pre><code>## V1 V2 V3 V4
## V1 1.00 0.43 0.18 0.12
## V2 0.43 1.00 0.39 0.13
## V3 0.18 0.39 1.00 0.38
## V4 0.12 0.13 0.38 1.00</code></pre>
<pre class="r"><code>### Check mean and variance of Y_pois
dtM[, .(mean = round(mean(Y_pois), 1),
var = round(var(Y_pois), 1)), keyby = seq]</code></pre>
<pre><code>## seq mean var
## 1: V1 8.1 8.3
## 2: V2 7.9 7.8
## 3: V3 8.0 8.4
## 4: V4 8.0 7.5</code></pre>
<pre class="r"><code>### Check correlation matrix of Y_pois's
round(cor(as.matrix(dcast(data = dtM, id~seq,
value.var = "Y_pois")[,-1])), 2)</code></pre>
<pre><code>## V1 V2 V3 V4
## V1 1.00 0.41 0.18 0.13
## V2 0.41 1.00 0.39 0.14
## V3 0.18 0.39 1.00 0.36
## V4 0.13 0.14 0.36 1.00</code></pre>
<p>Again - comparing the two correlation matrices - the original normal data, and the derivative Poisson data - suggests that this can work pretty well.</p>
<p>Using the last data set, I fit a GEE model to see how well the data generating process is recovered:</p>
<pre class="r"><code>library(geepack)
geefit <- geepack::geeglm(Y_pois ~ 1, data = dtM, family = poisson,
id = id, corstr = "ar1")
summary(geefit)</code></pre>
<pre><code>##
## Call:
## geepack::geeglm(formula = Y_pois ~ 1, family = poisson, data = dtM,
## id = id, corstr = "ar1")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 2.080597 0.007447 78060 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 0.9984 0.02679
##
## Correlation: Structure = ar1 Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.3987 0.02008
## Number of clusters: 1000 Maximum cluster size: 4</code></pre>
<p>In the GEE output, alpha is an estimate of <span class="math inline">\(\rho\)</span>. The estimated alpha is 0.399, quite close to 0.40, the original value used to generate the normally distributed data.</p>
</div>
<div id="binary-outcomes" class="section level2">
<h2>Binary outcomes</h2>
<p>We can also generate binary data:</p>
<pre class="r"><code>### Generate binary data with p=0.5 (var = 0.25)
dtM[, Y_bin := qbinom(U, 1, .5), keyby = seqid]
dtM</code></pre>
<pre><code>## id seq X seqid U Y_pois Y_bin
## 1: 1 V1 1.7425 1 0.959288 13 1
## 2: 1 V2 1.4915 2 0.932086 12 1
## 3: 1 V3 0.7379 3 0.769722 10 1
## 4: 1 V4 -1.6581 4 0.048644 4 0
## 5: 2 V1 2.3262 5 0.989997 15 1
## ---
## 3996: 999 V4 -0.3805 3996 0.351772 7 0
## 3997: 1000 V1 -0.8724 3997 0.191505 6 0
## 3998: 1000 V2 -1.0085 3998 0.156600 5 0
## 3999: 1000 V3 -2.0451 3999 0.020420 3 0
## 4000: 1000 V4 -2.7668 4000 0.002831 1 0</code></pre>
<pre class="r"><code>### Check mean and variance of Y_bin
dtM[, .(mean = round(mean(Y_bin), 2),
var = round(var(Y_bin), 2)), keyby = seq]</code></pre>
<pre><code>## seq mean var
## 1: V1 0.52 0.25
## 2: V2 0.50 0.25
## 3: V3 0.48 0.25
## 4: V4 0.49 0.25</code></pre>
<pre class="r"><code>### Check correlation matrix of Y_bin's
round(cor(as.matrix(dcast(data = dtM, id~seq,
value.var = "Y_bin")[,-1])), 2)</code></pre>
<pre><code>## V1 V2 V3 V4
## V1 1.00 0.29 0.10 0.05
## V2 0.29 1.00 0.27 0.03
## V3 0.10 0.27 1.00 0.23
## V4 0.05 0.03 0.23 1.00</code></pre>
<p>The binary data are correlated, but the correlation coefficient doesn’t replicate as well as the Poisson distribution. While both the Poisson and binary CDF’s are discontinuous, the extreme jump in the binary CDF leads to this discrepancy. Values that are relatively close to each other on the normal scale, and in particular on the uniform scale, can be ‘sent’ to opposite ends of the binary scale (that is to 0 and to 1) if they straddle the cutoff point <span class="math inline">\(p\)</span> (the probability of the outcome in the binary distribution); values similar in the original data are very different in the target data. This bias is partially attenuated by values far apart on the uniform scale yet falling on the same side of <span class="math inline">\(p\)</span> (both driven to 0 or both to 1); in this case values different in the original data are similar (actually identical) in the target data.</p>
<p>The series of plots below show bivariate data for the original multivariate normal data, and the corresponding uniform, Poisson, and binary data. We can see the effect of extreme discontinuity of the binary data. (R code available <a href="https://github.com/kgoldfeld/RDataGenBlog/blob/master/static/img/post-copula/Check%20rho%20for%20dists.R">here</a>.)</p>
<p><img src="https://www.rdatagen.net/post/2017-06-19-generating-correlated-data-using-a-copula-approach_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
</div>
<div id="some-simulation-results" class="section level2">
<h2>Some simulation results</h2>
<p>A series of simulations shows how well the estimates of <span class="math inline">\(\rho\)</span> compare across a set of different assumptions. In each of the plots below, we see how <span class="math inline">\(\rho\)</span> for the non-normal data changes as a function of <span class="math inline">\(\rho\)</span> from the original normally distributed data. For each value of <span class="math inline">\(\rho\)</span>, I varied the parameter of the non-normal distribution (in the case of the binary data, I varied the probability of the outcome; in the case of the Poisson data, I varied the parameter <span class="math inline">\(\lambda\)</span> which defines the mean and variance). I also considered both covariance structures, exchangeable and ar-1. (R code available <a href="https://github.com/kgoldfeld/RDataGenBlog/blob/master/static/img/post-copula/Copula%20data%20generation.R">here</a>.)</p>
<div class="figure">
<img src="https://www.rdatagen.net/img/post-copula/dists.png" />
</div>
<p>These simulations confirm what we saw earlier. The Poisson data generating process recovers the original <span class="math inline">\(\rho\)</span> under both covariance structures reasonably well. The binary data generating process is less successful, with the exchangeable structure doing slightly better than then auto-regressive structure.</p>
<p>Hopefully soon, this will be implemented in <code>simstudy</code> so that we can generate data from more general distributions with a single function call.</p>
</div>
When marginal and conditional logistic model estimates diverge
https://www.rdatagen.net/post/marginal-v-conditional/
Fri, 09 Jun 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/marginal-v-conditional/<STYLE TYPE="text/css">
<!--
td{
font-family: Arial;
font-size: 9pt;
height: 2px;
padding:0px;
cellpadding="0";
cellspacing="0";
text-align: center;
}
th {
font-family: Arial;
font-size: 9pt;
height: 20px;
font-weight: bold;
text-align: center;
}
table {
border-spacing: 0px;
border-collapse: collapse;
}
--->
</STYLE>
<p>Say we have an intervention that is assigned at a group or cluster level but the outcome is measured at an individual level (e.g. students in different schools, eyes on different individuals). And, say this outcome is binary; that is, something happens, or it doesn’t. (This is important, because none of this is true if the outcome is continuous and close to normally distributed.) If we want to measure the <em>effect</em> of the intervention - perhaps the risk difference, risk ratio, or odds ratio - it can really matter if we are interested in the <em>marginal</em> effect or the <em>conditional</em> effect, because they likely won’t be the same.</p>
<p>My aim is to show this through a couple of data simulations that allow us to see this visually.</p>
<div id="first-example" class="section level3">
<h3>First example</h3>
<p>In the first scenario, I am going to use a <em>causal inference</em> framework that uses the idea that everyone has a potential outcome under one exposure (such as an intervention of some sort), and another potential outcome under a different exposure (such as treatment as usual or control). (I may discuss potential outcomes and causal inference in more detail in the future.) The potential outcome can be written with a superscript, lie <span class="math inline">\(Y^0\)</span> or <span class="math inline">\(Y^1\)</span>.</p>
<p>To generate the data, I will use this simple model for each potential outcome: <span class="math display">\[ log\left[\frac{P(Y^0_{ij})}{1-P(Y^0_{ij})}\right] = \gamma + \alpha_i\]</span></p>
<p>and <span class="math display">\[ log\left[\frac{P(Y^1_{ij})}{1-P(Y^1_{ij})}\right] = \gamma + \alpha_i + \delta.\]</span> <span class="math inline">\(\delta\)</span> is the treatment effect and is constant across the clusters, on the log-odds scale. <span class="math inline">\(\alpha_i\)</span> is the cluster specific effect for cluster <span class="math inline">\(i\)</span>. <span class="math inline">\(Y^a_{ij}\)</span> is the potential outcome for individual <span class="math inline">\(j\)</span> under exposure <span class="math inline">\(a\)</span>.</p>
<p>Now let’s generate some data and look at it:</p>
<pre class="r"><code># Define data
def1 <- defData(varname = "clustEff", formula = 0, variance = 2,
id = "cID")
def1 <- defData(def1, varname = "nInd", formula = 10000,
dist = "noZeroPoisson")
def2 <- defDataAdd(varname = "Y0", formula = "-1 + clustEff",
dist = "binary", link = "logit")
def2 <- defDataAdd(def2, varname = "Y1",
formula = "-1 + clustEff + 2",
dist = "binary", link = "logit")
options(width = 80)
def1</code></pre>
<pre><code>## varname formula variance dist link
## 1: clustEff 0 2 normal identity
## 2: nInd 10000 0 noZeroPoisson identity</code></pre>
<pre class="r"><code>def2</code></pre>
<pre><code>## varname formula variance dist link
## 1: Y0 -1 + clustEff 0 binary logit
## 2: Y1 -1 + clustEff + 2 0 binary logit</code></pre>
<pre class="r"><code># Generate cluster level data
set.seed(123)
dtC <- genData(n = 100, def1)
# Generate individual level data
dt <- genCluster(dtClust = dtC, cLevelVar = "cID", numIndsVar = "nInd",
level1ID = "id")
dt <- addColumns(def2, dt)</code></pre>
<p>Since we have repeated measurements for each cluster (the two potential outcomes), we can transform this into a “longitudinal” data set, though the periods are not time but different exposures.</p>
<pre class="r"><code>dtLong <- addPeriods(dtName = dt, idvars = c("id","cID"),
nPeriods = 2,timevars = c("Y0","Y1"),
timevarName = "Y"
)</code></pre>
<p>When we look at the data visually, we get a hint that the marginal (or average) effect might not be the same as the conditional (cluster-specific) effects.</p>
<pre class="r"><code># Calculate average potential outcomes by exposure (which is called period)
dtMean <- dtLong[, .(Y = mean(Y)), keyby = .(period, cID)] # conditional mean
dtMMean <- dtLong[, .(Y = mean(Y)), keyby = .(period)] # marginal mean
dtMMean[, cID := 999]
ggplot(data = dtMean, aes(x=factor(period), y = Y, group= cID)) +
# geom_jitter(width= .25, color = "grey75") +
geom_line(color = "grey75", position=position_jitter(w=0.02, h=0.02)) +
geom_point(data=dtMMean) +
geom_line(data=dtMMean, size = 1, color = "red") +
ylab("Estimated cluster probability") +
scale_y_continuous(labels = scales::percent) +
theme(axis.title.x = element_blank()) +
my_theme()</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-06-09-marginal-v-conditional_files/figure-html/unnamed-chunk-2-1.png" width="672" style="display: block; margin: auto;" /></p>
<p>Looking at the plot, we see that the slopes of the grey lines - each representing the change in probability as a result of the exposure for each cluster - vary quite a bit. When the probability without exposure (<span class="math inline">\(Y^0\)</span>) is particularly low or high, the absolute effect of the intervention is small (the slope is minimal). The slope or absolute effect increases when the starting probability is closer to 50%. The red line represents the averages of <span class="math inline">\(Y^0\)</span> and <span class="math inline">\(Y^1\)</span> across all individuals in all clusters. There is no reason to believe that the average slope of the grey lines is the same as the slope of the red line, which is slope of the averages. We will see that more clearly with the next data generation scenario.</p>
<p>Finally, if we look at cluster-specific effects of exposure, we see that on the risk difference scale (difference in probabilities), there is much variation, but on the log-odds ratio scale there is almost no variation. This is as it should be, because on the log-odds scale (which is how we generated the data), the difference between exposure and non-exposure is additive. On the probability scale, the difference is multiplicative. Here are some estimated differences for a sample of clusters:</p>
<pre class="r"><code>dtChange <- dt[, .(Y0 = mean(Y0), Y1 = mean(Y1)), keyby = cID]
dtChange[, riskdiff := round(Y1 - Y0, 2)]
dtChange[, loratio := round( log( (Y1 / (1-Y1)) / (Y0 / (1-Y0) )), 2)]
dtChange[sample(1:100, 10, replace = F),
.(Y0 = round(Y0,2), Y1 = round(Y1,2), riskdiff, loratio),
keyby=cID]</code></pre>
<pre><code>## cID Y0 Y1 riskdiff loratio
## 1: 18 0.02 0.14 0.12 1.95
## 2: 19 0.50 0.89 0.39 2.06
## 3: 22 0.22 0.67 0.45 1.97
## 4: 24 0.12 0.49 0.37 1.94
## 5: 30 0.69 0.94 0.26 2.00
## 6: 31 0.40 0.83 0.42 1.95
## 7: 34 0.56 0.90 0.35 1.99
## 8: 38 0.26 0.72 0.46 2.01
## 9: 72 0.01 0.09 0.08 1.97
## 10: 99 0.22 0.66 0.44 1.93</code></pre>
</div>
<div id="second-example" class="section level2">
<h2>Second example</h2>
<p>This time around, we will add an additional individual level covariate that will help us visualize the difference a bit more clearly. Let us say that <em>age</em> is positively associated with increased probability in the outcome. (In this case, we measured age and then normalized it so that the mean age in the sample is 0.) And this time around, we are not going to use potential outcomes, but will randomly assign clusters to an intervention or treatment group.</p>
<p>This is the data generating model and the code:</p>
<p><span class="math display">\[ log\left[\frac{P(Y_{ij})}{1-P(Y_{ij})}\right] = \gamma + \alpha_j + \beta_1*Trt_j + \beta_2*Age_i\]</span></p>
<pre class="r"><code>def1 <- defData(varname = "clustEff", formula = 0, variance = 2, id = "cID")
def1 <- defData(def1, varname = "nInd", formula = 100, dist = "noZeroPoisson")
# Each individual now has a measured age
def2 <- defDataAdd(varname = "age", formula = 0, variance = 2)
def2 <- defDataAdd(def2, varname = "Y",
formula = "-4 + clustEff + 2*trt + 2*age",
dist = "binary", link = "logit")
# Generate cluster level data
dtC <- genData(200, def1)
dtC <- trtAssign(dtC, grpName = "trt") #
# Generate individual level data
dt <- genCluster(dtClust = dtC, cLevelVar = "cID", numIndsVar = "nInd",
level1ID = "id")
dt <- addColumns(def2, dt)</code></pre>
<p>By fitting a conditional model (generalized linear mixed effects model) and a marginal model (we should fit a generalized estimating equation model to get the proper standard error estimates, but will estimate a generalized linear model, because the GEE model does not have a “predict” option in R; the point estimates for both marginal models should be quite close), we can see that indeed the conditional and marginal averages can be quite different.</p>
<pre class="r"><code>glmerFit1 <- glmer(Y ~ trt + age + (1 | cID), data = dt, family="binomial")
glmFit1 <- glm(Y ~ trt + age, family = binomial, data = dt)</code></pre>
<pre><code>## Intercept Trt Age
## conditional model -3.82 1.99 2.01
## marginal model -2.97 1.60 1.54</code></pre>
<p>Now, we’d like to visualize how the conditional and marginal treatment effects diverge. We can use the model estimates from the conditional model to predict probabilities for each cluster, age, and treatment group. (These will appear as grey lines in the plots below). We can also predict marginal probabilities from the marginal model based on age and treatment group while ignoring cluster. (These marginal estimates appear as red lines.) Finally, we can predict probability of outcomes for the conditional model also based on age and treatment group alone, but fixed at a mythical cluster whose random effect is 0. (These “average” conditional estimates appear as black lines.)</p>
<pre class="r"><code>newCond <- expand.grid(cID = unique(dt$cID), age=seq(-4, 4, by =.1))
newCond0 <- data.table(trt = 0, newCond)
newCond1 <- data.table(trt = 1, newCond)
newMarg0 <- data.table(trt = 0, age = seq(-4, 4, by = .1))
newMarg1 <- data.table(trt = 1, age = seq(-4, 4, by = .1))
newCond0[, pCond0 := predict(glmerFit1, newCond0, type = "response")]
newCond1[, pCond1 := predict(glmerFit1, newCond1, type = "response")]
newMarg0[, pMarg0 := predict(glmFit1, newMarg0, type = "response")]
newMarg0[, pCAvg0 := predict(glmerFit1, newMarg0[,c(1,2)],
re.form = NA, type="response")]
newMarg1[, pMarg1 := predict(glmFit1, newMarg1, type = "response")]
newMarg1[, pCAvg1 := predict(glmerFit1, newMarg1[,c(1,2)],
re.form = NA, type="response")]
dtAvg <- data.table(age = newMarg1$age,
avgMarg = newMarg1$pMarg1 - newMarg0$pMarg0,
avgCond = newMarg1$pCAvg1 - newMarg0$pCAvg0
)
p1 <- ggplot(aes(x = age, y = pCond1), data=newCond1) +
geom_line(color="grey", aes(group = cID)) +
geom_line(data=newMarg1, aes(x = age, y = pMarg1), color = "red", size = 1) +
geom_line(data=newMarg1, aes(x = age, y = pCAvg1), color = "black", size = 1) +
ggtitle("Treatment group") +
xlab("Age") +
ylab("Probability") +
my_theme()
p0 <- ggplot(aes(x = age, y = pCond0), data=newCond0) +
geom_line(color="grey", aes(group = cID)) +
geom_line(data=newMarg0, aes(x = age, y = pMarg0), color = "red", size = 1) +
geom_line(data=newMarg0, aes(x = age, y = pCAvg0), color = "black", size = 1) +
ggtitle("Control group") +
xlab("Age") +
ylab("Probability") +
my_theme()
pdiff <- ggplot(data = dtAvg) +
geom_line(aes(x = age, y = avgMarg), color = "red", size = 1) +
geom_line(aes(x = age, y = avgCond), color = "black", size = 1) +
ggtitle("Risk difference") +
xlab("Age") +
ylab("Probability") +
my_theme()
grid.arrange(p1, p0, pdiff)</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-06-09-marginal-v-conditional_files/figure-html/unnamed-chunk-7-1.png" width="432" style="display: block; margin: auto;" /></p>
<p>We see pretty clearly across all ages that the marginal and conditional estimates of average treatment differences differ quite dramatically.</p>
<p>Below are point estimates and plots for data generated with very little variance across clusters, that is <span class="math inline">\(var(\alpha_i)\)</span> is close to 0. (We change this in the simulation by setting <code>def1 <- defData(varname = "clustEff", formula = 0, variance = 0.05, id = "cID")</code>.)</p>
<pre><code>## Intercept Trt Age
## conditional model -4.03 2.08 2.01
## marginal model -4.00 2.07 1.99</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-06-09-marginal-v-conditional_files/figure-html/unnamed-chunk-8-1.png" width="432" style="display: block; margin: auto;" /></p>
<p>The black lines obscure the red - the marginal model estimate is not much different from the conditional model estimate - because the variance across clusters is negligible.</p>
</div>
It can be easy to explore data generating mechanisms with the simstudy package
https://www.rdatagen.net/post/intro-to-simstudy/
Tue, 16 May 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/intro-to-simstudy/<STYLE TYPE="text/css">
<!--
td{
font-family: Arial;
font-size: 9pt;
height: 2px;
padding:0px;
cellpadding="0";
cellspacing="0";
text-align: center;
}
th {
font-family: Arial;
font-size: 9pt;
height: 20px;
font-weight: bold;
text-align: center;
}
table {
border-spacing: 0px;
border-collapse: collapse;
}
--->
</STYLE>
<p>I learned statistics and probability by simulating data. Sure, I did the occasional proof, but I never believed the results until I saw it in a simulation. I guess I have it backwards, but I that’s just the way I am. And now that I am a so-called professional, I continue to use simulation to understand models, to do sample size estimates and power calculations, and of course to teach. Sure - I’ll use the occasional formula when one exists, but I always feel the need to check it with simulation. It’s just the way I am.</p>
<p>Since I found myself constantly setting up simulations, over time I developed ways to make the process a bit easier. Those processes turned into a package, which I called <a href = https://cran.r-project.org/web/packages/simstudy/index.html>simstudy</a>, which obviously means <em>simulating study data</em>. The purpose here in this blog entyr is to introduce the basic idea behind simstudy, and provide a relatively brief example that actually comes from a question a user posed about generating correlated longitudinal data.</p>
<div id="the-basic-idea" class="section level2">
<h2>The basic idea</h2>
<p>Simulation using simstudy has two primary steps. First, the user defines the data elements of a data set either in an external csv file or internally through a set of repeated definition statements. Second, the user generates the data, using these definitions. Data generation can be as simple as a cross-sectional design or prospective cohort design, or it can be more involved, extending to allow simulators to generate observed or randomized <em>treatment assignment/exposures</em>, <em>survival</em> data, <em>longitudinal/panel</em> data, <em>multi-level/hierarchical</em> data, datasets with <em>correlated variables</em> based on a specified covariance structure, and to data sets with <em>missing</em> data based on a variety of missingness patterns.</p>
<p>The key to simulating data in simstudy is the creation of series of data defintion tables that look like this:</p>
<table>
<thead>
<tr class="header">
<th align="left">varname</th>
<th align="left">formula</th>
<th align="right">variance</th>
<th align="left">dist</th>
<th align="left">link</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">nr</td>
<td align="left">7</td>
<td align="right">0</td>
<td align="left">nonrandom</td>
<td align="left">identity</td>
</tr>
<tr class="even">
<td align="left">x1</td>
<td align="left">10;20</td>
<td align="right">0</td>
<td align="left">uniform</td>
<td align="left">identity</td>
</tr>
<tr class="odd">
<td align="left">y1</td>
<td align="left">nr + x1 * 2</td>
<td align="right">8</td>
<td align="left">normal</td>
<td align="left">identity</td>
</tr>
<tr class="even">
<td align="left">y2</td>
<td align="left">nr - 0.2 * x1</td>
<td align="right">0</td>
<td align="left">poisson</td>
<td align="left">log</td>
</tr>
<tr class="odd">
<td align="left">xCat</td>
<td align="left">0.3;0.2;0.5</td>
<td align="right">0</td>
<td align="left">categorical</td>
<td align="left">identity</td>
</tr>
<tr class="even">
<td align="left">g1</td>
<td align="left">5+xCat</td>
<td align="right">1</td>
<td align="left">gamma</td>
<td align="left">log</td>
</tr>
<tr class="odd">
<td align="left">a1</td>
<td align="left">-3 + xCat</td>
<td align="right">0</td>
<td align="left">binary</td>
<td align="left">logit</td>
</tr>
</tbody>
</table>
<p>Here’s the code that is used to generate this definition, which is stored as a <a href = https://github.com/Rdatatable/data.table/wiki>data.table </a>:</p>
<pre class="r"><code>def <- defData(varname = "nr", dist = "nonrandom", formula = 7, id = "idnum")
def <- defData(def, varname = "x1", dist = "uniform", formula = "10;20")
def <- defData(def, varname = "y1", formula = "nr + x1 * 2", variance = 8)
def <- defData(def, varname = "y2", dist = "poisson", formula = "nr - 0.2 * x1",
link = "log")
def <- defData(def, varname = "xCat", formula = "0.3;0.2;0.5", dist = "categorical")
def <- defData(def, varname = "g1", dist = "gamma", formula = "5+xCat", variance = 1,
link = "log")
def <- defData(def, varname = "a1", dist = "binary", formula = "-3 + xCat",
link = "logit")</code></pre>
<p>To create a simple data set based on these definitions, all one needs to do is execute a single <code>genData</code> command. In this example, we generate 500 records that are based on the definition in the <code>def</code>table:</p>
<pre class="r"><code>dt <- genData(500, def)
dt</code></pre>
<pre><code>## idnum nr x1 y1 y2 xCat g1 a1
## 1: 1 7 11.78709 30.24305 109 1 297.03396 0
## 2: 2 7 13.02129 30.58814 87 3 929.80474 0
## 3: 3 7 15.37784 37.31205 44 2 3738.52996 1
## 4: 4 7 18.66916 42.51506 24 2 78.17338 0
## 5: 5 7 10.55980 26.39898 131 2 147.42608 0
## ---
## 496: 496 7 19.74327 46.62290 24 1 15.49462 0
## 497: 497 7 18.41837 43.35002 24 1 201.66739 0
## 498: 498 7 18.64485 47.62767 26 1 215.74763 0
## 499: 499 7 12.07687 28.65423 105 3 864.30964 1
## 500: 500 7 11.37337 29.71361 132 2 149.21883 0</code></pre>
<p>There’s a lot more functionality in the package, and I’ll be writing about that in the future. But here, I just want give a little more introduction by way of an example that came in from around the world a couple of days ago. (I’d say the best thing about building a package is hearing from folks literally all over the world and getting to talk to them about statistics and R. It is really incredible to be able to do that.)</p>
</div>
<div id="going-a-bit-further-simulating-a-prosepctive-cohort-study-with-repeated-measures" class="section level2">
<h2>Going a bit further: simulating a prosepctive cohort study with repeated measures</h2>
<p>The question was, can we simulate a study with two arms, say a control and treatment, with repeated measures at three time points: baseline, after 1 month, and after 2 months? Of course.</p>
<p>This was what I sent back to my correspondent:</p>
<pre class="r"><code># Define the outcome
ydef <- defDataAdd(varname = "Y", dist = "normal", formula = "5 + 2.5*period + 1.5*T + 3.5*period*T",
variance = 3)
# Generate a 'blank' data.table with 24 observations and assign them to
# groups
set.seed(1234)
indData <- genData(24)
indData <- trtAssign(indData, nTrt = 2, balanced = TRUE, grpName = "T")
# Create a longitudinal data set of 3 records for each id
longData <- addPeriods(indData, nPeriods = 3, idvars = "id")
longData <- addColumns(dtDefs = ydef, longData)
longData[, `:=`(T, factor(T, labels = c("No", "Yes")))]
# Let's look at the data
ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T,
group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-05-16-intro-to-simstudy_files/figure-html/unnamed-chunk-4-1.png" width="672" /></p>
<p>If we generate a data set based on 1,000 indviduals and estimate a linear regression model we see that the parameter estimates are quite good. However, my correspondent wrote back saying she wanted correlated data, which makes sense. We can see from the alpha estimate of approximately 0.02 (at the bottom of the output), we don’t have much correlation:</p>
<pre class="r"><code># Fit a GEE model to the data
fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"),
data = longData, id = id, corstr = "exchangeable")
summary(fit)</code></pre>
<pre><code>##
## Call:
## geeglm(formula = Y ~ factor(T) + period + factor(T) * period,
## family = gaussian(link = "identity"), data = longData, id = id,
## corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 4.98268 0.07227 4753.4 <2e-16 ***
## factor(T)Yes 1.48555 0.10059 218.1 <2e-16 ***
## period 2.53946 0.05257 2333.7 <2e-16 ***
## factor(T)Yes:period 3.51294 0.07673 2096.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 2.952 0.07325
##
## Correlation: Structure = exchangeable Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.01737 0.01862
## Number of clusters: 1000 Maximum cluster size: 3</code></pre>
</div>
<div id="one-way-to-generate-correlated-data" class="section level2">
<h2>One way to generate correlated data</h2>
<p>The first way to approach this is to use the simstudy function <code>genCorData</code> to generate correlated errors that are normally distributed with mean 0, variance of 3, and and common correlation coeffcient of 0.7. This approach is a bit mysterious, because we are acknowledging that we don’t know what is driving the relationship between the three outcomes, just that they have a common cause.</p>
<pre class="r"><code># define the outcome
ydef <- defDataAdd(varname = "Y", dist = "normal", formula = "5 + 2.5*period + 1.5*T + 3.5*period*T + e")
# define the correlated errors
mu <- c(0, 0, 0)
sigma <- rep(sqrt(3), 3)
# generate correlated data for each id and assign treatment
dtCor <- genCorData(24, mu = mu, sigma = sigma, rho = 0.7, corstr = "cs")
dtCor <- trtAssign(dtCor, nTrt = 2, balanced = TRUE, grpName = "T")
# create longitudinal data set and generate outcome based on definition
longData <- addPeriods(dtCor, nPeriods = 3, idvars = "id", timevars = c("V1",
"V2", "V3"), timevarName = "e")
longData <- addColumns(ydef, longData)
longData[, `:=`(T, factor(T, labels = c("No", "Yes")))]
# look at the data, outcomes should appear more correlated, lines a bit
# straighter
ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T,
group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-05-16-intro-to-simstudy_files/figure-html/unnamed-chunk-7-1.png" width="672" /></p>
<p>Again, we recover the true parameters. And this time, if we look at the estimated correlation, we see that indeed the outcomes are correlated within each indivdual. The estimate is 0.77, close to the our specified value of 0.7.</p>
<pre class="r"><code>fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"),
data = longData, id = id, corstr = "exchangeable")
summary(fit)</code></pre>
<pre><code>##
## Call:
## geeglm(formula = Y ~ factor(T) + period + factor(T) * period,
## family = gaussian(link = "identity"), data = longData, id = id,
## corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 5.0636 0.0762 4411 <2e-16 ***
## factor(T)Yes 1.4945 0.1077 192 <2e-16 ***
## period 2.4972 0.0303 6798 <2e-16 ***
## factor(T)Yes:period 3.5204 0.0426 6831 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 3.07 0.117
##
## Correlation: Structure = exchangeable Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.711 0.0134
## Number of clusters: 1000 Maximum cluster size: 3</code></pre>
</div>
<div id="another-way-to-generate-correlated-data" class="section level2">
<h2>Another way to generate correlated data</h2>
<p>A second way to generate correlatd data is through an individual level random-effect or random intercept. This could be considered some unmeasured characteristic of the individuals (which happens to have a convenient normal distribution with mean zero). This random effect contributes equally to all instances of an individuals outcomes, but the outcomes for a particular individual deviate slightly from the hypothetical straight line as a result of unmeasured noise.</p>
<pre class="r"><code>ydef1 <- defData(varname = "randomEffect", dist = "normal", formula = 0, variance = sqrt(3))
ydef2 <- defDataAdd(varname = "Y", dist = "normal", formula = "5 + 2.5*period + 1.5*T + 3.5*period*T + randomEffect",
variance = 1)
indData <- genData(24, ydef1)
indData <- trtAssign(indData, nTrt = 2, balanced = TRUE, grpName = "T")
indData[1:6]</code></pre>
<pre><code>## id T randomEffect
## 1: 1 0 -1.3101
## 2: 2 1 0.3423
## 3: 3 0 0.5716
## 4: 4 1 2.6723
## 5: 5 0 -0.9996
## 6: 6 1 -0.0722</code></pre>
<pre class="r"><code>longData <- addPeriods(indData, nPeriods = 3, idvars = "id")
longData <- addColumns(dtDefs = ydef2, longData)
longData[, `:=`(T, factor(T, labels = c("No", "Yes")))]
ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T,
group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")</code></pre>
<p><img src="https://www.rdatagen.net/post/2017-05-16-intro-to-simstudy_files/figure-html/unnamed-chunk-10-1.png" width="672" /></p>
<pre class="r"><code>fit <- geeglm(Y ~ factor(T) + period + factor(T) * period, family = gaussian(link = "identity"),
data = longData, id = id, corstr = "exchangeable")
summary(fit)</code></pre>
<pre><code>##
## Call:
## geeglm(formula = Y ~ factor(T) + period + factor(T) * period,
## family = gaussian(link = "identity"), data = longData, id = id,
## corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 4.9230 0.0694 5028 <2e-16 ***
## factor(T)Yes 1.4848 0.1003 219 <2e-16 ***
## period 2.5310 0.0307 6793 <2e-16 ***
## factor(T)Yes:period 3.5076 0.0449 6104 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 2.63 0.0848
##
## Correlation: Structure = exchangeable Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.619 0.0146
## Number of clusters: 1000 Maximum cluster size: 3</code></pre>
<p>I sent all this back to my correspondent, but I haven’t heard yet if it is what she wanted. I certainly hope so. If there are specific topics you’d like me to discuss related to simstudy, definitely get in touch, and I will try to write something up.</p>
</div>
Everyone knows that loops in R are to be avoided, but vectorization is not always possible
https://www.rdatagen.net/post/first-blog-entry/
Wed, 10 May 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/first-blog-entry/<p>It goes without saying that there are always many ways to solve a problem in R, but clearly some ways are better (for example, faster) than others. Recently, I found myself in a situation where I could not find a way to avoid using a loop, and I was immediately concerned, knowing that I would want this code to be flexible enough to run with a very large number of observations, possibly over many observations. Two tools immediately came to mind: <a href = https://github.com/Rdatatable/data.table/wiki>data.table </a> and <a href = http://adv-r.had.co.nz/Rcpp.html>Rcpp </a>. This brief description explains the background of the simulation problem I was working on and walks through the evolution of ideas to address the problems I ran into when I tried to simulate a large number of inviduals. In particular, when I tried to simulate a very large number of individuals, say over 1 million, running the simulation over night wasn’t enough.</p>
<div id="setting-up-the-problem" class="section level2">
<h2>Setting up the problem</h2>
<p>The task in question here is not the focus, but needs a little explanation to understand what is motivating the programming issue. I am conducting a series of simulations that involve generating an individual-level stochastic (Markov) process for any number of individuals. For the data generation, I am using the <a href = https://cran.r-project.org/web/packages/simstudy/index.html>simstudy</a> package developed to help facilitate simulated data.</p>
<p>The functions <code>defDataAdd</code> and <code>genData</code> are both from <code>simstudy</code>. The first part of the simulation involves specifying the transition matrix <code>P</code> that determine a state I am calling <code>status</code>, and then defining the probability of an event that are based on a particular status level at a particular time point. For each individual, I generate 36 months of data and a status and event for each month.</p>
<pre class="r"><code>library(data.table)
library(simstudy)
set.seed(123)
P <- matrix(c(0.985, 0.015, 0.000, 0.000,
0.000, 0.950, 0.050, 0.000,
0.000, 0.000, 0.850, 0.150,
0.000, 0.000, 0.000, 1.000),
nrow = 4, byrow = TRUE)
form <- "(status == 1) * 0.02 + (status == 2) * 0.10 + (status == 3) * 0.20"
dtDef <- defDataAdd(varname = "event",
formula = form,
dist = "binary",
link = "identity")
N = 5000
did <- genData(N)</code></pre>
<p>In order to simulate the Markov process, I decided immediately that <code>Rcpp</code> would be most appropriate because I knew I could not avoid looping. Since each state of a Markov process depends on the state immediately preceding, states need to be generated sequentially, which means no obvious way to vectorize (if someone has figured that out, let me know.)</p>
<pre class="cpp"><code>#include <RcppArmadilloExtensions/sample.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector MCsim( unsigned int nMonths, NumericMatrix P,
int startStatus, unsigned int startMonth ) {
IntegerVector sim( nMonths );
IntegerVector m( P.ncol());
NumericVector currentP;
IntegerVector newstate;
unsigned int q = P.ncol();
m = Rcpp::seq(1, q);
sim[startMonth - 1] = startStatus;
for (unsigned int i = startMonth; i < nMonths; i++) {
newstate = RcppArmadillo::sample(m, 1, TRUE, P.row(sim(i-1) - 1));
sim(i) = newstate(0);
}
return sim;
}</code></pre>
<p>The process is simulated for each individual using the <strong>Rcpp</strong> function <code>MCsim</code>, but is done in the context of a <code>data.table</code> statement. The key here is that each individual is processed separately through the <code>keyby = id</code> statement. This obviates the requirement to loop through individuals even though I still need to loop within individuals for the stochastic process. This algorithm is quite fast, even with very large numbers of individuals and large numbers of observations (in this case months) per individual.</p>
<pre class="r"><code>dt <- did[, .(status = MCsim(36, P, 1, 1)),
keyby = id]
dt[, month := 1 : .N, keyby = id]
dt <- addColumns(dtDefs = dtDef, dtOld = dt)
dt</code></pre>
<pre><code>## id status month event
## 1: 1 1 1 0
## 2: 1 1 2 0
## 3: 1 1 3 0
## 4: 1 1 4 0
## 5: 1 1 5 0
## ---
## 179996: 5000 4 32 0
## 179997: 5000 4 33 0
## 179998: 5000 4 34 0
## 179999: 5000 4 35 0
## 180000: 5000 4 36 0</code></pre>
</div>
<div id="this-is-where-things-begin-to-slow-down" class="section level2">
<h2>This is where things begin to slow down</h2>
<p>It is the next phase of the simulation that started to cause me problems. For the simulation, I need to assign individuals to a group or cohort which is defined by a month and is based on several factors: (1) whether an event occurred in that month, (2) whether the status of that individual in that month exceeded a value of <code>1</code>, and (3) whether or not the individual experienced 2 or more events in the prior 12 months. An indivdual might be eligible for more than one cohort, but will be assigned to the first possible cohort (i.e. the earliest month where all three criteria are met.)</p>
<p>Again, the specifics of the simulation are not important here. What is important, is the notion that the problem requires looking through individual data sequentially, something R is generally not so good at when the sequences get particularly long, and they must be repeated a large number of times.</p>
<p>My first, naïve, approach was to create an <strong>R</strong> function that loops through all the individuals and loops within each individual until a cohort is found:</p>
<pre class="r"><code>rAssignCohortID <- function(id, month, status,
event, nInds,
startMonth, thresholdNum) {
cohort <- rep(0, length(id));
for (j in (1 : nInds)) {
idMonth = month[id == j];
idEvent = event[id == j];
idStatus = status[id == j];
endMonth = length(idMonth);
done = FALSE;
i = max(startMonth - idMonth[1], 13);
while (i <= endMonth && !done) {
if (idEvent[i] == 1 && idStatus[i] > 1) {
begin = i-12;
end = i-1;
sumED = sum(idEvent[begin:end]);
if (sumED >= thresholdNum) {
cohort[id == j] <- i - 1 + month[1];
done = TRUE;
}
}
i = i + 1;
}
}
return(cohort);
} </code></pre>
</div>
<div id="working-through-possible-solutions" class="section level2">
<h2>Working through possible solutions</h2>
<pre class="r"><code>system.time(dt[, cohort1 := rAssignCohortID(id, month, status, event,
nInds = N, startMonth = 13, thresholdNum = 2)])</code></pre>
<pre><code>## user system elapsed
## 12.555 0.180 13.256</code></pre>
<p>The naïve approach works, but can we do better? I thought <strong>Rcpp</strong> might be a solution, because we know that loops in C++ are much more efficient. However, things did not turn out so well after I translated the function into C++; in fact, they got a little worse.</p>
<pre class="cpp"><code>#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector cAssignCohortID( IntegerVector id,
IntegerVector month,
IntegerVector status,
IntegerVector event,
int nInds,
int startMonth,
int thresholdNum) {
IntegerVector cohort(id.length(), 0);
IntegerVector idMonth;
IntegerVector idEvent;
IntegerVector idStatus;
for (int j = 0; j < nInds; j++) {
idMonth = month[id == j+1];
idEvent = event[id == j+1];
idStatus = status[id == j+1];
int endMonth = idMonth.length();
int sumED;
bool done = FALSE;
int i = std::max(startMonth - idMonth(0), 12);
int begin;
int end;
while (i < endMonth && !done) {
if (idEvent(i) == 1 && idStatus(i) > 1) {
begin = i-12;
end = i-1;
sumED = sum(idEvent[Rcpp::seq(begin, end)]);
if (sumED >= thresholdNum) {
cohort[id == j + 1] = i + month(0);
done = TRUE;
}
}
i += 1;
}
}
return(cohort);
}</code></pre>
<pre class="r"><code>system.time(dt[, cohort2 := cAssignCohortID(id, month, status, event,
nInds = N, startMonth = 13, thresholdNum = 2)])</code></pre>
<pre><code>## user system elapsed
## 12.405 0.140 12.886</code></pre>
<p>I know that the function <code>cAssignCohortID</code> bogs down not in the loop, but in each phase where I need to subset the data set to work on a single <code>id</code>. For example, I need to execute the statement <code>idMonth = month[id == j+1]</code> for each <code>id</code>, and this apparently uses a lot of resources. I tried variations on this theme, alternatives to subset the data set within the Rcpp function, but could get no improvements.</p>
<p>But a light bulb went off in my head (dim as it might be), telling me that this is one of the many things <code>data.table</code> is particularly good at. In fact, I used this trick earlier in generating the stochastic process data. So, rather than subsetting the data within the function, I created a regular R function that handles only a single individual <code>id</code> at a time, and let <code>data.table</code> do the hard work of splitting up the data set to process by individual.</p>
<p>As you can see, things got markedly faster.</p>
<pre class="r"><code>rAssignCohort <- function(id, month, status, event,
nInds, startMonth, thresholdNum) {
cohort <- 0
endMonth = length(month);
done = FALSE;
i = max(startMonth - month[1], 13);
while (i <= endMonth && !done) {
if (event[i] == 1 && status[i] > 1) {
begin = i-12;
end = i-1;
sumED = sum(event[begin:end]);
if (sumED >= thresholdNum) {
cohort <- i - 1 + month[1];
done = TRUE;
}
}
i = i + 1;
}
return(cohort)
}</code></pre>
<pre class="r"><code>system.time(dt[, cohort3 := rAssignCohort(id, month, status, event,
nInds = N, startMonth = 13, thresholdNum = 2),
keyby=id])</code></pre>
<pre><code>## user system elapsed
## 0.150 0.009 0.171</code></pre>
<p>Finally, it occurred to me that an <code>Rcpp</code> function that is not required to subset the data might offer more yet improvements in speed. So, for the last iteration, I combined the strengths of looping in <code>Rcpp</code> with the strengths of subsetting in <code>data.table</code> to create a formidable combination. (Even when sample sizes exceed 1 million, the data are generated in a flash.)</p>
<pre class="cpp"><code>#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int cAssignCohort( IntegerVector month,
IntegerVector status,
IntegerVector event,
int startMonth, int thresholdNum) {
int endMonth = month.length();
int sumED;
int cohort = 0;
bool done = FALSE;
int i = std::max(startMonth - month(0), 12);
int begin;
int end;
while (i < endMonth && !done) {
if (event(i) == 1 && status(i) > 1) {
begin = i-12;
end = i-1;
sumED = sum(event[Rcpp::seq(begin, end)]);
if (sumED >= thresholdNum) {
cohort = i + month(0);
done = TRUE;
}
}
i += 1;
}
return(cohort);
}</code></pre>
<pre class="r"><code>system.time(dt[, cohort4 := cAssignCohort(month, status, event,
startMonth=13, thresholdNum = 2), keyby=id])</code></pre>
<pre><code>## user system elapsed
## 0.029 0.004 0.035</code></pre>
<p>For a more robust comparison, let’s use the <code>benchmark</code> function in package <code>rbenchmark</code>, and you can see how well <code>data.table</code> performs and how much <code>Rcpp</code> can add when used efficiently.</p>
<pre class="r"><code>library(rbenchmark)
benchmark(
dt[, cohort1 := rAssignCohortID(id, month, status, event, # Naïve approach
nInds = N, startMonth = 13, thresholdNum = 2)],
dt[, cohort2 := cAssignCohortID(id, month, status, event, # Rcpp approach
nInds = N, startMonth = 13, thresholdNum = 2)],
dt[, cohort3 := rAssignCohort(id, month, status, event, # data.table approach
nInds = N, startMonth = 13, thresholdNum = 2), keyby=id],
dt[, cohort4 := cAssignCohort(month, status, event, # combined data.table/Rcpp
startMonth=13, thresholdNum = 2), keyby=id],
replications = 5,
columns = c("replications", "elapsed", "relative"))</code></pre>
<pre><code>## replications elapsed relative
## 1 5 52.578 395.323
## 2 5 66.528 500.211
## 3 5 0.752 5.654
## 4 5 0.133 1.000</code></pre>
</div>
<div id="postscript" class="section level2">
<h2>Postscript</h2>
<p>I shared all of this with the incredibly helpful folks who have created data.table, and they offered a data.table only solution that avoids all looping, which I will share here for completeness. While it is an improvement over the third approach presented above (R function with data.table statment <code>keyby</code>), it is still no match for the fastest solution. (But, this all just goes to show you there will always be new approaches to consider, and I don’t claim to have come any where near to trying them all out.)</p>
<pre class="r"><code>dtfunc <- function(dx) {
dx[, prev12 := Reduce(`+`, shift(event, 1:12)), by=id]
map <- CJ(id=1:N, start=13L, end=36L, event=1L, statusx=1L, prev12x=1L)
ans <- dx[map, on=.(id, event, status > statusx, prev12 > prev12x, month >= start, month <= end),
.I, allow=TRUE, by=.EACHI, nomatch=0L][, .(id, I)]
minans <- ans[, .(I=min(I)), by=id]
dx <- dx[, cohort5 := 0L][minans, cohort5 := min(month) - 1L + dx$month[I], on="id", by=.EACHI]
return(dx)
}
system.time(dtfunc(dt))</code></pre>
<pre><code>## user system elapsed
## 0.195 0.011 0.210</code></pre>
<p>And here is a more complete comparison of the fastest version with this additional approach:</p>
<pre class="r"><code>benchmark(
dt[, cohort6 := cAssignCohort(month, status, event, # combined data.table/Rcpp
startMonth=13, thresholdNum = 2), keyby=id],
dt2 <- dtfunc(dt),
replications = 5,
columns = c("replications", "elapsed", "relative"))</code></pre>
<pre><code>## replications elapsed relative
## 1 5 0.112 1.000
## 2 5 0.899 8.027</code></pre>
</div>