I was just going to make a quick announcement to let folks know that I’ve updated the simstudy package to version 0.1.4 (now available on CRAN) to include functions that allow conversion of columns to factors, creation of dummy variables, and most importantly, specification of outcomes that are more flexibly conditional on previously defined variables. But, as I was coming up with an example that might illustrate the added conditional functionality, I found myself playing with package flexmix, which uses an Expectation-Maximization (EM) algorithm to estimate latent classes and fit regression models. So, in the end, this turned into a bit more than a brief service announcement.
This site is a compendium of R code meant to highlight the various uses of simulation to aid in the understanding of probability, statistics, and study design. I frequently draw on examples using my R package simstudy. Occasionally, I opine on other topics related to causal inference, evidence, and research more generally.
Complier average causal effect? Exploring what we learn from an RCT with participants who don't do what they are told
Inspired by a free online course titled Complier Average Causal Effects (CACE) Analysis and taught by Booil Jo and Elizabeth Stuart (through Johns Hopkins University), I’ve decided to explore the topic a little bit. My goal here isn’t to explain CACE analysis in extensive detail (you should definitely go take the course for that), but to describe the problem generally and then (of course) simulate some data. A plot of the simulated data gives a sense of what we are estimating and assuming. And I end by describing two simple methods to estimate the CACE, which we can compare to the truth (since this is a simulation); next time, I will describe a third way.
[Read More]Further considerations of a hidden process underlying categorical responses
In my previous post, I described a continuous data generating process that can be used to generate discrete, categorical outcomes. In that post, I focused largely on binary outcomes and simple logistic regression just because things are always easier to follow when there are fewer moving parts. Here, I am going to focus on a situation where we have multiple outcomes, but with a slight twist - these groups of interest can be interpreted in an ordered way. This conceptual latent process can provide another perspective on the models that are typically applied to analyze these types of outcomes.
[Read More]A hidden process behind binary or other categorical outcomes?
I was thinking a lot about proportional-odds cumulative logit models last fall while designing a study to evaluate an intervention’s effect on meat consumption. After a fairly extensive pilot study, we had determined that participants can have quite a difficult time recalling precise quantities of meat consumption, so we were forced to move to a categorical response. (This was somewhat unfortunate, because we would not have continuous or even count outcomes, and as a result, might not be able to pick up small changes in behavior.) We opted for a question that was based on 30-day meat consumption: none, 1-3 times per month, 1 time per week, etc. - six groups in total. The question was how best to evaluate effectiveness of the intervention?
[Read More]Be careful not to control for a post-exposure covariate
A researcher was presenting an analysis of the impact various types of childhood trauma might have on subsequent substance abuse in adulthood. Obviously, a very interesting and challenging research question. The statistical model included adjustments for several factors that are plausible confounders of the relationship between trauma and substance use, such as childhood poverty. However, the model also include a measurement for poverty in adulthood - believing it was somehow confounding the relationship of trauma and substance use. A confounder is a common cause of an exposure/treatment and an outcome; it is hard to conceive of adult poverty as a cause of childhood events, even though it might be related to adult substance use (or maybe not). At best, controlling for adult poverty has no impact on the conclusions of the research; less good, though, is the possibility that it will lead to the conclusion that the effect of trauma is less than it actually is.
[Read More]Should we be concerned about incidence - prevalence bias?
Recently, we were planning a study to evaluate the effect of an intervention on outcomes for very sick patients who show up in the emergency department. My collaborator had concerns about a phenomenon that she had observed in other studies that might affect the results - patients measured earlier in the study tend to be sicker than those measured later in the study. This might not be a problem, but in the context of a stepped-wedge study design (see this for a discussion that touches this type of study design), this could definitely generate biased estimates: when the intervention occurs later in the study (as it does in a stepped-wedge design), the “exposed” and “unexposed” populations could differ, and in turn so could the outcomes. We might confuse an artificial effect as an intervention effect.
[Read More]Using simulation for power analysis: an example based on a stepped wedge study design
Simulation can be super helpful for estimating power or sample size requirements when the study design is complex. This approach has some advantages over an analytic one (i.e. one based on a formula), particularly the flexibility it affords in setting up the specific assumptions in the planned study, such as time trends, patterns of missingness, or effects of different levels of clustering. A downside is certainly the complexity of writing the code as well as the computation time, which can be a bit painful. My goal here is to show that at least writing the code need not be overwhelming.
[Read More]simstudy update: two new functions that generate correlated observations from non-normal distributions
In an earlier post, I described in a fair amount of detail an algorithm to generate correlated binary or Poisson data. I mentioned that I would be updating simstudy with functions that would make generating these kind of data relatively painless. Well, I have managed to do that, and the updated package (version 0.1.3) is available for download from CRAN. There are now two additional functions to facilitate the generation of correlated data from binomial, poisson, gamma, and uniform distributions: genCorGen and addCorGen. Here’s a brief intro to these functions.
Balancing on multiple factors when the sample is too small to stratify
Ideally, a study that uses randomization provides a balance of characteristics that might be associated with the outcome being studied. This way, we can be more confident that any differences in outcomes between the groups are due to the group assignments and not to differences in characteristics. Unfortunately, randomization does not guarantee balance, especially with smaller sample sizes. If we want to be certain that groups are balanced with respect to a particular characteristic, we need to do something like stratified randomization.
[Read More]Copulas and correlated data generation: getting beyond the normal distribution
Using the simstudy package, it’s possible to generate correlated data from a normal distribution using the function genCorData. I’ve wanted to extend the functionality so that we can generate correlated data from other sorts of distributions; I thought it would be a good idea to begin with binary and Poisson distributed data, since those come up so frequently in my work. simstudy can already accommodate more general correlated data, but only in the context of a random effects data generation process. This might not be what we want, particularly if we are interested in explicitly generating data to explore marginal models (such as a GEE model) rather than a conditional random effects model (a topic I explored in my previous discussion). The extension can quite easily be done using copulas.