This site is a compendium of R code meant to highlight the various uses of simulation to aid in the understanding of probability, statistics, and study design. I frequently draw on examples using my R package simstudy. Occasionally, I opine on other topics related to causal inference, evidence, and research more generally.

In regression, we assume noise is independent of all measured predictors. What happens if it isn't?

A number of key assumptions underlie the linear regression model - among them linearity and normally distributed noise (error) terms with constant variance In this post, I consider an additional assumption: the unobserved noise is uncorrelated with any covariates or predictors in the model.

In this simple model:

\[Y_i = \beta_0 + \beta_1X_i + e_i,\]

\(Y_i\) has both a structural and stochastic (random) component. The structural component is the linear relationship of \(Y\) with \(X\). The random element is often called the \(error\) term, but I prefer to think of it as \(noise\). \(e_i\) is not measuring something that has gone awry, but rather it is variation emanating from some unknown, unmeasurable source or sources for each individual \(i\). It represents everything we haven’t been able to measure.

[Read More]
R 

simstudy update: improved correlated binary outcomes

An updated version of the simstudy package (0.1.10) is now available on CRAN. The impetus for this release was a series of requests about generating correlated binary outcomes. In the last post, I described a beta-binomial data generating process that uses the recently added beta distribution. In addition to that update, I’ve added functionality to genCorGen and addCorGen, functions which generate correlated data from non-Gaussian or normally distributed data such as Poisson, Gamma, and binary data. Most significantly, there is a newly implemented algorithm based on the work of Emrich & Piedmonte, which I mentioned the last time around.

[Read More]
R 

Binary, beta, beta-binomial

I’ve been working on updates for the simstudy package. In the past few weeks, a couple of folks independently reached out to me about generating correlated binary data. One user was not impressed by the copula algorithm that is already implemented. I’ve added an option to use an algorithm developed by Emrich and Piedmonte in 1991, and will be incorporating that option soon in the functions genCorGen and addCorGen. I’ll write about that change some point soon.

[Read More]
R 

The power of stepped-wedge designs

Just before heading out on vacation last month, I put up a post that purported to compare stepped-wedge study designs with more traditional cluster randomized trials. Either because I rushed or was just lazy, I didn’t exactly do what I set out to do. I did confirm that a multi-site randomized clinical trial can be more efficient than a cluster randomized trial when there is variability across clusters. (I compared randomizing within a cluster with randomization by cluster.) But, this really had nothing to with stepped-wedge designs.

[Read More]
R 

Multivariate ordinal categorical data generation

An economist contacted me about the ability of simstudy to generate correlated ordinal categorical outcomes. He is trying to generate data as an aide to teaching cost-effectiveness analysis, and is hoping to simulate responses to a quality-of-life survey instrument, the EQ-5D. The particular instrument has five questions related to mobility, self-care, activities, pain, and anxiety. Each item has three possible responses: (1) no problems, (2) some problems, and (3) a lot of problems. Although the instrument has been designed so that each item is orthogonal (independent) from the others, it is impossible to avoid correlation. So, in generating (and analyzing) these kinds of data, it is important to take this into consideration.

[Read More]
R 

Randomize by, or within, cluster?

I am involved with a stepped-wedge designed study that is exploring whether we can improve care for patients with end-stage disease who show up in the emergency room. The plan is to train nurses and physicians in palliative care. (A while ago, I described what the stepped wedge design is.)

Under this design, 33 sites around the country will receive the training at some point, which is no small task (and fortunately as the statistician, this is a part of the study I have little involvement). After hearing about this ambitious plan, a colleague asked why we didn’t just randomize half the sites to the intervention and conduct a more standard cluster randomized trial, where a site would either get the training or not. I quickly simulated some data to see what we would give up (or gain) if we had decided to go that route. (It is actually a moot point, since there would be no way to simultaneously train 16 or so sites, which is why we opted for the stepped-wedge design in the first place.)

[Read More]
R 

How the odds ratio confounds: a brief study in a few colorful figures

The odds ratio always confounds: while it may be constant across different groups or clusters, the risk ratios or risk differences across those groups may vary quite substantially. This makes it really hard to interpret an effect. And then there is inconsistency between marginal and conditional odds ratios, a topic I seem to be visiting frequently, most recently last month.

My aim here is to generate a few figures that might highlight some of these issues.

[Read More]
R 

Re-referencing factor levels to estimate standard errors when there is interaction turns out to be a really simple solution

Maybe this should be filed under topics that are so obvious that it is not worth writing about. But, I hate to let a good simulation just sit on my computer. I was recently working on a paper investigating the relationship of emotion knowledge (EK) in very young kids with academic performance a year or two later. The idea is that kids who are more emotionally intelligent might be better prepared to learn. My collaborator suspected that the relationship between EK and academics would be different for immigrant and non-immigrant children, so we agreed that this would be a key focus of the analysis.

[Read More]
R 

Late anniversary edition redux: conditional vs marginal models for clustered data

This afternoon, I was looking over some simulations I plan to use in an upcoming lecture on multilevel models. I created these examples a while ago, before I started this blog. But since it was just about a year ago that I first wrote about this topic (and started the blog), I thought I’d post this now to mark the occasion.

The code below provides another way to visualize the difference between marginal and conditional logistic regression models for clustered data (see here for an earlier post that discusses in greater detail some of the key issues raised here.) The basic idea is that both models for a binary outcome are valid, but they provide estimates for different quantities.

[Read More]
R 

A little function to help generate ICCs in simple clustered data

In health services research, experiments are often conducted at the provider or site level rather than the patient level. However, we might still be interested in the outcome at the patient level. For example, we could be interested in understanding the effect of a training program for physicians on their patients. It would be very difficult to randomize patients to be exposed or not to the training if a group of patients all see the same doctor. So the experiment is set up so that only some doctors get the training and others serve as the control; we still compare the outcome at the patient level.

[Read More]
R