This site is a compendium of R code meant to highlight the various uses of simulation to aid in the understanding of probability, statistics, and study design. I frequently draw on examples using my R package `simstudy`

. Occasionally, I opine on other topics related to causal inference, evidence, and research more generally.

About two years ago, someone inquired whether simstudy had the functionality to generate data from a logistic model with a specific AUC. It did not, but now it does, thanks to a paper by Peter Austin that describes a nice algorithm to accomplish this. The paper actually describes a series of related algorithms for generating coefficients that target specific prevalence rates, risk ratios, and risk differences, in addition the AUC.
[Read More]

## A demo of power estimation by simulation for a cluster randomized trial with a time-to-event outcome

A colleague reached out for help designing a cluster randomized trial to evaluate a clinical decision support tool for primary care physicians (PCPs), which aims to improve care for high-risk patients. The outcome will be a time-to-event measure, collected at the patient level. The unit of randomization will be the PCP, and one of the key design issues is settling on the number to randomize.
[Read More]

## Generating variable cluster sizes to assess power in cluster randomized trials

In recent discussions with a number of collaborators at the NIA IMPACT Collaboratory about setting the sample size for a proposed cluster randomized trial, the question of variable cluster sizes has come up a number of times. Given a fixed overall sample size, it is generally better (in terms of statistical power) if the sample is equally distributed across the different clusters; highly variable cluster sizes increase the standard errors of effect size estimates and reduce the ability to determine if an intervention or treatment is effective.
[Read More]

## Implementing a one-step GEE algorithm for very large cluster sizes in R

Very large data sets can present estimation problems for some statistical models, particularly ones that cannot avoid matrix inversion. For example, generalized estimating equations (GEE) models that are used when individual observations are correlated within groups can have severe computation challenges when the cluster sizes get too large. GEE are often used when repeated measures for an individual are collected over time; the individual is considered the cluster in this analysis.
[Read More]

## simstudy 0.6.0 released: more flexible correlation patterns

The new version (0.6.0) of simstudy is available for download from CRAN. In addition to some important bug fixes, I’ve added new functionality that should make data generation with correlated data a little more flexible. In the previous post, I described enhancements to the function genCorMat. As part of this release announcement, I’m describing blockExchangeMat and blockDecayMat, two new functions that can be used to generate correlation matrices when there is a temporal element to the data generation.
[Read More]

## Flexible correlation generation: an update to genCorMat in simstudy

I’ve been slowly working on some updates to simstudy, focusing mostly on the functionality to generate correlation matrices (which can be used to simulate correlated data). Here, I’m briefly describing the function genCorMat, which has been updated to facilitate the generation of correlation matrices for clusters of different sizes and with potentially different correlation coefficients.
I’ll briefly describe what the existing function can currently do, and then give an idea about what the enhancements will provide.
[Read More]

## A GAM for time trends in a stepped-wedge trial with a binary outcome

In a previous post, I described some ways one might go about analyzing data from a stepped-wedge, cluster-randomized trial using a generalized additive model (a GAM), focusing on continuous outcomes. I have spent the past few weeks developing a similar model for a binary outcome, and have started to explore model comparison and methods to evaluate goodness-of-fit. The following describes some of my thought process.
Data generation The data generation process I am using here follows along pretty closely with the earlier post, except, of course, the outcome has changed from continuous to binary.
[Read More]

## Modeling the secular trend in a stepped-wedge design

Recently I started a discussion about modeling secular trends using flexible models in the context of cluster randomized trials. I’ve been motivated by a trial I am involved with that is using a stepped-wedge study design. The initial post focused on more standard parallel designs; here, I want to extend the discussion explicitly to the stepped-wedge design.
The stepped-wedge design Stepped-wedge designs are a special class of cluster randomized trial where each cluster is observed in both treatment arms (as opposed to the classic parallel design where only some of the clusters receive the treatment).
[Read More]

## Generating clustered data with marginal correlations

A student is working on a project to derive an analytic solution to the problem of sample size determination in the context of cluster randomized trials and repeated individual-level measurement (something I’ve thought a little bit about before). Though the goal is an analytic solution, we do want confirmation with simulation. So, I was a little disheartened to discover that the routines I’d developed in simstudy for this were not quite up to the task.
[Read More]

## Modeling the secular trend in a cluster randomized trial using very flexible models

A key challenge - maybe the key challenge - of a stepped wedge clinical trial design is the threat of confounding by time. This is a cross-over design where the unit of randomization is a group or cluster, where each cluster begins in the control state and transitions to the intervention. It is the transition point that is randomized. Since outcomes could be changing over time regardless of the intervention, it is important to model the time trends when conducting the efficacy analysis.
[Read More]