This site is a compendium of R code meant to highlight the various uses of simulation to aid in the understanding of probability, statistics, and study design. I frequently draw on examples using my R package `simstudy`

. Occasionally, I opine on other topics related to causal inference, evidence, and research more generally.

As a biostatistician, I like to be involved in the design of a study as early as possible. I always like to say that I hope one of the first conversations an investigator has is with me, so that I can help clarify the research questions before getting into the design questions related to measurement, unit of randomization, and sample size. In the worst case scenario - and this actually doesn’t happen to me any more - a researcher would approach me after everything is done except the analysis.
[Read More]

## Yes, unbalanced randomization can improve power, in some situations

Last time I provided some simulations that suggested that there might not be any efficiency-related benefits to using unbalanced randomization when the outcome is binary. This is a quick follow-up to provide a counter-example where the outcome in a two-group comparison is continuous. If the groups have different amounts of variability, intuitively it makes sense to allocate more patients to the more variable group. Doing this should reduce the variability in the estimate of the mean for that group, which in turn could improve the power of the test.
[Read More]

## Can unbalanced randomization improve power?

Of course, we’re all thinking about one thing these days, so it seems particularly inconsequential to be writing about anything that doesn’t contribute to solving or addressing in some meaningful way this pandemic crisis. But, I find that working provides a balm from reading and hearing all day about the events swirling around us, both here and afar. (I am in NYC, where things are definitely swirling.) And for me, working means blogging, at least for a few hours every couple of weeks.
[Read More]

## When you want more than a chi-squared test, consider a measure of association

In my last post, I made the point that p-values should not necessarily be considered sufficient evidence (or evidence at all) in drawing conclusions about associations we are interested in exploring. When it comes to contingency tables that represent the outcomes for two categorical variables, it isn’t so obvious what measure of association should augment (or replace) the \(\chi^2\) statistic.
I described a model-based measure of effect to quantify the strength of an association in the particular case where one of the categorical variables is ordinal.
[Read More]

## Alternatives to reporting a p-value: the case of a contingency table

I frequently find myself in discussions with collaborators about the merits of reporting p-values, particularly in the context of pilot studies or exploratory analysis. Over the past several years, the American Statistical Association has made several strong statements about the need to consider approaches that measure the strength of evidence or uncertainty that don’t necessarily rely on p-values. In 2016, the ASA attempted to clarify the proper use and interpretation of the p-value by highlighting key principles “that could improve the conduct or interpretation of quantitative science, according to widespread consensus in the statistical community.
[Read More]

## Clustered randomized trials and the design effect

I am always saying that simulation can help illuminate interesting statistical concepts or ideas. The design effect that underlies much of clustered analysis is could benefit from a little exploration through simulation. I’ve written about clustered-related methods so much on this blog that I won’t provide links - just peruse the list of entries on the home page and you are sure to spot a few. But, I haven’t written explicitly about the design effect.
[Read More]

## Analysing an open cohort stepped-wedge clustered trial with repeated individual binary outcomes

I am currently wrestling with how to analyze data from a stepped-wedge designed cluster randomized trial. A few factors make this analysis particularly interesting. First, we want to allow for the possibility that between-period site-level correlation will decrease (or decay) over time. Second, there is possibly additional clustering at the patient level since individual outcomes will be measured repeatedly over time. And third, given that these outcomes are binary, there are no obvious software tools that can handle generalized linear models with this particular variance structure we want to model.
[Read More]

## A brief account (via simulation) of the ROC (and its AUC)

The ROC (receiver operating characteristic) curve visually depicts the ability of a measure or classification model to distinguish two groups. The area under the ROC (AUC), quantifies the extent of that ability. My goal here is to describe as simply as possible a process that serves as a foundation for the ROC, and to provide an interpretation of the AUC that is defined by that curve.
A prediction problem The classic application for the ROC is a medical test designed to identify individuals with a particular medical condition or disease.
[Read More]

## Repeated measures can improve estimation when we only care about a single endpoint

I’m participating in the design of a new study that will evaluate interventions aimed at reducing both pain and opioid use for patients on dialysis. This study is likely to be somewhat complicated, possibly involving multiple clusters, multiple interventions, a sequential and/or adaptive randomization scheme, and a composite binary outcome. I’m not going into any of that here.
There is one issue that should be fairly generalizable to other studies.
[Read More]

## Adding a "mixture" distribution to the simstudy package

I am contemplating adding a new distribution option to the package simstudy that would allow users to define a new variable as a mixture of previously defined (or already generated) variables. I think the easiest way to explain how to apply the new mixture option is to step through a few examples and see it in action.
Specifying the “mixture” distribution As defined here, a mixture of variables is a random draw from a set of variables based on a defined set of probabilities.
[Read More]