A couple of weeks ago, I was inspired by a study to write about a classic design issue that arises in cluster randomized trials: should we focus on the number of clusters or the size of those clusters? This trial, which is concerned with preventing opioid use disorder for at-risk patients in primary care clinics, has also motivated this second post, which concerns another important issue - over-dispersion.
A count outcome
In this study, one of the primary outcomes is the number of days of opioid use over a six-month follow-up period (to be recorded monthly by patient-report and aggregated for the six-month measure). While one might get away with assuming that the outcome is continuous, it really is not; it is a count outcome, and the possible range is 0 to 180. There are two related questions here - what model will be used to analyze the data once the study is complete? And, how should we generate simulated data to estimate the power of the study?
[Read More]What matters more in a cluster randomized trial: number or size?
I am involved with a trial of an intervention designed to prevent full-blown opioid use disorder for patients who may have an incipient opioid use problem. Given the nature of the intervention, it was clear the only feasible way to conduct this particular study is to randomize at the physician rather than the patient level.
There was a concern that the number of patients eligible for the study might be limited, so that each physician might only have a handful of patients able to participate, if that many. A question arose as to whether we can make up for this limitation by increasing the number of physicians who participate? That is, what is the trade-off between number of clusters and cluster size?
[Read More]Musings on missing data
I’ve been meaning to share an analysis I recently did to estimate the strength of the relationship between a young child’s ability to recognize emotions in others (e.g. teachers and fellow students) and her longer term academic success. The study itself is quite interesting (hopefully it will be published sometime soon), but I really wanted to write about it here as it involved the challenging problem of missing data in the context of heterogeneous effects (different across sub-groups) and clustering (by schools).
[Read More]A case where prospective matching may limit bias in a randomized trial
Analysis is important, but study design is paramount. I am involved with the Diabetes Research, Education, and Action for Minorities (DREAM) Initiative, which is, among other things, estimating the effect of a group-based therapy program on weight loss for patients who have been identified as pre-diabetic (which means they have elevated HbA1c levels). The original plan was to randomize patients at a clinic to treatment or control, and then follow up with those assigned to the treatment group to see if they wanted to participate. The primary outcome is going to be measured using medical records, so those randomized to control (which basically means nothing special happens to them) will not need to interact with the researchers in any way.
[Read More]A example in causal inference designed to frustrate: an estimate pretty much guaranteed to be biased
I am putting together a brief lecture introducing causal inference for graduate students studying biostatistics. As part of this lecture, I thought it would be helpful to spend a little time describing directed acyclic graphs (DAGs), since they are an extremely helpful tool for communicating assumptions about the causal relationships underlying a researcher’s data.
The strength of DAGs is that they help us think how these underlying relationships in the data might lead to biases in causal effect estimation, and suggest ways to estimate causal effects that eliminate these biases. (For a real introduction to DAGs, you could take a look at this paper by Greenland, Pearl, and Robins or better yet take a look at Part I of this book on causal inference by Hernán and Robins.)
[Read More]Using the uniform sum distribution to introduce probability
I’ve never taught an intro probability/statistics course. If I ever did, I would certainly want to bring the underlying wonder of the subject to life. I’ve always found it almost magical the way mathematical formulation can be mirrored by computer simulation, the way proof can be guided by observed data generation processes, and the way DGPs can confirm analytic solutions.
I would like to begin such a course with a somewhat unusual but accessible problem that would evoke these themes from the start. The concepts would not necessarily be immediately comprehensible, but rather would pique the interest of the students.
[Read More]Correlated longitudinal data with varying time intervals
I was recently contacted to see if simstudy can create a data set of correlated outcomes that are measured over time, but at different intervals for each individual. The quick answer is there is no specific function to do this. However, if you are willing to assume an “exchangeable” correlation structure, where measurements far apart in time are just as correlated as measurements taken close together, then you could just generate individual-level random effects (intercepts and/or slopes) and pretty much call it a day. Unfortunately, the researcher had something more challenging in mind: he wanted to generate auto-regressive correlation, so that proximal measurements are more strongly correlated than distal measurements.
Considering sensitivity to unmeasured confounding: part 2
In part 1 of this 2-part series, I introduced the notion of sensitivity to unmeasured confounding in the context of an observational data analysis. I argued that an estimate of an association between an observed exposure \(D\) and outcome \(Y\) is sensitive to unmeasured confounding if we can conceive of a reasonable alternative data generating process (DGP) that includes some unmeasured confounder that will generate the same observed distribution the observed data. I further argued that reasonableness can be quantified or parameterized by the two correlation coefficients \(\rho_{UD}\) and \(\rho_{UY}\), which measure the strength of the relationship of the unmeasured confounder \(U\) with each of the observed measures. Alternative DGPs that are characterized by high correlation coefficients can be viewed as less realistic, and the observed data could be considered less sensitive to unmeasured confounding. On the other hand, DGPs characterized by lower correlation coefficients would be considered more sensitive.
[Read More]Considering sensitivity to unmeasured confounding: part 1
Principled causal inference methods can be used to compare the effects of different exposures or treatments we have observed in non-experimental settings. These methods, which include matching (with or without propensity scores), inverse probability weighting, and various g-methods, help us create comparable groups to simulate a randomized experiment. All of these approaches rely on a key assumption of no unmeasured confounding. The problem is, short of subject matter knowledge, there is no way to test this assumption empirically.
[Read More]Parallel processing to add a little zip to power simulations (and other replication studies)
It’s always nice to be able to speed things up a bit. My first blog post ever described an approach using Rcpp to make huge improvements in a particularly intensive computational process. Here, I want to show how simple it is to speed things up by using the R package parallel and its function mclapply. I’ve been using this function more and more, so I want to explicitly demonstrate it in case any one is wondering.