R on ouR data generation

R on ouR data generation https://www.rdatagen.net/tags/r/ Recent content in R on ouR data generation Hugo -- gohugo.io keith.goldfeld@nyumc.org (Keith Goldfeld) keith.goldfeld@nyumc.org (Keith Goldfeld) Tue, 01 Apr 2025 00:00:00 +0000 Bayesian proportional hazards model for a stepped-wedge design https://www.rdatagen.net/post/2025-04-01-bayesian-proportional-hazards-model-for-a-stepped-wedge-design/ Tue, 01 Apr 2025 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2025-04-01-bayesian-proportional-hazards-model-for-a-stepped-wedge-design/ We’ve finally reached the end of the road. This is the fifth and last post in a series building up to a Bayesian proportional hazards model for analyzing a stepped-wedge cluster-randomized trial. If you are just joining in, you may want to start at the beginning. The model presented here integrates non-linear time trends and cluster-specific random effects—elements we’ve previously explored in isolation. There’s nothing fundamentally new in this post; it brings everything together. A Bayesian proportional hazards model for a cluster randomized trial https://www.rdatagen.net/post/2025-03-25-a-bayesian-proportional-hazards-model-for-a-cluster-randomized-trial/ Tue, 25 Mar 2025 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2025-03-25-a-bayesian-proportional-hazards-model-for-a-cluster-randomized-trial/ In recent posts, I introduced a Bayesian approach to proportional hazards modeling and then extended it to incorporate a penalized spline. (There was also a third post on handling ties when multiple individuals share the same event time.) This post describes another extension: a random effect to account for clustering in a cluster randomized trial. With this in place, I’ll be ready to tackle the final step—building a model for analyzing a stepped-wedge cluster-randomized trial that incorporates both splines and site-specific random effects. Accounting for ties in a Bayesian proportional hazards model https://www.rdatagen.net/post/2025-03-20-bayesian-survival-model-that-can-appropriately-handle-ties/ Thu, 20 Mar 2025 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2025-03-20-bayesian-survival-model-that-can-appropriately-handle-ties/ Over my past few posts, I’ve been progressively building towards a Bayesian model for a stepped-wedge cluster randomized trial with a time-to-event outcome, where time will be modeled using a spline function. I started with a simple Cox proportional hazards model for a traditional RCT, ignoring time as a factor. In the next post, I introduced a nonlinear time effect. For the third post—one I initially thought was ready to publish—I extended the model to a cluster randomized trial without explicitly incorporating time. A Bayesian proportional hazards model with a penalized spline https://www.rdatagen.net/post/2025-03-04-a-bayesian-proportional-hazards-model-with-splines/ Tue, 04 Mar 2025 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2025-03-04-a-bayesian-proportional-hazards-model-with-splines/ In my previous post, I outlined a Bayesian approach to proportional hazards modeling. This post serves as an addendum, providing code to incorporate a spline to model a time-varying hazard ratio non linearly. In a second addendum to come I will present a separate model with a site-specific random effect, essential for a cluster-randomized trial. These components lay the groundwork for analyzing a stepped-wedge cluster-randomized trial, where both splines and site-specific random effects will be integrated into a single model. Estimating a Bayesian proportional hazards model https://www.rdatagen.net/post/2025-02-11-estimating-a-bayesian-proportional-hazards-model/ Tue, 11 Feb 2025 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2025-02-11-estimating-a-bayesian-proportional-hazards-model/ A recent conversation with a colleague about a large stepped-wedge design (SW-CRT) cluster randomized trial piqued my interest, because the primary outcome is time-to-event. This is not something I’ve seen before. A quick dive into the literature suggested that time-to-event outcomes are uncommon in SW-CRTs-and that the best analytic approach is not obvious. I was intrigued by how to analyze the data to estimate a hazard ratio while accounting for clustering and potential secular trends that might influence the time to the event. Thinking about covariates in an analysis of an RCT https://www.rdatagen.net/post/2025-01-28-handling-covariates-in-an-analysis-of-an-rct/ Tue, 28 Jan 2025 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2025-01-28-handling-covariates-in-an-analysis-of-an-rct/ I was recently discussing the analytic plan for a randomized controlled trial (RCT) with a clinical collaborator when she asked whether it’s appropriate to adjust for pre-specified baseline covariates. This question is so interesting because it touches on fundamental issues of inference—both causal and statistical. What is the target estimand in an RCT—that is, what effect are we actually measuring? What do we hope to learn from the specific sample recruited for the trial (i. Can ChatGPT help construct non-trivial statistical models? An example with Bayesian "random" splines https://www.rdatagen.net/post/2024-10-08-can-chatgpt-help-construct-non-trivial-bayesian-models-with-cluster-specific-splines/ Tue, 08 Oct 2024 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2024-10-08-can-chatgpt-help-construct-non-trivial-bayesian-models-with-cluster-specific-splines/ I’ve been curious to see how helpful ChatGPT can be for implementing relatively complicated models in R. About two years ago, I described a model for estimating a treatment effect in a cluster-randomized stepped wedge trial. We used a generalized additive model (GAM) with site-specific splines to account for general time trends, implemented using the mgcv package. I’ve been interested in exploring a Bayesian version of this model, but hadn’t found the time to try - until I happened to pose this simple question to ChatGPT: An IV study design to estimate an effect size when randomization is not ethical https://www.rdatagen.net/post/2024-09-03-an-instrumental-variable-study-design-to-estimate-an-effect-size-when-randomization-may-not-be-ethical/ Tue, 03 Sep 2024 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2024-09-03-an-instrumental-variable-study-design-to-estimate-an-effect-size-when-randomization-may-not-be-ethical/ An investigator I frequently consult with seeks to estimate the effect of a palliative care treatment protocol for patients nearing end-stage disease, compared to a more standard, though potentially overly burdensome, therapeutic approach. Ideally, we would conduct a two-arm randomized clinical trial (RCT) to create comparable groups and obtain an unbiased estimate of the intervention effect. However, in this case, it may be considered unethical to randomize patients to a non-standard protocol. Generating binary data by specifying the relative risk, with simulations https://www.rdatagen.net/post/2024-07-02-generating-binary-data-by-specifying-relative-risk/ Tue, 02 Jul 2024 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2024-07-02-generating-binary-data-by-specifying-relative-risk/ The most traditional approach for analyzing binary outcome data is logistic regression, where the estimated parameters are interpreted as log odds ratios or, if exponentiated, as odds ratios (ORs). No one other than statisticians (and maybe not even statisticians) finds the odds ratio to be a very intuitive statistic, and many feel that a risk difference or risk ratio/relative risks (RRs) are much more interpretable. Indeed, there seems to be a strong belief that readers will, more often than not, interpret odds ratios as risk ratios. simstudy: another way to generate data from a non-standard density https://www.rdatagen.net/post/2024-06-04-simstudy-another-way-to-generate-data-from-a-non-standard-density/ Tue, 04 Jun 2024 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2024-06-04-simstudy-another-way-to-generate-data-from-a-non-standard-density/ One of my goals for the simstudy package is to make it as easy as possible to generate data from a wide range of data distributions. The recent update created the possibility of generating data from a customized distribution specified in a user-defined function. Last week, I added two functions, genDataDist and addDataDist, that allow data generation from an empirical distribution defined by a vector of integers. (See here for how to download latest development version. simstudy 0.8.0: customized distributions https://www.rdatagen.net/post/2024-05-21-simstudy-customized-distributions/ Tue, 21 May 2024 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2024-05-21-simstudy-customized-distributions/ Over the past few years, a number of folks have asked if simstudy accommodates customized distributions. There’s been interest in truncated, zero-inflated, or even more standard distributions that haven’t been implemented in simstudy. While I’ve come up with approaches for some of the specific cases, I was never able to develop a general solution that could provide broader flexibility. This shortcoming changes with the latest version of simstudy, now available on CRAN. simstudy enhancement: specifying idiosyncratic follow-up times for longitudinal data https://www.rdatagen.net/post/2024-04-16-simstudy-update-specifying-idiosyncratic-follow-up-times-for-longitudinal-data/ Tue, 16 Apr 2024 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2024-04-16-simstudy-update-specifying-idiosyncratic-follow-up-times-for-longitudinal-data/ A researcher reached out to me a few weeks ago. They were trying to generate longitudinal data that included irregularly spaced follow-up periods. The default periods generated by the function addPeriods in the simstudy package are \(\{0, 1, 2, ..., n - 1\}\), where there are \(n\) total periods. However, when follow-up periods required more specificity, such as \(\{0, 90, 180, 365\}\) days from baseline, users had to manually add them. Perfectly balanced treatment arm distribution in a multifactorial CRT using stratified randomization https://www.rdatagen.net/post/2024-02-20-ensuring-balance-with-a-cluster-randomized-factorial-design/ Tue, 20 Feb 2024 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2024-02-20-ensuring-balance-with-a-cluster-randomized-factorial-design/ Over two years ago, I wrote a series of posts (starting here) that described possible analytic approaches for a proposed cluster-randomized trial with a factorial design. That proposal was recently funded by NIA/NIH, and now the Emergency departments leading the transformation of Alzheimer’s and dementia care (ED-LEAD) trial is just getting underway. Since the trial is in its early planning phase, I am starting to think about how we will do the randomization, and I’m sharing some of those thoughts (and code) here. A three-arm trial using two-step randomization https://www.rdatagen.net/post/2023-12-19-a-three-arm-trial-using-two-step-randomization/ Tue, 19 Dec 2023 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2023-12-19-a-three-arm-trial-using-two-step-randomization/ Clinical Decision Support (CDS) tools are systems created to support clinical decision-making. Health care professionals using these tools can get guidance about diagnostic and treatment options when providing care to a patient. I’m currently involved with designing a trial focused on comparing a standard CDS tool with an enhanced version (CDS+). The main goal is to directly compare patient-level outcomes for those who have been exposed to the different versions of the CDS. Creating a nice looking Table 1 with standardized mean differences https://www.rdatagen.net/post/2023-09-26-nice-looking-table-1-with-standardized-mean-difference/ Tue, 26 Sep 2023 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2023-09-26-nice-looking-table-1-with-standardized-mean-difference/ I’m in the middle of a perfect storm, winding down three randomized clinical trials (RCTs), with patient recruitment long finished and data collection all wrapped up. This means a lot of data analysis, presentation prep, and paper writing (and not so much blogging). One common (and not so glamorous) thread cutting across all of these RCTs is the need to generate a Table 1, the comparison of baseline characteristics that convinces readers that randomization worked its magic (i. Finding logistic models to generate data with desired risk ratio, risk difference and AUC profiles https://www.rdatagen.net/post/2023-06-20-finding-coefficients-for-logistic-models-that-generate-data-with-desired-characteristics/ Tue, 20 Jun 2023 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2023-06-20-finding-coefficients-for-logistic-models-that-generate-data-with-desired-characteristics/ About two years ago, someone inquired whether simstudy had the functionality to generate data from a logistic model with a specific AUC. It did not, but now it does, thanks to a paper by Peter Austin that describes a nice algorithm to accomplish this. The paper actually describes a series of related algorithms for generating coefficients that target specific prevalence rates, risk ratios, and risk differences, in addition to the AUC. A demo of power estimation by simulation for a cluster randomized trial with a time-to-event outcome https://www.rdatagen.net/post/2023-05-23-just-a-simple-demo-power-estimates-for-cluster-randomized-trial-with-time-to-event-outcome/ Tue, 23 May 2023 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2023-05-23-just-a-simple-demo-power-estimates-for-cluster-randomized-trial-with-time-to-event-outcome/ A colleague reached out for help designing a cluster randomized trial to evaluate a clinical decision support tool for primary care physicians (PCPs), which aims to improve care for high-risk patients. The outcome will be a time-to-event measure, collected at the patient level. The unit of randomization will be the PCP, and one of the key design issues is settling on the number to randomize. Generating variable cluster sizes to assess power in cluster randomized trials https://www.rdatagen.net/post/2023-04-18-generating-variable-cluster-sizes-to-assess-power-in-cluster-randomize-trials/ Tue, 18 Apr 2023 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2023-04-18-generating-variable-cluster-sizes-to-assess-power-in-cluster-randomize-trials/ In recent discussions with a number of collaborators at the NIA IMPACT Collaboratory about setting the sample size for a proposed cluster randomized trial, the question of variable cluster sizes has come up a number of times. Given a fixed overall sample size, it is generally better (in terms of statistical power) if the sample is equally distributed across the different clusters; highly variable cluster sizes increase the standard errors of effect size estimates and reduce the ability to determine if an intervention or treatment is effective. Implementing a one-step GEE algorithm for very large cluster sizes in R https://www.rdatagen.net/post/2023-03-21-implementing-a-1-step-gee-with-large-cluster-sizes-in-r/ Tue, 21 Mar 2023 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2023-03-21-implementing-a-1-step-gee-with-large-cluster-sizes-in-r/ Very large data sets can present estimation problems for some statistical models, particularly ones that cannot avoid matrix inversion. For example, generalized estimating equations (GEE) models that are used when individual observations are correlated within groups can have severe computation challenges when the cluster sizes get too large. GEE are often used when repeated measures for an individual are collected over time; the individual is considered the cluster in this analysis. simstudy 0.6.0 released: more flexible correlation patterns https://www.rdatagen.net/post/2023-02-21-flexible-correlation-generation-revisiting-block-matrices-for-temporal-patterns-in-simstudy/ Tue, 21 Feb 2023 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2023-02-21-flexible-correlation-generation-revisiting-block-matrices-for-temporal-patterns-in-simstudy/ The new version (0.6.0) of simstudy is available for download from CRAN. In addition to some important bug fixes, I’ve added new functionality that should make data generation with correlated data a little more flexible. In the previous post, I described enhancements to the function genCorMat. As part of this release announcement, I’m describing blockExchangeMat and blockDecayMat, two new functions that can be used to generate correlation matrices when there is a temporal element to the data generation. Flexible correlation generation: an update to genCorMat in simstudy https://www.rdatagen.net/post/2023-02-14-flexible-correlation-generation-an-update-to-gencorgen-in-simstudy/ Tue, 14 Feb 2023 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2023-02-14-flexible-correlation-generation-an-update-to-gencorgen-in-simstudy/ I’ve been slowly working on some updates to simstudy, focusing mostly on the functionality to generate correlation matrices (which can be used to simulate correlated data). Here, I’m briefly describing the function genCorMat, which has been updated to facilitate the generation of correlation matrices for clusters of different sizes and with potentially different correlation coefficients. I’ll briefly describe what the existing function can currently do, and then give an idea about what the enhancements will provide. A GAM for time trends in a stepped-wedge trial with a binary outcome https://www.rdatagen.net/post/2023-01-17-a-gam-model-for-time-trends-in-a-stepped-wedge-trial-with-a-binary-outcome/ Tue, 17 Jan 2023 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2023-01-17-a-gam-model-for-time-trends-in-a-stepped-wedge-trial-with-a-binary-outcome/ In a previous post, I described some ways one might go about analyzing data from a stepped-wedge, cluster-randomized trial using a generalized additive model (a GAM), focusing on continuous outcomes. I have spent the past few weeks developing a similar model for a binary outcome, and have started to explore model comparison and methods to evaluate goodness-of-fit. The following describes some of my thought process. Data generation The data generation process I am using here follows along pretty closely with the earlier post, except, of course, the outcome has changed from continuous to binary. Modeling the secular trend in a stepped-wedge design https://www.rdatagen.net/post/2022-12-13-modeling-the-secular-trend-in-a-stepped-wedge-design/ Tue, 13 Dec 2022 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2022-12-13-modeling-the-secular-trend-in-a-stepped-wedge-design/ Recently I started a discussion about modeling secular trends using flexible models in the context of cluster randomized trials. I’ve been motivated by a trial I am involved with that is using a stepped-wedge study design. The initial post focused on more standard parallel designs; here, I want to extend the discussion explicitly to the stepped-wedge design. The stepped-wedge design Stepped-wedge designs are a special class of cluster randomized trial where each cluster is observed in both treatment arms (as opposed to the classic parallel design where only some of the clusters receive the treatment). Generating clustered data with marginal correlations https://www.rdatagen.net/post/2022-11-22-generating-cluster-data-with-marginal-correlations/ Tue, 22 Nov 2022 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2022-11-22-generating-cluster-data-with-marginal-correlations/ A student is working on a project to derive an analytic solution to the problem of sample size determination in the context of cluster randomized trials and repeated individual-level measurement (something I’ve thought a little bit about before). Though the goal is an analytic solution, we do want confirmation with simulation. So, I was a little disheartened to discover that the routines I’d developed in simstudy for this were not quite up to the task. Modeling the secular trend in a cluster randomized trial using very flexible models https://www.rdatagen.net/post/2022-11-01-modeling-secular-trend-in-crt-using-gam/ Tue, 01 Nov 2022 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2022-11-01-modeling-secular-trend-in-crt-using-gam/ A key challenge - maybe the key challenge - of a stepped wedge clinical trial design is the threat of confounding by time. This is a cross-over design where the unit of randomization is a group or cluster, where each cluster begins in the control state and transitions to the intervention. It is the transition point that is randomized. Since outcomes could be changing over time regardless of the intervention, it is important to model the time trends when conducting the efficacy analysis. Presenting results for multinomial logistic regression: a marginal approach using propensity scores https://www.rdatagen.net/post/2022-09-20-presenting-results-for-multinomial-logistic-regression-a-marginal-approach-using-propensity-scores/ Tue, 20 Sep 2022 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2022-09-20-presenting-results-for-multinomial-logistic-regression-a-marginal-approach-using-propensity-scores/ Multinomial logistic regression modeling can provide an understanding of the factors influencing an unordered, categorical outcome. For example, if we are interested in identifying individual-level characteristics associated with political parties in the United States (Democratic, Republican, Libertarian, Green), a multinomial model would be a reasonable approach to for estimating the strength of the associations. In the case of a randomized trial or epidemiological study, we might be primarily interested in the effect of a specific intervention or exposure while controlling for other covariates. Flexible simulation in simstudy with customized distribution functions https://www.rdatagen.net/post/2022-08-30-expanding-the-possibilities-of-simulation-in-simstudy-with-customized-distribution-funcdtions/ Tue, 30 Aug 2022 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2022-08-30-expanding-the-possibilities-of-simulation-in-simstudy-with-customized-distribution-funcdtions/ Really, the only problem with the simstudy package (😄) is that there is a hard limit to the possible probability distributions that are available (the current count is 15 - see here for a complete description). However, it turns out that there is more flexibility than first meets the eye, and we can easily accommodate a limitless number as long as you are willing to provide some extra code. Simulating data from a non-linear function by specifying a handful of points https://www.rdatagen.net/post/2022-08-09-simulating-data-from-a-non-linear-function-by-specifying-some-points-on-the-curve/ Tue, 09 Aug 2022 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2022-08-09-simulating-data-from-a-non-linear-function-by-specifying-some-points-on-the-curve/ Trying to simulate data with non-linear relationships can be frustrating, since there is not always an obvious mathematical expression that will give you the shape you are looking for. I’ve come up with a relatively simple solution for somewhat complex scenarios that only requires the specification of a few points that lie on or near the desired curve. (Clearly, if the relationships are straightforward, such as relationships that can easily be represented by quadratic or cubic polynomials, there is no need to go through all this trouble. simstudy updated to version 0.5.0 https://www.rdatagen.net/post/2022-07-20-simstudy-updated-to-version-0-5-0/ Wed, 20 Jul 2022 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2022-07-20-simstudy-updated-to-version-0-5-0/ A new version of simstudy is available on CRAN. There are two major enhancements and several new features. In the “major” category, I would include (1) changes to survival data generation that accommodate hazard ratios that can change over time, as well as competing risks, and (2) the addition of functions to allow users to sample from existing data sets with replacement to generate “synthetic” data will real life distribution properties. To impute or not: the case of an RCT with baseline and follow-up measurements https://www.rdatagen.net/post/2022-04-12-to-impute-or-not-the-case-of-an-rct-with-baseline-and-follow-up-measurements/ Tue, 12 Apr 2022 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2022-04-12-to-impute-or-not-the-case-of-an-rct-with-baseline-and-follow-up-measurements/ Under normal conditions, conducting a randomized clinical trial is challenging. Throw in a pandemic and things like site selection, patient recruitment and patient follow-up can be particularly vexing. In any study, subjects need to be retained long enough so that outcomes can be measured; during a period when there are so many potential disruptions, this can become quite difficult. This issue of loss to follow-up recently came up during a conversation among a group of researchers who were troubleshooting challenges they are all experiencing in their ongoing trials. Simulating time-to-event outcomes with non-proportional hazards https://www.rdatagen.net/post/2022-03-29-simulating-non-proportional-hazards/ Tue, 29 Mar 2022 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2022-03-29-simulating-non-proportional-hazards/ As I mentioned last time, I am working on an update of simstudy that will make generating survival/time-to-event data a bit more flexible. I previously presented the functionality related to competing risks, and this time I’ll describe generating survival data that has time-dependent hazard ratios. (As I mentioned last time, if you want to try this at home, you will need the development version of simstudy that you can install using devtools::install_github(“kgoldfeld/simstudy”). Adding competing risks in survival data generation https://www.rdatagen.net/post/2022-03-15-adding-competing-risks-in-survival-data-generation/ Tue, 15 Mar 2022 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2022-03-15-adding-competing-risks-in-survival-data-generation/ I am working on an update of simstudy that will make generating survival/time-to-event data a bit more flexible. There are two biggish enhancements. The first facilitates generation of competing events, and the second allows for the possibility of generating survival data that has time-dependent hazard ratios. This post focuses on the first enhancement, and a follow up will provide examples of the second. (If you want to try this at home, you will need the development version of simstudy, which you can install using devtools::install_github(“kgoldfeld/simstudy”). Follow-up: simstudy function for generating parameters for survival distribution https://www.rdatagen.net/post/2022-02-22-follow-up-simstudy-function-for-generating-parameters-for-survival-distribution/ Tue, 22 Feb 2022 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2022-02-22-follow-up-simstudy-function-for-generating-parameters-for-survival-distribution/ In the previous post I described how to determine the parameter values for generating a Weibull survival curve that reflects a desired distribution defined by two points along the curve. I went ahead and implemented these ideas in the development version of simstudy 0.4.0.9000, expanding the idea to allow for any number of points rather than just two. This post provides a brief overview of the approach, the code, and a simple example using the parameters to generate simulated data. Simulating survival outcomes: setting the parameters for the desired distribution https://www.rdatagen.net/post/2022-02-08-simulating-survival-outcomes-setting-the-parameters-for-the-desired-distribution/ Tue, 08 Feb 2022 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2022-02-08-simulating-survival-outcomes-setting-the-parameters-for-the-desired-distribution/ The package simstudy has some functions that facilitate generating survival data using an underlying Weibull distribution. Originally, I added this to the package because I thought it would be interesting to try to do, and I figured it would be useful for me someday (and hopefully some others, as well). Well, now I am working on a project that involves evaluating at least two survival-type processes that are occurring simultaneously. simstudy update: ordinal data generation that violates proportionality https://www.rdatagen.net/post/2022-01-25-simstudy-update-ordinal-data-without-the-proportionality-assumption/ Tue, 25 Jan 2022 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2022-01-25-simstudy-update-ordinal-data-without-the-proportionality-assumption/ Version 0.4.0 of simstudy is now available on CRAN and GitHub. This update includes two enhancements (and at least one major bug fix). genOrdCat now includes an argument to generate ordinal data without an assumption of cumulative proportional odds. And two new functions defRepeat and defRepeatAdd make it a bit easier to define multiple variables that share the same distribution assumptions. Ordinal data In simstudy, it is relatively easy to specify multinomial distributions that characterize categorical data. Including uncertainty when comparing response rates across clusters https://www.rdatagen.net/post/2022-01-18-including-uncertainty-when-comparing-response-rates-across-clusters/ Tue, 18 Jan 2022 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2022-01-18-including-uncertainty-when-comparing-response-rates-across-clusters/ Since this is a holiday weekend here in the US, I thought I would write up something relatively short and simple since I am supposed to be relaxing. A few weeks ago, someone presented me with some data that showed response rates to a survey that was conducted at about 30 different locations. The team that collected the data was interested in understanding if there were some sites that had response rates that might have been too low. Skeptical Bayesian priors might help minimize skepticism about subgroup analyses https://www.rdatagen.net/post/2022-01-04-reducing-the-risk-of-spurious-findings-with-bayesian-decison-rules/ Tue, 04 Jan 2022 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2022-01-04-reducing-the-risk-of-spurious-findings-with-bayesian-decison-rules/ Over the past couple of years, I have been working with an amazing group of investigators as part of the CONTAIN trial to study whether COVID-19 convalescent plasma (CCP) can improve the clinical status of patients hospitalized with COVID-19 and requiring noninvasive supplemental oxygen. This was a multi-site study in the US that randomized 941 patients to either CCP or a saline solution placebo. The overall findings suggest that CCP did not benefit the patients who received it, but if you drill down a little deeper, the story may be more complicated than that. Controlling Type I error in RCTs with interim looks: a Bayesian perspective https://www.rdatagen.net/post/2021-12-21-controling-type-1-error-rates-in-rcts-with-interim-looks-a-bayesian-perspective/ Tue, 21 Dec 2021 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2021-12-21-controling-type-1-error-rates-in-rcts-with-interim-looks-a-bayesian-perspective/ Recently, a colleague submitted a paper describing the results of a Bayesian adaptive trial where the research team estimated the probability of effectiveness at various points during the trial. This trial was designed to stop as soon as the probability of effectiveness exceeded a pre-specified threshold. The journal rejected the paper on the grounds that these repeated interim looks inflated the Type I error rate, and increased the chances that any conclusions drawn from the study could have been misleading. Exploring design effects of stepped wedge designs with baseline measurements https://www.rdatagen.net/post/2021-12-07-exploring-design-effects-of-stepped-wedge-designs-with-baseline-measurements/ Tue, 07 Dec 2021 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2021-12-07-exploring-design-effects-of-stepped-wedge-designs-with-baseline-measurements/ In the previous post, I described an incipient effort that I am undertaking with two colleagues, Monica Taljaard and Fan Li, to better understand the implications for collecting baseline measurements on sample size requirements for stepped wedge cluster randomized trials. (The three of us are on the Design and Statistics Core of the NIA IMPACT Collaboratory.) In that post, I conducted a series of simulations that illustrated the design effects in parallel cluster randomized trials derived analytically in a paper by Teerenstra et al. The design effect of a cluster randomized trial with baseline measurements https://www.rdatagen.net/post/2021-11-23-design-effects-with-baseline-measurements/ Tue, 23 Nov 2021 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2021-11-23-design-effects-with-baseline-measurements/ Is it possible to reduce the sample size requirements of a stepped wedge cluster randomized trial simply by collecting baseline information? In a trial with randomization at the individual level, it is generally the case that if we are able to measure an outcome for subjects at two time periods, first at baseline and then at follow-up, we can reduce the overall sample size. But does this extend to (a) cluster randomized trials generally, and to (b) stepped wedge designs more specifically? simstudy update: adding flexibility to data generation https://www.rdatagen.net/post/2021-11-09-simstudy-0-3-0-update-summary/ Tue, 09 Nov 2021 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2021-11-09-simstudy-0-3-0-update-summary/ A new version of simstudy (0.3.0) is now available on CRAN and on the package website. Along with some less exciting bug fixes, we have added capabilities to a few existing features: double-dot variable reference, treatment assignment, and categorical data definition. These simple additions should make the data generation process a little smoother and more flexible. Using non-scalar double-dot variable reference Double-dot notation was introduced in the last version of simstudy to allow data definitions to be more dynamic. Sample size requirements for a Bayesian factorial study design https://www.rdatagen.net/post/2021-10-26-sample-size-requirements-for-a-factorial-study-design/ Tue, 26 Oct 2021 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2021-10-26-sample-size-requirements-for-a-factorial-study-design/ How do you determine sample size when the goal of a study is not to conduct a null hypothesis test but to provide an estimate of multiple effect sizes? I needed to get a handle on this for a recent grant submission, which I’ve been writing about over the past month, here and here. (I provide a little more context for all of this in those earlier posts.) The statistical inference in the study will be based on the estimated posterior distributions from a Bayesian model, so it seems like we’d like those distributions to be as informative as possible. A Bayesian analysis of a factorial design focusing on effect size estimates https://www.rdatagen.net/post/2021-10-12-analyzing-a-factorial-design-with-a-bayesian-shrinkage-model/ Tue, 12 Oct 2021 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2021-10-12-analyzing-a-factorial-design-with-a-bayesian-shrinkage-model/ Factorial study designs present a number of analytic challenges, not least of which is how to best understand whether simultaneously applying multiple interventions is beneficial. Last time I presented a possible approach that focuses on estimating the variance of effect size estimates using a Bayesian model. The scenario I used there focused on a hypothetical study evaluating two interventions with four different levels each. This time around, I am considering a proposed study to reduce emergency department (ED) use for patients living with dementia that I am actually involved with. Analyzing a factorial design by focusing on the variance of effect sizes https://www.rdatagen.net/post/2021-09-28-analyzing-a-factorial-trial-with-a-bayesian-model/ Tue, 28 Sep 2021 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2021-09-28-analyzing-a-factorial-trial-with-a-bayesian-model/ Way back in 2018, long before the pandemic, I described a soon-to-be implemented simstudy function genMultiFac that facilitates the generation of multi-factorial study data. I followed up that post with a description of how we can use these types of efficient designs to answer multiple questions in the context of a single study. Fast forward three years, and I am thinking about these designs again for a new grant application that proposes to study simultaneously three interventions aimed at reducing emergency department (ED) use for people living with dementia. Drawing the wrong conclusion about subgroups: a comparison of Bayes and frequentist methods https://www.rdatagen.net/post/2021-09-14-drawing-the-wrong-conclusion-a-comparison-of-bayes-and-frequentist-methods/ Tue, 14 Sep 2021 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2021-09-14-drawing-the-wrong-conclusion-a-comparison-of-bayes-and-frequentist-methods/ In the previous post, I simulated data from a hypothetical RCT that had heterogeneous treatment effects across subgroups defined by three covariates. I presented two Bayesian models, a strongly pooled model and an unpooled version, that could be used to estimate all the subgroup effects in a single model. I compared the estimates to a set of linear regression models that were estimated for each subgroup separately. My goal in doing these comparisons is to see how often we might draw the wrong conclusion about subgroup effects when we conduct these types of analyses. Subgroup analysis using a Bayesian hierarchical model https://www.rdatagen.net/post/2021-08-31-subgroup-analysis-using-a-bayesian-hierarchical-model/ Tue, 31 Aug 2021 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2021-08-31-subgroup-analysis-using-a-bayesian-hierarchical-model/ I’m part of a team that recently submitted the results of a randomized clinical trial for publication in a journal. The overall findings of the study were inconclusive, and we certainly didn’t try to hide that fact in our paper. Of course, the story was a bit more complicated, as the RCT was conducted during various phases of the COVID-19 pandemic; the context in which the therapeutic treatment was provided changed over time. Posterior probability checking with rvars: a quick follow-up https://www.rdatagen.net/post/2021-08-17-quick-follow-up-on-posterior-probability-checks-with-rvars/ Tue, 17 Aug 2021 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2021-08-17-quick-follow-up-on-posterior-probability-checks-with-rvars/ This is a relatively brief addendum to last week’s post, where I described how the rvar datatype implemented in the R package posterior makes it quite easy to perform posterior probability checks to assess goodness of fit. In the initial post, I generated data from a linear model and estimated parameters for a linear regression model, and, unsurprisingly, the model fit the data quite well. When I introduced a quadratic term into the data generating process and fit the same linear model (without a quadratic term), equally unsurprising, the model wasn’t a great fit. Fitting your model is only the beginning: Bayesian posterior probability checks with rvars https://www.rdatagen.net/post/2021-08-10-fitting-your-model-is-only-the-begining-bayesian-posterior-probability-checks/ Mon, 09 Aug 2021 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2021-08-10-fitting-your-model-is-only-the-begining-bayesian-posterior-probability-checks/ Say we’ve collected data and estimated parameters of a model that give structure to the data. An important question to ask is whether the model is a reasonable approximation of the true underlying data generating process. If we did a good job, we should be able to turn around and generate data from the model itself that looks similar to the data we started with. And if we didn’t do such a great job, the newly generated data will diverge from the original. Estimating a risk difference (and confidence intervals) using logistic regression https://www.rdatagen.net/post/2021-06-15-estimating-a-risk-difference-using-logistic-regression/ Tue, 15 Jun 2021 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2021-06-15-estimating-a-risk-difference-using-logistic-regression/ The odds ratio (OR) – the effect size parameter estimated in logistic regression – is notoriously difficult to interpret. It is a ratio of two quantities (odds, under different conditions) that are themselves ratios of probabilities. I think it is pretty clear that a very large or small OR implies a strong treatment effect, but translating that effect into a clinical context can be challenging, particularly since ORs cannot be mapped to unique probabilities. Sample size determination in the context of Bayesian analysis https://www.rdatagen.net/post/2021-06-01-bayesian-power-analysis/ Tue, 01 Jun 2021 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2021-06-01-bayesian-power-analysis/ Given my recent involvement with the design of a somewhat complex trial centered around a Bayesian data analysis, I am appreciating more and more that Bayesian approaches are a very real option for clinical trial design. A key element of any study design is sample size. While some would argue that sample size considerations are not critical to the Bayesian design (since Bayesian inference is agnostic to any pre-specified sample size and is not really affected by how frequently you look at the data along the way), it might be a bit of a challenge to submit a grant without telling the potential funders how many subjects you plan on recruiting (since that could have a rather big effect on the level of resources - financial and time - required. Generating random lists of names with errors to explore fuzzy word matching https://www.rdatagen.net/post/2021-04-13-generating-random-lists-of-names-with-errors-to-explore-fuzzy-word-matching/ Tue, 13 Apr 2021 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2021-04-13-generating-random-lists-of-names-with-errors-to-explore-fuzzy-word-matching/ Health data systems are not always perfect, a point that was made quite obvious when a study I am involved with required a matched list of nursing home residents taken from one system with set results from PCR tests for COVID-19 drawn from another. Name spellings for the same person from the second list were not always consistent across different PCR tests, nor were they always consistent with the cohort we were interested in studying defined by the first list. The case of three MAR mechanisms: when is multiple imputation mandatory? https://www.rdatagen.net/post/2021-03-30-some-cases-where-imputing-missing-data-matters/ Tue, 30 Mar 2021 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2021-03-30-some-cases-where-imputing-missing-data-matters/ I thought I’d written about this before, but I searched through my posts and I couldn’t find what I was looking for. If I am repeating myself, my apologies. I explored missing data two years ago, using directed acyclic graphs (DAGs) to help understand the various missing data mechanisms (MAR, MCAR, and MNAR). The DAGs provide insight into when it is appropriate to use observed data to get unbiased estimates of population quantities even though some of the observations are missing information. Framework for power analysis using simulation https://www.rdatagen.net/post/2021-03-16-framework-for-power-analysis-using-simulation/ Tue, 16 Mar 2021 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2021-03-16-framework-for-power-analysis-using-simulation/ The simstudy package started as a collection of functions I developed as I found myself repeating many of the same types of simulations for different projects. It was a way of organizing my work that I decided to share with others in case they wanted a routine way to generate data as well. simstudy has expanded a bit from that, but replicability is still a key motivation. What I have here is another attempt to document and organize a process that I find myself doing quite often - repeated data generation and model fitting. Randomization tests make fewer assumptions and seem pretty intuitive https://www.rdatagen.net/post/2021-03-02-randomization-tests/ Tue, 02 Mar 2021 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2021-03-02-randomization-tests/ I’m preparing a lecture on simulation for a statistical modeling class, and I plan on describing a couple of cases where simulation is intrinsic to the analytic method rather than as a tool for exploration and planning. MCMC methods used for Bayesian estimation, bootstrapping, and randomization tests all come to mind. Randomization tests are particularly interesting as an approach to conducting hypothesis tests, because they allow us to avoid making unrealistic assumptions. Visualizing the treatment effect with an ordinal outcome https://www.rdatagen.net/post/2021-02-16-visualizing-the-treatment-effect-when-outcome-is-ordinal/ Tue, 16 Feb 2021 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2021-02-16-visualizing-the-treatment-effect-when-outcome-is-ordinal/ If it’s true that many readers of a journal article focus on the abstract, figures and tables while skimming the rest, it is particularly important tell your story with a well conceived graphic or two. Along with a group of collaborators, I am trying to figure out the best way to represent an ordered categorical outcome from an RCT. In this case, there are a lot of categories, so the images can get confusing. How useful is it to show uncertainty in a plot comparing proportions? https://www.rdatagen.net/post/2021-02-02-uncertainty-in-a-plot-comparing-proportions/ Tue, 02 Feb 2021 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2021-02-02-uncertainty-in-a-plot-comparing-proportions/ I recently created a simple plot for a paper describing a pilot study of an intervention targeting depression. This small study was largely conducted to assess the feasibility and acceptability of implementing an existing intervention in a new population. The primary outcome measure that was collected was the proportion of patients in each study arm who remained depressed following the intervention. The plot of the study results that we included in the paper looked something like this: Finding answers faster for COVID-19: an application of Bayesian predictive probabilities https://www.rdatagen.net/post/2021-01-19-should-we-continue-recruiting-patients-an-application-of-bayesian-predictive-probabilities/ Tue, 19 Jan 2021 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2021-01-19-should-we-continue-recruiting-patients-an-application-of-bayesian-predictive-probabilities/ As we evaluate therapies for COVID-19 to help improve outcomes during the pandemic, researchers need to be able to make recommendations as quickly as possible. There really is no time to lose. The Data & Safety Monitoring Board (DSMB) of COMPILE, a prospective individual patient data meta-analysis, recognizes this. They are regularly monitoring the data to determine if there is a sufficiently strong signal to indicate effectiveness of convalescent plasma (CP) for hospitalized patients not on ventilation. Coming soon: effortlessly generate ordinal data without assuming proportional odds https://www.rdatagen.net/post/2021-01-05-coming-soon-new-feature-to-easily-generate-cumulative-odds-without-proportionality-assumption/ Tue, 05 Jan 2021 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2021-01-05-coming-soon-new-feature-to-easily-generate-cumulative-odds-without-proportionality-assumption/ I’m starting off 2021 with my 99th post ever to introduce a new feature that will be incorporated into simstudy soon to make it a bit easier to generate ordinal data without requiring an assumption of proportional odds. I should wait until this feature has been incorporated into the development version, but I want to put it out there in case any one has any further suggestions. In any case, having this out in plain view will motivate me to get back to work on the package. Constrained randomization to evaulate the vaccine rollout in nursing homes https://www.rdatagen.net/post/2020-12-22-constrained-randomization-to-evaulate-the-vaccine-rollout-in-nursing-homes/ Tue, 22 Dec 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/2020-12-22-constrained-randomization-to-evaulate-the-vaccine-rollout-in-nursing-homes/ On an incredibly heartening note, two COVID-19 vaccines have been approved for use in the US and other countries around the world. More are possibly on the way. The big challenge, at least here in the United States, is to convince people that these vaccines are safe and effective; we need people to get vaccinated as soon as they are able to slow the spread of this disease. I for one will not hesitate for a moment to get a shot when I have the opportunity, though I don’t think biostatisticians are too high on the priority list. A Bayesian implementation of a latent threshold model https://www.rdatagen.net/post/a-latent-threshold-model-to-estimate-treatment-effects/ Tue, 08 Dec 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/a-latent-threshold-model-to-estimate-treatment-effects/ In the previous post, I described a latent threshold model that might be helpful if we want to dichotomize a continuous predictor but we don’t know the appropriate cut-off point. This was motivated by a need to identify a threshold of antibody levels present in convalescent plasma that is currently being tested as a therapy for hospitalized patients with COVID in a number of RCTs, including those that are particpating in the ongoing COMPILE meta-analysis. A latent threshold model to dichotomize a continuous predictor https://www.rdatagen.net/post/a-latent-threshold-model/ Tue, 24 Nov 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/a-latent-threshold-model/ This is the context. In the convalescent plasma pooled individual patient level meta-analysis we are conducting as part of the COMPILE study, there is great interest in understanding the impact of antibody levels on outcomes. (I’ve described various aspects of the analysis in previous posts, most recently here). In other words, not all convalescent plasma is equal. If we had a clear measure of antibodies, we could model the relationship of these levels with the outcome of interest, such as health status as captured by the WHO 11-point scale or mortality, and call it a day. Exploring the properties of a Bayesian model using high performance computing https://www.rdatagen.net/post/a-frequentist-bayesian-exploring-frequentist-properties-of-bayesian-models/ Tue, 10 Nov 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/a-frequentist-bayesian-exploring-frequentist-properties-of-bayesian-models/ An obvious downside to estimating Bayesian models is that it can take a considerable amount of time merely to fit a model. And if you need to estimate the same model repeatedly, that considerable amount becomes a prohibitive amount. In this post, which is part of a series (last one here) where I’ve been describing various aspects of the Bayesian analyses we plan to conduct for the COMPILE meta-analysis of convalescent plasma RCTs, I’ll present a somewhat elaborate model to illustrate how we have addressed these computing challenges to explore the properties of these models. A refined brute force method to inform simulation of ordinal response data https://www.rdatagen.net/post/can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables/ Tue, 27 Oct 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/can-empirical-mean-and-variance-data-inform-simulation-of-ordinal-response-variables/ Francisco, a researcher from Spain, reached out to me with a challenge. He is interested in exploring various models that estimate correlation across multiple responses to survey questions. This is the context: He doesn’t have access to actual data, so to explore analytic methods he needs to simulate responses. It would be ideal if the simulated data reflect the properties of real-world responses, some of which can be gleaned from the literature. simstudy just got a little more dynamic: version 0.2.1 https://www.rdatagen.net/post/simstudy-just-got-a-little-more-dynamic-version-0-2-0/ Tue, 13 Oct 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/simstudy-just-got-a-little-more-dynamic-version-0-2-0/ simstudy version 0.2.1 has just been submitted to CRAN. Along with this release, the big news is that I’ve been joined by Jacob Wujciak-Jens as a co-author of the package. He initially reached out to me from Germany with some suggestions for improvements, we had a little back and forth, and now here we are. He has substantially reworked the underbelly of simstudy, making the package much easier to maintain, and positioning it for much easier extension. Permuted block randomization using simstudy https://www.rdatagen.net/post/permuted-block-randomization-using-simstudy/ Tue, 29 Sep 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/permuted-block-randomization-using-simstudy/ Along with preparing power analyses and statistical analysis plans (SAPs), generating study randomization lists is something a practicing biostatistician is occasionally asked to do. While not a particularly interesting activity, it offers the opportunity to tackle a small programming challenge. The title is a little misleading because you should probably skip all this and just use the blockrand package if you want to generate randomization schemes; don’t try to reinvent the wheel. Generating probabilities for ordinal categorical data https://www.rdatagen.net/post/generating-probabilities-for-ordinal-categorical-data/ Tue, 15 Sep 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/generating-probabilities-for-ordinal-categorical-data/ Over the past couple of months, I’ve been describing various aspects of the simulations that we’ve been doing to get ready for a meta-analysis of convalescent plasma treatment for hospitalized patients with COVID-19, most recently here. As I continue to do that, I want to provide motivation and code for a small but important part of the data generating process, which involves creating probabilities for ordinal categorical outcomes using a Dirichlet distribution. Diagnosing and dealing with degenerate estimation in a Bayesian meta-analysis https://www.rdatagen.net/post/diagnosing-and-dealing-with-estimation-issues-in-the-bayesian-meta-analysis/ Tue, 01 Sep 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/diagnosing-and-dealing-with-estimation-issues-in-the-bayesian-meta-analysis/ The federal government recently granted emergency approval for the use of antibody rich blood plasma when treating hospitalized COVID-19 patients. This announcement is unfortunate, because we really don’t know if this promising treatment works. The best way to determine this, of course, is to conduct an experiment, though this approval makes this more challenging to do; with the general availability of convalescent plasma (CP), there may be resistance from patients and providers against participating in a randomized trial. Generating data from a truncated distribution https://www.rdatagen.net/post/generating-data-from-a-truncated-distribution/ Tue, 18 Aug 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/generating-data-from-a-truncated-distribution/ A researcher reached out to me the other day to see if the simstudy package provides a quick and easy way to generate data from a truncated distribution. Other than the noZeroPoisson distribution option (which is a very specific truncated distribution), there is no way to do this directly. You can always generate data from the full distribution and toss out the observations that fall outside of the truncation range, but this is not exactly efficient, and in practice can get a little messy. A hurdle model for COVID-19 infections in nursing homes https://www.rdatagen.net/post/a-hurdle-model-for-covid-19-infections-in-nursing-homes-sample-size-considerations/ Tue, 04 Aug 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/a-hurdle-model-for-covid-19-infections-in-nursing-homes-sample-size-considerations/ Late last year, I added a mixture distribution to the simstudy package, largely motivated to accommodate zero-inflated Poisson or negative binomial distributions. (I really thought I had added this two years ago - but time is moving so slowly these days.) These distributions are useful when modeling count data, but we anticipate observing more than the expected frequency of zeros that would arise from a non-inflated (i.e. “regular”) Poisson or negative binomial distribution. A Bayesian model for a simulated meta-analysis https://www.rdatagen.net/post/a-bayesian-model-for-a-simulated-meta-analysis/ Tue, 21 Jul 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/a-bayesian-model-for-a-simulated-meta-analysis/ This is essentially an addendum to the previous post where I simulated data from multiple RCTs to explore an analytic method to pool data across different studies. In that post, I used the nlme package to conduct a meta-analysis based on individual level data of 12 studies. Here, I am presenting an alternative hierarchical modeling approach that uses the Bayesian package rstan. Create the data set We’ll use the exact same data generating process as described in some detail in the previous post. Simulating multiple RCTs to simulate a meta-analysis https://www.rdatagen.net/post/simulating-mutliple-studies-to-simulate-a-meta-analysis/ Tue, 07 Jul 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/simulating-mutliple-studies-to-simulate-a-meta-analysis/ I am currently involved with an RCT that is struggling to recruit eligible patients (by no means an unusual problem), increasing the risk that findings might be inconclusive. A possible solution to this conundrum is to find similar, ongoing trials with the aim of pooling data in a single analysis, to conduct a meta-analysis of sorts. In an ideal world, this theoretical collection of sites would have joined forces to develop a single study protocol, but often there is no structure or funding mechanism to make that happen. Consider a permutation test for a small pilot study https://www.rdatagen.net/post/permutation-test-for-a-covid-19-pilot-nursing-home-study/ Tue, 23 Jun 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/permutation-test-for-a-covid-19-pilot-nursing-home-study/ Recently I wrote about the challenges of trying to learn too much from a small pilot study, even if it is a randomized controlled trial. There are limitations on how much you can learn about a treatment effect given the small sample size and relatively high variability of the estimate. However, the temptation for researchers is usually just too great; it is only natural to want to see if there is any kind of signal of an intervention effect, even though the pilot study is focused on questions of feasibility and acceptability. When proportional odds is a poor assumption, collapsing categories is probably not going to save you https://www.rdatagen.net/post/more-fun-with-ordinal-scales-combining-categories-may-not-make-solve-the-problem-of-non-proportionality/ Tue, 09 Jun 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/more-fun-with-ordinal-scales-combining-categories-may-not-make-solve-the-problem-of-non-proportionality/ Continuing the discussion on cumulative odds models I started last time, I want to investigate a solution I always assumed would help mitigate a failure to meet the proportional odds assumption. I’ve believed if there is a large number of categories and the relative cumulative odds between two groups don’t appear proportional across all categorical levels, then a reasonable approach is to reduce the number of categories. In other words, fewer categories translates to proportional odds. Considering the number of categories in an ordinal outcome https://www.rdatagen.net/post/the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome/ Tue, 26 May 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome/ In two Covid-19-related trials I’m involved with, the primary or key secondary outcome is the status of a patient at 14 days based on a World Health Organization ordered rating scale. In this particular ordinal scale, there are 11 categories ranging from 0 (uninfected) to 10 (death). In between, a patient can be infected but well enough to remain at home, hospitalized with milder symptoms, or hospitalized with severe disease. To stratify or not? It might not actually matter... https://www.rdatagen.net/post/to-stratify-or-not-to-stratify/ Tue, 12 May 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/to-stratify-or-not-to-stratify/ Continuing with the theme of exploring small issues that come up in trial design, I recently used simulation to assess the impact of stratifying (or not) in the context of a multi-site Covid-19 trial with a binary outcome. The investigators are concerned that baseline health status will affect the probability of an outcome event, and are interested in randomizing by health status. The goal is to ensure balance across the two treatment arms with respect to this important variable. Simulation for power in designing cluster randomized trials https://www.rdatagen.net/post/simulation-for-power-calculations-in-designing-cluster-randomized-trials/ Tue, 28 Apr 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/simulation-for-power-calculations-in-designing-cluster-randomized-trials/ As a biostatistician, I like to be involved in the design of a study as early as possible. I always like to say that I hope one of the first conversations an investigator has is with me, so that I can help clarify the research questions before getting into the design questions related to measurement, unit of randomization, and sample size. In the worst case scenario - and this actually doesn’t happen to me any more - a researcher would approach me after everything is done except the analysis. Yes, unbalanced randomization can improve power, in some situations https://www.rdatagen.net/post/unbalanced-randomization-can-improve-power-in-some-situations/ Tue, 14 Apr 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/unbalanced-randomization-can-improve-power-in-some-situations/ Last time I provided some simulations that suggested that there might not be any efficiency-related benefits to using unbalanced randomization when the outcome is binary. This is a quick follow-up to provide a counter-example where the outcome in a two-group comparison is continuous. If the groups have different amounts of variability, intuitively it makes sense to allocate more patients to the more variable group. Doing this should reduce the variability in the estimate of the mean for that group, which in turn could improve the power of the test. Can unbalanced randomization improve power? https://www.rdatagen.net/post/can-unbalanced-randomization-improve-power/ Tue, 31 Mar 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/can-unbalanced-randomization-improve-power/ Of course, we’re all thinking about one thing these days, so it seems particularly inconsequential to be writing about anything that doesn’t contribute to solving or addressing in some meaningful way this pandemic crisis. But, I find that working provides a balm from reading and hearing all day about the events swirling around us, both here and afar. (I am in NYC, where things are definitely swirling.) And for me, working means blogging, at least for a few hours every couple of weeks. When you want more than a chi-squared test, consider a measure of association https://www.rdatagen.net/post/when-a-chi-squared-statistic-is-not-enough-a-measure-of-association-for-contingency-tables/ Tue, 17 Mar 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/when-a-chi-squared-statistic-is-not-enough-a-measure-of-association-for-contingency-tables/ In my last post, I made the point that p-values should not necessarily be considered sufficient evidence (or evidence at all) in drawing conclusions about associations we are interested in exploring. When it comes to contingency tables that represent the outcomes for two categorical variables, it isn’t so obvious what measure of association should augment (or replace) the \(\chi^2\) statistic. I described a model-based measure of effect to quantify the strength of an association in the particular case where one of the categorical variables is ordinal. Alternatives to reporting a p-value: the case of a contingency table https://www.rdatagen.net/post/to-report-a-p-value-or-not-the-case-of-a-contingency-table/ Tue, 03 Mar 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/to-report-a-p-value-or-not-the-case-of-a-contingency-table/ I frequently find myself in discussions with collaborators about the merits of reporting p-values, particularly in the context of pilot studies or exploratory analysis. Over the past several years, the American Statistical Association has made several strong statements about the need to consider approaches that measure the strength of evidence or uncertainty that don’t necessarily rely on p-values. In 2016, the ASA attempted to clarify the proper use and interpretation of the p-value by highlighting key principles “that could improve the conduct or interpretation of quantitative science, according to widespread consensus in the statistical community. Clustered randomized trials and the design effect https://www.rdatagen.net/post/what-exactly-is-the-design-effect/ Tue, 18 Feb 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/what-exactly-is-the-design-effect/ I am always saying that simulation can help illuminate interesting statistical concepts or ideas. Such an exploration might provide some insight into the concept of the design effect, which underlies clustered randomized trial designs. I’ve written about clustered-related methods so much on this blog that I won’t provide links - just peruse the list of entries on the home page and you are sure to spot a few. But, I haven’t written explicitly about the design effect. Analysing an open cohort stepped-wedge clustered trial with repeated individual binary outcomes https://www.rdatagen.net/post/analyzing-the-open-cohort-stepped-wedge-trial-with-binary-outcomes/ Tue, 04 Feb 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/analyzing-the-open-cohort-stepped-wedge-trial-with-binary-outcomes/ I am currently wrestling with how to analyze data from a stepped-wedge designed cluster randomized trial. A few factors make this analysis particularly interesting. First, we want to allow for the possibility that between-period site-level correlation will decrease (or decay) over time. Second, there is possibly additional clustering at the patient level since individual outcomes will be measured repeatedly over time. And third, given that these outcomes are binary, there are no obvious software tools that can handle generalized linear models with this particular variance structure we want to model. A brief account (via simulation) of the ROC (and its AUC) https://www.rdatagen.net/post/a-simple-explanation-of-what-the-roc-and-auc-represent/ Tue, 21 Jan 2020 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/a-simple-explanation-of-what-the-roc-and-auc-represent/ The ROC (receiver operating characteristic) curve visually depicts the ability of a measure or classification model to distinguish two groups. The area under the ROC (AUC), quantifies the extent of that ability. My goal here is to describe as simply as possible a process that serves as a foundation for the ROC, and to provide an interpretation of the AUC that is defined by that curve. A prediction problem The classic application for the ROC is a medical test designed to identify individuals with a particular medical condition or disease. Repeated measures can improve estimation when we only care about a single endpoint https://www.rdatagen.net/post/using-repeated-measures-might-improve-effect-estimation-even-when-single-endpoint-is-the-focus/ Tue, 10 Dec 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/using-repeated-measures-might-improve-effect-estimation-even-when-single-endpoint-is-the-focus/ I’m participating in the design of a new study that will evaluate interventions aimed at reducing both pain and opioid use for patients on dialysis. This study is likely to be somewhat complicated, possibly involving multiple clusters, multiple interventions, a sequential and/or adaptive randomization scheme, and a composite binary outcome. I’m not going into any of that here. There is one issue that should be fairly generalizable to other studies. Adding a "mixture" distribution to the simstudy package https://www.rdatagen.net/post/adding-mixture-distributions-to-simstudy/ Tue, 26 Nov 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/adding-mixture-distributions-to-simstudy/ I am contemplating adding a new distribution option to the package simstudy that would allow users to define a new variable as a mixture of previously defined (or already generated) variables. I think the easiest way to explain how to apply the new mixture option is to step through a few examples and see it in action. Specifying the “mixture” distribution As defined here, a mixture of variables is a random draw from a set of variables based on a defined set of probabilities. What can we really expect to learn from a pilot study? https://www.rdatagen.net/post/what-can-we-really-expect-to-learn-from-a-pilot-study/ Tue, 12 Nov 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/what-can-we-really-expect-to-learn-from-a-pilot-study/ I am involved with a very interesting project - the NIA IMPACT Collaboratory - where a primary goal is to fund a large group of pragmatic pilot studies to investigate promising interventions to improve health care and quality of life for people living with Alzheimer’s disease and related dementias. One of my roles on the project team is to advise potential applicants on the development of their proposals. In order to provide helpful advice, it is important that we understand what we should actually expect to learn from a relatively small pilot study of a new intervention. Any one interested in a function to quickly generate data with many predictors? https://www.rdatagen.net/post/any-one-interested-in-a-function-to-quickly-generate-data-with-many-predictors/ Tue, 29 Oct 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/any-one-interested-in-a-function-to-quickly-generate-data-with-many-predictors/ A couple of months ago, I was contacted about the possibility of creating a simple function in simstudy to generate a large dataset that could include possibly 10’s or 100’s of potential predictors and an outcome. In this function, only a subset of the variables would actually be predictors. The idea is to be able to easily generate data for exploring ridge regression, Lasso regression, or other “regularization” methods. Alternatively, this can be used to very quickly generate correlated data (with one line of code) without going through the definition process. Selection bias, death, and dying https://www.rdatagen.net/post/selection-bias-death-and-dying/ Tue, 15 Oct 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/selection-bias-death-and-dying/ I am collaborating with a number of folks who think a lot about palliative or supportive care for people who are facing end-stage disease, such as advanced dementia, cancer, COPD, or congestive heart failure. A major concern for this population (which really includes just about everyone at some point) is the quality of life at the end of life and what kind of experiences, including interactions with the health care system, they have (and don’t have) before death. There's always at least two ways to do the same thing: an example generating 3-level hierarchical data using simstudy https://www.rdatagen.net/post/in-simstudy-as-in-r-there-s-always-at-least-two-ways-to-do-the-same-thing/ Thu, 03 Oct 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/in-simstudy-as-in-r-there-s-always-at-least-two-ways-to-do-the-same-thing/ “I am working on a simulation study that requires me to generate data for individuals within clusters, but each individual will have repeated measures (say baseline and two follow-ups). I’m new to simstudy and have been going through the examples in R this afternoon, but I wondered if this was possible in the package, and if so whether you could offer any tips to get me started with how I would do this? Simulating an open cohort stepped-wedge trial https://www.rdatagen.net/post/simulating-an-open-cohort-stepped-wedge-trial/ Tue, 17 Sep 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/simulating-an-open-cohort-stepped-wedge-trial/ In a current multi-site study, we are using a stepped-wedge design to evaluate whether improved training and protocols can reduce prescriptions of anti-psychotic medication for home hospice care patients with advanced dementia. The study is officially called the Hospice Advanced Dementia Symptom Management and Quality of Life (HAS-QOL) Stepped Wedge Trial. Unlike my previous work with stepped-wedge designs, where individuals were measured once in the course of the study, this study will collect patient outcomes from the home hospice care EHRs over time. Analyzing a binary outcome arising out of within-cluster, pair-matched randomization https://www.rdatagen.net/post/analyzing-a-binary-outcome-in-a-study-with-within-cluster-pair-matched-randomization/ Tue, 03 Sep 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/analyzing-a-binary-outcome-in-a-study-with-within-cluster-pair-matched-randomization/ A key motivating factor for the simstudy package and much of this blog is that simulation can be super helpful in understanding how best to approach an unusual, or least unfamiliar, analytic problem. About six months ago, I described the DREAM Initiative (Diabetes Research, Education, and Action for Minorities), a study that used a slightly innovative randomization scheme to ensure that two comparison groups were evenly balanced across important covariates. simstudy updated to version 0.1.14: implementing Markov chains https://www.rdatagen.net/post/simstudy-1-14-update/ Tue, 20 Aug 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/simstudy-1-14-update/ I’m developing study simulations that require me to generate a sequence of health status for a collection of individuals. In these simulations, individuals gradually grow sicker over time, though sometimes they recover slightly. To facilitate this, I am using a stochastic Markov process, where the probability of a health status at a particular time depends only on the previous health status (in the immediate past). While there are packages to do this sort of thing (see for example the markovchain package), I hadn’t yet stumbled upon them while I was tackling my problem. Bayes models for estimation in stepped-wedge trials with non-trivial ICC patterns https://www.rdatagen.net/post/bayes-model-to-estimate-stepped-wedge-trial-with-non-trivial-icc-structure/ Tue, 06 Aug 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/bayes-model-to-estimate-stepped-wedge-trial-with-non-trivial-icc-structure/ Continuing a series of posts discussing the structure of intra-cluster correlations (ICC’s) in the context of a stepped-wedge trial, this latest edition is primarily interested in fitting Bayesian hierarchical models for more complex cases (though I do talk a bit more about the linear mixed effects models). The first two posts in the series focused on generating data to simulate various scenarios; the third post considered linear mixed effects and Bayesian hierarchical models to estimate ICC’s under the simplest scenario of constant between-period ICC’s. Estimating treatment effects (and ICCs) for stepped-wedge designs https://www.rdatagen.net/post/estimating-treatment-effects-and-iccs-for-stepped-wedge-designs/ Tue, 16 Jul 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/estimating-treatment-effects-and-iccs-for-stepped-wedge-designs/ In the last two posts, I introduced the notion of time-varying intra-cluster correlations in the context of stepped-wedge study designs. (See here and here). Though I generated lots of data for those posts, I didn’t fit any models to see if I could recover the estimates and any underlying assumptions. That’s what I am doing now. My focus here is on the simplest case, where the ICC’s are constant over time and between time. More on those stepped-wedge design assumptions: varying intra-cluster correlations over time https://www.rdatagen.net/post/varying-intra-cluster-correlations-over-time/ Tue, 09 Jul 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/varying-intra-cluster-correlations-over-time/ In my last post, I wrote about within- and between-period intra-cluster correlations in the context of stepped-wedge cluster randomized study designs. These are quite important to understand when figuring out sample size requirements (and models for analysis, which I’ll be writing about soon.) Here, I’m extending the constant ICC assumption I presented last time around by introducing some complexity into the correlation structure. Much of the code I am using can be found in last week’s post, so if anything seems a little unclear, hop over here. Planning a stepped-wedge trial? Make sure you know what you're assuming about intra-cluster correlations ... https://www.rdatagen.net/post/intra-cluster-correlations-over-time/ Tue, 25 Jun 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/intra-cluster-correlations-over-time/ A few weeks ago, I was at the annual meeting of the NIH Collaboratory, which is an innovative collection of collaboratory cores, demonstration projects, and NIH Institutes and Centers that is developing new models for implementing and supporting large-scale health services research. A study I am involved with - Primary Palliative Care for Emergency Medicine - is one of the demonstration projects in this collaboratory. The second day of this meeting included four panels devoted to the design and analysis of embedded pragmatic clinical trials, and focused on the challenges of conducting rigorous research in the real-world context of a health delivery system. Don't get too excited - it might just be regression to the mean https://www.rdatagen.net/post/regression-to-the-mean/ Tue, 11 Jun 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/regression-to-the-mean/ It is always exciting to find an interesting pattern in the data that seems to point to some important difference or relationship. A while ago, one of my colleagues shared a figure with me that looked something like this: It looks like something is going on. On average low scorers in the first period increased a bit in the second period, and high scorers decreased a bit. Something is going on, but nothing specific to the data in question; it is just probability working its magic. simstudy update - stepped-wedge design treatment assignment https://www.rdatagen.net/post/simstudy-update-stepped-wedge-treatment-assignment/ Tue, 28 May 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/simstudy-update-stepped-wedge-treatment-assignment/ simstudy has just been updated (version 0.1.13 on CRAN), and includes one interesting addition (and a couple of bug fixes). I am working on a post (or two) about intra-cluster correlations (ICCs) and stepped-wedge study designs (which I’ve written about before), and I was getting tired of going through the convoluted process of generating data from a time-dependent treatment assignment process. So, I wrote a new function, trtStepWedge, that should simplify things. Generating and modeling over-dispersed binomial data https://www.rdatagen.net/post/overdispersed-binomial-data/ Tue, 14 May 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/overdispersed-binomial-data/ A couple of weeks ago, I was inspired by a study to write about a classic design issue that arises in cluster randomized trials: should we focus on the number of clusters or the size of those clusters? This trial, which is concerned with preventing opioid use disorder for at-risk patients in primary care clinics, has also motivated this second post, which concerns another important issue - over-dispersion. A count outcome In this study, one of the primary outcomes is the number of days of opioid use over a six-month follow-up period (to be recorded monthly by patient-report and aggregated for the six-month measure). What matters more in a cluster randomized trial: number or size? https://www.rdatagen.net/post/what-matters-more-in-a-cluster-randomized-trial-number-or-size/ Tue, 30 Apr 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/what-matters-more-in-a-cluster-randomized-trial-number-or-size/ I am involved with a trial of an intervention designed to prevent full-blown opioid use disorder for patients who may have an incipient opioid use problem. Given the nature of the intervention, it was clear the only feasible way to conduct this particular study is to randomize at the physician rather than the patient level. There was a concern that the number of patients eligible for the study might be limited, so that each physician might only have a handful of patients able to participate, if that many. Musings on missing data https://www.rdatagen.net/post/musings-on-missing-data/ Tue, 02 Apr 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/musings-on-missing-data/ I’ve been meaning to share an analysis I recently did to estimate the strength of the relationship between a young child’s ability to recognize emotions in others (e.g. teachers and fellow students) and her longer term academic success. The study itself is quite interesting (hopefully it will be published sometime soon), but I really wanted to write about it here as it involved the challenging problem of missing data in the context of heterogeneous effects (different across sub-groups) and clustering (by schools). A case where prospective matching may limit bias in a randomized trial https://www.rdatagen.net/post/a-case-where-prospecitve-matching-may-limit-bias/ Tue, 12 Mar 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/a-case-where-prospecitve-matching-may-limit-bias/ Analysis is important, but study design is paramount. I am involved with the Diabetes Research, Education, and Action for Minorities (DREAM) Initiative, which is, among other things, estimating the effect of a group-based therapy program on weight loss for patients who have been identified as pre-diabetic (which means they have elevated HbA1c levels). The original plan was to randomize patients at a clinic to treatment or control, and then follow up with those assigned to the treatment group to see if they wanted to participate. A example in causal inference designed to frustrate: an estimate pretty much guaranteed to be biased https://www.rdatagen.net/post/dags-colliders-and-an-example-of-variance-bias-tradeoff/ Tue, 26 Feb 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/dags-colliders-and-an-example-of-variance-bias-tradeoff/ I am putting together a brief lecture introducing causal inference for graduate students studying biostatistics. As part of this lecture, I thought it would be helpful to spend a little time describing directed acyclic graphs (DAGs), since they are an extremely helpful tool for communicating assumptions about the causal relationships underlying a researcher’s data. The strength of DAGs is that they help us think how these underlying relationships in the data might lead to biases in causal effect estimation, and suggest ways to estimate causal effects that eliminate these biases. Using the uniform sum distribution to introduce probability https://www.rdatagen.net/post/a-fun-example-to-explore-probability/ Tue, 05 Feb 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/a-fun-example-to-explore-probability/ I’ve never taught an intro probability/statistics course. If I ever did, I would certainly want to bring the underlying wonder of the subject to life. I’ve always found it almost magical the way mathematical formulation can be mirrored by computer simulation, the way proof can be guided by observed data generation processes, and the way DGPs can confirm analytic solutions. I would like to begin such a course with a somewhat unusual but accessible problem that would evoke these themes from the start. Correlated longitudinal data with varying time intervals https://www.rdatagen.net/post/correlated-longitudinal-data-with-varying-time-intervals/ Tue, 22 Jan 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/correlated-longitudinal-data-with-varying-time-intervals/ I was recently contacted to see if simstudy can create a data set of correlated outcomes that are measured over time, but at different intervals for each individual. The quick answer is there is no specific function to do this. However, if you are willing to assume an “exchangeable” correlation structure, where measurements far apart in time are just as correlated as measurements taken close together, then you could just generate individual-level random effects (intercepts and/or slopes) and pretty much call it a day. Considering sensitivity to unmeasured confounding: part 2 https://www.rdatagen.net/post/what-does-it-mean-if-findings-are-sensitive-to-unmeasured-confounding-ii/ Thu, 10 Jan 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/what-does-it-mean-if-findings-are-sensitive-to-unmeasured-confounding-ii/ In part 1 of this 2-part series, I introduced the notion of sensitivity to unmeasured confounding in the context of an observational data analysis. I argued that an estimate of an association between an observed exposure \(D\) and outcome \(Y\) is sensitive to unmeasured confounding if we can conceive of a reasonable alternative data generating process (DGP) that includes some unmeasured confounder that will generate the same observed distribution the observed data. Considering sensitivity to unmeasured confounding: part 1 https://www.rdatagen.net/post/what-does-it-mean-if-findings-are-sensitive-to-unmeasured-confounding/ Wed, 02 Jan 2019 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/what-does-it-mean-if-findings-are-sensitive-to-unmeasured-confounding/ Principled causal inference methods can be used to compare the effects of different exposures or treatments we have observed in non-experimental settings. These methods, which include matching (with or without propensity scores), inverse probability weighting, and various g-methods, help us create comparable groups to simulate a randomized experiment. All of these approaches rely on a key assumption of no unmeasured confounding. The problem is, short of subject matter knowledge, there is no way to test this assumption empirically. Parallel processing to add a little zip to power simulations (and other replication studies) https://www.rdatagen.net/post/parallel-processing-to-add-a-little-zip-to-power-simulations/ Mon, 10 Dec 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/parallel-processing-to-add-a-little-zip-to-power-simulations/ It’s always nice to be able to speed things up a bit. My first blog post ever described an approach using Rcpp to make huge improvements in a particularly intensive computational process. Here, I want to show how simple it is to speed things up by using the R package parallel and its function mclapply. I’ve been using this function more and more, so I want to explicitly demonstrate it in case any one is wondering. Horses for courses, or to each model its own (causal effect) https://www.rdatagen.net/post/different-models-estimate-different-causal-effects-part-ii/ Wed, 28 Nov 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/different-models-estimate-different-causal-effects-part-ii/ In my previous post, I described a (relatively) simple way to simulate observational data in order to compare different methods to estimate the causal effect of some exposure or treatment on an outcome. The underlying data generating process (DGP) included a possibly unmeasured confounder and an instrumental variable. (If you haven’t already, you should probably take a quick look.) A key point in considering causal effect estimation is that the average causal effect depends on the individuals included in the average. Generating data to explore the myriad causal effects that can be estimated in observational data analysis https://www.rdatagen.net/post/generating-data-to-explore-the-myriad-causal-effects/ Tue, 20 Nov 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/generating-data-to-explore-the-myriad-causal-effects/ I’ve been inspired by two recent talks describing the challenges of using instrumental variable (IV) methods. IV methods are used to estimate the causal effects of an exposure or intervention when there is unmeasured confounding. This estimated causal effect is very specific: the complier average causal effect (CACE). But, the CACE is just one of several possible causal estimands that we might be interested in. For example, there’s the average causal effect (ACE) that represents a population average (not just based the subset of compliers). Causal mediation estimation measures the unobservable https://www.rdatagen.net/post/causal-mediation/ Tue, 06 Nov 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/causal-mediation/ I put together a series of demos for a group of epidemiology students who are studying causal mediation analysis. Since mediation analysis is not always so clear or intuitive, I thought, of course, that going through some examples of simulating data for this process could clarify things a bit. Quite often we are interested in understanding the relationship between an exposure or intervention on an outcome. Does exposure \(A\) (could be randomized or not) have an effect on outcome \(Y\)? Cross-over study design with a major constraint https://www.rdatagen.net/post/when-the-research-question-doesn-t-fit-nicely-into-a-standard-study-design/ Tue, 23 Oct 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/when-the-research-question-doesn-t-fit-nicely-into-a-standard-study-design/ Every new study presents its own challenges. (I would have to say that one of the great things about being a biostatistician is the immense variety of research questions that I get to wrestle with.) Recently, I was approached by a group of researchers who wanted to evaluate an intervention. Actually, they had two, but the second one was a minor tweak added to the first. They were trying to figure out how to design the study to answer two questions: (1) is intervention \(A\) better than doing nothing and (2) is \(A^+\), the slightly augmented version of \(A\), better than just \(A\)? In regression, we assume noise is independent of all measured predictors. What happens if it isn't? https://www.rdatagen.net/post/linear-regression-models-assume-noise-is-independent/ Tue, 09 Oct 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/linear-regression-models-assume-noise-is-independent/ A number of key assumptions underlie the linear regression model - among them linearity and normally distributed noise (error) terms with constant variance In this post, I consider an additional assumption: the unobserved noise is uncorrelated with any covariates or predictors in the model. In this simple model: \[Y_i = \beta_0 + \beta_1X_i + e_i,\] \(Y_i\) has both a structural and stochastic (random) component. The structural component is the linear relationship of \(Y\) with \(X\). simstudy update: improved correlated binary outcomes https://www.rdatagen.net/post/simstudy-update-to-version-0-1-10/ Tue, 25 Sep 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/simstudy-update-to-version-0-1-10/ An updated version of the simstudy package (0.1.10) is now available on CRAN. The impetus for this release was a series of requests about generating correlated binary outcomes. In the last post, I described a beta-binomial data generating process that uses the recently added beta distribution. In addition to that update, I’ve added functionality to genCorGen and addCorGen, functions which generate correlated data from non-Gaussian or normally distributed data such as Poisson, Gamma, and binary data. Binary, beta, beta-binomial https://www.rdatagen.net/post/binary-beta-beta-binomial/ Tue, 11 Sep 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/binary-beta-beta-binomial/ I’ve been working on updates for the simstudy package. In the past few weeks, a couple of folks independently reached out to me about generating correlated binary data. One user was not impressed by the copula algorithm that is already implemented. I’ve added an option to use an algorithm developed by Emrich and Piedmonte in 1991, and will be incorporating that option soon in the functions genCorGen and addCorGen. The power of stepped-wedge designs https://www.rdatagen.net/post/alternatives-to-stepped-wedge-designs/ Tue, 28 Aug 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/alternatives-to-stepped-wedge-designs/ Just before heading out on vacation last month, I put up a post that purported to compare stepped-wedge study designs with more traditional cluster randomized trials. Either because I rushed or was just lazy, I didn’t exactly do what I set out to do. I did confirm that a multi-site randomized clinical trial can be more efficient than a cluster randomized trial when there is variability across clusters. (I compared randomizing within a cluster with randomization by cluster. Multivariate ordinal categorical data generation https://www.rdatagen.net/post/multivariate-ordinal-categorical-data-generation/ Wed, 15 Aug 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/multivariate-ordinal-categorical-data-generation/ An economist contacted me about the ability of simstudy to generate correlated ordinal categorical outcomes. He is trying to generate data as an aide to teaching cost-effectiveness analysis, and is hoping to simulate responses to a quality-of-life survey instrument, the EQ-5D. The particular instrument has five questions related to mobility, self-care, activities, pain, and anxiety. Each item has three possible responses: (1) no problems, (2) some problems, and (3) a lot of problems. Randomize by, or within, cluster? https://www.rdatagen.net/post/by-vs-within/ Thu, 19 Jul 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/by-vs-within/ I am involved with a stepped-wedge designed study that is exploring whether we can improve care for patients with end-stage disease who show up in the emergency room. The plan is to train nurses and physicians in palliative care. (A while ago, I described what the stepped wedge design is.) Under this design, 33 sites around the country will receive the training at some point, which is no small task (and fortunately as the statistician, this is a part of the study I have little involvement). How the odds ratio confounds: a brief study in a few colorful figures https://www.rdatagen.net/post/log-odds/ Tue, 10 Jul 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/log-odds/ The odds ratio always confounds: while it may be constant across different groups or clusters, the risk ratios or risk differences across those groups may vary quite substantially. This makes it really hard to interpret an effect. And then there is inconsistency between marginal and conditional odds ratios, a topic I seem to be visiting frequently, most recently last month. My aim here is to generate a few figures that might highlight some of these issues. Re-referencing factor levels to estimate standard errors when there is interaction turns out to be a really simple solution https://www.rdatagen.net/post/re-referencing-to-estimate-effects-when-there-is-interaction/ Tue, 26 Jun 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/re-referencing-to-estimate-effects-when-there-is-interaction/ Maybe this should be filed under topics that are so obvious that it is not worth writing about. But, I hate to let a good simulation just sit on my computer. I was recently working on a paper investigating the relationship of emotion knowledge (EK) in very young kids with academic performance a year or two later. The idea is that kids who are more emotionally intelligent might be better prepared to learn. Late anniversary edition redux: conditional vs marginal models for clustered data https://www.rdatagen.net/post/mixed-effect-models-vs-gee/ Wed, 13 Jun 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/mixed-effect-models-vs-gee/ This afternoon, I was looking over some simulations I plan to use in an upcoming lecture on multilevel models. I created these examples a while ago, before I started this blog. But since it was just about a year ago that I first wrote about this topic (and started the blog), I thought I’d post this now to mark the occasion. The code below provides another way to visualize the difference between marginal and conditional logistic regression models for clustered data (see here for an earlier post that discusses in greater detail some of the key issues raised here. A little function to help generate ICCs in simple clustered data https://www.rdatagen.net/post/a-little-function-to-help-generate-iccs-in-simple-clustered-data/ Thu, 24 May 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/a-little-function-to-help-generate-iccs-in-simple-clustered-data/ In health services research, experiments are often conducted at the provider or site level rather than the patient level. However, we might still be interested in the outcome at the patient level. For example, we could be interested in understanding the effect of a training program for physicians on their patients. It would be very difficult to randomize patients to be exposed or not to the training if a group of patients all see the same doctor. How efficient are multifactorial experiments? https://www.rdatagen.net/post/so-how-efficient-are-multifactorial-experiments-part/ Wed, 02 May 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/so-how-efficient-are-multifactorial-experiments-part/ I recently described why we might want to conduct a multi-factorial experiment, and I alluded to the fact that this approach can be quite efficient. It is efficient in the sense that it is possible to test simultaneously the impact of multiple interventions using an overall sample size that would be required to test a single intervention in a more traditional RCT. I demonstrate that here, first with a continuous outcome and then with a binary outcome. Testing multiple interventions in a single experiment https://www.rdatagen.net/post/testing-many-interventions-in-a-single-experiment/ Thu, 19 Apr 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/testing-many-interventions-in-a-single-experiment/ A reader recently inquired about functions in simstudy that could generate data for a balanced multi-factorial design. I had to report that nothing really exists. A few weeks later, a colleague of mine asked if I could help estimate the appropriate sample size for a study that plans to use a multi-factorial design to choose among a set of interventions to improve rates of smoking cessation. In the course of exploring this, I realized it would be super helpful if the function suggested by the reader actually existed. Exploring the underlying theory of the chi-square test through simulation - part 2 https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence-part-2/ Sun, 25 Mar 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence-part-2/ In the last post, I tried to provide a little insight into the chi-square test. In particular, I used simulation to demonstrate the relationship between the Poisson distribution of counts and the chi-squared distribution. The key point in that post was the role conditioning plays in that relationship by reducing variance. To motivate some of the key issues, I talked a bit about recycling. I asked you to imagine a set of bins placed in different locations to collect glass bottles. Exploring the underlying theory of the chi-square test through simulation - part 1 https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence/ Sun, 18 Mar 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence/ Kids today are so sophisticated (at least they are in New York City, where I live). While I didn’t hear about the chi-square test of independence until my first stint in graduate school, they’re already talking about it in high school. When my kids came home and started talking about it, I did what I usually do when they come home asking about a new statistical concept. I opened up R and started generating some data. Another reason to be careful about what you control for https://www.rdatagen.net/post/another-reason-to-be-careful-about-what-you-control-for/ Wed, 07 Mar 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/another-reason-to-be-careful-about-what-you-control-for/ Modeling data without any underlying causal theory can sometimes lead you down the wrong path, particularly if you are interested in understanding the way things work rather than making predictions. A while back, I described what can go wrong when you control for a mediator when you are interested in an exposure and an outcome. Here, I describe the potential biases that are introduced when you inadvertently control for a variable that turns out to be a collider. “I have to randomize by cluster. Is it OK if I only have 6 sites?" https://www.rdatagen.net/post/i-have-to-randomize-by-site-is-it-ok-if-i-only-have-6/ Wed, 21 Feb 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/i-have-to-randomize-by-site-is-it-ok-if-i-only-have-6/ The answer is probably no, because there is a not-so-low chance (perhaps considerably higher than 5%) you will draw the wrong conclusions from the study. I have heard variations on this question not so infrequently, so I thought it would be useful (of course) to do a few quick simulations to see what happens when we try to conduct a study under these conditions. (Another question I get every so often, after a study has failed to find an effect: “can we get a post-hoc estimate of the power? Have you ever asked yourself, "how should I approach the classic pre-post analysis?" https://www.rdatagen.net/post/thinking-about-the-run-of-the-mill-pre-post-analysis/ Sun, 28 Jan 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/thinking-about-the-run-of-the-mill-pre-post-analysis/ Well, maybe you haven’t, but this seems to come up all the time. An investigator wants to assess the effect of an intervention on a outcome. Study participants are randomized either to receive the intervention (could be a new drug, new protocol, behavioral intervention, whatever) or treatment as usual. For each participant, the outcome measure is recorded at baseline - this is the pre in pre/post analysis. The intervention is delivered (or not, in the case of the control group), some time passes, and the outcome is measured a second time. Importance sampling adds an interesting twist to Monte Carlo simulation https://www.rdatagen.net/post/importance-sampling-adds-a-little-excitement-to-monte-carlo-simulation/ Thu, 18 Jan 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/importance-sampling-adds-a-little-excitement-to-monte-carlo-simulation/ I’m contemplating the idea of teaching a course on simulation next fall, so I have been exploring various topics that I might include. (If anyone has great ideas either because you have taught such a course or taken one, definitely drop me a note.) Monte Carlo (MC) simulation is an obvious one. I like the idea of talking about importance sampling, because it sheds light on the idea that not all MC simulations are created equally. Simulating a cost-effectiveness analysis to highlight new functions for generating correlated data https://www.rdatagen.net/post/generating-correlated-data-for-a-simulated-cost-effectiveness-analysis/ Mon, 08 Jan 2018 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/generating-correlated-data-for-a-simulated-cost-effectiveness-analysis/ My dissertation work (which I only recently completed - in 2012 - even though I am not exactly young, a whole story on its own) focused on inverse probability weighting methods to estimate a causal cost-effectiveness model. I don’t really do any cost-effectiveness analysis (CEA) anymore, but it came up very recently when some folks in the Netherlands contacted me about using simstudy to generate correlated (and clustered) data to compare different approaches to estimating cost-effectiveness. When there's a fork in the road, take it. Or, taking a look at marginal structural models. https://www.rdatagen.net/post/when-a-covariate-is-a-confounder-and-a-mediator/ Mon, 11 Dec 2017 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/when-a-covariate-is-a-confounder-and-a-mediator/ I am going to cut right to the chase, since this is the third of three posts related to confounding and weighting, and it’s kind of a long one. (If you want to catch up, the first two are here and here.) My aim with these three posts is to provide a basic explanation of the marginal structural model (MSM) and how we should interpret the estimates. This is obviously a very rich topic with a vast literature, so if you remain interested in the topic, I recommend checking out this (as of yet unpublished) text book by Hernán & Robins for starters. When you use inverse probability weighting for estimation, what are the weights actually doing? https://www.rdatagen.net/post/inverse-probability-weighting-when-the-outcome-is-binary/ Mon, 04 Dec 2017 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/inverse-probability-weighting-when-the-outcome-is-binary/ Towards the end of Part 1 of this short series on confounding, IPW, and (hopefully) marginal structural models, I talked a little bit about the fact that inverse probability weighting (IPW) can provide unbiased estimates of marginal causal effects in the context of confounding just as more traditional regression models like OLS can. I used an example based on a normally distributed outcome. Now, that example wasn’t super interesting, because in the case of a linear model with homogeneous treatment effects (i. Characterizing the variance for clustered data that are Gamma distributed https://www.rdatagen.net/post/icc-for-gamma-distribution/ Mon, 27 Nov 2017 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/icc-for-gamma-distribution/ Way back when I was studying algebra and wrestling with one word problem after another (I think now they call them story problems), I complained to my father. He laughed and told me to get used to it. “Life is one big word problem,” is how he put it. Well, maybe one could say any statistical analysis is really just some form of multilevel data analysis, whether we treat it that way or not. Visualizing how confounding biases estimates of population-wide (or marginal) average causal effects https://www.rdatagen.net/post/potential-outcomes-confounding/ Thu, 16 Nov 2017 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/potential-outcomes-confounding/ When we are trying to assess the effect of an exposure or intervention on an outcome, confounding is an ever-present threat to our ability to draw the proper conclusions. My goal (starting here and continuing in upcoming posts) is to think a bit about how to characterize confounding in a way that makes it possible to literally see why improperly estimating intervention effects might lead to bias. Confounding, potential outcomes, and causal effects Typically, we think of a confounder as a factor that influences both exposure and outcome. A simstudy update provides an excuse to generate and display Likert-type data https://www.rdatagen.net/post/generating-and-displaying-likert-type-data/ Tue, 07 Nov 2017 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/generating-and-displaying-likert-type-data/ I just updated simstudy to version 0.1.7. It is available on CRAN. To mark the occasion, I wanted to highlight a new function, genOrdCat, which puts into practice some code that I presented a little while back as part of a discussion of ordinal logistic regression. The new function was motivated by a reader/researcher who came across my blog while wrestling with a simulation study. After a little back and forth about how to generate ordinal categorical data, I ended up with a function that might be useful. Thinking about different ways to analyze sub-groups in an RCT https://www.rdatagen.net/post/sub-group-analysis-in-rct/ Wed, 01 Nov 2017 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/sub-group-analysis-in-rct/ Here’s the scenario: we have an intervention that we think will improve outcomes for a particular population. Furthermore, there are two sub-groups (let’s say defined by which of two medical conditions each person in the population has) and we are interested in knowing if the intervention effect is different for each sub-group. And here’s the question: what is the ideal way to set up a study so that we can assess (1) the intervention effects on the group as a whole, but also (2) the sub-group specific intervention effects? Who knew likelihood functions could be so pretty? https://www.rdatagen.net/post/mle-can-be-pretty/ Mon, 23 Oct 2017 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/mle-can-be-pretty/ I just released a new iteration of simstudy (version 0.1.6), which fixes a bug or two and adds several spline related routines (available on CRAN). The previous post focused on using spline curves to generate data, so I won’t repeat myself here. And, apropos of nothing really - I thought I’d take the opportunity to do a simple simulation to briefly explore the likelihood function. It turns out if we generate lots of them, it can be pretty, and maybe provide a little insight. Can we use B-splines to generate non-linear data? https://www.rdatagen.net/post/generating-non-linear-data-using-b-splines/ Mon, 16 Oct 2017 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/generating-non-linear-data-using-b-splines/ I’m exploring the idea of adding a function or set of functions to the simstudy package that would make it possible to easily generate non-linear data. One way to do this would be using B-splines. Typically, one uses splines to fit a curve to data, but I thought it might be useful to switch things around a bit to use the underlying splines to generate data. This would facilitate exploring models where we know the assumption of linearity is violated. A minor update to simstudy provides an excuse to talk a bit about the negative binomial and Poisson distributions https://www.rdatagen.net/post/a-small-update-to-simstudy-neg-bin/ Thu, 05 Oct 2017 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/a-small-update-to-simstudy-neg-bin/ I just updated simstudy to version 0.1.5 (available on CRAN) so that it now includes several new distributions - exponential, discrete uniform, and negative binomial. As part of the release, I thought I’d explore the negative binomial just a bit, particularly as it relates to the Poisson distribution. The Poisson distribution is a discrete (integer) distribution of outcomes of non-negative values that is often used to describe count outcomes. It is characterized by a mean (or rate) and its variance equals its mean. CACE closed: EM opens up exclusion restriction (among other things) https://www.rdatagen.net/post/em-estimation-of-cace/ Thu, 28 Sep 2017 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/em-estimation-of-cace/ This is the third, and probably last, of a series of posts touching on the estimation of complier average causal effects (CACE) and latent variable modeling techniques using an expectation-maximization (EM) algorithm. What follows is a simplistic way to implement an EM algorithm in R to do principal strata estimation of CACE. The EM algorithm In this approach, we assume that individuals fall into one of three possible groups - never-takers, always-takers, and compliers - but we cannot see who is who (except in a couple of cases). A simstudy update provides an excuse to talk a little bit about latent class regression and the EM algorithm https://www.rdatagen.net/post/simstudy-update-provides-an-excuse-to-talk-a-little-bit-about-the-em-algorithm-and-latent-class/ Wed, 20 Sep 2017 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/simstudy-update-provides-an-excuse-to-talk-a-little-bit-about-the-em-algorithm-and-latent-class/ I was just going to make a quick announcement to let folks know that I’ve updated the simstudy package to version 0.1.4 (now available on CRAN) to include functions that allow conversion of columns to factors, creation of dummy variables, and most importantly, specification of outcomes that are more flexibly conditional on previously defined variables. But, as I was coming up with an example that might illustrate the added conditional functionality, I found myself playing with package flexmix, which uses an Expectation-Maximization (EM) algorithm to estimate latent classes and fit regression models. Complier average causal effect? Exploring what we learn from an RCT with participants who don't do what they are told https://www.rdatagen.net/post/cace-explored/ Tue, 12 Sep 2017 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/cace-explored/ Inspired by a free online course titled Complier Average Causal Effects (CACE) Analysis and taught by Booil Jo and Elizabeth Stuart (through Johns Hopkins University), I’ve decided to explore the topic a little bit. My goal here isn’t to explain CACE analysis in extensive detail (you should definitely go take the course for that), but to describe the problem generally and then (of course) simulate some data. A plot of the simulated data gives a sense of what we are estimating and assuming. Further considerations of a hidden process underlying categorical responses https://www.rdatagen.net/post/a-hidden-process-part-2-of-2/ Tue, 05 Sep 2017 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/a-hidden-process-part-2-of-2/ In my previous post, I described a continuous data generating process that can be used to generate discrete, categorical outcomes. In that post, I focused largely on binary outcomes and simple logistic regression just because things are always easier to follow when there are fewer moving parts. Here, I am going to focus on a situation where we have multiple outcomes, but with a slight twist - these groups of interest can be interpreted in an ordered way. A hidden process behind binary or other categorical outcomes? https://www.rdatagen.net/post/ordinal-regression/ Mon, 28 Aug 2017 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/ordinal-regression/ I was thinking a lot about proportional-odds cumulative logit models last fall while designing a study to evaluate an intervention’s effect on meat consumption. After a fairly extensive pilot study, we had determined that participants can have quite a difficult time recalling precise quantities of meat consumption, so we were forced to move to a categorical response. (This was somewhat unfortunate, because we would not have continuous or even count outcomes, and as a result, might not be able to pick up small changes in behavior. Be careful not to control for a post-exposure covariate https://www.rdatagen.net/post/be-careful/ Mon, 21 Aug 2017 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/be-careful/ A researcher was presenting an analysis of the impact various types of childhood trauma might have on subsequent substance abuse in adulthood. Obviously, a very interesting and challenging research question. The statistical model included adjustments for several factors that are plausible confounders of the relationship between trauma and substance use, such as childhood poverty. However, the model also include a measurement for poverty in adulthood - believing it was somehow confounding the relationship of trauma and substance use. Should we be concerned about incidence - prevalence bias? https://www.rdatagen.net/post/simulating-incidence-prevalence-bias/ Wed, 09 Aug 2017 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/simulating-incidence-prevalence-bias/ Recently, we were planning a study to evaluate the effect of an intervention on outcomes for very sick patients who show up in the emergency department. My collaborator had concerns about a phenomenon that she had observed in other studies that might affect the results - patients measured earlier in the study tend to be sicker than those measured later in the study. This might not be a problem, but in the context of a stepped-wedge study design (see this for a discussion that touches this type of study design), this could definitely generate biased estimates: when the intervention occurs later in the study (as it does in a stepped-wedge design), the “exposed” and “unexposed” populations could differ, and in turn so could the outcomes. Using simulation for power analysis: an example based on a stepped wedge study design https://www.rdatagen.net/post/using-simulation-for-power-analysis-an-example/ Mon, 10 Jul 2017 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/using-simulation-for-power-analysis-an-example/ Simulation can be super helpful for estimating power or sample size requirements when the study design is complex. This approach has some advantages over an analytic one (i.e. one based on a formula), particularly the flexibility it affords in setting up the specific assumptions in the planned study, such as time trends, patterns of missingness, or effects of different levels of clustering. A downside is certainly the complexity of writing the code as well as the computation time, which can be a bit painful. simstudy update: two new functions that generate correlated observations from non-normal distributions https://www.rdatagen.net/post/simstudy-update-two-functions-for-correlation/ Wed, 05 Jul 2017 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/simstudy-update-two-functions-for-correlation/ In an earlier post, I described in a fair amount of detail an algorithm to generate correlated binary or Poisson data. I mentioned that I would be updating simstudy with functions that would make generating these kind of data relatively painless. Well, I have managed to do that, and the updated package (version 0.1.3) is available for download from CRAN. There are now two additional functions to facilitate the generation of correlated data from binomial, poisson, gamma, and uniform distributions: genCorGen and addCorGen. Balancing on multiple factors when the sample is too small to stratify https://www.rdatagen.net/post/balancing-when-sample-is-too-small-to-stratify/ Mon, 26 Jun 2017 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/balancing-when-sample-is-too-small-to-stratify/ Ideally, a study that uses randomization provides a balance of characteristics that might be associated with the outcome being studied. This way, we can be more confident that any differences in outcomes between the groups are due to the group assignments and not to differences in characteristics. Unfortunately, randomization does not guarantee balance, especially with smaller sample sizes. If we want to be certain that groups are balanced with respect to a particular characteristic, we need to do something like stratified randomization. Copulas and correlated data generation: getting beyond the normal distribution https://www.rdatagen.net/post/correlated-data-copula/ Mon, 19 Jun 2017 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/correlated-data-copula/ Using the simstudy package, it’s possible to generate correlated data from a normal distribution using the function genCorData. I’ve wanted to extend the functionality so that we can generate correlated data from other sorts of distributions; I thought it would be a good idea to begin with binary and Poisson distributed data, since those come up so frequently in my work. simstudy can already accommodate more general correlated data, but only in the context of a random effects data generation process. When marginal and conditional logistic model estimates diverge https://www.rdatagen.net/post/marginal-v-conditional/ Fri, 09 Jun 2017 00:00:00 +0000 keith.goldfeld@nyumc.org (Keith Goldfeld) https://www.rdatagen.net/post/marginal-v-conditional/ Say we have an intervention that is assigned at a group or cluster level but the outcome is measured at an individual level (e.g. students in different schools, eyes on different individuals). And, say this outcome is binary; that is, something happens, or it doesn’t. (This is important, because none of this is true if the outcome is continuous and close to normally distributed.) If we want to measure the effect of the intervention - perhaps the risk difference, risk ratio, or odds ratio - it can really matter if we are interested in the marginal effect or the conditional effect, because they likely won’t be the same.