R on ouR data generation
https://www.rdatagen.net/tags/r/
Recent content in R on ouR data generationHugo -- gohugo.iokeith.goldfeld@nyumc.org (Keith Goldfeld)keith.goldfeld@nyumc.org (Keith Goldfeld)Tue, 13 Oct 2020 00:00:00 +0000simstudy just got a little more dynamic: version 0.2.1
https://www.rdatagen.net/post/simstudy-just-got-a-little-more-dynamic-version-0-2-0/
Tue, 13 Oct 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simstudy-just-got-a-little-more-dynamic-version-0-2-0/simstudy version 0.2.1 has just been submitted to CRAN. Along with this release, the big news is that I’ve been joined by Jacob Wujciak-Jens as a co-author of the package. He initially reached out to me from Germany with some suggestions for improvements, we had a little back and forth, and now here we are. He has substantially reworked the underbelly of simstudy, making the package much easier to maintain, and positioning it for much easier extension.Permuted block randomization using simstudy
https://www.rdatagen.net/post/permuted-block-randomization-using-simstudy/
Tue, 29 Sep 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/permuted-block-randomization-using-simstudy/Along with preparing power analyses and statistical analysis plans (SAPs), generating study randomization lists is something a practicing biostatistician is occasionally asked to do. While not a particularly interesting activity, it offers the opportunity to tackle a small programming challenge. The title is a little misleading because you should probably skip all this and just use the blockrand package if you want to generate randomization schemes; don’t try to reinvent the wheel.Generating probabilities for ordinal categorical data
https://www.rdatagen.net/post/generating-probabilities-for-ordinal-categorical-data/
Tue, 15 Sep 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/generating-probabilities-for-ordinal-categorical-data/Over the past couple of months, I’ve been describing various aspects of the simulations that we’ve been doing to get ready for a meta-analysis of convalescent plasma treatment for hospitalized patients with COVID-19, most recently here. As I continue to do that, I want to provide motivation and code for a small but important part of the data generating process, which involves creating probabilities for ordinal categorical outcomes using a Dirichlet distribution.Diagnosing and dealing with degenerate estimation in a Bayesian meta-analysis
https://www.rdatagen.net/post/diagnosing-and-dealing-with-estimation-issues-in-the-bayesian-meta-analysis/
Tue, 01 Sep 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/diagnosing-and-dealing-with-estimation-issues-in-the-bayesian-meta-analysis/The federal government recently granted emergency approval for the use of antibody rich blood plasma when treating hospitalized COVID-19 patients. This announcement is unfortunate, because we really don’t know if this promising treatment works. The best way to determine this, of course, is to conduct an experiment, though this approval makes this more challenging to do; with the general availability of convalescent plasma (CP), there may be resistance from patients and providers against participating in a randomized trial.Generating data from a truncated distribution
https://www.rdatagen.net/post/generating-data-from-a-truncated-distribution/
Tue, 18 Aug 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/generating-data-from-a-truncated-distribution/A researcher reached out to me the other day to see if the simstudy package provides a quick and easy way to generate data from a truncated distribution. Other than the noZeroPoisson distribution option (which is a very specific truncated distribution), there is no way to do this directly. You can always generate data from the full distribution and toss out the observations that fall outside of the truncation range, but this is not exactly efficient, and in practice can get a little messy.A hurdle model for COVID-19 infections in nursing homes
https://www.rdatagen.net/post/a-hurdle-model-for-covid-19-infections-in-nursing-homes-sample-size-considerations/
Tue, 04 Aug 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-hurdle-model-for-covid-19-infections-in-nursing-homes-sample-size-considerations/Late last year, I added a mixture distribution to the simstudy package, largely motivated to accommodate zero-inflated Poisson or negative binomial distributions. (I really thought I had added this two years ago - but time is moving so slowly these days.) These distributions are useful when modeling count data, but we anticipate observing more than the expected frequency of zeros that would arise from a non-inflated (i.e. “regular”) Poisson or negative binomial distribution.A Bayesian model for a simulated meta-analysis
https://www.rdatagen.net/post/a-bayesian-model-for-a-simulated-meta-analysis/
Tue, 21 Jul 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-bayesian-model-for-a-simulated-meta-analysis/This is essentially an addendum to the previous post where I simulated data from multiple RCTs to explore an analytic method to pool data across different studies. In that post, I used the nlme package to conduct a meta-analysis based on individual level data of 12 studies. Here, I am presenting an alternative hierarchical modeling approach that uses the Bayesian package rstan.
Create the data set We’ll use the exact same data generating process as described in some detail in the previous post.Simulating multiple RCTs to simulate a meta-analysis
https://www.rdatagen.net/post/simulating-mutliple-studies-to-simulate-a-meta-analysis/
Tue, 07 Jul 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simulating-mutliple-studies-to-simulate-a-meta-analysis/I am currently involved with an RCT that is struggling to recruit eligible patients (by no means an unusual problem), increasing the risk that findings might be inconclusive. A possible solution to this conundrum is to find similar, ongoing trials with the aim of pooling data in a single analysis, to conduct a meta-analysis of sorts.
In an ideal world, this theoretical collection of sites would have joined forces to develop a single study protocol, but often there is no structure or funding mechanism to make that happen.Consider a permutation test for a small pilot study
https://www.rdatagen.net/post/permutation-test-for-a-covid-19-pilot-nursing-home-study/
Tue, 23 Jun 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/permutation-test-for-a-covid-19-pilot-nursing-home-study/Recently I wrote about the challenges of trying to learn too much from a small pilot study, even if it is a randomized controlled trial. There are limitations on how much you can learn about a treatment effect given the small sample size and relatively high variability of the estimate. However, the temptation for researchers is usually just too great; it is only natural to want to see if there is any kind of signal of an intervention effect, even though the pilot study is focused on questions of feasibility and acceptability.When proportional odds is a poor assumption, collapsing categories is probably not going to save you
https://www.rdatagen.net/post/more-fun-with-ordinal-scales-combining-categories-may-not-make-solve-the-problem-of-non-proportionality/
Tue, 09 Jun 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/more-fun-with-ordinal-scales-combining-categories-may-not-make-solve-the-problem-of-non-proportionality/Continuing the discussion on cumulative odds models I started last time, I want to investigate a solution I always assumed would help mitigate a failure to meet the proportional odds assumption. I’ve believed if there is a large number of categories and the relative cumulative odds between two groups don’t appear proportional across all categorical levels, then a reasonable approach is to reduce the number of categories. In other words, fewer categories translates to proportional odds.Considering the number of categories in an ordinal outcome
https://www.rdatagen.net/post/the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome/
Tue, 26 May 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/the-advantage-of-increasing-the-number-of-categories-in-an-ordinal-outcome/In two Covid-19-related trials I’m involved with, the primary or key secondary outcome is the status of a patient at 14 days based on a World Health Organization ordered rating scale. In this particular ordinal scale, there are 11 categories ranging from 0 (uninfected) to 10 (death). In between, a patient can be infected but well enough to remain at home, hospitalized with milder symptoms, or hospitalized with severe disease.To stratify or not? It might not actually matter...
https://www.rdatagen.net/post/to-stratify-or-not-to-stratify/
Tue, 12 May 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/to-stratify-or-not-to-stratify/Continuing with the theme of exploring small issues that come up in trial design, I recently used simulation to assess the impact of stratifying (or not) in the context of a multi-site Covid-19 trial with a binary outcome. The investigators are concerned that baseline health status will affect the probability of an outcome event, and are interested in randomizing by health status. The goal is to ensure balance across the two treatment arms with respect to this important variable.Simulation for power in designing cluster randomized trials
https://www.rdatagen.net/post/simulation-for-power-calculations-in-designing-cluster-randomized-trials/
Tue, 28 Apr 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simulation-for-power-calculations-in-designing-cluster-randomized-trials/As a biostatistician, I like to be involved in the design of a study as early as possible. I always like to say that I hope one of the first conversations an investigator has is with me, so that I can help clarify the research questions before getting into the design questions related to measurement, unit of randomization, and sample size. In the worst case scenario - and this actually doesn’t happen to me any more - a researcher would approach me after everything is done except the analysis.Yes, unbalanced randomization can improve power, in some situations
https://www.rdatagen.net/post/unbalanced-randomization-can-improve-power-in-some-situations/
Tue, 14 Apr 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/unbalanced-randomization-can-improve-power-in-some-situations/Last time I provided some simulations that suggested that there might not be any efficiency-related benefits to using unbalanced randomization when the outcome is binary. This is a quick follow-up to provide a counter-example where the outcome in a two-group comparison is continuous. If the groups have different amounts of variability, intuitively it makes sense to allocate more patients to the more variable group. Doing this should reduce the variability in the estimate of the mean for that group, which in turn could improve the power of the test.Can unbalanced randomization improve power?
https://www.rdatagen.net/post/can-unbalanced-randomization-improve-power/
Tue, 31 Mar 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/can-unbalanced-randomization-improve-power/Of course, we’re all thinking about one thing these days, so it seems particularly inconsequential to be writing about anything that doesn’t contribute to solving or addressing in some meaningful way this pandemic crisis. But, I find that working provides a balm from reading and hearing all day about the events swirling around us, both here and afar. (I am in NYC, where things are definitely swirling.) And for me, working means blogging, at least for a few hours every couple of weeks.When you want more than a chi-squared test, consider a measure of association
https://www.rdatagen.net/post/when-a-chi-squared-statistic-is-not-enough-a-measure-of-association-for-contingency-tables/
Tue, 17 Mar 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/when-a-chi-squared-statistic-is-not-enough-a-measure-of-association-for-contingency-tables/In my last post, I made the point that p-values should not necessarily be considered sufficient evidence (or evidence at all) in drawing conclusions about associations we are interested in exploring. When it comes to contingency tables that represent the outcomes for two categorical variables, it isn’t so obvious what measure of association should augment (or replace) the \(\chi^2\) statistic.
I described a model-based measure of effect to quantify the strength of an association in the particular case where one of the categorical variables is ordinal.Alternatives to reporting a p-value: the case of a contingency table
https://www.rdatagen.net/post/to-report-a-p-value-or-not-the-case-of-a-contingency-table/
Tue, 03 Mar 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/to-report-a-p-value-or-not-the-case-of-a-contingency-table/I frequently find myself in discussions with collaborators about the merits of reporting p-values, particularly in the context of pilot studies or exploratory analysis. Over the past several years, the American Statistical Association has made several strong statements about the need to consider approaches that measure the strength of evidence or uncertainty that don’t necessarily rely on p-values. In 2016, the ASA attempted to clarify the proper use and interpretation of the p-value by highlighting key principles “that could improve the conduct or interpretation of quantitative science, according to widespread consensus in the statistical community.Clustered randomized trials and the design effect
https://www.rdatagen.net/post/what-exactly-is-the-design-effect/
Tue, 18 Feb 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/what-exactly-is-the-design-effect/I am always saying that simulation can help illuminate interesting statistical concepts or ideas. The design effect that underlies much of clustered analysis is could benefit from a little exploration through simulation. I’ve written about clustered-related methods so much on this blog that I won’t provide links - just peruse the list of entries on the home page and you are sure to spot a few. But, I haven’t written explicitly about the design effect.Analysing an open cohort stepped-wedge clustered trial with repeated individual binary outcomes
https://www.rdatagen.net/post/analyzing-the-open-cohort-stepped-wedge-trial-with-binary-outcomes/
Tue, 04 Feb 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/analyzing-the-open-cohort-stepped-wedge-trial-with-binary-outcomes/I am currently wrestling with how to analyze data from a stepped-wedge designed cluster randomized trial. A few factors make this analysis particularly interesting. First, we want to allow for the possibility that between-period site-level correlation will decrease (or decay) over time. Second, there is possibly additional clustering at the patient level since individual outcomes will be measured repeatedly over time. And third, given that these outcomes are binary, there are no obvious software tools that can handle generalized linear models with this particular variance structure we want to model.A brief account (via simulation) of the ROC (and its AUC)
https://www.rdatagen.net/post/a-simple-explanation-of-what-the-roc-and-auc-represent/
Tue, 21 Jan 2020 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-simple-explanation-of-what-the-roc-and-auc-represent/The ROC (receiver operating characteristic) curve visually depicts the ability of a measure or classification model to distinguish two groups. The area under the ROC (AUC), quantifies the extent of that ability. My goal here is to describe as simply as possible a process that serves as a foundation for the ROC, and to provide an interpretation of the AUC that is defined by that curve.
A prediction problem The classic application for the ROC is a medical test designed to identify individuals with a particular medical condition or disease.Repeated measures can improve estimation when we only care about a single endpoint
https://www.rdatagen.net/post/using-repeated-measures-might-improve-effect-estimation-even-when-single-endpoint-is-the-focus/
Tue, 10 Dec 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/using-repeated-measures-might-improve-effect-estimation-even-when-single-endpoint-is-the-focus/I’m participating in the design of a new study that will evaluate interventions aimed at reducing both pain and opioid use for patients on dialysis. This study is likely to be somewhat complicated, possibly involving multiple clusters, multiple interventions, a sequential and/or adaptive randomization scheme, and a composite binary outcome. I’m not going into any of that here.
There is one issue that should be fairly generalizable to other studies.Adding a "mixture" distribution to the simstudy package
https://www.rdatagen.net/post/adding-mixture-distributions-to-simstudy/
Tue, 26 Nov 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/adding-mixture-distributions-to-simstudy/I am contemplating adding a new distribution option to the package simstudy that would allow users to define a new variable as a mixture of previously defined (or already generated) variables. I think the easiest way to explain how to apply the new mixture option is to step through a few examples and see it in action.
Specifying the “mixture” distribution As defined here, a mixture of variables is a random draw from a set of variables based on a defined set of probabilities.What can we really expect to learn from a pilot study?
https://www.rdatagen.net/post/what-can-we-really-expect-to-learn-from-a-pilot-study/
Tue, 12 Nov 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/what-can-we-really-expect-to-learn-from-a-pilot-study/I am involved with a very interesting project - the NIA IMPACT Collaboratory - where a primary goal is to fund a large group of pragmatic pilot studies to investigate promising interventions to improve health care and quality of life for people living with Alzheimer’s disease and related dementias. One of my roles on the project team is to advise potential applicants on the development of their proposals. In order to provide helpful advice, it is important that we understand what we should actually expect to learn from a relatively small pilot study of a new intervention.Any one interested in a function to quickly generate data with many predictors?
https://www.rdatagen.net/post/any-one-interested-in-a-function-to-quickly-generate-data-with-many-predictors/
Tue, 29 Oct 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/any-one-interested-in-a-function-to-quickly-generate-data-with-many-predictors/A couple of months ago, I was contacted about the possibility of creating a simple function in simstudy to generate a large dataset that could include possibly 10’s or 100’s of potential predictors and an outcome. In this function, only a subset of the variables would actually be predictors. The idea is to be able to easily generate data for exploring ridge regression, Lasso regression, or other “regularization” methods. Alternatively, this can be used to very quickly generate correlated data (with one line of code) without going through the definition process.Selection bias, death, and dying
https://www.rdatagen.net/post/selection-bias-death-and-dying/
Tue, 15 Oct 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/selection-bias-death-and-dying/I am collaborating with a number of folks who think a lot about palliative or supportive care for people who are facing end-stage disease, such as advanced dementia, cancer, COPD, or congestive heart failure. A major concern for this population (which really includes just about everyone at some point) is the quality of life at the end of life and what kind of experiences, including interactions with the health care system, they have (and don’t have) before death.There's always at least two ways to do the same thing: an example generating 3-level hierarchical data using simstudy
https://www.rdatagen.net/post/in-simstudy-as-in-r-there-s-always-at-least-two-ways-to-do-the-same-thing/
Thu, 03 Oct 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/in-simstudy-as-in-r-there-s-always-at-least-two-ways-to-do-the-same-thing/“I am working on a simulation study that requires me to generate data for individuals within clusters, but each individual will have repeated measures (say baseline and two follow-ups). I’m new to simstudy and have been going through the examples in R this afternoon, but I wondered if this was possible in the package, and if so whether you could offer any tips to get me started with how I would do this?Simulating an open cohort stepped-wedge trial
https://www.rdatagen.net/post/simulating-an-open-cohort-stepped-wedge-trial/
Tue, 17 Sep 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simulating-an-open-cohort-stepped-wedge-trial/In a current multi-site study, we are using a stepped-wedge design to evaluate whether improved training and protocols can reduce prescriptions of anti-psychotic medication for home hospice care patients with advanced dementia. The study is officially called the Hospice Advanced Dementia Symptom Management and Quality of Life (HAS-QOL) Stepped Wedge Trial. Unlike my previous work with stepped-wedge designs, where individuals were measured once in the course of the study, this study will collect patient outcomes from the home hospice care EHRs over time.Analyzing a binary outcome arising out of within-cluster, pair-matched randomization
https://www.rdatagen.net/post/analyzing-a-binary-outcome-in-a-study-with-within-cluster-pair-matched-randomization/
Tue, 03 Sep 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/analyzing-a-binary-outcome-in-a-study-with-within-cluster-pair-matched-randomization/A key motivating factor for the simstudy package and much of this blog is that simulation can be super helpful in understanding how best to approach an unusual, or least unfamiliar, analytic problem. About six months ago, I described the DREAM Initiative (Diabetes Research, Education, and Action for Minorities), a study that used a slightly innovative randomization scheme to ensure that two comparison groups were evenly balanced across important covariates.simstudy updated to version 0.1.14: implementing Markov chains
https://www.rdatagen.net/post/simstudy-1-14-update/
Tue, 20 Aug 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simstudy-1-14-update/I’m developing study simulations that require me to generate a sequence of health status for a collection of individuals. In these simulations, individuals gradually grow sicker over time, though sometimes they recover slightly. To facilitate this, I am using a stochastic Markov process, where the probability of a health status at a particular time depends only on the previous health status (in the immediate past). While there are packages to do this sort of thing (see for example the markovchain package), I hadn’t yet stumbled upon them while I was tackling my problem.Bayes models for estimation in stepped-wedge trials with non-trivial ICC patterns
https://www.rdatagen.net/post/bayes-model-to-estimate-stepped-wedge-trial-with-non-trivial-icc-structure/
Tue, 06 Aug 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/bayes-model-to-estimate-stepped-wedge-trial-with-non-trivial-icc-structure/Continuing a series of posts discussing the structure of intra-cluster correlations (ICC’s) in the context of a stepped-wedge trial, this latest edition is primarily interested in fitting Bayesian hierarchical models for more complex cases (though I do talk a bit more about the linear mixed effects models). The first two posts in the series focused on generating data to simulate various scenarios; the third post considered linear mixed effects and Bayesian hierarchical models to estimate ICC’s under the simplest scenario of constant between-period ICC’s.Estimating treatment effects (and ICCs) for stepped-wedge designs
https://www.rdatagen.net/post/estimating-treatment-effects-and-iccs-for-stepped-wedge-designs/
Tue, 16 Jul 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/estimating-treatment-effects-and-iccs-for-stepped-wedge-designs/In the last two posts, I introduced the notion of time-varying intra-cluster correlations in the context of stepped-wedge study designs. (See here and here). Though I generated lots of data for those posts, I didn’t fit any models to see if I could recover the estimates and any underlying assumptions. That’s what I am doing now.
My focus here is on the simplest case, where the ICC’s are constant over time and between time.More on those stepped-wedge design assumptions: varying intra-cluster correlations over time
https://www.rdatagen.net/post/varying-intra-cluster-correlations-over-time/
Tue, 09 Jul 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/varying-intra-cluster-correlations-over-time/In my last post, I wrote about within- and between-period intra-cluster correlations in the context of stepped-wedge cluster randomized study designs. These are quite important to understand when figuring out sample size requirements (and models for analysis, which I’ll be writing about soon.) Here, I’m extending the constant ICC assumption I presented last time around by introducing some complexity into the correlation structure. Much of the code I am using can be found in last week’s post, so if anything seems a little unclear, hop over here.Planning a stepped-wedge trial? Make sure you know what you're assuming about intra-cluster correlations ...
https://www.rdatagen.net/post/intra-cluster-correlations-over-time/
Tue, 25 Jun 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/intra-cluster-correlations-over-time/A few weeks ago, I was at the annual meeting of the NIH Collaboratory, which is an innovative collection of collaboratory cores, demonstration projects, and NIH Institutes and Centers that is developing new models for implementing and supporting large-scale health services research. A study I am involved with - Primary Palliative Care for Emergency Medicine - is one of the demonstration projects in this collaboratory.
The second day of this meeting included four panels devoted to the design and analysis of embedded pragmatic clinical trials, and focused on the challenges of conducting rigorous research in the real-world context of a health delivery system.Don't get too excited - it might just be regression to the mean
https://www.rdatagen.net/post/regression-to-the-mean/
Tue, 11 Jun 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/regression-to-the-mean/It is always exciting to find an interesting pattern in the data that seems to point to some important difference or relationship. A while ago, one of my colleagues shared a figure with me that looked something like this:
It looks like something is going on. On average low scorers in the first period increased a bit in the second period, and high scorers decreased a bit. Something is going on, but nothing specific to the data in question; it is just probability working its magic.simstudy update - stepped-wedge design treatment assignment
https://www.rdatagen.net/post/simstudy-update-stepped-wedge-treatment-assignment/
Tue, 28 May 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simstudy-update-stepped-wedge-treatment-assignment/simstudy has just been updated (version 0.1.13 on CRAN), and includes one interesting addition (and a couple of bug fixes). I am working on a post (or two) about intra-cluster correlations (ICCs) and stepped-wedge study designs (which I’ve written about before), and I was getting tired of going through the convoluted process of generating data from a time-dependent treatment assignment process. So, I wrote a new function, trtStepWedge, that should simplify things.Generating and modeling over-dispersed binomial data
https://www.rdatagen.net/post/overdispersed-binomial-data/
Tue, 14 May 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/overdispersed-binomial-data/A couple of weeks ago, I was inspired by a study to write about a classic design issue that arises in cluster randomized trials: should we focus on the number of clusters or the size of those clusters? This trial, which is concerned with preventing opioid use disorder for at-risk patients in primary care clinics, has also motivated this second post, which concerns another important issue - over-dispersion.
A count outcome In this study, one of the primary outcomes is the number of days of opioid use over a six-month follow-up period (to be recorded monthly by patient-report and aggregated for the six-month measure).What matters more in a cluster randomized trial: number or size?
https://www.rdatagen.net/post/what-matters-more-in-a-cluster-randomized-trial-number-or-size/
Tue, 30 Apr 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/what-matters-more-in-a-cluster-randomized-trial-number-or-size/I am involved with a trial of an intervention designed to prevent full-blown opioid use disorder for patients who may have an incipient opioid use problem. Given the nature of the intervention, it was clear the only feasible way to conduct this particular study is to randomize at the physician rather than the patient level.
There was a concern that the number of patients eligible for the study might be limited, so that each physician might only have a handful of patients able to participate, if that many.Musings on missing data
https://www.rdatagen.net/post/musings-on-missing-data/
Tue, 02 Apr 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/musings-on-missing-data/I’ve been meaning to share an analysis I recently did to estimate the strength of the relationship between a young child’s ability to recognize emotions in others (e.g. teachers and fellow students) and her longer term academic success. The study itself is quite interesting (hopefully it will be published sometime soon), but I really wanted to write about it here as it involved the challenging problem of missing data in the context of heterogeneous effects (different across sub-groups) and clustering (by schools).A case where prospective matching may limit bias in a randomized trial
https://www.rdatagen.net/post/a-case-where-prospecitve-matching-may-limit-bias/
Tue, 12 Mar 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-case-where-prospecitve-matching-may-limit-bias/Analysis is important, but study design is paramount. I am involved with the Diabetes Research, Education, and Action for Minorities (DREAM) Initiative, which is, among other things, estimating the effect of a group-based therapy program on weight loss for patients who have been identified as pre-diabetic (which means they have elevated HbA1c levels). The original plan was to randomize patients at a clinic to treatment or control, and then follow up with those assigned to the treatment group to see if they wanted to participate.A example in causal inference designed to frustrate: an estimate pretty much guaranteed to be biased
https://www.rdatagen.net/post/dags-colliders-and-an-example-of-variance-bias-tradeoff/
Tue, 26 Feb 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/dags-colliders-and-an-example-of-variance-bias-tradeoff/I am putting together a brief lecture introducing causal inference for graduate students studying biostatistics. As part of this lecture, I thought it would be helpful to spend a little time describing directed acyclic graphs (DAGs), since they are an extremely helpful tool for communicating assumptions about the causal relationships underlying a researcher’s data.
The strength of DAGs is that they help us think how these underlying relationships in the data might lead to biases in causal effect estimation, and suggest ways to estimate causal effects that eliminate these biases.Using the uniform sum distribution to introduce probability
https://www.rdatagen.net/post/a-fun-example-to-explore-probability/
Tue, 05 Feb 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-fun-example-to-explore-probability/I’ve never taught an intro probability/statistics course. If I ever did, I would certainly want to bring the underlying wonder of the subject to life. I’ve always found it almost magical the way mathematical formulation can be mirrored by computer simulation, the way proof can be guided by observed data generation processes, and the way DGPs can confirm analytic solutions.
I would like to begin such a course with a somewhat unusual but accessible problem that would evoke these themes from the start.Correlated longitudinal data with varying time intervals
https://www.rdatagen.net/post/correlated-longitudinal-data-with-varying-time-intervals/
Tue, 22 Jan 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/correlated-longitudinal-data-with-varying-time-intervals/I was recently contacted to see if simstudy can create a data set of correlated outcomes that are measured over time, but at different intervals for each individual. The quick answer is there is no specific function to do this. However, if you are willing to assume an “exchangeable” correlation structure, where measurements far apart in time are just as correlated as measurements taken close together, then you could just generate individual-level random effects (intercepts and/or slopes) and pretty much call it a day.Considering sensitivity to unmeasured confounding: part 2
https://www.rdatagen.net/post/what-does-it-mean-if-findings-are-sensitive-to-unmeasured-confounding-ii/
Thu, 10 Jan 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/what-does-it-mean-if-findings-are-sensitive-to-unmeasured-confounding-ii/In part 1 of this 2-part series, I introduced the notion of sensitivity to unmeasured confounding in the context of an observational data analysis. I argued that an estimate of an association between an observed exposure \(D\) and outcome \(Y\) is sensitive to unmeasured confounding if we can conceive of a reasonable alternative data generating process (DGP) that includes some unmeasured confounder that will generate the same observed distribution the observed data.Considering sensitivity to unmeasured confounding: part 1
https://www.rdatagen.net/post/what-does-it-mean-if-findings-are-sensitive-to-unmeasured-confounding/
Wed, 02 Jan 2019 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/what-does-it-mean-if-findings-are-sensitive-to-unmeasured-confounding/Principled causal inference methods can be used to compare the effects of different exposures or treatments we have observed in non-experimental settings. These methods, which include matching (with or without propensity scores), inverse probability weighting, and various g-methods, help us create comparable groups to simulate a randomized experiment. All of these approaches rely on a key assumption of no unmeasured confounding. The problem is, short of subject matter knowledge, there is no way to test this assumption empirically.Parallel processing to add a little zip to power simulations (and other replication studies)
https://www.rdatagen.net/post/parallel-processing-to-add-a-little-zip-to-power-simulations/
Mon, 10 Dec 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/parallel-processing-to-add-a-little-zip-to-power-simulations/It’s always nice to be able to speed things up a bit. My first blog post ever described an approach using Rcpp to make huge improvements in a particularly intensive computational process. Here, I want to show how simple it is to speed things up by using the R package parallel and its function mclapply. I’ve been using this function more and more, so I want to explicitly demonstrate it in case any one is wondering.Horses for courses, or to each model its own (causal effect)
https://www.rdatagen.net/post/different-models-estimate-different-causal-effects-part-ii/
Wed, 28 Nov 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/different-models-estimate-different-causal-effects-part-ii/In my previous post, I described a (relatively) simple way to simulate observational data in order to compare different methods to estimate the causal effect of some exposure or treatment on an outcome. The underlying data generating process (DGP) included a possibly unmeasured confounder and an instrumental variable. (If you haven’t already, you should probably take a quick look.)
A key point in considering causal effect estimation is that the average causal effect depends on the individuals included in the average.Generating data to explore the myriad causal effects that can be estimated in observational data analysis
https://www.rdatagen.net/post/generating-data-to-explore-the-myriad-causal-effects/
Tue, 20 Nov 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/generating-data-to-explore-the-myriad-causal-effects/I’ve been inspired by two recent talks describing the challenges of using instrumental variable (IV) methods. IV methods are used to estimate the causal effects of an exposure or intervention when there is unmeasured confounding. This estimated causal effect is very specific: the complier average causal effect (CACE). But, the CACE is just one of several possible causal estimands that we might be interested in. For example, there’s the average causal effect (ACE) that represents a population average (not just based the subset of compliers).Causal mediation estimation measures the unobservable
https://www.rdatagen.net/post/causal-mediation/
Tue, 06 Nov 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/causal-mediation/I put together a series of demos for a group of epidemiology students who are studying causal mediation analysis. Since mediation analysis is not always so clear or intuitive, I thought, of course, that going through some examples of simulating data for this process could clarify things a bit.
Quite often we are interested in understanding the relationship between an exposure or intervention on an outcome. Does exposure \(A\) (could be randomized or not) have an effect on outcome \(Y\)?Cross-over study design with a major constraint
https://www.rdatagen.net/post/when-the-research-question-doesn-t-fit-nicely-into-a-standard-study-design/
Tue, 23 Oct 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/when-the-research-question-doesn-t-fit-nicely-into-a-standard-study-design/Every new study presents its own challenges. (I would have to say that one of the great things about being a biostatistician is the immense variety of research questions that I get to wrestle with.) Recently, I was approached by a group of researchers who wanted to evaluate an intervention. Actually, they had two, but the second one was a minor tweak added to the first. They were trying to figure out how to design the study to answer two questions: (1) is intervention \(A\) better than doing nothing and (2) is \(A^+\), the slightly augmented version of \(A\), better than just \(A\)?In regression, we assume noise is independent of all measured predictors. What happens if it isn't?
https://www.rdatagen.net/post/linear-regression-models-assume-noise-is-independent/
Tue, 09 Oct 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/linear-regression-models-assume-noise-is-independent/A number of key assumptions underlie the linear regression model - among them linearity and normally distributed noise (error) terms with constant variance In this post, I consider an additional assumption: the unobserved noise is uncorrelated with any covariates or predictors in the model.
In this simple model:
\[Y_i = \beta_0 + \beta_1X_i + e_i,\]
\(Y_i\) has both a structural and stochastic (random) component. The structural component is the linear relationship of \(Y\) with \(X\).simstudy update: improved correlated binary outcomes
https://www.rdatagen.net/post/simstudy-update-to-version-0-1-10/
Tue, 25 Sep 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simstudy-update-to-version-0-1-10/An updated version of the simstudy package (0.1.10) is now available on CRAN. The impetus for this release was a series of requests about generating correlated binary outcomes. In the last post, I described a beta-binomial data generating process that uses the recently added beta distribution. In addition to that update, I’ve added functionality to genCorGen and addCorGen, functions which generate correlated data from non-Gaussian or normally distributed data such as Poisson, Gamma, and binary data.Binary, beta, beta-binomial
https://www.rdatagen.net/post/binary-beta-beta-binomial/
Tue, 11 Sep 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/binary-beta-beta-binomial/I’ve been working on updates for the simstudy package. In the past few weeks, a couple of folks independently reached out to me about generating correlated binary data. One user was not impressed by the copula algorithm that is already implemented. I’ve added an option to use an algorithm developed by Emrich and Piedmonte in 1991, and will be incorporating that option soon in the functions genCorGen and addCorGen. I’ll write about that change some point soon.The power of stepped-wedge designs
https://www.rdatagen.net/post/alternatives-to-stepped-wedge-designs/
Tue, 28 Aug 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/alternatives-to-stepped-wedge-designs/Just before heading out on vacation last month, I put up a post that purported to compare stepped-wedge study designs with more traditional cluster randomized trials. Either because I rushed or was just lazy, I didn’t exactly do what I set out to do. I did confirm that a multi-site randomized clinical trial can be more efficient than a cluster randomized trial when there is variability across clusters. (I compared randomizing within a cluster with randomization by cluster.Multivariate ordinal categorical data generation
https://www.rdatagen.net/post/multivariate-ordinal-categorical-data-generation/
Wed, 15 Aug 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/multivariate-ordinal-categorical-data-generation/An economist contacted me about the ability of simstudy to generate correlated ordinal categorical outcomes. He is trying to generate data as an aide to teaching cost-effectiveness analysis, and is hoping to simulate responses to a quality-of-life survey instrument, the EQ-5D. The particular instrument has five questions related to mobility, self-care, activities, pain, and anxiety. Each item has three possible responses: (1) no problems, (2) some problems, and (3) a lot of problems.Randomize by, or within, cluster?
https://www.rdatagen.net/post/by-vs-within/
Thu, 19 Jul 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/by-vs-within/I am involved with a stepped-wedge designed study that is exploring whether we can improve care for patients with end-stage disease who show up in the emergency room. The plan is to train nurses and physicians in palliative care. (A while ago, I described what the stepped wedge design is.)
Under this design, 33 sites around the country will receive the training at some point, which is no small task (and fortunately as the statistician, this is a part of the study I have little involvement).How the odds ratio confounds: a brief study in a few colorful figures
https://www.rdatagen.net/post/log-odds/
Tue, 10 Jul 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/log-odds/The odds ratio always confounds: while it may be constant across different groups or clusters, the risk ratios or risk differences across those groups may vary quite substantially. This makes it really hard to interpret an effect. And then there is inconsistency between marginal and conditional odds ratios, a topic I seem to be visiting frequently, most recently last month.
My aim here is to generate a few figures that might highlight some of these issues.Re-referencing factor levels to estimate standard errors when there is interaction turns out to be a really simple solution
https://www.rdatagen.net/post/re-referencing-to-estimate-effects-when-there-is-interaction/
Tue, 26 Jun 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/re-referencing-to-estimate-effects-when-there-is-interaction/Maybe this should be filed under topics that are so obvious that it is not worth writing about. But, I hate to let a good simulation just sit on my computer. I was recently working on a paper investigating the relationship of emotion knowledge (EK) in very young kids with academic performance a year or two later. The idea is that kids who are more emotionally intelligent might be better prepared to learn.Late anniversary edition redux: conditional vs marginal models for clustered data
https://www.rdatagen.net/post/mixed-effect-models-vs-gee/
Wed, 13 Jun 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/mixed-effect-models-vs-gee/This afternoon, I was looking over some simulations I plan to use in an upcoming lecture on multilevel models. I created these examples a while ago, before I started this blog. But since it was just about a year ago that I first wrote about this topic (and started the blog), I thought I’d post this now to mark the occasion.
The code below provides another way to visualize the difference between marginal and conditional logistic regression models for clustered data (see here for an earlier post that discusses in greater detail some of the key issues raised here.A little function to help generate ICCs in simple clustered data
https://www.rdatagen.net/post/a-little-function-to-help-generate-iccs-in-simple-clustered-data/
Thu, 24 May 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-little-function-to-help-generate-iccs-in-simple-clustered-data/In health services research, experiments are often conducted at the provider or site level rather than the patient level. However, we might still be interested in the outcome at the patient level. For example, we could be interested in understanding the effect of a training program for physicians on their patients. It would be very difficult to randomize patients to be exposed or not to the training if a group of patients all see the same doctor.How efficient are multifactorial experiments?
https://www.rdatagen.net/post/so-how-efficient-are-multifactorial-experiments-part/
Wed, 02 May 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/so-how-efficient-are-multifactorial-experiments-part/I recently described why we might want to conduct a multi-factorial experiment, and I alluded to the fact that this approach can be quite efficient. It is efficient in the sense that it is possible to test simultaneously the impact of multiple interventions using an overall sample size that would be required to test a single intervention in a more traditional RCT. I demonstrate that here, first with a continuous outcome and then with a binary outcome.Testing multiple interventions in a single experiment
https://www.rdatagen.net/post/testing-many-interventions-in-a-single-experiment/
Thu, 19 Apr 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/testing-many-interventions-in-a-single-experiment/A reader recently inquired about functions in simstudy that could generate data for a balanced multi-factorial design. I had to report that nothing really exists. A few weeks later, a colleague of mine asked if I could help estimate the appropriate sample size for a study that plans to use a multi-factorial design to choose among a set of interventions to improve rates of smoking cessation. In the course of exploring this, I realized it would be super helpful if the function suggested by the reader actually existed.Exploring the underlying theory of the chi-square test through simulation - part 2
https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence-part-2/
Sun, 25 Mar 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence-part-2/In the last post, I tried to provide a little insight into the chi-square test. In particular, I used simulation to demonstrate the relationship between the Poisson distribution of counts and the chi-squared distribution. The key point in that post was the role conditioning plays in that relationship by reducing variance.
To motivate some of the key issues, I talked a bit about recycling. I asked you to imagine a set of bins placed in different locations to collect glass bottles.Exploring the underlying theory of the chi-square test through simulation - part 1
https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence/
Sun, 18 Mar 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence/Kids today are so sophisticated (at least they are in New York City, where I live). While I didn’t hear about the chi-square test of independence until my first stint in graduate school, they’re already talking about it in high school. When my kids came home and started talking about it, I did what I usually do when they come home asking about a new statistical concept. I opened up R and started generating some data.Another reason to be careful about what you control for
https://www.rdatagen.net/post/another-reason-to-be-careful-about-what-you-control-for/
Wed, 07 Mar 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/another-reason-to-be-careful-about-what-you-control-for/Modeling data without any underlying causal theory can sometimes lead you down the wrong path, particularly if you are interested in understanding the way things work rather than making predictions. A while back, I described what can go wrong when you control for a mediator when you are interested in an exposure and an outcome. Here, I describe the potential biases that are introduced when you inadvertently control for a variable that turns out to be a collider.“I have to randomize by cluster. Is it OK if I only have 6 sites?"
https://www.rdatagen.net/post/i-have-to-randomize-by-site-is-it-ok-if-i-only-have-6/
Wed, 21 Feb 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/i-have-to-randomize-by-site-is-it-ok-if-i-only-have-6/The answer is probably no, because there is a not-so-low chance (perhaps considerably higher than 5%) you will draw the wrong conclusions from the study. I have heard variations on this question not so infrequently, so I thought it would be useful (of course) to do a few quick simulations to see what happens when we try to conduct a study under these conditions. (Another question I get every so often, after a study has failed to find an effect: “can we get a post-hoc estimate of the power?Have you ever asked yourself, "how should I approach the classic pre-post analysis?"
https://www.rdatagen.net/post/thinking-about-the-run-of-the-mill-pre-post-analysis/
Sun, 28 Jan 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/thinking-about-the-run-of-the-mill-pre-post-analysis/Well, maybe you haven’t, but this seems to come up all the time. An investigator wants to assess the effect of an intervention on a outcome. Study participants are randomized either to receive the intervention (could be a new drug, new protocol, behavioral intervention, whatever) or treatment as usual. For each participant, the outcome measure is recorded at baseline - this is the pre in pre/post analysis. The intervention is delivered (or not, in the case of the control group), some time passes, and the outcome is measured a second time.Importance sampling adds an interesting twist to Monte Carlo simulation
https://www.rdatagen.net/post/importance-sampling-adds-a-little-excitement-to-monte-carlo-simulation/
Thu, 18 Jan 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/importance-sampling-adds-a-little-excitement-to-monte-carlo-simulation/I’m contemplating the idea of teaching a course on simulation next fall, so I have been exploring various topics that I might include. (If anyone has great ideas either because you have taught such a course or taken one, definitely drop me a note.) Monte Carlo (MC) simulation is an obvious one. I like the idea of talking about importance sampling, because it sheds light on the idea that not all MC simulations are created equally.Simulating a cost-effectiveness analysis to highlight new functions for generating correlated data
https://www.rdatagen.net/post/generating-correlated-data-for-a-simulated-cost-effectiveness-analysis/
Mon, 08 Jan 2018 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/generating-correlated-data-for-a-simulated-cost-effectiveness-analysis/My dissertation work (which I only recently completed - in 2012 - even though I am not exactly young, a whole story on its own) focused on inverse probability weighting methods to estimate a causal cost-effectiveness model. I don’t really do any cost-effectiveness analysis (CEA) anymore, but it came up very recently when some folks in the Netherlands contacted me about using simstudy to generate correlated (and clustered) data to compare different approaches to estimating cost-effectiveness.When there's a fork in the road, take it. Or, taking a look at marginal structural models.
https://www.rdatagen.net/post/when-a-covariate-is-a-confounder-and-a-mediator/
Mon, 11 Dec 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/when-a-covariate-is-a-confounder-and-a-mediator/I am going to cut right to the chase, since this is the third of three posts related to confounding and weighting, and it’s kind of a long one. (If you want to catch up, the first two are here and here.) My aim with these three posts is to provide a basic explanation of the marginal structural model (MSM) and how we should interpret the estimates. This is obviously a very rich topic with a vast literature, so if you remain interested in the topic, I recommend checking out this (as of yet unpublished) text book by Hernán & Robins for starters.When you use inverse probability weighting for estimation, what are the weights actually doing?
https://www.rdatagen.net/post/inverse-probability-weighting-when-the-outcome-is-binary/
Mon, 04 Dec 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/inverse-probability-weighting-when-the-outcome-is-binary/Towards the end of Part 1 of this short series on confounding, IPW, and (hopefully) marginal structural models, I talked a little bit about the fact that inverse probability weighting (IPW) can provide unbiased estimates of marginal causal effects in the context of confounding just as more traditional regression models like OLS can. I used an example based on a normally distributed outcome. Now, that example wasn’t super interesting, because in the case of a linear model with homogeneous treatment effects (i.Characterizing the variance for clustered data that are Gamma distributed
https://www.rdatagen.net/post/icc-for-gamma-distribution/
Mon, 27 Nov 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/icc-for-gamma-distribution/Way back when I was studying algebra and wrestling with one word problem after another (I think now they call them story problems), I complained to my father. He laughed and told me to get used to it. “Life is one big word problem,” is how he put it. Well, maybe one could say any statistical analysis is really just some form of multilevel data analysis, whether we treat it that way or not.Visualizing how confounding biases estimates of population-wide (or marginal) average causal effects
https://www.rdatagen.net/post/potential-outcomes-confounding/
Thu, 16 Nov 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/potential-outcomes-confounding/When we are trying to assess the effect of an exposure or intervention on an outcome, confounding is an ever-present threat to our ability to draw the proper conclusions. My goal (starting here and continuing in upcoming posts) is to think a bit about how to characterize confounding in a way that makes it possible to literally see why improperly estimating intervention effects might lead to bias.
Confounding, potential outcomes, and causal effects Typically, we think of a confounder as a factor that influences both exposure and outcome.A simstudy update provides an excuse to generate and display Likert-type data
https://www.rdatagen.net/post/generating-and-displaying-likert-type-data/
Tue, 07 Nov 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/generating-and-displaying-likert-type-data/I just updated simstudy to version 0.1.7. It is available on CRAN.
To mark the occasion, I wanted to highlight a new function, genOrdCat, which puts into practice some code that I presented a little while back as part of a discussion of ordinal logistic regression. The new function was motivated by a reader/researcher who came across my blog while wrestling with a simulation study. After a little back and forth about how to generate ordinal categorical data, I ended up with a function that might be useful.Thinking about different ways to analyze sub-groups in an RCT
https://www.rdatagen.net/post/sub-group-analysis-in-rct/
Wed, 01 Nov 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/sub-group-analysis-in-rct/Here’s the scenario: we have an intervention that we think will improve outcomes for a particular population. Furthermore, there are two sub-groups (let’s say defined by which of two medical conditions each person in the population has) and we are interested in knowing if the intervention effect is different for each sub-group.
And here’s the question: what is the ideal way to set up a study so that we can assess (1) the intervention effects on the group as a whole, but also (2) the sub-group specific intervention effects?Who knew likelihood functions could be so pretty?
https://www.rdatagen.net/post/mle-can-be-pretty/
Mon, 23 Oct 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/mle-can-be-pretty/I just released a new iteration of simstudy (version 0.1.6), which fixes a bug or two and adds several spline related routines (available on CRAN). The previous post focused on using spline curves to generate data, so I won’t repeat myself here. And, apropos of nothing really - I thought I’d take the opportunity to do a simple simulation to briefly explore the likelihood function. It turns out if we generate lots of them, it can be pretty, and maybe provide a little insight.Can we use B-splines to generate non-linear data?
https://www.rdatagen.net/post/generating-non-linear-data-using-b-splines/
Mon, 16 Oct 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/generating-non-linear-data-using-b-splines/I’m exploring the idea of adding a function or set of functions to the simstudy package that would make it possible to easily generate non-linear data. One way to do this would be using B-splines. Typically, one uses splines to fit a curve to data, but I thought it might be useful to switch things around a bit to use the underlying splines to generate data. This would facilitate exploring models where we know the assumption of linearity is violated.A minor update to simstudy provides an excuse to talk a bit about the negative binomial and Poisson distributions
https://www.rdatagen.net/post/a-small-update-to-simstudy-neg-bin/
Thu, 05 Oct 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-small-update-to-simstudy-neg-bin/I just updated simstudy to version 0.1.5 (available on CRAN) so that it now includes several new distributions - exponential, discrete uniform, and negative binomial.
As part of the release, I thought I’d explore the negative binomial just a bit, particularly as it relates to the Poisson distribution. The Poisson distribution is a discrete (integer) distribution of outcomes of non-negative values that is often used to describe count outcomes. It is characterized by a mean (or rate) and its variance equals its mean.CACE closed: EM opens up exclusion restriction (among other things)
https://www.rdatagen.net/post/em-estimation-of-cace/
Thu, 28 Sep 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/em-estimation-of-cace/This is the third, and probably last, of a series of posts touching on the estimation of complier average causal effects (CACE) and latent variable modeling techniques using an expectation-maximization (EM) algorithm . What follows is a simplistic way to implement an EM algorithm in R to do principal strata estimation of CACE.
The EM algorithm In this approach, we assume that individuals fall into one of three possible groups - never-takers, always-takers, and compliers - but we cannot see who is who (except in a couple of cases).A simstudy update provides an excuse to talk a little bit about latent class regression and the EM algorithm
https://www.rdatagen.net/post/simstudy-update-provides-an-excuse-to-talk-a-little-bit-about-the-em-algorithm-and-latent-class/
Wed, 20 Sep 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simstudy-update-provides-an-excuse-to-talk-a-little-bit-about-the-em-algorithm-and-latent-class/I was just going to make a quick announcement to let folks know that I’ve updated the simstudy package to version 0.1.4 (now available on CRAN) to include functions that allow conversion of columns to factors, creation of dummy variables, and most importantly, specification of outcomes that are more flexibly conditional on previously defined variables. But, as I was coming up with an example that might illustrate the added conditional functionality, I found myself playing with package flexmix, which uses an Expectation-Maximization (EM) algorithm to estimate latent classes and fit regression models.Complier average causal effect? Exploring what we learn from an RCT with participants who don't do what they are told
https://www.rdatagen.net/post/cace-explored/
Tue, 12 Sep 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/cace-explored/Inspired by a free online course titled Complier Average Causal Effects (CACE) Analysis and taught by Booil Jo and Elizabeth Stuart (through Johns Hopkins University), I’ve decided to explore the topic a little bit. My goal here isn’t to explain CACE analysis in extensive detail (you should definitely go take the course for that), but to describe the problem generally and then (of course) simulate some data. A plot of the simulated data gives a sense of what we are estimating and assuming.Further considerations of a hidden process underlying categorical responses
https://www.rdatagen.net/post/a-hidden-process-part-2-of-2/
Tue, 05 Sep 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/a-hidden-process-part-2-of-2/In my previous post, I described a continuous data generating process that can be used to generate discrete, categorical outcomes. In that post, I focused largely on binary outcomes and simple logistic regression just because things are always easier to follow when there are fewer moving parts. Here, I am going to focus on a situation where we have multiple outcomes, but with a slight twist - these groups of interest can be interpreted in an ordered way.A hidden process behind binary or other categorical outcomes?
https://www.rdatagen.net/post/ordinal-regression/
Mon, 28 Aug 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/ordinal-regression/I was thinking a lot about proportional-odds cumulative logit models last fall while designing a study to evaluate an intervention’s effect on meat consumption. After a fairly extensive pilot study, we had determined that participants can have quite a difficult time recalling precise quantities of meat consumption, so we were forced to move to a categorical response. (This was somewhat unfortunate, because we would not have continuous or even count outcomes, and as a result, might not be able to pick up small changes in behavior.Be careful not to control for a post-exposure covariate
https://www.rdatagen.net/post/be-careful/
Mon, 21 Aug 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/be-careful/A researcher was presenting an analysis of the impact various types of childhood trauma might have on subsequent substance abuse in adulthood. Obviously, a very interesting and challenging research question. The statistical model included adjustments for several factors that are plausible confounders of the relationship between trauma and substance use, such as childhood poverty. However, the model also include a measurement for poverty in adulthood - believing it was somehow confounding the relationship of trauma and substance use.Should we be concerned about incidence - prevalence bias?
https://www.rdatagen.net/post/simulating-incidence-prevalence-bias/
Wed, 09 Aug 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simulating-incidence-prevalence-bias/Recently, we were planning a study to evaluate the effect of an intervention on outcomes for very sick patients who show up in the emergency department. My collaborator had concerns about a phenomenon that she had observed in other studies that might affect the results - patients measured earlier in the study tend to be sicker than those measured later in the study. This might not be a problem, but in the context of a stepped-wedge study design (see this for a discussion that touches this type of study design), this could definitely generate biased estimates: when the intervention occurs later in the study (as it does in a stepped-wedge design), the “exposed” and “unexposed” populations could differ, and in turn so could the outcomes.Using simulation for power analysis: an example based on a stepped wedge study design
https://www.rdatagen.net/post/using-simulation-for-power-analysis-an-example/
Mon, 10 Jul 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/using-simulation-for-power-analysis-an-example/Simulation can be super helpful for estimating power or sample size requirements when the study design is complex. This approach has some advantages over an analytic one (i.e. one based on a formula), particularly the flexibility it affords in setting up the specific assumptions in the planned study, such as time trends, patterns of missingness, or effects of different levels of clustering. A downside is certainly the complexity of writing the code as well as the computation time, which can be a bit painful.simstudy update: two new functions that generate correlated observations from non-normal distributions
https://www.rdatagen.net/post/simstudy-update-two-functions-for-correlation/
Wed, 05 Jul 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/simstudy-update-two-functions-for-correlation/In an earlier post, I described in a fair amount of detail an algorithm to generate correlated binary or Poisson data. I mentioned that I would be updating simstudy with functions that would make generating these kind of data relatively painless. Well, I have managed to do that, and the updated package (version 0.1.3) is available for download from CRAN. There are now two additional functions to facilitate the generation of correlated data from binomial, poisson, gamma, and uniform distributions: genCorGen and addCorGen.Balancing on multiple factors when the sample is too small to stratify
https://www.rdatagen.net/post/balancing-when-sample-is-too-small-to-stratify/
Mon, 26 Jun 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/balancing-when-sample-is-too-small-to-stratify/Ideally, a study that uses randomization provides a balance of characteristics that might be associated with the outcome being studied. This way, we can be more confident that any differences in outcomes between the groups are due to the group assignments and not to differences in characteristics. Unfortunately, randomization does not guarantee balance, especially with smaller sample sizes. If we want to be certain that groups are balanced with respect to a particular characteristic, we need to do something like stratified randomization.Copulas and correlated data generation: getting beyond the normal distribution
https://www.rdatagen.net/post/correlated-data-copula/
Mon, 19 Jun 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/correlated-data-copula/Using the simstudy package, it’s possible to generate correlated data from a normal distribution using the function genCorData. I’ve wanted to extend the functionality so that we can generate correlated data from other sorts of distributions; I thought it would be a good idea to begin with binary and Poisson distributed data, since those come up so frequently in my work. simstudy can already accommodate more general correlated data, but only in the context of a random effects data generation process.When marginal and conditional logistic model estimates diverge
https://www.rdatagen.net/post/marginal-v-conditional/
Fri, 09 Jun 2017 00:00:00 +0000keith.goldfeld@nyumc.org (Keith Goldfeld)https://www.rdatagen.net/post/marginal-v-conditional/Say we have an intervention that is assigned at a group or cluster level but the outcome is measured at an individual level (e.g. students in different schools, eyes on different individuals). And, say this outcome is binary; that is, something happens, or it doesn’t. (This is important, because none of this is true if the outcome is continuous and close to normally distributed.) If we want to measure the effect of the intervention - perhaps the risk difference, risk ratio, or odds ratio - it can really matter if we are interested in the marginal effect or the conditional effect, because they likely won’t be the same.