The most traditional approach for analyzing binary outcome data is logistic regression, where the estimated parameters are interpreted as log odds ratios or, if exponentiated, as odds ratios (ORs). No one other than statisticians (and maybe not even statisticians) finds the odds ratio to be a very intuitive statistic, and many feel that a risk difference or risk ratio/relative risks (RRs) are much more interpretable. Indeed, there seems to be a strong belief that readers will, more often than not, interpret odds ratios as risk ratios. This turns out to be reasonable when an event is rare. However, when the event is more prevalent, the odds ratio will diverge from the risk ratio. (Here is a paper that discusses some of these issues in greater depth, in case you came here looking for more.)
[Read More]simstudy: another way to generate data from a non-standard density
One of my goals for the simstudy
package is to make it as easy as possible to generate data from a wide range of data distributions. The recent update created the possibility of generating data from a customized distribution specified in a user-defined function. Last week, I added two functions, genDataDist
and addDataDist
, that allow data generation from an empirical distribution defined by a vector of integers. (See here for how to download latest development version.) This post provides a simple illustration of the new functionality.
simstudy 0.8.0: customized distributions
Over the past few years, a number of folks have asked if simstudy
accommodates customized distributions. There’s been interest in truncated, zero-inflated, or even more standard distributions that haven’t been implemented in simstudy
. While I’ve come up with approaches for some of the specific cases, I was never able to develop a general solution that could provide broader flexibility.
This shortcoming changes with the latest version of simstudy
, now available on CRAN. Custom distributions can now be specified in defData
and defDataAdd
by setting the argument dist to “custom”. To introduce the new option, I am providing a couple of examples.
Finding logistic models to generate data with desired risk ratio, risk difference and AUC profiles
About two years ago, someone inquired whether simstudy
had the functionality to generate data from a logistic model with a specific AUC. It did not, but now it does, thanks to a paper by Peter Austin that describes a nice algorithm to accomplish this. The paper actually describes a series of related algorithms for generating coefficients that target specific prevalence rates, risk ratios, and risk differences, in addition to the AUC. simstudy
has a new function logisticCoefs
that implements all of these. (The Austin paper also describes an additional algorithm focused on survival outcome data and hazard ratios, but that has not been implemented in simstudy
). This post describes the the new function and provides some simple examples.
simstudy 0.6.0 released: more flexible correlation patterns
The new version (0.6.0) of simstudy
is available for download from CRAN. In addition to some important bug fixes, I’ve added new functionality that should make data generation with correlated data a little more flexible. In the previous post, I described enhancements to the function genCorMat
. As part of this release announcement, I’m describing blockExchangeMat
and blockDecayMat
, two new functions that can be used to generate correlation matrices when there is a temporal element to the data generation.
Flexible correlation generation: an update to genCorMat in simstudy
I’ve been slowly working on some updates to simstudy
, focusing mostly on the functionality to generate correlation matrices (which can be used to simulate correlated data). Here, I’m briefly describing the function genCorMat
, which has been updated to facilitate the generation of correlation matrices for clusters of different sizes and with potentially different correlation coefficients.
I’ll briefly describe what the existing function can currently do, and then give an idea about what the enhancements will provide.
[Read More]simstudy updated to version 0.5.0
A new version of simstudy
is available on CRAN. There are two major enhancements and several new features. In the “major” category, I would include (1) changes to survival data generation that accommodate hazard ratios that can change over time, as well as competing risks, and (2) the addition of functions to allow users to sample from existing data sets with replacement to generate “synthetic” data will real life distribution properties. Other less monumental, but important, changes were made: updates to functions genFormula
and genMarkov
, and two added utility functions, survGetParams
and survParamPlot
. (I did describe the survival data generation functions in two recent posts, here and here.)
simstudy update: ordinal data generation that violates proportionality
Version 0.4.0 of simstudy
is now available on CRAN and GitHub. This update includes two enhancements (and at least one major bug fix). genOrdCat
now includes an argument to generate ordinal data without an assumption of cumulative proportional odds. And two new functions defRepeat
and defRepeatAdd
make it a bit easier to define multiple variables that share the same distribution assumptions.
Ordinal data
In simstudy
, it is relatively easy to specify multinomial distributions that characterize categorical data. Order becomes relevant when the categories take on meanings related to strength of opinion or agreement (as in a Likert-type response) or frequency. A motivating example could be when a response variable takes on four possible values: (1) strongly disagree, (2) disagree, (4) agree, (5) strongly agree. There is a natural order to the response possibilities.
simstudy update: adding flexibility to data generation
A new version of simstudy
(0.3.0) is now available on CRAN and on the package website. Along with some less exciting bug fixes, we have added capabilities to a few existing features: double-dot variable reference, treatment assignment, and categorical data definition. These simple additions should make the data generation process a little smoother and more flexible.
Using non-scalar double-dot variable reference
Double-dot notation was introduced in the last version of simstudy
to allow data definitions to be more dynamic. Previously, the double-dot variable could only be a scalar value, and with the current version, double-dot notation is now also array-friendly.