Defining the data

The key to simulating data in simstudy is the creation of series of data definition tables that look like this:

varname formula variance dist link
nr 7 0 nonrandom identity
x1 10;20 0 uniform identity
y1 nr + x1 * 2 8 normal identity
y2 nr - 0.2 * x1 0 poisson log
xCat 0.3;0.2;0.5 0 categorical identity
g1 5+xCat 1 gamma log
a1 -3 + xCat 0 binary logit

These definition tables can be generated two ways. One option is to to use any external editor that allows the creation of csv files, which can be read in with a call to defRead. An alternative is to make repeated calls to the function defData. Here, we illustrate the R code that builds this definition table internally:

def <- defData(varname = "nr", dist = "nonrandom", formula = 7, id = "idnum")
def <- defData(def, varname = "x1", dist = "uniform", formula = "10;20")
def <- defData(def, varname = "y1", formula = "nr + x1 * 2", variance = 8)
def <- defData(def, varname = "y2", dist = "poisson", formula = "nr - 0.2 * x1", 
    link = "log")
def <- defData(def, varname = "xCat", formula = "0.3;0.2;0.5", dist = "categorical")
def <- defData(def, varname = "g1", dist = "gamma", formula = "5+xCat", variance = 1, 
    link = "log")
def <- defData(def, varname = "a1", dist = "binary", formula = "-3 + xCat", 
    link = "logit")

The first call to defData without specifying a definition name (in this example the definition name is def) creates a new data.table with a single row. An additional row is added to the table def each time the function defData is called. Each of these calls is the definition of a new field in the data set that will be generated. In this example, the first data field is named ‘nr’, defined as a constant with a value to be 7. In each call to defData the user defines a variable name, a distribution (the default is ‘normal’), a mean formula (if applicable), a variance parameter (if applicable), and a link function for the mean (defaults to ‘identity’).

The possible distributions include normal, gamma, poisson, zero-truncated poisson, binary, uniform, categorical, and deterministic/non-random. For all of these distributions, key parameters defining the distribution are entered in the formula, variance, and link fields.

In the case of the normal and gamma distributions, the formula specifies the mean. The formula can be a scalar value (number) or a string that represents a function of previously defined variables in the data set definition (or, as we will see later, in a previously generated data set). In the example, the mean of y1, a normally distributed value, is declared as a linear function of nr and x1, and the mean of g1 is a function of the category defined by xCat. The variance field is defined only for normal and gamma random variables, and can only be defined as a scalar value. In the case of gamma random variables, the value entered in variance field is really a dispersion value \(d\), where the actual variance will be \(d \times mean^2\).

In the case of the poisson, zero-truncated poisson, and binary distributions, the formula also specifies the mean. The variance is not a valid parameter in these cases, but the link field is. The default link is ‘identity’ but a ‘log’ link is available for the poisson distributions and a “logit” link is available for the binary outcomes. In this example, y2 is defined as poisson random variable with a mean that is function of nr and x1 on the log scale. For binary variables, which take a value of 0 or 1, the formula represents probability (with the ‘identity’ link) or log odds (with the ‘logit’ link) of the variable having a value of 1. In the example, a1 has been defined as a binary random variable with a log odds that is a function of xCat.

Variables defined with a uniform, categorical, or deterministic/non-random distribution are specified using the formula only. The variance and link fields are not used in these cases.

For a uniformly distributed variable, The formula is a string with the format “a;b”, where a and b are scalars or functions of previously defined variables. The uniform distribution has two parameters - the minimum and the maximum. In this case, a represents the minimum and b represents the maximum.

For a categorical variable with \(k\) categories, the formula is a string of probabilities that sum to 1: “\(p_1 ; p_2 ; ... ; p_k\)”. \(p_1\) is the probability of the random variable falling category 1, \(p_2\) is the probability of category 2, etc. The probabilities can be specified as functions of other variables previously defined. In the example, xCat has three possibilities with probabilities 0.3, 0.2, and 0.5, respectively.

Non-random variables are defined by the formula. Since these variables are deterministic, variance is not relevant. They can be functions of previously defined variables or a scalar, as we see in the sample for variable defined as nr.

Generating the data

After the data set definitions have been created, a new data set with \(n\) observations can be created with a call to function genData. In this example, 1,000 observations are generated using the data set definitions in def, and then stored in the object dt:

dt <- genData(1000, def)
dt
##       idnum nr       x1       y1 y2 xCat         g1 a1
##    1:     1  7 16.95595 44.33474 39    3 3930.82048  1
##    2:     2  7 16.78838 37.02146 41    3 1015.17080  1
##    3:     3  7 18.76275 41.93826 19    1  866.30426  0
##    4:     4  7 18.23129 44.34943 24    1 1153.82489  0
##    5:     5  7 15.01042 35.21291 52    2   87.33281  1
##   ---                                                 
##  996:   996  7 18.26391 44.77100 41    3 4026.05132  1
##  997:   997  7 18.20590 37.42132 31    3 4345.89164  1
##  998:   998  7 13.04172 31.67119 85    1  922.29056  0
##  999:   999  7 15.34133 44.17389 49    3   33.91809  1
## 1000:  1000  7 14.09879 32.04555 70    3  445.62692  1

New data can be added to an existing data set with a call to function addColumns. The new data definitions are created with a call to defData and then included as an argument in the call to addColumns:

addef <- defDataAdd(varname = "zExtra", dist = "normal", formula = "3 + y1", 
    variance = 2)

dt <- addColumns(addef, dt)
dt
##       idnum nr       x1       y1 y2 xCat         g1 a1   zExtra
##    1:     1  7 16.95595 44.33474 39    3 3930.82048  1 49.29833
##    2:     2  7 16.78838 37.02146 41    3 1015.17080  1 40.53881
##    3:     3  7 18.76275 41.93826 19    1  866.30426  0 47.78248
##    4:     4  7 18.23129 44.34943 24    1 1153.82489  0 47.95174
##    5:     5  7 15.01042 35.21291 52    2   87.33281  1 38.83490
##   ---                                                          
##  996:   996  7 18.26391 44.77100 41    3 4026.05132  1 48.05672
##  997:   997  7 18.20590 37.42132 31    3 4345.89164  1 42.41864
##  998:   998  7 13.04172 31.67119 85    1  922.29056  0 35.34204
##  999:   999  7 15.34133 44.17389 49    3   33.91809  1 46.29664
## 1000:  1000  7 14.09879 32.04555 70    3  445.62692  1 32.31429