## Defining the data

The key to simulating data in `simstudy`

is the creation of series of data definition tables that look like this:

varname | formula | variance | dist | link |
---|---|---|---|---|

nr | 7 | 0 | nonrandom | identity |

x1 | 10;20 | 0 | uniform | identity |

y1 | nr + x1 * 2 | 8 | normal | identity |

y2 | nr - 0.2 * x1 | 0 | poisson | log |

xCat | 0.3;0.2;0.5 | 0 | categorical | identity |

g1 | 5+xCat | 1 | gamma | log |

a1 | -3 + xCat | 0 | binary | logit |

These *definition* tables can be generated two ways. One option is to to use any external editor that allows the creation of `csv`

files, which can be read in with a call to `defRead`

. An alternative is to make repeated calls to the function `defData`

. Here, we illustrate the R code that builds this definition table internally:

```
def <- defData(varname = "nr", dist = "nonrandom", formula = 7, id = "idnum")
def <- defData(def, varname = "x1", dist = "uniform", formula = "10;20")
def <- defData(def, varname = "y1", formula = "nr + x1 * 2", variance = 8)
def <- defData(def, varname = "y2", dist = "poisson", formula = "nr - 0.2 * x1",
link = "log")
def <- defData(def, varname = "xCat", formula = "0.3;0.2;0.5", dist = "categorical")
def <- defData(def, varname = "g1", dist = "gamma", formula = "5+xCat", variance = 1,
link = "log")
def <- defData(def, varname = "a1", dist = "binary", formula = "-3 + xCat",
link = "logit")
```

The first call to `defData`

without specifying a definition name (in this example the definition name is *def*) creates a **new** data.table with a single row. An additional row is added to the table `def`

each time the function `defData`

is called. Each of these calls is the definition of a new field in the data set that will be generated. In this example, the first data field is named ‘nr’, defined as a constant with a value to be 7. In each call to `defData`

the user defines a variable name, a distribution (the default is ‘normal’), a mean formula (if applicable), a variance parameter (if applicable), and a link function for the mean (defaults to ‘identity’).

The possible distributions include **normal**, **gamma**, **poisson**, **zero-truncated poisson**, **binary**, **uniform**, **categorical**, and **deterministic/non-random**. For all of these distributions, key parameters defining the distribution are entered in the `formula`

, `variance`

, and `link`

fields.

In the case of the **normal** and **gamma** distributions, the formula specifies the mean. The formula can be a scalar value (number) or a string that represents a function of previously defined variables in the data set definition (or, as we will see later, in a previously generated data set). In the example, the mean of `y1`

, a normally distributed value, is declared as a linear function of `nr`

and `x1`

, and the mean of `g1`

is a function of the category defined by `xCat`

. The `variance`

field is defined only for normal and gamma random variables, and can only be defined as a scalar value. In the case of gamma random variables, the value entered in variance field is really a dispersion value \(d\), where the actual variance will be \(d \times mean^2\).

In the case of the **poisson**, **zero-truncated poisson**, and **binary** distributions, the formula also specifies the mean. The variance is not a valid parameter in these cases, but the `link`

field is. The default link is ‘identity’ but a ‘log’ link is available for the poisson distributions and a “logit” link is available for the binary outcomes. In this example, `y2`

is defined as poisson random variable with a mean that is function of `nr`

and `x1`

on the log scale. For binary variables, which take a value of 0 or 1, the formula represents probability (with the ‘identity’ link) or log odds (with the ‘logit’ link) of the variable having a value of 1. In the example, `a1`

has been defined as a binary random variable with a log odds that is a function of `xCat`

.

Variables defined with a **uniform**, **categorical**, or **deterministic/non-random** distribution are specified using the formula only. The `variance`

and `link`

fields are not used in these cases.

For a uniformly distributed variable, The formula is a string with the format “a;b”, where *a* and *b* are scalars or functions of previously defined variables. The uniform distribution has two parameters - the minimum and the maximum. In this case, *a* represents the minimum and *b* represents the maximum.

For a categorical variable with \(k\) categories, the formula is a string of probabilities that sum to 1: “\(p_1 ; p_2 ; ... ; p_k\)”. \(p_1\) is the probability of the random variable falling category 1, \(p_2\) is the probability of category 2, etc. The probabilities can be specified as functions of other variables previously defined. In the example, `xCat`

has three possibilities with probabilities 0.3, 0.2, and 0.5, respectively.

Non-random variables are defined by the formula. Since these variables are deterministic, variance is not relevant. They can be functions of previously defined variables or a scalar, as we see in the sample for variable defined as `nr`

.

## Generating the data

After the data set definitions have been created, a new data set with \(n\) observations can be created with a call to function ** genData**. In this example, 1,000 observations are generated using the data set definitions in

**, and then stored in the object**

`def`

**:**

`dt`

```
dt <- genData(1000, def)
dt
```

```
## idnum nr x1 y1 y2 xCat g1 a1
## 1: 1 7 16.95595 44.33474 39 3 3930.82048 1
## 2: 2 7 16.78838 37.02146 41 3 1015.17080 1
## 3: 3 7 18.76275 41.93826 19 1 866.30426 0
## 4: 4 7 18.23129 44.34943 24 1 1153.82489 0
## 5: 5 7 15.01042 35.21291 52 2 87.33281 1
## ---
## 996: 996 7 18.26391 44.77100 41 3 4026.05132 1
## 997: 997 7 18.20590 37.42132 31 3 4345.89164 1
## 998: 998 7 13.04172 31.67119 85 1 922.29056 0
## 999: 999 7 15.34133 44.17389 49 3 33.91809 1
## 1000: 1000 7 14.09879 32.04555 70 3 445.62692 1
```

New data can be added to an existing data set with a call to function ** addColumns**. The new data definitions are created with a call to

**and then included as an argument in the call to**

`defData`

**:**

`addColumns`

```
addef <- defDataAdd(varname = "zExtra", dist = "normal", formula = "3 + y1",
variance = 2)
dt <- addColumns(addef, dt)
dt
```

```
## idnum nr x1 y1 y2 xCat g1 a1 zExtra
## 1: 1 7 16.95595 44.33474 39 3 3930.82048 1 49.29833
## 2: 2 7 16.78838 37.02146 41 3 1015.17080 1 40.53881
## 3: 3 7 18.76275 41.93826 19 1 866.30426 0 47.78248
## 4: 4 7 18.23129 44.34943 24 1 1153.82489 0 47.95174
## 5: 5 7 15.01042 35.21291 52 2 87.33281 1 38.83490
## ---
## 996: 996 7 18.26391 44.77100 41 3 4026.05132 1 48.05672
## 997: 997 7 18.20590 37.42132 31 3 4345.89164 1 42.41864
## 998: 998 7 13.04172 31.67119 85 1 922.29056 0 35.34204
## 999: 999 7 15.34133 44.17389 49 3 33.91809 1 46.29664
## 1000: 1000 7 14.09879 32.04555 70 3 445.62692 1 32.31429
```