Really, the only problem with the `simstudy`

package (😄) is that there is a hard limit to the possible probability distributions that are available (the current count is 15 - see here for a complete description). However, it turns out that there is more flexibility than first meets the eye, and we can easily accommodate a limitless number as long as you are willing to provide some extra code.

I am going to illustrate this with two examples, first by implementing a truncated normal distribution, and second by implementing the flexible non-linear data generating algorithm that I described last time.

Before we get going, here are the necessary libraries:

```
library(simstudy)
library(data.table)
library(msm)
library(ggplot2)
library(mgcv)
```

#### General concept

In the data definition step, it is possible to specify any valid `R`

function in the *formula* argument. If *dist* is specified as “nonrandom”, then `simstudy`

will generate data based on that function. (Yes, the specification as “nonrandom” is a bit awkward since we are defining a stochastic data generating process in this case; in future versions I plan to allow *dist* to be specified as “custom” to make this less dissonant.)

In this example, I want to be able to generate data from a truncated normal distribution. There is an existing function `rtnorm`

in the `msm`

package that I can take advantage here. What I have done is essentially create a wrapper function that makes a single draw from the truncated distribution with a specified mean, standard deviation, and pair of truncation bounds:

```
trunc_norm <- function(mean, sd, lower, upper) {
rtnorm(n = 1, mean = mean, sd = sd, lower = lower, upper = upper)
}
```

Now that `trunc_norm`

has been created, I am free to use this is in a data definition statement. And even more important, the call to `trunc_norm`

can depend on other variables; in this case, I have created binary variable *x* that will determine the upper and lower bounds of the distribution. When \(x=0\), the \(N(0, 3.5^2)\) distribution is truncated at -5 and 5, and when \(x=1\), the distribution is truncated at -8 and 8.

```
defI <- defData(varname = "x", formula = 0.5, dist = "binary")
defI <- defData(defI, varname = "y",
formula = "trunc_norm(mean = 0, sd = 3.5,
lower = -5 + -3*x, upper = 5 + 3*x)",
dist = "nonrandom")
```

The generated data appear to have the properties that we would expect:

`dd <- genData(1000, defI)`

#### Application to non-linear data generation

Last time, I described an approach to generate a variable \(y\) that has a non-linear response with respect to an input variable \(x\). At the end of that post, I created two functions, one of which can be referred to in the `defData`

statement to generate the data. (I plan on implementing these functions in `simstudy`

, but I was eager to get the concept out there in case any one has some suggestions or could use this feature right away.)

In the first step, I need to generate a smooth function by specifying a few points. I do this by calling `getNLfunction`

. (If you want the code for this, let me know, but I actually provided most of it last week.) The variable `nlf`

is an object that contains the function:

```
dpoints <- data.table(x = c(20, 30, 53, 65, 80), y = c(15, 44, 60, 55, 35))
nlf <- getNLfunction(dpoints)
```

The function `genNL`

makes predictions based on the `nlf`

object and adds a little Gaussian noise. We use the same approach as we did above for the truncated normal to generate different responses \(y\) based on the level of \(x\):

```
def <- defData(varname = "x", formula = "20;80", dist = "uniform")
def <- defData(def, varname = "y",
formula = "genNL(nf = ..nlf, x, sd = 10)", dist = "nonrandom")
dd <- genData(300, def)
```

And if we introduce much less noise, we get much closer to the original underlying function specified by our points:

```
def <- defData(varname = "x", formula = "20;80", dist = "uniform")
def <- defData(def, varname = "y",
formula = "genNL(nf = ..nlf, x, sd = 0.5)", dist = "nonrandom")
dd <- genData(300, def)
```