# Flexible simulation in simstudy with customized distribution functions

Really, the only problem with the simstudy package (😄) is that there is a hard limit to the possible probability distributions that are available (the current count is 15 - see here for a complete description). However, it turns out that there is more flexibility than first meets the eye, and we can easily accommodate a limitless number as long as you are willing to provide some extra code.

I am going to illustrate this with two examples, first by implementing a truncated normal distribution, and second by implementing the flexible non-linear data generating algorithm that I described last time.

Before we get going, here are the necessary libraries:

library(simstudy)
library(data.table)
library(msm)
library(ggplot2)
library(mgcv)

#### General concept

In the data definition step, it is possible to specify any valid R function in the formula argument. If dist is specified as “nonrandom”, then simstudy will generate data based on that function. (Yes, the specification as “nonrandom” is a bit awkward since we are defining a stochastic data generating process in this case; in future versions I plan to allow dist to be specified as “custom” to make this less dissonant.)

In this example, I want to be able to generate data from a truncated normal distribution. There is an existing function rtnorm in the msm package that I can take advantage here. What I have done is essentially create a wrapper function that makes a single draw from the truncated distribution with a specified mean, standard deviation, and pair of truncation bounds:

trunc_norm <- function(mean, sd, lower, upper) {
rtnorm(n = 1, mean = mean, sd = sd, lower = lower, upper = upper)
}

Now that trunc_norm has been created, I am free to use this is in a data definition statement. And even more important, the call to trunc_norm can depend on other variables; in this case, I have created binary variable x that will determine the upper and lower bounds of the distribution. When $$x=0$$, the $$N(0, 3.5^2)$$ distribution is truncated at -5 and 5, and when $$x=1$$, the distribution is truncated at -8 and 8.

defI <- defData(varname = "x", formula = 0.5, dist = "binary")
defI <- defData(defI, varname = "y",
formula = "trunc_norm(mean = 0, sd = 3.5,
lower = -5 + -3*x, upper = 5 + 3*x)",
dist = "nonrandom")

The generated data appear to have the properties that we would expect:

dd <- genData(1000, defI)

#### Application to non-linear data generation

Last time, I described an approach to generate a variable $$y$$ that has a non-linear response with respect to an input variable $$x$$. At the end of that post, I created two functions, one of which can be referred to in the defData statement to generate the data. (I plan on implementing these functions in simstudy, but I was eager to get the concept out there in case any one has some suggestions or could use this feature right away.)

In the first step, I need to generate a smooth function by specifying a few points. I do this by calling getNLfunction. (If you want the code for this, let me know, but I actually provided most of it last week.) The variable nlf is an object that contains the function:

dpoints <- data.table(x = c(20, 30, 53, 65, 80), y = c(15, 44, 60, 55, 35))
nlf <- getNLfunction(dpoints)

The function genNL makes predictions based on the nlf object and adds a little Gaussian noise. We use the same approach as we did above for the truncated normal to generate different responses $$y$$ based on the level of $$x$$:

def <- defData(varname = "x", formula = "20;80", dist = "uniform")
def <- defData(def, varname = "y",
formula = "genNL(nf = ..nlf, x, sd = 10)", dist = "nonrandom")

dd <- genData(300, def)

And if we introduce much less noise, we get much closer to the original underlying function specified by our points:

def <- defData(varname = "x", formula = "20;80", dist = "uniform")
def <- defData(def, varname = "y",
formula = "genNL(nf = ..nlf, x, sd = 0.5)", dist = "nonrandom")

dd <- genData(300, def)