One of my goals for the `simstudy`

package is to make it as easy as possible to generate data from a wide range of data distributions. The recent update created the possibility of generating data from a customized distribution specified in a user-defined function. Last week, I added two functions, `genDataDist`

and `addDataDist`

, that allow data generation from an empirical distribution defined by a vector of integers. (See here for how to download latest development version.) This post provides a simple illustration of the new functionality.

Here are the libraries needed, in case you want to follow along:

```
library(simstudy)
library(data.table)
library(ggplot2)
set.seed(1234)
```

The target density is simply defined by specifying a vector that is intended to loosely represent a data distribution. We start by specifying the vector (which can be of any length):

```
base_data <-
c(1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 5, 6, 6, 7, 8, 9, 9, 9, 10, 10, 10, 10, 10)
```

We can look at the density to make sure this is the distribution we are interested in drawing our data from:

```
emp_density <- density(base_data, n = 10000)
den_curve <- data.table(x = emp_density$x, y = emp_density$y)
ggplot(data = den_curve, aes(x = x, y = y)) +
geom_line(linewidth = 1) +
scale_y_continuous(name = "density\n", limits = c(0, 0.11),
breaks = seq(0, .10, by = .02)) +
scale_color_manual(values = colors) +
theme(panel.grid = element_blank())
```

Actually drawing samples from this distribution is a simple call to `genDataDensity`

. The key argument is the data distribution as as represented by the vector of integers:

`dx <- genDataDensity(10000, dataDist = base_data, varname = "x1")`

Here’s a look at the sampled data and their relationship to the target density:

```
ggplot(data = dx, aes(x=x1)) +
geom_histogram(aes(y = after_stat(count / sum(count))),
binwidth = 1, fill = "grey", color = "black", alpha = .2) +
geom_line(data = den_curve, aes(x = x, y = y),
color = "black", linewidth = 2) +
scale_y_continuous(name = "density\n", limits = c(0, 0.11),
breaks = seq(0, .10, by = .02)) +
scale_x_continuous(limits = c(-6, 15), breaks = seq(-5, 10, by = 5)) +
theme(panel.grid = element_blank(),
plot.title = element_text(face = "bold", size = 10))
```

Just to show that this was not a fluke, here are three additional target distributions, specified with three different vectors:

```
base_data <- list(
c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 10),
c(1, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 5, 6, 6, 7, 7, 7, 8, 9, 10, 10, 10, 10, 10),
c(1, 2, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 10)
)
```

We can generate data from each of the distributions and then confirm that each one adequately fits its target distribution:

```
dx1 <- genDataDensity(10000, dataDist = base_data[[1]], varname = "x1")
dx2 <- genDataDensity(10000, dataDist = base_data[[2]], varname = "x1")
dx3 <- genDataDensity(10000, dataDist = base_data[[3]], varname = "x1")
```

### Addendum: code to generate multiple distribution plot

Here is a little more detail in case someone might find it useful to have the code that generates the “facet” plot. In the plot with the single distribution, I specified the histogram with this command:

`geom_histogram(aes(y = after_stat(count / sum(count))), ...)`

When I tried to apply this to the “facet” plot, the denominator of that plot (`sum(count)`

) was not calculated for each subgroup (i.e., dataset), but was the total across all datasets. As a result, the dataset-specific proportions were underestimated; we can see that here:

```
dx <- rbindlist(list(dx1, dx2, dx3), idcol = TRUE)
ggplot(data = dx, aes(x=x1)) +
geom_histogram(
aes(y = after_stat(count / sum(count)), fill = factor(.id), color = factor(.id)),
binwidth = 1, alpha = .2) +
geom_line(data = dens, aes(x = x, y = y, color = factor(.id)), linewidth = 2) +
xlab("\nx1") +
ylab("density\n") +
scale_fill_manual(values = colors) +
scale_color_manual(values = colors) +
theme(panel.grid = element_blank(),
plot.title = element_text(face = "bold", size = 10),
legend.position = "none") +
facet_grid(~ .id)
```

I looked around for a way to address this, but couldn’t find anything that obviously addressed this shortcoming (though I am convinced it must be possible, and I just couldn’t locate the solution). I considered using `ggarrange`

or something similar, but was not satisfied with the results. Instead, it turned out to be faster just to calculate the proportions myself. This is the process I used:

First, I created a dataset with the bins (using a bin size of 1):

```
cuts <- seq(dx[,floor(min(x1))], dx[,ceiling(max(x1))], by = 1)
dcuts <- data.table(bin = 1:length(cuts), binlab = cuts)
dcuts
```

```
## bin binlab
## <int> <num>
## 1: 1 -3
## 2: 2 -2
## 3: 3 -1
## 4: 4 0
## 5: 5 1
## 6: 6 2
## 7: 7 3
## 8: 8 4
## 9: 9 5
## 10: 10 6
## 11: 11 7
## 12: 12 8
## 13: 13 9
## 14: 14 10
## 15: 15 11
## 16: 16 12
## 17: 17 13
## 18: 18 14
```

Then, I allocated each observation to a bin using the `cut`

function:

```
dx[, bin := cut(x1, breaks = cuts, labels = FALSE)]
dx <- merge(dx, dcuts, by = "bin")
dx
```

```
## Key: <bin>
## bin .id id x1 binlab
## <int> <int> <int> <num> <num>
## 1: 1 1 1251 -2.097413 -3
## 2: 1 1 2215 -2.580587 -3
## 3: 1 1 2404 -2.042049 -3
## 4: 1 1 3228 -2.078958 -3
## 5: 1 1 5039 -2.055471 -3
## ---
## 29996: 17 3 7690 13.290347 13
## 29997: 17 3 8360 13.083991 13
## 29998: 17 3 8860 13.149421 13
## 29999: 17 3 9214 13.043727 13
## 30000: 17 3 9743 13.199752 13
```

Finally, I calculated the distribution-specific proportions (showing only the second distribution):

```
dp <- dx[, .N, keyby = .(.id, binlab)]
dp[, p := N/sum(N), keyby = .id]
dp[.id == 2]
```

```
## Key: <.id>
## .id binlab N p
## <int> <num> <int> <num>
## 1: 2 -3 3 0.0003
## 2: 2 -2 38 0.0038
## 3: 2 -1 130 0.0130
## 4: 2 0 340 0.0340
## 5: 2 1 619 0.0619
## 6: 2 2 938 0.0938
## 7: 2 3 1161 0.1161
## 8: 2 4 1155 0.1155
## 9: 2 5 1035 0.1035
## 10: 2 6 882 0.0882
## 11: 2 7 828 0.0828
## 12: 2 8 861 0.0861
## 13: 2 9 822 0.0822
## 14: 2 10 641 0.0641
## 15: 2 11 384 0.0384
## 16: 2 12 140 0.0140
## 17: 2 13 23 0.0023
```

And now the facet plot will work just fine. Here is the code and the plot (again).

```
ggplot(data = dp, aes(x = binlab, y = p)) +
geom_bar(aes(fill = factor(.id), color = factor(.id)), stat = "identity", alpha = .4) +
geom_line(data = dens, aes(x = x, y = y, color = factor(.id)),
linewidth = 2) +
xlab("\nx1") +
ylab("density\n") +
scale_fill_manual(values = colors) +
scale_color_manual(values = colors) +
theme(panel.grid = element_blank(),
plot.title = element_text(face = "bold", size = 10),
legend.position = "none") +
facet_grid(~ .id)
```

### Addendum follow-up

Well, that was quick. Andrea provided code on *Disqus* - which for some reason is no longer publishing on my site, and if anyone has thoughts about that issue, feel free to contact me :) - that does exactly what I was trying to do without any pre-plotting data transformations. The trick is to use the *density* stat available in `geom_histogram`

, This actually looks better, because it lines up more precisely with the density curve.

```
ggplot(data = dx, aes(x=x1)) +
geom_histogram(
aes(y = after_stat(density), fill = factor(.id), color = factor(.id)),
binwidth = 1, alpha = .2) +
geom_line(data = dens, aes(x = x, y = y, color = factor(.id)), linewidth = 2) +
xlab("\nx1") +
ylab("density\n") +
scale_fill_manual(values = colors) +
scale_color_manual(values = colors) +
theme(panel.grid = element_blank(),
plot.title = element_text(face = "bold", size = 10),
legend.position = "none") +
facet_grid(~ .id)
```