# Generating binary data by specifying the relative risk, with simulations

The most traditional approach for analyzing binary outcome data is logistic regression, where the estimated parameters are interpreted as log odds ratios or, if exponentiated, as odds ratios (ORs). No one other than statisticians (and maybe not even statisticians) finds the odds ratio to be a very intuitive statistic, and many feel that a risk difference or risk ratio/relative risks (RRs) are much more interpretable. Indeed, there seems to be a strong belief that readers will, more often than not, interpret odds ratios as risk ratios. This turns out to be reasonable when an event is rare. However, when the event is more prevalent, the odds ratio will diverge from the risk ratio. (Here is a paper that discusses some of these issues in greater depth, in case you came here looking for more.)

I was playing around with ORs and RRs using simstudy and realized that up until now, one could not specify a binary data generating process using an assumption about the underlying RR. (Well, you actually could, but it required the extra step of explicitly creating probability parameters using a RR assumption.) I’ve rectified that in the latest development version, by including a “log” link option for the binary distribution (and for binomial data generation more broadly). Here’s some simulation code to show this in action.

### Simulation

In this data set, the treatment indicator is $$A$$. The control group ($$A = 0$$) will have a $$20\%$$ underlying probability of an outcome. The risk ratio is 1.8, so that the underlying probability of an outcome in the treatment group ($$A = 1$$) is $$1.8 \times 0.20 = 36\%$$.

library(simstudy)
library(data.table)

def <-
defData(varname = "A", formula = "1;1", dist = "trtAssign") |>
defData(
varname = "y",
formula = "log(0.2) + A * log(1.8)",
dist = "binary",
)

The data generation bears this out:

set.seed(123)
dd <- genData(1000, def)
## Key: <id>
##          id     A     y
##       <int> <int> <int>
##    1:     1     0     0
##    2:     2     0     0
##    3:     3     0     0
##    4:     4     1     0
##    5:     5     0     0
##   ---
##  996:   996     0     1
##  997:   997     0     0
##  998:   998     0     0
##  999:   999     1     0
## 1000:  1000     0     0
dd[, .(obs_p = mean(y)), keyby = A]
## Key: <A>
##        A obs_p
##    <int> <num>
## 1:     0 0.210
## 2:     1 0.366

### Added bonus: estimating the RR using regression

Under the traditional approach, we might estimate a logistic regression model:

summary(glm(y ~ A, family = binomial, data = dd))
##
## Call:
## glm(formula = y ~ A, family = binomial, data = dd)
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -1.3249     0.1098 -12.067  < 2e-16 ***
## A             0.7755     0.1438   5.393 6.91e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 1200.7  on 999  degrees of freedom
## Residual deviance: 1170.7  on 998  degrees of freedom
## AIC: 1174.7
##
## Number of Fisher Scoring iterations: 4

The parameter of interest is $$e^{0.7755} = 2.1716$$, which is a bit higher than the $$RR = 1.8$$ used to generate the data. The reason, of course, is that it is the OR.

We can estimate the RR by using a log-binomial regression model. Like logistic regression, this is implemented using the glm function with the binomial “family”, but with a “log” link instead of a “logistic” link:

fit_logbin = glm(y ~ A, family = binomial(link="log"), data = dd)
summary(fit_logbin)
##
## Call:
## glm(formula = y ~ A, family = binomial(link = "log"), data = dd)
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.56065    0.08674  -17.99  < 2e-16 ***
## A            0.55553    0.10483    5.30 1.16e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 1200.7  on 999  degrees of freedom
## Residual deviance: 1170.7  on 998  degrees of freedom
## AIC: 1174.7
##
## Number of Fisher Scoring iterations: 6

Here, the estimated parameter is the log(RR), so the estimated $$RR =e^{0.5555} = 1.7428$$, much closer to the true value.

The paper I’ve referenced suggests that binomial regression with a log link does not always reliably converge, so they suggest a modified Poisson regression as a better approach. The modification arises because the standard errors estimated from the Poisson model are too conservative (i.e. too high). We know that the binary data (Bernoulli distribution) variation is proportional to $$p(1-p)$$, whereas the variance for the Poisson model is proportional to $$p$$. Since $$0 < p < 1$$, $$p > p(1-p)$$. This makes sense as the binary outcome data is limited to $$0$$ and $$1$$, but the Poisson data can include values greater than $$1$$.

We can see this by generating data from each distribution:

rb <- rbinom(10000, 1, 0.20)
rp <- rpois(10000, 0.20)
##      dist   avg   var   min   max
##    <char> <num> <num> <num> <int>
## 1:  binom   0.2  0.16     0     1
## 2:   pois   0.2  0.19     0     3

Now, let’s estimate a Poisson regression model. Since “log” is the default link for Poisson regression for function glm, we should get a risk ratio estimate similar to the log-binomial regression above, and in fact we do:

fit_pois <- glm(y ~  A, family = poisson, data = dd)
summary(fit_pois)
##
## Call:
## glm(formula = y ~ A, family = poisson, data = dd)
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.56065    0.09758 -15.994  < 2e-16 ***
## A            0.55553    0.12242   4.538 5.68e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
##     Null deviance: 717.00  on 999  degrees of freedom
## Residual deviance: 695.61  on 998  degrees of freedom
## AIC: 1275.6
##
## Number of Fisher Scoring iterations: 5

The only issue is that we see that the standard error is inflated somewhat, the log-binomial model standard error the treatment parameter was $$0.10483$$, whereas the estimate is $$0.12242$$ in the Poisson model. If we estimate a robust standard error using the function vcovHC in the sandwich package, the standard error is in now line with the log-binomial estimate. (I could simulate many data sets to see what the empirical variation of the estimate is, but I’ll leave that to you as an exercise.)

library(sandwich)

data.table(
estimate = summary(fit_pois)$coef["A", "Estimate"], pois.SE = summary(fit_pois)$coef["A", "Std. Error"],
logbin.SE = summary(fit_logbin)\$coef["A", "Std. Error"],
robust.SE = sqrt(diag(vcovHC(fit_pois, type = "HC3")))["A"]
)
##     estimate  pois.SE logbin.SE robust.SE
##        <num>    <num>     <num>     <num>
## 1: 0.5555258 0.122418  0.104825 0.1050351

And just to doubly make sure that the robust standard errors are reasonable, we can estimate standard errors using a simple bootstrap:

bootstrap_both <- function(dx) {

selected.rows <- dx[, sample(id, replace = TRUE), keyby = A][, V1]
ds <- dx[selected.rows]

fit_logbin = glm(y ~ A, family = binomial(link="log"), data = ds)
fit_pois <- glm(y ~  A, family = poisson, data = ds)

data.table(bs.logbin = coef(fit_logbin)["A"], bs.pois = coef(fit_pois)["A"])

}

bs.res <- rbindlist(lapply(1:1500, function(x) bootstrap_both(dd)))
bs.res[, .(bs.se_logbin = sd(bs.logbin), bs.se_pois = sd(bs.pois))]
##    bs.se_logbin bs.se_pois
##           <num>      <num>
## 1:    0.1043538  0.1043538

It does look like the robust standard errors for the Poisson regression model are indeed pretty robust, and that in this simple case at least, there is no difference between the Poisson and log-binomial models.

References:

Guangyong Zou. “A Modified Poisson Regression Approach to Prospective Studies with Binary Data.” American Journal of Epidemiology. Volume 159, Issue 7, 1 April 2004, Pages 702–706.