Creating a nice looking Table 1 with standardized mean differences -

I’m in the middle of a perfect storm, winding down three randomized clinical trials (RCTs), with patient recruitment long finished and data collection all wrapped up. This means a lot of data analysis, presentation prep, and paper writing (and not so much blogging). One common (and not so glamorous) thread cutting across all of these RCTs is the need to generate a Table 1, the comparison of baseline characteristics that convinces readers that randomization worked its magic (i.e., that study groups are indeed “comparable”). My primary goal here is to provide some R code to automate the generation of this table, but not before highlighting some issues related to checking for balance and pointing you to a couple of really interesting papers.

Table 1 and balance

In Table 1, we report summary statistics by intervention arm. The mean and standard deviation (or median and interquartile range) of continuous measures and percentages for categorical measures are provided for a selected set of baseline subject characteristics, such as age, sex, and baseline health status. Typically, these tables will include a statistic intended to provide an objective measure of balance (or lack thereof). Any indication of imbalance for a particular characteristic across groups, might suggest that the final estimate of the treatment effect should be adjusted, lest there be any residual confounding bias of the effect estimate.

There are two key (and related) questions here. First, what should that “objective” measure of imbalance be, and second should we even be checking for imbalance in the first place? Traditionally, Table 1 has included a set of p-values resulting from a series of between-group comparisons, one for each baseline measure. This paper, written by Douglas Altman in 1985 (back when I was still an undergraduate!), points out a slew of issues with using the p-value in this context. I’ve always been concerned that studies with large sample sizes may have small p-values for small differences (i.e., differences that we should not be worried about), but Altman is actually most concerned that using p-values in the context of smaller studies can mislead us into ignoring important differences. He states that “unfortunately, … use of significance tests may be unhelpful. It is the strength of the association rather than the significance level (which also depends upon sample size) which is of importance.”

A second paper, written ten years later by Stephen Senn, argued even more strongly that it may actually be counter-productive to attempt a “formal” assessment of balance. Using the p-value, you are implicitly conducting a hypothesis test, with the null hypothesis that the study has used randomization to allocate subjects to the groups. But this is indeed what you have done, so there is no question there. (Of course, it is possible that someone has cheated.) Senn recommends instead that rather than conducting these tests, “the practical statistician will do well to establish beforehand a limited list of covariates deemed useful and fit them regardless. Such a strategy will usually lead to a gain in power, has no adverse effect on unconditional size and controls conditional size with respect to the covariates identified.”

Given all of this, why should we do anything beyond reporting the group means and percentages and let the readers decide about comparability? It is hardly compelling to say this, but I think most journals will demand some formal comparison (though, to be honest, I haven’t attempted to submit a table without). And, if there needs to be a comparison, I would move away from the p-value given its shortcomings alluded to here, and use a standardized mean difference (SMD). In the case of a continuous measure, this is the difference in group means divided by the pooled standard deviation (and is defined differently for categorical measures). The SMD quantifies the difference on a scale that is comparable across measures so that the reader can identify where the largest imbalances are, and make a judgement about comparability.

Which is a long-winded way of getting to the point of the this post: how can we easily generate a nice looking Table 1 with SMDs in R?

Simulating data

First step is to generate some data for an RCT. Here are the R packages I will be using:

library(simstudy)
library(table1)
library(smd)
library(flextable)
library(officer)
library(smd)

The data set will have four “baseline” measures, two numerical and two categorical; missingness will be generated for two of the measures. Three of the variables are actually derived from the same categorical variable in order to compare the SMD for a categorical variable treated numerically as well as with missing data.

Here is the data generation process:

def <-  
  defData(varname = "rx", formula = "1;1", dist = "trtAssign") |>
  defData(varname = "x", formula = 0, variance = 10) |>
  defData(varname = "v1", formula = ".5;.3;.2", dist = "categorical")

dm <- 
  defMiss(varname = "x", formula = .10) |>
  defMiss(varname = "f2_v1", formula = '.05 + .05*(frx == "Control")')

set.seed(8312)

dd <- genData(1000, def)
dd <- genFactor(dd, "rx", labels = c("Control", "Treatment"), replace = TRUE)
dd <- genFactor(dd, "v1", prefix = "f1_")
dd <- genFactor(dd, "v1", prefix = "f2_", labels = c("red", "blue", "green"))

missMat <- genMiss(dd, dm, idvars = "id")
dobs <- genObs(dd, missMat, idvars = "id")

dobs

##         id          x v1       frx f1_v1 f2_v1
##    1:    1  3.3059744  1   Control     1   red
##    2:    2 -2.7291981  3 Treatment     3 green
##    3:    3         NA  1 Treatment     1   red
##    4:    4  3.1638764  1   Control     1   red
##    5:    5  5.2252358  1 Treatment     1   red
##   ---                                         
##  996:  996         NA  1   Control     1   red
##  997:  997 -1.8891992  2 Treatment     2  blue
##  998:  998  0.2994518  2 Treatment     2  blue
##  999:  999 -0.8043489  2 Treatment     2  blue
## 1000: 1000  1.5822111  1 Treatment     1  <NA>

Calculating the SMD

Standardized mean differences can be calculated using the smd package, which uses the methods described in this paper by Yang and Dalton. The standardized mean difference for a numeric measure is

$d = \frac{ \left( \bar{x}_1 - \bar{x}_2 \right) } {\text{se}_p },$

where $\bar{x}_1$ and $\bar{x}_2$ are the means for each group. $\text{se}_p$ is the pooled standard deviation:

$\text{se}_p = \sqrt{\frac{s^2_1 + s^2_2}{2}},$

where $s^2_1$ and $s^2_2$ are the group-specific variances. Here is the SMD for the continuous measure x:

with(dobs, smd(x = x, g = frx, na.rm = TRUE))

##        term   estimate
## 1 Treatment 0.06275175

For categorical measures, the SMD is the multivariate Mahalanobis distance between the group-specific proportion vectors: $\{p_{11}, \dots,p_{1k}\}$ and $\{p_{21}, \dots,p_{2k}\}$ . Here is the SMD for the categorical measure f1_v1:

with(dobs, smd(x = f1_v1, g = frx, na.rm = TRUE))

##        term   estimate
## 1 Treatment 0.02629001

Creating Table 1

We are creating Table 1 with package table1. (See here for a nice vignette.) The package does not explicitly calculate the SMD, but allows us to customize the table creation with a user-defined function, which is shown below. An alternative package,tableone, does have an SMD option built in. However, missing data reporting and integration with the flextable package, two very important features, are not built into tableone; in contrast, table1 provides both capabilities.

Here is the relatively simple code used to generate the table:

mysmd <- function(x, ...) {
  
  # Construct vectors of data y, and groups (strata) g
  
  y <- unlist(x)
  g <- factor(rep(1:length(x), times=sapply(x, length)))
  
  abs(round(smd::smd(y, g, na.rm = TRUE)[2], 3))
  
}

tab_1 <- table1(
  ~ x + v1 + f1_v1 + f2_v1 | frx, 
  data = dobs, overall = FALSE, 
  render.continuous=c(.="Mean (SD)"),
  extra.col = list(`SMD`= mysmd)
)

And here is the table that is generated by table1:

	Control (N=500)	Treatment (N=500)	SMD
x			0.063
Mean (SD)	0.204 (3.19)	0.00807 (3.06)
Missing	59 (11.8%)	52 (10.4%)
v1			0.005
Mean (SD)	1.76 (0.802)	1.76 (0.808)
f1_v1			0.026
1	234 (46.8%)	238 (47.6%)
2	151 (30.2%)	145 (29.0%)
3	115 (23.0%)	117 (23.4%)
f2_v1			0.042
red	216 (43.2%)	223 (44.6%)
blue	139 (27.8%)	137 (27.4%)
green	103 (20.6%)	114 (22.8%)
Missing	42 (8.4%)	26 (5.2%)

This is pretty nice as it is, but we might want embellish a bit by using the capabilities of flextable, another package I’ve become enamored with lately. The table1 object can be turned directly into a flextable using function t1flex. And once we have transformed the table type, the possibilities are almost endless. One really nice thing about a flextable is that it can be output to a Word file, to a PowerPoint file, html, or other useful formats. Taking this approach, Table 1 generation (or really any table generation) can be fully automated, obviating any need for manual table creation and eliminating at least one possible source of human error.

For example, here is code that approximates a JAMA-style table:

set_flextable_defaults(
  font.family = "Calibri", 
  font.size = 11
)

header <- "Table 1"
footer <- "Values are No. (%) unless otherwise noted. SD = standard deviation"

tab_1f <- t1flex(tab_1) |> 
  add_header_lines(header) |>
  add_footer_lines(footer) |>
  bold(i = 1, part = "header") |> 
  hline_top(part = "header", border = fp_border(color = "red", width = 3)) |> 
  hline(i = 1, part = "header", border = fp_border(width = 0.25)) |>
  hline_top(part = "body", border = fp_border(width = 0.25)) |> 
  hline_bottom(part = "body", border = fp_border(width = 0.25)) |> 
  hline_bottom(part = "footer", border = fp_border(width = 0.25)) |> 
  border_inner_h(part = "body", border = fp_border(width = 0.25, style = "dotted")) |> 
  autofit(part = "body") |>
  bg(part = "body", bg = "#f5f5f5") |>
  align(part = "all", align = "center") |> 
  align(part = "header", j=1, i=2, align = "left")  |>
  align(part = "footer", align = "left") |>
  merge_v(j = 1) |>
  valign(j = 1, valign = "top") |>
  align(j = 1, align = "left")

And, the final product:

References:

Altman, Douglas G. “Comparability of randomised groups.” Journal of the Royal Statistical Society Series D: The Statistician 34, no. 1 (1985): 125-136.

Senn, Stephen. “Testing for baseline balance in clinical trials.” Statistics in medicine 13, no. 17 (1994): 1715-1726.

Yang, D. and Dalton, J.E., 2012, April. A unified approach to measuring the effect size between two groups using SAS. In SAS global forum (Vol. 335, pp. 1-6).

Rich B (2023). table1: Tables of Descriptive Statistics in HTML. R package version 1.4.3.

Gohel D, Skintzos P (2023). flextable: Functions for Tabular Reporting.