Implementing a one-step GEE algorithm for very large cluster sizes in R -

Very large data sets can present estimation problems for some statistical models, particularly ones that cannot avoid matrix inversion. For example, generalized estimating equations (GEE) models that are used when individual observations are correlated within groups can have severe computation challenges when the cluster sizes get too large. GEE are often used when repeated measures for an individual are collected over time; the individual is considered the cluster in this analysis. Estimation in this case is not really an issue because the cluster sizes are typically relatively small. However, if there are groups of individuals, we also need to account for correlation. Unfortunately, if these group/cluster sizes are too large - perhaps bigger than 1000 - traditional GEE estimation techniques just may not be feasible.

An approach to GEE that is feasible has been described by Lipsitz et al and implemented using a SAS macro (which is available on the journal’s website). I am not much of a SAS user, so I searched for an implementation in R. Since I didn’t come across anything, I went ahead and implemented it myself. I am undecided about creating an R package for this, but in the meantime I thought I would compare it to a standard package in R and provide a link to the code if you’d like to implement it yourself. (And, if it does already exist in an R package, definitely let me know, because I certainly don’t want to duplicate anything.)

The one-step GEE algorithm

Traditional GEE models (such as those fit with R packages gee and geepack) allow for flexibility in specifying the within-cluster correlation structure (we generally still assume that individuals in different clusters are uncorrelated). For example, one could assume that the correlation across individuals is constant within a cluster. We call this exchangeable or compound symmetry correlation, and the intra-cluster correlation (ICC) is the measure of that correlation. Alternatively, if measurements are collected over time, we might assume that measurements taken closer together are more highly correlated; this is called auto-regressive correlation.

The proposed algorithm that is implemented here is called the one-step GEE, and is operating under the assumption of exchangeable correlation. To provide a little more detail on the algorithm, but to keep it simple, let me quote directly from the paper’s abstract:

We propose a one-step GEE estimator that (1) matches the asymptotic efficiency of the fully iterated GEE; (2) uses a simpler formula to estimate the [intra-cluster correlation] ICC that avoids summing over all pairs; and (3) completely avoids matrix multiplications and inversions. These three features make the proposed estimator much less computationally intensive, especially with large cluster sizes. A unique contribution of this article is that it expresses the GEE estimating equations incorporating the ICC as a simple sum of vectors and scalars.

The rest of the way, I will simulate data and fit the models using the traditional estimation approach as well as the one-step approach.

Comparing standard GEE with one-step GEE

To start, I am simulating a simple data set with 100 clusters that average 100 individuals per cluster.

library(simstudy)
library(data.table)
library(geepack)

First, the definitions used in the data generation. Each individual has three covariates, and the probability is a function of two of them:

d1 <- defData(varname = "n", formula = 100, dist = "noZeroPoisson")

d2 <- defDataAdd(varname = "x1", formula = 0, variance = .1,  dist = "normal")
d2 <- defDataAdd(d2, varname = "x2", formula = 0, variance = .1,  dist = "normal")
d2 <- defDataAdd(d2, varname = "x3", formula = 0, variance = .1,  dist = "normal")
d2 <- defDataAdd(d2, varname = "p", formula = "-0.7 + 0.7*x1 - 0.4*x2", 
        dist = "nonrandom", link="logit")

And then the data generation - the final step creates the within-site correlated outcomes with an ICC of 0.15:

set.seed(1234)

ds <- genData(100, d1, id = "site")
dc <- genCluster(dtClust = ds, cLevelVar = "site", numIndsVar = "n", level1ID = "id")
dc <- addColumns(d2, dc)

dd <- addCorGen(dc, idvar = "site", param1 = "p",
        rho = 0.15, corstr = "cs", dist = "binary", cnames = "y", method = "ep")

Here’s a few records from the data set, which has just under 10,000 observations across the 100 clusters:

dd

##       site   n   id          x1          x2          x3         p y
##    1:    1  88    1 -0.57111723  0.21022042 -0.31201547 0.2343570 0
##    2:    1  88    2 -0.18406857  0.68915628 -0.72801775 0.2488957 0
##    3:    1  88    3 -0.35066169 -0.10884256  0.28141278 0.2886548 0
##    4:    1  88    4 -0.32095917 -0.03676544 -0.33847489 0.2870070 0
##    5:    1  88    5 -0.05132678  0.07095837 -0.09308927 0.3177108 1
##   ---                                                              
## 9784:  100 106 9784 -0.26912777 -0.22893853  0.16303770 0.3107074 1
## 9785:  100 106 9785  0.16399013  0.26200677 -0.03187061 0.3340309 0
## 9786:  100 106 9786  0.14450843 -0.24399797 -0.28108422 0.3772482 1
## 9787:  100 106 9787  0.29307929  0.03889844 -0.37495296 0.3750989 0
## 9788:  100 106 9788 -0.06246070 -0.41041326  0.43236084 0.3590345 1

We can fit a regular GEE model here, since the cluster sizes are relatively small:

system.time(geefit <- geese(y ~ x1 + x2 + x3, id = site, data = dd, 
  family = binomial, corstr = "exchangeable"))

##    user  system elapsed 
##   2.245   0.103   1.903

summary(geefit)

## 
## Call:
## geese(formula = y ~ x1 + x2 + x3, id = site, data = dd, family = binomial, 
##     corstr = "exchangeable")
## 
## Mean Model:
##  Mean Link:                 logit 
##  Variance to Mean Relation: binomial 
## 
##  Coefficients:
##               estimate     san.se        wald            p
## (Intercept) -0.7275165 0.08459309  73.9632404 0.000000e+00
## x1           0.7396758 0.06584516 126.1929384 0.000000e+00
## x2          -0.3191633 0.06073004  27.6196887 1.476680e-07
## x3          -0.0477172 0.06113708   0.6091727 4.350995e-01
## 
## Scale Model:
##  Scale Link:                identity 
## 
##  Estimated Scale Parameters:
##              estimate     san.se     wald p
## (Intercept) 0.9954265 0.03464657 825.4633 0
## 
## Correlation Model:
##  Correlation Structure:     exchangeable 
##  Correlation Link:          identity 
## 
##  Estimated Correlation Parameters:
##        estimate     san.se     wald            p
## alpha 0.1441719 0.02168204 44.21411 2.943523e-11
## 
## Returned Error Value:    0 
## Number of clusters:   100   Maximum cluster size: 125

The one-step GEE function (which I’ve called gee1step) runs quite a bit faster than the standard GEE model (more than 10 times faster), but the results are virtually identical.

system.time(fit1 <- gee1step(y ~ x1 + x2 + x3, data = dd, cluster = "site"))

##    user  system elapsed 
##   0.138   0.018   0.090

fit1

## $estimates
##                   est     se.err          z      p.value
## Intercept -0.72743563 0.08549337 -8.5086793 1.759257e-17
## x1         0.73943837 0.06704127 11.0296000 2.750859e-28
## x2        -0.31903588 0.06136514 -5.1989760 2.003894e-07
## x3        -0.04758544 0.06165700 -0.7717768 4.402466e-01
## 
## $rho
## [1] 0.1440159
## 
## $clusters
## $clusters$n_clusters
## [1] 100
## 
## $clusters$avg_size
## [1] 97.88
## 
## $clusters$min_size
## [1] 77
## 
## $clusters$max_size
## [1] 125
## 
## 
## $outcome
## [1] "y"
## 
## $model
## y ~ x1 + x2 + x3
## 
## attr(,"class")
## [1] "gee1step"

The one-step algorithm with very large cluster sizes

Obviously, in the previous example, gee1step is unnecessary because geese handled the data set just fine. But, in the next example, with an average of 10,000 observations per cluster, geese will not run - at least not on my MacBook Pro. gee1step does just fine. I’m generating the data slightly differently here since simstudy doesn’t do well with extremely large correlation matrices. I’m using a random effect instead to induce correlation:

vicc <- iccRE(0.15, dist = "binary")

d1 <- defData(varname = "n", formula = 10000, dist = "noZeroPoisson")
d1 <- defData(d1, varname = "b", formula = 0, variance = vicc)

d2 <- defDataAdd(varname = "x1", formula = 0, variance = .1,  dist = "normal")
d2 <- defDataAdd(d2, varname = "x2", formula = 0, variance = .1,  dist = "normal")
d2 <- defDataAdd(d2, varname = "x3", formula = 0, variance = .1,  dist = "normal")
d2 <- defDataAdd(d2, varname = "y", formula = "-0.7 + 0.7*x1 - 0.4*x2 + b", 
        dist = "binary", link="logit")

### generate data

set.seed(1234)

ds <- genData(100, d1, id = "site")
dc <- genCluster(dtClust = ds, cLevelVar = "site", numIndsVar = "n", level1ID = "id")
dd <- addColumns(d2, dc)

Now, we have almost one million observations:

dd

##         site     n          b     id          x1            x2          x3 y
##      1:    1  9879 -1.3761022      1 -0.11929302  0.4136057653 -0.16119383 1
##      2:    1  9879 -1.3761022      2  0.03086998  0.3085905336  0.46055431 0
##      3:    1  9879 -1.3761022      3  0.51821656  0.2047086274  0.16721059 0
##      4:    1  9879 -1.3761022      4 -0.27688665  0.0030742246 -1.08113944 0
##      5:    1  9879 -1.3761022      5  0.03850389 -0.0991678903 -0.09447011 0
##     ---                                                                     
## 997872:  100 10065  0.2648166 997872  0.46022770  0.1588510133 -0.11851768 0
## 997873:  100 10065  0.2648166 997873  0.10696608  0.0002424567 -0.10632926 1
## 997874:  100 10065  0.2648166 997874  0.32317258 -0.1626614787  0.37855380 1
## 997875:  100 10065  0.2648166 997875 -0.17992641 -0.0333636043 -0.20060539 0
## 997876:  100 10065  0.2648166 997876  0.65942651  0.1960137135 -0.05687481 1

Despite the very large cluster sizes, the one-step algorithm still runs very fast. In addition to what is shown here, I have conducted experiments with repeated data sets to confirm that the coefficient estimates are unbiased and the standard error estimates are correct.

system.time(fit1 <- gee1step(y ~ x1 + x2 + x3, data = dd, cluster = "site"))

##    user  system elapsed 
##   2.371   0.339   1.813

fit1

## $estimates
##                    est      se.err          z      p.value
## Intercept -0.569908988 0.069360637  -8.216605 2.093444e-16
## x1         0.615260991 0.010232501  60.128117 0.000000e+00
## x2        -0.349871546 0.008694526 -40.240439 0.000000e+00
## x3         0.008132196 0.006470784   1.256756 2.088421e-01
## 
## $rho
## [1] 0.1016716
## 
## $clusters
## $clusters$n_clusters
## [1] 100
## 
## $clusters$avg_size
## [1] 9978.76
## 
## $clusters$min_size
## [1] 9766
## 
## $clusters$max_size
## [1] 10242
## 
## 
## $outcome
## [1] "y"
## 
## $model
## y ~ x1 + x2 + x3
## 
## attr(,"class")
## [1] "gee1step"

And please, if someone thinks it would be valuable for me to create a package for this, let me know. It would certainly help motivate me :).

UPDATE: I actually went ahead and created the most bare bone of packages, gee1step. The package can be downloaded from GitHub (not CRAN) by using the command devtools::install_github("kgoldfeld/gee1step"). I welcome anyone who wants to help me improve it so that it can go up on CRAN.

Reference:

Lipsitz, Stuart, Garrett Fitzmaurice, Debajyoti Sinha, Nathanael Hevelone, Jim Hu, and Louis L. Nguyen. “One-step generalized estimating equations with large cluster sizes.” Journal of Computational and Graphical Statistics 26, no. 3 (2017): 734-737.