Disclaimer: The views and opinions expressed in this document are those of the author and do not necessarily reflect the official policy or position of any of the affiliations above.

Contents

Introduction

Suppose we are interested in assessing how age and sex could affect the presence of a given disease in a given population. To do that, suppose we take a random sample of individuals from the population of interest resulting in a dataset including individuals from different cities. To emulate that, we can develop a simulation study.

In this document we simulate data to reproduce the problem. Then, we fit a model to the simulated data and finally we compare the estimated effects with the true effects that were used to simulate the data.

First, we consider the case of just one city and fit a logistic regression model (which is a model included in the family of General Linear Models, GLM). Then we generalise to the case of several city and we model the data using a logistic regression model with random effects (which is a model included in the family of General Linear Mixed Models, GLMM).

Data from a single population

Simulation setting

We simulate a sample of size \(n\) under the following conditions:

  • Sex (\(S\)), from a Bernoulli distribution with probablity of being a male equal to probmale

  • Age (\(A\)), from a normal distribution with approximate range equal to agerange

  • Outcome (\(Y\)), from a Bernoulli distribution with probability of being affected by the disease, \(P(Y) = \pi\), which depends on \(S\) and \(A\) through the following logistic model:

\[\begin{equation} \label{eq:glm} \log(\text{odds}(Y\, |\, S, A)) = \log\left(\frac{\pi}{1 - \pi}\right) = \beta_0 + \beta_S\,S + \beta_A\,A. \end{equation}\]

That model includes three coefficents (\(\beta_0\), \(\beta_S\) and \(\beta_A\)), so we need three “conditions” in order to set the values of these coefficients. For instance:

  • The OR of \(Y\) associated to sex (i.e. when changing from female to male) is equal to orsex
  • The OR of \(Y\) associated to age, when increasing the age deltaage years, is equal to orage
  • The pravalence of \(Y\) among women aged age0 is equal to pi0

Exercise: Prove the following relationships

\[\begin{equation} \label{eq:orglm} \left\{ \begin{array}{l} \beta_S = \log(\textbf{orsex})\\ \beta_A = \frac{\log(\textbf{orage})}{\textbf{deltaage}}\\ \beta_0 = \log\left(\frac{\textbf{pi0}}{1 - \textbf{pi0}}\right) - \beta_A\,\textbf{age0} \end{array} \right. \end{equation}\]

The following function simbin implements such simulation setting as well as the fitting model to the simulate data. The function returns the values of the OR (true and estimated) and the fitted model.

simbin <- function(n,             # sample size
                   probmale,      # probability of male
                   orsex,         # OR for sex
                   orage,         # OR for age
                   deltaage,      # increase in age associated to orage  
                   pi0,           # prevalence for women age0 years old
                   age0,          # age associated to pi0  
                   agerange,      # approx. age range
                   seed = NULL) {
  # set the seed if any:
  if (!is.null(seed))
    set.seed(seed)
  
  # model:
  # logodds(y) = L = b0 + b1 * sex + b2 * age
  # simulate sex
  sex <- sample(x = c("male", "female"),
                size = n,
                prob = c(probmale, 1 - probmale),
                replace = TRUE)
  sex <- as.factor(sex)
  # simulate age
  age <- round(rnorm(n = n,
                     mean = mean(agerange),
                     sd = diff(agerange) / (2 * qnorm(0.99))))
  # get betasex
  betasex <- log(orsex) 
  # get betaage
  betaage <- log(orage) / deltaage
  # get beta0
  beta0 <- log(pi0 / (1 - pi0)) - age0 * betaage
  # get linear predictor L
  L <- beta0 + betasex * (sex == "male") + betaage * age
  prob <- 1 / (1 + exp(-L))
  data <- data.frame(sex, age, prob)
  # get y
  data$y <- Rlab::rbern(n = n, prob = prob)
  # fit model
  mod <- glm(y ~ sex + age, data = data, family = binomial)
  # OR for sex (true and estimated)
  ORsex <- c(orsex, exp(coef(mod)["sexmale"]))
  # OR for age (true and estimated)
  ORage <- c(orage, exp(deltaage * coef(mod)["age"]))
  OR <- rbind(ORsex, ORage)
  rownames(OR) <- c("sex", "age")
  colnames(OR) <- c("true", "estimate")
  res <- list(or = OR, model = mod)
  return(res)
}

Now, we can set the following values for the simulations (you can change them to explore new results):

myn <- 2000
myprobmale <- 0.5
myorsex <- 1.05
myorage <- 1.1
mydeltaage <- 15
mypi0 <- 0.25
myage0 <- 50
myagerange <- c(48, 80)

Back to top

Results

Let’s see the result of a single simulation:

sim1 <- simbin(n = myn,
               probmale = myprobmale,
               orsex = myorsex,
               orage = myorage,
               deltaage = mydeltaage,
               pi0 = mypi0,
               age0 = myage0,
               agerange = myagerange,
               seed = 666)
class(sim1)
## [1] "list"
names(sim1)
## [1] "or"    "model"
sim1
## $or
##     true estimate
## sex 1.05 1.126056
## age 1.10 1.161995
## 
## $model
## 
## Call:  glm(formula = y ~ sex + age, family = binomial, data = data)
## 
## Coefficients:
## (Intercept)      sexmale          age  
##    -1.75104      0.11872      0.01001  
## 
## Degrees of Freedom: 1999 Total (i.e. Null);  1997 Residual
## Null Deviance:       2292 
## Residual Deviance: 2289  AIC: 2295

In the previous output, we can see that the OR estimates are not exactly equal to the true values. To check if it is due just to error sampling, we can replicate the simulations a lot of times and then look at the distribution of the extimates:

set.seed(666)
nsim <- 500   # number of simulations
sim1rep <- replicate(n = nsim,
                     expr = simbin(n = myn,
                                   probmale = myprobmale,
                                   orsex = myorsex,
                                   orage = myorage,
                                   deltaage = mydeltaage,
                                   pi0 = mypi0,
                                   age0 = myage0,
                                   agerange = myagerange,
                                   seed = NULL)$or[, "estimate"]  # keep only estimates
)

# put it as a data.frame
sim1rep <- as.data.frame(t(sim1rep))

We can see the estimates of the first simulations

head(sim1rep)
##        sex       age
## 1 1.126056 1.1619949
## 2 1.069678 0.9976579
## 3 1.149329 1.0582224
## 4 1.166316 1.0445091
## 5 1.046072 1.2460847
## 6 1.100619 1.2505807

We can visualize how them are distributed

np <- dim(sim1rep)[2]
par(las = 1, mfrow = c(1, np))
for (i in 1:np)
  plot(density(sim1rep[, i]), main = paste("OR associated to", names(sim1rep)[i]), xlab = "Estimate")

Now we can calculate mean values and empirical 95% confidence interval for the estimates and compare with the true values:

sumsim1rep <- apply(sim1rep, 2, FUN = function(x) c(estimate = mean(x), quantile(x, probs = c(2.5, 97.5) / 100)))
sumsim1rep <- t(sumsim1rep)
sumsim1rep <- cbind(true = c(myorsex, myorage), sumsim1rep)

Results for 500 simulations:

round(sumsim1rep, 2)
##     true estimate 2.5% 97.5%
## sex 1.05     1.07 0.88  1.30
## age 1.10     1.11 0.88  1.36

Back to top

Data clustered by several populations

Introduction

Suppose now we have the same modeling problem but data now are gathered in several cities (i.e. multi-city study). Now, we should take into account that is reasonable to assume that two individuals could be more correlated if they belong to the same city. For instance, sociocultural, economic or lifestyle characteristics of individuals could be more correlated for two individuals who beleng to the same city than for two individuals who belong to different city. This is related with the concepts of within-city variability (variability among individuals from the same city) and between-city variability (variability among cities “mean individuals”). This situation can be modeled using GLMM, which can include fixed effects (in this example, the effects of sex and age) and random effects (in this example, the effect of the city). Roughly, a random effect can be seen as a perturbation of a fixed effect, with mean zero and a (unknown) variance. For instance, if we assume a random effect of the city, we are assuming that each city’s specific characteristics have an effect in the prevalence of \(Y\) that could be one (o more) of these:

  • A perturbation of the prevalence for a given profile of sex and age. Then we say that variable city has a random effect on the intercept (i.e. on \(\beta_0\)):

\[ \beta_0 = \beta_{00} + \beta_{0i}, \]

where \(\beta_{00}\) is the common intercept for all individuals and \(\beta_{0i} \sim \mathcal{N}(0, \sigma_0)\) is the specific perturbation of \(\beta_0\) for city \(i\). Note that, since \(\mathbb{E}(\beta_{0i}) = 0\), \(\mathbb{E}(\beta_0) = \mathbb{E}(\beta_{00})\).

  • A perturbation of the effect of sex. Then we say that variable city has a random effect on sex (i.e. on \(\beta_S\)):

\[ \beta_S = \beta_{S0} + \beta_{Si}, \] where \(\beta_{S0}\) is the common (poblational) effect of sex and \(\beta_{Si} \sim \mathcal{N}(0, \sigma_S)\) is the specific perturbation of such effect for city \(i\). Note that, since \(\mathbb{E}(\beta_{Si}) = 0\), \(\mathbb{E}(\beta_S) = \mathbb{E}(\beta_{S0})\).

  • A perturbation of the effect of age. Then we say that variable city has a random effect on sex (i.e. on \(\beta_A\))

\[ \beta_A = \beta_{A0} + \beta_{Ai}, \] where \(\beta_{A0}\) is the common (poblational) effect of age and \(\beta_{Ai} \sim \mathcal{N}(0, \sigma_A)\) is the specific perturbation of such effect for city \(i\). Note that, since \(\mathbb{E}(\beta_{Ai}) = 0\), \(\mathbb{E}(\beta_A) = \mathbb{E}(\beta_{A0})\).

Hence, the interpretation of \(\beta_0\), \(\beta_S\) and \(\beta_A\) is essentially the same than in the model without random effects.

In this example we assume (then in most cases in real studies) that the random effect on the city only exists in the intercept. That is to say that de effects of sex and age on the outcome are the same for all individuals regardless of the city they beleng to, and that the effect of the city is just a perturbation on the basal prevalence of the outcome. Hence, denoting cities by \(C_i, i = 1, 2, \dots, I\), where \(I\) is the number of cities, the model would be:

\[\begin{equation} \label{eq:glmmlogistic} \log(\text{odds}(Y\, |\, S, A, C)) = \log\left(\frac{\pi}{1 - \pi}\right) = (\beta_{00} + \beta_{0i}) + \beta_S\,S + \beta_A\,A, \quad \beta_{0i} \sim \mathcal{N}(0, \sigma_0). \end{equation}\]

In R, the model can be fitted using the following syntax:

library(lme4)
mod <- glmer(Y ~ S + A + (1 | C), family = binomial, data = data)

The estimates of \(\beta\) coefficients associated to the fixed effects (in this case, \(\beta_S\) and \(\beta_A\)), including the intercept coefficient \(\beta_{00}\), can be obtained with:

fixef(mod)

A 95% confidence interval for such coefficients can be obtained with:

confint(mod, parm = "beta_")

The estimates of \(\beta\) coefficients associated to the random effects (in this case, \(\beta_{0i}\)) are available at:

ranef(mod)

The estimates of the covariance matrix of the random effects (in this case, a single parameter, \(\sigma_0\)) are available at:

VarCorr(mod)

To learn about the syntax to specify the random effects structure, see Table 2 in https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf

Back to top

Simulation setting

We proceed as in Case 1 but now we need to simulate also the random effect of the city, so we now add the variable city (\(C\)) and the parameters cities (the labels for the city names) and the proportion of observations in each city, probcities. The implementation is made in the function simbinr, which is analogous to simbin.

simbinr <- function(n,                 # see simbin
                    probmale,          # see simbin
                    orsex,             # see simbin
                    orage,             # see simbin
                    deltaage,          # see simbin  
                    pi0,               # see simbin
                    age0,              # see simbin  
                    agerange,          # see simbin
                    cities,            # city labels
                    probcities = NULL, # city sampling probs (NULL = equal probs)
                    sdranef,           # sd of the random effect of city in intercept
                    seed = NULL) {
  if (!is.null(seed))
    set.seed(seed)
  
  # model:
  # logodds(y) = L = b0 + b1 * sex + b2 * age
  # simulate sex
  sex <- sample(x = c("male", "female"),
                size = n,
                prob = c(probmale, 1 - probmale),
                replace = TRUE)
  sex <- as.factor(sex)
  # simulate age
  age <- round(rnorm(n = n,
                     mean = mean(agerange),
                     sd = diff(agerange) / (2 * qnorm(0.99)))
               )
  # simulate cities:
  ncities <- length(cities)
  if (is.null(probcities))
    probcities <- rep(1, ncities)
  city <- sample(x = cities,
                 size = n,
                 prob = probcities,
                 replace = TRUE)
  city <- as.factor(city)
  # get betasex
  betasex <- log(orsex) 
  # get betaage
  betaage <- log(orage) / deltaage
  # get beta0
  beta0 <- log(pi0 / (1 - pi0)) - age0 * betaage
 
  data <- data.frame(sex, age, city)
  # simulate error by city
  errorcity <- rnorm(n = ncities, mean = 0, sd = sdranef)
  cityaux <- data.frame(city = cities, error = errorcity)
  data <- merge(data, cityaux)
  # get linear predictor L
  L <- beta0 + betasex * (data$sex == "male") + betaage * data$age + data$error
  prob <- 1 / (1 + exp(-L))
  # get y
  data$y <- Rlab::rbern(n = n, prob = prob)
  # fit model
  mod <- glmer(y ~ sex + age + (1 | city), data = data, family = binomial)
  # OR for sex (true and estimated)
  ORsex <- c(sextrue = orsex, sexestimate = exp(fixef(mod)["sexmale"]))
  # OR for age (true and estimated)
  ORage <- c(agetrue = orage, ageestimate = exp(deltaage * fixef(mod)["age"]))
  # estimated sd of the random effect
  sdranef <- c(sdtrue = sdranef, sdestimate = sqrt(as.numeric(VarCorr(mod))))
  # estimates
  estimates <- rbind(ORsex, ORage, sdranef)
  rownames(estimates) <- c("ORsex", "ORage", "sd0")
  colnames(estimates) <- c("true", "estimate")
  
  # random effects (city-specific intercepts):
  beta0city <- ranef(mod)
  res <- list(estimates = estimates, beta0i = beta0city, model = mod)
  return(res)
}

Now, we can set the following values for the extra parameters (you can change them to explore new results):

(mycities <- LETTERS[1:5])
## [1] "A" "B" "C" "D" "E"
myprobcities <- c(1, 2, 2, 3, 3)
mysdranef <- 2    

Back to top

Results

Let’s see the result of a single simulation:

sim2 <- simbinr(n = myn,
                probmale = myprobmale,
                orsex = myorsex,
                orage = myorage,
                deltaage = mydeltaage,
                pi0 = mypi0,
                age0 = myage0,
                agerange = myagerange,
                cities = mycities,
                probcities = myprobcities,
                sdranef = mysdranef, 
                seed = 666)
names(sim2)
## [1] "estimates" "beta0i"    "model"
sim2
## $estimates
##       true  estimate
## ORsex 1.05 0.9735237
## ORage 1.10 1.0148832
## sd0   2.00 0.7182938
## 
## $beta0i
## $city
##   (Intercept)
## A   0.4851682
## B   1.1394871
## C  -0.4396544
## D  -0.4002923
## E  -0.7826384
## 
## with conditional variances for "city" 
## 
## $model
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: y ~ sex + age + (1 | city)
##    Data: data
##       AIC       BIC    logLik  deviance  df.resid 
##  2488.633  2511.037 -1240.317  2480.633      1996 
## Random effects:
##  Groups Name        Std.Dev.
##  city   (Intercept) 0.7183  
## Number of obs: 2000, groups:  city, 5
## Fixed Effects:
## (Intercept)      sexmale          age  
##  -0.3366909   -0.0268331    0.0009849

The fitted model is:

mod2 <- sim2$model

mod2
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: y ~ sex + age + (1 | city)
##    Data: data
##       AIC       BIC    logLik  deviance  df.resid 
##  2488.633  2511.037 -1240.317  2480.633      1996 
## Random effects:
##  Groups Name        Std.Dev.
##  city   (Intercept) 0.7183  
## Number of obs: 2000, groups:  city, 5
## Fixed Effects:
## (Intercept)      sexmale          age  
##  -0.3366909   -0.0268331    0.0009849

summary(mod2)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: y ~ sex + age + (1 | city)
##    Data: data
## 
##      AIC      BIC   logLik deviance df.resid 
##   2488.6   2511.0  -1240.3   2480.6     1996 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.5532 -0.7037 -0.5865  0.9033  1.7289 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  city   (Intercept) 0.5159   0.7183  
## Number of obs: 2000, groups:  city, 5
## 
## Fixed effects:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3366909  0.5715833  -0.589    0.556
## sexmale     -0.0268331  0.0970946  -0.276    0.782
## age          0.0009849  0.0072471   0.136    0.892
## 
## Correlation of Fixed Effects:
##         (Intr) sexmal
## sexmale -0.108       
## age     -0.817  0.026

The estimates of \(\beta_S\) and \(\beta_A\) (and \(\beta_{00}\)) are:

(betas <- fixef(mod2))
##   (Intercept)       sexmale           age 
## -0.3366908645 -0.0268331088  0.0009849001

A 95% confidence interval for such coefficients are:

(cibetas <- confint(mod2, parm = "beta_"))
## Computing profile confidence intervals ...
##                   2.5 %     97.5 %
## (Intercept) -1.48702858 0.81627829
## sexmale     -0.21729923 0.16355296
## age         -0.01323015 0.01520207

We can merge both point estimates and confidence intervals. Also, we can drop the results regarding the intercept of the model because that information is not of interest:

betas <- cbind(betas, cibetas)
betas <- betas[!rownames(betas) == "(Intercept)", ]
colnames(betas)[colnames(betas) == "betas"] <- "estimate"
betas
##              estimate       2.5 %     97.5 %
## sexmale -0.0268331088 -0.21729923 0.16355296
## age      0.0009849001 -0.01323015 0.01520207

Hence, given that \(\exp(\beta) = \text{OR}\), we can easily get the OR estimates (and 95% CI):

(OR <- exp(betas))
##          estimate     2.5 %   97.5 %
## sexmale 0.9735237 0.8046891 1.177688
## age     1.0009854 0.9868570 1.015318

According to the previous results, the estimated OR of the disease associate to sex, adjusted by age, is 0.97 (95% CI: 0.80, 1.18)). In other words, among individuals of the same age, the odds of the disease is a 2.65% lower among males than among women (95% CI: (-17.77% , 19.53%)). Regarding age, the OR associated to age in the previous table is for a 1-year increase. Hence, the OR estimated for a 15 years increase in age are can be computed as \(\text{OR} = \exp(15\beta)\), which is 1.01 (95% CI: 0.82, 1.26). In other words, among individuals of the same gender, a 15 years increase in age is associated to a 1.49% increase in the odds of the disease (95% CI: (-18.00% , 25.61%)).

The estimates of \(\beta\) coefficients associated to the random effects (i.e. the city-specific perturbations of the intercept, \(\beta_{0i}\)) are:

ranef(mod2)
## $city
##   (Intercept)
## A   0.4851682
## B   1.1394871
## C  -0.4396544
## D  -0.4002923
## E  -0.7826384
## 
## with conditional variances for "city"

The standard deviation of previous random effects is:

sd(unlist(ranef(mod2)))
## [1] 0.790549

The estimates of the covariance matrix of the random effects (in this case, a single parameter, \(\sigma_0\)) are:

(varranef <- VarCorr(mod2))
##  Groups Name        Std.Dev.
##  city   (Intercept) 0.71829

And \(\sigma_0\) is available at:

sqrt(as.numeric(varranef))
## [1] 0.7182938

In previous results, we can see that the OR and \(\sigma_0\) estimates are not exactly equal to the true values. To check that that is due to error sampling, we can replicate the simulations a lot of times and look at the distribution of the values of the estimates:

set.seed(666)
sim2rep <- replicate(n = nsim,
                     expr = simbinr(n = myn,
                                    probmale = myprobmale,
                                    orsex = myorsex,
                                    orage = myorage,
                                    deltaage = mydeltaage,
                                    pi0 = mypi0,
                                    age0 = myage0,
                                    agerange = myagerange,
                                    cities = mycities,
                                    probcities = myprobcities,
                                    sdranef = mysdranef, 
                                    seed = NULL)$estimates[, "estimate"] # only estimates
)

# put it as a data.frame
sim2rep <- as.data.frame(t(sim2rep))

We can see the estimates of the first simulations

head(sim2rep)
##       ORsex     ORage       sd0
## 1 0.9735237 1.0148832 0.7182936
## 2 0.9625590 0.9904537 1.7627971
## 3 0.9935033 0.9430369 1.8580695
## 4 1.2327877 0.9794527 2.2845385
## 5 1.0510110 1.4279600 1.3986295
## 6 0.9715692 1.0822376 1.6022167

We can visualize how them are distributed

np <- dim(sim2rep)[2]
par(las = 1, mfrow = c(1, np))
for (i in 1:np)
  plot(density(sim2rep[, i]), main = paste("OR associated to", names(sim2rep)[i]), xlab = "Estimate")

Now we can calculate mean values and empirical 95% confidence interval for the estimates and compare with the true values:

sumsim2rep <- apply(sim2rep, 2, FUN = function(x) c(estimate = mean(x), quantile(x, probs = c(2.5, 97.5) / 100)))
sumsim2rep <- t(sumsim2rep)
sumsim2rep <- cbind(true = c(myorsex, myorage, mysdranef), sumsim2rep)

Results for 500 simulations:

round(sumsim2rep, 2)
##       true estimate 2.5% 97.5%
## ORsex 1.05     1.05 0.84  1.31
## ORage 1.10     1.11 0.82  1.43
## sd0   2.00     1.70 0.61  3.12

Back to top

Modeling counts and rates

Similarly than we can extend the logistic regression model to the logistic regression mixed models to include random effects when modeling a binary outcome, we can extend the Poisson regression model to the Poisson mixed effects model to include random effects when modeling count or rate outcomes. For instance, if we are interested in modeling the effect of both sex and age on a count outcome \(Z\) (assuming \(Z \sim \mathcal{Pois}(\lambda)\)), including a random effect of city on the intercept, we could fit the following Poisson regression mixed effects model (assuming the usual logarithmic link function):

\[\begin{equation} \label{eq:glmmpoisson} \log(\lambda |\, S, A, C)) = (\beta_{00} + \beta_{0i}) + \beta_S\,S + \beta_A\,A, \quad \beta_{0i} \sim \mathcal{N}(0, \sigma_0). \end{equation}\]

In R:

library(lme4)
mod <- glmer(Y ~ S + A + (1 | C), family = poisson, data = data)

Further reading

  • A reading on modeling clustered categorical data with mixed effects models can be found in chapter 13 of this book and in chapter 10 of a more recent edition of the same book.

  • A reading on analysis of multicentre epidemiological studies, contrasting fixed or random effects modelling and and meta-analysis can be found here.

  • A nice introduction to mixed models with R can be found here.