Suppose we are interested in assessing how age and sex could affect the presence of a given disease in a given population. To do that, suppose we take a random sample of individuals from the population of interest resulting in a dataset including individuals from different cities. To emulate that, we can develop a simulation study.
In this document we simulate data to reproduce the problem. Then, we fit a model to the simulated data and finally we compare the estimated effects with the true effects that were used to simulate the data.
First, we consider the case of just one city and fit a logistic regression model (which is a model included in the family of General Linear Models, GLM). Then we generalise to the case of several city and we model the data using a logistic regression model with random effects (which is a model included in the family of General Linear Mixed Models, GLMM).
We simulate a sample of size \(n\) under the following conditions:
Sex (\(S\)), from a Bernoulli distribution with probablity of being a male equal to probmale
Age (\(A\)), from a normal distribution with approximate range equal to agerange
Outcome (\(Y\)), from a Bernoulli distribution with probability of being affected by the disease, \(P(Y) = \pi\), which depends on \(S\) and \(A\) through the following logistic model:
\[\begin{equation} \label{eq:glm} \log(\text{odds}(Y\, |\, S, A)) = \log\left(\frac{\pi}{1 - \pi}\right) = \beta_0 + \beta_S\,S + \beta_A\,A. \end{equation}\]
That model includes three coefficents (\(\beta_0\), \(\beta_S\) and \(\beta_A\)), so we need three “conditions” in order to set the values of these coefficients. For instance:
orsex
deltaage
years, is equal to orage
age0
is equal to pi0
Exercise: Prove the following relationships
\[\begin{equation} \label{eq:orglm} \left\{ \begin{array}{l} \beta_S = \log(\textbf{orsex})\\ \beta_A = \frac{\log(\textbf{orage})}{\textbf{deltaage}}\\ \beta_0 = \log\left(\frac{\textbf{pi0}}{1 - \textbf{pi0}}\right) - \beta_A\,\textbf{age0} \end{array} \right. \end{equation}\]
The following function simbin
implements such simulation setting as well as the fitting model to the simulate data. The function returns the values of the OR (true and estimated) and the fitted model.
simbin <- function(n, # sample size
probmale, # probability of male
orsex, # OR for sex
orage, # OR for age
deltaage, # increase in age associated to orage
pi0, # prevalence for women age0 years old
age0, # age associated to pi0
agerange, # approx. age range
seed = NULL) {
# set the seed if any:
if (!is.null(seed))
set.seed(seed)
# model:
# logodds(y) = L = b0 + b1 * sex + b2 * age
# simulate sex
sex <- sample(x = c("male", "female"),
size = n,
prob = c(probmale, 1 - probmale),
replace = TRUE)
sex <- as.factor(sex)
# simulate age
age <- round(rnorm(n = n,
mean = mean(agerange),
sd = diff(agerange) / (2 * qnorm(0.99))))
# get betasex
betasex <- log(orsex)
# get betaage
betaage <- log(orage) / deltaage
# get beta0
beta0 <- log(pi0 / (1 - pi0)) - age0 * betaage
# get linear predictor L
L <- beta0 + betasex * (sex == "male") + betaage * age
prob <- 1 / (1 + exp(-L))
data <- data.frame(sex, age, prob)
# get y
data$y <- Rlab::rbern(n = n, prob = prob)
# fit model
mod <- glm(y ~ sex + age, data = data, family = binomial)
# OR for sex (true and estimated)
ORsex <- c(orsex, exp(coef(mod)["sexmale"]))
# OR for age (true and estimated)
ORage <- c(orage, exp(deltaage * coef(mod)["age"]))
OR <- rbind(ORsex, ORage)
rownames(OR) <- c("sex", "age")
colnames(OR) <- c("true", "estimate")
res <- list(or = OR, model = mod)
return(res)
}
Now, we can set the following values for the simulations (you can change them to explore new results):
Let’s see the result of a single simulation:
sim1 <- simbin(n = myn,
probmale = myprobmale,
orsex = myorsex,
orage = myorage,
deltaage = mydeltaage,
pi0 = mypi0,
age0 = myage0,
agerange = myagerange,
seed = 666)
class(sim1)
## [1] "list"
names(sim1)
## [1] "or" "model"
sim1
## $or
## true estimate
## sex 1.05 1.126056
## age 1.10 1.161995
##
## $model
##
## Call: glm(formula = y ~ sex + age, family = binomial, data = data)
##
## Coefficients:
## (Intercept) sexmale age
## -1.75104 0.11872 0.01001
##
## Degrees of Freedom: 1999 Total (i.e. Null); 1997 Residual
## Null Deviance: 2292
## Residual Deviance: 2289 AIC: 2295
In the previous output, we can see that the OR estimates are not exactly equal to the true values. To check if it is due just to error sampling, we can replicate the simulations a lot of times and then look at the distribution of the extimates:
set.seed(666)
nsim <- 500 # number of simulations
sim1rep <- replicate(n = nsim,
expr = simbin(n = myn,
probmale = myprobmale,
orsex = myorsex,
orage = myorage,
deltaage = mydeltaage,
pi0 = mypi0,
age0 = myage0,
agerange = myagerange,
seed = NULL)$or[, "estimate"] # keep only estimates
)
# put it as a data.frame
sim1rep <- as.data.frame(t(sim1rep))
We can see the estimates of the first simulations
head(sim1rep)
## sex age
## 1 1.126056 1.1619949
## 2 1.069678 0.9976579
## 3 1.149329 1.0582224
## 4 1.166316 1.0445091
## 5 1.046072 1.2460847
## 6 1.100619 1.2505807
We can visualize how them are distributed
np <- dim(sim1rep)[2]
par(las = 1, mfrow = c(1, np))
for (i in 1:np)
plot(density(sim1rep[, i]), main = paste("OR associated to", names(sim1rep)[i]), xlab = "Estimate")
Now we can calculate mean values and empirical 95% confidence interval for the estimates and compare with the true values:
sumsim1rep <- apply(sim1rep, 2, FUN = function(x) c(estimate = mean(x), quantile(x, probs = c(2.5, 97.5) / 100)))
sumsim1rep <- t(sumsim1rep)
sumsim1rep <- cbind(true = c(myorsex, myorage), sumsim1rep)
Results for 500 simulations:
Suppose now we have the same modeling problem but data now are gathered in several cities (i.e. multi-city study). Now, we should take into account that is reasonable to assume that two individuals could be more correlated if they belong to the same city. For instance, sociocultural, economic or lifestyle characteristics of individuals could be more correlated for two individuals who beleng to the same city than for two individuals who belong to different city. This is related with the concepts of within-city variability (variability among individuals from the same city) and between-city variability (variability among cities “mean individuals”). This situation can be modeled using GLMM, which can include fixed effects (in this example, the effects of sex and age) and random effects (in this example, the effect of the city). Roughly, a random effect can be seen as a perturbation of a fixed effect, with mean zero and a (unknown) variance. For instance, if we assume a random effect of the city, we are assuming that each city’s specific characteristics have an effect in the prevalence of \(Y\) that could be one (o more) of these:
\[ \beta_0 = \beta_{00} + \beta_{0i}, \]
where \(\beta_{00}\) is the common intercept for all individuals and \(\beta_{0i} \sim \mathcal{N}(0, \sigma_0)\) is the specific perturbation of \(\beta_0\) for city \(i\). Note that, since \(\mathbb{E}(\beta_{0i}) = 0\), \(\mathbb{E}(\beta_0) = \mathbb{E}(\beta_{00})\).
\[ \beta_S = \beta_{S0} + \beta_{Si}, \] where \(\beta_{S0}\) is the common (poblational) effect of sex and \(\beta_{Si} \sim \mathcal{N}(0, \sigma_S)\) is the specific perturbation of such effect for city \(i\). Note that, since \(\mathbb{E}(\beta_{Si}) = 0\), \(\mathbb{E}(\beta_S) = \mathbb{E}(\beta_{S0})\).
\[ \beta_A = \beta_{A0} + \beta_{Ai}, \] where \(\beta_{A0}\) is the common (poblational) effect of age and \(\beta_{Ai} \sim \mathcal{N}(0, \sigma_A)\) is the specific perturbation of such effect for city \(i\). Note that, since \(\mathbb{E}(\beta_{Ai}) = 0\), \(\mathbb{E}(\beta_A) = \mathbb{E}(\beta_{A0})\).
Hence, the interpretation of \(\beta_0\), \(\beta_S\) and \(\beta_A\) is essentially the same than in the model without random effects.
In this example we assume (then in most cases in real studies) that the random effect on the city only exists in the intercept. That is to say that de effects of sex and age on the outcome are the same for all individuals regardless of the city they beleng to, and that the effect of the city is just a perturbation on the basal prevalence of the outcome. Hence, denoting cities by \(C_i, i = 1, 2, \dots, I\), where \(I\) is the number of cities, the model would be:
\[\begin{equation} \label{eq:glmmlogistic} \log(\text{odds}(Y\, |\, S, A, C)) = \log\left(\frac{\pi}{1 - \pi}\right) = (\beta_{00} + \beta_{0i}) + \beta_S\,S + \beta_A\,A, \quad \beta_{0i} \sim \mathcal{N}(0, \sigma_0). \end{equation}\]
In R, the model can be fitted using the following syntax:
The estimates of \(\beta\) coefficients associated to the fixed effects (in this case, \(\beta_S\) and \(\beta_A\)), including the intercept coefficient \(\beta_{00}\), can be obtained with:
A 95% confidence interval for such coefficients can be obtained with:
The estimates of \(\beta\) coefficients associated to the random effects (in this case, \(\beta_{0i}\)) are available at:
The estimates of the covariance matrix of the random effects (in this case, a single parameter, \(\sigma_0\)) are available at:
To learn about the syntax to specify the random effects structure, see Table 2 in https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf
We proceed as in Case 1 but now we need to simulate also the random effect of the city, so we now add the variable city (\(C\)) and the parameters cities
(the labels for the city names) and the proportion of observations in each city, probcities
. The implementation is made in the function simbinr
, which is analogous to simbin
.
simbinr <- function(n, # see simbin
probmale, # see simbin
orsex, # see simbin
orage, # see simbin
deltaage, # see simbin
pi0, # see simbin
age0, # see simbin
agerange, # see simbin
cities, # city labels
probcities = NULL, # city sampling probs (NULL = equal probs)
sdranef, # sd of the random effect of city in intercept
seed = NULL) {
if (!is.null(seed))
set.seed(seed)
# model:
# logodds(y) = L = b0 + b1 * sex + b2 * age
# simulate sex
sex <- sample(x = c("male", "female"),
size = n,
prob = c(probmale, 1 - probmale),
replace = TRUE)
sex <- as.factor(sex)
# simulate age
age <- round(rnorm(n = n,
mean = mean(agerange),
sd = diff(agerange) / (2 * qnorm(0.99)))
)
# simulate cities:
ncities <- length(cities)
if (is.null(probcities))
probcities <- rep(1, ncities)
city <- sample(x = cities,
size = n,
prob = probcities,
replace = TRUE)
city <- as.factor(city)
# get betasex
betasex <- log(orsex)
# get betaage
betaage <- log(orage) / deltaage
# get beta0
beta0 <- log(pi0 / (1 - pi0)) - age0 * betaage
data <- data.frame(sex, age, city)
# simulate error by city
errorcity <- rnorm(n = ncities, mean = 0, sd = sdranef)
cityaux <- data.frame(city = cities, error = errorcity)
data <- merge(data, cityaux)
# get linear predictor L
L <- beta0 + betasex * (data$sex == "male") + betaage * data$age + data$error
prob <- 1 / (1 + exp(-L))
# get y
data$y <- Rlab::rbern(n = n, prob = prob)
# fit model
mod <- glmer(y ~ sex + age + (1 | city), data = data, family = binomial)
# OR for sex (true and estimated)
ORsex <- c(sextrue = orsex, sexestimate = exp(fixef(mod)["sexmale"]))
# OR for age (true and estimated)
ORage <- c(agetrue = orage, ageestimate = exp(deltaage * fixef(mod)["age"]))
# estimated sd of the random effect
sdranef <- c(sdtrue = sdranef, sdestimate = sqrt(as.numeric(VarCorr(mod))))
# estimates
estimates <- rbind(ORsex, ORage, sdranef)
rownames(estimates) <- c("ORsex", "ORage", "sd0")
colnames(estimates) <- c("true", "estimate")
# random effects (city-specific intercepts):
beta0city <- ranef(mod)
res <- list(estimates = estimates, beta0i = beta0city, model = mod)
return(res)
}
Now, we can set the following values for the extra parameters (you can change them to explore new results):
Let’s see the result of a single simulation:
sim2 <- simbinr(n = myn,
probmale = myprobmale,
orsex = myorsex,
orage = myorage,
deltaage = mydeltaage,
pi0 = mypi0,
age0 = myage0,
agerange = myagerange,
cities = mycities,
probcities = myprobcities,
sdranef = mysdranef,
seed = 666)
names(sim2)
## [1] "estimates" "beta0i" "model"
sim2
## $estimates
## true estimate
## ORsex 1.05 0.9735237
## ORage 1.10 1.0148832
## sd0 2.00 0.7182938
##
## $beta0i
## $city
## (Intercept)
## A 0.4851682
## B 1.1394871
## C -0.4396544
## D -0.4002923
## E -0.7826384
##
## with conditional variances for "city"
##
## $model
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: y ~ sex + age + (1 | city)
## Data: data
## AIC BIC logLik deviance df.resid
## 2488.633 2511.037 -1240.317 2480.633 1996
## Random effects:
## Groups Name Std.Dev.
## city (Intercept) 0.7183
## Number of obs: 2000, groups: city, 5
## Fixed Effects:
## (Intercept) sexmale age
## -0.3366909 -0.0268331 0.0009849
The fitted model is:
mod2 <- sim2$model
mod2
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: y ~ sex + age + (1 | city)
## Data: data
## AIC BIC logLik deviance df.resid
## 2488.633 2511.037 -1240.317 2480.633 1996
## Random effects:
## Groups Name Std.Dev.
## city (Intercept) 0.7183
## Number of obs: 2000, groups: city, 5
## Fixed Effects:
## (Intercept) sexmale age
## -0.3366909 -0.0268331 0.0009849
summary(mod2)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: y ~ sex + age + (1 | city)
## Data: data
##
## AIC BIC logLik deviance df.resid
## 2488.6 2511.0 -1240.3 2480.6 1996
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.5532 -0.7037 -0.5865 0.9033 1.7289
##
## Random effects:
## Groups Name Variance Std.Dev.
## city (Intercept) 0.5159 0.7183
## Number of obs: 2000, groups: city, 5
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3366909 0.5715833 -0.589 0.556
## sexmale -0.0268331 0.0970946 -0.276 0.782
## age 0.0009849 0.0072471 0.136 0.892
##
## Correlation of Fixed Effects:
## (Intr) sexmal
## sexmale -0.108
## age -0.817 0.026
The estimates of \(\beta_S\) and \(\beta_A\) (and \(\beta_{00}\)) are:
A 95% confidence interval for such coefficients are:
(cibetas <- confint(mod2, parm = "beta_"))
## Computing profile confidence intervals ...
## 2.5 % 97.5 %
## (Intercept) -1.48702858 0.81627829
## sexmale -0.21729923 0.16355296
## age -0.01323015 0.01520207
We can merge both point estimates and confidence intervals. Also, we can drop the results regarding the intercept of the model because that information is not of interest:
betas <- cbind(betas, cibetas)
betas <- betas[!rownames(betas) == "(Intercept)", ]
colnames(betas)[colnames(betas) == "betas"] <- "estimate"
betas
## estimate 2.5 % 97.5 %
## sexmale -0.0268331088 -0.21729923 0.16355296
## age 0.0009849001 -0.01323015 0.01520207
Hence, given that \(\exp(\beta) = \text{OR}\), we can easily get the OR estimates (and 95% CI):
(OR <- exp(betas))
## estimate 2.5 % 97.5 %
## sexmale 0.9735237 0.8046891 1.177688
## age 1.0009854 0.9868570 1.015318
According to the previous results, the estimated OR of the disease associate to sex, adjusted by age, is 0.97 (95% CI: 0.80, 1.18)). In other words, among individuals of the same age, the odds of the disease is a 2.65% lower among males than among women (95% CI: (-17.77% , 19.53%)). Regarding age, the OR associated to age in the previous table is for a 1-year increase. Hence, the OR estimated for a 15 years increase in age are can be computed as \(\text{OR} = \exp(15\beta)\), which is 1.01 (95% CI: 0.82, 1.26). In other words, among individuals of the same gender, a 15 years increase in age is associated to a 1.49% increase in the odds of the disease (95% CI: (-18.00% , 25.61%)).
The estimates of \(\beta\) coefficients associated to the random effects (i.e. the city-specific perturbations of the intercept, \(\beta_{0i}\)) are:
ranef(mod2)
## $city
## (Intercept)
## A 0.4851682
## B 1.1394871
## C -0.4396544
## D -0.4002923
## E -0.7826384
##
## with conditional variances for "city"
The standard deviation of previous random effects is:
The estimates of the covariance matrix of the random effects (in this case, a single parameter, \(\sigma_0\)) are:
And \(\sigma_0\) is available at:
In previous results, we can see that the OR and \(\sigma_0\) estimates are not exactly equal to the true values. To check that that is due to error sampling, we can replicate the simulations a lot of times and look at the distribution of the values of the estimates:
set.seed(666)
sim2rep <- replicate(n = nsim,
expr = simbinr(n = myn,
probmale = myprobmale,
orsex = myorsex,
orage = myorage,
deltaage = mydeltaage,
pi0 = mypi0,
age0 = myage0,
agerange = myagerange,
cities = mycities,
probcities = myprobcities,
sdranef = mysdranef,
seed = NULL)$estimates[, "estimate"] # only estimates
)
# put it as a data.frame
sim2rep <- as.data.frame(t(sim2rep))
We can see the estimates of the first simulations
head(sim2rep)
## ORsex ORage sd0
## 1 0.9735237 1.0148832 0.7182936
## 2 0.9625590 0.9904537 1.7627971
## 3 0.9935033 0.9430369 1.8580695
## 4 1.2327877 0.9794527 2.2845385
## 5 1.0510110 1.4279600 1.3986295
## 6 0.9715692 1.0822376 1.6022167
We can visualize how them are distributed
np <- dim(sim2rep)[2]
par(las = 1, mfrow = c(1, np))
for (i in 1:np)
plot(density(sim2rep[, i]), main = paste("OR associated to", names(sim2rep)[i]), xlab = "Estimate")
Now we can calculate mean values and empirical 95% confidence interval for the estimates and compare with the true values:
sumsim2rep <- apply(sim2rep, 2, FUN = function(x) c(estimate = mean(x), quantile(x, probs = c(2.5, 97.5) / 100)))
sumsim2rep <- t(sumsim2rep)
sumsim2rep <- cbind(true = c(myorsex, myorage, mysdranef), sumsim2rep)
Results for 500 simulations:
Similarly than we can extend the logistic regression model to the logistic regression mixed models to include random effects when modeling a binary outcome, we can extend the Poisson regression model to the Poisson mixed effects model to include random effects when modeling count or rate outcomes. For instance, if we are interested in modeling the effect of both sex and age on a count outcome \(Z\) (assuming \(Z \sim \mathcal{Pois}(\lambda)\)), including a random effect of city on the intercept, we could fit the following Poisson regression mixed effects model (assuming the usual logarithmic link function):
\[\begin{equation} \label{eq:glmmpoisson} \log(\lambda |\, S, A, C)) = (\beta_{00} + \beta_{0i}) + \beta_S\,S + \beta_A\,A, \quad \beta_{0i} \sim \mathcal{N}(0, \sigma_0). \end{equation}\]
In R:
A reading on modeling clustered categorical data with mixed effects models can be found in chapter 13 of this book and in chapter 10 of a more recent edition of the same book.
A reading on analysis of multicentre epidemiological studies, contrasting fixed or random effects modelling and and meta-analysis can be found here.
A nice introduction to mixed models with R can be found here.