**3. Statistical models for intra-oral caries data**

Although many dental studies provide detailed tooth-level data on caries activity, most analyses still rely on aggregated scores such as the DMF index. These scores summarize at mouth level caries information for each individual typically recorded at the tooth level or tooth-surface level. They have therefore been instrumental in evaluating and comparing the risks for dental caries among population groups. Despite these advances in the etiology of dental caries, there are still some fundamental questions regarding the spatial distribution of dental caries in the mouth that remain unanswered. The intra-oral spatial distribution of dental caries can help answer questions on whether the disease develops symmetrically in the mouth, and whether different types of teeth (Incisors, Canines and Molars) and tooth surfaces (Facial, Lingual, occlusal, Mesial, Distal, and incisal surfaces) are equally susceptible to dental caries. It is well recognized that the different morphology of the pitand-fissure surfaces of teeth makes them more susceptible to decay than the smooth surfaces. Thus, it is no surprise that the posterior molar and premolar teeth that have pitand-fissure surfaces are more susceptible to decay than the anterior teeth.

The analysis of intra-oral data poses a number of difficulties due to inherent spatial association of teeth and tooth-surfaces in the mouth. It is well known, for example, that the multiplicity of outcomes recorded on the same unit necessitates the use of methods for correlated data. This section reviews some of the commonly used statistical techniques to analyze such data. A focus will be on parametric models, namely the class of generalized linear mixed effects models and the class of generalized estimating equation models. These regression models take into account the unique spatial structure of teeth and tooth-surfaces in the mouth.

i. Generalized linear mixed effects models

Generalized linear mixed-effects models constitute the broader class of mixed-effects models for correlated continuous, binary, multinomial and count data (Breslow and Clayton, 1993). They are likelihood-based and often are formulated as hierarchical models. At the first stage, a conditional distribution of the response given random effects is specified, usually assumed to be a member of the exponential family. At the second stage, a prior distribution (typically normal) is imposed on the random effects. The conditional expectations (given random effects) are made of two components, a fixed effects term and a random effects term. The fixed effects term represents covariate effects that do not change with the subject. Random effects represent subject-specific coefficients viewed as deviations from the fixed effects (average) coefficients. Most importantly, they account for the within-mouth correlation

joint distribution of clustered counted outcomes with extra zeros. Two random effects models were formulated. The first model assumed a shared random effects term between the logistic model of the conditional probability of perfect zeros and the conditional mean of the imperfect state. The second formulation relaxed the shared random effects assumption by relating the conditional probability of perfect zeros and the conditional mean of the imperfect state to two correlated random effects variables. Under the conditional independence assumption and the missing data at random assumption, a direct optimization of the marginal likelihood and an EM algorithm were proposed to fit the

Although many dental studies provide detailed tooth-level data on caries activity, most analyses still rely on aggregated scores such as the DMF index. These scores summarize at mouth level caries information for each individual typically recorded at the tooth level or tooth-surface level. They have therefore been instrumental in evaluating and comparing the risks for dental caries among population groups. Despite these advances in the etiology of dental caries, there are still some fundamental questions regarding the spatial distribution of dental caries in the mouth that remain unanswered. The intra-oral spatial distribution of dental caries can help answer questions on whether the disease develops symmetrically in the mouth, and whether different types of teeth (Incisors, Canines and Molars) and tooth surfaces (Facial, Lingual, occlusal, Mesial, Distal, and incisal surfaces) are equally susceptible to dental caries. It is well recognized that the different morphology of the pitand-fissure surfaces of teeth makes them more susceptible to decay than the smooth surfaces. Thus, it is no surprise that the posterior molar and premolar teeth that have pit-

The analysis of intra-oral data poses a number of difficulties due to inherent spatial association of teeth and tooth-surfaces in the mouth. It is well known, for example, that the multiplicity of outcomes recorded on the same unit necessitates the use of methods for correlated data. This section reviews some of the commonly used statistical techniques to analyze such data. A focus will be on parametric models, namely the class of generalized linear mixed effects models and the class of generalized estimating equation models. These regression models take into account the unique spatial structure of teeth and tooth-surfaces

Generalized linear mixed-effects models constitute the broader class of mixed-effects models for correlated continuous, binary, multinomial and count data (Breslow and Clayton, 1993). They are likelihood-based and often are formulated as hierarchical models. At the first stage, a conditional distribution of the response given random effects is specified, usually assumed to be a member of the exponential family. At the second stage, a prior distribution (typically normal) is imposed on the random effects. The conditional expectations (given random effects) are made of two components, a fixed effects term and a random effects term. The fixed effects term represents covariate effects that do not change with the subject. Random effects represent subject-specific coefficients viewed as deviations from the fixed effects (average) coefficients. Most importantly, they account for the within-mouth correlation

and-fissure surfaces are more susceptible to decay than the anterior teeth.

proposed models.

in the mouth.

i. Generalized linear mixed effects models

**3. Statistical models for intra-oral caries data** 

under the conditional independence assumption. In dental caries research, data collected at the tooth level or tooth-surface level are typically binary outcomes representing the presence or absence of decay. For such data, a logistic regression model with random effects is typically used. In this class of models, fixed-effects regression parameters have a subjectspecific interpretation, conditional on random effects (Verbeke and Molenberghs, 2000). That is, they have a direct and meaningful interpretation only for covariates that change within the cluster level (subject's mouth) such as the location of a tooth or a tooth-surface in the mouth. The probabilities of tooth and tooth-surface decay are conditional given random effects and can be used to capture changes occurring within a particular subject's mouth. To assess changes across all subjects' mouths, the modeler is then required to integrate out the random effects from the quantities of interest. Generalized linear mixed effects models are likelihood-based and therefore can be highly sensitive to any distribution misspecification. But they are known to be robust against less restrictive missing data mechanisms (Little and Rubin, 1987).

#### ii. Generalized estimating equations models

Although there are a variety of standard likelihood-based models available to analyze data when the outcome is approximately normal, models for discrete outcomes (such as binary outcomes) generally require a different methodology. Kung-Yee Liang and Scott Zeger (1986) have proposed the so-called Generalized Estimating Equations-GEE model, which is an extension of generalized linear models to correlated data. The basic idea of this family of models is to specify a function that links the linear predictor to the mean response, and use a set of estimating functions with any working correlation model for parameter estimation. A sandwich estimator that corrects for any misspecification of the working correlation model is then used to compute the parameters' standard errors. GEE-based models are very popular as an all-round technique to analyze correlated data when the exact likelihood is difficult to specify. One of the strong points of this methodology is that the full joint distribution of the data does not need to be fully specified to guarantee asymptotically consistent and normal parameter estimates. Instead, a working correlation model between the clustered observations is required for estimation. GEE regression parameter estimates have a population-averaged interpretation, analogous to those obtained from a crosssectional data analysis. This property makes GEE-based models desirable in populationbased studies, where the focus is on average affects accounting for the within-subject association viewed as a nuisance term.

The GEE approach has several advantages over a likelihood-based model. It is computationally tractable in applications where the parametric approaches are computationally very demanding, if not impossible. It is also less sensitive to distribution misspecification as compared to full likelihood-based models. A major limitation of GEEbased models at least in their 1986 original formulation is that they require a more stringent missing data mechanism to produce valid inferences.
