**Generalized Additive Models in Environmental Health: A Literature Review**

Jalila Jbilou\* and Salaheddine El Adlouni *Université de Moncton, Moncton, Canada* 

### **1. Introduction**

84 Novel Approaches and Their Applications in Risk Assessment

Viaggiu, E.; Calvanella, S.; Mattioli, P.; Albertano, P.; Melchiorre, S. & Bruno, M. (2003) Toxic

Viaggiu, E.; Melchiorre, S.; Volpi, F.; Di Corcia, A.; Mancini, R.; Garibaldi, L.; Crichigno, G.

Whitton, B.A. & Potts, M. (2000) Chapter 1. Introduction to the cyanobacteria. In *The ecology* 

WHO (1998) *Guidelines for drinking water quality*, 2nd ed., Addendum to Vol. 2, Health criteria

WHO (1999) Chapter 2. Cyanobacteria in the environment. In *Toxic Cyanobacteria in Water:* 

WHO (2003) *Guidelines for Safe Recreational Water Environments. Coastal and Freshwaters*. World Health Organization: Geneva, Switzerland, Vol.1, pp. 136–158. Yepremian, C.; Gugger, M.F.; Briand, E.; Catherine, A.; Berger, C.; Quiblier, C. & Bernard, C.

Yu, S.Z. (1995) Primary prevention of hepatocellular carcinoma. *Journal of Gastroenterology* 

Zĕgura, B.; Štraser, A. & Filipič, M. (2011) Genotoxicity and potential carcinogenicity of cyanobacterial toxins – a review. *Mutation Research*, Vol.727, No.1-2, pp. 16–41. Zhou, L.; Yu, H. & Chen K. (2002) Relationship between microcystin in drinking water and colorectal cancer. *Biomedical and Environmental Sciences*, Vol.15, No.2, pp. 166-71.

waterbodies in Italy. *Algological Studies*, Vol.109, No.1, pp. 569-577.

191-197.

Switzerland.

Netherlands, pp. 1-11.

Bartram, J. (Ed). World Health Organization.

*Research*, Vol.41, No.19, pp. 4446-4456.

*and Hepatology*, Vol.10, No.6, pp. 674-82.

blooms of *Planktothrix rubescens* (Cyanobacteria/Phormidiaceae) in three

& Bruno, M. (2004) Anatoxin-a toxin in the cyanobacterium *Planktothrix rubescens* from a fishing pond in northern Italy. *Environmental Toxicology*, Vol.19, No.3, pp.

*of cyanobacteria*. B.A Whitton, Potts, M. (Ed). Kluer Academic Publishers,

and other supporting information. World Health Organization: Geneva,

*Aguide to their public health consequences, monitoring and management.* I. Chorus,

(2007) Microcystin ecotypes in a perennial *Planktothrix agardhii* bloom. *Water* 

Time series regression models are especially suitable in epidemiology for evaluating shortterm effects of time-varying exposures. Typically, a single population is assessed with reference to its change over the time in the rate of any health outcome and the corresponding changes in the exposure factors during the same period. In time series regression dependent and independent variables are measured over time, and the purpose is to model the existing relationship between these variables through regression methods. Various applications of these models have been reported in literature exploring relationship between mortality and air pollution (Katsouyanni et al. 2009; Wong et al. 2010; Balakrishnan et al. 2011); hospital admissions and air pollution (Peng et al. 2008; Zanobetti et Schwartz 2009; Lall et al. 2011); pollution plumes and breast cancer (Vieira et a. 2005); diet and cancer (Harnack et al. 1997); and mortality and drinking water (Braga et al. 2001). Different time series methods have been used in these studies, i.e. the linear models (Hatzakis et al. 1986) the log-linear models (Mackenbach et al. 1992), the Poisson regression models (Schwartz et al. 2004), and Generalized Additive Models (Dominici 2002; Wood, 2006). The Generalized Additive Models represent a method of fitting a smooth relationship between two or more variables and are useful for complex correlations, that not easily fitted by standard linear or non-linear models.

The present chapter reviews The Genralized Additive Model (GAM), a class of statistical models which have commonly been used in time series regression, specially allowing for serial correlations, which make them potentially useful for environmental epidemiology.

#### **2. Generalized additive models**

The classic multiple linear regression model has the form:

$$Y = X\beta + \varepsilon \tag{1}$$

where Y is the response variable, X is the matrix (n×p) of the independent p variables <sup>1</sup> , , *X X <sup>p</sup>* , is the vector of the parameters and ε is the vector of errors normally

<sup>\*</sup> Corresponding Author

Generalized Additive Models in Environmental Health: A Literature Review 87

popular choice of the spline function is the natural cubic spline. It is a polynomial of the 3rd degree whose second derivative is zero at the limits. It offers less flexibility at the limits but this constitutes an advantage since the fit given by the regression spline presents a large variance around the limits (Hastie and Tibshirani, 1990). A smoothing B-spline basis is independent of the response variable Y and depends only on the following information: (i) the extent of the explicative variable; (ii) the number and position of the knots, and (iii) the


One of the main advantages of the generalized additive model (GAM) is that it offers a great flexibility in order to represent the relations between the dependant variable and the explicative variables. Berger et al. (2004) present advantages related to the GAM to describe the relation between the use of the fluoroquinolone antibiotic and the resistance of the Staphylococcus aureus bacteria collected on the adult patients hospitalized for at least 48 hours. The dependant variable Y(t) of the model was the monthly number of cases in which the bacteria collected from the infected patient resisted to the fluoroquinolone and the explicative variables (Xm(t), m=1…p) were the monthly indicators of the antibiotics doses daily administered. The variable Y(t) follows the Poisson distribution P(λ), where the parameter λ corresponds to the average number of the cases per month and is function of the covariates. The link function is the logarithmic function and the regressor has the form

in which *fm* . is a spline function. The results have shown the

existence of a significant relation between the use of fluoroquinolone and the resistance of

The GAM models are used in the prognostic analyses of diseases. For example, Gehrmann et al. (2003) explored multiple sclerosis disease in order to identify the variables that have significant effects on the supported progression of the disease, to determine the intensity and the form of these effects and to estimate the survival curves. The use of Generalized Additive Models helped identify that among the available explicative variables; only the level of initial severity and the number of relapses during the twelve months preceding the study had significant effects on the hazard rate. The hazard rate h(t) means the probability

In a study on the failure rate h(t) of patients with breast cancer (Hastie et al., 1992), the GAM model has been considered to identify among the prognostic factors those which presented significant non-linear relations with h(t). These prognostic factors are: the presence or absence of necrosis of the tumor, the size of the tumor, the number of samples examined, the patient's age, the body mass index and the number of days between the surgical intervention and the beginning of the study. Among these variables, the non-significant relation has been identified

of death after the time t, given that the patient has survived up to the time t.



degree of the B-spline. The properties of the B-splines are: - It is formed of q + 1 polynomial pieces, each of q degree;



continuous;

1 ( ) ( ( )) *p*

*m*

the bacteria.

 *t a fXt* 

*m m*

distributed with average 0 and variance <sup>2</sup> . Consequently, the variable Y is also Normal distribution with *EY X* and the covariance matrix <sup>2</sup> *I* (I is the identity matrix). The linear models are central in applied statistics, mainly because of their simple structure and their interpretative ease. However, they present certain limits and are inadequate when the assumption of normality of the response variable is no longer justified. The linear model is extended to the Generalized Linear Model (GLM) to include a large class of the response variable distribution which belongs to the exponential family of distribution. The distribution Y is related to the linear combination of the covariables, *X* , via the link function g(.), such as *g g EY* .

To introduce more flexibility in the dependence structure between the response variables and covariables, the Generalized Additive Models (GAM), an extension of the GLM, replace the linear dependence functions by more flexible non-linear functions (Hastie and Tibshirani, 1990). The dependences are generally presented by non-parametric smoothing functions. The statistical inference consists on the estimation of the non-linear functions , 1, , *<sup>j</sup> f X j p* , for each explicative variable *Xj* . This allows the identification of the specific form of the effect of each explicative variable on the dependant variable Y.

In practice, the objective is to model the dependence between the response variable, Y, and the explicative variables 1 , , *X X <sup>p</sup>* , for three main reasons: the description, the inference and the prediction. The goal is to find an explicit form of the effect *f Xj* of each variable *Xj* on the variability of Y. The Generalized Additive Model (GAM) can be summarized by the flowing three components:


$$\eta = a + \sum\_{j=1}^{p} f\left(X\_j\right) \tag{2}$$

3. The link function g(.) is such that *g g EY* , which implies that <sup>1</sup> *EY g* .

The exponential family of distributions contains the Normal, Binominal, Gamma, Poisson, Geometric, Negative Binominal, and Exponential.

The non-linear functions *f* . are usually represented by non-parametric dependence functions based on smoothing. The smoothing consists on creating a polynomial function that summarizes the data's tendencies. Some types of smoothing designed to express nonlinear relations between the Y variable and the covariates Xj, j = 1… p, of the GAM models are the following: smoothing by scatter plot, parametric regression, mobile average, kernel smoothing and spline smoothing. A spline is a combination of polynomial functions. The knots are the points that mark the transition between the pieces of the polynomials (Eilers and Marx, 1996). The constraints allowing the joining of the polynomial pieces are defined by the number of continuous derivatives from the polynomial to the knots. The most popular choice of the spline function is the natural cubic spline. It is a polynomial of the 3rd degree whose second derivative is zero at the limits. It offers less flexibility at the limits but this constitutes an advantage since the fit given by the regression spline presents a large variance around the limits (Hastie and Tibshirani, 1990). A smoothing B-spline basis is independent of the response variable Y and depends only on the following information: (i) the extent of the explicative variable; (ii) the number and position of the knots, and (iii) the degree of the B-spline. The properties of the B-splines are:


86 Novel Approaches and Their Applications in Risk Assessment

The linear models are central in applied statistics, mainly because of their simple structure and their interpretative ease. However, they present certain limits and are inadequate when the assumption of normality of the response variable is no longer justified. The linear model is extended to the Generalized Linear Model (GLM) to include a large class of the response variable distribution which belongs to the exponential family of distribution. The

To introduce more flexibility in the dependence structure between the response variables and covariables, the Generalized Additive Models (GAM), an extension of the GLM, replace the linear dependence functions by more flexible non-linear functions (Hastie and Tibshirani, 1990). The dependences are generally presented by non-parametric smoothing functions. The statistical inference consists on the estimation of the non-linear functions , 1, , *<sup>j</sup> f X j p* , for each explicative variable *Xj* . This allows the identification of the

In practice, the objective is to model the dependence between the response variable, Y, and the explicative variables 1 , , *X X <sup>p</sup>* , for three main reasons: the description, the inference and the prediction. The goal is to find an explicit form of the effect *f Xj* of each variable *Xj* on the variability of Y. The Generalized Additive Model (GAM) can be summarized by

1. The random component: Y that follows a distribution of the exponential family and the

2. The systematic component: the explicative variables 1 , , *X X <sup>p</sup>* that compose the

*p*

1

*j*

The exponential family of distributions contains the Normal, Binominal, Gamma, Poisson,

The non-linear functions *f* . are usually represented by non-parametric dependence functions based on smoothing. The smoothing consists on creating a polynomial function that summarizes the data's tendencies. Some types of smoothing designed to express nonlinear relations between the Y variable and the covariates Xj, j = 1… p, of the GAM models are the following: smoothing by scatter plot, parametric regression, mobile average, kernel smoothing and spline smoothing. A spline is a combination of polynomial functions. The knots are the points that mark the transition between the pieces of the polynomials (Eilers and Marx, 1996). The constraints allowing the joining of the polynomial pieces are defined by the number of continuous derivatives from the polynomial to the knots. The most

 *f X* 

*j*

and <sup>2</sup> var *Y*

.

(2)

, which implies that

and the covariance matrix <sup>2</sup>

. Consequently, the variable Y is also Normal

 *X*

*I* (I is the identity matrix).

, via the link

distributed with average 0 and variance <sup>2</sup>

mean and the variance are, respectively, *E Y*

3. The link function g(.) is such that *g g EY*

distribution Y is related to the linear combination of the covariables,

.

specific form of the effect of each explicative variable on the dependant variable Y.

distribution with *EY X*

function g(.), such as *g g EY*

the flowing three components:

regressor, defined by

 <sup>1</sup> *EY g* 

.

Geometric, Negative Binominal, and Exponential.


One of the main advantages of the generalized additive model (GAM) is that it offers a great flexibility in order to represent the relations between the dependant variable and the explicative variables. Berger et al. (2004) present advantages related to the GAM to describe the relation between the use of the fluoroquinolone antibiotic and the resistance of the Staphylococcus aureus bacteria collected on the adult patients hospitalized for at least 48 hours. The dependant variable Y(t) of the model was the monthly number of cases in which the bacteria collected from the infected patient resisted to the fluoroquinolone and the explicative variables (Xm(t), m=1…p) were the monthly indicators of the antibiotics doses daily administered. The variable Y(t) follows the Poisson distribution P(λ), where the parameter λ corresponds to the average number of the cases per month and is function of the covariates. The link function is the logarithmic function and the regressor has the form

$$\mathcal{A}(\mathbf{t}) = \mathbf{a} + \sum\_{m=1}^{p} f\_m(\mathbf{X}\_m(\mathbf{t})) \quad \text{in which} \quad f\_m(\cdot) \text{ is a spline function. The results have shown the}$$

existence of a significant relation between the use of fluoroquinolone and the resistance of the bacteria.

The GAM models are used in the prognostic analyses of diseases. For example, Gehrmann et al. (2003) explored multiple sclerosis disease in order to identify the variables that have significant effects on the supported progression of the disease, to determine the intensity and the form of these effects and to estimate the survival curves. The use of Generalized Additive Models helped identify that among the available explicative variables; only the level of initial severity and the number of relapses during the twelve months preceding the study had significant effects on the hazard rate. The hazard rate h(t) means the probability of death after the time t, given that the patient has survived up to the time t.

In a study on the failure rate h(t) of patients with breast cancer (Hastie et al., 1992), the GAM model has been considered to identify among the prognostic factors those which presented significant non-linear relations with h(t). These prognostic factors are: the presence or absence of necrosis of the tumor, the size of the tumor, the number of samples examined, the patient's age, the body mass index and the number of days between the surgical intervention and the beginning of the study. Among these variables, the non-significant relation has been identified

Generalized Additive Models in Environmental Health: A Literature Review 89

estimation problem in both cases to that of a GLM model with all its advantages related to the linear dependence functions. Despite the simplicity of the penalized GLM model, in the case of smoothing splines (Hastie and Tibshirani, 1990), the problem of the large system of equations remains. In the case of regression splines, each spline function is the function of the sum of the basis B-spline functions. This situation features the ease of B-spline construction, but the problem of the optimum choice in the position and number of B-spline nodes arises (Hastie and Tibshirani, 1990). Eilers and Marx (1996) have shown that this problem could be avoided by combining the B-splines to a differential penalty. In fact, the penalty is applied directly to the parameters in order to control the roughness of the spline functions. Criterion can be employed for the number of knots and the value of the penalty

( ) *j j jj f X BA* and a response variable distribution belongs to the exponential family. In this section, *Bj, j =* 1*… p* is the B-spline matrix (with *nj* knots) of *N* × *nj* dimension, *Aj* is the *nj*vector of the basic B-spline function coefficients and then represents the part of the the

> *E Y g*

the penalized GLM Fisher scoring, below, until the desired convergence criterion is

<sup>1</sup> ˆˆ ˆ ( ) <sup>ˆ</sup> *At BWB P BWzt t t*

*P* is the component which summarizes the penalty on the B-spline coefficients of the *p*

The approach assumes that the effect functions *<sup>j</sup> f* of a covariate *Xj* can be approximated by a polynomial spline written in terms of a linear combination of B-spline basis functions. The crucial problem with such regression splines is the choice of the number and the position of the knots. A small number of knots may lead to a function space which is not flexible enough to capture the variability of the data. A large number of knots may lead to a serious overfiting. Similarly, the position of the knots may potentially have a strong influence on the estimation. A remedy can be based on a roughness penalty approach as

1

1

*j*

 with

*j j*

*p*

() () ( )

 

*g EY f X*

*B A* (3)

, ,, *A A* <sup>1</sup> *<sup>p</sup>* . We are left with a GLM model and the

by maximization of the penalized log-likelihood is done by

(4)

1*P*1,…*pPp).*

When the P-spline are considered, the GAM has the form

where 1 2 <sup>1</sup> *B BB Bp* and *<sup>A</sup>*

estimation of the parameters

( ) *i i*

proposed by Eilers and Marx (1996).

 

variability of *Y* explained by *Xj* . The model can be rewritten as follows:

, ˆ*<sup>i</sup> <sup>z</sup>* ( ) <sup>ˆ</sup> <sup>ˆ</sup> ( ) <sup>ˆ</sup>

covariates and *h* is the opposite of the linking function *g*.

*y h*

*i*

*i i*

*i*

and *P* = blockdiag(0,

parameter.

obtained.

where

<sup>2</sup> [ ( )] <sup>ˆ</sup> <sup>ˆ</sup>

*<sup>h</sup> W diag Var Y*

with the age and the body mass index. The authors stated that the non-linear modeling had the advantage, firstly, of preventing against the false definition of the model which would lead to incorrect conclusions with regards to the effectiveness of a treatment, and also of provide information on the relation between the prognostic factors and the risk of disease which the standard (linear regression, normal distribution) models do not provide.

The GAM models are also employed in the analyses on the impact of climate and environmental variables on the public's health. In Quebec, a study of the impact of climate variables on mortality was conducted by Doyon et al. (2006). The number of daily deaths was modeled by the Poisson regression with a linking logarithmic function and the explicative climate variables selected were the humidity, the heat threshold and the functions of the average daily temperatures. A similar project carried out on European cities characterized by diverse climatic conditions arrived at the same conclusion of the existence of a significant relation between mortality and the temperature in several cities in Europe (Michelozzi P et al. 2007). The number of deaths and the number of hospital admissions were classified by age groups (15-64 years, 65-74 years, 75 and above years) and by cause (all causes – except death due to external causes –, cardiovascular diseases, cerebrovascular diseases, respiratory diseases, influenza). Considered climate variables are: temperature, dew point, wind speed, wind direction, pressure, total coverage of clouds, solar radiation, precipitations, and visibility. The variables of pollution were SO2, TSP (black smoke), PM10, NO2, and CO. The analysis was done separately for the warm season (April-September) and the cold season (October-March). This provides flexibility for the analysis, allowing the use of different model structures for each season (Terzi and Cengiz, 2009). Recently, Bayentin et al. (2010) used the GAM model to study the association between climate variables and circulatory diseases. The short term effect of climate conditions on the incidence of ischemic heart disease (IHD) over the 1989-2006 period was examined for Quebec's 18 health regions, with control for seasonality and socio-demographic conditions.
