**3. Parameter estimation**

#### **3.1 Local scoring procedure algorithm**

The algorithm (presented in Appendix C.1) is summarized as an iterative and weighted process which allows the adjustment of a function *fj*, *j =* 1*… p*, while keeping the other *p*-1 dimensions in their actual state. GAM models, in which the iterative algorithm is incorporated in S-Plus, became a popular analytical tool in epidemiology, especially in studies on the effects of environmental variables on public health (Dominici et al,. 2002). However, estimation by this algorithm presents problems of convergence and validity when the weighting matrix *W* (Appendix C.1) is not diagonal and if the independence hypothesis is not respected. Even if augmenting the number of iterations improves the estimations, the typical estimation errors remain difficult to evaluate and the model's effective dimension is statistically demanding (Wood, 2006). Many authors have suggested more direct approaches to remedy these problems.

#### **3.2 Simultaneous estimation**

The most effective way to estimate parameters is the use of a parametric GLM model with a limited number of regression splines or smoothing splines. This reduces the parameter estimation problem in both cases to that of a GLM model with all its advantages related to the linear dependence functions. Despite the simplicity of the penalized GLM model, in the case of smoothing splines (Hastie and Tibshirani, 1990), the problem of the large system of equations remains. In the case of regression splines, each spline function is the function of the sum of the basis B-spline functions. This situation features the ease of B-spline construction, but the problem of the optimum choice in the position and number of B-spline nodes arises (Hastie and Tibshirani, 1990). Eilers and Marx (1996) have shown that this problem could be avoided by combining the B-splines to a differential penalty. In fact, the penalty is applied directly to the parameters in order to control the roughness of the spline functions. Criterion can be employed for the number of knots and the value of the penalty parameter.

When the P-spline are considered, the GAM has the form 1 () () ( ) *p j j j g EY f X* with

( ) *j j jj f X BA* and a response variable distribution belongs to the exponential family. In this section, *Bj, j =* 1*… p* is the B-spline matrix (with *nj* knots) of *N* × *nj* dimension, *Aj* is the *nj*vector of the basic B-spline function coefficients and then represents the part of the the variability of *Y* explained by *Xj* . The model can be rewritten as follows:

$$E\left[\left.Y\right] = \mathcal{g}\left(\mu\right) = BA\tag{3}$$

where 1 2 <sup>1</sup> *B BB Bp* and *<sup>A</sup>* , ,, *A A* <sup>1</sup> *<sup>p</sup>* . We are left with a GLM model and the estimation of the parameters by maximization of the penalized log-likelihood is done by the penalized GLM Fisher scoring, below, until the desired convergence criterion is obtained.

$$
\hat{A}\_{t+1} = \left(B^\prime \hat{V}\_t B + P\right)^{-1} B^\prime \hat{V} \hat{V}\_t \hat{z}\_t \tag{4}
$$

where

88 Novel Approaches and Their Applications in Risk Assessment

with the age and the body mass index. The authors stated that the non-linear modeling had the advantage, firstly, of preventing against the false definition of the model which would lead to incorrect conclusions with regards to the effectiveness of a treatment, and also of provide information on the relation between the prognostic factors and the risk of disease which the

The GAM models are also employed in the analyses on the impact of climate and environmental variables on the public's health. In Quebec, a study of the impact of climate variables on mortality was conducted by Doyon et al. (2006). The number of daily deaths was modeled by the Poisson regression with a linking logarithmic function and the explicative climate variables selected were the humidity, the heat threshold and the functions of the average daily temperatures. A similar project carried out on European cities characterized by diverse climatic conditions arrived at the same conclusion of the existence of a significant relation between mortality and the temperature in several cities in Europe (Michelozzi P et al. 2007). The number of deaths and the number of hospital admissions were classified by age groups (15-64 years, 65-74 years, 75 and above years) and by cause (all causes – except death due to external causes –, cardiovascular diseases, cerebrovascular diseases, respiratory diseases, influenza). Considered climate variables are: temperature, dew point, wind speed, wind direction, pressure, total coverage of clouds, solar radiation, precipitations, and visibility. The variables of pollution were SO2, TSP (black smoke), PM10, NO2, and CO. The analysis was done separately for the warm season (April-September) and the cold season (October-March). This provides flexibility for the analysis, allowing the use of different model structures for each season (Terzi and Cengiz, 2009). Recently, Bayentin et al. (2010) used the GAM model to study the association between climate variables and circulatory diseases. The short term effect of climate conditions on the incidence of ischemic heart disease (IHD) over the 1989-2006 period was examined for Quebec's 18 health regions,

The algorithm (presented in Appendix C.1) is summarized as an iterative and weighted process which allows the adjustment of a function *fj*, *j =* 1*… p*, while keeping the other *p*-1 dimensions in their actual state. GAM models, in which the iterative algorithm is incorporated in S-Plus, became a popular analytical tool in epidemiology, especially in studies on the effects of environmental variables on public health (Dominici et al,. 2002). However, estimation by this algorithm presents problems of convergence and validity when the weighting matrix *W* (Appendix C.1) is not diagonal and if the independence hypothesis is not respected. Even if augmenting the number of iterations improves the estimations, the typical estimation errors remain difficult to evaluate and the model's effective dimension is statistically demanding (Wood, 2006). Many authors have suggested more direct approaches

The most effective way to estimate parameters is the use of a parametric GLM model with a limited number of regression splines or smoothing splines. This reduces the parameter

standard (linear regression, normal distribution) models do not provide.

with control for seasonality and socio-demographic conditions.

**3. Parameter estimation** 

to remedy these problems.

**3.2 Simultaneous estimation** 

**3.1 Local scoring procedure algorithm** 

$$\hat{\mathcal{N}} = \text{diag}\left\{ \frac{\left[\text{li}'(\hat{\eta}\_i)\right]^2}{Var(Y\_i)} \right\}, \quad \hat{z}\_i = \hat{\eta}\_i + \frac{(y\_i - \hat{\mu}\_i)}{h'(\hat{\eta}\_i)} \text{ and } P = \text{blockdiag}(0, \mathcal{A}\_1 P\_1, \dots, \mathcal{A}\_p P\_p).$$

*P* is the component which summarizes the penalty on the B-spline coefficients of the *p* covariates and *h* is the opposite of the linking function *g*.

The approach assumes that the effect functions *<sup>j</sup> f* of a covariate *Xj* can be approximated by a polynomial spline written in terms of a linear combination of B-spline basis functions. The crucial problem with such regression splines is the choice of the number and the position of the knots. A small number of knots may lead to a function space which is not flexible enough to capture the variability of the data. A large number of knots may lead to a serious overfiting. Similarly, the position of the knots may potentially have a strong influence on the estimation. A remedy can be based on a roughness penalty approach as proposed by Eilers and Marx (1996).

Generalized Additive Models in Environmental Health: A Literature Review 91

*IG*(*cj, dj*), where *cj, dj* are the hyper-parameters and are usually given by prior knowledge on the variables. It is however necessary to perform a sensitivity analysis on the prior

*a ay* 2 2

*p p <sup>a</sup> <sup>j</sup>*

All the inference is based on the posterior distribution. The MCMC algorithm can be performed to estimate the empirical posterior distribution and the predictive distribution of

The assumptions of the Bayesian estimation model are completed by the following

a. For all explicative variables and *fj* parameters, the observations *Yi* are conditionally

In order to select the smoothing penalty and the number of knots that leads to the most adequate fit some performance measures are used. The most used performance measures are the Akaïke information criterion (*AIC*) and the generalized cross-validation (*GCV*). They are based on the deviance statistic (or the statistical likelihood ratio) that, for a counting GAM model (the case of the Poisson distribution), is obtained by the following

( ; ) 2 [ ln( / ) ( )] ˆ ˆˆ *<sup>n</sup>*

The Akaïke information criterion developed by Akaike (1973) measures the quality of the model fit to observed data series. It is the function of the deviance function *D*(*y*; *μ*) and is

> <sup>1</sup> *AIC D y tr R* [ (; ) ( )] <sup>ˆ</sup> *<sup>n</sup>*

where *tr*(*R*) the sum of the diagonal elements of the matrix *R* of the weighted additive-fit

*i ii ii*

 

(8)

*aKa*

2 ( )/2 2 2

exp ( ) exp

1 1 ( , , , ,..., , ) *Ly a a*

 (7)

 *p p* 

2 1

*j*

 

*j* 

, *j* = 1, …, *p* are mutually

(9)

*b*

The posteriori distribution of the model has the following form:

the quantile to deduce the risk values.

conditional independence assumptions:

independent.

independent.

formula:

**4. Performance measure** 

obtained by the following formula:

2 2 1 1 ( , , ,..., , ) *p p p*

1 1

( ) 2

*j*

 

1 1

b. The prior distributions of the parameters are conditionally independent.

1

operator of the last iterations in the estimation process, and the scale parameter.

*i Dy y y y*

c. The priori distribution of the fixed effects and variances <sup>2</sup>

*j jj j rk K j j j j j*

choice.

Smoothing parameters are used to balance the goodness-of-fit and smoothness. A performance measure is used to find the optimum values of the penalties. The number and location of knots are no longer crucial as long as the minimum number of knots is reached. In practice, this approach poses problems to get a solution when the number of the model's smoothing functions is high (Lang and Brezger, 2004). The P-spline approach is easy to conceive and has the advantage of the explicit formula of the estimation matrix and standard errors estimations (Marx and Eilers, 1998). However, the simplicity if reduced if the knots are at unequal distances (Wood, 2006). Thus, despite the advantages of the Pspline approach in the GAM models, the problems of the estimation of the parameters with the *penalized GLM Fisher scoring* algorithm, remains important (Zhao et al., 2006, Wood, 2006, Binder and Tutz, 2006).

#### **3.3 Bayesian method**

The Bayesian approach is essentially based on the concept that the parameters to be estimated are not constants but are considered as random variables. Bayesian statistical inference is based on the posterior distributions of the parameters, which combine the prior information and observed one from the sample. In the case of the GAM models, we wish to estimate the parameter and the functions 1 , , *<sup>p</sup> f f* . One of the advantages of the Bayesian approach compared to the penalized GLM Fisher scoring algorithm is the fact that the uncertainty related to the variance of the components is taken into account trough the posterior distribution of the parameters (Fahrmeir and Lang, 2001, Zhao et al., 2006). In practice, the analytical form of the posteriori distribution is rarely available and then it s difficult to extract their characteristic for risk assessment purposes. The Markov chain Monte Carlo procedure (MCMC), allows to obtain all these characteristics by simulating samples from the posterior distribution and thus to deduce parameter estimators, the quantiles and associated risk as well as estimator uncertainty. More details on the MCMC approach and their convergence diagnostics are studied in El Adlouni et al. (2006).

In the case of the P-spline functions, the parameters *a* of the GAM model, in equation (2), are a random variables. The penalties based on the finite differences of the B-spine coefficients are replaced by their stochastic equivalent which correspond to a random walks of order one or two, defined by

$$a\_{\;j\;\rho} = a\_{\;j,\rho-1} + u\_{\;j\;\rho\;\prime} \text{ or } a\_{\;j\;\rho} = 2a\_{\;j,\rho-1} - a\_{\;j,\rho-2} + u\_{\;j\;\rho} \tag{5}$$

with *uj* <sup>2</sup> (0, ) *N <sup>j</sup>* and the initial values *aj*1, *aj*2 are constants. The level of smoothing is thus controlled by the variance parameter *τ*2, which must also be estimated. Lang and Brezger (2004) suggest a prior distribution of the parameters *aj* of the form:

$$a\_j \Big| \tau\_j^2 \propto \frac{1}{\left(\tau\_j^2\right)^{rk(K\_j)/2}} \exp\left(-\frac{1}{2} a\_j' K\_j a\_j\right) \tag{6}$$

where *K* is the penalty and depends on the smoothing function *fj* and on the nature of the *Xj* variable. The prior distribution of the parameter *τ*2 is an Inverse Gamma distribution *IG*(*cj, dj*), where *cj, dj* are the hyper-parameters and are usually given by prior knowledge on the variables. It is however necessary to perform a sensitivity analysis on the prior choice.

The posteriori distribution of the model has the following form:

$$p(a, a\_1, \tau\_1^2, \dots, a\_p, \tau\_p^2 | y) \propto \mathcal{L}(y, a, a\_1, \tau\_1^2, \dots, a\_p, \tau\_p^2)$$

$$\propto \prod\_{j=1}^p \frac{1}{(\tau\_j^2)^{p(K\_j)/2}} \exp\left(-\frac{1}{2\tau\_j^2} a\_j' K\_j a\_j \right) \prod\_{j=1}^p (\tau\_j^2)^{-a\_j - 1} \exp\left(-\frac{b\_j}{\tau\_j^2}\right) \tag{7}$$

All the inference is based on the posterior distribution. The MCMC algorithm can be performed to estimate the empirical posterior distribution and the predictive distribution of the quantile to deduce the risk values.

The assumptions of the Bayesian estimation model are completed by the following conditional independence assumptions:


#### **4. Performance measure**

90 Novel Approaches and Their Applications in Risk Assessment

Smoothing parameters are used to balance the goodness-of-fit and smoothness. A performance measure is used to find the optimum values of the penalties. The number and location of knots are no longer crucial as long as the minimum number of knots is reached. In practice, this approach poses problems to get a solution when the number of the model's smoothing functions is high (Lang and Brezger, 2004). The P-spline approach is easy to conceive and has the advantage of the explicit formula of the estimation matrix and standard errors estimations (Marx and Eilers, 1998). However, the simplicity if reduced if the knots are at unequal distances (Wood, 2006). Thus, despite the advantages of the Pspline approach in the GAM models, the problems of the estimation of the parameters with the *penalized GLM Fisher scoring* algorithm, remains important (Zhao et al., 2006, Wood, 2006,

The Bayesian approach is essentially based on the concept that the parameters to be estimated are not constants but are considered as random variables. Bayesian statistical inference is based on the posterior distributions of the parameters, which combine the prior information and observed one from the sample. In the case of the GAM models, we wish to

Bayesian approach compared to the penalized GLM Fisher scoring algorithm is the fact that the uncertainty related to the variance of the components is taken into account trough the posterior distribution of the parameters (Fahrmeir and Lang, 2001, Zhao et al., 2006). In practice, the analytical form of the posteriori distribution is rarely available and then it s difficult to extract their characteristic for risk assessment purposes. The Markov chain Monte Carlo procedure (MCMC), allows to obtain all these characteristics by simulating samples from the posterior distribution and thus to deduce parameter estimators, the quantiles and associated risk as well as estimator uncertainty. More details on the MCMC

In the case of the P-spline functions, the parameters *a* of the GAM model, in equation (2), are a random variables. The penalties based on the finite differences of the B-spine coefficients are replaced by their stochastic equivalent which correspond to a random walks of order

controlled by the variance parameter *τ*2, which must also be estimated. Lang and Brezger

where *K* is the penalty and depends on the smoothing function *fj* and on the nature of the *Xj* variable. The prior distribution of the parameter *τ*2 is an Inverse Gamma distribution

1 1 exp ( ) *<sup>j</sup>* <sup>2</sup> *j jj rk K*

 , or ,1 ,2 2 *jj j j aa a u* 

 

and the initial values *aj*1, *aj*2 are constants. The level of smoothing is thus

*aKa*

 

 

(5)

(6)

approach and their convergence diagnostics are studied in El Adlouni et al. (2006).

and the functions 1 , , *<sup>p</sup> f f* . One of the advantages of the

Binder and Tutz, 2006).

**3.3 Bayesian method** 

estimate the parameter

one or two, defined by

 <sup>2</sup> (0, ) *N <sup>j</sup>* 

with *uj* *jj j* , 1 *aa u*

(2004) suggest a prior distribution of the parameters *aj* of the form:

2 *j j a* 

 

<sup>2</sup> ( )/2

*j*

 In order to select the smoothing penalty and the number of knots that leads to the most adequate fit some performance measures are used. The most used performance measures are the Akaïke information criterion (*AIC*) and the generalized cross-validation (*GCV*). They are based on the deviance statistic (or the statistical likelihood ratio) that, for a counting GAM model (the case of the Poisson distribution), is obtained by the following formula:

$$D(y; \hat{\mu}) = 2\sum\_{i=1}^{n} \left[ y\_i \ln(y\_i \mid \hat{\mu}\_i) - (y\_i - \hat{\mu}\_i) \right] \tag{8}$$

The Akaïke information criterion developed by Akaike (1973) measures the quality of the model fit to observed data series. It is the function of the deviance function *D*(*y*; *μ*) and is obtained by the following formula:

$$AIC = \frac{1}{n} [D(y; \hat{\mu}) + tr(R)\phi] \tag{9}$$

where *tr*(*R*) the sum of the diagonal elements of the matrix *R* of the weighted additive-fit operator of the last iterations in the estimation process, and the scale parameter.

Generalized Additive Models in Environmental Health: A Literature Review 93

Other processes for managing the confounding effects are the methods of sampling: specification and matching. Specification is the scheme that specifies the value of the potential confounding and excludes other values (i.e. non smokers only in the study). This method of sampling allows focusing solely on the subjects of the study in question but does not enable the generalization of results. The matching consists on grouping the subjects with similar values of the confounding variable. It has the advantage of eliminating the influences of the confounding with important effects and of improving the precision

The non-linear dependence that remains between the covariates is referred to as the concurvity in the GAM models by analogy to the co-linearity in GLM models. Researchers (Ramsay et al., 2003) insist that a certain degree of concurvity exists in every epidemiological time series, especially when the time is included in the model as a confounding variable. The main problem caused by concurvity in a GAM model is the presence of a bias in the model, more specifically the overestimation of the parameters and the underestimation of the standard errors. The use of asymptotically unbiased estimator of standard errors introduced by Hastie and Tibshirani (1990) and demonstrated by Dominici et al. (2003) does not solve the bias problem. The consequence of this is the inflation of type I errors in the signification tests, resulting in the conclusion of the presence of significant effect (Ramsay et

Several approaches have been proposed to control the problem of concurvity in time series. One method of estimation of the variation, based on the bootstrap parametric, has also produced biased results based on simulations by Ramsay et al. (2003). These recommend instead the use of parametric models such as the GLM model with natural splines (Dominici et al., 2002). He (2004) suggests the use of a non parametric model GAM to explore data in a primary level of analysis and when the appropriate variables are retained, to pursue the analysis with a parametric model GLM with natural splines, all while keeping the same

Figueras et al. (2005) developed the conditional bootstrap method in order to control the effect of the concurvity. In this type of bootstrap, B bootstrap replicates are generated. In each of these, the values of the independent variables are the same as those of the observed data, with only the values of the response variable being varied from replicate to replicate. The value assumed by the outcome in each observation is conditional (hence the technique's name) upon the values of the set of independent variables in said observation. The conditional Bootstrap approach has been tested on simulated data and leads to good

The interaction within a statistical model denotes the effect of two or more variables, which is not simply additive. In other words, the effect is due to the combination of two or more variables in the model. A consequence of the interaction between two variables is that the

(strength) by balancing the number of cases and controlling each layer.

**5.2 Concurvity** 

al., 2003).

results.

degree of smoothing.

**5.3 Interactions in the GAM model** 

The generalized cross-validation for the smoothing penalty is obtained by the following formula:

$$GCV(\boldsymbol{\lambda}) = \frac{1}{n} \sum\_{i=1}^{n} \left\{ \frac{y\_i - \hat{f}\_{\boldsymbol{\lambda}}(\mathbf{x}\_i)}{1 - tr(S\_{\boldsymbol{\lambda}}) / n} \right\}^2 \tag{10}$$

Where *S* is the smoother. For the GCV of the model, the corresponding criterion is based on :

$$GCV = \frac{\frac{1}{n}D(y; \hat{\mu})}{[1 - tr(R)/n]} \tag{11}$$

*R* is the weighted additive-fit operator of the last iteration in the estimation of the model.

#### **5. Confounding variables, concurvity, and interaction**

#### **5.1 Confounding variables**

Confounding is potentially present in all observational studies. A confounding factor in the field of environmental health refers to a situation in which an association between an exposure (i.e. air pollution) and a health outcome (i.e. morbidity or mortality) is distorted because it is mixed with the effect of a third variable – the confounding variable (i.e. humidity). The confounding variable is related to both the exposure and the outcome. The distortion introduced by a confounder can lead to an overestimation (positive confounding, affecting the outcomes in the same direction as the exposure under study) or underestimation (negative confounding, affecting the outcomes in the opposite direction of the exposure under study) of the association between exposure and outcome. Confounding variables can be controlled for by using of one or more of a variety of techniques that eliminate the differential influence of the confounder. For example, if one group is mostly females and the other group is mostly males, then the gender may have a differentially effect on the outcome. As a result, we will not know whether the outcome is due to the treatment or due to the effect of gender. If the comparison groups are the same on all extraneous variables at the start of the experiment, then differential influence is unlikely to occur. The control techniques are essentially attempts to make the groups similar or equivalent. Confounding variables are to be differentiated from intermediating or latent variables that are part of the causal pathway between the exposure and the outcome (Budtz-Jorgensen et al., 2007).

Peng et al. (2006) identified two types of confounding variables: those that are measured and are already included in the model, and those that are not. They propose as an adjustment to this problem the inclusion of a non-linear function of actual and future data in the model. In the study of the relation between air pollution and mortality, the nonmeasured confounders are the factors that influence the mortality in the same way as the air pollution variables (Peng et al., 2006). These factors produce seasonal effects and long-term tendencies on the mortality which deforms the relation between the mortality and the air pollution (i.e. Influenza epidemics and pulmonary infections). In these situations, the inclusion of the variable "time" helps to reduce the bias caused by these factors.

Other processes for managing the confounding effects are the methods of sampling: specification and matching. Specification is the scheme that specifies the value of the potential confounding and excludes other values (i.e. non smokers only in the study). This method of sampling allows focusing solely on the subjects of the study in question but does not enable the generalization of results. The matching consists on grouping the subjects with similar values of the confounding variable. It has the advantage of eliminating the influences of the confounding with important effects and of improving the precision (strength) by balancing the number of cases and controlling each layer.

#### **5.2 Concurvity**

92 Novel Approaches and Their Applications in Risk Assessment

The generalized cross-validation for the smoothing penalty is obtained by the following

1 <sup>ˆ</sup> <sup>1</sup> ( ) ( ) 1 ( )/ *<sup>n</sup> i i*

Where *S* is the smoother. For the GCV of the model, the corresponding criterion is based on :

*n tr S n* 

> <sup>1</sup> (; )ˆ [1 ( )/ ]

*tr R n* 

*i y fx GCV*

*<sup>n</sup> D y GCV*

*R* is the weighted additive-fit operator of the last iteration in the estimation of the model.

Confounding is potentially present in all observational studies. A confounding factor in the field of environmental health refers to a situation in which an association between an exposure (i.e. air pollution) and a health outcome (i.e. morbidity or mortality) is distorted because it is mixed with the effect of a third variable – the confounding variable (i.e. humidity). The confounding variable is related to both the exposure and the outcome. The distortion introduced by a confounder can lead to an overestimation (positive confounding, affecting the outcomes in the same direction as the exposure under study) or underestimation (negative confounding, affecting the outcomes in the opposite direction of the exposure under study) of the association between exposure and outcome. Confounding variables can be controlled for by using of one or more of a variety of techniques that eliminate the differential influence of the confounder. For example, if one group is mostly females and the other group is mostly males, then the gender may have a differentially effect on the outcome. As a result, we will not know whether the outcome is due to the treatment or due to the effect of gender. If the comparison groups are the same on all extraneous variables at the start of the experiment, then differential influence is unlikely to occur. The control techniques are essentially attempts to make the groups similar or equivalent. Confounding variables are to be differentiated from intermediating or latent variables that are part of the causal pathway between the exposure and the outcome (Budtz-

Peng et al. (2006) identified two types of confounding variables: those that are measured and are already included in the model, and those that are not. They propose as an adjustment to this problem the inclusion of a non-linear function of actual and future data in the model. In the study of the relation between air pollution and mortality, the nonmeasured confounders are the factors that influence the mortality in the same way as the air pollution variables (Peng et al., 2006). These factors produce seasonal effects and long-term tendencies on the mortality which deforms the relation between the mortality and the air pollution (i.e. Influenza epidemics and pulmonary infections). In these situations, the

inclusion of the variable "time" helps to reduce the bias caused by these factors.

**5. Confounding variables, concurvity, and interaction** 

2

(10)

(11)

formula:

**5.1 Confounding variables** 

Jorgensen et al., 2007).

The non-linear dependence that remains between the covariates is referred to as the concurvity in the GAM models by analogy to the co-linearity in GLM models. Researchers (Ramsay et al., 2003) insist that a certain degree of concurvity exists in every epidemiological time series, especially when the time is included in the model as a confounding variable. The main problem caused by concurvity in a GAM model is the presence of a bias in the model, more specifically the overestimation of the parameters and the underestimation of the standard errors. The use of asymptotically unbiased estimator of standard errors introduced by Hastie and Tibshirani (1990) and demonstrated by Dominici et al. (2003) does not solve the bias problem. The consequence of this is the inflation of type I errors in the signification tests, resulting in the conclusion of the presence of significant effect (Ramsay et al., 2003).

Several approaches have been proposed to control the problem of concurvity in time series. One method of estimation of the variation, based on the bootstrap parametric, has also produced biased results based on simulations by Ramsay et al. (2003). These recommend instead the use of parametric models such as the GLM model with natural splines (Dominici et al., 2002). He (2004) suggests the use of a non parametric model GAM to explore data in a primary level of analysis and when the appropriate variables are retained, to pursue the analysis with a parametric model GLM with natural splines, all while keeping the same degree of smoothing.

Figueras et al. (2005) developed the conditional bootstrap method in order to control the effect of the concurvity. In this type of bootstrap, B bootstrap replicates are generated. In each of these, the values of the independent variables are the same as those of the observed data, with only the values of the response variable being varied from replicate to replicate. The value assumed by the outcome in each observation is conditional (hence the technique's name) upon the values of the set of independent variables in said observation. The conditional Bootstrap approach has been tested on simulated data and leads to good results.

#### **5.3 Interactions in the GAM model**

The interaction within a statistical model denotes the effect of two or more variables, which is not simply additive. In other words, the effect is due to the combination of two or more variables in the model. A consequence of the interaction between two variables is that the

Generalized Additive Models in Environmental Health: A Literature Review 95

This chapter presented the potential of the Generalized Additive Model (GAM) for environmental studies. Generalized additive models (GAMs) are a generalization of generalized linear models (GLMs) and constitute a powerful technique to capture nonlinear relationships between explanatory variables and a response variable. Selection of the best parameter estimation methods, control for confounding variables and concurvity aims to reduce bias and improve the use of the GAM model. Moreover, when using the GAM model in environmental health, and for an adequate interpretation of the outputs, socio-economic

The authors would like to thank the CIHR Team on Familial breast cancer at Université Laval (QC) leaded by Dr Jacques Simard; and also the Consortium national de formation en santé-Université de Moncton (NB) for the financial support they provided to prepare and

Akaike H. (1973). Information theory as an extension of the maximum likelihood principle.

Balakrishnan K, Ganguli B, Ghosh S, Sankar S, Thanasekaraan V, Rayudu VN, Caussy H;

Bates D. and M. Maechler (2009). lme4: Linear mixed-effects models using S4 classes. URL http://CRAN.R-project.org/package=lme4. R package version 0.999375-31. Bayentin, L., S. El Adlouni, T.B.M.J. Ouarda, P. Gosselin, B. Doyon and F. Chebana (2010).

Ballester F., Rodríguez P., Iñíguez C., Saez M., Daponte A., Galán I., Taracido M., Arribas F.,

Berger P., L. Pascal, C. Sartor , J. Delorme, P. Monge, C. P. Ragon, M. Charbit, R. Sambuc

Binder H. and G. Tutz (2006). Fitting Generalized Additive Models: A comparison of

Braga, A.L., Zanobetti, A. and Schwartz J (2001), The time course of weather related deaths,

Second International Symposium on Information Theory (B. N. Petrov, et F. Csaki),

HEI Health Review Committee. Short-term effects of air pollution on mortality: results from a time-series analysis in Chennai, India. *Res Rep Health Eff Inst.* 2011

Spatial variability of climate effects on ischemic heart disease hospitalization rates for the period 1989-2006 in Quebec, Canada. International Journal of Health

Bellido J., Cirarda F.B., Cañada A., Guillén J.J., Guillén-Grima F., López E., Pérez-Hoyos S., Lertxundi A. and Toro S. (2006). Air Pollution and cardiovascular admissions association in Spain: results within the EMECAS Project. Epidemiol.

and M. Drancourt (2004). Generalized Additive Model demonstrates fluoroquinolone use/resistance relationships for Staphylococcus aureus. European

and demographic parameters should be considered.

pp. 267-281, Akademiai Kiado, Budapest.

Geographics, 9:5. doi:10.1186/1476-072X-9-5.

Community Health 60: 328-336.

Epidemiology, 12, 662-667.

Journal of Epidemiology, 19, pp. 453-460.

Methods. Universität Freiburg i. Br., Nr. 93.

**7. Acknowledgement** 

publish this chapter.

Mar;(157):7-44.

**8. References** 

effect of a variable depends on the value observed for the other one. A form of interaction often found in bibliography is the modification of the effect. The modification of the effect happens when the statistical measure of the association between the explicative variable *X*<sup>1</sup> and the response variable *Y* depends on the level of another variable *X*2, known as the effect modifier. The extent of the relationship depending on the value of the effect modifier contributes to the improvement of the model fit. In the field of environmental health, this allows us to identify the most vulnerable groups to a particular condition (Wood, 2006; Bates and Maechler, 2009).
