**On Combining Family Data from Different Study Designs for Estimating Disease Risk Associated with Mutated Genes**

Yun-Hee Choi *University of Western Ontario Canada*

### **1. Introduction**

Genetic disorders caused primarily by abnormalities in genes or chromosomes are rare in the general population. The associated putative mutations that lead to a high risk of developing such diseases are even rarer. In order to study disease risks associated with mutated genes, families sampled under different study designs are commonly used in association studies. This is because family data recruited via affected individuals (probands) would be expected to contain more affected individuals and mutation carriers than families randomly sampled from a general population, thus leading to increased statistical efficiency in estimating the disease risk. The disease risk associated with a mutated gene can be measured on a relative or absolute scale. As the event we consider is disease with its age of onset, the relative risk can be measured as a ratio of two hazards of developing disease between mutation carriers and non-carriers, and the absolute risk as a function of age, i.e., the cumulative risk of developing disease by a given age, which is also termed penetrance.

Several family-based study designs have been used for estimating the disease risk associated with a gene mutation when onset varies with age. Gong & Whittemore (2003) discussed two basic types of family-based sampling schemes: population-based and clinic-based designs. For population-based designs, families are ascertained for study inclusion based on affected family members who are randomly sampled from the disease population. The proband is usually genotyped to determine if s/he carries the disease risk gene and additional genotype and phenotype data can then be collected from other family members. A kin-cohort design described by Wacholder et al. (1998) is an example of the population-based design as families are sampled through a volunteer (either affected or unaffected) who agrees to be genotyped and provides the disease history of her or his first-degree relatives through a questionnaire. Not restricted to including the first degree relatives and genotyping only probands, a kin-cohort design can be easily extended to case-family studies to include more extended family members and their genotype information. Case-control family studies have been widely used to analyse the ages of onset of disease in relation to genetic risk (Li et al., 1998; Shih & Chatterjee, 2000; Hsu & Gorfine, 2006), where case families are recruited via population-based cases and their matching control families are randomly sampled from the population.

compared when only probands for the families from the clinic-based study are adjusted for, and when the probands and other affected family members for the families from the population-based design are used for ascertainment correction. For the combined family data, the two design correction methods (population-based and clinic-based) are applied and compared respectively with our proposed combined likelihood method in terms of their

<sup>381</sup> On Combining Family Data from Different Study Designs

This chapter includes the following sections. Section 2 introduces two family-based study designs—the population- and clinic-based study designs and their ascertainment-corrected likelihood methods for modeling ages at onset for family members in disease risk estimation. We propose a likelihood-based approach for the combined family data obtained from different study designs. In Section 3, an Expectation-Maximization (EM) algorithm is incorporated to account for the missing genotype information, where the missing genetic covariates are inferred from their conditional expectation given the observed genotypes and phenotypes of other family members. Moreover, a robust variance estimator is proposed to account for the dependence of individuals within families. Using simulation studies in Section 4, we examine the effects of study design misspecification for estimating the disease risks and investigate the properties of our proposed likelihood approach for combined families from different study designs. In Section 5, we illustrate our proposed approaches through an application to family data obtained from the combination of two studies of Lynch Syndrome—first, Newfoundland data from the clinic-based design and second, Ontario data based on the population-based

design. Final remarks and possible extensions of this work will follow in Section 6.

on mutation gene *G* and other risk factors *X* is assumed to take the form

gene—relative risk and absolute risk, the latter is also called penetrance.

a mutated gene compared to an individual free from the mutation, that is

For diseases caused by mutated genes, the phenotype of interest varies in age at onset, i.e., time to an event such as death or disease diagnosis. We denote the age at onset by *T*, the affection status at age of examination by *δ*. Then, the phenotype is given by *D* = (*T*, *δ*). Under the Cox's proportional hazards model, the hazard function for individual *i* conditional

*h*(*ti*|*gi*,*xi*) = *ho*(*t*) exp(*βgG* + *βxX*), where *ho*(*t*) is a baseline hazard function and *β<sup>g</sup>* and *β<sup>x</sup>* are unknown regression parameters. Based on this model, we consider two types of disease risk associated with a mutated

(1) the relative risk in survival analysis is defined by the hazards ratio for an individual with

*Relative Risk* = exp(*βg*).

(2) the penetrance function for the disease susceptibility gene is defined as the age-specific cumulative risk function conditional on the disease susceptibility gene *G* and other relevant

*Penetrance* = *P*(*T < t*|*G*, *X*).

accuracies and efficiencies for disease risk estimation.

for Estimating Disease Risk Associated with Mutated Genes

**2. Methods**

covariates *X*,

**2.1 Defining disease risks**

For clinic-based designs, on the other hand, families are ascertained into the study based on having multiple affected family members in addition to the affected probands. Pedigrees with many cases are highly informative because they are more likely to carry the disease gene mutation, but typically have not been ascertained in any population-based manner. Such families are often identified from high-risk disease clinics and provide substantial information to estimate the disease risk (for example, Kopciuk et al., 2009). Multistage designs (Whittemore & Halpern, 1997; Siegmund et al., 1999) provide an alternative way to efficiently recruit high risk families, often using disease family registries, where families are sampled from more informative groups via several stages. Studies based on these high-risk families can be effective for characterizing the prevalence and penetrance of mutated genes, but it is well known that without proper ascertainment corrections statistical inference would lead to biased estimations of population attributes such as allele frequency, disease risks, and penetrance of the mutated genes.

To allow population-based inference for estimating disease risks associated with mutated genes, family data can be analyzed using various likelihood-based methods (Thomas, 2004). In particular, ascertainment-corrected likelihood approaches have been developed by several authors (for example, Choi et al., 2008; Carayol & Bonaïti-Pellié, 2004; Kraft & Thomas, 2000; Le Bihan et al., 1995). Based on the survival approach, Le Bihan et al. (1995) formulated a prospective likelihood for modeling phenotypes as the age of onset and disease status given genotypes, and corrected the likelihood by the probability of families being ascertained for study. This approach is natural as it models phenotypes as a function of genotype and covariates, but the ascertainment scheme has to be clearly known and simple enough to make proper correction. On the other hand, the retrospective likelihood models genotypes conditioning on the phenotypes of all family members (Carayol & Bonaïti-Pellié, 2004; Kraft & Thomas, 2000; Schaid et al., 2010). Although this approach provides the most robust way to obtain consistent estimates of relative risk even with the ascertainment schemes that are imprecisely defined or complex, it encounters the computational burden of summing over possible genotypes of all family members and a decreased efficiency resulting from conditioning. Choi et al. (2008) adapted the retrospective likelihood conditioning only on phenotypes of individuals who were involved in the ascertainment criteria; for families sampled from the population-based designs, only probands were used to correct for the ascertainment, whereas for families from the clinic-based designs, the probands and their parents and sibs were used for ascertainment correction. Moreover, Schaid et al. (2010) accommodated the composite likelihood approach to obtaining the retrospective likelihood based on all possible pairs of individuals in families to reduce the computational burden.

The main objectives of this article are first, to examine the effects of misspecification of study designs when more appropriate study designs have been ignored or incorrectly specified in the analysis; second, to provide simple and easy to apply adjustment schemes for estimating disease risks by combining family data from different study designs; and third, to develop an Expectation-Maximization algorithm to infer missing genotypes in the estimation of disease risks. We start with describing ascertainment-corrected likelihood methods to take the study design into account and propose a likelihood-based approach to estimating the disease risks for combined family data collected under different study designs. The performance of these ascertainment-corrected likelihood methods is evaluated in terms of bias and efficiency. The effect of design misspecification is examined for estimating the disease risks associated with mutated genes. The bias and efficiency involved in estimating two disease risks are compared when only probands for the families from the clinic-based study are adjusted for, and when the probands and other affected family members for the families from the population-based design are used for ascertainment correction. For the combined family data, the two design correction methods (population-based and clinic-based) are applied and compared respectively with our proposed combined likelihood method in terms of their accuracies and efficiencies for disease risk estimation.

This chapter includes the following sections. Section 2 introduces two family-based study designs—the population- and clinic-based study designs and their ascertainment-corrected likelihood methods for modeling ages at onset for family members in disease risk estimation. We propose a likelihood-based approach for the combined family data obtained from different study designs. In Section 3, an Expectation-Maximization (EM) algorithm is incorporated to account for the missing genotype information, where the missing genetic covariates are inferred from their conditional expectation given the observed genotypes and phenotypes of other family members. Moreover, a robust variance estimator is proposed to account for the dependence of individuals within families. Using simulation studies in Section 4, we examine the effects of study design misspecification for estimating the disease risks and investigate the properties of our proposed likelihood approach for combined families from different study designs. In Section 5, we illustrate our proposed approaches through an application to family data obtained from the combination of two studies of Lynch Syndrome—first, Newfoundland data from the clinic-based design and second, Ontario data based on the population-based design. Final remarks and possible extensions of this work will follow in Section 6.
