**4.3.2 The likelihood methods for combined family data from different study designs**

We evaluated the accuracy and precision of the disease risk (log relative risk and penetrance) estimates based on the three retrospective likelihoods for combined data. Simulation results based on the combined data from CLI+ and POP+ designs are summarized in Table 3, and those from combining CLI and POP families in Table 4 .

### *Combined data from POP+ and CLI+ designs*

In the log relative risk estimation, as expected, the population-based likelihoods for the combined data yielded overestimates because the ascertainment correction was based on only probands, which would not be sufficient for the families from clinic-based designs. However, the clinic-based retrospective likelihood provided slightly negative but less biased estimates in log relative risk but slightly larger standard errors. Although the population-based likelihoods provided smallest standard errors, they were subject to positive bias. Moreover, the log relative risk estimates for low penetrance performed better (less bias and higher precision) than for high penetrance. Our proposed likelihood was almost as efficient as the


### **Log relative risk (***β***) estimation**

10 Will-be-set-by-IN-TECH

Results of the simulation studies are described based on the empirical summary measures of

We first assessed bias and precision in disease risk estimation (relative risk and penetrance) for the retrospective likelihood with correct design adjustment for family data from different

With the correct design adjustment, the estimates of both the log relative risk and penetrance appeared unbiased; the absolute values of bias were less than 0.05 under both high and low penetrance models regardless of the study design. The magnitude of the bias was much smaller than the standard errors. In the log relative risk estimation, the precision of clinic-based designs was higher (smaller standard errors) than that of population-based designs. The population-based designs provided more accurate and precise estimates of the log relative risk for high penetrance than for low penetrance, whereas the clinic-based designs performed better for low penetrance. However, in the penetrance estimation, all designs provided more precise penetrance estimates (smaller standard errors) for high penetrance

We then examined the effect of design misspecification in terms of bias and precision of the log relative risk and penetrance estimates obtained from the retrospective likelihoods when the study design was misspecified. The clinic-based ascertainment correction was applied to the family data under the population-based designs and the population-based ascertainment correction to the clinic-based study. It is worth noting that the clinic-based design with the population-based correction provided relatively large bias in both disease risks, however, the bias in the population-based design with the clinic-based ascertainment correction was not notably large. Especially, under POP+ design (with affected and mutation carrier probands), the clinic-based retrospective likelihood yielded estimates at least as accurate as those from probands-only adjustment (correct design), although their standard errors were larger under

**4.3.2 The likelihood methods for combined family data from different study designs**

those from combining CLI and POP families in Table 4 .

*Combined data from POP+ and CLI+ designs*

We evaluated the accuracy and precision of the disease risk (log relative risk and penetrance) estimates based on the three retrospective likelihoods for combined data. Simulation results based on the combined data from CLI+ and POP+ designs are summarized in Table 3, and

In the log relative risk estimation, as expected, the population-based likelihoods for the combined data yielded overestimates because the ascertainment correction was based on only probands, which would not be sufficient for the families from clinic-based designs. However, the clinic-based retrospective likelihood provided slightly negative but less biased estimates in log relative risk but slightly larger standard errors. Although the population-based likelihoods provided smallest standard errors, they were subject to positive bias. Moreover, the log relative risk estimates for low penetrance performed better (less bias and higher precision) than for high penetrance. Our proposed likelihood was almost as efficient as the

bias and standard error obtained from the maximum likelihood estimates.

**4.3 Simulation results**

than for low penetrance.

the misspecified design.

**4.3.1 The effect of design misspecification**

study designs. The results are summarized in Table 2.

#### **Penetrance estimation**


Table 2. Effects of the design misspecification: bias and precision in disease risk estimation based on retrospective likelihoods with correct and incorrect design adjustments; standard errors are in parenthesis.

population-based likelihood and as accurate as the clinic-based likelihood, regardless of the mixing rates we considered. Especially, the combined likelihood appeared to perform better for relative risk estimation when more CLI+ families were included in the sample.

In the penetrance estimation, we observed similar patterns as in the log relative risk estimation. The population-based likelihood provided substantially large bias with small standard errors, whereas the clinic-based likelihood yielded less bias with large standard errors. However, our proposed likelihood method offered the least bias and improved precision compared to the clinic-based likelihood. In addition, the penetrance was more precisely estimated with the combined likelihood when fewer CLI+ families were recruited (20% CLI+ families).

### *Combined data from POP and CLI designs*

The patterns of bias and precision of the three likelihood methods were more clear with the combined data from POP and CLI designs, as shown in Table 4. In the log relative risk estimation, our proposed likelihood yielded both the most accurate and precise estimates. It also provided more precise estimates when 50% CLI families were included. Similarly, in penetrance estimation, the population-based likelihood provided heavily biased estimates; however, the combined likelihood performed well in terms of both bias and precision. With fewer CLI families (20%) in the data, more precise estimates were obtained.

**Log relative risk (***β***) estimation**

for Estimating Disease Risk Associated with Mutated Genes

**Penetrance estimation**

High Penetrance (*β* = 2.4) Low Penetrance (*β* = 1.8)

High Penetrance (70%) Low Penetrance (48%)

POP vs. CLI 50-50 70-30 80-20 50-50 70-30 80-20

POP corrected 0.269 0.237 0.203 0.467 0.413 0.353 likelihood (0.005) (0.010) (0.013) (0.007) (0.012) (0.018) CLI corrected 0.043 0.041 0.042 0.059 0.060 0.062 likelihood (0.055) (0.055) (0.056) (0.081) (0.082) (0.082) Combined 0.006 -0.002 -0.005 0.015 0.008 0.006 likelihood (0.042) (0.038) (0.036) (0.047) (0.043) (0.042)

Table 4. Bias and precision in disease risk estimation based on three retrospective likelihood approaches for combined data from different family based designs (POP and CLI) with

study design POP+. Of them, 154 individuals were genotyped (92 carriers, 62 non-carriers)

The three likelihood methods (POP+ corrected, CLI+ corrected and combined likelihoods) were applied to combined families with Lynch Syndrome identified from Newfoundland (CLI+) and Ontario (POP+). A Weibull model was used to assess the effects of MMR mutation gene and gender on the age at onset of colorectal cancer. The EM algorithm was implemented to infer missing genotypes. The results of fitting these Lynch Syndrome families using different likelihood methods are presented in Table 5, and the age-specific penetrance

In the analysis based on the combined likelihood, the *β* parameters for the genetic and gender effects were estimated to be 1.13 with robust standard error (se) = 0.18 and -0.51 with se=0.17, respectively, which lead to the hazards ratio of the MMR mutation carriers for the colorectal cancer as 3.10 (se=0.55) and the hazards ratio between female and male as 0.60 (se=0.11).

estimates based on the combined likelihood are graphically illustrated in Figure 1.

random affected probands; standard errors are in parenthesis.

and 352 were not genotyped.

POP vs. CLI 50-50 70-30 80-20 50-50 70-30 80-20

<sup>391</sup> On Combining Family Data from Different Study Designs

POP corrected 0.911 0.644 0.485 0.700 0.609 0.506 likelihood (0.089) (0.079) (0.078) (0.094) (0.089) (0.089) CLI corrected 0.044 0.035 0.038 0.020 0.024 0.029 likelihood (0.079) (0.086) (0.094) (0.063) (0.077) (0.087) Combined 0.014 -0.009 -0.017 0.003 0.000 -0.002 likelihood (0.072) (0.076) (0.080) (0.058) (0.068) (0.076)


#### **Log relative risk (***β***) estimation**

#### **Penetrance estimation**


Table 3. Bias and precision in disease risk estimation based on three retrospective likelihood approaches for combined data from different family based designs (POP+ and CLI+) with affected and mutation carrier probands; standard errors are in parenthesis.

#### **5. Application to Lynch Syndrome families**

Lynch Syndrome, also referred to as hereditary non-polyposis colorectal cancer is an autosomal dominant condition which predisposes carriers to colorectal cancer (CRC). Several DNA mismatch repair (MMR) genes responsible for the majority of Lynch Syndrome cancers have been identified, predominantly MLH1 and MSH2. For the study of CRC, Lynch Syndrome families share a founder mutation in an MMR gene sampled from Newfoundland and Ontario. The Newfoundland data consist of 315 phenotyped individuals (74 affected and 241 not affected) from 12 very large families identified using a high risk criteria. Of them, 261 were genotyped (162 carriers, 99 non-carriers) and 54 were not genotyped. Each family had a carrier proband and other affected relatives, which corresponds to the study design CLI+. The Ontario data were identified through the Ontario Familial Colorectal Cancer Registry (Cotterchio et al., 2000) and consist of 506 phenotyped individuals (126 affected and 380 not affected) from 32 families with MMR mutation carrier probands, which corresponds to the 12 Will-be-set-by-IN-TECH

POP+ vs. CLI+ 50-50 70-30 80-20 50-50 70-30 80-20

POP+ corrected 0.279 0.196 0.145 0.326 0.240 0.191 likelihood (0.132) (0.142) (0.149) (0.123) (0.139) (0.145)

CLI+ corrected -0.024 -0.024 -0.026 -0.004 -0.010 -0.002 likelihood (0.140) (0.154) (0.163) (0.124) (0.141) (0.150) Combined -0.025 -0.026 -0.028 -0.005 -0.011 -0.005 likelihood (0.134) (0.143) (0.149) (0.123) (0.140) (0.147)

POP+ vs. CLI+ 50-50 70-30 80-20 50-50 70-30 80-20

POP+ corrected 0.209 0.151 0.113 0.348 0.247 0.182 likelihood (0.009) (0.015) (0.017) (0.012) (0.017) (0.020)

CLI+ corrected 0.019 0.016 0.015 0.032 0.021 0.024 likelihood (0.060) (0.067) (0.067) (0.079) (0.083) (0.085) Combined -0.008 -0.011 -0.012 0.002 -0.002 -0.002 likelihood (0.031) (0.029) (0.027) (0.033) (0.030) (0.028)

Table 3. Bias and precision in disease risk estimation based on three retrospective likelihood approaches for combined data from different family based designs (POP+ and CLI+) with

Lynch Syndrome, also referred to as hereditary non-polyposis colorectal cancer is an autosomal dominant condition which predisposes carriers to colorectal cancer (CRC). Several DNA mismatch repair (MMR) genes responsible for the majority of Lynch Syndrome cancers have been identified, predominantly MLH1 and MSH2. For the study of CRC, Lynch Syndrome families share a founder mutation in an MMR gene sampled from Newfoundland and Ontario. The Newfoundland data consist of 315 phenotyped individuals (74 affected and 241 not affected) from 12 very large families identified using a high risk criteria. Of them, 261 were genotyped (162 carriers, 99 non-carriers) and 54 were not genotyped. Each family had a carrier proband and other affected relatives, which corresponds to the study design CLI+. The Ontario data were identified through the Ontario Familial Colorectal Cancer Registry (Cotterchio et al., 2000) and consist of 506 phenotyped individuals (126 affected and 380 not affected) from 32 families with MMR mutation carrier probands, which corresponds to the

affected and mutation carrier probands; standard errors are in parenthesis.

High Penetrance (*β* = 2.4) Low Penetrance (*β* = 1.8)

High Penetrance (70%) Low Penetrance (48%)

**Log relative risk (***β***) estimation**

**Penetrance estimation**

**5. Application to Lynch Syndrome families**


#### **Log relative risk (***β***) estimation**

#### **Penetrance estimation**


Table 4. Bias and precision in disease risk estimation based on three retrospective likelihood approaches for combined data from different family based designs (POP and CLI) with random affected probands; standard errors are in parenthesis.

study design POP+. Of them, 154 individuals were genotyped (92 carriers, 62 non-carriers) and 352 were not genotyped.

The three likelihood methods (POP+ corrected, CLI+ corrected and combined likelihoods) were applied to combined families with Lynch Syndrome identified from Newfoundland (CLI+) and Ontario (POP+). A Weibull model was used to assess the effects of MMR mutation gene and gender on the age at onset of colorectal cancer. The EM algorithm was implemented to infer missing genotypes. The results of fitting these Lynch Syndrome families using different likelihood methods are presented in Table 5, and the age-specific penetrance estimates based on the combined likelihood are graphically illustrated in Figure 1.

In the analysis based on the combined likelihood, the *β* parameters for the genetic and gender effects were estimated to be 1.13 with robust standard error (se) = 0.18 and -0.51 with se=0.17, respectively, which lead to the hazards ratio of the MMR mutation carriers for the colorectal cancer as 3.10 (se=0.55) and the hazards ratio between female and male as 0.60 (se=0.11).

likelihood and the combined likelihood, although their precisions were slightly better with

<sup>393</sup> On Combining Family Data from Different Study Designs

We obtained that the penetrance of colorectal cancer by age 70 was 61% (se=4.15) among male carriers and 43% (se=4.1) among female carriers using the combined likelihood. These estimates were comparable with those obtained using the POP+ and CLI+ corrected retrospective likelihoods. Penetrances were overestimated (62% for male and 48% for female carriers) with higher precision (se=4.08 for male, 4.06 for female) under POP+ correction but slightly underestimated (59% for male carriers and 41% for female carriers) with lower precision (se=4.27 for male and 4.18 for female) under CLI+ correction, as seen in our

In genetic epidemiology, family studies have been widely used for identifying genes responsible for traits and characterizing their risks in the population and they are often based on various family-based designs to sample families depending on the objectives of the study or their budget. To make population-based inferences, the study design should be properly taken into account, especially when the sampling is not randomly conducted as often is the

In this study, for estimating disease risks—relative risk and penetrance, we have proposed the use of a retrospective likelihood to take the sampling process of families into account, and investigated the effect of sampling design misspecification on disease risk estimation. Our study showed that the misspecification of study design undoubtedly lead to bias; overestimation of risks when the study design adjustment was less than it should be (i.e, the clinic-based designs were analyzed with the correction by probands only), and underestimation with overcorrection by multiple affected family members. However, the magnitudes of bias and precision varied depending on the study design and the size of the penetrance. We found that undercorrection created more bias although it provided smaller standard error. This implies that conditioning more individuals would be safer for obtaining accurate estimates at the price of loss of precision if the study design is not known. The POP+ design with clinic-based correction in fact provided unbiased estimates of relative risk and penetrance. In general, the population-based designs performed better for high penetrance for estimating both disease risks but the clinic-based designs performed differently: penetrance was more efficiently estimated under high penetrance but relative risk was more efficiently estimated under low penetrance. In addition, we have proposed the combined likelihood for families sampled under different study designs and the effect of design misspecification was also investigated for combined data. Our proposed likelihood is applicable even when the study designs of the combined data are not clearly known since we can divide families into two categories—high risk families with at least three affected individuals and low risk families, otherwise. Our proposed combined retrospective likelihood method yielded accurate and precise estimates of both disease risks. Comparatively, the clinic-based likelihoods applied to combined data and provided unbiased estimates less efficiently compared to those from the combined likelihood. It is noteworthy that the EM algorithm we developed for inferring missing genotypes is a novel way to impute the missing genotypes using the observed genotypic and phenotypic information from other family

the combined likelihood.

for Estimating Disease Risk Associated with Mutated Genes

simulation study.

**6. Conclusion**

members.

case with the sampling of families.

Fig. 1. (a) Estimated cumulative risk of developing colorectal cancer for carriers of any MMR gene mutation for the Lynch Syndrome families from Newfoundland and Ontario. (b) Same as (a) for non-carriers.


Combined likelihood 1.14 (0.18) -0.51 (0.17)

**Log relative risk estimation in terms of hazards ratio**

#### **Age-specific penetrance estimation among mutation carriers**


Table 5. Disease risk estimates and their corresponding robust standard errors in parenthesis using different likelihood methods for the Lynch Syndrome families from Newfoundland and Ontario

These relative risks indicated that the MMR mutation carriers were approximately three times more likely to develop the colorectal cancers than non-carriers, whereas among males and females, females showed about one third lower the hazard rate than males. There was very little difference observed between the relative risk estimates obtained by the CLI+ corrected likelihood and the combined likelihood, although their precisions were slightly better with the combined likelihood.

We obtained that the penetrance of colorectal cancer by age 70 was 61% (se=4.15) among male carriers and 43% (se=4.1) among female carriers using the combined likelihood. These estimates were comparable with those obtained using the POP+ and CLI+ corrected retrospective likelihoods. Penetrances were overestimated (62% for male and 48% for female carriers) with higher precision (se=4.08 for male, 4.06 for female) under POP+ correction but slightly underestimated (59% for male carriers and 41% for female carriers) with lower precision (se=4.27 for male and 4.18 for female) under CLI+ correction, as seen in our simulation study.
