**6. Conclusion**

14 Will-be-set-by-IN-TECH

0.0 0.2 0.4 0.6 0.8 1.0

MMR mutation Gender

Male Female

Penetrance

Fig. 1. (a) Estimated cumulative risk of developing colorectal cancer for carriers of any MMR gene mutation for the Lynch Syndrome families from Newfoundland and Ontario. (b) Same

> POP+ corrected likelihood 1.07 (0.17) -0.42 (0.16) CLI+ corrected likelihood 1.15 (0.18) -0.52 (0.18) Combined likelihood 1.14 (0.18) -0.51 (0.17)

> POP+ corrected likelihood 62.4% (4.08) 47.5% (4.06) CLI+ corrected likelihood 58.9% (4.27) 40.9% (4.18) Combined likelihood 60.7% (4.15) 42.8% (4.10)

Table 5. Disease risk estimates and their corresponding robust standard errors in parenthesis using different likelihood methods for the Lynch Syndrome families from Newfoundland

These relative risks indicated that the MMR mutation carriers were approximately three times more likely to develop the colorectal cancers than non-carriers, whereas among males and females, females showed about one third lower the hazard rate than males. There was very little difference observed between the relative risk estimates obtained by the CLI+ corrected

**Age-specific penetrance estimation among mutation carriers**

**Log relative risk estimation in terms of hazards ratio**

20 30 40 50 60 70 80

Male Female

Age at onset

**(b) MMR mutation non−carriers**

20 30 40 50 60 70 80

Male Female

Age at onset

**(a) MMR mutation carriers**

0.0 0.2 0.4 0.6 0.8 1.0

and Ontario

as (a) for non-carriers.

Penetrance

In genetic epidemiology, family studies have been widely used for identifying genes responsible for traits and characterizing their risks in the population and they are often based on various family-based designs to sample families depending on the objectives of the study or their budget. To make population-based inferences, the study design should be properly taken into account, especially when the sampling is not randomly conducted as often is the case with the sampling of families.

In this study, for estimating disease risks—relative risk and penetrance, we have proposed the use of a retrospective likelihood to take the sampling process of families into account, and investigated the effect of sampling design misspecification on disease risk estimation. Our study showed that the misspecification of study design undoubtedly lead to bias; overestimation of risks when the study design adjustment was less than it should be (i.e, the clinic-based designs were analyzed with the correction by probands only), and underestimation with overcorrection by multiple affected family members. However, the magnitudes of bias and precision varied depending on the study design and the size of the penetrance. We found that undercorrection created more bias although it provided smaller standard error. This implies that conditioning more individuals would be safer for obtaining accurate estimates at the price of loss of precision if the study design is not known. The POP+ design with clinic-based correction in fact provided unbiased estimates of relative risk and penetrance. In general, the population-based designs performed better for high penetrance for estimating both disease risks but the clinic-based designs performed differently: penetrance was more efficiently estimated under high penetrance but relative risk was more efficiently estimated under low penetrance. In addition, we have proposed the combined likelihood for families sampled under different study designs and the effect of design misspecification was also investigated for combined data. Our proposed likelihood is applicable even when the study designs of the combined data are not clearly known since we can divide families into two categories—high risk families with at least three affected individuals and low risk families, otherwise. Our proposed combined retrospective likelihood method yielded accurate and precise estimates of both disease risks. Comparatively, the clinic-based likelihoods applied to combined data and provided unbiased estimates less efficiently compared to those from the combined likelihood. It is noteworthy that the EM algorithm we developed for inferring missing genotypes is a novel way to impute the missing genotypes using the observed genotypic and phenotypic information from other family members.

Dempster, A.P.; Laird, N.M. & Rubin, D.B. (1977). Maximum likelihood from incomplete data

<sup>395</sup> On Combining Family Data from Different Study Designs

Gong, G. & Whittemore, A.S. (2003). Optimal designs for estimating penetrance of rare

Green, J.; O'Driscoll, M.; Barnes, A.; Maher, E.R.; Bridge, P.; Shields, K. & Parfrey, P.S. (2002).

Heagerty, P.J. (1999). Marginally specified logistic-normal models for longitudinal binary data,

Hsu, L. & Gorfine, M. (2006). Multivariate survival analysis for case-control family studies,

Kopciuk, K.A.; Choi, Y.-H.; Parkhomenko, E.; Parfrey, P.; McLaughlin, J.; Green, J. & Briollais,

Kraft, P. & Thomas, D.C. (2000). Bias and efficiency in family-based gene-characterization

Lawless, J.F. (2003). *Statistical Models and Methods for Lifetime Data*, (Second Ed.), John Wiley

Le Bihan, C.; Moutou, C.; Brugières, L.; Feunteun, J. & Bonaïti-Pellié, C. (1995). ARCAD: a

status from family data, *Genetic Epidemiology*, Vol. 12: 13–25, ISSN 1098-2272. Li, H.; Yang, P. & Schwartz, A.G. (1998). Analysis of age of onset data from case-control family

Oakes, D. (1999). Direct calculation of the information matrix via the EM algorithm, *Journal of*

Pfeiffer, R. M., Pee, D. & Landi, M.T. (2008). On combining family and case-control studies,

Schaid, D.J.; McDonnell, S.K.; Riska, S.M.; Carlson, E.E. & Thibodeau, S.N. (2010). Estimation

Shih, J.H. & Chatterjee, N. (2000). Analysis of survival data from case-control family studies,

Siegmund, K.D.; Whittemore, A.S. & Thomas, D.C. (1999). Multistage sampling for disease

Thomas, D.C. (2004). *Statistical Methods in Genetic Epidemiology*, Oxford University Press,

Wacholder, S.; Hartge, P.; Struewing, J.P.; Pee, D.; McAdams, M.; Brody, L. & Tucker, M. (1998).

of genotype relative risks from pedigree data by retrospective likelihoods, *Genetic*

family registries, *Journal of the National Cancer Institute Monographs*, Vol. 26: 43–48,

The kin-cohort study for estimating penetrance, *American Journal of Epidemiology*, Vol.

*Royal Statistical Society, Series B*, Vol. 61, 479–482, ISSN 1369-7412.

*Journal of Human Genetics*, Vol. 66: 1119–1131, ISSN 1537-6605.

*Biometrics*, Vol. 55: 688–698, ISSN 1541-0420.

for Estimating Disease Risk Associated with Mutated Genes

*Biostatistics*, Vol. 7: 387–398, ISSN 1468-4357.

and Sons Inc., ISBN 9780471372158, Hoboken.

studies, *Biometrics*, Vol. 54: 1030–1039, ISSN 1541-0420.

*Genetic Epidemiology*, Vol. 32:638–646, ISSN 1098-2272.

*Epidemiology*, Vol. 34:287–298, ISSN 1098-2272.

*Biometrics*, Vol. 58: 502–509, ISSN 1541-0420.

ISBN-13 978-0195159394, New York.

148:623–630, ISSN 1476-6256.

1369-7412.

1098-2272.

1897-4287.

ISSN 1745-6614.

via the EM algorithm, *Journal of the Royal Statistical Society. Series B*, Vol. 39:1–38, ISSN

mutations of a disease-susceptibility gene, *Genetic Epidemiology*, Vol. 24:173–180, ISSN

Impact of gender and parent of origin on the phenotypic expression of hereditary nonpolyposis colorectal cancer in a large Newfoundland kindred with a common MSH2 mutation, *Diseases of the Colon and Rectum*, Vol. 45:1223–1232, ISSN 1530-0358.

L. (2009) Penetrance of HNPCC-related cancers in a retrospective cohort of 12 large Newfoundland families carrying a MSH2 founder mutation: an evaluation using modified segregation models, *Hereditary Cancer in Clinical Practice*, Vol.7: 16, ISSN

studies: conditional, prospective, retrospective, and joint likelihoods, *The American*

method for estimating age-dependent disease risk associated with mutation carrier

In practice, it might be difficult to collect families with a mutation-carrier proband. However, with the emergence of large international consortiums such as the Breast and Colon Cancer Family Registries, the planning of studies using designs POP+ and CLI+ is now quite feasible. Therefore, the use of 200 families in the CLI+ design, as specified in our simulation study, seems to provide a reasonable sample size; however, the efficiency gains with more families would clearly be greater.

There are potential limitations to our study. First, we assumed the Weibull distribution, chosen to model the penetrance function because of flexible modeling of the baseline hazard function which includes constant, increasing or decreasing hazard functions. There might be potential for model misspecification. Kopciuk et al. (2009) employed the generalized log-Burr model for more flexible modeling as it includes the Weibull model or the log-logistic model as special cases (Lawless, 2003), where the Weibull model has a monotonic functional form of the hazard whereas the log-logistic model does not. The baseline hazard can be also modeled semiparametrically using a step function while assuming proportional hazards. Second, between-family heterogeneity in allele frequencies and baseline hazards can lead to bias in parameter estimates based on the homogeneous models. A random effect model would allow us to take between-family heterogeneity into account while avoiding a great number of family-specific parameters. Finally, familial correlation is a common feature of family data due to the unobserved genetic or environmental risk factors shared within families. We did not explicitly model within-family dependencies, instead, we accommodated a robust variance estimator. However, ignoring familial correlation can lead to biased estimates of the model parameters, and so to biased disease risks (Choi et al., 2008). Relating to other work, several authors have adopted mixed effect models for binary outcomes in family studies (Heagerty, 1999; Pfeiffer et al., 2008; Zheng et al., 2010). Shared frailty models can allow us to model times to onset data from families while explicitly modeling familial correlation. We are planning to develop such frailty models in the context of various family designs.
