**2.1 Defining disease risks**

For diseases caused by mutated genes, the phenotype of interest varies in age at onset, i.e., time to an event such as death or disease diagnosis. We denote the age at onset by *T*, the affection status at age of examination by *δ*. Then, the phenotype is given by *D* = (*T*, *δ*). Under the Cox's proportional hazards model, the hazard function for individual *i* conditional on mutation gene *G* and other risk factors *X* is assumed to take the form

$$h(t\_l | \mathcal{g}\_{l'} x\_l) = h\_o(t) \exp(\beta\_{\mathcal{S}} G + \beta\_{\mathcal{X}} X)\_{\prime}$$

where *ho*(*t*) is a baseline hazard function and *β<sup>g</sup>* and *β<sup>x</sup>* are unknown regression parameters.

Based on this model, we consider two types of disease risk associated with a mutated gene—relative risk and absolute risk, the latter is also called penetrance.

(1) the relative risk in survival analysis is defined by the hazards ratio for an individual with a mutated gene compared to an individual free from the mutation, that is

$$Relative\ Risk = \exp(\beta\_{\mathcal{S}}) \.$$

(2) the penetrance function for the disease susceptibility gene is defined as the age-specific cumulative risk function conditional on the disease susceptibility gene *G* and other relevant covariates *X*,

$$Penetrance = P(T < t | G, X) \dots$$

**2.3 Likelihood approaches for family-based study designs**

for Estimating Disease Risk Associated with Mutated Genes

**2.3.1 Ascertainment-corrected retrospective likelihood**

The likelihood contribution *Lf* for a single family *f* can be written as

genotypes and covariates. Thus, we can express the numerator as

*P*(*Gf*) =

allele frequency, *P*(*gi*|*gmi*

, *gfi*

genotypes (*gmi*

examination, *ap*, i.e.,

*nf* ∏ *i*=1

, *gfi*

) of individual *i*.

place by assuming the proband is a mutation carrier.

and

*Lf* = *P*(*Gf* |*Df* , *Xf* , *Af*)

<sup>∝</sup> *<sup>P</sup>*(*Df* <sup>|</sup>*Xf* , *Gf*)*P*(*Gf*)

ascertainment, and 0 otherwise, and so is independent of the parameter of interest.

*P*(*Df* |*Xf* , *Gf*) =

, *gfi*

*P*(*gi*|*gmi*

*<sup>P</sup>*(*Df* , *Af* |*Xf*) = ∑

This section describes the likelihood-based approaches for modeling ages at onset and genetic covariates using family data via population-based and clinic-based study designs. We propose a combined likelihood approach for family data arising from the two different study designs.

<sup>383</sup> On Combining Family Data from Different Study Designs

The retrospective likelihood corrects for the ascertainment by conditioning on the phenotypes. Define *D* = (*d*1,..., *dn*) as a vector of phenotypes, *G* = (*g*1,..., *gn*) as a vector of genotypes, *X* = (*x*1,..., *xn*) a vector of covariates other than genotypes, and *A* the ascertainment event.

> <sup>=</sup> *<sup>P</sup>*(*Af* <sup>|</sup>*Df* , *Xf* , *Gf*)*P*(*Df* <sup>|</sup>*Xf* , *Gf*)*P*(*Gf*) *P*(*Df* , *Af* |*Xf*)

where we assume that *P*(*Af* |*Df* , *Xf* , *Gf*) is equal to 1 if the vector *Df* qualifies for

We further assume that individuals' phenotypes are independent conditionally given their

Here *P*(*gi*) is based on Hardy-Weinberg Equilibrium (HWE) and depends on the population

The denominator is the correction term used to account for the study designs. In the population-based study, the ascertainment correction is based on the proband's phenotype in that it equals the probability of the proband, *p*, being affected before his/her age at

*g*

where the sum is over all possible genotypes of the proband. For POP+ design, the sum takes

In the clinic-based study, the denominator is based on the phenotypes of four individuals, two parents and two sibs, who involved in their family's ascertainment process. It can be

*nf* ∏ *i*=1

*P*(*gi*), if individual *i* is a founder,

*P*(*di*|*xi*, *gi*) ,

*P*(*Tp < ap*|*g*)*P*(*g*),

), if individual *i* is a nonfounder.

) is the Mendelian transmission probability given parents'

*<sup>P</sup>*(*Df* , *Af* <sup>|</sup>*Xf*) , (1)


#### Table 1. Family-based study designs

The disease risks can be estimated by maximizing a likelihood function with proper ascertainment adjustment of the families. In a crude analysis of family data, their ascertainments are often corrected by simply excluding the probands from the analysis to prevent overestimating the risk. However, more prudent approaches such as likelihood methods would not simply drop out the probands because they include other important information about the disease risks. Rather they would adjust for the sampling process, allowing their contributions to the likelihood. To accommodate study designs and the ascertainment process properly, both the design and the ascertainment criteria should be known clearly. However, such designs or criteria in many cases are unclear or too complex to allow adjustment at the analysis stage. Moreover, family data could come from different sources where families were recruited using different designs or ascertainment criteria.

#### **2.2 Family-based study designs**

We consider the two most commonly used family-based study designs—population-based designs and clinic-based designs. The population-based study design uses the affected cases (probands) to sample their families while the clinic-based study design is based on the probands with a high family history of disease risk. Thus, the clinic-based families likely include more disease cases and mutation carriers compared to the families from population-based designs.

The ascertainment criteria for the population-based study are based on the affected probands who are randomly sampled from the diseased population; for example, cancer registries. To increase the power to study the effect of the mutated gene of interest, one can apply stringent criteria to recruit the probands to be not only affected but also be a mutation-carrier. Similarly, the clinic-based study designs can have two variants: one with random probands with multiple case family members and the other with carrier probands with multiple case family members. Such families can be recruited from cancer registries or cancer clinics.

Table 1 summarizes the four study designs and their sampling criteria used to ascertain families. Population-based designs correspond to ascertainment criteria POP and POP+. They are similar to a kin-cohort design but are more like case-family designs that include extended family members and their genotype information. Ascertainment criteria CLI and CLI+ correspond to clinic-based designs which have multiple disease occurrences among family members. Important to note is that ascertainment criteria for the POP+ and CLI+ designs include families who have at least one member (proband) who carries the mutated gene of interest.

#### **2.3 Likelihood approaches for family-based study designs**

This section describes the likelihood-based approaches for modeling ages at onset and genetic covariates using family data via population-based and clinic-based study designs. We propose a combined likelihood approach for family data arising from the two different study designs.

#### **2.3.1 Ascertainment-corrected retrospective likelihood**

The retrospective likelihood corrects for the ascertainment by conditioning on the phenotypes. Define *D* = (*d*1,..., *dn*) as a vector of phenotypes, *G* = (*g*1,..., *gn*) as a vector of genotypes, *X* = (*x*1,..., *xn*) a vector of covariates other than genotypes, and *A* the ascertainment event. The likelihood contribution *Lf* for a single family *f* can be written as

$$\begin{split} \mathcal{L}\_{f} &= \mathcal{P}(\mathbf{G}\_{f}|\mathcal{D}\_{f}, \mathbf{X}\_{f}, A\_{f}) \\ &= \frac{\mathcal{P}(A\_{f}|\mathcal{D}\_{f}, \mathbf{X}\_{f}, \mathbf{G}\_{f})\mathcal{P}(\mathcal{D}\_{f}|\mathbf{X}\_{f}, \mathbf{G}\_{f})\mathcal{P}(\mathbf{G}\_{f})}{\mathcal{P}(\mathcal{D}\_{f}, A\_{f}|\mathbf{X}\_{f})} \\ &\propto \frac{\mathcal{P}(\mathcal{D}\_{f}|\mathbf{X}\_{f}, \mathbf{G}\_{f})\mathcal{P}(\mathbf{G}\_{f})}{\mathcal{P}(\mathcal{D}\_{f}, A\_{f}|\mathbf{X}\_{f})} . \end{split} \tag{1}$$

where we assume that *P*(*Af* |*Df* , *Xf* , *Gf*) is equal to 1 if the vector *Df* qualifies for ascertainment, and 0 otherwise, and so is independent of the parameter of interest.

We further assume that individuals' phenotypes are independent conditionally given their genotypes and covariates. Thus, we can express the numerator as

$$P(D\_f|X\_{f'},G\_f) = \prod\_{i=1}^{n\_f} P(d\_i|\mathcal{X}\_{i\prime}g\_i) \dots$$

and

4 Will-be-set-by-IN-TECH

CLI Proband is affected and at least one parent and one sib are affected CLI+ Proband is affected mutation-carrier and at least one parent and

The disease risks can be estimated by maximizing a likelihood function with proper ascertainment adjustment of the families. In a crude analysis of family data, their ascertainments are often corrected by simply excluding the probands from the analysis to prevent overestimating the risk. However, more prudent approaches such as likelihood methods would not simply drop out the probands because they include other important information about the disease risks. Rather they would adjust for the sampling process, allowing their contributions to the likelihood. To accommodate study designs and the ascertainment process properly, both the design and the ascertainment criteria should be known clearly. However, such designs or criteria in many cases are unclear or too complex to allow adjustment at the analysis stage. Moreover, family data could come from different sources where families were recruited using different designs or ascertainment criteria.

We consider the two most commonly used family-based study designs—population-based designs and clinic-based designs. The population-based study design uses the affected cases (probands) to sample their families while the clinic-based study design is based on the probands with a high family history of disease risk. Thus, the clinic-based families likely include more disease cases and mutation carriers compared to the families from

The ascertainment criteria for the population-based study are based on the affected probands who are randomly sampled from the diseased population; for example, cancer registries. To increase the power to study the effect of the mutated gene of interest, one can apply stringent criteria to recruit the probands to be not only affected but also be a mutation-carrier. Similarly, the clinic-based study designs can have two variants: one with random probands with multiple case family members and the other with carrier probands with multiple case family members. Such families can be recruited from cancer registries or cancer clinics.

Table 1 summarizes the four study designs and their sampling criteria used to ascertain families. Population-based designs correspond to ascertainment criteria POP and POP+. They are similar to a kin-cohort design but are more like case-family designs that include extended family members and their genotype information. Ascertainment criteria CLI and CLI+ correspond to clinic-based designs which have multiple disease occurrences among family members. Important to note is that ascertainment criteria for the POP+ and CLI+ designs include families who have at least one member (proband) who carries the mutated

Design Ascertainment Criteria POP Proband is affected

Table 1. Family-based study designs

**2.2 Family-based study designs**

population-based designs.

gene of interest.

POP+ Proband is affected and mutation-carrier

one sibling are affected

$$P(G\_f) = \prod\_{i=1}^{n\_f} \begin{cases} P(\mathcal{g}\_i)\_{\prime} & \text{if individual } i \text{ is a bounded,} \\ P(\mathcal{g}\_i | \mathcal{g}\_{m\_{i'}} \mathcal{g}\_{f\_i})\_{\prime} & \text{if individual } i \text{ is a nonfourander.} \end{cases}$$

Here *P*(*gi*) is based on Hardy-Weinberg Equilibrium (HWE) and depends on the population allele frequency, *P*(*gi*|*gmi* , *gfi* ) is the Mendelian transmission probability given parents' genotypes (*gmi* , *gfi* ) of individual *i*.

The denominator is the correction term used to account for the study designs. In the population-based study, the ascertainment correction is based on the proband's phenotype in that it equals the probability of the proband, *p*, being affected before his/her age at examination, *ap*, i.e.,

$$P(D\_{f'}A\_f|X\_f) = \sum\_{\mathcal{S}} P(T\_p < a\_p|\mathcal{g})P(\mathcal{g})\_f$$

where the sum is over all possible genotypes of the proband. For POP+ design, the sum takes place by assuming the proband is a mutation carrier.

In the clinic-based study, the denominator is based on the phenotypes of four individuals, two parents and two sibs, who involved in their family's ascertainment process. It can be

and baseline hazard parameters. In our situation, the expectation of the complete data (*Df* , *Xf* , *Gf*), *f* = 1, . . . , *n*, is taken with respect to the conditional distribution of missing genotypes *Gf m* given observed data (*Df* , *Xf* , *Gf o*) and current estimates of *θ*. Then the parameter estimates are updated by maximizing the likelihood function using the estimate of missing data in the expectation step. These two steps iterate until convergence to obtain the

<sup>385</sup> On Combining Family Data from Different Study Designs

The conditional expectation of the log-likelihood function (*θ*|*D*, *G*, *X*) of the complete data (*D*, *G*, *X*) given the observed phenotypes *Do* and genotypes *Go*, or *Q* function for the *kth*

missing genotype *Gi* given their observed phenotype *Di*, covariates *Xi*, and the observed mutation status *Go* of other family members, especially if the proband's genotype *Gp* is

*<sup>P</sup>θ*(*k*) (*Di*|*Xi*, *Gi* <sup>=</sup> <sup>1</sup>)*P*(*Gi* <sup>=</sup> <sup>1</sup>|*Gp* <sup>=</sup> <sup>1</sup>) + *<sup>P</sup>θ*(*k*) (*Di*|*Xi*, *Gi* <sup>=</sup> <sup>0</sup>)*P*(*Gi* <sup>=</sup> <sup>0</sup>|*Gp* <sup>=</sup> <sup>1</sup>) .

Here *P*(*Gi*|*Gp* = 1) is the conditional probability of the mutation carrier status for family member *i*, using the family proband's known mutation status. Based on Mendelian transmission probabilities, we can express these as simple constants under an assumed genetic model. Under the model assumptions given above, the phenotype probabilities conditional

*<sup>P</sup>*(*Di*|*Xi*, *Gi*) = *<sup>S</sup>*(*ti*; *Xi*, *Gi*)*h*(*ti*; *Xi*, *Gi*)*δ<sup>i</sup>* . In the *M* step of the algorithm, we take the partial derivatives of *Q* with respect to *θ* and set

We illustrate the use of robust variance estimators (sandwich estimators) to account for within-family dependencies for disease risk estimates. In the presence of missing genotypes, the variance estimators are modified accordingly upon the use of the EM algorithm (Louis,

Let *U*(*θ*) and *B*(*θ*) denote the score vector and the negative of the associated matrix of second derivatives for the complete data, respectively, and *U*∗(*θ*) and *B*∗(*θ*) be the corresponding vector and matrix for the incomplete data. Then, the observed information matrix can be

where *go* and *do* denote the vectors of observed genotypes and phenotypes from data. At the maximum likelihood estimate of *θ*, because of the convergence of the EM algorithm, *U*∗

*Io*(*θ*) = *E<sup>θ</sup>* [*B*(*θ*)|*go*, *do*] − *E<sup>θ</sup>* [*U*(*θ*)*U*�(*θ*)|*go*, *do*] + *U*∗(*θ*)*U*∗�(*θ*), (3)

and the corresponding survival function *S* depending on his/her affection status *δ<sup>i</sup>* as

*th* individual in family *f* , we can then obtain the conditional expectation of their

) = *Eθ*(*k*) [(*θ*|*D*, *G*, *X*)|*Do*, *Go*] . (2)

*th* individual can be expressed in terms of the hazard function *h*

MLEs, where the algorithm is guaranteed to increase the likelihood at each iteration.

*<sup>Q</sup>*(*θ*|*θ*(*k*)

for Estimating Disease Risk Associated with Mutated Genes

*Eθ*(*k*) [*Gi*|*Di*, *Xi*, *Gp* = 1] = *Pθ*(*k*) (*Gi*|*Di*, *Xi*, *Gp* = 1)

<sup>=</sup> *<sup>P</sup>θ*(*k*) (*Di*|*Xi*, *Gi*)*P*(*Gi*|*Gp* <sup>=</sup> <sup>1</sup>)

iteration is given by:

For the *i*

conditioned as:

on genotype status for the *i*

to zero, that will maximize *Q*.

1982).

expressed as

**3.2 Robust variance estimator for the EM algorithm**

expressed as

$$\begin{split} P(D\_{f'}A\_{f}|\mathbf{X}\_{f}) &= \sum\_{\mathbf{G}\_{\omega}} P(T\_{f} < a\_{f}|\mathbf{x}\_{f'}\mathbf{g}\_{\omega\prime})^{\delta\_{f}} P(T\_{f} \ge a\_{f}|\mathbf{x}\_{f'}\mathbf{g}\_{\omega\prime})^{1-\delta\_{f}} \times \\ P(T\_{m} < a\_{m}|\mathbf{x}\_{m}, \mathbf{g}\_{\omega\prime})^{\delta\_{m}} P(T\_{m} \ge a\_{m}|\mathbf{x}\_{m}, \mathbf{g}\_{\omega\prime})^{1-\delta\_{m}} P(\mathbf{g}\_{\omega\prime\_{f'}}\mathbf{g}\_{\omega\prime\_{n}}|\mathbf{g}\_{\omega\prime\_{p}}) \times \\ P(T\_{s} < a\_{s}|\mathbf{x}\_{s}, \mathbf{g}\_{\omega\prime\_{s}}) P(\mathbf{g}\_{\omega\prime\_{s}}|\mathbf{g}\_{\omega\prime\_{p}}) P(T\_{p} < a\_{p}|\mathbf{x}\_{p}, \mathbf{g}\_{\omega\prime\_{p}}) P(\mathbf{g}\_{\omega\prime\_{p}}) \end{split}$$

where indices *f* , *m*,*s*, *p* represent father, mother, sib and proband, respectively, *δ* indicates the affection status, and *G<sup>ω</sup>* = (*gω<sup>f</sup>* , *gω<sup>m</sup>* , *gω<sup>p</sup>* , *gω<sup>s</sup>* ) includes all possible genotypes of the four individuals in the ascertainment set. For CLI+ design, the sum in the denominator is taken over all possible genotypes, provided that the proband carries a mutated allele of the major gene. The conditional probabilities *P*(*gω<sup>f</sup>* , *gω<sup>m</sup>* |*gω<sup>p</sup>* = 1) and *P*(*gω<sup>s</sup>* |*gω<sup>p</sup>* = 1) are obtained based on the HWE and Mendelian transmission probabilities using Bayes theorem.

#### **2.3.2 Combined population- and and clinic-based study designs**

Consider a study where the families are sampled via different study designs, say *np* families from a population-based design and *nc* families from a clinic-based design. When their study designs are known, we can construct the likelihoods based on their study designs. Let *Lp* and *Lc* be the likelihood functions based on the population-based design and clinic-based design, respectively. We propose the combined likelihood for the families from the two designs as

$$L\_{comb}(\theta|D,G,X) = L\_p(\theta|D^p,G^p,X^p)L\_{\mathfrak{c}}(\theta|D^{\mathfrak{c}},G^{\mathfrak{c}},X^{\mathfrak{c}}),$$

where the superscripts *p* and *c* denote the population- and clinic-based study designs, respectively, and the likelihoods *Lp* and *Lc* are obtained using the retrospective likelihood approach in expression (1). Therefore, the combined likelihood using the retrospective likelihood approach can accommodate both population- and clinic-based designs.

Even when the sampling schemes are not clearly defined, we can still employ this combined likelihood approach by dividing the families into two groups—high risk and low risk families, according to the number of cases observed in the family. For example, a family would be classified as a high risk family if it includes at least three cases among family members, otherwise it would be classified as a low risk family.

#### **3. Missing genotypes**

In practice, family data often include some missing information, particularly, missing genotypes. In the presence of missing genotype information, we estimate the disease risks associated with a known gene mutation in the families. Suppose data include genetic covariates that consist of observed genotypes and missing genotypes and phenotypes as time-of-onset responses with no missing. To infer the unobserved genotypes in the family, we implement an expectation-maximazation (EM) algorithm (Dempster et al., 1977) and estimate the parameters in the likelihood. The EM algorithm is an iterative procedure that computes the maximum likelihood estimates (MLEs) in the presence of missing data.

#### **3.1 Expectation-maximazation algorithm**

Suppose a genetic covariate *Gf* in family *f* consists of observed genotypes *Gf o* and missing genotypes *Gf m* and the vector of unknown parameters *θ* includes both regression parameters 6 Will-be-set-by-IN-TECH

*<sup>P</sup>*(*Tf <sup>&</sup>lt; <sup>a</sup> <sup>f</sup>* <sup>|</sup>*xf* , *<sup>g</sup>ω<sup>f</sup>* )*δ<sup>f</sup> <sup>P</sup>*(*Tf* <sup>≥</sup> *<sup>a</sup> <sup>f</sup>* <sup>|</sup>*xf* , *<sup>g</sup>ω<sup>f</sup>* )1−*δ<sup>f</sup>* <sup>×</sup>

*P*(*Ts < as*|*xs*, *gω<sup>s</sup>* )*P*(*gω<sup>s</sup>* |*gω<sup>p</sup>* )*P*(*Tp < ap*|*xp*, *gω<sup>p</sup>* )*P*(*gω<sup>p</sup>* ),

where indices *f* , *m*,*s*, *p* represent father, mother, sib and proband, respectively, *δ* indicates the affection status, and *G<sup>ω</sup>* = (*gω<sup>f</sup>* , *gω<sup>m</sup>* , *gω<sup>p</sup>* , *gω<sup>s</sup>* ) includes all possible genotypes of the four individuals in the ascertainment set. For CLI+ design, the sum in the denominator is taken over all possible genotypes, provided that the proband carries a mutated allele of the major gene. The conditional probabilities *P*(*gω<sup>f</sup>* , *gω<sup>m</sup>* |*gω<sup>p</sup>* = 1) and *P*(*gω<sup>s</sup>* |*gω<sup>p</sup>* = 1) are obtained

Consider a study where the families are sampled via different study designs, say *np* families from a population-based design and *nc* families from a clinic-based design. When their study designs are known, we can construct the likelihoods based on their study designs. Let *Lp* and *Lc* be the likelihood functions based on the population-based design and clinic-based design, respectively. We propose the combined likelihood for the families from the two designs as

where the superscripts *p* and *c* denote the population- and clinic-based study designs, respectively, and the likelihoods *Lp* and *Lc* are obtained using the retrospective likelihood approach in expression (1). Therefore, the combined likelihood using the retrospective

Even when the sampling schemes are not clearly defined, we can still employ this combined likelihood approach by dividing the families into two groups—high risk and low risk families, according to the number of cases observed in the family. For example, a family would be classified as a high risk family if it includes at least three cases among family members,

In practice, family data often include some missing information, particularly, missing genotypes. In the presence of missing genotype information, we estimate the disease risks associated with a known gene mutation in the families. Suppose data include genetic covariates that consist of observed genotypes and missing genotypes and phenotypes as time-of-onset responses with no missing. To infer the unobserved genotypes in the family, we implement an expectation-maximazation (EM) algorithm (Dempster et al., 1977) and estimate the parameters in the likelihood. The EM algorithm is an iterative procedure that computes

Suppose a genetic covariate *Gf* in family *f* consists of observed genotypes *Gf o* and missing genotypes *Gf m* and the vector of unknown parameters *θ* includes both regression parameters

based on the HWE and Mendelian transmission probabilities using Bayes theorem.

*Lcomb*(*θ*|*D*, *<sup>G</sup>*, *<sup>X</sup>*) = *Lp*(*θ*|*Dp*, *<sup>G</sup>p*, *<sup>X</sup>p*)*Lc*(*θ*|*D<sup>c</sup>*

likelihood approach can accommodate both population- and clinic-based designs.

the maximum likelihood estimates (MLEs) in the presence of missing data.

**2.3.2 Combined population- and and clinic-based study designs**

otherwise it would be classified as a low risk family.

**3.1 Expectation-maximazation algorithm**

**3. Missing genotypes**

*<sup>P</sup>*(*Tm <sup>&</sup>lt; am*|*xm*, *<sup>g</sup>ω<sup>m</sup>* )*δ<sup>m</sup> <sup>P</sup>*(*Tm* <sup>≥</sup> *am*|*xm*, *<sup>g</sup>ω<sup>m</sup>* )1−*δ<sup>m</sup> <sup>P</sup>*(*gω<sup>f</sup>* , *<sup>g</sup>ω<sup>m</sup>* <sup>|</sup>*gω<sup>p</sup>* ) <sup>×</sup>

, *G<sup>c</sup>* , *X<sup>c</sup>* ),

expressed as

*<sup>P</sup>*(*Df* , *Af* |*Xf*) = ∑

*G<sup>ω</sup>*

and baseline hazard parameters. In our situation, the expectation of the complete data (*Df* , *Xf* , *Gf*), *f* = 1, . . . , *n*, is taken with respect to the conditional distribution of missing genotypes *Gf m* given observed data (*Df* , *Xf* , *Gf o*) and current estimates of *θ*. Then the parameter estimates are updated by maximizing the likelihood function using the estimate of missing data in the expectation step. These two steps iterate until convergence to obtain the MLEs, where the algorithm is guaranteed to increase the likelihood at each iteration.

The conditional expectation of the log-likelihood function (*θ*|*D*, *G*, *X*) of the complete data (*D*, *G*, *X*) given the observed phenotypes *Do* and genotypes *Go*, or *Q* function for the *kth* iteration is given by:

$$Q(\theta|\theta^{(k)}) = E\_{\theta^{(k)}}\left[\ell(\theta|D\_{\prime}G\_{\prime}X)|D\_{0\prime}G\_{0}\right].\tag{2}$$

For the *i th* individual in family *f* , we can then obtain the conditional expectation of their missing genotype *Gi* given their observed phenotype *Di*, covariates *Xi*, and the observed mutation status *Go* of other family members, especially if the proband's genotype *Gp* is conditioned as:

$$\begin{split} &E\_{\theta^{(k)}}\left[\mathcal{G}\_{\mathrm{i}}|D\_{\mathrm{i}\prime}X\_{\mathrm{i}\prime}\mathcal{G}\_{\mathrm{p}}=1\right] = P\_{\theta^{(k)}}\left(\mathcal{G}\_{\mathrm{i}}|D\_{\mathrm{i}\prime}X\_{\mathrm{i}\prime}\mathcal{G}\_{\mathrm{p}}=1\right) \\ &= \frac{P\_{\theta^{(k)}}\left(D\_{\mathrm{i}}|X\_{\mathrm{i}\prime}\mathcal{G}\_{\mathrm{i}}\right)P\left(\mathcal{G}\_{\mathrm{i}}|\mathcal{G}\_{\mathrm{p}}=1\right)}{P\_{\theta^{(k)}}\left(D\_{\mathrm{i}}|X\_{\mathrm{i}\prime}\mathcal{G}\_{\mathrm{i}}=1\right)P\left(\mathcal{G}\_{\mathrm{i}}=1\right) + P\_{\theta^{(k)}}\left(D\_{\mathrm{i}}|X\_{\mathrm{i}\prime}\mathcal{G}\_{\mathrm{i}}=0\right)P\left(\mathcal{G}\_{\mathrm{i}}=0|\mathcal{G}\_{\mathrm{p}}=1\right)}. \end{split}$$

Here *P*(*Gi*|*Gp* = 1) is the conditional probability of the mutation carrier status for family member *i*, using the family proband's known mutation status. Based on Mendelian transmission probabilities, we can express these as simple constants under an assumed genetic model. Under the model assumptions given above, the phenotype probabilities conditional on genotype status for the *i th* individual can be expressed in terms of the hazard function *h* and the corresponding survival function *S* depending on his/her affection status *δ<sup>i</sup>* as

$$P(D\_i|X\_{i\prime}G\_i) = S(t\_{i\prime}X\_{i\prime}G\_i)h(t\_{i\prime}X\_{i\prime}G\_i)^{\delta\_{i\prime}} \dots$$

In the *M* step of the algorithm, we take the partial derivatives of *Q* with respect to *θ* and set to zero, that will maximize *Q*.

#### **3.2 Robust variance estimator for the EM algorithm**

We illustrate the use of robust variance estimators (sandwich estimators) to account for within-family dependencies for disease risk estimates. In the presence of missing genotypes, the variance estimators are modified accordingly upon the use of the EM algorithm (Louis, 1982).

Let *U*(*θ*) and *B*(*θ*) denote the score vector and the negative of the associated matrix of second derivatives for the complete data, respectively, and *U*∗(*θ*) and *B*∗(*θ*) be the corresponding vector and matrix for the incomplete data. Then, the observed information matrix can be expressed as

$$I\_o(\theta) = E\_\theta[B(\theta)|g\_{o\prime}d\_o] - E\_\theta[\mathcal{U}(\theta)\mathcal{U}^\top(\theta)|\mathcal{g}\_{o\prime}d\_o] + \mathcal{U}^\*(\theta)\mathcal{U}^{\*\top}(\theta),\tag{3}$$

where *go* and *do* denote the vectors of observed genotypes and phenotypes from data. At the maximum likelihood estimate of *θ*, because of the convergence of the EM algorithm, *U*∗

model,

<sup>20</sup>)}*ρ*−1.

affected before his(her) age at examination, *ap*,

for Estimating Disease Risk Associated with Mutated Genes

**4.2 Simulation study designs**

by age 70 among carriers.

formula. Once we simulated the age at examination and genotype information for all family members, then the time-to-onset of individual *i* was simulated from the proportional hazards

<sup>387</sup> On Combining Family Data from Different Study Designs

*h*(*ti*|*gi*) = *h*0(*ti*) exp(*βxi*), where *xi* indicates a carrier status of disease mutation gene for subject *i* and the baseline hazard is assumed to follow the Weibull distribution which has a form, *h*0(*t*) = *λρ*{*λ*(*t* −

The proband's age at onset was generated conditioning on the fact that the proband was

*Tp* ∼ *T*|*T < ap* . For the rest of family members, their times to onset were generated unconditionally. We also assumed the minimum age at onset was 20 years of age and the maximum age for followup was 90 years of age. Finally, the affection status *δ<sup>i</sup>* for the *i*th individual was determined by comparing the age at onset *Ti* and age at examination *ai*; *δ<sup>i</sup>* = 1 if *Ti < ai* and 0 otherwise.

Data were simulated under different configurations. We assumed Weibull baseline hazard functions with scale (*λ*) and shape (*ρ*) parameters equal to 0.01 and 3.2, respectively. This leads to a cumulative risk of 10% among mutation non-carriers by age 70. Two penetrances were considered: high and low penetrances corresponding to the log relative risk of a major gene (*β*) given by 2.4 and 1.8, respectively. The high penetrance represents a lifetime risk of 70% by age 70 among carriers of a major gene, which assumes a rare gene with the allele frequency 0.02 under the dominant model. The low penetrance provides a lifetime risk of 48%

We designed the simulation studies, first to investigate the effect of design misspecification, and second, to examine the properties of our proposed likelihood for combined family data from different study designs in the estimation of disease risks associated with a mutated gene. **(1)** To study the effect of design misspecification, the study designs POP, POP+, CLI, and CLI+ were used to generate family data. For each design, two retrospective likelihood methods were applied to fit the data—one using correct adjustment of the study design and the other using a design with misspecified correction; for example, population-based ascertainment correction was used for the families under CLI+ design and clinic-based ascertainment correction was used for the families under POP+ design as for the misspecified design. We

**(2)** To investigate potential bias and efficiency loss in disease risk estimation for the proposed likelihood approach for combined family data from population-based and clinic-based designs. We considered the combined families either from POP+ and CLI+ designs or POP and CLI designs with three mixing ratios between two designs—50-50, 70-30 and 80-20. For example, with the total 400 families sampled, the ratio 50-50 corresponds to equal numbers of families from POP+ and CLI+ designs, the 70-30 sampling corresponds to 280 POP+ families and 120 CLI+ families and the ratio 80-20 to 320 POP+ and 80 CLI+ families. The same numbers were examined for combining POP and CLI families. For each simulation

simulated 500 random samples of 200 families for each simulation configuration.

configuration, 500 random samples were simulated.

is zero. Thus, the observed information matrix can be obtained as the first two terms on the right hand side of (3) that arise from the complete data log-likelihood analysis. The first term is evaluated as

$$E\_{\theta}[B(\theta)|g\_{\theta'}d\_{\theta}] = E\_{\theta}\left[-\frac{\partial^2 \ell(\theta)}{\partial \theta \partial \theta^{\top}}|g\_{\theta'}d\_{\theta}\right].$$

Oakes (1999) explicitly expressed the information matrix in terms of derivatives of the *<sup>Q</sup>*(*θ*|*θ*(*k*)) function in equation (2) invoked by the EM algorithm, as given by

$$I\_o(\theta) = \frac{\partial^2 \ell}{\partial \theta \partial \theta^\top} = \left\{ \frac{\partial^2 Q(\theta | \theta^{(k)})}{\partial \theta \partial \theta^\top} + \frac{\partial^2 Q(\theta | \theta^{(k)})}{\partial \theta \partial \theta^{(k)} \top} \right\}\_{\theta = \theta^{(k)}}\tag{4}$$

where represents the observed data log-likelihood and the second term is viewed as the 'missing information.'

To account for familial correlation, as our model assumes the independence of the individuals in the family, we obtain the robust variance estimator for the ascertainment corrected likelihood with missing genotypes in a 'sandwich' form (White, 1982),

$$\text{Var}(\hat{\theta}) = I\_o(\theta)^{-1} \left\{ \sum\_{f=1}^n \mathcal{U}\_f^\*(\theta) \mathcal{U}\_f^\*(\theta)^\top \right\} I\_o(\theta)^{-1} \tag{5}$$

where *U*∗ *<sup>f</sup>* (*θ*) is the conditional expectation of the complete data score vector for family *f* given the observed data. Thus, the robust variance of ˆ *θ* can be estimated by replacing *θ* by ˆ *θ* in equation (5).

#### **4. Simulation study**

We carried out simulation studies to investigate the properties of our proposed likelihood methods and the effect of design misspecification. The simulation study aims to (1) assess bias and efficiency in disease risk estimation (relative risk and penetrance) for the retrospective likelihood-based approaches for family data from different study designs, (2) investigate potential bias and efficiency loss in risk estimation when the study designs are misspecified, and (3) evaluate the first two aims using combined data from two different study designs.

#### **4.1 Family data generation**

The simulation of family data is based on the method developed by Gauderman (1995) and further extended by Choi et al. (2008). We generated families of three generations: two parents and their two offspring, one of whom is the proband (affected individual from whom the family is selected). Each offspring has a spouse and their children ranged in number from two to five. At the first stage, all family members' ages at examination were obtained using a normal distribution with mean age 65 for the first generation and 45 for the second generation, with variance fixed at 2.5 years for both generations. It resulted in an average of 20 years difference between the parents and offspring. At the next stage, the proband's genotype of a major gene was determined conditioning on the proband's affection status by her/his age at examination, assuming Hardy-Weinberg equilibrium (HWE) with the fixed population allele frequency. Given the proband's genotypes, the genotypes of the other family members were then determined using HWE and Mendelian transmission probabilities calculated with Bayes' formula. Once we simulated the age at examination and genotype information for all family members, then the time-to-onset of individual *i* was simulated from the proportional hazards model,

$$h(t\_i|\mathcal{g}\_i) = h\_0(t\_i) \exp(\beta x\_i)\_{\prime i}$$

where *xi* indicates a carrier status of disease mutation gene for subject *i* and the baseline hazard is assumed to follow the Weibull distribution which has a form, *h*0(*t*) = *λρ*{*λ*(*t* − <sup>20</sup>)}*ρ*−1.

The proband's age at onset was generated conditioning on the fact that the proband was affected before his(her) age at examination, *ap*,

$$T\_p \sim T|T < a\_p \, .$$

For the rest of family members, their times to onset were generated unconditionally. We also assumed the minimum age at onset was 20 years of age and the maximum age for followup was 90 years of age. Finally, the affection status *δ<sup>i</sup>* for the *i*th individual was determined by comparing the age at onset *Ti* and age at examination *ai*; *δ<sup>i</sup>* = 1 if *Ti < ai* and 0 otherwise.

### **4.2 Simulation study designs**

8 Will-be-set-by-IN-TECH

is zero. Thus, the observed information matrix can be obtained as the first two terms on the right hand side of (3) that arise from the complete data log-likelihood analysis. The first term

Oakes (1999) explicitly expressed the information matrix in terms of derivatives of the

*<sup>∂</sup>*2*Q*(*θ*|*θ*(*k*)) *∂θ∂θ*� <sup>+</sup>

where represents the observed data log-likelihood and the second term is viewed as the

To account for familial correlation, as our model assumes the independence of the individuals in the family, we obtain the robust variance estimator for the ascertainment corrected

We carried out simulation studies to investigate the properties of our proposed likelihood methods and the effect of design misspecification. The simulation study aims to (1) assess bias and efficiency in disease risk estimation (relative risk and penetrance) for the retrospective likelihood-based approaches for family data from different study designs, (2) investigate potential bias and efficiency loss in risk estimation when the study designs are misspecified, and (3) evaluate the first two aims using combined data from two different study designs.

The simulation of family data is based on the method developed by Gauderman (1995) and further extended by Choi et al. (2008). We generated families of three generations: two parents and their two offspring, one of whom is the proband (affected individual from whom the family is selected). Each offspring has a spouse and their children ranged in number from two to five. At the first stage, all family members' ages at examination were obtained using a normal distribution with mean age 65 for the first generation and 45 for the second generation, with variance fixed at 2.5 years for both generations. It resulted in an average of 20 years difference between the parents and offspring. At the next stage, the proband's genotype of a major gene was determined conditioning on the proband's affection status by her/his age at examination, assuming Hardy-Weinberg equilibrium (HWE) with the fixed population allele frequency. Given the proband's genotypes, the genotypes of the other family members were then determined using HWE and Mendelian transmission probabilities calculated with Bayes'

*n* ∑ *f*=1 *U*∗ *<sup>f</sup>* (*θ*)*U*<sup>∗</sup>

�

−*∂*<sup>2</sup>(*θ*) *∂θ∂θ*� <sup>|</sup>*go*, *do*

� .

�

*θ*=*θ*(*k*)

, (4)

, (5)

*θ*

*θ* can be estimated by replacing *θ* by ˆ

*<sup>∂</sup>*2*Q*(*θ*|*θ*(*k*)) *∂θ∂θ*(*k*)�

*<sup>f</sup>* (*θ*)�

*<sup>f</sup>* (*θ*) is the conditional expectation of the complete data score vector for family *f*

⎫ ⎬ ⎭ *Io*(*θ*) −1

*E<sup>θ</sup>* [*B*(*θ*)|*go*, *do*] = *E<sup>θ</sup>*

*<sup>Q</sup>*(*θ*|*θ*(*k*)) function in equation (2) invoked by the EM algorithm, as given by

�

*Io*(*θ*) = *<sup>∂</sup>*<sup>2</sup>

Var(ˆ

given the observed data. Thus, the robust variance of ˆ

*∂θ∂θ*� <sup>=</sup>

likelihood with missing genotypes in a 'sandwich' form (White, 1982),

−1 ⎧ ⎨ ⎩

*θ*) = *Io*(*θ*)

is evaluated as

'missing information.'

where *U*∗

in equation (5).

**4. Simulation study**

**4.1 Family data generation**

Data were simulated under different configurations. We assumed Weibull baseline hazard functions with scale (*λ*) and shape (*ρ*) parameters equal to 0.01 and 3.2, respectively. This leads to a cumulative risk of 10% among mutation non-carriers by age 70. Two penetrances were considered: high and low penetrances corresponding to the log relative risk of a major gene (*β*) given by 2.4 and 1.8, respectively. The high penetrance represents a lifetime risk of 70% by age 70 among carriers of a major gene, which assumes a rare gene with the allele frequency 0.02 under the dominant model. The low penetrance provides a lifetime risk of 48% by age 70 among carriers.

We designed the simulation studies, first to investigate the effect of design misspecification, and second, to examine the properties of our proposed likelihood for combined family data from different study designs in the estimation of disease risks associated with a mutated gene.

**(1)** To study the effect of design misspecification, the study designs POP, POP+, CLI, and CLI+ were used to generate family data. For each design, two retrospective likelihood methods were applied to fit the data—one using correct adjustment of the study design and the other using a design with misspecified correction; for example, population-based ascertainment correction was used for the families under CLI+ design and clinic-based ascertainment correction was used for the families under POP+ design as for the misspecified design. We simulated 500 random samples of 200 families for each simulation configuration.

**(2)** To investigate potential bias and efficiency loss in disease risk estimation for the proposed likelihood approach for combined family data from population-based and clinic-based designs. We considered the combined families either from POP+ and CLI+ designs or POP and CLI designs with three mixing ratios between two designs—50-50, 70-30 and 80-20. For example, with the total 400 families sampled, the ratio 50-50 corresponds to equal numbers of families from POP+ and CLI+ designs, the 70-30 sampling corresponds to 280 POP+ families and 120 CLI+ families and the ratio 80-20 to 320 POP+ and 80 CLI+ families. The same numbers were examined for combining POP and CLI families. For each simulation configuration, 500 random samples were simulated.

**Log relative risk (***β***) estimation**

for Estimating Disease Risk Associated with Mutated Genes

**Penetrance estimation**

errors are in parenthesis.

(20% CLI+ families).

*Combined data from POP and CLI designs*

High Penetrance (*β* = 2.4) Low Penetrance (*β* = 1.8) POP POP+ CLI CLI+ POP POP+ CLI CLI+

High Penetrance (70%) Low Penetrance (48%) POP POP+ CLI CLI+ POP POP+ CLI CLI+

Correct -0.002 0.010 0.044 -0.015 -0.006 0.017 0.017 -0.003 Design (0.129) (0.236) (0.095) (0.171) (0.153) (0.256) (0.066) (0.136)

Correct 0.013 0.015 0.028 0.020 0.008 0.012 0.049 0.037 Design (0.049) (0.033) (0.078) (0.084) (0.057) (0.040) (0.115) (0.103)

Misspecified 0.034 0.009 0.290 0.282 0.056 0.014 0.501 0.480 Design (0.087) (0.098) (0.003) (0.004) (0.140) (0.135) (0.003) (0.006)

Table 2. Effects of the design misspecification: bias and precision in disease risk estimation based on retrospective likelihoods with correct and incorrect design adjustments; standard

population-based likelihood and as accurate as the clinic-based likelihood, regardless of the mixing rates we considered. Especially, the combined likelihood appeared to perform better

In the penetrance estimation, we observed similar patterns as in the log relative risk estimation. The population-based likelihood provided substantially large bias with small standard errors, whereas the clinic-based likelihood yielded less bias with large standard errors. However, our proposed likelihood method offered the least bias and improved precision compared to the clinic-based likelihood. In addition, the penetrance was more precisely estimated with the combined likelihood when fewer CLI+ families were recruited

The patterns of bias and precision of the three likelihood methods were more clear with the combined data from POP and CLI designs, as shown in Table 4. In the log relative risk estimation, our proposed likelihood yielded both the most accurate and precise estimates. It also provided more precise estimates when 50% CLI families were included. Similarly, in penetrance estimation, the population-based likelihood provided heavily biased estimates; however, the combined likelihood performed well in terms of both bias and precision. With

fewer CLI families (20%) in the data, more precise estimates were obtained.

for relative risk estimation when more CLI+ families were included in the sample.

Misspecified 0.033 -0.022 1.456 0.444 0.041 0.009 0.665 0.475 Design (0.159) (0.265) (0.201) (0.165) ( 0.185) (0.272) (0.151) (0.144)

<sup>389</sup> On Combining Family Data from Different Study Designs
