**1. Introduction**

NGS technologies can be integrated into medical diagnostics in several ways which vary in the number and type of sequenced regions. While targeted tests include sequencing particular disease-specific genes, sequencing all ~20,000 protein-coding genes by Whole-exome sequencing (WES) and entire genomes by Whole-genome Sequencing (WGS) are non-targeted approaches. These sequencing approaches are precise ways to detect genetic variation of a patient and in relation to a healthy population or healthy reference genome. However, sequencing-based diagnostic methods generate large amounts of genomic data. Approximately, 60,000– 100,000 single nucleotide variations (SNV) and small insertions and deletions (indel) in each patient's personal genome can be detected on WES [1]. Translating these high numbers of genomic variants into useful clinical information is a crucial task. Although several methods have been introduced to help reduce the vast number of possible genes to clinically causative ones, this process still remains challenging.

#### **Figure 1.**

*A general workflow for WES data analysis. Six main steps, quality assessment & preprocessing, alignment, post-alignment processing, variant calling, variant annotation, and variant prioritization integrated with evolutionary approaches, are shown.*

Disease-related genes show non-random distribution characteristics in the genome with the majority of them being already present in the eukaryotic ancestor [2]. Mendelian disease genes that underlie single-gene disorders tend to have a more ancient evolutionary origin [3]. Considering disease-related genes have evolved under the effect of natural selection like other genes, evolutionary approaches can provide powerful insight not only to understand human genetic diseases but also to detect genomic variants that cause them.

Here, we briefly describe the analysis workflow from raw data to genomic variants as the first step of the translation to the clinical outcome. We primarily focus on WES analysis because most variations that are responsible for Mendelian disorders disrupt protein-coding regions [4]. Then we give an insight into how evolutionary principles are integrated into the prioritization of detected variants. The framework of the chapter can be found in **Figure 1.**

#### **2. From raw data to genomic variations**

The common file format for the storage of data produced by sequencers is FASTQ [5]. FASTQ format stores both nucleotide sequence and its corresponding Phred quality scores [6, 7]. The Phred score related to the base-calling error probabilities indicates the quality of each nucleotide within a read. In a FASTQ file, each read is shown by four lines: The first line begins with a "@" and continues with a sequence identifier and an optional description. The second line consists of the raw sequence letters: A, T, G, C, and N (unknown). The third line starts with a "+"

**133**

*Integrating Evolutionary Genetics to Medical Genomics: Evolutionary Approaches to Investigate…*

character and can be followed by the same sequence identifier again. The "+" sign specifies the end of the sequence. The fourth line includes the quality scores for the

Here, we give an overview of the data analysis workflow from a FASTQ file to

Although NGS platforms are capable of generating massively parallel sequences even in a single run, the quality of sequencing reads may not be perfect due to some reasons such as the failure in experimental processing and technical machine errors. The quality of raw FASTQ data should be assessed in the first step of the workflow

A number of tools have been developed to evaluate raw FASTQ data. These tools generally take FASTQ files as input and generate summary statistics and graphs for a quick overview of the raw read quality. In addition to the most commonly used one FASTQC [8], developed by Simon Andrews at Babraham Institute, other tools are also available such as FQStat [9], Quack [10], SeqAssist [11], QC-Chain [12]. Based on the result of the quality check step, if there is a need, preprocessing is

The standard preprocessing step consists of trimming of low-quality bases and adapter sequence removal at the end of the reads. Adapter sequences can be ligated to 3′ and 5′ ends of reads depending on the used library preparation protocol during the sequencing. These adapter fragments should be removed correctly because of leading to either missed alignments or wrong genotyping in further downstream analyses. Many tools with different principles of implementation have been developed to perform preprocessing. Ktrim [13], PE-Trimmer [14], SeqPurge [15], AdapterRemoval [16], PEAT [17], Skewer [18], Trimmomatic [19], QcReads [20], AlienTrimmer [21], and Btrim [22] are tools can be used for adapter and quality trimming depending on the study design. In addition to these, some tools such as FastqCleaner [23], FastProNGS [24], EasyQC [25], fastp [26], TrimGalore, FASTX-Toolkit, afterQC, ClinQC, NGS QC Toolkit, PRINSEQ, fastQ \_brew carry out both

After quality check and preprocessing of raw data, processed reads must be aligned to the reference genome. Both GRCh37 (hg19) and GRCh38 (hg38) are widely used as a reference for the human genome. Optimal alignment to reference sequences is not easy computational task and requires a fast and tolerant algorithm to obtain an imperfect alignment due to genomic variations. Several tools have been developed to align short reads. They mainly use the Burrows-Wheeler Transformation (BWT) algorithm, the Smith-Waterman (SW) dynamic programming algorithm or a combination of both of them. Bowtie2 [27] and BWA [28], which implement the BWT algorithm, are widely used for short reads alignment. Novoalign [29], MOSAIK [30], and SHRiMP2 [31] implement SW algorithm. For a comprehensive review of these methods and their differences, benchmark studies

The output of the alignment step is the Sequence Alignment Map (SAM) file which contains mapped reads. BAM stands for Binary Alignment Map and is the binary version of a SAM file. Both BAM files and SAM files have the same information which include a header and an alignment section. The header section

*DOI: http://dx.doi.org/10.5772/intechopen.92738*

sequence in the second line.

necessary before alignment.

**2.2 Alignment of reads**

obtain annotated genomic variants.

**2.1 Quality assessment and preprocessing**

since these errors affect downstream analysis.

quality check and preprocessing functions.

can be found in the literature [32, 33].

*Integrating Evolutionary Genetics to Medical Genomics: Evolutionary Approaches to Investigate… DOI: http://dx.doi.org/10.5772/intechopen.92738*

character and can be followed by the same sequence identifier again. The "+" sign specifies the end of the sequence. The fourth line includes the quality scores for the sequence in the second line.

Here, we give an overview of the data analysis workflow from a FASTQ file to obtain annotated genomic variants.

#### **2.1 Quality assessment and preprocessing**

*Methods in Molecular Medicine*

Disease-related genes show non-random distribution characteristics in the genome with the majority of them being already present in the eukaryotic ancestor [2]. Mendelian disease genes that underlie single-gene disorders tend to have a more ancient evolutionary origin [3]. Considering disease-related genes have evolved under the effect of natural selection like other genes, evolutionary approaches can provide powerful insight not only to understand human genetic diseases but also to

*A general workflow for WES data analysis. Six main steps, quality assessment & preprocessing, alignment, post-alignment processing, variant calling, variant annotation, and variant prioritization integrated with* 

Here, we briefly describe the analysis workflow from raw data to genomic variants as the first step of the translation to the clinical outcome. We primarily focus on WES analysis because most variations that are responsible for Mendelian disorders disrupt protein-coding regions [4]. Then we give an insight into how evolutionary principles are integrated into the prioritization of detected variants. The framework

The common file format for the storage of data produced by sequencers is FASTQ [5]. FASTQ format stores both nucleotide sequence and its corresponding Phred quality scores [6, 7]. The Phred score related to the base-calling error probabilities indicates the quality of each nucleotide within a read. In a FASTQ file, each read is shown by four lines: The first line begins with a "@" and continues with a sequence identifier and an optional description. The second line consists of the raw sequence letters: A, T, G, C, and N (unknown). The third line starts with a "+"

detect genomic variants that cause them.

*evolutionary approaches, are shown.*

of the chapter can be found in **Figure 1.**

**2. From raw data to genomic variations**

**132**

**Figure 1.**

Although NGS platforms are capable of generating massively parallel sequences even in a single run, the quality of sequencing reads may not be perfect due to some reasons such as the failure in experimental processing and technical machine errors. The quality of raw FASTQ data should be assessed in the first step of the workflow since these errors affect downstream analysis.

A number of tools have been developed to evaluate raw FASTQ data. These tools generally take FASTQ files as input and generate summary statistics and graphs for a quick overview of the raw read quality. In addition to the most commonly used one FASTQC [8], developed by Simon Andrews at Babraham Institute, other tools are also available such as FQStat [9], Quack [10], SeqAssist [11], QC-Chain [12]. Based on the result of the quality check step, if there is a need, preprocessing is necessary before alignment.

The standard preprocessing step consists of trimming of low-quality bases and adapter sequence removal at the end of the reads. Adapter sequences can be ligated to 3′ and 5′ ends of reads depending on the used library preparation protocol during the sequencing. These adapter fragments should be removed correctly because of leading to either missed alignments or wrong genotyping in further downstream analyses. Many tools with different principles of implementation have been developed to perform preprocessing. Ktrim [13], PE-Trimmer [14], SeqPurge [15], AdapterRemoval [16], PEAT [17], Skewer [18], Trimmomatic [19], QcReads [20], AlienTrimmer [21], and Btrim [22] are tools can be used for adapter and quality trimming depending on the study design. In addition to these, some tools such as FastqCleaner [23], FastProNGS [24], EasyQC [25], fastp [26], TrimGalore, FASTX-Toolkit, afterQC, ClinQC, NGS QC Toolkit, PRINSEQ, fastQ \_brew carry out both quality check and preprocessing functions.

#### **2.2 Alignment of reads**

After quality check and preprocessing of raw data, processed reads must be aligned to the reference genome. Both GRCh37 (hg19) and GRCh38 (hg38) are widely used as a reference for the human genome. Optimal alignment to reference sequences is not easy computational task and requires a fast and tolerant algorithm to obtain an imperfect alignment due to genomic variations. Several tools have been developed to align short reads. They mainly use the Burrows-Wheeler Transformation (BWT) algorithm, the Smith-Waterman (SW) dynamic programming algorithm or a combination of both of them. Bowtie2 [27] and BWA [28], which implement the BWT algorithm, are widely used for short reads alignment. Novoalign [29], MOSAIK [30], and SHRiMP2 [31] implement SW algorithm. For a comprehensive review of these methods and their differences, benchmark studies can be found in the literature [32, 33].

The output of the alignment step is the Sequence Alignment Map (SAM) file which contains mapped reads. BAM stands for Binary Alignment Map and is the binary version of a SAM file. Both BAM files and SAM files have the same information which include a header and an alignment section. The header section provides some information such as reference sequence, read group, sequencing platform details and applied process information to the reads. The alignment section includes the genomic position with relevant descriptive information of each sequence.

SAMtools [34] and Integrative Genomics Viewer (IGV) [35] are also commonly used programs to view BAM/SAM files for further confirmation analysis of detected variants.

#### **2.3 Post-alignment processing**

Processing of aligned reads is recommended to improve the quality of downstream variant calling analysis. The processing step generally consists of marking read duplicates and base quality score recalibration (BQSR) to minimize technical biases.

During the sequencing, a library of DNA fragments from a particular genomic region is prepared using PCR amplification to provide adequate DNA fragments for the sequencing process. Therefore, some amplified fragments could share the same sequence and the same corresponding alignment position leading to bias in variant detection. These duplicates should be removed to eliminate PCR-introduced bias. MarkDuplicates available in the Picard [36] and SAMtools [34] are widely-used tools to detect read duplicates based on their identical 5′ region and position on the genome.

In addition to marking duplicates, base quality is also an important factor for variant detection. As mentioned in the section "Quality check and preprocessing", each sequence read has a Phred quality score generated by the sequencing machine. However, the machine could generate systematically biased scores. On the contrary, BQSR patterns errors empirically to recalibrate the base quality scores using a machine learning approach. Thus, technical bias is significantly minimized. The key point in this process is to exclude known variants before BQSR since they are true genomic variations. So, they should not be considered as sequencing errors. The most widely used tool for recalibration of base qualities is BaseRecalibrator available in Genome Analysis Toolkit (GATK) [37].

#### **2.4 Variant calling**

After the post-alignment processing step, variant analysis can be started on an analysis-ready BAM file. In the variant calling step, the differences between the reference genome and genome of interest are calculated. Variants can be categorized as germline and somatic variants while dealing with variant calling. Germline variants are inherited variations present in the germ cells. Somatic variants are present only in somatic cells and can be specific to a tissue. In this chapter, we focus on the identification of germline SNV and indels. Several tools based on different algorithms have been developed to call germline short variants. Tools such as HaplotypeCaller available in GATK [38], SAMtools [34], FreeBayes [39], and Platypus [40] are based on Bayesian approaches. VarScan [41] relies on a heuristic approach to identify variants, while SNVer [42] uses a frequentist approach. The performance of different tools has been evaluated by recent studies [43–45], yet, these tools mostly generate an analysis-ready VCF (Variant Call Format) file. A VCF file is a text file that contains header lines and data lines. The header lines begin with "##" symbol. The first header line is always the VCF format version and continues with lines defining the name, length, value type, and description of each item in relevant fields of each data line.

**135**

*Integrating Evolutionary Genetics to Medical Genomics: Evolutionary Approaches to Investigate…*

After variants are detected, biologically important features such as gene symbols, genomic position, amino acid change, and consequences of variants add to a VCF file in the annotation step. In addition to the basic annotation, several tools can be used to integrate the annotations from countless sources including information of known variants with minor allele frequency (MAF) found in public databases and pathogenicity prediction of variants. There are numerous variant annotation tools that implement different methods and most widely used ones are AnnoVar [46], VEP [47], SnpEff [48], GEMINI [49], VarAFT [50], AnnTools [51], SVA [52], NGS-SNP [53]. These annotation tools enable to filtering and prioritizing potential disease-causing mutations. The prioritization of clinically causative mutation among a vast amount of annotated variations is the most challenging part of the analysis and is not a fully automatized. In the next section, we are going to discuss

how evolutionary approaches can be used to prioritize genomic variants.

**3. Utilizing evolutionary information in variant prioritization**

phenotypes of different mutant genes on a multispecies level [58].

phenotypes in several species.

**3.1 Population databases**

We have described the process of obtaining annotated variations from raw FASTQ data. Experimentally evaluation of each variant at a genomic scale would be an impractical process, but evolutionary principles can provide us a valuable set of an experiment from nature. Integrating evolutionary approaches into the prioritization step have the potential to distinguish the variant responsible for a particular disease among all annotated variants. Indeed, the association between disease and evolution has been attributed to natural selection [54, 55]. During evolution, variations at highly conserved genomic regions are exposed to natural selection because of their negative impact on fitness that make these conserved genes intolerant to variations [56]. On the contrary, at the faster-evolving regions of the genome, many variations have been tolerated over evolutionary time and accumulate in the population with high MAF. However, there is a predisposition for Mendelian disease genes to be more intolerant than the other genes [57]. These genes are also more conserved across species allowing us to compare the

In this part, we discuss the role of evolutionary approaches in variant prioritization. The first prioritization method aims to filter variants using information from allele frequencies in population databases. Then we introduce several pathogenicity prediction tools to interpret the rest of the variants, especially the ones with uncertain significance. Following that, we describe the usage of gene intolerance information while making inference the variant pathogenicity. Finally, we list commonly used model organism databases that can be used for the comparison of mutant gene

During human evolution, present and novel variations have been evaluated in terms of their biological impact. Population databases record the outcomes of genetic variations providing an extensive catalog that include thousands of individuals' genomic variations to researchers. At the end of the 1990s, the establishment of dbSNP has led to record genotype-phenotype associations via variant databases [59]. Latterly, large-scale projects such as gnomAD and 1000 Genome Project Databases that actively collect genomic data from various populations have become available MAF at population level found in these databases is one of the

*DOI: http://dx.doi.org/10.5772/intechopen.92738*

**2.5 Variant annotation and prioritization**

*Integrating Evolutionary Genetics to Medical Genomics: Evolutionary Approaches to Investigate… DOI: http://dx.doi.org/10.5772/intechopen.92738*

#### **2.5 Variant annotation and prioritization**

*Methods in Molecular Medicine*

sequence.

biases.

the genome.

**2.4 Variant calling**

detected variants.

**2.3 Post-alignment processing**

in Genome Analysis Toolkit (GATK) [37].

provides some information such as reference sequence, read group, sequencing platform details and applied process information to the reads. The alignment section includes the genomic position with relevant descriptive information of each

SAMtools [34] and Integrative Genomics Viewer (IGV) [35] are also commonly used programs to view BAM/SAM files for further confirmation analysis of

Processing of aligned reads is recommended to improve the quality of downstream variant calling analysis. The processing step generally consists of marking read duplicates and base quality score recalibration (BQSR) to minimize technical

During the sequencing, a library of DNA fragments from a particular genomic region is prepared using PCR amplification to provide adequate DNA fragments for the sequencing process. Therefore, some amplified fragments could share the same sequence and the same corresponding alignment position leading to bias in variant detection. These duplicates should be removed to eliminate PCR-introduced bias. MarkDuplicates available in the Picard [36] and SAMtools [34] are widely-used tools to detect read duplicates based on their identical 5′ region and position on

In addition to marking duplicates, base quality is also an important factor for variant detection. As mentioned in the section "Quality check and preprocessing", each sequence read has a Phred quality score generated by the sequencing machine. However, the machine could generate systematically biased scores. On the contrary, BQSR patterns errors empirically to recalibrate the base quality scores using a machine learning approach. Thus, technical bias is significantly minimized. The key point in this process is to exclude known variants before BQSR since they are true genomic variations. So, they should not be considered as sequencing errors. The most widely used tool for recalibration of base qualities is BaseRecalibrator available

After the post-alignment processing step, variant analysis can be started on an analysis-ready BAM file. In the variant calling step, the differences between the reference genome and genome of interest are calculated. Variants can be categorized as germline and somatic variants while dealing with variant calling. Germline variants are inherited variations present in the germ cells. Somatic variants are present only in somatic cells and can be specific to a tissue. In this chapter, we focus on the identification of germline SNV and indels. Several tools based on different algorithms have been developed to call germline short variants. Tools such as HaplotypeCaller available in GATK [38], SAMtools [34], FreeBayes [39], and Platypus [40] are based on Bayesian approaches. VarScan [41] relies on a heuristic approach to identify variants, while SNVer [42] uses a frequentist approach. The performance of different tools has been evaluated by recent studies [43–45], yet, these tools mostly generate an analysis-ready VCF (Variant Call Format) file. A VCF file is a text file that contains header lines and data lines. The header lines begin with "##" symbol. The first header line is always the VCF format version and continues with lines defining the name, length, value type, and description of each item in relevant fields of each

**134**

data line.

After variants are detected, biologically important features such as gene symbols, genomic position, amino acid change, and consequences of variants add to a VCF file in the annotation step. In addition to the basic annotation, several tools can be used to integrate the annotations from countless sources including information of known variants with minor allele frequency (MAF) found in public databases and pathogenicity prediction of variants. There are numerous variant annotation tools that implement different methods and most widely used ones are AnnoVar [46], VEP [47], SnpEff [48], GEMINI [49], VarAFT [50], AnnTools [51], SVA [52], NGS-SNP [53]. These annotation tools enable to filtering and prioritizing potential disease-causing mutations. The prioritization of clinically causative mutation among a vast amount of annotated variations is the most challenging part of the analysis and is not a fully automatized. In the next section, we are going to discuss how evolutionary approaches can be used to prioritize genomic variants.

## **3. Utilizing evolutionary information in variant prioritization**

We have described the process of obtaining annotated variations from raw FASTQ data. Experimentally evaluation of each variant at a genomic scale would be an impractical process, but evolutionary principles can provide us a valuable set of an experiment from nature. Integrating evolutionary approaches into the prioritization step have the potential to distinguish the variant responsible for a particular disease among all annotated variants. Indeed, the association between disease and evolution has been attributed to natural selection [54, 55]. During evolution, variations at highly conserved genomic regions are exposed to natural selection because of their negative impact on fitness that make these conserved genes intolerant to variations [56]. On the contrary, at the faster-evolving regions of the genome, many variations have been tolerated over evolutionary time and accumulate in the population with high MAF. However, there is a predisposition for Mendelian disease genes to be more intolerant than the other genes [57]. These genes are also more conserved across species allowing us to compare the phenotypes of different mutant genes on a multispecies level [58].

In this part, we discuss the role of evolutionary approaches in variant prioritization. The first prioritization method aims to filter variants using information from allele frequencies in population databases. Then we introduce several pathogenicity prediction tools to interpret the rest of the variants, especially the ones with uncertain significance. Following that, we describe the usage of gene intolerance information while making inference the variant pathogenicity. Finally, we list commonly used model organism databases that can be used for the comparison of mutant gene phenotypes in several species.

#### **3.1 Population databases**

During human evolution, present and novel variations have been evaluated in terms of their biological impact. Population databases record the outcomes of genetic variations providing an extensive catalog that include thousands of individuals' genomic variations to researchers. At the end of the 1990s, the establishment of dbSNP has led to record genotype-phenotype associations via variant databases [59]. Latterly, large-scale projects such as gnomAD and 1000 Genome Project Databases that actively collect genomic data from various populations have become available MAF at population level found in these databases is one of the

primary guides to interpret that variant pathogenicity. Because causative variants related to most Mendelian disorders have deleterious effects on reproductive fitness. Generally, causative alleles are less likely to reside in these databases or are present with low frequencies. In any global population database, except for the well-known founder alleles, >5% MAF can be considered as benign [60]. Therefore, a subset of the total number of variants inside these databases can be used for variant filtration. This is often achieved according to three different approaches. The first approach, called discrete filtering, assumes that a disease-causing variant should not found in these databases [61, 62]. This approach can be useful for very rare Mendelian disorders, but it can be problematic in some cases. Excluding observed alleles, independent from their MAF, can lead to the elimination of truly pathogenic alleles found in the general population at low frequencies because of the increasing number of genomes in databases. Especially, elucidating autosomal recessive disorders are affected by this risk. The second approach, called 1%-approach, is based on allele frequency thresholds that change according to the inheritance model of variants. While the analysis of autosomal recessive variants MAF threshold can be set at 1%, MAF cutoff of 0.1% can be useful for autosomal dominant variants [62]. Alternatively, the third approach, called the quantile-based approach, employs frequency thresholds as in the previous method. However, the thresholds in the quantile-based method are variable and depend on disease prevalence, mode of inheritance, database size, and database characteristics [63].

Depending on the case, different approaches can be employed using population databases with different scopes and data collection. Here, we summarize the widely used population databases. 1000 Genome Project (1KGP) Database.

### *3.1.1 1000 Genome Project (1KGP) database*

1KGP database provides a comprehensive set of human genetic variations from a diverse set of individuals of multiple populations. The database includes the reconstructed genomes of 2504 individuals from 26 populations obtained by combining low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. The database contains over 88 million variants, which consist of around 84.7 million SNPs, 3.6 million indels, and 60,000 structural variants [64, 65].

### *3.1.2 The Genome Aggregation Database (gnomAD)*

gnomAD is an extensive collection of exome and genome sequencing data from several large-scale sequencing projects. The first release of gnomAD is also known as the Exome Aggregation Consortium (ExAC) dataset. gnomAD short variant v2 release contains 125,748 exomes, and 15,708 whole genomes mapped to the GRCh37/hg19 reference sequence. In contrast, the short variant v3 release contains 71,702 whole genomes, including most of the whole genomes from v2 release mapped to the GRCh38 reference sequence. Therefore, gnomAD v2 provides higher power for the analysis of the coding regions, while v3 offers a valuable resource for the analysis of non-coding regions. For the analysis of structural variants, gnomAD SV v2.1 data set grants access to a total of 10,847 genomes aligned against the GRCh37 reference sequence [66].

### *3.1.3 Database of short genetic variations (dbSNP) and the database of genomic structural variations (dbVar)*

The National Center for Biotechnology Information (NCBI) maintains dbSNP and dbVar databases which together contain almost 2 billion submitted human

**137**

*Integrating Evolutionary Genetics to Medical Genomics: Evolutionary Approaches to Investigate…*

ClinVar is a public database that archives genetic variances of any type and the interpretations of their clinical significance for reported conditions. Unlike dbSNP and dbVar that are also maintained by NCBI, ClinVar only focuses on the medically relevant variations. Although ClinVar reviews the submissions of variants for validation, the clinical significance of the variants is reported directly from submitters. ClinVar displays any conflict between the interpretations for the same variant from different submitters or the consensus. In the strict comparison approaches, the algorithm evaluates submissions for a variant to be pathogenic and likely pathogenic as conflicting. In the more relaxed approach, the variants can be categorized as pathogenic/likely pathogenic, benign/likely benign, or uncertain significance [69].

*3.1.5 Database of chromosomal imbalance and phenotype in humans using* 

and to identify novel and potentially pathogenic variants.

DECIPHER provides a catalog of common copy-number changes in healthy populations as well as chromosome rearrangements of patients and their phenotype record submitted by clinical researchers upon informed consent [70]. Therefore, DECIPHER can serve as a valuable platform during variant prioritization. Users can check both the healthy population database and the previously submitted clinical records within DECIPHER to understand the effect of the variant of interests better

Even population MAF-based filtering, individuals generally have many variants that are not present in databases. Most of these variants do not classify definitively as benign or pathogenic according to criteria proposed by some clinical guidelines such as the American College of Medical Genetics and Genomics (ACMG) [60]. These types of alterations termed variants of uncertain significance (VUS). Further filtering approaches must use to reduce the number of VUS. For this purpose, numerous pathogenicity prediction tools based on different principles have been developed to evaluate the variant effect. ACMG and the European Society of Human Genetics (ESHG) [71] guidelines also recommend these in-silico methods

The first methods were proposed to predict computationally whether an amino acid substitution will disturb the protein function. These methods, now part of the PolyPhen algorithm [72], use physical properties of the mutational change along with a multispecies alignment as a basis to evaluate mutations. Many methods have been derived from this idea and are based on different principles. Evolutionary conservation is among the most useful features for such predictions. Some methods such as SIFT [73], PROVEAN [74] and PANTHER [75] rely on sequence conservation. For example, SIFT, as the most widely used algorithm, compares the

variants. Although dbVar does not have a reference structural variant database since the current technology cannot detect the precise breakpoints in the genome, dbSNP presents the reference variants as rs identifiers. Other contents of the dataset include population frequency, geographic origin of the population, population-specific genotype and allele frequencies as well as population-specific heterozygosity estimates. Besides serving as a human population database, dbSNP and dbVar also contain a variety of organisms´ genomic variations that can be a valuable resource

*DOI: http://dx.doi.org/10.5772/intechopen.92738*

for evolutionary studies [67, 68].

*Ensembl resources (DECIPHER)*

**3.2 Pathogenicity prediction tools**

to interpret variant pathogenicity.

*3.1.4 ClinVar*

#### *Integrating Evolutionary Genetics to Medical Genomics: Evolutionary Approaches to Investigate… DOI: http://dx.doi.org/10.5772/intechopen.92738*

variants. Although dbVar does not have a reference structural variant database since the current technology cannot detect the precise breakpoints in the genome, dbSNP presents the reference variants as rs identifiers. Other contents of the dataset include population frequency, geographic origin of the population, population-specific genotype and allele frequencies as well as population-specific heterozygosity estimates. Besides serving as a human population database, dbSNP and dbVar also contain a variety of organisms´ genomic variations that can be a valuable resource for evolutionary studies [67, 68].

## *3.1.4 ClinVar*

*Methods in Molecular Medicine*

primary guides to interpret that variant pathogenicity. Because causative variants related to most Mendelian disorders have deleterious effects on reproductive fitness. Generally, causative alleles are less likely to reside in these databases or are present with low frequencies. In any global population database, except for the well-known founder alleles, >5% MAF can be considered as benign [60]. Therefore, a subset of the total number of variants inside these databases can be used for variant filtration. This is often achieved according to three different approaches. The first approach, called discrete filtering, assumes that a disease-causing variant should not found in these databases [61, 62]. This approach can be useful for very rare Mendelian disorders, but it can be problematic in some cases. Excluding observed alleles, independent from their MAF, can lead to the elimination of truly pathogenic alleles found in the general population at low frequencies because of the increasing number of genomes in databases. Especially, elucidating autosomal recessive disorders are affected by this risk. The second approach, called 1%-approach, is based on allele frequency thresholds that change according to the inheritance model of variants. While the analysis of autosomal recessive variants MAF threshold can be set at 1%, MAF cutoff of 0.1% can be useful for autosomal dominant variants [62]. Alternatively, the third approach, called the quantile-based approach, employs frequency thresholds as in the previous method. However, the thresholds in the quantile-based method are variable and depend on disease prevalence, mode of

inheritance, database size, and database characteristics [63].

*3.1.1 1000 Genome Project (1KGP) database*

*3.1.2 The Genome Aggregation Database (gnomAD)*

GRCh37 reference sequence [66].

*structural variations (dbVar)*

used population databases. 1000 Genome Project (1KGP) Database.

Depending on the case, different approaches can be employed using population databases with different scopes and data collection. Here, we summarize the widely

1KGP database provides a comprehensive set of human genetic variations from a diverse set of individuals of multiple populations. The database includes the reconstructed genomes of 2504 individuals from 26 populations obtained by combining low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. The database contains over 88 million variants, which consist of around 84.7 million SNPs, 3.6 million indels, and 60,000 structural variants [64, 65].

gnomAD is an extensive collection of exome and genome sequencing data from several large-scale sequencing projects. The first release of gnomAD is also known as the Exome Aggregation Consortium (ExAC) dataset. gnomAD short variant v2 release contains 125,748 exomes, and 15,708 whole genomes mapped to the GRCh37/hg19 reference sequence. In contrast, the short variant v3 release contains 71,702 whole genomes, including most of the whole genomes from v2 release mapped to the GRCh38 reference sequence. Therefore, gnomAD v2 provides higher power for the analysis of the coding regions, while v3 offers a valuable resource for the analysis of non-coding regions. For the analysis of structural variants, gnomAD SV v2.1 data set grants access to a total of 10,847 genomes aligned against the

*3.1.3 Database of short genetic variations (dbSNP) and the database of genomic* 

The National Center for Biotechnology Information (NCBI) maintains dbSNP and dbVar databases which together contain almost 2 billion submitted human

**136**

ClinVar is a public database that archives genetic variances of any type and the interpretations of their clinical significance for reported conditions. Unlike dbSNP and dbVar that are also maintained by NCBI, ClinVar only focuses on the medically relevant variations. Although ClinVar reviews the submissions of variants for validation, the clinical significance of the variants is reported directly from submitters. ClinVar displays any conflict between the interpretations for the same variant from different submitters or the consensus. In the strict comparison approaches, the algorithm evaluates submissions for a variant to be pathogenic and likely pathogenic as conflicting. In the more relaxed approach, the variants can be categorized as pathogenic/likely pathogenic, benign/likely benign, or uncertain significance [69].

### *3.1.5 Database of chromosomal imbalance and phenotype in humans using Ensembl resources (DECIPHER)*

DECIPHER provides a catalog of common copy-number changes in healthy populations as well as chromosome rearrangements of patients and their phenotype record submitted by clinical researchers upon informed consent [70]. Therefore, DECIPHER can serve as a valuable platform during variant prioritization. Users can check both the healthy population database and the previously submitted clinical records within DECIPHER to understand the effect of the variant of interests better and to identify novel and potentially pathogenic variants.

## **3.2 Pathogenicity prediction tools**

Even population MAF-based filtering, individuals generally have many variants that are not present in databases. Most of these variants do not classify definitively as benign or pathogenic according to criteria proposed by some clinical guidelines such as the American College of Medical Genetics and Genomics (ACMG) [60]. These types of alterations termed variants of uncertain significance (VUS). Further filtering approaches must use to reduce the number of VUS. For this purpose, numerous pathogenicity prediction tools based on different principles have been developed to evaluate the variant effect. ACMG and the European Society of Human Genetics (ESHG) [71] guidelines also recommend these in-silico methods to interpret variant pathogenicity.

The first methods were proposed to predict computationally whether an amino acid substitution will disturb the protein function. These methods, now part of the PolyPhen algorithm [72], use physical properties of the mutational change along with a multispecies alignment as a basis to evaluate mutations. Many methods have been derived from this idea and are based on different principles. Evolutionary conservation is among the most useful features for such predictions. Some methods such as SIFT [73], PROVEAN [74] and PANTHER [75] rely on sequence conservation. For example, SIFT, as the most widely used algorithm, compares the

#### *Methods in Molecular Medicine*

alignments of related sequences by performing a PSI-BLAST search to check if the variant is tolerated in an evolutionary aspect. In addition to sequence conservation, another group of methods which take into account several features such as amino acid physicochemical properties, the context of variation position, protein structural features through machine learning algorithms are also available. CADD [76], MutationTaster2 [77], PolyPhen-2 [72], DANN [78] and VEST3 [79] are wellknown examples of such tools.

The predicted impact of a variation obtained from different tools may not be the same. This problem led to researchers making efforts to develop meta predictors that combine the results from existing tools by using several approaches such as logistic regression, decision trees, random forests, and support vector machines to make their own decisions. MetaSVM and MetaLR [80], M-CAP [81] and REVEL [82] are well-known examples of meta-predictors.

Below, several useful tools are explained without a performance comparison. However, various benchmark studies that have extensively examined the accuracy of these tools can be found in the literature [83–85].

#### *3.2.1 MutationTaster2*

MutationTaster2, using a naive Bayes classifier, predicts the functional consequences of variants that are both in exonic and intronic regions by incorporating a scoring system for the evolutionary conservation around DNA variants. MutationTaster uses information from several variant databases, including 1KGP and ClinVar. The tool automatically predicts a variant as neutral if it is found more than four times in the homozygous state in these databases and as disease-causing if it is reported as pathogenic in ClinVar by listing the associated disease phenotypes [77].

#### *3.2.2 Combined annotation-dependent depletion (CADD)*

CADD combines 63 genomic features derived from evolutionary constraint, surrounding sequence context, and functional predictions to evaluate SNVs and short indels. The tool integrates all of these features into a single CADD score using a machine learning approach trained on a binary distinction between simulated variants and variants that have become fixed in human populations since the split between humans and chimpanzees. C scores correlate with pathogenicity of a variant and disease severity [76].

#### *3.2.3 The Mendelian clinically applicable pathogenicity (M-CAP)*

M-CAP uses a supervised learning classifier to interpret genomic variants and focus especially on coding mutations for Mendelian diseases. As a metapredictor, it uses nine existing tools SIFT, PolyPhen-2, CADD, MutationTaster, MutationAssessor [86], FATHMM [87], LRT [88], MetaLR and MetaSVM. It also combines information of base-pair, amino acid, genomic region, and gene conservation from RVIS [89], PhyloP [90], PhastCons [91], SIPHY [92], GERP [93], PAM250 and BLOSUM62 [94]. Additionally, M-CAP establishes multiple-sequence alignments of 99 primate, mammalian, and vertebrate genomes to the human genome as a new feature [81].

#### *3.2.4 PrimateAI*

PrimateAI [95] is a deep neural network trained by a comprehensive dataset that includes around 380,000 common missense variants from humans and six

**139**

*Integrating Evolutionary Genetics to Medical Genomics: Evolutionary Approaches to Investigate…*

non-human primate species. PrimateAI categorizes the common missense mutations from other primate species as non-pathogenic for humans. Thus, it enables the identification of the pathogenic variants. PrimateAI has previously shown 88% accuracy in disease-causing variant identification and allowed the discovery of 14 novel candidate genes related to intellectual disability. PrimateAI also incorporates protein structure information as it learns to predict the secondary structure and solvent accessibility from amino acid sequences. PrimateAI provides a score to the user in which a threshold of >0.8 is for likely pathogenic classification, <0.6 is for likely benign, and 0.6–0.8 is as intermediate in genes with dominant modes of inheritance, and a threshold of >0.7 is for likely pathogenic and <0.5 for likely

Genic intolerance is a gene-level assessment that has a potential to prioritize genomic variants. It has been developed as a scoring system to calculate tolerance of genes to a functional genetic variation on a genome-wide scale and rank them using 6503 WES data available in the National Heart, Lung, and Blood Institute-NHLBI Exome Sequencing Project [89]. This system predicts the expected common functional variation in the gene and compares them to apparently neutral variation found in the gene. The deviation from this prediction is attributed to the intolerance score, namely the Residual Variation Intolerance Score (RVIS). While genes with a positive RVIS score have more common functional variation than expected, genes with negative RVIS scores have less. A negative RVIS score indicates that the gene is intolerant. The scoring system also shows that the genes that cause Mendelian diseases are significantly more intolerant to functional variation than genes that do

The evolutionary conservation of many biological processes among species allows the usage of several different model organisms to study human diseases. Although not all the human genes are conserved in invertebrate models such as worms and fruit flies, vertebrate models such as zebrafish and mouse provide valuable resources to study such genes. When evaluating the function of a conserved gene in model organisms, it is critical to keep in mind that orthologous genes usually cause different phenotypes in different species, although the gene products have a similar molecular function. The model organism databases listed below provide the related information on the molecular function of query genes so that they serve as a valuable resource during the variant prioritization process.

MGI is the primary database that integrates genetic, genomic, and biological data for the laboratory Mouse. Mouse Genome Database (MGD) and Mouse Gene Expression Database (GXD) are the two largest contributors to MGI, both serving as valuable resources for the studies of human disease. MGD provides curated phenotypes and functional annotations for mouse genes and alleles, while GXD contains mouse gene expression data with an emphasis on endogenous gene expression during mouse development [96, 97]. The Human-Mouse Disease Connection tool within MGI is another important feature that facilitates exploring gene-phenotype-disease relationships between human and mouse. By simply searching the list of human genes on MGI, the algorithm finds matching mouse genes and their

*DOI: http://dx.doi.org/10.5772/intechopen.92738*

benign in genes with recessive modes of inheritance.

**3.3 Genic intolerance**

not cause any known disease.

**3.4 Model organism databases**

*3.4.1 Mouse genome informatics (MGI)*

*Integrating Evolutionary Genetics to Medical Genomics: Evolutionary Approaches to Investigate… DOI: http://dx.doi.org/10.5772/intechopen.92738*

non-human primate species. PrimateAI categorizes the common missense mutations from other primate species as non-pathogenic for humans. Thus, it enables the identification of the pathogenic variants. PrimateAI has previously shown 88% accuracy in disease-causing variant identification and allowed the discovery of 14 novel candidate genes related to intellectual disability. PrimateAI also incorporates protein structure information as it learns to predict the secondary structure and solvent accessibility from amino acid sequences. PrimateAI provides a score to the user in which a threshold of >0.8 is for likely pathogenic classification, <0.6 is for likely benign, and 0.6–0.8 is as intermediate in genes with dominant modes of inheritance, and a threshold of >0.7 is for likely pathogenic and <0.5 for likely benign in genes with recessive modes of inheritance.

#### **3.3 Genic intolerance**

*Methods in Molecular Medicine*

known examples of such tools.

*3.2.1 MutationTaster2*

ant and disease severity [76].

genome as a new feature [81].

*3.2.4 PrimateAI*

[82] are well-known examples of meta-predictors.

of these tools can be found in the literature [83–85].

*3.2.2 Combined annotation-dependent depletion (CADD)*

*3.2.3 The Mendelian clinically applicable pathogenicity (M-CAP)*

alignments of related sequences by performing a PSI-BLAST search to check if the variant is tolerated in an evolutionary aspect. In addition to sequence conservation, another group of methods which take into account several features such as amino acid physicochemical properties, the context of variation position, protein structural features through machine learning algorithms are also available. CADD [76], MutationTaster2 [77], PolyPhen-2 [72], DANN [78] and VEST3 [79] are well-

The predicted impact of a variation obtained from different tools may not be the same. This problem led to researchers making efforts to develop meta predictors that combine the results from existing tools by using several approaches such as logistic regression, decision trees, random forests, and support vector machines to make their own decisions. MetaSVM and MetaLR [80], M-CAP [81] and REVEL

Below, several useful tools are explained without a performance comparison. However, various benchmark studies that have extensively examined the accuracy

MutationTaster2, using a naive Bayes classifier, predicts the functional consequences of variants that are both in exonic and intronic regions by incorporating a scoring system for the evolutionary conservation around DNA variants.

MutationTaster uses information from several variant databases, including 1KGP and ClinVar. The tool automatically predicts a variant as neutral if it is found more than four times in the homozygous state in these databases and as disease-causing if it is reported as pathogenic in ClinVar by listing the associated disease phenotypes [77].

CADD combines 63 genomic features derived from evolutionary constraint, surrounding sequence context, and functional predictions to evaluate SNVs and short indels. The tool integrates all of these features into a single CADD score using a machine learning approach trained on a binary distinction between simulated variants and variants that have become fixed in human populations since the split between humans and chimpanzees. C scores correlate with pathogenicity of a vari-

M-CAP uses a supervised learning classifier to interpret genomic variants and focus especially on coding mutations for Mendelian diseases. As a metapredictor, it uses nine existing tools SIFT, PolyPhen-2, CADD, MutationTaster, MutationAssessor [86], FATHMM [87], LRT [88], MetaLR and MetaSVM. It also combines information of base-pair, amino acid, genomic region, and gene conservation from RVIS [89], PhyloP [90], PhastCons [91], SIPHY [92], GERP [93], PAM250 and BLOSUM62 [94]. Additionally, M-CAP establishes multiple-sequence alignments of 99 primate, mammalian, and vertebrate genomes to the human

PrimateAI [95] is a deep neural network trained by a comprehensive dataset that includes around 380,000 common missense variants from humans and six

**138**

Genic intolerance is a gene-level assessment that has a potential to prioritize genomic variants. It has been developed as a scoring system to calculate tolerance of genes to a functional genetic variation on a genome-wide scale and rank them using 6503 WES data available in the National Heart, Lung, and Blood Institute-NHLBI Exome Sequencing Project [89]. This system predicts the expected common functional variation in the gene and compares them to apparently neutral variation found in the gene. The deviation from this prediction is attributed to the intolerance score, namely the Residual Variation Intolerance Score (RVIS). While genes with a positive RVIS score have more common functional variation than expected, genes with negative RVIS scores have less. A negative RVIS score indicates that the gene is intolerant. The scoring system also shows that the genes that cause Mendelian diseases are significantly more intolerant to functional variation than genes that do not cause any known disease.

#### **3.4 Model organism databases**

The evolutionary conservation of many biological processes among species allows the usage of several different model organisms to study human diseases. Although not all the human genes are conserved in invertebrate models such as worms and fruit flies, vertebrate models such as zebrafish and mouse provide valuable resources to study such genes. When evaluating the function of a conserved gene in model organisms, it is critical to keep in mind that orthologous genes usually cause different phenotypes in different species, although the gene products have a similar molecular function. The model organism databases listed below provide the related information on the molecular function of query genes so that they serve as a valuable resource during the variant prioritization process.

#### *3.4.1 Mouse genome informatics (MGI)*

MGI is the primary database that integrates genetic, genomic, and biological data for the laboratory Mouse. Mouse Genome Database (MGD) and Mouse Gene Expression Database (GXD) are the two largest contributors to MGI, both serving as valuable resources for the studies of human disease. MGD provides curated phenotypes and functional annotations for mouse genes and alleles, while GXD contains mouse gene expression data with an emphasis on endogenous gene expression during mouse development [96, 97]. The Human-Mouse Disease Connection tool within MGI is another important feature that facilitates exploring gene-phenotype-disease relationships between human and mouse. By simply searching the list of human genes on MGI, the algorithm finds matching mouse genes and their

homologs and displays the both human and mouse phenotypes associated with the genes of interest. MGI is updated once every week by adding new annotations from the literature.

#### *3.4.2 International Mouse Phenotyping Consortium (IMPC)*

IMPC aims to establish a comprehensive dataset of mouse genome and phenome by knocking out each gene individually and characterizing the physical and chemical changes, thus providing the foundations for the functional analysis of human genetic variation [98]. The project also aims to generate putative human pathogenic variants in both coding and non-coding regions of the mouse genome.

IMPC uses an algorithm that has been developed to detect phenotypic similarities between the mouse strains of IMPC and more than 7000 rare diseases. The algorithm evaluates a very diverse set of phenotyping parameters that comprise neurological, behavioral, metabolic, cardiovascular, pulmonary, reproductive, respiratory, sensory, musculoskeletal, and immunological parameters and provides a quantitative measure on how well a mouse model recapitulates disease features.

So far, over 3000 genes have already been cataloged and revealed models for 360 diseases, with 90% of the annotated phenotypes being novel [99]. By 2021, IMPC plans to analyze more than 9000 mouse genes to facilitate the prioritization and validation of variations obtained from clinical sequencing efforts.

### *3.4.3 Rat Genome Database (RGD)*

RGD provides genetic, genomic, phenotypic, and disease-related data for the laboratory rat, *Rattus norvegicus*. Rats have been one of the most commonly used model organisms for human disease research. RGD catalogs the rat data and also serves as a comparative data analysis platform between species such as rat, mouse, and human by validating the orthologous relationships. The database currently contains more than 1300 rat strains with disease/phenotype annotations [100]. RGD contains several tools that facilitate the analysis of data in disease-related content. PhenoMiner is such a tool that standardizes the phenotype data obtained from different rat studies by using a variety of ontologies developed at RGD [101]. Users can select one of the Phenominer search categories that include rat strains, experimental conditions, clinical measurements, and measurement methods to begin their search. Then, the algorithm filters the data according to the selected conditions and displays the results.

### *3.4.4 FlyBase*

FlyBase is the central resource for integrated Drosophila genetic and genomic data, including but not limited to sequence-level gene models, mutant phenotypes, mutant lesions and chromosome aberrations, as well as gene expression patterns [102]. The fruit fly—*Drosophila melanogaster*—is a member of the Drosophila family widely used as a model for human disease research.

FlyBase allows different approaches for data presentation to facilitate Drosophila translational research as the two main methods being the gene-centric and diseasecentric ones. The Gene Report displays information on individual genes. The report also lists the mutant alleles of the gene and the expression pattern of the gene products. The Human Disease Model Report provides background information on a specific disease and presents summaries of the experimental data and results from previous fruit fly studies.

**141**

*Integrating Evolutionary Genetics to Medical Genomics: Evolutionary Approaches to Investigate…*

FlyBase also incorporates orthology prediction tools such as OrthoDB and DIOPT that have been developed to identify orthologs of fly genes in multiple organisms [103, 104]. Integrating the results of these tools to the Gene Reports provides users the identification of orthologs in up to 5000 species. The predicted orthologs serve as a valuable resource for the human disease gene variants prediction as FlyBase also indicates whether the human ortholog functionally comple-

WormBase serves as the main database for genetic, genomic, and biological information on C.elegans and related nematodes. *C. elegans* is a widely used model for human disease variant research as over 40% of human genes have a C.elegans ortholog. WormBase catalogs the available mutant strains for each gene as well as related nematode studies. WS273 release of WormBase contains over 160,000 gene summaries for 10 nematode species. The gene summaries also include human ortholog diseases and phenotypes to aid the detection of human disease-causing

ZFIN is the main database that provides genetic, genomic, and phenotypic data from zebrafish studies [106]. Zebrafish—*Danio reriro*—is a model organism extensively used in biomedical research, especially for developmental and genomic studies. Powerful approaches are available to model human diseases using zebrafish. Genetic manipulation of zebrafish orthologs of human disease genes is a common strategy to model genetic disorders such as Duchenne muscular dystrophy [107] and Rett Syndrome [108]. Another strategy of disease modeling is generating transgenic zebrafish lines that express human genes. This approach allows testing the function of the potential disease-causative variant in disease pathology. For example, a transgenic zebrafish model confirmed the pathogenicity of two novel XPNPEP3 gene mutations predicted to be ciliopathy-causing in the clinic [109]. Users can easily search ZFIN to reach information on disease models, including the

ments the fly mutant upon transfer into the Drosophila genome.

transgenic lines and mutant phenotypes related to their query.

**4.1 Variation in the frizzled class receptor 6 (FZD6) protein found in** 

Nonsyndromic congenital nail disorder 1 (OMIM #1161050) is a condition affecting the fingernails and toenails characterized by extremely thick nails, onycholysis, hyponychia and claw-like appearance. Autosomal recessive mutations in the FZD6 gene (OMIM \*603409) were found to be associated with this disorder [110]. FZD6 is a member of the highly conserved WNT receptors family crucial for developmental processes and differentiation. The study conducted on mice demonstrated that FZD6-mediated Wnt signaling has a regulatory role in the differentia-

In a previous study from our group, a Turkish family with three affected individuals reported. After performing WES on the index case, 96 de novo heterozygous, 421 homozygous, and 185 compound heterozygous variants were obtained

*DOI: http://dx.doi.org/10.5772/intechopen.92738*

*3.4.6 Zebrafish Information Network (ZFin)*

**individuals with the nail disorder**

tion process of claw/nail formation [111].

*3.4.5 WormBase*

variants [105].

**4. Real-life case**

#### *Integrating Evolutionary Genetics to Medical Genomics: Evolutionary Approaches to Investigate… DOI: http://dx.doi.org/10.5772/intechopen.92738*

FlyBase also incorporates orthology prediction tools such as OrthoDB and DIOPT that have been developed to identify orthologs of fly genes in multiple organisms [103, 104]. Integrating the results of these tools to the Gene Reports provides users the identification of orthologs in up to 5000 species. The predicted orthologs serve as a valuable resource for the human disease gene variants prediction as FlyBase also indicates whether the human ortholog functionally complements the fly mutant upon transfer into the Drosophila genome.

## *3.4.5 WormBase*

*Methods in Molecular Medicine*

the literature.

homologs and displays the both human and mouse phenotypes associated with the genes of interest. MGI is updated once every week by adding new annotations from

IMPC aims to establish a comprehensive dataset of mouse genome and phenome by knocking out each gene individually and characterizing the physical and chemical changes, thus providing the foundations for the functional analysis of human genetic variation [98]. The project also aims to generate putative human pathogenic

IMPC uses an algorithm that has been developed to detect phenotypic similarities between the mouse strains of IMPC and more than 7000 rare diseases. The algorithm evaluates a very diverse set of phenotyping parameters that comprise neurological, behavioral, metabolic, cardiovascular, pulmonary, reproductive, respiratory, sensory, musculoskeletal, and immunological parameters and provides a quantitative measure on how well a mouse model recapitulates disease features. So far, over 3000 genes have already been cataloged and revealed models for 360 diseases, with 90% of the annotated phenotypes being novel [99]. By 2021, IMPC plans to analyze more than 9000 mouse genes to facilitate the prioritization and

RGD provides genetic, genomic, phenotypic, and disease-related data for the laboratory rat, *Rattus norvegicus*. Rats have been one of the most commonly used model organisms for human disease research. RGD catalogs the rat data and also serves as a comparative data analysis platform between species such as rat, mouse, and human by validating the orthologous relationships. The database currently contains more than 1300 rat strains with disease/phenotype annotations [100]. RGD contains several tools that facilitate the analysis of data in disease-related content. PhenoMiner is such a tool that standardizes the phenotype data obtained from different rat studies by using a variety of ontologies developed at RGD [101]. Users can select one of the Phenominer search categories that include rat strains, experimental conditions, clinical measurements, and measurement methods to begin their search. Then, the algorithm filters the data according to the selected

FlyBase is the central resource for integrated Drosophila genetic and genomic data, including but not limited to sequence-level gene models, mutant phenotypes, mutant lesions and chromosome aberrations, as well as gene expression patterns [102]. The fruit fly—*Drosophila melanogaster*—is a member of the Drosophila family

FlyBase allows different approaches for data presentation to facilitate Drosophila translational research as the two main methods being the gene-centric and diseasecentric ones. The Gene Report displays information on individual genes. The report also lists the mutant alleles of the gene and the expression pattern of the gene products. The Human Disease Model Report provides background information on a specific disease and presents summaries of the experimental data and results from

*3.4.2 International Mouse Phenotyping Consortium (IMPC)*

variants in both coding and non-coding regions of the mouse genome.

validation of variations obtained from clinical sequencing efforts.

*3.4.3 Rat Genome Database (RGD)*

conditions and displays the results.

widely used as a model for human disease research.

*3.4.4 FlyBase*

**140**

previous fruit fly studies.

WormBase serves as the main database for genetic, genomic, and biological information on C.elegans and related nematodes. *C. elegans* is a widely used model for human disease variant research as over 40% of human genes have a C.elegans ortholog. WormBase catalogs the available mutant strains for each gene as well as related nematode studies. WS273 release of WormBase contains over 160,000 gene summaries for 10 nematode species. The gene summaries also include human ortholog diseases and phenotypes to aid the detection of human disease-causing variants [105].

### *3.4.6 Zebrafish Information Network (ZFin)*

ZFIN is the main database that provides genetic, genomic, and phenotypic data from zebrafish studies [106]. Zebrafish—*Danio reriro*—is a model organism extensively used in biomedical research, especially for developmental and genomic studies. Powerful approaches are available to model human diseases using zebrafish. Genetic manipulation of zebrafish orthologs of human disease genes is a common strategy to model genetic disorders such as Duchenne muscular dystrophy [107] and Rett Syndrome [108]. Another strategy of disease modeling is generating transgenic zebrafish lines that express human genes. This approach allows testing the function of the potential disease-causative variant in disease pathology. For example, a transgenic zebrafish model confirmed the pathogenicity of two novel XPNPEP3 gene mutations predicted to be ciliopathy-causing in the clinic [109]. Users can easily search ZFIN to reach information on disease models, including the transgenic lines and mutant phenotypes related to their query.
