*2.2.2. Addition of PhyloP46way Conservation Score Database for hg19 Assembly (Homo sapiens)*

SeqAnt 2012: Recent Developments in Next-Generation Sequencing Annotation 87

variants, genes, and sample identifiers for those loci with two or more replacement variants. The collected list of variants includes those that could be compound recessive in a given individual, although since the phase of the variants is not determined, this would have to be validated by other means. This file may be useful when looking for genes that harbor variants that may fit a recessive loss-of-function model. The last is a \*.log file generated by SeqAnt that

This directory contains the complete variant annotation files obtained from annotating input files with SeqAnt 2.0 (Figure 5). Two main types of genetic variation are annotated by SeqAnt: single nucleotide variants (SNPs) and insertions/deletions (INDELs). For SNPs, a given variant site when annotated belongs in one of five functional classifications. These include exonic.replacement, exonic.silent, untranslated region (UTR), intronic, or intergenic. For INDELs, a given variant when annotated belongs in one of four functional classifications. These include exonic, UTR, intronic, or intergenic. Overall, there are a total of nine files that contain the variants and their associated annotation information. These annotation files include all possible splice variants impacted by a given variant site. Thus, a

This directory contains files in BED format (http://genome.ucsc.edu/FAQ/FAQformat) that can be visualized on the UCSC Genome Browser or other viewer able to process files in this format. There are ten files total in this directory. Nine of the files include the variants and annotation information as described above; the tenth file (\*.ucsc.bed) contains all the annotation information from each of the nine files in a single BED file for the entire genomic region to be visualized. These files can be uploaded to the UCSC browser as custom tracks to be visualized. They can also be visualized in other software packages that process BED

In contrast to the annotation in the All\_Variations directory, the Unique\_Variations directory contains nine files that contain a single variant annotation for each SNP or INDEL. Thus, each variant is listed just once, regardless of the number of different splice variants it is predicted to impact. These files allow the user to quickly determine the total number of

We introduced a number of changes to the annotation fields contained within the SeqAnt output files. First, we rearranged the order of columns in the output files to aid users in

records the major events that occur when SeqAnt processes a dataset.

given variant site may be listed multiple times in one of the nine output files.

files, such as the Integrative Genomics Viewer (Version 2.1) [21].

*2.3.1. All\_variations directory* 

*2.3.2. BED\_annotation directory* 

*2.3.3. Unique\_variations directory* 

variants for any specific functional class.

*2.4.1. Redesign of Result Columns for Annotation Files* 

**2.4. SeqAnt 2.0 - Output files** 

The phyloP Evolutionary Conservation Score data type is a new addition to SeqAnt 2.0. Binary databases, including phylopP scores from a 46-way alignment of vertebrate species to the human genome, were included to complement the PhastCons Evolutionary Conservation Scores previously included in the application. The phyloP scores predict the probability of a given variant site having undergone evolution over time. The absolute phyloP values represent negative log p-values for the null hypothesis that there was no evolution across the regions annotated [18]. Regions that are more conserved tend to have more positive values, whereas sites believed to be fast evolving have negative values. The medium range of these scores for the 46-way alignment from the UCSC Genome Browser is between approximately -3 and +3. It should be noted that, unlike PhastCons, which takes into account flanking bases on a sequence in arriving at its final score for a given variant site, phyloP scores are computed by basically comparing the particular base in the sequence with aligned bases from other species [18]. Variations in highly conserved regions often suggest a significant change that could have functional implications. The PhyloP46way dataset we have on the upgraded SeqAnt web application is the most recent phyloP track in the UCSC, released in December 2009.

## *2.2.3. Addition of Full Genome Data Set by Chromosome of Zebrafish (danRer6 Assembly)*

We selected zebrafish (*Danio rerio*) as the next species to be incorporated into the SeqAnt database because of its emergence as a model organism for a wide range of scientific studies, from behavioral genetics to drug modeling studies and integrative physiology [19,20]. SeqAnt 2.0 has now been updated to include binary files for the genome sequence of zebrafish. We derived binary databases for the first four data types from flat table files on the UCSC Genome Browser website. Flat table files for the phyloP evolutionary conservation score were not available and were therefore not included. The reference genome binaries use the danRer6 assembly, which annotated the datasets by chromosome and was released in December of 2008. The RefGene annotation and dbSNP variation data are relative to the danRer6 assembly. PhastCons evolutionary conservation scores were derived from multiple alignment between seven species and zebrafish. Including the zebrafish in SeqAnt 2.0 should prove valuable for researchers who work with this species.

#### **2.3. SeqAnt 2.0: output directory structure and files**

Significant changes to the number and types of output files are reflected in a new output directory structure in SeqAnt 2.0. The output from SeqAnt is contained within a Results directory that includes three subdirectories (Figure 5). This Results directory has the name of the original SeqAnt input file and a subscript '\_Annotation\_Files'. Within this directory, there are three distinct directories (All\_Variations, BED\_Annotation, Unique\_Variations) holding the output of SeqAnt, which will be described in detail below. This directory also contains three other files of interest to a user. The first is a \*.summary.txt file that provides a summary of all the variants annotated by SeqAnt. The second is a Compound.Replacement file that identifies variants, genes, and sample identifiers for those loci with two or more replacement variants. The collected list of variants includes those that could be compound recessive in a given individual, although since the phase of the variants is not determined, this would have to be validated by other means. This file may be useful when looking for genes that harbor variants that may fit a recessive loss-of-function model. The last is a \*.log file generated by SeqAnt that records the major events that occur when SeqAnt processes a dataset.

#### *2.3.1. All\_variations directory*

86 Bioinformatics

*sapiens)* 

the UCSC, released in December 2009.

**2.3. SeqAnt 2.0: output directory structure and files** 

*2.2.2. Addition of PhyloP46way Conservation Score Database for hg19 Assembly (Homo* 

The phyloP Evolutionary Conservation Score data type is a new addition to SeqAnt 2.0. Binary databases, including phylopP scores from a 46-way alignment of vertebrate species to the human genome, were included to complement the PhastCons Evolutionary Conservation Scores previously included in the application. The phyloP scores predict the probability of a given variant site having undergone evolution over time. The absolute phyloP values represent negative log p-values for the null hypothesis that there was no evolution across the regions annotated [18]. Regions that are more conserved tend to have more positive values, whereas sites believed to be fast evolving have negative values. The medium range of these scores for the 46-way alignment from the UCSC Genome Browser is between approximately -3 and +3. It should be noted that, unlike PhastCons, which takes into account flanking bases on a sequence in arriving at its final score for a given variant site, phyloP scores are computed by basically comparing the particular base in the sequence with aligned bases from other species [18]. Variations in highly conserved regions often suggest a significant change that could have functional implications. The PhyloP46way dataset we have on the upgraded SeqAnt web application is the most recent phyloP track in

*2.2.3. Addition of Full Genome Data Set by Chromosome of Zebrafish (danRer6 Assembly)* 

We selected zebrafish (*Danio rerio*) as the next species to be incorporated into the SeqAnt database because of its emergence as a model organism for a wide range of scientific studies, from behavioral genetics to drug modeling studies and integrative physiology [19,20]. SeqAnt 2.0 has now been updated to include binary files for the genome sequence of zebrafish. We derived binary databases for the first four data types from flat table files on the UCSC Genome Browser website. Flat table files for the phyloP evolutionary conservation score were not available and were therefore not included. The reference genome binaries use the danRer6 assembly, which annotated the datasets by chromosome and was released in December of 2008. The RefGene annotation and dbSNP variation data are relative to the danRer6 assembly. PhastCons evolutionary conservation scores were derived from multiple alignment between seven species and zebrafish. Including the zebrafish in SeqAnt 2.0 should prove valuable for researchers who work with this species.

Significant changes to the number and types of output files are reflected in a new output directory structure in SeqAnt 2.0. The output from SeqAnt is contained within a Results directory that includes three subdirectories (Figure 5). This Results directory has the name of the original SeqAnt input file and a subscript '\_Annotation\_Files'. Within this directory, there are three distinct directories (All\_Variations, BED\_Annotation, Unique\_Variations) holding the output of SeqAnt, which will be described in detail below. This directory also contains three other files of interest to a user. The first is a \*.summary.txt file that provides a summary of all the variants annotated by SeqAnt. The second is a Compound.Replacement file that identifies This directory contains the complete variant annotation files obtained from annotating input files with SeqAnt 2.0 (Figure 5). Two main types of genetic variation are annotated by SeqAnt: single nucleotide variants (SNPs) and insertions/deletions (INDELs). For SNPs, a given variant site when annotated belongs in one of five functional classifications. These include exonic.replacement, exonic.silent, untranslated region (UTR), intronic, or intergenic. For INDELs, a given variant when annotated belongs in one of four functional classifications. These include exonic, UTR, intronic, or intergenic. Overall, there are a total of nine files that contain the variants and their associated annotation information. These annotation files include all possible splice variants impacted by a given variant site. Thus, a given variant site may be listed multiple times in one of the nine output files.

### *2.3.2. BED\_annotation directory*

This directory contains files in BED format (http://genome.ucsc.edu/FAQ/FAQformat) that can be visualized on the UCSC Genome Browser or other viewer able to process files in this format. There are ten files total in this directory. Nine of the files include the variants and annotation information as described above; the tenth file (\*.ucsc.bed) contains all the annotation information from each of the nine files in a single BED file for the entire genomic region to be visualized. These files can be uploaded to the UCSC browser as custom tracks to be visualized. They can also be visualized in other software packages that process BED files, such as the Integrative Genomics Viewer (Version 2.1) [21].

#### *2.3.3. Unique\_variations directory*

In contrast to the annotation in the All\_Variations directory, the Unique\_Variations directory contains nine files that contain a single variant annotation for each SNP or INDEL. Thus, each variant is listed just once, regardless of the number of different splice variants it is predicted to impact. These files allow the user to quickly determine the total number of variants for any specific functional class.

#### **2.4. SeqAnt 2.0 - Output files**

#### *2.4.1. Redesign of Result Columns for Annotation Files*

We introduced a number of changes to the annotation fields contained within the SeqAnt output files. First, we rearranged the order of columns in the output files to aid users in evaluating their results. Second, we introduced additional feature columns to the output files. These included row 10, which depicts the transcript change that occurs for a coding sequence variant, row 14, which shows the concomitant amino-acid change for a coding sequence variant, and rows 21 and 22, which report the phyloP conservation score values for each variant position annotated. A summary of the annotation information provided by SeqAnt 2.0 is shown below in Table 1. A representation of an example output file is shown in Figure 6 below.

SeqAnt 2012: Recent Developments in Next-Generation Sequencing Annotation 89

**Figure 6. Snapshot of Exonic Replacement Annotation Output File.** The top half shows the data for fields 1 - 13. The bottom half of the figure shows the data from fields 14 - 23. The last four columns

report the number of homozygous and heterozygous SNPs and associated sample IDs.


**Table 1.** Annotation information output by SeqAnt 2.0


in Figure 6 below.

evaluating their results. Second, we introduced additional feature columns to the output files. These included row 10, which depicts the transcript change that occurs for a coding sequence variant, row 14, which shows the concomitant amino-acid change for a coding sequence variant, and rows 21 and 22, which report the phyloP conservation score values for each variant position annotated. A summary of the annotation information provided by SeqAnt 2.0 is shown below in Table 1. A representation of an example output file is shown

2 Functional Class Annotated functional category for variant site

4 Position Absolute position of variant site on a chromosome

3 Chromosome Chromosome containing variant site

5 Gene\_Name Name of locus containing variant site

6 RefSeq\_ID Ref\_Seq ID from UCSC track

8 Reference\_Base Reference allele at variant site 9 Input\_Base Minor allele at variant site

10 Transcript Change Nucleotide base change on transcript 11 Original\_Amino\_Acid Reference amino acid at variant site 12 Amino\_Acid\_Number Position of amino acid on peptide chain 13 Modified\_Amino\_Acid Modified amino acid due to variant site 14 Amino\_Acid\_Change Amino acid change on peptide chain 15 dbSNP\_IDs dbSNP ID If variant site has been reported 16 Het\_Rates dbSNP heterozygosity of reported variant site 17 Orientation dbSNP orientation of reported variant site

18 PhastCons\_placentals Placental PhastCons score for variant site (46way) 19 PhastCons\_primates Primate PhastCons score for variant site (46way) 20 PhastCons\_vertebrate Vertebrate PhastCons score for variant site (46way) 21 PhyloP\_placental Placental phyloP score for variant site (46way) 22 PhyloP\_primates Primate phyloP score for variant site (46way) 23 PhyloP\_vertebrate Vertebrate phyloP score for variant site (46way)

**Table 1.** Annotation information output by SeqAnt 2.0

7 Gene\_Strand Orientation of locus

**Field ID Annotation Field Description** 

1 Variation\_Type Type of variant

**Figure 6. Snapshot of Exonic Replacement Annotation Output File.** The top half shows the data for fields 1 - 13. The bottom half of the figure shows the data from fields 14 - 23. The last four columns report the number of homozygous and heterozygous SNPs and associated sample IDs.
