**4. An application of SeqAnt 2.0: Sequencing the** *AFF2* **locus and X chromosome exome in patients with autism**

SeqAnt 2012: Recent Developments in Next-Generation Sequencing Annotation 93

**Novel SNPs at Conserved Sites** 

**Novel Indels at Conserved Sites** 

the Ensembl Genome Browser, would have been time-consuming and laborious. SeqAnt helped us rapidly annotate these SNPs and INDELs into different functional classes, as well as reported whether a variant had already been cataloged in the dbSNP database (Tables 4, 5). SeqAnt also reported the PhastCons and phyloP conservation scores, which are important in helping to determine whether a variant might cause a deleterious change in the protein structure/function, since variants in the well-conserved sites are likely to cause such changes. By using this feature of SeqAnt, we could easily identify our list of candidate

> **Novel SNPs**

**Novel Indels** 

variants that were rare, as well as likely to cause a damaging change.

**SNPs in dbSNP** 

Replacement 5 0 5 5

Silent 8 4 4 4 UTR 33 20 13 1 Intron 223 129 94 6 Total 269 153 116 16

**Table 4.** Functional annotation of single nucleotide polymorphisms at the *AFF2* locus identified by

**Indels in dbSNP** 

Exonic 0 0 0 0 UTR 2 0 2 1 Intron 15 7 8 1 Total 17 7 10 2

**Table 5.** Functional annotation of indels at the *AFF2* locus identified by next-generation sequencing of

As expected, almost all common variation (>5% frequency in our population) is contained in dbSNP, whereas most rare variants (<5%) are not cataloged in dbSNP (Figure 8). We found that, in our cases, there were five (2.5% of total cases sequenced) singleton nonsynonymous variants. This level of variation in our cases was significantly higher than that seen in a set of 5400 controls. Furthermore, we used SeqAnt to rapidly annotate 1006 X chromosome genes that had been sequenced in the 75 SSC samples, and ultimately showed that the excess mutations at *AFF2* were unusual compared to other X chromosome loci. Thus, the ability to rapidly annotate our sequence variants discovered from sequencing the entire X chromosome exome had a major impact on our ability to assess the role of *AFF2* as an autism susceptibility locus. Finally, SeqAnt helped us identify three rare noncoding UTR

**Functional Class Total** 

**Functional Class Total** 

202 males with autism

**SNPs** 

next-generation sequencing of 202 males with autism

**Indels** 

With improvements in methods of targeted enrichment and next-generation sequencing, the targeted sequencing of all genes on a specific chromosome has become feasible. Specific genes/genomic regions is a common experimental design that benefits from the use of SeqAnt [25]. Here we performed an experiment that combined targeted sequencing with chromosomal exome sequencing. We selected 127 males from the Autism Genetic Resource Exchange (AGRE) multiplex collection and 75 males from the Simons Foundation Autism Research Initiative (SFARI) Simplex Collection, New York, NY, USA (SSC) for target DNA amplification and DNA sequencing. From the AGRE collection, we chose multiplex families with two or more male affected sib-pairs who shared >99% of 76 genotyped SNPs in the *AFF2* genomic region [22]. One male was randomly chosen if both affected siblings were equally affected; otherwise, the male with autism was chosen over those boys with a diagnosis of not quite autism (NQA) or broad spectrum. From the SSC collection, we chose only those boys who were described as autistic and not reported to have any other syndromes. From the SSC collection, we chose 75 male children from different families with a diagnosis of ASD [26].

For the AGRE samples, we prepared target DNA for sequencing the AGRE samples by performing long PCR (LPCR) amplification of the *AFF2* genomic region, followed by sequencing on an Illumina Genome Analyzer. For the SSC samples, we prepared target DNA for Illumina sequencing by using RainDance Technology's (RDT) microdroplet-based technology to enrich for the human X chromosome exome, as described previously [25]. Following enrichment we performed 70-bp single-end multiplex sequencing on an Illumina Genome Analyzer (IGA). Nearly 20 GB of sequence was generated for AGRE samples, while ~55 GB of sequence was generated for the SSC samples. The *AFF2* reference sequence used for the AGRE samples consists of 10 discontiguous fragments covering 84.8 kb, and the SSC reference sequence consisted of the entire human X chromosome, which spanned 5748 discontiguous fragments covering 4.7 Mb. Raw base-calling data generated with the IGA were mapped and variants called using PEMapper (Cutler DJ et al, personal communication). For AGRE samples, 99% of the bases had more than 8X coverage. Median depth of coverage was in the range of 388-1548. For the SSC samples, between 83% and 97% of the targeted reference bases had more than 8X coverage. Median depth of coverage was in the range of 20-607. We identified a total of 286 sites of variation, with 269 single nucleotide polymorphisms (SNPs) and 17 insertions or deletions (INDELs). Overall levels of variation were similar between the two datasets (Θw per site [23]; AGRE - 6.0 x 10-4, SSC - 6.7 x 10-4), with an excess of rare variants as evidenced by a negative value for the Tajima's D test statistics for both sets of samples ([24]; AGRE: -1.46, SSC: -1.41).

We used SeqAnt to annotate the variants found at the *AFF2* locus in the total sample of 202 males with a diagnosis of autism (Mondal et al, in revision). We sought to test the hypothesis that rare variants at the *AFF2* locus can act as autism susceptibility alleles. Annotating our variants using the other web-based tools, like the UCSC Genome Browser or the Ensembl Genome Browser, would have been time-consuming and laborious. SeqAnt helped us rapidly annotate these SNPs and INDELs into different functional classes, as well as reported whether a variant had already been cataloged in the dbSNP database (Tables 4, 5). SeqAnt also reported the PhastCons and phyloP conservation scores, which are important in helping to determine whether a variant might cause a deleterious change in the protein structure/function, since variants in the well-conserved sites are likely to cause such changes. By using this feature of SeqAnt, we could easily identify our list of candidate variants that were rare, as well as likely to cause a damaging change.

92 Bioinformatics

a diagnosis of ASD [26].

**4. An application of SeqAnt 2.0: Sequencing the** *AFF2* **locus and X** 

With improvements in methods of targeted enrichment and next-generation sequencing, the targeted sequencing of all genes on a specific chromosome has become feasible. Specific genes/genomic regions is a common experimental design that benefits from the use of SeqAnt [25]. Here we performed an experiment that combined targeted sequencing with chromosomal exome sequencing. We selected 127 males from the Autism Genetic Resource Exchange (AGRE) multiplex collection and 75 males from the Simons Foundation Autism Research Initiative (SFARI) Simplex Collection, New York, NY, USA (SSC) for target DNA amplification and DNA sequencing. From the AGRE collection, we chose multiplex families with two or more male affected sib-pairs who shared >99% of 76 genotyped SNPs in the *AFF2* genomic region [22]. One male was randomly chosen if both affected siblings were equally affected; otherwise, the male with autism was chosen over those boys with a diagnosis of not quite autism (NQA) or broad spectrum. From the SSC collection, we chose only those boys who were described as autistic and not reported to have any other syndromes. From the SSC collection, we chose 75 male children from different families with

For the AGRE samples, we prepared target DNA for sequencing the AGRE samples by performing long PCR (LPCR) amplification of the *AFF2* genomic region, followed by sequencing on an Illumina Genome Analyzer. For the SSC samples, we prepared target DNA for Illumina sequencing by using RainDance Technology's (RDT) microdroplet-based technology to enrich for the human X chromosome exome, as described previously [25]. Following enrichment we performed 70-bp single-end multiplex sequencing on an Illumina Genome Analyzer (IGA). Nearly 20 GB of sequence was generated for AGRE samples, while ~55 GB of sequence was generated for the SSC samples. The *AFF2* reference sequence used for the AGRE samples consists of 10 discontiguous fragments covering 84.8 kb, and the SSC reference sequence consisted of the entire human X chromosome, which spanned 5748 discontiguous fragments covering 4.7 Mb. Raw base-calling data generated with the IGA were mapped and variants called using PEMapper (Cutler DJ et al, personal communication). For AGRE samples, 99% of the bases had more than 8X coverage. Median depth of coverage was in the range of 388-1548. For the SSC samples, between 83% and 97% of the targeted reference bases had more than 8X coverage. Median depth of coverage was in the range of 20-607. We identified a total of 286 sites of variation, with 269 single nucleotide polymorphisms (SNPs) and 17 insertions or deletions (INDELs). Overall levels of variation were similar between the two datasets (Θw per site [23]; AGRE - 6.0 x 10-4, SSC - 6.7 x 10-4), with an excess of rare variants as evidenced by a negative value for the Tajima's D

We used SeqAnt to annotate the variants found at the *AFF2* locus in the total sample of 202 males with a diagnosis of autism (Mondal et al, in revision). We sought to test the hypothesis that rare variants at the *AFF2* locus can act as autism susceptibility alleles. Annotating our variants using the other web-based tools, like the UCSC Genome Browser or

test statistics for both sets of samples ([24]; AGRE: -1.46, SSC: -1.41).

**chromosome exome in patients with autism** 


**Table 4.** Functional annotation of single nucleotide polymorphisms at the *AFF2* locus identified by next-generation sequencing of 202 males with autism


**Table 5.** Functional annotation of indels at the *AFF2* locus identified by next-generation sequencing of 202 males with autism

As expected, almost all common variation (>5% frequency in our population) is contained in dbSNP, whereas most rare variants (<5%) are not cataloged in dbSNP (Figure 8). We found that, in our cases, there were five (2.5% of total cases sequenced) singleton nonsynonymous variants. This level of variation in our cases was significantly higher than that seen in a set of 5400 controls. Furthermore, we used SeqAnt to rapidly annotate 1006 X chromosome genes that had been sequenced in the 75 SSC samples, and ultimately showed that the excess mutations at *AFF2* were unusual compared to other X chromosome loci. Thus, the ability to rapidly annotate our sequence variants discovered from sequencing the entire X chromosome exome had a major impact on our ability to assess the role of *AFF2* as an autism susceptibility locus. Finally, SeqAnt helped us identify three rare noncoding UTR sequence variants, one of which was at an evolutionarily conserved site. Subsequent functional testing suggested that the variant at the conserved site acts to influence the level of *AFF2* expression. Thus, for this experiment, SeqAnt allowed us to rapidly focus on those sites of greatest interest for both statistical analyses and direct functional testing.

SeqAnt 2012: Recent Developments in Next-Generation Sequencing Annotation 95

**In Background Strains, Not in dbSNP** 

**Remaining Variants** 

**Replacement Variants Within Mapped Region** 

used SeqAnt as a part of this methodology for rapid annotation of variations obtained from mutant, parental, and background strains in a single experiment. By using SeqAnt, we first annotated all the variants into different functional classes. Next, by comparing variants identified in mutant offspring to those found in dbSNP, the unmutagenized background strains, and parental lines, we could immediately distinguish the induced putative causative

> **In dbSNP**

AB5 Replacement 96 80 13 3 1 AB5 Silent 157 143 12 2 - AB5 UTR 331 191 135 5 - AB5 Intronic 106 87 17 2 - AB5 Intergenic 54 50 4 0 - M2 Replacement 43 8 31 4 2 M2 Silent 19 11 7 1 - M2 UTR 73 16 55 2 - M2 Intronic 46 18 20 8 - M2 Intergenic 40 4 36 0 - X5 Replacement 128 59 63 6 2 X5 Silent 192 128 63 1 - X5 UTR 387 231 152 4 - X5 Intronic 205 116 86 3 - X5 Intergenic 89 34 55 0 - Y1 Replacement 17 1 14 2 1 Y1 Silent 5 0 4 1 - Y1 UTR 14 2 11 1 - Y1 Intronic 34 0 31 3 - Y1 Intergenic 7 0 7 0 -

**Table 6.** Results of filtering homozygous variants sites for each mouse mutant line sequenced.

mutations from preexisting variations or experimental artifacts (Table 6).

**Total Homozygous Variants** 

**Mutant Line** 

**Functional Classes** 

**Figure 8. Summary of SNV and indel variation discovered at the** *AFF2* **locus in males with ASD.** The frequency of SNVs and indels (minor alleles) in cases is plotted against their level of evolutionary conservation. Most common variation has already been discovered and exists in public databases (blue; circles and diamonds). Most of the rare variation at *AFF2* was discovered in our study and not contained in public databases (red; circles and diamonds).
