**3. An application of SeqAnt 2.0: Targeted next-generation sequencing of**  *NLGN3* **and** *NLGN4X* **in humans**

SeqAnt 2012: Recent Developments in Next-Generation Sequencing Annotation 91

**Novel Indels at Evolutionary Conserved Sites** 

which could contribute to autism susceptibility. We focused on this set of variants for direct functional testing. Finally, we identified novel intronic variants at evolutionarily conserved sites that appear to be located in transcription factor binding sites. These variants are being followed up to determine whether they have a regulatory role that impacts the expression of *NLGN3* or *NLGN4X.* In summary, SeqAnt 2.0 allowed us to rapidly annotate all the sites of variation in our sample and rapidly focus attention on those variants most likely to be

> **Novel Indels**

autism susceptibility alleles.

**Functional Class Total** 

**Indels** 

generation sequencing of 144 males with autism.

**Indels in dbSNP** 

Coding 0 0 0 0

UTR 1 0 1 1

Intron 25 7 18 0

Intergenic 6 1 5 0

Total 32 8 24 1

**Table 3.** Functional annotation of INDELs at the *NLGN3* and *NLGN4X* loci identified by next-

**Figure 7. Summary of SNV and indel variation discovered at the** *NLGN3* **and** *NLGN4X* **loci in males with ASD.** The frequency of SNVs and INDELs (minor alleles) in cases is plotted against their level of evolutionary conservation. Most common variation has already been discovered and exists in public databases (blue; circles and diamonds); most of the rare variation at both loci was discovered in our

study and not contained in public databases (red; circles and diamonds).

The targeted sequencing of specific genes or genomic regions is a common experimental design that can benefit from the use of SeqAnt. Here we describe such a study. We sequenced the *NLGN*3 and *NLGN4X* loci in a sample of 144 males with a diagnosis of autism. All the patient samples were obtained from the multiplex Autism Genetic Resource Exchange (AGRE) [22]. Raw base-calling data generated with an Illumina Genome Analyzer (IGA) were used as input for mapping and alignment. The total amount of sequence generated was 7.04 GB. Paired-end reads were mapped and variants were called using PEMapper (Cutler DJ et al, personal communication). In total, 99.7% of target bases had at least 8X coverage, with a median depth of coverage of 452. We identified a total of 208 sites of variation, with 176 single nucleotide polymorphisms and 32 insertions or deletions. Overall levels of variation were estimated at 5.8 x 10-4 (Θw per site [23]), which matched our expectation for loci from the human X chromosome. We also observed an excess of rare variants, as evidenced by a negative value for the Tajima's D test statistic (-0.27,[24]).

Single nucleotide variants (SNVs) and small insertions and deletions (INDELs) were annotated using SeqAnt [5]. For the SNPs, a total of 68, or 39%, had not been reported before (31 in *NLGN3* and 37 in *NLGN4X,* Table 2). For the INDELs, a total of 24, or 75%, had not been reported before (5 in *NLGN3* and 19 in *NLGN4X*, Table 3). As summarized in Figure 7, almost all common variation (>5% frequency in our sample) is contained in dbSNP, whereas most rare variants (<5%) have not been cataloged there.


**Table 2.** Functional annotation of SNPs at the *NLGN3* and *NLGN4X* loci identified by next-generation sequencing of 144 males with autism.

Using SeqAnt to rapidly annotate our sequence data allows us to quickly draw four main conclusions. First, most common variation is already contained in dbSNP, while much of the rare variation remains undiscovered. Second, we did not see any novel replacement variants at either *NLGN3* or *NLGN4X*, suggesting that mutations at these loci are rare causes of autism. Third, we identified novel UTR variants at highly evolutionarily conserved sites, which could contribute to autism susceptibility. We focused on this set of variants for direct functional testing. Finally, we identified novel intronic variants at evolutionarily conserved sites that appear to be located in transcription factor binding sites. These variants are being followed up to determine whether they have a regulatory role that impacts the expression of *NLGN3* or *NLGN4X.* In summary, SeqAnt 2.0 allowed us to rapidly annotate all the sites of variation in our sample and rapidly focus attention on those variants most likely to be autism susceptibility alleles.

90 Bioinformatics

*NLGN3* **and** *NLGN4X* **in humans** 

**3. An application of SeqAnt 2.0: Targeted next-generation sequencing of** 

The targeted sequencing of specific genes or genomic regions is a common experimental design that can benefit from the use of SeqAnt. Here we describe such a study. We sequenced the *NLGN*3 and *NLGN4X* loci in a sample of 144 males with a diagnosis of autism. All the patient samples were obtained from the multiplex Autism Genetic Resource Exchange (AGRE) [22]. Raw base-calling data generated with an Illumina Genome Analyzer (IGA) were used as input for mapping and alignment. The total amount of sequence generated was 7.04 GB. Paired-end reads were mapped and variants were called using PEMapper (Cutler DJ et al, personal communication). In total, 99.7% of target bases had at least 8X coverage, with a median depth of coverage of 452. We identified a total of 208 sites of variation, with 176 single nucleotide polymorphisms and 32 insertions or deletions. Overall levels of variation were estimated at 5.8 x 10-4 (Θw per site [23]), which matched our expectation for loci from the human X chromosome. We also observed an excess of rare

variants, as evidenced by a negative value for the Tajima's D test statistic (-0.27,[24]).

**SNPs in dbSNP** 

Replacement 1 1 0 0

Silent 3 3 0 0

UTR 18 10 8 2

Intron 134 78 56 9

Intergenic 20 16 4 0

Total 176 108 68 11

**Table 2.** Functional annotation of SNPs at the *NLGN3* and *NLGN4X* loci identified by next-generation

Using SeqAnt to rapidly annotate our sequence data allows us to quickly draw four main conclusions. First, most common variation is already contained in dbSNP, while much of the rare variation remains undiscovered. Second, we did not see any novel replacement variants at either *NLGN3* or *NLGN4X*, suggesting that mutations at these loci are rare causes of autism. Third, we identified novel UTR variants at highly evolutionarily conserved sites,

most rare variants (<5%) have not been cataloged there.

**SNPs** 

**Functional class Total** 

sequencing of 144 males with autism.

Single nucleotide variants (SNVs) and small insertions and deletions (INDELs) were annotated using SeqAnt [5]. For the SNPs, a total of 68, or 39%, had not been reported before (31 in *NLGN3* and 37 in *NLGN4X,* Table 2). For the INDELs, a total of 24, or 75%, had not been reported before (5 in *NLGN3* and 19 in *NLGN4X*, Table 3). As summarized in Figure 7, almost all common variation (>5% frequency in our sample) is contained in dbSNP, whereas

> **Novel SNPs**

**Novel SNPs at Evolutionary Conserved Sites** 


**Figure 7. Summary of SNV and indel variation discovered at the** *NLGN3* **and** *NLGN4X* **loci in males with ASD.** The frequency of SNVs and INDELs (minor alleles) in cases is plotted against their level of evolutionary conservation. Most common variation has already been discovered and exists in public databases (blue; circles and diamonds); most of the rare variation at both loci was discovered in our study and not contained in public databases (red; circles and diamonds).
