*3.3.2. Genome sequencing and assembly*

The Illumina HiSeq-2000 platform was chosen for *Xiphophorus* genome sequencing. Sequenc‐ ing libraries with different insert sizes (300 bp, 500 bp, 3 kb, and 8 kb) were prepared. The purpose of using different insert size libraries is to using the paired-end reads that span different lengths of genome to estimate the gap size in a higher level of assembly. Over 700 (*X. couchianus*) and 360 (*X. hellerii*) million 100 bp paired-end short sequence reads were obtained from sequencer.

Genomes of *X. couchianus* and *X. hellerii* were constructed at three stages: contig, scaffold, and chromosome. The contigs were assembled in a *de novo* manner to maximally capture any sequences that are not present in *X. maculatus*, while scaffolds and chromosomes were assembled using the *X. maculatus* genome as a reference to guide assembly.

The first stage contig assembly was carried out by ALLPATHS using only the Illumina sequencing reads. This step generated contig-level assembly with N50 of 60 kb and 30 kb for *X. couchianus* and *X. hellerii*, respectively*.*

These contigs were further grouped into scaffolds using the *X. maculatus* scaffolds assembly as reference. *X. couchianus* and *X. hellerii de novo* assembled contigs, as well as the sequencing reads, were aligned to *X. maculatus* genome scaffold assembly using a multi-phase aligner SRprism (ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/srprism/). The sequence gaps between consecutive contigs were filled with long-insertion paired-end Illumina reads that bridge the upstream ends and downstream ends of contigs that are right next to the gaps. Scaffolding of contigs and gap fillings increased the length of both assemblies to N50s of 1.8 Mb and 1.6 Mb, respectively.

The construction of chromosomal level genome was accomplished by aligning *de novo* assembled contigs to the *X. maculatus* chromosome assembly using Mummer 3 package Nucmer3.0 (http://mummer.sourceforge.net). For each species, sequences of contigs and the location of *X. maculatus* chromosome alignments were recorded. By using a customized Perl script, these sequences and alignment information were organized into chromosomes.

#### *3.3.3. Genome annotation*

To annotate the newly assembled *X. couchianus* and *X. hellerii* genome, two methods, rapid annotation of transfer tool (RATT) and *de novo* assembled transcriptome, were used and the result from each were compared to each other.

Transcript sequences and associated functional annotations can be transferred between closely related species. A modified gene annotation method, RATT, was applied using the *X. macu‐ latus* genome and gene model as a reference to quickly transfer genome annotation [27]. Since the *X. maculatus* genome was already available, using RATT to transfer annotation can minimize computational and human resources that are required for genome annotation. Both *X. couchianus* and *X. hellerii* genomic scaffold sequences were used as query species to be aligned to the well annotated *X. maculatus* genome using Nucmer3.0 with parameters imple‐ mented by RATT for annotation transfer. To avoid frame shift between two species, the synteny between both species and reference was established and insertions/deletions were also identified, respectively. *X. maculatus* gene models were then transferred and corrected to both query species. Of the 20,482 gene models annotated in *Xiphophorus* genome, 20,300 and 20,325 of them were transferred to *X. couchianus* and *X. hellerii*, respectively (Table 4).

To compare to this RATT annotation transfer method, *X. couchianus* and *X. hellerii* genome annotations were also annotated with a different method using *de novo* assembled transcrip‐ tomes. This method is reference genome independent. Briefly, RNA samples from one month old whole fish of *X. hellerii* and *X. couchianus* and a collection of tissues of mature individuals of each species were sequenced using Illumina GAIIx platform as 60 bp paired-end reads as well as HiSeq-2000 platform as 100 bp paired-end reads. *De novo* transcript assemblies and reports of putative transcripts were performed using velvet v1.1.05 and Oases v0.1.22 [28, 29]. The transcriptome assembly resulted in 110,604 and 242,675 transcripts for *X. couchianus* and *X. hellerii*, respectively.


**Table 4.** Comparisons between reference-based annotation and *de novo*-based annotation

Comparing these two methods of annotation to each other in perspective of transcriptome quality, *de novo* method produced very larger transcriptomes in number of transcripts and final assembly size (Table 4). Many transcripts produced this way are unverified isoforms of same genes and redundant splicing isoforms of the same gene. In contrast, the RATT gene model transfer produced transcriptomes are similar to the reference [27]. In addition, both methods produced comparable N50s; however, reference-based method had longer average length, suggesting this method is superior.

In conclusion, the *de novo* assembly of a species transcriptome and its use in biological inference studies is appropriate, when a reference genome is not available and assuming tissue diversity is adequately captured. Nonetheless, reference-based gene model transfer is a reliable, economical, and efficient means to annotate closely related species.
