**3. Analysis of the repetitive component of the genome**

### **3.1. Assembly of olive repetitive sequences**

Some of the biggest technical challenges in sequencing eukaryotic genomes are caused by repetitive DNA (Alkan et al., 2011): that is, sequences that are similar or identical to sequences elsewhere in the genome.

The first step in characterizing and sequencing large genomes has to be a genome survey, from which important information about common repeat sequences can be obtained. NGS data are particularly suitable to identify sequences present in many copies per genome, by assembling reads according to their sequence.

Olive Tree Genomic 137

1000 copies (Macas et al., 2007). Hence, we decided to proceed to the final assembly of Illumina reads after having splitted the sequence read datasets into subpackages of different

In a first assembly, we assembled the complete pool of Illumina reads using CLC-BIO and subsequently CAP3 assembler. In other experiments, the pool of Illumina reads was splitted into 8, 16, 32, 250, or 500 subpackages and assembled separately (indicated as split 1, 2, 3, 4, and 5, respectively); for each splitting, the resulting contigs were assembled on their turn

All supercontigs were then mapped with all Illumina 75 nt long reads (Table 1). It can be observed that major splittings allow to recover the most redundant supercontigs, that are not found in the lower splittings, because of their too large coverage and, hence, the occurrence of multi-reads. Due to the different redundancy observed in the different

0 0 1.309 x 44336 235.6 19.61 6.07 243

1 4 0.327 x 78983 200.5 19.96 6.01 201

2 8 0.163 x 50698 204.2 31.72 9.56 204

3 32 0.041 x 22749 252.3 68.58 14.35 265

4 244 0.005 x 14748 240.6 218.57 74.42 258

5 489 0.003 x 11819 223.6 212.77 67.99 239

**Table 1.** Characteristics of supercontig sets obtained by CLC Bio Workbench and CAP3 assembly after

Concerning 454 sequence reads, we did not proceed to such a subdivision, estimating that the superior length of reads compared to that of Illumina ones allowed to recover also highly repeated sequences. In fact, in longer sequences, the occurrence of multi-reads is

All Illumina- and 454-derived supercontigs and contigs longer than 80 nt were masked against an in-house made database of chloroplast and mitochondrial sequences using RepeatMasker, and organellar sequences were removed. Then, a final assembly was performed, using CAP3, among all datasets, i.e. six Illumina datasets (split 0-5) and one 454 dataset. The resulting whole genome dataset included 238,914 supercontigs, with mean

Mean length Mean nr. of mapped reads

Average coverage N50

Nr. of assembled supercontigs

genome coverages.

Split Nr. of

subpackages

different splitting of Illumina reads.

length of 667.9 nt and N50 = 1.331.

naturally reduced.

using CAP3 assembler obtaining 210,063 supercontigs.

Subpackage coverage

subpackages, we decided to use all supercontigs in the final assembly.

The olive genome is largely uncharacterized, despite the growing importance of this tree as oil crop. Concerning repeated sequences, the most characterized are tandem repeats belonging to 4 families, isolated from genomic libraries and, in some instances, localized by cytological hybridization on olive chromosomes (Katsiotis et al., 1998; Minelli et al., 2000; Lorite et al., 2001; Contento et al., 2002). Also putative retrotransposon fragments have been isolated and sequenced (Stergiou et al., 2002; Natali et al., 2007), but a comprehensive picture of RE landscape in the olive genome is still lacking.

We have performed a deep analysis of the repetitive component of olive genome, using NGS techniques (454 and Illumina). We have used around 25 million Illumina paired-end reads of 75 nt, corresponding to 1.8 billion nt and a 1.3 x coverage, and around 8 million 454 single reads, with mean read length of 407 nt, corresponding to a total of 3.3 billion nt and a 2.3 x coverage.

This large amount of sequencing data cannot be sufficient for whole genome assembly, but it enables representative sampling of elements present in a genome in multiple copies. Moreover, the proportion of individual sequences in the reads reflects their genomic abundance, thus providing a simple and reliable means for quantification of repetitive elements (Macas et al., 2007).

In our experiments, we performed de novo repeat identification and reconstruction by direct assembly of the reads. Due to the relatively low genome coverage of the sequencing, most of the contigs that are obtained do not represent specific genomic loci; instead, they are probably composed of reads derived from multiple copies of repetitive elements, thus representing consensus sequences of genomic repeats (Novak et al., 2010). Even though the exact form of this consensus does not necessarily occur in the genome, this representation of repetitive elements has been shown to be sufficiently accurate to enable amplification of the full length repetitive elements using PCR (Swaminathan et al., 2007).

We assembled Illumina and 454 sequence reads by overlapping DNA sequence fragments using CLC-BIO and CAP3 as aligners. In spite of recent progresses, a major challenge remains when reads map to multiple locations, i.e. with multi-reads. The occurrence of multi-reads is strongly dependent on the read length: they are most common in the Illumina sequence packages, and less common in 454 sequence packages, in which sequence length is rapidly growing to lengths similar to those achieved by classical Sanger sequencing, though at higher costs than Illumina.

The sequencing coverage affects heavily the possibility to recover repeated sequences. Obviously, the larger is the coverage, the higher is the possibility that multi-reads are not resolved and discarded. For example, it has been demonstrated, in pea, that a very low coverage (0.008 ×) of the genome allows to obtain repetitive sequences present with at least 1000 copies (Macas et al., 2007). Hence, we decided to proceed to the final assembly of Illumina reads after having splitted the sequence read datasets into subpackages of different genome coverages.

136 Olive Germplasm – The Olive Cultivation, Table Olive and Olive Oil Industry in Italy

picture of RE landscape in the olive genome is still lacking.

assembling reads according to their sequence.

2.3 x coverage.

elements (Macas et al., 2007).

at higher costs than Illumina.

The first step in characterizing and sequencing large genomes has to be a genome survey, from which important information about common repeat sequences can be obtained. NGS data are particularly suitable to identify sequences present in many copies per genome, by

The olive genome is largely uncharacterized, despite the growing importance of this tree as oil crop. Concerning repeated sequences, the most characterized are tandem repeats belonging to 4 families, isolated from genomic libraries and, in some instances, localized by cytological hybridization on olive chromosomes (Katsiotis et al., 1998; Minelli et al., 2000; Lorite et al., 2001; Contento et al., 2002). Also putative retrotransposon fragments have been isolated and sequenced (Stergiou et al., 2002; Natali et al., 2007), but a comprehensive

We have performed a deep analysis of the repetitive component of olive genome, using NGS techniques (454 and Illumina). We have used around 25 million Illumina paired-end reads of 75 nt, corresponding to 1.8 billion nt and a 1.3 x coverage, and around 8 million 454 single reads, with mean read length of 407 nt, corresponding to a total of 3.3 billion nt and a

This large amount of sequencing data cannot be sufficient for whole genome assembly, but it enables representative sampling of elements present in a genome in multiple copies. Moreover, the proportion of individual sequences in the reads reflects their genomic abundance, thus providing a simple and reliable means for quantification of repetitive

In our experiments, we performed de novo repeat identification and reconstruction by direct assembly of the reads. Due to the relatively low genome coverage of the sequencing, most of the contigs that are obtained do not represent specific genomic loci; instead, they are probably composed of reads derived from multiple copies of repetitive elements, thus representing consensus sequences of genomic repeats (Novak et al., 2010). Even though the exact form of this consensus does not necessarily occur in the genome, this representation of repetitive elements has been shown to be sufficiently accurate to enable amplification of the

We assembled Illumina and 454 sequence reads by overlapping DNA sequence fragments using CLC-BIO and CAP3 as aligners. In spite of recent progresses, a major challenge remains when reads map to multiple locations, i.e. with multi-reads. The occurrence of multi-reads is strongly dependent on the read length: they are most common in the Illumina sequence packages, and less common in 454 sequence packages, in which sequence length is rapidly growing to lengths similar to those achieved by classical Sanger sequencing, though

The sequencing coverage affects heavily the possibility to recover repeated sequences. Obviously, the larger is the coverage, the higher is the possibility that multi-reads are not resolved and discarded. For example, it has been demonstrated, in pea, that a very low coverage (0.008 ×) of the genome allows to obtain repetitive sequences present with at least

full length repetitive elements using PCR (Swaminathan et al., 2007).

In a first assembly, we assembled the complete pool of Illumina reads using CLC-BIO and subsequently CAP3 assembler. In other experiments, the pool of Illumina reads was splitted into 8, 16, 32, 250, or 500 subpackages and assembled separately (indicated as split 1, 2, 3, 4, and 5, respectively); for each splitting, the resulting contigs were assembled on their turn using CAP3 assembler obtaining 210,063 supercontigs.

All supercontigs were then mapped with all Illumina 75 nt long reads (Table 1). It can be observed that major splittings allow to recover the most redundant supercontigs, that are not found in the lower splittings, because of their too large coverage and, hence, the occurrence of multi-reads. Due to the different redundancy observed in the different subpackages, we decided to use all supercontigs in the final assembly.


**Table 1.** Characteristics of supercontig sets obtained by CLC Bio Workbench and CAP3 assembly after different splitting of Illumina reads.

Concerning 454 sequence reads, we did not proceed to such a subdivision, estimating that the superior length of reads compared to that of Illumina ones allowed to recover also highly repeated sequences. In fact, in longer sequences, the occurrence of multi-reads is naturally reduced.

All Illumina- and 454-derived supercontigs and contigs longer than 80 nt were masked against an in-house made database of chloroplast and mitochondrial sequences using RepeatMasker, and organellar sequences were removed. Then, a final assembly was performed, using CAP3, among all datasets, i.e. six Illumina datasets (split 0-5) and one 454 dataset. The resulting whole genome dataset included 238,914 supercontigs, with mean length of 667.9 nt and N50 = 1.331.
