**2. Genome sequencing and assembly**

The olive genome is being sequenced using a combination of Next Generation Sequencing (NGS) technologies and a combination of assembly approaches, using the cultivar Leccino as the genotype to be sequenced. The Whole Genome Shotgun approach to assembling the genome is being pursued using Illumina and 454 sequencing with a combination of long single reads, paired end reads and mate pairs until a coverage of at least 40 genome equivalents is reached. The assembly is being performed using Abyss and CLC assemblers. A BAC pooling approach is being used to sequence random pools of 384 BACs using Illumina paired end reads. A BAC coverage of approximately 3-4 genome equivalents is going to be sequenced, with each BAC clone sequenced on average at a 50X coverage. The advantages of the BAC approach are of two types: on one hand each BAC pool is much smaller in size than the total genome size, reducing the assembly complexity, on the other hand within each BAC pool we should not face the problem posed by sequence heterozygosity among maternal and paternal-derived genomes that strongly affects WGS approaches. The advantage of the WGS approach is the much more complete and homogeneous coverage of the entire genome. The two assemblies derived, the WGS and the pooled BAC assembly, will therefore be combined using a proprietary algorithm (GAM) to produce a consensus assembly. The consensus assembly will finally be anchored to the genetic map through the use of high throughput genotyping technologies.

As of today we have produced all of the data needed for the Whole Genome Shotgun component. We have produced approximately 90 Gbp of Illumina sequence data, corresponding to a nominal coverage of 60X of the olive genome. The Illumina sequences were obtained from two paired-end libraries with 500-600 bp inserts that were sequenced on the Illumina Genome Analyser IIx producing 150 bp reads for a total coverage of 43X (65 Gbp) and from one paired-end library with 1000 bp inserts that was sequenced on the Illumina HiSeq2000 system producing 100 bp reads for the remaining 17X coverage (25 Gbp). Finally two mate-pair libraries with 3 Kbp inserts were constructed and sequenced on the HiSeq2000 to produce 100 bp reads and reach a coverage of 4 genome equivalents (6 Gbp).

134 Olive Germplasm – The Olive Cultivation, Table Olive and Olive Oil Industry in Italy

olive breeding have been detected.

**2. Genome sequencing and assembly** 

microsatellites and SCAR markers on a Frantoio x Kalamata progeny (Wu et al., 2004) and, more recently, a new maps has been derived through SSR, AFLP, ISSR, RAPD and SCAR marker, scored on a 140 F1 progeny from a Picholine Marocaine x Picholine du Languedoc cultivars cross (El Aabidine et al., 2010). In any case, no QTLs of agronomical interest for

The Italian project, OLEA, is an initiative, mainly supported by Italian Minister of Agricultural, Food and Forestry Policies, dedicated toward the development genomic resources of olive, and it aims to identify, isolate and determine the function of genes that are associated with both vegetative and reproductive phenotype. Therefore, the knowledge of the genetic structural basis is the first step to identify the relevant differences in the control of gene expression of the same sets of genes that exist among different genotypes. The development of new molecular tools through approaches of structural and functional genomics, together with those from proteomics, metabolomics, mapping and genotyping, will allow to advance in molecular breeding of olive, pull out under-exploited natural diversity that is present in the *Olea* complex and in olive germplasm, dissect the molecular mechanisms underlying traits related to high valued compounds and those involved in plant-environment interactions, establish a platform for a rapid and cost-effective transfer of knowledge and technologies.

The olive genome is being sequenced using a combination of Next Generation Sequencing (NGS) technologies and a combination of assembly approaches, using the cultivar Leccino as the genotype to be sequenced. The Whole Genome Shotgun approach to assembling the genome is being pursued using Illumina and 454 sequencing with a combination of long single reads, paired end reads and mate pairs until a coverage of at least 40 genome equivalents is reached. The assembly is being performed using Abyss and CLC assemblers. A BAC pooling approach is being used to sequence random pools of 384 BACs using Illumina paired end reads. A BAC coverage of approximately 3-4 genome equivalents is going to be sequenced, with each BAC clone sequenced on average at a 50X coverage. The advantages of the BAC approach are of two types: on one hand each BAC pool is much smaller in size than the total genome size, reducing the assembly complexity, on the other hand within each BAC pool we should not face the problem posed by sequence heterozygosity among maternal and paternal-derived genomes that strongly affects WGS approaches. The advantage of the WGS approach is the much more complete and homogeneous coverage of the entire genome. The two assemblies derived, the WGS and the pooled BAC assembly, will therefore be combined using a proprietary algorithm (GAM) to produce a consensus assembly. The consensus assembly will finally be anchored to

the genetic map through the use of high throughput genotyping technologies.

As of today we have produced all of the data needed for the Whole Genome Shotgun component. We have produced approximately 90 Gbp of Illumina sequence data, corresponding to a nominal coverage of 60X of the olive genome. The Illumina sequences were obtained from two paired-end libraries with 500-600 bp inserts that were sequenced on the Illumina Genome Analyser IIx producing 150 bp reads for a total coverage of 43X (65 Gbp) and We have produced approximately 18 Gbp of Roche-454 sequence data, corresponding to 12X coverage approximately. 12 Gbp were obtained as long single reads of which approximately one third were 400 bp long reads (FLX TITANIUM technology) and two thirds were 700 bp long reads (FLX XL PLUS technology). Additionally 6.2 Gbp of sequence data were obtained as paired end reads from 3 libraries with 3 Kbp inserts (3.8 Gbp) and 10 libraries with 8 Kbp inserts (4.4 Gbp).

The 454 single reds and the Illumina paired-end reads are being used in a traditional WGS assembly. The Illumina mate-pair and the 454 paired end sequenced, i.e. all those sequences that have been obtained from inserts of larger size, will be utilised in order to scaffold into larger assemblies the contigs obtained from the assembly of the reads from the shorter inserts and try to overcome the assembly problems posed by the occurrence of repetitive elements. Since many of the transposable elements in plant genomes are larger than 3 Kbp the larger inserts are going to be of crucial importance.

We have performed a number of assemblies to test different strategies and to obtain a first rough draft of the olive genome. We tested assemblies both using the Illumina data only, as well as using Illumina and 454 data. All data sets have been initially filtered for low quality sequences and for chloroplast DNA contamination and then subject to assembly using the CLCBio assembler. When only the Illumina data were used (53X coverage after filtering), we produced an assembly of total size of 1.1 Gbp and N50 size of 1.7 Kbp. The scaffolding using the mate pair and paired end information on the same assembly using the SSPACE tool increased the N50 size to 2.3 Kbp. The addition of an initial set of 454 data (3.5 genome equivalents after filtering, single reads only) increased the total assembly size to 1.5 Gbp and the N50 size of contigs and scaffolds to 2.8 and 3.7 Kbp, respectively. We expect that the addition of the remaining 454 sequenced from the large insert libraries (3 and 8 Kbp inserts) should greatly improve the assembly by increasing considerably the N50 size of the scaffolds. However, due to the problems posed by the high levels of sequence heterozygosity present in the olive genome of cultivar Leccino, we consider the sequencing of the pools of BACs a necessary component of our strategy in order to obtain a satisfactory assembly. The problems here are represented by the difficulties in obtaining BAC libraries with large insert sizes (>100 Kbp) from cultivar Leccino. Should this not prove feasible we will anyhow resort to using a fosmid library (40 Kbp inserts).
