**4. Data mining and statistical analyses**

56 Bioinformatics

sequencing of the human genome revealed a controversial number of interrupted genes (25,000-32,000) with their regulatory sequences [1, 2] representing about 2% of the genome. These genes are immersed in a giant sea of different types of non-coding sequences which make up around 98% of the genome. The non-coding regions are characterized by many kinds of repetitive DNA sequences, where almost 10.6% of the human genome consists of Alu sequences, a type of SINE (short interspersed elements) sequence [3]. [Alu] elements are not randomly distributed throughout the genome but rather are biased toward gene-rich regions [5]. They can act as insertional mutagens and the vast majority appears to be genetically inert (6). LINES, MIR, MER, LTRs, DNA transposons, and introns are other kinds of noncoding sequences, which together conform about 86% of the genome. In addition, some of these sequences are overlapped one to another, for example, the CpG islands (CGI), which complicates analysis of the genomic landscape. In turn, each chromosome is

The two closely related human lentiviruses HIV-1 and HIV-2 are responsible for the 21th century AIDS pandemic [7-9]. Most current therapeutic approaches use combinations of antiviral drugs that inhibit activities of viral enzymes such as reverse transcriptase, protease and integrase; nevertheless none of those have succeeded in controlling infection [10-12]. One option to overcome the problem is to explore new therapies that include the study of the integration dynamics of human Lentiviruses because it would permit to understand the underpinnings behinds of alterations of cellular homeostasis when a cell is infected [13]. Additionally, analysis of integration process is important in HIV-induced disease and in

Integration is a crucial step in the life cycle of retrovirus permitting the incorporation of viral cDNA into the host genome [15-17]. cDNA integration is mediated by the virally encoded integrase enzyme and other viral and cellular proteins in a molecular complex called the preintegration complex (PIC) [18]. One cellular factor involved in HIV targeting is the lens epithelium-derived growth factor (LEDGF) [19, 20], which binds to both HIV-1 integrase and chromatin, tethering the viral integration machinery to chromatin [21]. HIV-1 integration has been extensively studied using a wide array of molecular biology, biochemistry and structural biology approaches [22]. However, is critical to directly identify the viral distribution inside human genome in order to understand at genomic level the relationship between the

As shown by previous studies, the preferences in target site selection for integration are not entirely random [23-26]; being pronounced favored and disfavored chromosomal regions which differ among retroviruses [27]. These preferential regions of host genomes are characterized by having a high frequency of integrational events, as known as "hotspot" and are distributed along the genome of host cell [28, 29]. In HIV-1, most of proviruses are localized into transcriptionally active regions not only in exons and introns, but also in

characterized by some particular properties of structure and function.

composition and topology of chromatin and the target site selection.

sequences around start transcription sites [30, 31].

**3. Human lentiviral integration** 

Lentivirus-based gene therapy [14].

A total of 352 human genome sequences flanking the 5'LTR of human Lentiviruses (176 sequences of HIV-1 [27] and 176 of HIV-2 [33] were obtained from GenBank (NCBI) under accession numbers: CL529260 to CL529766 (HIV-1) and DQ632388 to DQ632563 (HIV-2). Using the BLAST algorithm (NCBI; *http://blast.ncbi.nlm.nih.gov/Blast.cgi*), the sequences were aligned to the draft human genome (hg18) and those that met the following criteria were considered authentic integration sites: (i) contained the terminal 3' end of the HIV-1 or HIV-2 LTR; (ii) had matching genomic DNA within five bp of the end of the viral LTR; (iii) had at least 95% homology to human genomic sequence across the entire sequenced region; (iv) matched a single human genetic locus with at least 95% homology across the entire sequenced region (v) had minimum size of 50 bp*.* 

BLAST of NCBI and the BLAT algorithm of the Genome Browser (University of California, Santa Cruz, Human Genome Project) (*http://www.genome.ucsc.edu/)* were used to obtain information about coding protein genes (RefSeq), transcripts, CpG islands and repetitive elements. Additional genomic information included molecular process and molecular function, was obtained from Gene Ontology (GO) (*http://www.geneontology.org/index.shtml*), GenCard (*http://www.genecards.org/cgi-bin/carddisp.pl*) and Gene Entrez (*http://www.ncbi.nlm.nih.gov/ncbi/geneentrez*). The chromosomal localization of the HIV-1 and HIV-2 proviruses was identified using the G pattern banding of each chromosome, as proposed by the Paris Conference (1971) [35], with updating of 850 times resolution. As the highest number of HIV-1 and HIV-2 proviruses was recorded on chromosome 17, an extensive characterization of its chromatin structure was performed including the genomic information available in several platforms of the Genome Browser: shows the CpG islands and distribution of its methylation; of histone H3 in the Lysine 4 and 27 methylation data obtained from ENCODE Histone modification by University of Washington CHIP-seq; Nucleosoma occupancy probabilities from A375 by Washington University and DNase1 hypersensitivity (ENCODE University of Washington) in GM12878 cells. All statistical analyses were performed using STATISTICA 7 [35]. The Mann-Whitney test (Wilcoxon rank) was used to establish differences between HIV-1 and HIV-2 chromosomal integration. Differences in function, molecular process and cell localization were analyzed using the ttest for independent samples. The Kolgomorov-Smirnov test was used for determining normality of data. In order to avoid an erroneous significance level for multiple comparisons a Bonferroni correction test was applied. To calculate the significant association among CpG numbers, genes numbers and integrations multiple regression analyses were performed**.**  CpG numbers and genes per Mpb per chromosomes were determined from the NCBI and Ensemble databases (update 2010).

Systemic Approach to the Genome Integration Process of Human Lentivirus 59

occurred close to chromatin regions containing protein coding genes (p>0.05, t-student test). In a 100Kb extension of chromatin that harbored both HIV-1 and HIV-2 proviruses no differences were observed for the gene functional categories (p>0.05, Bonferoni´s correction). According to molecular function, 46% of HIV-1 integrations and 57% of HIV-2 were associated with molecular binding, while 19% and 18% respectively occurred in regions that code for genes associated with enzymatic function (figure 2a). Otherwise an exploring about the biological process revealed a preferential integration in a collection of genes involved in metabolism and gene expression for HIV-1 (36%) and HIV-2 (37%) (p>0.05, Bonferoni´s

**Figure 1.** Chromosomal loci where 352 HIV-1 and HIV-2 cDNA have integrated into the human genome. Localization of chromosomal sequences matching both lentivirus are indicated in the graphics. Upper for each chromosome. Blue lines identify HIV-2 integrations and red lines identify

A low number of repetitive elements including SINEs, LINEs and LTRs were identified associated with provirus in an extension of 100Kb of flanking host chromatin. In general, there were no differences in the distribution of repetitive elements categories (SINEs, LINEs and LTRs) between HIV-1 and HIV-2 integrations (p>0.05, X2 test). Our results showed that

**5.2. Distribution of the repetitive elements flanking integration sites** 

correction) (figure 2b).

HIV-1 integrations.
