**3.1.2 Pnuc: the structural coding techniques**

210 Fourier Transform Applications

technique based on these base's association is used. We adopt binary indicators to each of

*if the codon cod starts at position n U i else*

is the binary indicator of the codon cod and Ns is the sequence's length. This marker takes the value of either 1 or 0 at location n for the first character depending on whether the corresponding character exists from the location n. Let's consider the codon binary indicator

Fig. 3. DNA flexibility around histones is enhanced by dinucleotide as 'AA', 'TT', 'TA'

A AAA, AAT, AAC, AAG, ATA, ATT, ATC, ATG

T TAA, TAT, TAC, TAG,TTA, TTT, TTC, TTG

C CAA, CAT, CAC, CAG, CTA, CTT, CTC, CTG

G GAA, GAT, GAC, GAG, GTA, GTT, GTC, GTG

ACA, ACT, ACC, ACG, AGA, AGT, AGC, AGG

TCA, TCT, TCC, TCG, TGA, TGT, TGC, TGG

CGT, CGC, CGG, CCA, CCT, CCC, CCG, CGA

GCA, GCT, GCC, GCG, GGA, GGT, GGC, GGG

BASE Codon associated

Table 1. Codon associated to each base

[ ] <sup>1</sup>

*SDNA* = 'AAT**CGCG**ACACTCATT**CG**G' *U n TCG* [ ] = 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0

⎧ = ⎨ ⎩

0 *cod*

Considering sequence *SDNA* and the associated codon's binary indicator

*Sn U i i N* [ ] = = { *cod* [ ], 1… *<sup>s</sup>*} (13)

(14)

the 64 codons (Table 1)

where:

*U n TCG* [ ].

The second coding technique is the Pnuc which is based on local bending and flexibility properties of the double helix; it is deduced experimentally from nucleosome positioning (Pnuc). By considering the matching of both stalks (A-T and C-G) along the helix, one base's pair defines a plane and a direction in this plane. A description of the double helix shows the overlapping of the plans (Fig. 4). When considering that the planes are parallel, passing between planes needs translation and rotation of 34,3° of the orientation of the connection of the plan.

Fig. 4. A description of the double helix shows the overlapping of the plans

Now the plans are not parallel and the axis of the double helix presents curvature. By considering the interaction between a protein, a histone and a DNA's sequence, this interaction is stronger when the contact area between both objects is the biggest. To increase this surface, it is necessary to roll up as much as possible the segment of DNA around the protein, in this way, we have two properties:

If the segment of DNA is not rolled up around the protein, it is in position of equilibrium, the curvature is static

The stalk must be flexible to allow the additional curvature around the protein. These two properties generate the nucleosome which generates an excessive curvature of the stalk.

Each trinucleotide is replaced by its numerical value given by the Pnuc table. The *SDNA* is then replaced by the numerical sequence *CPNUC* .

*SDNA* = 'AAT**CGCG**ACACTCATT**CG**G' *CPNUC* = 0.7 5.3 8.3 7.5 7.5 6.0 5.4 5.2 6.5 5.8 5.4 5.4 6.7 0.7 3.0 8.3 4.7

Spectral Analysis of Global Behaviour of C. Elegans Chromosomes 213

Fig. 6. Illustration of the smoothed mean spectrum applied on Pnuc signal of 2000 base pairs

The third technique is submerged from the Chaos Game Representation (CGR) images which can forms a global signature of bio-sequences (Almeida et al, 2001; Cenac et al, 2004; Deshavanne et al, 1999, 2000; Joseph & Sasikumar, 2006; Oliver et al 1993; Fiser et al, 1994). The CGR paradigm is a holistic way of DNA representation. It provides a unique scatter pictures. In 1999, H. Joel Jeffrey uses for the first time this representation for studying the "non-randomness" of genomic sequences (Jeffrey, 1990). The CGR is an iterative algorithm for drawing fractal images to any desired scale. It maps nucleotide sequences in the [0,1]x[0,1] square. The four letters A, C, G and T are placed at the corners. The binary CGR

Deriving scatter pictures, the CGR's construction algorithm consists of three steps. First, the four letters A, C, G and T are placed at the corners of a rectangular unit square. Second, the first point is plotted halfway between the center of the square, and the corner corresponding

*llll ACGT* = = = = (0,0 , 0,1 , 1,1 , 1,0 ) ( ) ( ) ( ) (15)

of chromosome 1 of C. Elegans genome

**3.1.3 Fcgr: the two dimensionnal coding techniques** 

vertices are assigned to the four nucleotides as:


Table 2. The PNuc table

The signal generated from this coding for a part of chromosome is given by Fig. 5. For clarity purpose the signal is multiplied by 10. Fig. 6 illustrate the stft method applied on this resulting signal. First, subfigure a shows a mean spectrum for distinct window of length 5\*105.. The spectrum obtained needs smoothing so for the second figure (subfigure b) a blackman smoothing window is applied on each signal part before calculating the mean spectrum of equation 7. In the third and last figures (subfigure c and d) the equation 8 is used and the parameters chosen are: Blackman window, M=5\*105 , N=5\*104 and overlap 50% for subfigure c and N=5\*103 with overlap 10% for subfigure d. The figure shows that meaning and smoothing are very efficient to have the best signal (subfigure d). In this signal, the periodicity 10 is enhanced to prove that this is a characteristic of helix flexibility.

Fig. 5. Pnuc signal of 2000 base pairs of chromosome 1 of C. Elegans genome

212 Fourier Transform Applications

The signal generated from this coding for a part of chromosome is given by Fig. 5. For clarity purpose the signal is multiplied by 10. Fig. 6 illustrate the stft method applied on this resulting signal. First, subfigure a shows a mean spectrum for distinct window of length 5\*105.. The spectrum obtained needs smoothing so for the second figure (subfigure b) a blackman smoothing window is applied on each signal part before calculating the mean spectrum of equation 7. In the third and last figures (subfigure c and d) the equation 8 is used and the parameters chosen are: Blackman window, M=5\*105 , N=5\*104 and overlap 50% for subfigure c and N=5\*103 with overlap 10% for subfigure d. The figure shows that meaning and smoothing are very efficient to have the best signal (subfigure d). In this signal, the periodicity 10 is enhanced to prove that this is a characteristic of helix

Fig. 5. Pnuc signal of 2000 base pairs of chromosome 1 of C. Elegans genome

**Trinucleotide PNUC Trinucleotide PNUC**  AAA/TTT 0.0 CAG/CTG 0.042 AAC/GTT 0.037 CCA/TGG 0.054 AAG/CTT 0.052 CCC/GGG 0.060 AAT/ATT 0.07 CCG/CGG 0.047 ACA/TGT 0.052 CGA/TCG 0.083 ACC/GGT 0.054 CGC/GCG 0.075 ACG/CGT 0.054 CTA/TAG 0.022 ACT/AGT 0.058 CTC/GAG 0.054 AGA/TCT 0.033 GAA/TTC 0.030 AGC/GCT 0.075 GAC/CTG 0.054 AGG/CCT 0.054 GCA/TGC 0.060 ATA/TAT 0.028 GCC/GGC 0.0100 ATC/GAT 0.053 GGA/TCC 0.038 ATG/CAT 0.067 GTA/TAC 0.037 CAA/TTG 0.033 TAA/TTA 0.020 CAC/GTG 0.065 TCA/TGA 0.054

Table 2. The PNuc table

flexibility.

Fig. 6. Illustration of the smoothed mean spectrum applied on Pnuc signal of 2000 base pairs of chromosome 1 of C. Elegans genome

#### **3.1.3 Fcgr: the two dimensionnal coding techniques**

The third technique is submerged from the Chaos Game Representation (CGR) images which can forms a global signature of bio-sequences (Almeida et al, 2001; Cenac et al, 2004; Deshavanne et al, 1999, 2000; Joseph & Sasikumar, 2006; Oliver et al 1993; Fiser et al, 1994). The CGR paradigm is a holistic way of DNA representation. It provides a unique scatter pictures. In 1999, H. Joel Jeffrey uses for the first time this representation for studying the "non-randomness" of genomic sequences (Jeffrey, 1990). The CGR is an iterative algorithm for drawing fractal images to any desired scale. It maps nucleotide sequences in the [0,1]x[0,1] square. The four letters A, C, G and T are placed at the corners. The binary CGR vertices are assigned to the four nucleotides as:

$$l\_A = (0,0), l\_C = (0,1), l\_G = (1,1), l\_T = (1,0) \tag{15}$$

Deriving scatter pictures, the CGR's construction algorithm consists of three steps. First, the four letters A, C, G and T are placed at the corners of a rectangular unit square. Second, the first point is plotted halfway between the center of the square, and the corner corresponding

Spectral Analysis of Global Behaviour of C. Elegans Chromosomes 215

Fig. 8. Chaos Game Representation of the C. Elegans's gene F56F11.4

Representation) (Almeida et al, 2001; Deshavanne et al, 2000; Jeffrey, 1990).

algorithms in a different way from traditional alignment of nucleotides.

By identifying local patterns displayed in the CGR square, it is possible to identify correspondent features of DNA sequences (Yu et al, 2008). The fractal nature of this kind of DNA representation can be observed Fig. 8. The clustering dots in the lower corners indicate a slightly high concentration in A and T. It is known that CGR patterns depict base composition. In fact, we divide the CGR space with a grid of size k (i.e (2k × 2k) pixels) and we count occurrence in each quadrant, the frequency of k-lengthen words occurrence can be estimated and the frequency matrix then extracted is called FCGR (Frequency Chaos Game

The FCGR was first investigated by Deschavanne in (Deshavanne et al, 1999) and later by Almeida in (Almeida et al, 2001). To show the frequencies of the K-tuples, a color scheme normalized to the distribution of frequency of occurrence of associated patterns is used (Joseph & Sasikumar, 2006; Oliver et al, 1993; Tavassoly, 2007a; Tavassoly, 2007b; Makula, 2009; Goldman, 1993; Cénac, 2006; Tino, 1999; reference 44). A grayscale color mapping may also be used. In Fig. 9, the dinucleotide and trinucleotide frequency matrices (k ={2,3}) are obtained for the gene F56F11.4 of C.elegans. Thus, 22x22=16 cells are needed for motifs of length two and 23x23=64 regions to count motifs of length 3. The darker pixels represent the most frequently used words; when the clearest ones represent the fewer used words. CGRs were used for displaying the behavior of sub-patterns within the same input sequence and depicting oligo\_mer composition. It forms the basis for similarity and self-similarity

This FCGR cannot follow the evolution of frequencies from the beginning to the end of a given sequence. So, we propose to generate signals from FCGR. We Generate the nth-order FCGR for the hole sequence, and we replace the reading the first n-lengthen word in the sequence, by the correspondent frequency of the same sub-pattern in the FCGRn matrix.

to the first nucleotide of the sequence. Third, the new point are marked successively half way between the previous point and the corner corresponding to the base of each nucleotide read from the sequence (Almeida et al, 2001; Joseph 2006). A generated CGR image can be viewed as an image of distributed dots. Subdividing the unit square into a set of square entries of equal size n, the number of square entries obtained is equal to 2n ×2n. The number of points counted in each sub-square represents the number of occurrence of a particular nlengthen pattern.

For illustration, let's consider a DNA sequence *S SS S* = { 1 2 , ,...., *<sup>N</sup>*} of N nucleotides, the CGR value along this sequence is defined by equation 16. The result will be a square uniformly and randomly filled with dots.

$$X\_{n+1} = \frac{1}{2} \left( X\_n + l\_{s\_{n+1}} \right) \tag{16}$$

The first point *X*0 is usually placed at the center of the square having thus the coordinates (0.5, 0.5). Then, the next point *Xn*+1 is repeatedly placed halfway between the previous plotted point *Xn* and the segment joining the vertex corresponding to the letter *<sup>n</sup>* <sup>1</sup> *s* <sup>+</sup> of the sequence. Fig. 7 illustrates the construction process of CGR trajectory for sequence "ATCGG".

Fig. 7. An illustration of CGR trajectory for sequence ″ATCGG″

To derive the CGR plot, the following steps are taken: First place *X*0 at the square's center and the four letters at the corners as described before (subfigure 1). From center to vertex A, mark midpoint 1 ( address A) (subfigure 2). From 1 to T, mark midpoint 2 (address AT) (subfigure 3). From 2 to C, mark midpoint 3 (address ATC) (subfigure 4). From 3 to G, mark midpoint 4 (address ATCG) (subfigure 5). From 4 to G, mark midpoint 5 (address ATCGG) (subfigure 6).

214 Fourier Transform Applications

to the first nucleotide of the sequence. Third, the new point are marked successively half way between the previous point and the corner corresponding to the base of each nucleotide read from the sequence (Almeida et al, 2001; Joseph 2006). A generated CGR image can be viewed as an image of distributed dots. Subdividing the unit square into a set of square entries of equal size n, the number of square entries obtained is equal to 2n ×2n. The number of points counted in each sub-square represents the number of occurrence of a particular n-

For illustration, let's consider a DNA sequence *S SS S* = { 1 2 , ,...., *<sup>N</sup>*} of N nucleotides, the CGR value along this sequence is defined by equation 16. The result will be a square uniformly

> ( ) <sup>1</sup> <sup>1</sup> 1

The first point *X*0 is usually placed at the center of the square having thus the coordinates (0.5, 0.5). Then, the next point *Xn*+1 is repeatedly placed halfway between the previous plotted point *Xn* and the segment joining the vertex corresponding to the letter *<sup>n</sup>* <sup>1</sup> *s* <sup>+</sup> of the sequence. Fig. 7 illustrates the construction process of CGR trajectory for sequence

To derive the CGR plot, the following steps are taken: First place *X*0 at the square's center and the four letters at the corners as described before (subfigure 1). From center to vertex A, mark midpoint 1 ( address A) (subfigure 2). From 1 to T, mark midpoint 2 (address AT) (subfigure 3). From 2 to C, mark midpoint 3 (address ATC) (subfigure 4). From 3 to G, mark midpoint 4 (address ATCG) (subfigure 5). From 4 to G, mark midpoint 5 (address ATCGG)

Fig. 7. An illustration of CGR trajectory for sequence ″ATCGG″

<sup>2</sup> *n nsn X Xl* <sup>+</sup> <sup>+</sup> = + (16)

lengthen pattern.

"ATCGG".

(subfigure 6).

and randomly filled with dots.

Fig. 8. Chaos Game Representation of the C. Elegans's gene F56F11.4

By identifying local patterns displayed in the CGR square, it is possible to identify correspondent features of DNA sequences (Yu et al, 2008). The fractal nature of this kind of DNA representation can be observed Fig. 8. The clustering dots in the lower corners indicate a slightly high concentration in A and T. It is known that CGR patterns depict base composition. In fact, we divide the CGR space with a grid of size k (i.e (2k × 2k) pixels) and we count occurrence in each quadrant, the frequency of k-lengthen words occurrence can be estimated and the frequency matrix then extracted is called FCGR (Frequency Chaos Game Representation) (Almeida et al, 2001; Deshavanne et al, 2000; Jeffrey, 1990).

The FCGR was first investigated by Deschavanne in (Deshavanne et al, 1999) and later by Almeida in (Almeida et al, 2001). To show the frequencies of the K-tuples, a color scheme normalized to the distribution of frequency of occurrence of associated patterns is used (Joseph & Sasikumar, 2006; Oliver et al, 1993; Tavassoly, 2007a; Tavassoly, 2007b; Makula, 2009; Goldman, 1993; Cénac, 2006; Tino, 1999; reference 44). A grayscale color mapping may also be used. In Fig. 9, the dinucleotide and trinucleotide frequency matrices (k ={2,3}) are obtained for the gene F56F11.4 of C.elegans. Thus, 22x22=16 cells are needed for motifs of length two and 23x23=64 regions to count motifs of length 3. The darker pixels represent the most frequently used words; when the clearest ones represent the fewer used words. CGRs were used for displaying the behavior of sub-patterns within the same input sequence and depicting oligo\_mer composition. It forms the basis for similarity and self-similarity algorithms in a different way from traditional alignment of nucleotides.

This FCGR cannot follow the evolution of frequencies from the beginning to the end of a given sequence. So, we propose to generate signals from FCGR. We Generate the nth-order FCGR for the hole sequence, and we replace the reading the first n-lengthen word in the sequence, by the correspondent frequency of the same sub-pattern in the FCGRn matrix.

Spectral Analysis of Global Behaviour of C. Elegans Chromosomes 217

(a) FCGR2 (b) FCGR2-signal

(c) FCGR3 (d) FCGR3-signal

 (e) FCGR6 (f) FCGR6-signal Fig. 10. FCG representation for k=2, 3 and 6 and the FCGR\_signals associated

Fig. 9. The FCGR2 (k=2) and the FCGR3 (k=3) for the gene F56F11.4 of C. Elegans

Generating signals from FCGRs was a good way to capture such variability. For this fact, a new 1D graphical representation of DNA sequences is introduced, which provide useful insights into local and global characteristics of genomic sequences. This novel algorithm of DNA coding consists of computing the kth-order FCGR for the whole sequence and assigning then the value of the correspondent frequency to each k-lengthen word in the sequence. Thus allows us to follow the frequencies' evolution along a given sequence. The the obtained plot set is called FCGRk-signal.

Let's We consider the given sequence *SDNA SDNA* = ' TTTAAAAGCTCGCGCTAAAA'

The given sequence is divided with a k-length sliding window. A set of K-frames are obtained which are denoted by K-mers. For example when k= {2, 3, 6}, we have 2-mers (SDNA), 3-mers (SDNA) and 6-mers (SDNA).

FK(s) is defined to be the frequencies' set of the k-substrings that appear in the sequence S. Obviously; these frequencies derive from the appropriate FCGRk matrices. It follows that:

F2(SDNA)= {0.1579, 0.1579, 0.1579, 0.3684, 0.3684, 0.3684, 0.1053, 0.2105, 0.1579, 0.1053, 0.1579, 0.2105, 0.1579, 0.2105, 0.1579, 0.1579, 0.3684, 0.3684, 0.3684}

F3(SDNA)= {0.1111, 0.1111, 0.1667, 0.2778, 0.2778, 0.1111, 0.1111, 0.1667, 0.1111, 0.1111, 0.1667, 0.1111, 0.1667, 0.1667, 0.1111, 0.1667, 0.2778, 0.2778}

F6(SDNA)= {0.1333, 0.1333, 0.1333, 0.1333, 0.1333, 0.1333, 0.1333, 0.1333, 0.1333, 0.1333, 0.1333, 0.1333, 0.1333, 0.1333, 0.1333}

Fig. 10. illustrates the FCGR drawings for the case of k = {2,3and 6} and the corresponding plot sets for the considered sequence.

Fig. 10. Also shows the slightly high concentration in AA and AAA motifs in FCR2 and FCGR3 which are expressed by the high-rise blocks in the correspondent signals.

On the signals obtained, a spectral analysis is applied to detect the frequency global behaviour in the spectrum for each C. Elegans chromosome.

216 Fourier Transform Applications

Fig. 9. The FCGR2 (k=2) and the FCGR3 (k=3) for the gene F56F11.4 of C. Elegans

the obtained plot set is called FCGRk-signal.

Let's We consider the given sequence *SDNA SDNA* = ' TTTAAAAGCTCGCGCTAAAA'

(SDNA), 3-mers (SDNA) and 6-mers (SDNA).

0.1333, 0.1333, 0.1333, 0.1333}

plot sets for the considered sequence.

0.2105, 0.1579, 0.2105, 0.1579, 0.1579, 0.3684, 0.3684, 0.3684}

behaviour in the spectrum for each C. Elegans chromosome.

0.1111, 0.1667, 0.1667, 0.1111, 0.1667, 0.2778, 0.2778}

Generating signals from FCGRs was a good way to capture such variability. For this fact, a new 1D graphical representation of DNA sequences is introduced, which provide useful insights into local and global characteristics of genomic sequences. This novel algorithm of DNA coding consists of computing the kth-order FCGR for the whole sequence and assigning then the value of the correspondent frequency to each k-lengthen word in the sequence. Thus allows us to follow the frequencies' evolution along a given sequence. The

The given sequence is divided with a k-length sliding window. A set of K-frames are obtained which are denoted by K-mers. For example when k= {2, 3, 6}, we have 2-mers

FK(s) is defined to be the frequencies' set of the k-substrings that appear in the sequence S. Obviously; these frequencies derive from the appropriate FCGRk matrices. It follows that:

F2(SDNA)= {0.1579, 0.1579, 0.1579, 0.3684, 0.3684, 0.3684, 0.1053, 0.2105, 0.1579, 0.1053, 0.1579,

F3(SDNA)= {0.1111, 0.1111, 0.1667, 0.2778, 0.2778, 0.1111, 0.1111, 0.1667, 0.1111, 0.1111, 0.1667,

F6(SDNA)= {0.1333, 0.1333, 0.1333, 0.1333, 0.1333, 0.1333, 0.1333, 0.1333, 0.1333, 0.1333, 0.1333,

Fig. 10. illustrates the FCGR drawings for the case of k = {2,3and 6} and the corresponding

Fig. 10. Also shows the slightly high concentration in AA and AAA motifs in FCR2 and

On the signals obtained, a spectral analysis is applied to detect the frequency global

FCGR3 which are expressed by the high-rise blocks in the correspondent signals.

Fig. 10. FCG representation for k=2, 3 and 6 and the FCGR\_signals associated

Spectral Analysis of Global Behaviour of C. Elegans Chromosomes 219

N frames of each M frames have length of 256 with Δn =128. The fig. 11 presents some examples for the spectrum related to each of the three coding technique used. In this figure, we show particularly the periodicities 3 and 10 which are closely depending on coding.

Fig. 11. Examples of spectrums and spectrograms generated with a mean valued technique based on smoothed Discrete Fourier Transform applied on sliding window along the DNA sequence parts of C. elegans genome. Two coding methods are used: a- linear coding technique (binary indicator) (subfigure a), b- structural coding technique

these groups represent polyA, generally associated for gene purposes.

In order to highlight the various frequencies characteristic of an organism, the tests were carried out with various coding over various sizes of segments and various widths. The example presented in the Table 3 presents the percentage of contribution of the trinucleotides in the highlighting of the various characteristic frequencies at the frequencies 1/3, 1/6.5, 1/9 and 1/10. The table shows that the organism C. Elegans is rich in periodicities and that these periodicities are raised by more than the 3/4 of these coding technique. We notice clearly that for periodicity 3, the rate has raised more 97 %, followed by periodicity 10 which has 90 % and periodicity 9 with 85 %. Periodicity 6.5 is a periodicity which is very marked for this organism 70 % of code contributes to its raising. It translates the existence with a very high rate of 6 bases groups at the periodicity 6. The majority of

(PNUC) (subfigure b)

#### **3.2 The Fourier analysis method steps**

The short time analysis is the technique used in order to locate specific regions in a DNA sequence. In this purpose, a mean values of Smoothed Discrete Fourier Transform is applied on sliding window along the DNA sequence to follow the peak's evolution for specific frequencies points. The Fourier analysis algorithm steps are:

The converted DNA sequence x[n] is divided into frames of M length with an overlap Δm. Each of these frames is also divided into N frames by multiplication with a sliding analysis window w[n]:

$$
\propto\_w [n, i] = \mathbf{x}[n] w[n - i\Delta n] \tag{17}
$$

Where i is the window index, and Δn the overlap. The weighting w[n] is assumed to be non zero in the interval [0, N-1]. The frame length value N is chosen in such a way that, on the one hand, the parameters to be measured remain constant and, on the other hand, that there are enough samples of x[n] within the frame to guarantee reliable frequency parameter determination. The choice of the windowing function influences the values of the short term parameters, the shorter the window the greater his influence (Mallat, 1999). We select N and M frame length as power of two to apply the Fast Fourier Transform algorithm.

Each weighted block xw[n], of the frame is transformed in the spectral domain using Discrete Fourier Transform (DFT), in order to extract the spectral parameters Xw[k], where k represents the index of the frequency ([0, N-1]). The DFT of each frame (in one of M sequence parts) is expressed as follows:

$$X\_w^i[k] = \sum\_{n=0}^{N-1} x\_w[n, i] e^{-j\frac{2\pi}{N}nk} \tag{18}$$

Using the mean values, we calculate a DFT mean value for each frame (1: M). The expression of mean DFT is expressed as:

$$Xm\_w^j[k] = \frac{1}{N} \sum\_{i=0}^{N-1} X\_w^i[k] \tag{19}$$

Where *i* correspond to the index frame of N frames ([1...N]), *k* is the index of the frequency and *j* correspond to the index frame of M frames ([1: M]). We constitute the matrix

$$\text{MAT}\left(j\_{\prime}k\right) = \text{Xm}\_{w}^{j}[k] \tag{20}$$

With these obtained values, we can constitute the matrix to represent restricted join time frequency information, known as 2D or 3D DNA spectrograms. This 2D or 3D representation consists of the spectrogram amplitude for a specific index periodicity in a specific nucleotide position in the chromosome.

### **4. Results**

The method has been applied on C. Elegans genome. The chromosomes have been divided on 1- million's parts. The M frames have a length of 1024 bp and an overlap Δm=256, the 218 Fourier Transform Applications

The short time analysis is the technique used in order to locate specific regions in a DNA sequence. In this purpose, a mean values of Smoothed Discrete Fourier Transform is applied on sliding window along the DNA sequence to follow the peak's evolution for specific

The converted DNA sequence x[n] is divided into frames of M length with an overlap Δm. Each of these frames is also divided into N frames by multiplication with a sliding analysis

[ ,] [ ] [ ] *wx ni xnwn i n* = −

Where i is the window index, and Δn the overlap. The weighting w[n] is assumed to be non zero in the interval [0, N-1]. The frame length value N is chosen in such a way that, on the one hand, the parameters to be measured remain constant and, on the other hand, that there are enough samples of x[n] within the frame to guarantee reliable frequency parameter determination. The choice of the windowing function influences the values of the short term parameters, the shorter the window the greater his influence (Mallat, 1999). We select N and

Each weighted block xw[n], of the frame is transformed in the spectral domain using Discrete Fourier Transform (DFT), in order to extract the spectral parameters Xw[k], where k represents the index of the frequency ([0, N-1]). The DFT of each frame (in one of M

> 0 [ ] [ ,] *<sup>N</sup> <sup>j</sup> nk <sup>i</sup> <sup>N</sup> w w*

Using the mean values, we calculate a DFT mean value for each frame (1: M). The

Where *i* correspond to the index frame of N frames ([1...N]), *k* is the index of the frequency

With these obtained values, we can constitute the matrix to represent restricted join time frequency information, known as 2D or 3D DNA spectrograms. This 2D or 3D representation consists of the spectrogram amplitude for a specific index periodicity in a

The method has been applied on C. Elegans genome. The chromosomes have been divided on 1- million's parts. The M frames have a length of 1024 bp and an overlap Δm=256, the

=

*n X k x nie*

2 1

1

−

0 <sup>1</sup> [] [] *N j i w w i Xm k X k N*

=

− −

π

<sup>=</sup> ∑ (18)

<sup>=</sup> ∑ (19)

( , [] ) *<sup>j</sup> MAT j k Xm k* <sup>=</sup> *<sup>w</sup>* (20)

M frame length as power of two to apply the Fast Fourier Transform algorithm.

Δ

(17)

**3.2 The Fourier analysis method steps** 

sequence parts) is expressed as follows:

expression of mean DFT is expressed as:

We constitute the matrix

**4. Results** 

and *j* correspond to the index frame of M frames ([1: M]).

specific nucleotide position in the chromosome.

window w[n]:

frequencies points. The Fourier analysis algorithm steps are:

N frames of each M frames have length of 256 with Δn =128. The fig. 11 presents some examples for the spectrum related to each of the three coding technique used. In this figure, we show particularly the periodicities 3 and 10 which are closely depending on coding.

Fig. 11. Examples of spectrums and spectrograms generated with a mean valued technique based on smoothed Discrete Fourier Transform applied on sliding window along the DNA sequence parts of C. elegans genome. Two coding methods are used: a- linear coding technique (binary indicator) (subfigure a), b- structural coding technique (PNUC) (subfigure b)

In order to highlight the various frequencies characteristic of an organism, the tests were carried out with various coding over various sizes of segments and various widths. The example presented in the Table 3 presents the percentage of contribution of the trinucleotides in the highlighting of the various characteristic frequencies at the frequencies 1/3, 1/6.5, 1/9 and 1/10. The table shows that the organism C. Elegans is rich in periodicities and that these periodicities are raised by more than the 3/4 of these coding technique. We notice clearly that for periodicity 3, the rate has raised more 97 %, followed by periodicity 10 which has 90 % and periodicity 9 with 85 %. Periodicity 6.5 is a periodicity which is very marked for this organism 70 % of code contributes to its raising. It translates the existence with a very high rate of 6 bases groups at the periodicity 6. The majority of these groups represent polyA, generally associated for gene purposes.

Spectral Analysis of Global Behaviour of C. Elegans Chromosomes 221

The spectrogram 3D adds a third element to the representation 2D. In addition to the localization of the periodicities in the segment, we visualize power associated with each peak. We can distinguish between the peaks which we can find in all the segments for a given periodicity: 10 and 3 and the peaks which are present in certain segments and which

The Fig. 12 is divided on 4 subfigures. Each one add to the 2D spectrograms the power values and locations of the periodicities Enhanced: it represents the 3-D spectrograms. Subfigures (a) and (b) are related to chromosomes2 of C. Elegans when the subfigures (c)

This figure shows that for the binary indicator 'AA', 'TT', 'AAA' and 'TTT', the peaks around the frequency 1/10.5 are very pronounced. The variation of the degree view angle demonstrates that the peaks are locally spread in the chromosome part. In the literrature, it has been demonstrated both with the biochemical and signal processing studies, that the periodicity 10.5 related to the nucleosomes is varying. That's why, these figures shows in one hand that there is peaks around this periodicity and in the other hand the peaks are

The Fig. 13 represents the spectrograms recovered after PNUC coding. The analysis breaks up the chromosome made up of 15,2Mbp into 15 parts of 1Mbp. We find the localization of the periodicity in the ends. In reality, the periodicity peaks are missed or have of very weak power in the sequence going of 6 Mbp with 12 Mbp, it is not localized on the centromer but it is around ends of the helix. We find it towards the position 13 Mbp until the end. In the

In Fig. 14 mean valued technique based on smoothed Discrete Fourier Transform was applied along the parts 6, 9 and 13 of the chromosome 1 of C.elegans. From the 1D, 2D and 3D plots, it is observed that coding with FCGR2 reveal the presence of both 10.5 and 3 periodicities. The peaks are spread with different values according to parts around each of these periodicities. Each part has each own specificity. In fact, in part 9 (subfigure a) , periodicities 3 and 10 just submerge from the frequency behavior with peaks of modest values. For the part 6 these periodicities have the same behavior, the specificity is the presence of horizontal peaks around the location 750 in this part. When the part 13 is rich in

For coding with FCGR-3 (Fig.15), the very pronounced peaks correspond to the 10.5 periodicity; just in the left side other peaks appear around the frequency 0.11 which corresponds to the 9 periodicity; in the right side a few peaks occur around the frequency 1/12. The 3 periodicity disappears in the majority of the parts and when it appears, it is present only on a few areas with very low amplitudes. In Fig 15, we can distinguish between frequency behavior in the three parts represented. The periodicity 10 is more pronounced for the part 16 (subfigure b) when comparing with part 9 (subfigure a) and 10 (subfigure c). As for the hexamers coding (FCGR6), we find that it enhances the frequency 1/10.5; upon rare zones the frequency 1/12 is observed (Fig.16). We clearly notice that this coding technique enhances the periodicity 10 and his neighbor in opposition to periodicity 3. The three parts shows different aspect of the repartition of the periodicities. In part 9 (subfigure a), the peaks are spread in a "large" frequency band around periodicity. The band is reduced for part 16 (subfigure b) to be located in two frequencies then the power is grouped

parts where it exists it is not continuous, it is localized in specific time's lapses.

were eliminated by carrying out the average

spread in specific regions in the chromosome.

periodicities 10 and 12 and poor in periodicity 3.

in one frequency for part 12 (subfigure c).

and (d) concern chromosome 3.


Table 3. The proportion of contribution of the trinucleotides in the highlighting of the various characteristic frequencies

The Fig. 11 presents some spectrum with linear coding based on binary indicator. Each indicator contributes on a specific periodicity enhancement. The ttt binary indicator enhances the periodicity 10 when for the indicators tta et tgg the periodicity 3 is picked up. The 3D spectrograms give more precision on the power's spread around these periodicities. In fact, the peaks in these frequency locations have different power values (Fig. 12).

Fig. 12. 3-D spectrograms for binary indicators coding

220 Fourier Transform Applications

The Fig. 11 presents some spectrum with linear coding based on binary indicator. Each indicator contributes on a specific periodicity enhancement. The ttt binary indicator enhances the periodicity 10 when for the indicators tta et tgg the periodicity 3 is picked up. The 3D spectrograms give more precision on the power's spread around these periodicities.

In fact, the peaks in these frequency locations have different power values (Fig. 12).

chromosomes period Ch1 Ch2 Ch3 Ch4 Ch10

P=3 96.8% 96.8% 96.8% 96.8% 96.8% P=6.5 64% 60.9% 81.25% 73.43% 71.9% P=9 85.9% 89% 89.1% 82.81% 79.7% P=10 70.3% 90.6% 90.5% 90.6% 84.3% Table 3. The proportion of contribution of the trinucleotides in the highlighting of the

various characteristic frequencies

Fig. 12. 3-D spectrograms for binary indicators coding

The spectrogram 3D adds a third element to the representation 2D. In addition to the localization of the periodicities in the segment, we visualize power associated with each peak. We can distinguish between the peaks which we can find in all the segments for a given periodicity: 10 and 3 and the peaks which are present in certain segments and which were eliminated by carrying out the average

The Fig. 12 is divided on 4 subfigures. Each one add to the 2D spectrograms the power values and locations of the periodicities Enhanced: it represents the 3-D spectrograms. Subfigures (a) and (b) are related to chromosomes2 of C. Elegans when the subfigures (c) and (d) concern chromosome 3.

This figure shows that for the binary indicator 'AA', 'TT', 'AAA' and 'TTT', the peaks around the frequency 1/10.5 are very pronounced. The variation of the degree view angle demonstrates that the peaks are locally spread in the chromosome part. In the literrature, it has been demonstrated both with the biochemical and signal processing studies, that the periodicity 10.5 related to the nucleosomes is varying. That's why, these figures shows in one hand that there is peaks around this periodicity and in the other hand the peaks are spread in specific regions in the chromosome.

The Fig. 13 represents the spectrograms recovered after PNUC coding. The analysis breaks up the chromosome made up of 15,2Mbp into 15 parts of 1Mbp. We find the localization of the periodicity in the ends. In reality, the periodicity peaks are missed or have of very weak power in the sequence going of 6 Mbp with 12 Mbp, it is not localized on the centromer but it is around ends of the helix. We find it towards the position 13 Mbp until the end. In the parts where it exists it is not continuous, it is localized in specific time's lapses.

In Fig. 14 mean valued technique based on smoothed Discrete Fourier Transform was applied along the parts 6, 9 and 13 of the chromosome 1 of C.elegans. From the 1D, 2D and 3D plots, it is observed that coding with FCGR2 reveal the presence of both 10.5 and 3 periodicities. The peaks are spread with different values according to parts around each of these periodicities. Each part has each own specificity. In fact, in part 9 (subfigure a) , periodicities 3 and 10 just submerge from the frequency behavior with peaks of modest values. For the part 6 these periodicities have the same behavior, the specificity is the presence of horizontal peaks around the location 750 in this part. When the part 13 is rich in periodicities 10 and 12 and poor in periodicity 3.

For coding with FCGR-3 (Fig.15), the very pronounced peaks correspond to the 10.5 periodicity; just in the left side other peaks appear around the frequency 0.11 which corresponds to the 9 periodicity; in the right side a few peaks occur around the frequency 1/12. The 3 periodicity disappears in the majority of the parts and when it appears, it is present only on a few areas with very low amplitudes. In Fig 15, we can distinguish between frequency behavior in the three parts represented. The periodicity 10 is more pronounced for the part 16 (subfigure b) when comparing with part 9 (subfigure a) and 10 (subfigure c).

As for the hexamers coding (FCGR6), we find that it enhances the frequency 1/10.5; upon rare zones the frequency 1/12 is observed (Fig.16). We clearly notice that this coding technique enhances the periodicity 10 and his neighbor in opposition to periodicity 3. The three parts shows different aspect of the repartition of the periodicities. In part 9 (subfigure a), the peaks are spread in a "large" frequency band around periodicity. The band is reduced for part 16 (subfigure b) to be located in two frequencies then the power is grouped in one frequency for part 12 (subfigure c).

Spectral Analysis of Global Behaviour of C. Elegans Chromosomes 223

a- Fourier analysis of part 9 of chromosome 1

b- Fourier analysis of part 6 of chromosome 1

c- Fourier analysis of part 13 of chromosome 1 Fig. 14. Examples of spectrums and spectrograms of chromosome's parts with FCGR-2

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 <sup>10</sup>

500 1000 1500 2000 2500 3000 3500

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 <sup>10</sup>

500 1000 1500 2000 2500 3000 3500

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 <sup>10</sup>

500 1000 1500 2000 2500 3000 3500

0.1 0.2 0.3 0.4

0.1 0.2 0.3 0.4

0.1 0.2 0.3 0.4

signal coding

Fig. 13. Distribution of periodicity 10 for coding pnuc along chromosome 2

222 Fourier Transform Applications

Fig. 13. Distribution of periodicity 10 for coding pnuc along chromosome 2

a- Fourier analysis of part 9 of chromosome 1

b- Fourier analysis of part 6 of chromosome 1

c- Fourier analysis of part 13 of chromosome 1

Fig. 14. Examples of spectrums and spectrograms of chromosome's parts with FCGR-2 signal coding

Spectral Analysis of Global Behaviour of C. Elegans Chromosomes 225

a- Fourier analysis of part 9 of chromosome 1

b- Fourier analysis of part 16 of chromosome 1

c- Fourier analysis of part 12 of chromosome 1

Fig. 16. Examples of spectrums and spectrograms of chromosome's parts with FCGR-6

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.2

500 1000 1500 2000 2500 3000 3500

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 <sup>0</sup>

500 1000 1500 2000 2500 3000 3500

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 <sup>0</sup>

500 1000 1500 2000 2500 3000 3500

0.4 0.6 0.8 1 1.2

0.1 0.2 0.3 0.4

0.5

0.1 0.2 0.3 0.4

> 0.5 1 1.5 2

> 0.1 0.2 0.3 0.4

signal coding

1 1.5

a- Fourier analysis of part 9 of chromosome 1

c- Fourier analysis of part 10 of chromosome 1

Fig. 15. Examples of spectrums and spectrograms of chromosome's parts with FCGR-3 signal coding

224 Fourier Transform Applications

a- Fourier analysis of part 9 of chromosome 1

b- Fourier analysis of part 16 of chromosome 1

c- Fourier analysis of part 10 of chromosome 1

Fig. 15. Examples of spectrums and spectrograms of chromosome's parts with FCGR-3

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 <sup>5</sup>

500 1000 1500 2000 2500 3000 3500

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 <sup>5</sup>

500 1000 1500 2000 2500 3000 3500

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 <sup>5</sup>

500 1000 1500 2000 2500 3000 3500

0.1 0.2 0.3 0.4

0.1 0.2 0.3 0.4

10

0.1 0.2 0.3 0.4

signal coding

15 20

b- Fourier analysis of part 16 of chromosome 1

c- Fourier analysis of part 12 of chromosome 1

Fig. 16. Examples of spectrums and spectrograms of chromosome's parts with FCGR-6 signal coding

Spectral Analysis of Global Behaviour of C. Elegans Chromosomes 227

Deschavanne, P., Giron, A., Vilain, J. Dufraigne, CH., & Fertil, B. (2000), "*Genomic Signature* 

Deschavanne, P., Giron, A., Vilain, J., Fagot, G. & Fertil, B. (1999) "*Genomic signature:* 

Fiser, A., Tusnady, G.E.& Simon, I.(1994) "*Chaos game representation of protein structures*",

Fukushima, A., Ikemurab, T., Kinouchie, M., Oshima, T., Kudod, Y., Morig, H. & Kanaya, S.

Goldman, N. (1993) "*Nucleotide, dinucleotide and trinucleotide frequencies explain patterns* 

Godsell, D.S. & Dickerson, R.E*.* (1994) *"Bending and curvature calculations in b-dna*", Nucl.

Hayes, J.J., Tullius, T.D. & wolffe, A. P. (1990) *"The structure of DNA in a nucleosome*",

Jeffrey, H.J., (1990) "*Chaos game visualization of sequences*", Computers & Graphics, Elsevier,

Joseph, J. & Sasikumar, R. (2006) "*Chaos game representation for comparison of whole genomes"*,

Kornberg, R.D. (1974), "*Chromatin structure: a repeating unit of histones and DNA*." Science 184,

Makula, M., (2009) "*Interactive visualization of oligomer frequency in DNA*" Computing and

Nicorici, D., Berger, J. A. , Astola, J. & Mitra, S. K. ,(2003), "*Finding borders between coding and* 

Oppenheim, A. V., Schafer ,R. W. & Buck, J. R., (1999) "*Discrete Time Signal Processing*", 2nd

Oudet, P., Germond, J.E., Bellard, M., Spadafora, C. & Chambon, P. (1978) "*Structure of* 

Sussillo, D., Kundaje, A. & Anastassiou, D., 2003 "*Spectrogram analysis of genomes*," *Eurasip* 

Journal of Applied Signal Processing, vol. 2003, no. 4, .

*non coding DNA regions using recursive segmentation and statistics of stop codons", Finnish Signal Processing Symposium (FINSIG'03)*, Tampere, Finland, pp. 231-235. Oliver, J. L., Bernaola-Galvan, P., Guerrero, G. & Foman-Roldan, R. (1993), "*Entropic profiles* 

*of DNA sequences through chaos-game-derived images*," J. Theor. Biol.,Vol 160, n°4, pp

*Eucaryotic Chromosomes and Chromatin",* Phylosophical Transactions of the Royal Society of London. Series B, Biological Sciences, vol 283, No 997, pp: 241-258, Segal, E., Fondufe-Mittendorf, Y., Chen, L., Thamstrom, A., Field, Y., Moore, I. K., Wang,

J.P.Z. & widom, J. (2006) " *a genomic code for nucleosome positioning, nature* vol 442,

Kornberg, R.D. (1977), *"Structure of Chromatin*", Annu Rev Biochem. 46 , pp 931-954.

and Biomedical Engineering, IEEE, pp 161–167.

*sequences*", Mol Biol E, Vol 16, n°10, pp 1391–1399.

J.Mol Graphics, Vol 12, pp 295,302–304.

*analysis*", Elsevier, Gene 300, pp 203–211.

Vol.21, n°10, pp 2487–2491.

Vol.16, n°1, pp 25–33.

pp:868-871,

457–470.

Edition, Prentice Hall.

pp: 772-778, August

Acids Res, vol 22, pp 5497-5503.

vol 87 No 19, pp 7405-7409, October.

BMC Bioinformatics, Vol 7, n°1, pp 1-10..

Informatics, Vol. 28, pp 1001–1016.

*Is Preserved in Short DNA Fragment*", International Symposium on Bio-Informatics

*characterization and classification of species assessed by chaos game representation of* 

(2002) "*Periodicity in prokaryotic and eukaryotic genomes identified by powerspectrum* 

*observed in chaos game representations of DNA sequences*", Nucleic Acids Research

Proceedings of the National Academy of sciences of the United States of America,

A peak around the frequency 1 / 4 nearby at position 2500 corresponds to a satellite (Fig. 16 subfigure a). This frequency derives from repetitions of certain dinucleotides in the area. The spectrogram reveals the presence of a satellite with multiple frequencies; this is manifested clearly in the 3D graph in the form of horizontally aligned peaks colored in red, the higher frequencies.
