**3. Genomic sequence analysis based on Short Fourier Transform**

In order to give frequencies more precise location in time, Gabor proposes to use a Fourier local analyze with windows. The technique consists in segmenting signal by multiplication by sliding window of fixed length (Mallat, 1999). Each part is analyzed independently with a classic Fourier transform to enhance frequencies behavior. The totality of these transforms forms the short Fourier transform and precise the frequencies location in time.

Applying coding process, the numerical signals are obtained by base's succession description as follows:

$$\mathbf{x}\begin{bmatrix} n \end{bmatrix} = \begin{Bmatrix} \mathbf{x}(i) \end{Bmatrix}, i \in \begin{bmatrix} 1 \dots \mathbf{N} \end{bmatrix} \tag{1}$$

The classic discrete Fourier transform related to numerical sequence is expressed as:

$$X\left[k\right] = \sum\_{n=0}^{N-1} x\left(n\right) e^{-j\frac{2\pi}{N}nk} \tag{2}$$

In order to locate the signal frequencies in time, the analysis is applied to sequence's parts generated by multiplication with a sliding analysis window.

Spectral Analysis of Global Behaviour of C. Elegans Chromosomes 209


The linear coding consists in attributing a binary value for each unit of the all indicators. Which are included in {'A','T','C','G', 'TT', 'TA', 'GC'', 'AAA'… 'GGG'}. The marker associated takes the value of either 1 or 0 at location n for the first character, depending on

> [ ] *<sup>b</sup>* [ ] *b B Sn U n* ∈

*if base b is in position n U n*

[ ] *bb* [ ] *bb BB Sn U n* ∈

*if base bb is in position n U n*

Some dinucleotides as 'AA', 'TT', 'TA' are enhancing the ADN flexibility around histones to

Codon's binary indicator: the three bases association called triplet or codon have a fundamental role in the process of amino acids fabrication. For these reasons, a coding

*else*

*else*

<sup>=</sup> ∑ (9)

<sup>=</sup> ∑ (11)

(10)

(12)


submerged from the field of physics known as 'chaotic dynamical systems'

whether or not the corresponding character group exists from the location n.

[ ] <sup>1</sup> 0 *<sup>b</sup>*

[ ] <sup>1</sup> 0 *bb*

B={AA,AT,AC,AG,TA,TT,TC,TG,CA,CT,CC,CG,GA,GT,GC,GG}

⎧ = ⎨ ⎩

Considering sequence *SDNA* and *U n CG* [ ] the associated base's binary indicator

⎧ = ⎨ ⎩

Considering sequence *SDNA* and *U n <sup>A</sup>* [ ] the associated base's binary indicator

deformability

Base's binary indicator:

Where:

where

**3.1.1 Binary indicator's techniques** 

is the binary indicator of the base B={A,T,C,G}

*SDNA* = '**AA**TCGCG**A**C**A**CTC**A**TTCGG' *U n <sup>A</sup>* [ ] = 1 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0

Two Base's binary indicator:

is the binary indicator of the base B

constitute nucleosomes (Fig. 3).

*SDNA* = 'AAT**CGCG**ACACTCATT**CG**G' *U n CG* [ ] = 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0

For this purpose, the numerical signal x[n] is divided into frames of N length. The expression become

$$\propto\_w \begin{bmatrix} n, i \end{bmatrix} = \mathbf{x} \begin{bmatrix} n \end{bmatrix}.o\begin{bmatrix} i - \Delta n \end{bmatrix} \tag{3}$$

When based on the binary indicator 'A', the equation becomes:

$$\propto\_{Aw} \begin{bmatrix} n\_\prime \ \mathrm{i} \end{bmatrix} = \mathsf{U}\_A \begin{bmatrix} n \end{bmatrix} \alpha \begin{bmatrix} \mathrm{i} - \Delta \mathrm{n} \end{bmatrix} \tag{4}$$

With i is the window's order and the Δn is the adopted sliding value. The window's length must be chosen to have an appropriate number of samples to guarantee the best frequency resolution. On each block xw [n], is applied a Fourier transform to determine Xw [k], k∈[0:N-1], k represents the frequency index. The FT expression associated with each frame is as follows:

$$X\_{\
u}^{i}\left[k\right] = \sum\_{n=0}^{N-1} x\_{\
u}\left[n\right]e^{-j\frac{2\pi}{N}nk} \tag{5}$$

With binary indicator 'A' coding, the equation is:

$$X\_A[k] = \sum\_{n=0}^{N-1} x\_{Ao} \begin{bmatrix} n \ \vdots \end{bmatrix} e^{-j\frac{2\pi}{N}nk} \tag{6}$$

On the basis of this expression, many representations can be obtained. The sequence is associated to chromosome, the first analyze consists in studying the frequency global behavior. To enhance the frequencies, we used a mean smoothed spectrum. The principle consists in calculating the mean of the obtained spectrum of equation.

$$\overline{X}\_{oo}^{j}\begin{bmatrix}k\end{bmatrix} = \frac{1}{N}\sum\_{i=0}^{N-1}X\_{oo}^{i}\begin{bmatrix}k\end{bmatrix} \tag{7}$$

 The chromosomes are generally constituted by more than 10 Mbp, so the obtained spectrum needs to be smoothed. A second mean of the mean spectrums is applied. The converted DNA sequence x[n] is divided into frames of M length with an overlap Δm. Each of these frames is also divided into N frames by multiplication with a sliding analysis window w[n]. On each part, a mean smoothed spectrum is generated. Finally, the mean of the spectrum for all the parts is calculated. The final expression of the spectrum is:

$$\overline{X}\begin{bmatrix}k\end{bmatrix} = \frac{1}{M} \sum\_{j=0}^{M-1} \overline{X}\_{\alpha}^{j}\begin{bmatrix}k\end{bmatrix} \tag{8}$$

#### **3.1 The chromosomes coding techniques**

This analysis aims to study the chromosome's frequency global behaviour. For this purpose, it is important to enhance particularly the signals generated by the protein coding regions and the nucleosome regions. That's why, three types of coding techniques are considered:


#### **3.1.1 Binary indicator's techniques**

The linear coding consists in attributing a binary value for each unit of the all indicators. Which are included in {'A','T','C','G', 'TT', 'TA', 'GC'', 'AAA'… 'GGG'}. The marker associated takes the value of either 1 or 0 at location n for the first character, depending on whether or not the corresponding character group exists from the location n.

Base's binary indicator:

$$S\begin{bmatrix} n \end{bmatrix} = \sum\_{b \le B} \mathcal{U}\_b \begin{bmatrix} n \end{bmatrix} \tag{9}$$

Where:

208 Fourier Transform Applications

For this purpose, the numerical signal x[n] is divided into frames of N length. The

*x ni xn i n <sup>w</sup>* [ , . ] = − [ ] ω[ Δ

*x ni U n i n Aw* [ , . ] = − *<sup>A</sup>* [ ]

With i is the window's order and the Δn is the adopted sliding value. The window's length must be chosen to have an appropriate number of samples to guarantee the best frequency resolution. On each block xw [n], is applied a Fourier transform to determine Xw [k], k∈[0:N-1], k represents the frequency index. The FT expression associated with each frame

[] [ ]

 ω

0

[] [ ]

0

=

*A A n X k x nie*

=

*n X k x nie*

ω

consists in calculating the mean of the obtained spectrum of equation.

all the parts is calculated. The final expression of the spectrum is:

**3.1 The chromosomes coding techniques** 

ω

With binary indicator 'A' coding, the equation is:

ω[ Δ

2 1

− −

π

*N*

π

<sup>=</sup> ∑ (5)

<sup>=</sup> ∑ (6)

<sup>=</sup> ∑ (7)

<sup>=</sup> ∑ (8)

, *<sup>N</sup> <sup>j</sup> nk <sup>i</sup> <sup>N</sup>*

2 1

− −

ω

On the basis of this expression, many representations can be obtained. The sequence is associated to chromosome, the first analyze consists in studying the frequency global behavior. To enhance the frequencies, we used a mean smoothed spectrum. The principle

> [ ] [ ] 1

1 *<sup>N</sup> j i i Xk Xk N*

 The chromosomes are generally constituted by more than 10 Mbp, so the obtained spectrum needs to be smoothed. A second mean of the mean spectrums is applied. The converted DNA sequence x[n] is divided into frames of M length with an overlap Δm. Each of these frames is also divided into N frames by multiplication with a sliding analysis window w[n]. On each part, a mean smoothed spectrum is generated. Finally, the mean of the spectrum for

> [ ] [ ] 1

*j Xk X k M*

This analysis aims to study the chromosome's frequency global behaviour. For this purpose, it is important to enhance particularly the signals generated by the protein coding regions and the nucleosome regions. That's why, three types of coding techniques are considered:

=

1 *<sup>M</sup>*

0

−

*j*

ω

0

=

−

 ω

, *N j nk*

When based on the binary indicator 'A', the equation becomes:

] (3)

] (4)

expression become

is as follows:

$$\mathbf{U}\_b \begin{bmatrix} n \\ \end{bmatrix} = \begin{cases} 1 \text{ if } base \ b \text{ is in position } n \\ \mathbf{0} \text{ else} \end{cases} \tag{10}$$

is the binary indicator of the base B={A,T,C,G}

Considering sequence *SDNA* and *U n <sup>A</sup>* [ ] the associated base's binary indicator *SDNA* = '**AA**TCGCG**A**C**A**CTC**A**TTCGG' *U n <sup>A</sup>* [ ] = 1 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0

Two Base's binary indicator:

$$S\begin{bmatrix} n \end{bmatrix} = \sum\_{b \models a \in B} \mathcal{U}\_{b \flat} \begin{bmatrix} n \end{bmatrix} \tag{11}$$

where

$$\mathcal{U}\_{bb}\begin{bmatrix}n\\ \end{bmatrix} = \begin{cases} \mathbf{1}\text{ if } base\ b b \text{ is in position } n\\ \mathbf{0}\text{ else} \end{cases} \tag{12}$$

is the binary indicator of the base B

#### B={AA,AT,AC,AG,TA,TT,TC,TG,CA,CT,CC,CG,GA,GT,GC,GG}

Considering sequence *SDNA* and *U n CG* [ ] the associated base's binary indicator *SDNA* = 'AAT**CGCG**ACACTCATT**CG**G' *U n CG* [ ] = 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0

Some dinucleotides as 'AA', 'TT', 'TA' are enhancing the ADN flexibility around histones to constitute nucleosomes (Fig. 3).

Codon's binary indicator: the three bases association called triplet or codon have a fundamental role in the process of amino acids fabrication. For these reasons, a coding

Spectral Analysis of Global Behaviour of C. Elegans Chromosomes 211

The second coding technique is the Pnuc which is based on local bending and flexibility properties of the double helix; it is deduced experimentally from nucleosome positioning (Pnuc). By considering the matching of both stalks (A-T and C-G) along the helix, one base's pair defines a plane and a direction in this plane. A description of the double helix shows the overlapping of the plans (Fig. 4). When considering that the planes are parallel, passing between planes needs translation and rotation of 34,3° of the orientation of the connection of

Fig. 4. A description of the double helix shows the overlapping of the plans

protein, in this way, we have two properties:

then replaced by the numerical sequence *CPNUC* .

*CPNUC* = 0.7 5.3 8.3 7.5 7.5 6.0 5.4 5.2 6.5 5.8 5.4 5.4 6.7 0.7 3.0 8.3 4.7

*SDNA* = 'AAT**CGCG**ACACTCATT**CG**G'

the curvature is static

Now the plans are not parallel and the axis of the double helix presents curvature. By considering the interaction between a protein, a histone and a DNA's sequence, this interaction is stronger when the contact area between both objects is the biggest. To increase this surface, it is necessary to roll up as much as possible the segment of DNA around the

If the segment of DNA is not rolled up around the protein, it is in position of equilibrium,

The stalk must be flexible to allow the additional curvature around the protein. These two properties generate the nucleosome which generates an excessive curvature of the stalk.

Each trinucleotide is replaced by its numerical value given by the Pnuc table. The *SDNA* is

**3.1.2 Pnuc: the structural coding techniques** 

the plan.

technique based on these base's association is used. We adopt binary indicators to each of the 64 codons (Table 1)

$$S\begin{bmatrix} n \end{bmatrix} = \left\{ \mathcal{U}\_{cud} \begin{bmatrix} i \end{bmatrix} \, \middle| \, i = \mathbf{1} \dots \mathcal{N}\_s \right\} \tag{13}$$

where:

$$\mathbf{U}\_{cdd} \begin{bmatrix} \mathbf{i} \end{bmatrix} = \begin{cases} \mathbf{1} \text{ if the codon cod starts at position } n \\ \mathbf{0} \text{ else} \end{cases} \tag{14}$$

is the binary indicator of the codon cod and Ns is the sequence's length. This marker takes the value of either 1 or 0 at location n for the first character depending on whether the corresponding character exists from the location n. Let's consider the codon binary indicator *U n TCG* [ ].

Considering sequence *SDNA* and the associated codon's binary indicator *SDNA* = 'AAT**CGCG**ACACTCATT**CG**G' *U n TCG* [ ] = 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0



