**1. Introduction**

Many DNA biotechnological applications, such as PCR or cDNA expression profiling, depend on thermodynamic parameters, which are sequence dependent. We could cite the strand melting temperature as an example of such thermodynamic parameters. In a general way, physical properties of DNA or RNA sequences can be calculated, in a very simple form, from algorithms in the context of nearest-neighbor (NN) models, whose core characteristic is

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

providing linear representations for experimental measurements on nucleotide chains always in terms of pairwise (dimer) sequence contributions.

However, NN dimer parameters cannot be assigned from experiments by solving a set of simultaneous linear equations. This is known since the beginning of the development of these models in the context of polynucleotide thermodynamic studies [1]. In fact, when we consider intrinsic composition closure constraints, the number of degrees of freedom of the model is effectively reduced.

Dimer occurrence relations are well known, thus allowing for decomposition of sequence properties into dimer contributions. Many authors, because of that, have preferred to use dimers as fundamental units because they provide the most straightforward decomposition scheme [2–6]. Although the dimer set values fit easily into the theoretical NN model approx‐ imation, the dimer composition is overstated. In fact, the dimer set size, which is equal to 16 (in the case of a simple chain) and 10 (in the case of double chains) [2–7], is greater than the number of degrees of freedom of the problem. However, the extraction of dimer set contribu‐ tions has remained an ill-posed problem. To accomplish this task further, ad hoc regularization hypothesis has been used so far. As a corollary, so-far-unknown constraints must also link the full dimer set properties in some hidden way to restore full set unity. Alternative approaches have considered decompositions into irreducible and hence smaller sets of short sequences or dimer combinations [8–11]. Comparison between different laboratory sets and physical interpretation of set values becomes a difficult task due to the arbitrariness of possible renderings. The extraction of simpler and more direct dimer contributions from such sets has remained an ill-posed problem with no unique solutions but still embraced by a large com‐ munity of biochemists [2–6]. To adopt the dimer set formulation further, ad hoc regularization hypotheses have been taken by different authors, such as the singular value decomposition method [4, 12].

In this review, among other objectives, we present an approach to this problem based on the analysis of how the nucleotide intrinsic intermolecular symmetries contribute to the structure of NN sets, as proposed by Licinio and Guerra [13]. Therefore, to achieve that, initially, it is introduced to a general quantum mechanics statement, giving physical properties for a sequence of heterogeneous molecules treated as subsystems assuming any of a given complete set of molecular states. The four-nucleotide set has a corresponding four-state representation. At this point, a careful choice of the number of degrees of freedom is made in order to project the representation into a three-dimensional molecular class space. Luckily, the three inde‐ pendent molecular classes are readily associated to the main biochemical classification of nucleotides as comprising purine–pyrimidine, amino–keto, and strong–weak bases. The representation of the four-nucleotide set as a tetrahedron in the three-dimensional space is at the heart of the approach, as proposed by Licinio and Guerra [13]. This representation has been used to generate DNA walks for sequence composition analysis or display. The corresponding proper space metrics have also been recently used for phylogenetic sequence comparisons [14]. In the following, we proceed to contract the original quantum mechanics statement into an irreducible formulation using the four-nucleotide tetrahedron representation. This molecular symmetrical decomposition is found to provide the right number of fundamental properties (free parameters), which is equal to 8, for the case of DNA double strands. We shall refer to these fundamental properties as constituting a symmetrical set of irreducible tensorial parameters. Next, we relate this decomposition to the dimer set formulation. The comparison uncovers useful and so far hidden self-consistency relations among dimers.

providing linear representations for experimental measurements on nucleotide chains always

However, NN dimer parameters cannot be assigned from experiments by solving a set of simultaneous linear equations. This is known since the beginning of the development of these models in the context of polynucleotide thermodynamic studies [1]. In fact, when we consider intrinsic composition closure constraints, the number of degrees of freedom of the model is

Dimer occurrence relations are well known, thus allowing for decomposition of sequence properties into dimer contributions. Many authors, because of that, have preferred to use dimers as fundamental units because they provide the most straightforward decomposition scheme [2–6]. Although the dimer set values fit easily into the theoretical NN model approx‐ imation, the dimer composition is overstated. In fact, the dimer set size, which is equal to 16 (in the case of a simple chain) and 10 (in the case of double chains) [2–7], is greater than the number of degrees of freedom of the problem. However, the extraction of dimer set contribu‐ tions has remained an ill-posed problem. To accomplish this task further, ad hoc regularization hypothesis has been used so far. As a corollary, so-far-unknown constraints must also link the full dimer set properties in some hidden way to restore full set unity. Alternative approaches have considered decompositions into irreducible and hence smaller sets of short sequences or dimer combinations [8–11]. Comparison between different laboratory sets and physical interpretation of set values becomes a difficult task due to the arbitrariness of possible renderings. The extraction of simpler and more direct dimer contributions from such sets has remained an ill-posed problem with no unique solutions but still embraced by a large com‐ munity of biochemists [2–6]. To adopt the dimer set formulation further, ad hoc regularization hypotheses have been taken by different authors, such as the singular value decomposition

In this review, among other objectives, we present an approach to this problem based on the analysis of how the nucleotide intrinsic intermolecular symmetries contribute to the structure of NN sets, as proposed by Licinio and Guerra [13]. Therefore, to achieve that, initially, it is introduced to a general quantum mechanics statement, giving physical properties for a sequence of heterogeneous molecules treated as subsystems assuming any of a given complete set of molecular states. The four-nucleotide set has a corresponding four-state representation. At this point, a careful choice of the number of degrees of freedom is made in order to project the representation into a three-dimensional molecular class space. Luckily, the three inde‐ pendent molecular classes are readily associated to the main biochemical classification of nucleotides as comprising purine–pyrimidine, amino–keto, and strong–weak bases. The representation of the four-nucleotide set as a tetrahedron in the three-dimensional space is at the heart of the approach, as proposed by Licinio and Guerra [13]. This representation has been used to generate DNA walks for sequence composition analysis or display. The corresponding proper space metrics have also been recently used for phylogenetic sequence comparisons [14]. In the following, we proceed to contract the original quantum mechanics statement into an irreducible formulation using the four-nucleotide tetrahedron representation. This molecular symmetrical decomposition is found to provide the right number of fundamental properties

in terms of pairwise (dimer) sequence contributions.

184 Nucleic Acids - From Basic Aspects to Laboratory Tools

effectively reduced.

method [4, 12].

However, an important point still would need to be clarified. In fact, in many publications, one finds datasets that include experimental values for duplex oligonucleotides, where end effects were believed to be important [2–6]. Nevertheless, such initiation and termination parameters would seem to be very sensitive to the modeling and have changed a lot even inside the same research group [3–6]. In fact, Xia et al. had already argued that data from melting experiments of RNA duplexes are of insufficient accuracy to distinguish end effects [15]. With this motivation, as a second step in the development of the approach proposed by us and presented in this review, we proposed to extend the irreducible model to investigate how it would accommodate end effects. Guerra and Licinio in fact performed such extension and calculated the irreducible parameters for free energy, entropy, enthalpy, and the respective end contributions [16]. Later, a detailed algorithm for performing such calculations is descri‐ bed. However, at this point, it is necessary to anticipate some conclusions. For example, Guerra and Licinio obtained values for the end effects with relatively large errors. In addition, specifically for free energy, they could not distinguish between the weak and strong terminal base pairs. In the light of their finding, one simple statistical mechanics approach, when applied to the melting transition, shows that the approach based on end effects, according to the NN approach, proves to be naive, even heuristic. In fact, since the end effects were initially (wrongly) identified as the nucleation free energies, they should be dependent on the mean global composition of the chain. However, an only slightly more detailed statistical mechanics approach can show that, summed to the eight (polymeric) irreducible parameters for free energy, as already mentioned, there are other two parameters related to the initiation of the double helix (related to two possible base pairings). That is, in the light of the NN approach, there are 10 parameters, which expand the free energy of any DNA oligomers [17].

Before we continue our discussion throughout the forthcoming sessions, it is important to inform the reader that all theoretical results we obtained were applied to the analysis of DNA free energy by introducing, initially, the formulation of end contributions to the model, which will be presented later in this chapter. A simple statistical mechanics approach is then applied to the problem. As a result, a second set of parameters, including this time the initiation parameters, will be obtained. Anyway, a self-consistent set has thus been fit to free energy data from 108 short duplex oligomer sequences as available in the literature. We will show that, using both the modeling, the first based on end effects and the second based on the use of double helix initiation parameters, the more compact and symmetrical self-consistent set is shown to provide at least as good modeling for oligomer free energy as standard NN dimer models. The far-reaching strength of the theoretical modeling frame for DNA or RNA sequences as proposed by us resides in its compactness and symmetry. As will be discussed later in this review, one of the immediate and practical consequences of the use of the tetra‐ hedral model is the disclosure of the initially hidden dimer self-consistency relations.

## **2. A quantum mechanics formulation for sequence properties**

Complexity in biological phenomena represents an enormous challenge and a rich field for the application and development of physical methods. To unfold simple biopolymer phenomena, we start by a biochemical meaningful nucleotide representation into molecular classes and count on tools provided by the quantum mechanics. Here, we shall use the quantum mechanics formulation based on the matrix representation. What is needed from start is some base set for the description of the states of the system, which, for us, is a DNA or RNA sequence. The ensemble of sequence states is given by allowable sequence composition alone. We want to describe and isolate gross composition states. Inner electronic states or molecular conformation contributions, which would require a much finer level of quantum description, are so far intrinsically averaged. State transitions are of course forbidden if one neglects mutations. The sequence state will be given in terms of its molecular constitution, and a nucleotide set representation will condition the sequence representation.

The quantum mechanics expectation for any observable is given in terms of the corresponding operator *Θ* and system state |Ψ as *ψ*|*Θ* |*ψ* , in Dirac's notation. The state of a system comprising *N* particles or molecules is usually expressed as the tensorial product of their component states |*b*(*i*) , (1*≤ i≤ N*):

$$\left| \Psi \right> = \left| b(1) \right> \left| b(2) \right> \otimes \dots \otimes \left| b(N) \right> = \left| b(1); b(2); \dots; \left| b(N) \right> \tag{1}$$

For *d*-dimensional component states, this would lead a priori to the specification of *(Nd)*<sup>²</sup> operator matrix elements *μ*(*i*)*ν*( *j*). If interaction range is limited, however, then many offdiagonal matrix elements become null, and a reduced formulation can be sought. Considering only sequential NN interactions, the expectation can thus be written simply as

$$\mathbf{E} = \sum\_{i} \left\langle b(i); b\left(i+1\right) \middle| \Theta \middle| b(i); b\left(i+1\right) \right\rangle \tag{2}$$

Here, submatrix elements pertaining to the same component at position *i* (diagonal or selfmatrices *Θμ*(*i*)*ν*( *<sup>j</sup>*)), which are internal to the sequence (*i ≠* 1, *N*), should be halved because they are counted twice in this formulation (see Fig. 1). We hope further reduction of this develop‐ ment can be obtained considering implicit symmetries of the Hermitian *Θ* matrix and its invariants under orthonormal base representations.

#### **3. Nucleotide class-state representation**

The most straightforward representation for a four-nucleotide set is, obviously, a fourdimensional vector. This "independent-nucleotide" representation has been implicitly adopted by many authors and leads to 4 × 4 matrices or 16 parameter sets when considering nucleotide pairwise properties [11]. This representation, however, already overstates the nucleotide composition problem from the beginning. The set representation should be more concisely established in a three-dimensional space. Thus, a complete and symmetrical representation for the usual DNA (or RNA) four-nucleotide set can be given within a tetrahe‐ dral decomposition scheme into a three-dimensional orthonormal base set | *x* , | *y* , | *z* . The pure nucleotide states |*b*(*i*) are given as follows [14]:

**2. A quantum mechanics formulation for sequence properties**

representation will condition the sequence representation.

component states |*b*(*i*) , (1*≤ i≤ N*):

186 Nucleic Acids - From Basic Aspects to Laboratory Tools

Complexity in biological phenomena represents an enormous challenge and a rich field for the application and development of physical methods. To unfold simple biopolymer phenomena, we start by a biochemical meaningful nucleotide representation into molecular classes and count on tools provided by the quantum mechanics. Here, we shall use the quantum mechanics formulation based on the matrix representation. What is needed from start is some base set for the description of the states of the system, which, for us, is a DNA or RNA sequence. The ensemble of sequence states is given by allowable sequence composition alone. We want to describe and isolate gross composition states. Inner electronic states or molecular conformation contributions, which would require a much finer level of quantum description, are so far intrinsically averaged. State transitions are of course forbidden if one neglects mutations. The sequence state will be given in terms of its molecular constitution, and a nucleotide set

The quantum mechanics expectation for any observable is given in terms of the corresponding operator *Θ* and system state |Ψ as *ψ*|*Θ* |*ψ* , in Dirac's notation. The state of a system comprising *N* particles or molecules is usually expressed as the tensorial product of their

For *d*-dimensional component states, this would lead a priori to the specification of *(Nd)*<sup>²</sup> operator matrix elements *μ*(*i*)*ν*( *j*). If interaction range is limited, however, then many offdiagonal matrix elements become null, and a reduced formulation can be sought. Considering

( );1 ;1 ( ) ( ) ( )

Here, submatrix elements pertaining to the same component at position *i* (diagonal or selfmatrices *Θμ*(*i*)*ν*( *<sup>j</sup>*)), which are internal to the sequence (*i ≠* 1, *N*), should be halved because they are counted twice in this formulation (see Fig. 1). We hope further reduction of this develop‐ ment can be obtained considering implicit symmetries of the Hermitian *Θ* matrix and its

The most straightforward representation for a four-nucleotide set is, obviously, a fourdimensional vector. This "independent-nucleotide" representation has been implicitly

only sequential NN interactions, the expectation can thus be written simply as

**<sup>E</sup>** =+ + å *bi bi bi bi* Q

*i*

invariants under orthonormal base representations.

**3. Nucleotide class-state representation**

Y = 12 *b b bN b b bN* ( ) ( ) Ä×××Ä = ( ) 1 ; 2 ;...; ( ) ( ) ( ) (1)

(2)

$$\begin{aligned} \{A\} = \begin{pmatrix} 1 \\ 1 \\ 1 \end{pmatrix}; \begin{bmatrix} T \end{bmatrix} = \begin{pmatrix} -1 \\ -1 \\ 1 \end{pmatrix}; \begin{bmatrix} \mathbf{C} \end{bmatrix} = \begin{pmatrix} -1 \\ 1 \\ -1 \end{pmatrix}; \begin{bmatrix} \mathbf{G} \end{bmatrix} = \begin{pmatrix} 1 \\ -1 \\ -1 \end{pmatrix} \end{aligned} \tag{3}$$

**Figure 1.** Structure of an expectation matrix for a sequence of *n =* 6 identical components (molecules in arbitrary states). The components have *d* degrees of freedom represented through *d* orthogonal base states, which result in 3*n*–2 = 16 submatrices of size *d*². In this case, only nearest-neighbor interactions are considered. The matrix above corresponding to the quantum mechanics formulation of Eq. 1 is Hermitian and periodic, allowing for a more synthetic representa‐ tion. One periodic module of four submatrices implicit in Eq. 2 has been distinguished by a dashed line. Observe that internal submatrices in the diagonal are counted twice according to the formulation of Eq. 2 [13].

The nucleotides themselves are represented as a nonorthogonal (tetrahedral) 3 -modulus vector set (Fig. 2). The four-nucleotide states are not independent and can be expressed in terms of three independent abstract nucleotide class states. Due to this decomposition, *z*-component discriminates weak (two bridges, AT) versus strong (three bridges, CG) hydrogen bonding for Watson–Crick (WC) pairing; *x*-component discriminates purine (double ring, AG) versus pyrimidine (single ring, CT) nucleotide sizes; and *y*-component discriminates amino (nitrogen containing, AC) versus keto (oxygen containing, GT) nucleotide radicals.

In quantum mechanics language, a | *x* base state, for example, is a ring number or purine– pyrimidine class state, whereas | *A* = | *x* + | *y* + | *z* is an adenine molecular state decom‐ posed in terms of proper nucleotide class subspaces. Any pure nucleotide state can thus be represented in terms of molecular class states.

**Figure 2.** Orthonormal *x*–*y*–*z* base set and tetrahedral DNA-nucleotide set representation. Each of the three axes distin‐ guishes a specific molecular class feature. Purines are distinguished from pyrimidines through *x*-coordinate. Amino is distinguished from keto through *y*-coordinate. And, finally, weak WC hydrogen-bridge binding is distinguished from stronger binding through *z*-coordinate [14].

Each possible nucleotide pair shares one of its fundamental molecular structural characteristics as a group in a given class, which differs from the complementary pair as another group in the same class. This is latent when we observe Eq. 3, which translates perfectly well the intrinsic cubic symmetry of the tetrahedron. From now, we proceed to construct our approach, which will use a complete nucleotide representation, and, then, having seen based on this represen‐ tation, it will provide properties associated to each molecule decomposing them in terms of three differential affinity groups or classes. Therefore, the choice of a tetrahedral set is thus natural and convenient for its intrinsic orthogonality and symmetry properties, which are related to common molecular group classifications. Nevertheless, its main advantage is to fulfill the necessity for a three-dimensional bijective representation of a four-set composition.
