*3.1.2.2 Hierarchical cluster analysis (HCA) technique*

This technique has become, together with PCA, another important tool in pattern recognition [67]. The purpose of using it is to display the data in such a way as to emphasize its natural clusters and patterns in a two-dimensional space. The results are presented as dendrograms. In HCA technique, the distances between objects or variables are calculated and computed through the similarity index which ranges from zero, that is, no similarity and large distance among objects, to one, for identical objects.

#### *3.1.2.3 K-nearest neighbor (KNN) technique*

The KNN technique [67] classifies the objects based on distance comparison among them. The multivariate Euclidean distances between every pair of objects with known class membership are calculated. The closest K objects are used to build the model. The optimal K is determined by cross-validation applied to the training set objects. The classification of a test object is determined based on the multivariate distance of this object with respect to the K objects in the training set. In this technique no assumption is made about the size and shape of the training set classes.

#### *3.1.2.4 Stepwise discriminant analysis (SDA) technique*

This technique separates objects from distinct populations and allocates new objects into populations previously defined. It uses a stepwise procedure in which, at each step, the most powerful variable is entered into the discriminant function. The SDA technique is anchored in the F-test for the significance of variables and at each step selects a variable based on its significance, and, after several steps, the most significant variables are extracted from the set in question [20, 68].

**51**

**Figure 1.**

*2D molecular structure for 5-nitrofuran-2-aldoxime.*

*Molecular Electrostatic Potential and Chemometric Techniques as Tools to Design Bioactive…*

This SIMCA technique develops principal component models for each training set category. Its main objective is the reliable classification of new samples. When a prediction is made with the SIMCA technique, new samples insufficiently close to the PC space of a class are considered nonmembers. Furthermore, the technique requires that each training sample be pre-assigned to one of *Q* different categories, where *Q* is typically greater than one. It provides three possible outcome predictions: the sample fits only one pre-defined category, the sample does not fit any of the pre-defined categories, and the sample fits into more than one pre-defined

For the present chapter, we performed molecular calculations on an AMD PHENOM 955 X4 2.2 GHz processor with 4 Gb of RAM with the Gaussian 98 program package [69]. The MEP was computed from the electronic density, and the maps were displayed using the MOLEKEL software [70], while the PR models were

**Figure 1** shows the 2D structure of the 5-nitrofuran-2-aldoxim molecule [72] used in the selection of method/basis set (see Section 3.1.3.1). In **Figures 2** and **3** the 2D structures of the nitrofuran compounds from the training [73–75] and prediction sets are displayed, respectively. In this work, the nitrofuran molecules were defined as more active against *T. cruzi*, when in vitro *growth rate inhibition (GR) T. cruzi* ≥ 75, and as less active when in vitro *growth rate inhibition T.* 

In general, the structure–activity relationship shows that for the compounds **1–6**, the increase in the carbon chain improves the activity against *T. cruzi*. The comparison between compounds **3** and **2** evidences increased activity by the substitution of the N atom by O. We can also notice that increasing the number of unsaturations and returning the nitrogen to the chain will lead to a decrease in biological activity (**7**, **8**). Still in relation to compound **1**, increasing the unsaturations, returning the atom of O, and increasing the carbon chain length (**9–12**) substantially increase the activity against *T. cruzi*. On the other hand, in compounds **13** and **14**, returning to an unsaturation in the main chain and introducing electron-withdrawing groups and more electronegative atoms, there is a decrease in chagasic activity. This evidence can also be verified for compounds

The molecular descriptors were obtained for the most stable conformation of each compound. These descriptors were computed to give information about the influence of electronic, steric, hydrophilic, and hydrophobic features on the antitrypanosomal activity of the studied nitrofurans. The atomic charges in this work were derived from the electrostatic potential obtained with HF/6-31G method/basis

carried out on a PC Pentium machine with the Pirouette program [71].

*3.1.2.5 Soft independent modeling of class analogy (SIMCA) technique*

*3.1.3 Computers, software, compounds, and molecular descriptors*

*DOI: http://dx.doi.org/10.5772/intechopen.89113*

category [67].

*cruzi* < 75.

**16**, **17, 19–22**.

*Molecular Electrostatic Potential and Chemometric Techniques as Tools to Design Bioactive… DOI: http://dx.doi.org/10.5772/intechopen.89113*

### *3.1.2.5 Soft independent modeling of class analogy (SIMCA) technique*

This SIMCA technique develops principal component models for each training set category. Its main objective is the reliable classification of new samples. When a prediction is made with the SIMCA technique, new samples insufficiently close to the PC space of a class are considered nonmembers. Furthermore, the technique requires that each training sample be pre-assigned to one of *Q* different categories, where *Q* is typically greater than one. It provides three possible outcome predictions: the sample fits only one pre-defined category, the sample does not fit any of the pre-defined categories, and the sample fits into more than one pre-defined category [67].

#### *3.1.3 Computers, software, compounds, and molecular descriptors*

For the present chapter, we performed molecular calculations on an AMD PHENOM 955 X4 2.2 GHz processor with 4 Gb of RAM with the Gaussian 98 program package [69]. The MEP was computed from the electronic density, and the maps were displayed using the MOLEKEL software [70], while the PR models were carried out on a PC Pentium machine with the Pirouette program [71].

**Figure 1** shows the 2D structure of the 5-nitrofuran-2-aldoxim molecule [72] used in the selection of method/basis set (see Section 3.1.3.1). In **Figures 2** and **3** the 2D structures of the nitrofuran compounds from the training [73–75] and prediction sets are displayed, respectively. In this work, the nitrofuran molecules were defined as more active against *T. cruzi*, when in vitro *growth rate inhibition (GR) T. cruzi* ≥ 75, and as less active when in vitro *growth rate inhibition T. cruzi* < 75.

In general, the structure–activity relationship shows that for the compounds **1–6**, the increase in the carbon chain improves the activity against *T. cruzi*. The comparison between compounds **3** and **2** evidences increased activity by the substitution of the N atom by O. We can also notice that increasing the number of unsaturations and returning the nitrogen to the chain will lead to a decrease in biological activity (**7**, **8**). Still in relation to compound **1**, increasing the unsaturations, returning the atom of O, and increasing the carbon chain length (**9–12**) substantially increase the activity against *T. cruzi*. On the other hand, in compounds **13** and **14**, returning to an unsaturation in the main chain and introducing electron-withdrawing groups and more electronegative atoms, there is a decrease in chagasic activity. This evidence can also be verified for compounds **16**, **17, 19–22**.

The molecular descriptors were obtained for the most stable conformation of each compound. These descriptors were computed to give information about the influence of electronic, steric, hydrophilic, and hydrophobic features on the antitrypanosomal activity of the studied nitrofurans. The atomic charges in this work were derived from the electrostatic potential obtained with HF/6-31G method/basis

**Figure 1.** *2D molecular structure for 5-nitrofuran-2-aldoxime.*

*Cheminformatics and Its Applications*

*3.1.2 RP techniques*

elsewhere [47–66].

identical objects.

*3.1.2.1 Principal component analysis (PCA) technique*

the biological activity against *T. cruzi*.

*3.1.2.3 K-nearest neighbor (KNN) technique*

*3.1.2.4 Stepwise discriminant analysis (SDA) technique*

*3.1.2.2 Hierarchical cluster analysis (HCA) technique*

where K is the number of nuclei with charges *Zj*, located at position *Rj* and *ρ (*<sup>→</sup>

is the electronic charge density. The first term on the right side of Eq. (1) represents the contribution of the nuclei, which is positive; the second term brings in the effect of the electrons, which is negative. In the investigation of the reactive sites of nitrofuran compounds, the MEP was evaluated through of the HF/6-31G method.

In this section, we will make a brief presentation of the PR techniques used in this chapter. A deeper and detailed description of these matters can be found

When computing large multivariate data, it is mandatory to find and reduce unknown data trends using exploratory tools. The main idea of the PCA technique is to reduce the dimensionality of a data set consisting of large numbers of interrelated variables while retaining the variation present in the data set as much as possible. This can be achieved by transforming them into a new set of variables, the PCs, which are uncorrelated and ordered so that the first few retain most of the variation present in all of the original variables. As the final result, the PCA technique performs the selection of a small number of variables (molecular properties) considered better related to the dependent property or feature [67], in this study,

This technique has become, together with PCA, another important tool in pattern recognition [67]. The purpose of using it is to display the data in such a way as to emphasize its natural clusters and patterns in a two-dimensional space. The results are presented as dendrograms. In HCA technique, the distances between objects or variables are calculated and computed through the similarity index which ranges from zero, that is, no similarity and large distance among objects, to one, for

The KNN technique [67] classifies the objects based on distance comparison among them. The multivariate Euclidean distances between every pair of objects with known class membership are calculated. The closest K objects are used to build the model. The optimal K is determined by cross-validation applied to the training set objects. The classification of a test object is determined based on the multivariate distance of this object with respect to the K objects in the training set. In this technique no assumption is made about the size and shape of the training set classes.

This technique separates objects from distinct populations and allocates new objects into populations previously defined. It uses a stepwise procedure in which, at each step, the most powerful variable is entered into the discriminant function. The SDA technique is anchored in the F-test for the significance of variables and at each step selects a variable based on its significance, and, after several steps, the

most significant variables are extracted from the set in question [20, 68].

*r)*

**50**

#### **Figure 2.** *2D molecular structure for nitrofurans (training set).*

set as implemented in the Gaussian program package. The electrostatic potential is obtained through the calculation of a set of punctual atomic charges so that it represents the possible best quantum molecular electrostatic potential for a set of points defined around the molecule [76, 77]. The charges derived from electrostatic potential present the advantage of being, in general, physically more satisfactory than the charges of Mülliken [78], especially with regard to biological activity.

The quantum–chemical descriptors employed and obtained with the Gaussian 98 program package [69] were total energy of molecules (TE), highest occupied molecular orbital (HOMO) energy, one level below to highest occupied molecular orbital (HOMO–1) energy; lowest unoccupied molecular orbital (LUMO) energy, one level about lowest unoccupied molecular orbital (LUMO+1) energy, HOMO energy–LUMO energy (gap energy), total dipole moment (μ), Mulliken's electronegativity (χ), atomic charges on the Nth atom (QN), molecular hardness (HD), and molecular softness (MS).

The physicochemical descriptors obtained with ChemPlus module [79] were total surface area (TSA), molecular volume (VOL), molecular refractivity (MR), and molecule hydration energy (MHE).

Molecular holistic (MH) descriptors were included with the purpose of representing different sources of chemical information in terms of molecular size, symmetry, and distribution of atoms in molecules. Also, we include topologic indices, connectivity indices, geometric descriptors, 3D-MoRSE descriptors, and Moriguchi octanol–water partition coefficient (MlogP). These descriptors were calculated with the Dragon software [80].

**53**

**Figure 3.**

*Molecular Electrostatic Potential and Chemometric Techniques as Tools to Design Bioactive…*

*3.1.3.1 Theoretical approach and basis set used in the molecular calculations*

*2D molecular structures for nitrofurans for the prediction set.*

In the calculations with the nitrofuran compounds (**Figure 1**), quantum–chemical approaches were used [81–87]. We use Becke's three-parameter hybrid methods [81], the Lee-Yang-Parr (LYP) correlation functional [82], B3LYP and Becke's 1988 functional (BLYP) [83], Hartree-Fock (HF) method [84], Austin model 1 (AM1) method [85], Parametric Method Number 3 (PM3) [86], and standard basis sets [87] available in the Gaussian program package. In 5-nitrofuran-2-aldoxim, geometry optimization was carried out by B3LYP/6-21G, B3LYP/6-21G\*, B3LYP/6-31G, B3LYP/6-31-G\*, BLYP/6-21G, BLYP/6-21G\*, BLYP/6-31G, BLYP/6-31G\*, HF/6- 21G, HF/6-21G\*, HF/6-31G, and HF/6-31G\* approaches [81–84] and basis sets [87] and AM1 and PM3 approaches [85, 86] . The calculations were performed to find the approach and basis set that would present the best compromise between

*DOI: http://dx.doi.org/10.5772/intechopen.89113*

*Molecular Electrostatic Potential and Chemometric Techniques as Tools to Design Bioactive… DOI: http://dx.doi.org/10.5772/intechopen.89113*

#### **Figure 3.** *2D molecular structures for nitrofurans for the prediction set.*
