*2.5.1 Scaffold content*

Murcko scaffolds were calculated with the program Molecular Equivalent Indices (MEQI) [50, 51] and DataWarrior program [69]. MEQI has been used to obtain the codes corresponding to the chemotypes most frequently analyzed in the databases. [23, 45, 52–55]. The distribution and diversity of the molecular scaffolds present in the data sets were calculated and analyzed using the cyclic system

**89**

**Figure 8.**

*3D visualization of the chemical spaces of all databases.*

**Figure 7.**

*Chemoinformatic Approach: The Case of Natural Products of Panama*

retrieval (CSR) curves [42]. These curves were obtained by plotting the fraction of scaffold and the fraction of compounds that contain cyclic systems [43, 44]. **Table 5** indicates that the MMV DB (0.491) was the most diverse in scaffold content taken as reference the F50 values compared to the data set from GSK (0.183), NPs (0.168), and GNF (0.161), respectively. CSR curves on **Figure 12** further confirm the relative scaffold variety of the eight databases. The analysis of area under curve (AUC) metrics associated with the CSR curves is reported in **Table 5**. The CSR curves showed that MMV has more variety in scaffold content with AUC value of 0.507. In contrast OSM, NPs, GNF, GSK, St. Jude, and CHEMBL were the least diverse (e.g., AUC scores of 0.745, 0.712, 0.705, 0.698, 0.655 and 0.607,

*3D visualization of the chemical spaces of natural products, OSM and St. Jude.*

*DOI: http://dx.doi.org/10.5772/intechopen.87779*

*Chemoinformatic Approach: The Case of Natural Products of Panama DOI: http://dx.doi.org/10.5772/intechopen.87779*

**Figure 7.**

*Cheminformatics and Its Applications*

**88**

*2.5.1 Scaffold content*

**Figure 6.**

**Figure 5.**

**2.5 Molecular scaffolds: content and diversity**

*3D visualization of the chemical spaces of natural products and DBK DBs.*

*3D visualization of the chemical spaces of natural products and TCMDC DBs.*

Murcko scaffolds were calculated with the program Molecular Equivalent Indices (MEQI) [50, 51] and DataWarrior program [69]. MEQI has been used to obtain the codes corresponding to the chemotypes most frequently analyzed in the databases. [23, 45, 52–55]. The distribution and diversity of the molecular scaffolds present in the data sets were calculated and analyzed using the cyclic system

*3D visualization of the chemical spaces of natural products, OSM and St. Jude.*

retrieval (CSR) curves [42]. These curves were obtained by plotting the fraction of scaffold and the fraction of compounds that contain cyclic systems [43, 44].

**Table 5** indicates that the MMV DB (0.491) was the most diverse in scaffold content taken as reference the F50 values compared to the data set from GSK (0.183), NPs (0.168), and GNF (0.161), respectively. CSR curves on **Figure 12** further confirm the relative scaffold variety of the eight databases. The analysis of area under curve (AUC) metrics associated with the CSR curves is reported in **Table 5**. The CSR curves showed that MMV has more variety in scaffold content with AUC value of 0.507. In contrast OSM, NPs, GNF, GSK, St. Jude, and CHEMBL were the least diverse (e.g., AUC scores of 0.745, 0.712, 0.705, 0.698, 0.655 and 0.607,

**Figure 8.**

*3D visualization of the chemical spaces of all databases.*

respectively). The CSR curves provide information on the diversity of the most frequent scaffolds in all databases.

## *2.5.2 Shannon entropy (SE) and scaled Shannon entropy (SSE)*

The Shannon entropy has been adapted to measure the scaffold diversity based on the (**N**) number of most recurrent scaffolds [70]. The scaled Shannon entropy is a normalized value that measures the most common chemotypes present in a

**Figure 9.**

*Curve for cumulative frequency distribution (CFD) based on ECFP-4.*

**Figure 10.** *Curve for cumulative frequency distribution based on MACCS keys.*

#### **Figure 11.**

*Curve for cumulative frequency distribution based on PubChem.*

database. Thus, SSE closer to 1 indicates higher scaffold diversity, while SSE closer to zero (0) indicates lower diversity. In this study, we calculated the SSE for values ranging from **N** = 10 to **N** = 40.

**91**

**Table 3.**

**Figure 12.**

**Table 2.**

*Chemoinformatic Approach: The Case of Natural Products of Panama*

*Cyclic system retrieval curves for all databases evaluated in this study.*

*The statistical values of the similarity of the Tanimoto coefficient with ECFP-4.*

**Similarity ECFP-4/Tanimoto coefficient**

**Similarity MACCS keys/Tanimoto coefficient**

**Figure 13** shows a histogram with the distribution of the 40 most populated scaffolds in NPAs. The histogram includes the corresponding chemotype code. The comparison of the scaffolds of the NPAs allowed the identification of the 68MBD

**DBs Min. 1st Qu. Median Mean 3rd Qu. Max.** GSK 0.07813 0.25682 0.33333 0.37009 0.45581 0.92683 NPs 0.00000 0.34426 0.43636 0.44673 0.54545 1.00000 OSM 0.00000 0.34483 0.43636 0.44693 0.54545 1.00000 MMV 0.00000 0.34483 0.43636 0.44677 0.54412 1.00000 ST JUDE 0.00000 0.33333 0.41250 0.42313 0.50000 1.00000 GNF 0.00000 0.31746 0.39437 0.39999 0.47619 1.00000

**DBs Min. 1st Qu. Median Mean 3rd Qu. Max.** GSK 0.01724 0.05789 0.08844 0.11490 0.12245 0.82353 NPs 0.00000 0.07826 0.09910 0.10565 0.12389 1.00000 OSM 0.00000 0.07826 0.09917 0.10607 0.12397 1.00000 MMV 0.00000 0.07826 0.09924 0.10615 0.12403 1.00000 ST JUDE 0.00000 0.08197 0.10345 0.10980 0.12857 1.00000 GNF 0.00000 0.08209 0.10345 0.10772 0.12739 1.00000

chemotype as one of the most active compounds in this database.

*The statistical values of the similarity of the Tanimoto coefficient with MACCS keys.*

*DOI: http://dx.doi.org/10.5772/intechopen.87779*

*Chemoinformatic Approach: The Case of Natural Products of Panama DOI: http://dx.doi.org/10.5772/intechopen.87779*

#### **Figure 12.** *Cyclic system retrieval curves for all databases evaluated in this study.*


#### **Table 2.**

*Cheminformatics and Its Applications*

frequent scaffolds in all databases.

respectively). The CSR curves provide information on the diversity of the most

The Shannon entropy has been adapted to measure the scaffold diversity based on the (**N**) number of most recurrent scaffolds [70]. The scaled Shannon entropy is a normalized value that measures the most common chemotypes present in a

database. Thus, SSE closer to 1 indicates higher scaffold diversity, while SSE closer to zero (0) indicates lower diversity. In this study, we calculated the SSE for values

*2.5.2 Shannon entropy (SE) and scaled Shannon entropy (SSE)*

**90**

**Figure 11.**

**Figure 10.**

**Figure 9.**

ranging from **N** = 10 to **N** = 40.

*Curve for cumulative frequency distribution based on MACCS keys.*

*Curve for cumulative frequency distribution (CFD) based on ECFP-4.*

*Curve for cumulative frequency distribution based on PubChem.*

*The statistical values of the similarity of the Tanimoto coefficient with ECFP-4.*


#### **Table 3.**

*The statistical values of the similarity of the Tanimoto coefficient with MACCS keys.*

**Figure 13** shows a histogram with the distribution of the 40 most populated scaffolds in NPAs. The histogram includes the corresponding chemotype code. The comparison of the scaffolds of the NPAs allowed the identification of the 68MBD chemotype as one of the most active compounds in this database.


#### **Table 4.**

*The statistical values of the similarity of the Tanimoto coefficient with PubChem.*


*M = number of molecules in the BD, N = number of chemotypes or substructures, FN/M = chemotype diversity fraction, NSING = singleton number, FNSING/M = singleton fraction between total molecules, FNSING/N = fraction of singleton among total chemotypes, AUC = area under the curve, F50 = fraction of chemotype required to recover 50% of the molecules.*

#### **Table 5.**

*Summary of the scaffold diversity of the eight databases analyzed in this work.*
