**3. Gene signature databases related to drugs**

The gene signature databases of drugs and compounds are fundamental resources determining the searching space for drug repositioning. For a long time, researchers have been pursuing the enlargement of the gene signature library of drugs and compounds. For example, researchers have explored a bunch of bioactive compounds and ligands, such as growth factors and cytokines, which are not drugs but with known functions [8–10]. There are lots of data resources related to drugs. The sources of these data are mainly from two aspects. One is the public data, such as GEO, which is scattered in the database. A manual curation by professional researchers is necessary to make a usable dataset for drug repositioning. There is a trend for advanced metadata curation from the GEO [34]. The other one is from large projects, such as CMap, aiming to create a reference dataset of gene signatures for drug development.

NCBI GEO [35], EMBL-EBI ArrayExpress [36] and NGDC Gene Expression Nebulas [37] store massive omics data, including many transcriptome data of drugs and other compounds. But researchers need to search, collect and tidy them before their use for drug repositioning. Fortunately, several groups have collected multigene expression signatures related to the drugs.

The CREEDS (CRowd Extracted Expression of Differential Signatures) extracted and analyzed the signatures of 875 drugs and 828 diseases from GEO

#### *Gene Signature-Based Drug Repositioning DOI: http://dx.doi.org/10.5772/intechopen.101377*

via a crowdsourcing project, setting in a massive open online course on Coursera [38]. The dataset could be downloaded from the website, https://maayanlab.cloud/ CREEDS/.

HERB (http://herb.ac.cn) is a high-throughput experiment database of traditional Chinese medicine, consisting of 7263 herbs and 49,258 ingredients, from 472 high-throughput GEO datasets, providing complementary and valuable drug resources [39].

The CMap version 1 (https://portals.broadinstitute.org/cmap/) consists of Affymetrix-based 6100 gene signatures of 1309 compounds perturbing five different cell lines (such as PC3, MCF7, HL60) with varying doses (mainly 10 μM). Notably, there were 164 distinct perturbagens, including approved drugs and nondrug bioactive compounds, in the original article published in the *Science* journal [8]. Indeed, this dataset stimulates the rapid development of drug repositioning, indicated by the high citations (more than 1800 times). It suggests the great value and success of a large-scale community Connectivity Map project.

The CMap version 2 (https://clue.io/cmap), belonging to NIH's Library of Integrated Network-Based Cellular Signatures (LINCS) program, includes 1.3 million L1000 profiles and 25,200 unique perturbations on variable cell lines [9]. They used L1000 technology due to the cost and argued that about 1000 landmark genes could recover 82% of the information in the full transcriptome based on a comprehensive comparison [9]. As expected, the updated dataset also motivated the continual development of drug repositioning. It should be noted that the consistency between the two versions of CMap is not high with a low recall [40]. It suggests that drug repositioning based on the CMap should consider other evidence to filter false positives in the computational drug repositioning.

In summary, the availability of huge gene signatures of drugs makes the gene signature-based drug repositioning possible as a big data basis. Meanwhile, researchers are still developing new transcriptome technology to make the largescale transcriptome sequencing of millions of drugs treating different cell lines with various doses possible at a relatively low cost. In addition, with the cost of conventional RNASeq lower, it is also possible to use the RNASeq directly soon.
