**2. Related work**

Maps of Science (http://mapofscience.com/) are a well-known website. Katy et al. also provides Sci2Tool visualization tools [2] and maps of journals and documents [3]. In Japan, National Institute of Science and Technology Policy (NISTEP) provides Science Map (http://www. nistep.go.jp/wp/wp-content/uploads/ScienceMapWebEdition2014.html). In such studies, the similarity between journals and articles is calculated based on the cosine and/or Jaccard similarity of inter-citation and co-citation. These maps promote interdisciplinary research collaboration, but citation analysis cannot be utilized for ongoing projects and recently published articles, although project descriptions will eventually include articles in their research results.

Funding agencies and publishers generally have their own classification systems. Projects/ articles have more than one code; thus, interdisciplinary projects can be found by searching multi-labeled projects. However, even if two projects/articles are assigned the same category, their similarity may not be found. Moreover, funding agencies and publishers use different categories, and there is no comprehensive scheme for characterizing projects or articles; thus, they cannot be compared between different agencies or publishers. For example, comparing articles with Association for Computing Machinery classification (https://www.acm.org/publications/class-2012) with Springer Nature classification requires taxonomy exchanges.

anticipate changes, especially those initiated in their immediate vicinity. Research laboratories and universities that are organized according to the established standards of disciplinary departments can understand an organization's environment. Furthermore, such maps are important to policy analysts and funding agencies. Since research funding should be based on quantitative and qualitative scientific metrics, they usually perform several analyses on the map with statistical analysis and careful examination by human experts. However, conventional approaches to understanding research activities focus on what authors told us about past accomplishments through inter-citation and co-citation analysis of published research articles. Thus, ongoing project and the recently published articles that do not have enough

Therefore, we propose to analyze them using a content-based method using natural language processing (NLP) techniques. Recently, word/paragraph embedding has been proposed for finding relationships between unstructured descriptions. Such embedding techniques represent words and paragraphs as real-valued vectors of several hundred dimensions. The distances between the descriptions are calculated from the similarities between vectors. Thus, we constructed a new mapping tool that represents the recent scientific trends, where nodes represent research projects or the articles that are linked by certain distances of the content similarity. Moreover, we drew a map from approx. 300,000 IEEE articles and National Science Foundation (NSF) projects, and then from its chronological changes we obtained some find-

The remainder of this chapter is organized as follows. In Section 2 discusses related work, and Section 3 describes our proposed method for calculating the content similarity and its evaluations. Then, Section 4 introduces our tool, Mapping Science, and we confirm on the map the formation process of research areas such as the Internet of Things in Section 5, final conclu-

Maps of Science (http://mapofscience.com/) are a well-known website. Katy et al. also provides Sci2Tool visualization tools [2] and maps of journals and documents [3]. In Japan, National Institute of Science and Technology Policy (NISTEP) provides Science Map (http://www. nistep.go.jp/wp/wp-content/uploads/ScienceMapWebEdition2014.html). In such studies, the similarity between journals and articles is calculated based on the cosine and/or Jaccard similarity of inter-citation and co-citation. These maps promote interdisciplinary research collaboration, but citation analysis cannot be utilized for ongoing projects and recently published articles, although project descriptions will eventually include articles in their research results. Funding agencies and publishers generally have their own classification systems. Projects/ articles have more than one code; thus, interdisciplinary projects can be found by searching multi-labeled projects. However, even if two projects/articles are assigned the same category, their similarity may not be found. Moreover, funding agencies and publishers use different categories, and there is no comprehensive scheme for characterizing projects or articles; thus, they cannot be compared between different agencies or publishers. For example, comparing

citations have not been analyzed.

176 Scientometrics

**2. Related work**

ings regarding the formation processes of research areas.

sions and suggestions for future work are provided in Section 6.

Therefore, several content-based methods are proposed in the related literature. Previous studies have examined automatic topic classification using probabilistic latent semantic analysis (pLSA) [4] and latent Dirichlet allocation (LDA) [5]. One uses LDA to find the five most probable words for a topic, and each document is viewed as a mixture of topics [6]. This approach can classify documents across different agencies and publishers. However, the similarity between projects/articles cannot be computed directly. In this regard, the National Institutes of Health (NIH) Visual Browser [7, 8] (http://nihmaps.org/index.php) computed the similarities between projects as the mixture of classification probability to each topic based on pLSA, using the average symmetric Kullback-Leibler divergence function [9]. However, this similarity is a combination of probabilities; that is, it is not derived from sentence context. Other studies are also based on the similarity between sets of words (bag-of-word) included in documents like term frequency-inverse document frequency (TF-IDF), and not considering the sentence context.

By contrast, a word/paragraph vector, which is a distributed representation of words and paragraphs, is attracting attention in NLP.Assuming that context determines the meaning of a word [10], words appearing in similar contexts are considered to have a similar meaning. In the basic form, a word vector is represented as a matrix, whose elements are the co-occurrence frequencies between a word *w* with a certain usage frequency in the corpus and words within a fixed window size *c* from *w*. A popular representation of word vectors is word2vec [11, 12]. Word2vec creates word vectors using a two-layered neural network obtained by a skip-gram model with negative sampling. Specifically, word vectors are obtained by calculating the maximum likelihood of objective function *L* in Eq. (1), where *T* is the number of words with a certain usage frequency in the corpus. Word2vec clusters words with similar meanings in a vector space.

$$L = \frac{1}{T} \sum\_{t=1}^{T} \sum\_{\prec x \not\le \alpha\_{\succ} \not\le 0} \log p(w\_{t\dash\neg} | w\_t) \tag{1}$$

In addition, Le and Mikolov [13] proposed a paragraph vector that learns fixed-length feature representations using a two-layered neural network from variable-length pieces of texts such as sentences, paragraphs, and documents. A paragraph vector is considered another word in a paragraph and is shared across all contexts generated from the same paragraph but not across paragraphs. The contexts are fixed length and sampled from a sliding window over the paragraph. The paragraph vectors are computed by fixing the word vectors and training the new paragraph vector until convergence, as shown in Eq. (2).

$$L = \sum\_{l=1}^{r} \log p(w\_l \mid w\_{l \to l'}, \dots, w\_{l \leftrightarrow l'} d\_l) \tag{2}$$

where *di* is a vector for a paragraph *i* that includes *wt* . Whereas word vectors are shared across paragraphs, paragraph vectors are unique among paragraphs and represent the topics of the paragraphs. By considering word order, paragraph vectors also address the weaknesses of bag-of-words models in LDA and pLSA. Therefore, paragraph vectors are considered more accurate representations of the context of the content. We can then input resulting vectors into the analysis using machine learning and clustering techniques for finding similar articles in different academic subjects as well as the relationships between projects from different agencies. Thus, we tried to convert the natural sentences in project descriptions and article abstracts to paragraph vectors in this study.

also exists in W3C Resource Description Framework (RDF, https://www.w3.org/RDF/) format with semantic relationships SKOS: broader, SKOS: narrower, and SKOS: related. A broader or narrower relationship essentially represents an *is-a* subsumption relationship but sometimes denotes a *part-of* relationship in geography, body organ terminology, and other academic disciplines. The JST thesaurus is publicly accessible from Web APIs on the J-GLOBAL website (http://jglobal.jst.go.jp/en/), along with the visualization tool Thesaurus Map (http://thesaurus-map.jst.go.jp/jisho/fullIF/index.html). We then calculate the information entropy of each concept in the JST thesaurus from the dataset. Shannon's entropy in information theory is an estimate of event informativeness. We used this entropy to measure the semantic diversity of a concept [17]. After creating clusters according to the degree of entropy, we unify all word vectors in the same cluster to a cluster vector and constructed paragraph vectors based on the

Hereafter, the "word" is a word in the dataset, the "term" is a term in a thesaurus, and terms are classified into hypernyms, hyponyms, and their synonyms. The "concept" is defined as a combination of a hypernym and one or more hyponyms one level below the hypernym indicated as

*p*(*Sij*|*C*) · log<sup>2</sup> ∑

concept in the thesaurus, we calculated the entropy *H(C)* in the dataset. As the probabilities of events become equal, *H*(*C*) increases. If only particular events occur, *H*(*C*) is reduced because of low informativeness. Thus, the proposed entropy of a concept increases when a hypernym

*j*=0 *m*

, we calculated the entropy of a

themselves).

Mapping Science Based on Research Content Similarity http://dx.doi.org/10.5772/intechopen.77067

*<sup>p</sup>*(*Sij*|*C*)) (3)

and its hyponyms *T*<sup>1</sup>

was summarized to

*…*

179

. For each

cluster vectors. The overall flow is shown in **Figure 1**.

**Figure 1.** Construction of paragraph vectors based on cluster vectors.

*H*(*C*) = −∑

*Tn*

a red box in **Figure 2**. Given that a thesaurus consists of terms *Ti*

concept *C* by considering the appearance frequencies of a hypernym *T*<sup>0</sup>

a corresponding concept (synonyms *Sij* include descriptors of terms *Ti*

as an event probability. The frequencies of synonyms *Si0…Sim* of term *Ti*

*i*=0 *n* (<sup>∑</sup> *j*=0 *m*

In Eq. (3), *p*(*Sij|C*) is the probability of a synonym *Sij* given a concept and terms *Ti*
