**Content-based Analysis**

[6] Fukuzawa N, Ida T. Science linkages between scientific articles and patents for leading scientists in the life and medical sciences field: The case of Japan. Scientometrics.

[7] Waltman L.A review of the literature on citation impact indicators. Journal of Informatics.

[8] Squicciarini M, Dernis H, Criscuolo C. Measuring Patent Quality: Indicators of Technological and Economic Value. OECD Science, Technology and Industry Working Papers, No. 2013/03. Paris: OECD Publishing; 2013. DOI: http://dx.doi.org/10.1787/5k4522wkw1r8-en

[9] Yamashita Y, Jibu M. Exploration of new performance indicator of academic paper cita-

[10] Schmoch U. Concept of a technology classification for country comparisons. Final Report

[11] Hicks D, Breitzman A Sr, Hamilton K, Narin F. Research excellence and patented innovation. Science and Public Policy. 2000;**27**:310-320. DOI: 10.3152/147154300781781805 [12] Branstetter L. Exploring the link between academic science and industrial innovation.

[13] Thomson Reuters (present Clarivate Analytics) [Internet]. InCites Indicator Handbook. Available from: http://ipscience-help.thomsonreuters.com/inCites2Live/8980-TRS/ version/default/part/AttachmentData/data/InCites-Indicators-Handbook-6%2019.pdf

[14] Guan J, He Y. Patent-bibliometric analysis on the Chinese science - technology linkages.

2016;**106**:629-644. DOI: 10.1007/s11192-015-1795-z

tions from patents. JAPIO Yearbook. 2017;**2017**:144-155

to the World Intellectual Organisation (WIPO); June 2008

Annales d'Économie et de Statistique. 2005;**79**(80):119-142

Scientometrics. 2007;**72**:403-425. DOI: 10.1007/s11192-007-1741-1

[Accessed: September 20, 2017]

2016;**10**:365-391. DOI: 10.1016/j.joi.2016.02.007

172 Scientometrics

**Chapter 11**

**Provisional chapter**

**Mapping Science Based on Research Content Similarity**

Maps of science representing the structure of science help us understand science and technology development. Thus, research in scientometrics has developed techniques for analyzing research activities and for measuring their relationships; however, navigating the recent scientific landscape is still challenging, since conventional inter-citation and co-citation analysis has difficulty in applying to recently published articles and ongoing projects. Therefore, to characterize what is being attempted in the current scientific landscape, this article proposes a content-based method of locating research articles/ projects in a multi-dimensional space using word/paragraph embedding. Specifically, for addressing an *unclustered* problem, we introduced cluster vectors based on the information entropies of technical concepts. The experimental results showed that our method formed a clustered map from approx. 300 k IEEE articles and NSF projects from 2012 to 2016. Finally, we confirmed that formation of specific research areas can be captured as

**Keywords:** map of science, content-based, paragraph vector, information entropy,

In 1965, Price [1] proposed studying science using scientific methods. Since then, research in scientometrics has developed techniques for analyzing research activities and for measuring their relationships and constructed maps of science, one of the major topics in scientometrics, that provides a bird's eye view of the scientific landscape. Maps of science have been useful tools for understanding the structure of science, their spread, and interconnection of disciplines. By knowing such information, science, and technology enterprises can

**Mapping Science Based on Research Content Similarity**

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,

distribution, and reproduction in any medium, provided the original work is properly cited.

DOI: 10.5772/intechopen.77067

Takahiro Kawamura, Katsutaro Watanabe, Naoya Matsumoto and Shusaku Egami

Takahiro Kawamura, Katsutaro Watanabe, Naoya Matsumoto and Shusaku Egami

Additional information is available at the end of the chapter

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/intechopen.77067

changes in the network structure.

**Abstract**

clustering

**1. Introduction**

#### **Mapping Science Based on Research Content Similarity Mapping Science Based on Research Content Similarity**

DOI: 10.5772/intechopen.77067

Takahiro Kawamura, Katsutaro Watanabe, Naoya Matsumoto and Shusaku Egami Takahiro Kawamura, Katsutaro Watanabe, Naoya Matsumoto and Shusaku Egami

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/intechopen.77067

#### **Abstract**

Maps of science representing the structure of science help us understand science and technology development. Thus, research in scientometrics has developed techniques for analyzing research activities and for measuring their relationships; however, navigating the recent scientific landscape is still challenging, since conventional inter-citation and co-citation analysis has difficulty in applying to recently published articles and ongoing projects. Therefore, to characterize what is being attempted in the current scientific landscape, this article proposes a content-based method of locating research articles/ projects in a multi-dimensional space using word/paragraph embedding. Specifically, for addressing an *unclustered* problem, we introduced cluster vectors based on the information entropies of technical concepts. The experimental results showed that our method formed a clustered map from approx. 300 k IEEE articles and NSF projects from 2012 to 2016. Finally, we confirmed that formation of specific research areas can be captured as changes in the network structure.

**Keywords:** map of science, content-based, paragraph vector, information entropy, clustering

#### **1. Introduction**

In 1965, Price [1] proposed studying science using scientific methods. Since then, research in scientometrics has developed techniques for analyzing research activities and for measuring their relationships and constructed maps of science, one of the major topics in scientometrics, that provides a bird's eye view of the scientific landscape. Maps of science have been useful tools for understanding the structure of science, their spread, and interconnection of disciplines. By knowing such information, science, and technology enterprises can

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

anticipate changes, especially those initiated in their immediate vicinity. Research laboratories and universities that are organized according to the established standards of disciplinary departments can understand an organization's environment. Furthermore, such maps are important to policy analysts and funding agencies. Since research funding should be based on quantitative and qualitative scientific metrics, they usually perform several analyses on the map with statistical analysis and careful examination by human experts. However, conventional approaches to understanding research activities focus on what authors told us about past accomplishments through inter-citation and co-citation analysis of published research articles. Thus, ongoing project and the recently published articles that do not have enough citations have not been analyzed.

articles with Association for Computing Machinery classification (https://www.acm.org/publications/class-2012) with Springer Nature classification requires taxonomy exchanges.

Therefore, several content-based methods are proposed in the related literature. Previous studies have examined automatic topic classification using probabilistic latent semantic analysis (pLSA) [4] and latent Dirichlet allocation (LDA) [5]. One uses LDA to find the five most probable words for a topic, and each document is viewed as a mixture of topics [6]. This approach can classify documents across different agencies and publishers. However, the similarity between projects/articles cannot be computed directly. In this regard, the National Institutes of Health (NIH) Visual Browser [7, 8] (http://nihmaps.org/index.php) computed the similarities between projects as the mixture of classification probability to each topic based on pLSA, using the average symmetric Kullback-Leibler divergence function [9]. However, this similarity is a combination of probabilities; that is, it is not derived from sentence context. Other studies are also based on the similarity between sets of words (bag-of-word) included in documents like term frequency-inverse document frequency (TF-IDF), and not considering the sentence context.

By contrast, a word/paragraph vector, which is a distributed representation of words and paragraphs, is attracting attention in NLP.Assuming that context determines the meaning of a word [10], words appearing in similar contexts are considered to have a similar meaning. In the basic form, a word vector is represented as a matrix, whose elements are the co-occurrence frequencies between a word *w* with a certain usage frequency in the corpus and words within a fixed window size *c* from *w*. A popular representation of word vectors is word2vec [11, 12]. Word2vec creates word vectors using a two-layered neural network obtained by a skip-gram model with negative sampling. Specifically, word vectors are obtained by calculating the maximum likelihood of objective function *L* in Eq. (1), where *T* is the number of words with a certain usage frequency in the corpus. Word2vec clusters words with similar meanings in a vector space.

> *<sup>T</sup>* ∑ *t*=1 *T* ∑−*<sup>c</sup>*≤*j*≤*c*,*j*≠<sup>0</sup>

new paragraph vector until convergence, as shown in Eq. (2).

is a vector for a paragraph *i* that includes *wt*

*t*=1 *T*

log*p*(*wt*


paragraphs, paragraph vectors are unique among paragraphs and represent the topics of the paragraphs. By considering word order, paragraph vectors also address the weaknesses of bag-of-words models in LDA and pLSA. Therefore, paragraph vectors are considered more accurate representations of the context of the content. We can then input resulting vectors

, …,*wt*+*<sup>c</sup>*

, *di*

log*p*(*wt*+*<sup>j</sup>*

In addition, Le and Mikolov [13] proposed a paragraph vector that learns fixed-length feature representations using a two-layered neural network from variable-length pieces of texts such as sentences, paragraphs, and documents. A paragraph vector is considered another word in a paragraph and is shared across all contexts generated from the same paragraph but not across paragraphs. The contexts are fixed length and sampled from a sliding window over the paragraph. The paragraph vectors are computed by fixing the word vectors and training the


Mapping Science Based on Research Content Similarity http://dx.doi.org/10.5772/intechopen.77067 177

) (2)

. Whereas word vectors are shared across

*L* = \_\_<sup>1</sup>

*L* = ∑

where *di*

Therefore, we propose to analyze them using a content-based method using natural language processing (NLP) techniques. Recently, word/paragraph embedding has been proposed for finding relationships between unstructured descriptions. Such embedding techniques represent words and paragraphs as real-valued vectors of several hundred dimensions. The distances between the descriptions are calculated from the similarities between vectors. Thus, we constructed a new mapping tool that represents the recent scientific trends, where nodes represent research projects or the articles that are linked by certain distances of the content similarity. Moreover, we drew a map from approx. 300,000 IEEE articles and National Science Foundation (NSF) projects, and then from its chronological changes we obtained some findings regarding the formation processes of research areas.

The remainder of this chapter is organized as follows. In Section 2 discusses related work, and Section 3 describes our proposed method for calculating the content similarity and its evaluations. Then, Section 4 introduces our tool, Mapping Science, and we confirm on the map the formation process of research areas such as the Internet of Things in Section 5, final conclusions and suggestions for future work are provided in Section 6.
