**3.1. Proposal of the paragraph embedding method**

Before introducing the proposed method, we present a problem in applying the paragraph vectors for research project descriptions. We implemented the paragraph embedding technique using the Deep Learning Library for Java (https://deeplearning4j.org). Then, we constructed paragraph vectors for approx. 30,000 NSF projects mentioned in the next section. Although we need a more systematic way, but this time the hyperparameters were set empirically as follows: 500 dimensions were established for 66,830 words that appeared more than 5 times; the window size *c* was 10, and the learning rate and minimum learning rate were 0.025 –0.0001, respectively, with an adaptive gradient algorithm. The learning model is a distributed memory model with hierarchical softmax.

However, the result showed that projects are scattered and not clustered by any subject or discipline in the vector space. Most projects are slightly connected to a low number of projects. Thus, it is difficult to grasp trends and compare an ordinary classification system. Closely observing the vector space reveals some of the reasons for this *unclustered* problem: each word with nearly the same meaning has slightly different word vectors, and shared but unimportant words are considered the commonality of paragraphs. In fact, Le and Mikolov reported classification accuracy with multiple categories of less than 50% [13].

Therefore, for addressing this problem, we introduce the information entropy [14] for clustering word vectors before constructing paragraph vectors. The fact that synonyms tend to gather in a word vector space indicates that the semantics of a word spatially spread to a certain distance. This observation is also suggested in the related literature [15]. Therefore, to unify word vectors of almost the same meanings, excluding trivial common words, we generated clusters of the word vectors based on the semantic diversity of each concept in a thesaurus. We first extracts 19,685 hypernyms (broader terms) with one or more hyponym (narrower term) from the Japan Science and Technology Agency (JST) science and technology thesaurus [16]. The JST thesaurus primarily consists of keywords that have been frequently indexed in 36 million articles accumulated by the JST since 1975. Currently, this thesaurus is updated every year and includes 276,179 terms with English and Japanese notations in 14 categories from bioscience to computer science and civil engineering. Based on the World Wide Web Consortium (W3C) Simple Knowledge Organization System (SKOS), the JST thesaurus also exists in W3C Resource Description Framework (RDF, https://www.w3.org/RDF/) format with semantic relationships SKOS: broader, SKOS: narrower, and SKOS: related. A broader or narrower relationship essentially represents an *is-a* subsumption relationship but sometimes denotes a *part-of* relationship in geography, body organ terminology, and other academic disciplines. The JST thesaurus is publicly accessible from Web APIs on the J-GLOBAL website (http://jglobal.jst.go.jp/en/), along with the visualization tool Thesaurus Map (http://thesaurus-map.jst.go.jp/jisho/fullIF/index.html). We then calculate the information entropy of each concept in the JST thesaurus from the dataset. Shannon's entropy in information theory is an estimate of event informativeness. We used this entropy to measure the semantic diversity of a concept [17]. After creating clusters according to the degree of entropy, we unify all word vectors in the same cluster to a cluster vector and constructed paragraph vectors based on the cluster vectors. The overall flow is shown in **Figure 1**.

into the analysis using machine learning and clustering techniques for finding similar articles in different academic subjects as well as the relationships between projects from different agencies. Thus, we tried to convert the natural sentences in project descriptions and article

This section introduces our proposed paragraph embedding method using entropy and then evaluates whether the similarity of the resulting vectors accurately represents the content

Before introducing the proposed method, we present a problem in applying the paragraph vectors for research project descriptions. We implemented the paragraph embedding technique using the Deep Learning Library for Java (https://deeplearning4j.org). Then, we constructed paragraph vectors for approx. 30,000 NSF projects mentioned in the next section. Although we need a more systematic way, but this time the hyperparameters were set empirically as follows: 500 dimensions were established for 66,830 words that appeared more than 5 times; the window size *c* was 10, and the learning rate and minimum learning rate were 0.025 –0.0001, respectively, with an adaptive gradient algorithm. The learning model is a distrib-

However, the result showed that projects are scattered and not clustered by any subject or discipline in the vector space. Most projects are slightly connected to a low number of projects. Thus, it is difficult to grasp trends and compare an ordinary classification system. Closely observing the vector space reveals some of the reasons for this *unclustered* problem: each word with nearly the same meaning has slightly different word vectors, and shared but unimportant words are considered the commonality of paragraphs. In fact, Le and Mikolov reported

Therefore, for addressing this problem, we introduce the information entropy [14] for clustering word vectors before constructing paragraph vectors. The fact that synonyms tend to gather in a word vector space indicates that the semantics of a word spatially spread to a certain distance. This observation is also suggested in the related literature [15]. Therefore, to unify word vectors of almost the same meanings, excluding trivial common words, we generated clusters of the word vectors based on the semantic diversity of each concept in a thesaurus. We first extracts 19,685 hypernyms (broader terms) with one or more hyponym (narrower term) from the Japan Science and Technology Agency (JST) science and technology thesaurus [16]. The JST thesaurus primarily consists of keywords that have been frequently indexed in 36 million articles accumulated by the JST since 1975. Currently, this thesaurus is updated every year and includes 276,179 terms with English and Japanese notations in 14 categories from bioscience to computer science and civil engineering. Based on the World Wide Web Consortium (W3C) Simple Knowledge Organization System (SKOS), the JST thesaurus

abstracts to paragraph vectors in this study.

**3.1. Proposal of the paragraph embedding method**

uted memory model with hierarchical softmax.

classification accuracy with multiple categories of less than 50% [13].

similarity of documents.

178 Scientometrics

**3. Paragraph embedding using information entropy**

Hereafter, the "word" is a word in the dataset, the "term" is a term in a thesaurus, and terms are classified into hypernyms, hyponyms, and their synonyms. The "concept" is defined as a combination of a hypernym and one or more hyponyms one level below the hypernym indicated as a red box in **Figure 2**. Given that a thesaurus consists of terms *Ti* , we calculated the entropy of a concept *C* by considering the appearance frequencies of a hypernym *T*<sup>0</sup> and its hyponyms *T*<sup>1</sup> *… Tn* as an event probability. The frequencies of synonyms *Si0…Sim* of term *Ti* was summarized to a corresponding concept (synonyms *Sij* include descriptors of terms *Ti* themselves).

$$H(\mathbb{C}) = -\sum\_{\boldsymbol{\eta}=\mathbf{0}} \left( \sum\_{\boldsymbol{\eta}=\mathbf{0}} p\left(\mathbb{S}\_{\boldsymbol{\eta}} \mid \mathbb{C}\right) \cdot \log\_2 \sum\_{\boldsymbol{\eta}=\mathbf{0}} p\left(\mathbb{S}\_{\boldsymbol{\eta}} \mid \mathbb{C}\right) \right) \tag{3}$$

In Eq. (3), *p*(*Sij|C*) is the probability of a synonym *Sij* given a concept and terms *Ti* . For each concept in the thesaurus, we calculated the entropy *H(C)* in the dataset. As the probabilities of events become equal, *H*(*C*) increases. If only particular events occur, *H*(*C*) is reduced because of low informativeness. Thus, the proposed entropy of a concept increases when a hypernym

**Figure 1.** Construction of paragraph vectors based on cluster vectors.

**Figure 2.** Concepts in a thesaurus.

and hyponyms that construct a concept separately appear with a certain frequency in the dataset. Therefore, the degree of entropy indicates the semantic diversity of a concept. Then, assuming that the degree of entropy and the spatial size of a concept in a word vector space are proportional to a certain extent, we split the word vector space into clusters. In fact, our preliminary experiment indicated that the entropy of a concept has high correlation *R* = 0.602 with the maximum Euclidean distance of hyponyms in the concept in a vector space, at least while the entropy is rather high. Specifically, we refined clusters by repeatedly subdividing them until the defined criterion was satisfied. In our method, we set the determination condition as shown in Eq. (4).

$$\text{Cl}\{w\_{i}\} = \begin{cases} \text{Cl}\{w\_{i}\} & \left(\frac{\text{H}\{\text{C}\_{\text{f}}\text{w}\_{i}\}}{\text{H}\{\text{C}\_{\text{f}}\text{w}\_{i}\}} > \frac{\|\;w\_{i} - w\_{i}\|}{\|\;w\_{i} - w\_{j}\|}\right) \\\\ & \text{Cl}\{w\_{j}\} \quad \text{(otherwise)} \end{cases} \tag{4}$$

**3.2. Evaluation of paragraph vectors**

Next, we evaluate the resulting vectors on the map constructed from the following dataset. In this article, the dataset includes titles and abstracts of 266,772 IEEE conference articles published from 2012 to 2016, including 2,290,743 sentences in total and titles and descriptions of 34,192 NSF projects from 2012 to 2016, including 730,563 sentences in total. Note that IEEE journal, transaction, symposium, and workshop articles are not included, and NSF project domains are limited to Computer and Information Science and Engineering, Mathematical and Physical Sciences, and Engineering in accordance with IEEE articles. All words in the sentences were tokenized and lemmatized by Stanford CoreNLP before creating the vector space. In terms of the *unclustered* problem, we confirmed that the proposed method successfully formed several clusters compared with the original paragraph embedding method. For a quantitative comparison, in **Figure 3** shows the relationships between the cosine similarities and the number of edges, and the relationship between the degree centrality and the number of nodes (i.e., projects) in the case of the cosine similarities of >0.35. As a result, we confirmed that edges with a higher cosine similarity and nodes with higher degrees increase. The reason for this result is because, through the use of high-entropy concepts, which are significant in scientific and technological contexts excluding scientifically unimportant words—as elements between paragraph vectors, the paragraph vectors were able to comprise meaningful groups. Simultaneously, newly, unknown synonyms, and closely related words that are not defined in the thesaurus can be unified to a cluster vector, if they are in the same cluster. Taking the centroid vector as a representative vector in a cluster involves separating each cluster vector

Mapping Science Based on Research Content Similarity http://dx.doi.org/10.5772/intechopen.77067 181

In terms of the accuracy of content similarities, the evaluation encounters difficulty since, to the best of our knowledge, there is no gold standard for evaluating the similarity among scientific and technological documents. Therefore, we first evaluated the degree of the similarities based on a sampling method. We randomly extracted 100 pairs of projects with a cosine similarity of >0.5 (similarities less than 0.5 are not considered in the map layout), to make the distribution similar to the entire distribution. Each pair has two project titles and descriptions, and a cosine value that is divided into three levels: weak (0.5 ≤ cos. < 0.67), middle (0.67 ≤ cos. < 0.84), and strong (0.84 ≤ cos.). Some examples of two projects and their cosine value are shown in **Table 1**. Then, three members of our organization, a funding agency in Japan, evaluated the similarity of each pair. The members were provided the prior explanations for the intended use of the map

as much as possible to form a clear difference in the vector space.

**Figure 3.** Comparison between paragraph vectors and those with entropy clustering.

This condition represents that the word vectors *w0 …wT* are subdivided into two clusters proportionally to the ratio of the highest two concept entropies *H*(*C*(*wi* )) and *H*(*C*(*wj* )), which are selected from all entropies of concepts in a cluster (an initial cluster is the whole vector space). *C*(*wi* ) and *C*(*wj* ) mean concepts *C* to which words *wi* and *wj* belong, respectively. The words *wi* and *wj* are words, whose lemmatized forms are identical to terms or synonyms in the thesaurus. However, note that the entropies of the other words whose correspondences are not included in the thesaurus are not calculated in Eq. (3). *Cl*(*w*) means a cluster to which a vector of a word *w* should be classified.

The vector space is subdivided until the entropy becomes lower than 0.25 (the top 1.5% of entropies) or the number of elements in a cluster is lower than 10. These parameters were also determined empirically through the experiments. After generating 1260 clusters from 66,830-word vectors, we considered the centroid of all vectors in a cluster as a cluster vector. Then, we constructed paragraph vectors using the cluster vectors rather than word vectors, as shown in Eq. (5) that is an extension of Eq. (2). After all, each cluster vector represents a concept that has the highest entropy in all concepts included in the cluster.

$$L = \sum\_{\iota=1}^{\top} \log p(\text{Cl}\_{\{w\_{\iota}\}} \mid \text{Cl}\_{\{w\_{\iota-\iota}\}}, \dots, \text{Cl}\_{\{w\_{\iota+\iota}\}}, d\_{\iota}) \tag{5}$$

### **3.2. Evaluation of paragraph vectors**

and hyponyms that construct a concept separately appear with a certain frequency in the dataset. Therefore, the degree of entropy indicates the semantic diversity of a concept. Then, assuming that the degree of entropy and the spatial size of a concept in a word vector space are proportional to a certain extent, we split the word vector space into clusters. In fact, our preliminary experiment indicated that the entropy of a concept has high correlation *R* = 0.602 with the maximum Euclidean distance of hyponyms in the concept in a vector space, at least while the entropy is rather high. Specifically, we refined clusters by repeatedly subdividing them until the defined criterion was satisfied. In our method, we set the determination condi-

> ⎧ ⎪ ⎨ ⎪ ⎩

portionally to the ratio of the highest two concept entropies *H*(*C*(*wi*

) mean concepts *C* to which words *wi*

concept that has the highest entropy in all concepts included in the cluster.

*t*=1 *T*

*Cl*(*wi*

) (

selected from all entropies of concepts in a cluster (an initial cluster is the whole vector space).

saurus. However, note that the entropies of the other words whose correspondences are not included in the thesaurus are not calculated in Eq. (3). *Cl*(*w*) means a cluster to which a vector

The vector space is subdivided until the entropy becomes lower than 0.25 (the top 1.5% of entropies) or the number of elements in a cluster is lower than 10. These parameters were also determined empirically through the experiments. After generating 1260 clusters from 66,830-word vectors, we considered the centroid of all vectors in a cluster as a cluster vector. Then, we constructed paragraph vectors using the cluster vectors rather than word vectors, as shown in Eq. (5) that is an extension of Eq. (2). After all, each cluster vector represents a

*H*(*C*(*wi* )) \_\_\_\_\_\_\_\_

*Cl*(*wj*) (otherwise)

are words, whose lemmatized forms are identical to terms or synonyms in the the-

*<sup>H</sup>*(*C*(*wj*)) <sup>&</sup>gt; ‖ *wk* \_\_\_\_\_\_\_ <sup>−</sup> *wi*

and *wj*

‖ ‖ *wk* <sup>−</sup> *wj*‖)

log*p*(*Cl*(*wt*)|*Cl*(*wt*−*<sup>c</sup>*), …,*Cl*(*wt*+*<sup>c</sup>*), *di*) (5)

*…wT* are subdivided into two clusters pro-

)) and *H*(*C*(*wj*

belong, respectively. The words

(4)

)), which are

tion as shown in Eq. (4).

**Figure 2.** Concepts in a thesaurus.

180 Scientometrics

*C*(*wi*

*wi* and *wj*

) and *C*(*wj*

*Cl*(*wk*) =

of a word *w* should be classified.

*L* = ∑

This condition represents that the word vectors *w0*

Next, we evaluate the resulting vectors on the map constructed from the following dataset. In this article, the dataset includes titles and abstracts of 266,772 IEEE conference articles published from 2012 to 2016, including 2,290,743 sentences in total and titles and descriptions of 34,192 NSF projects from 2012 to 2016, including 730,563 sentences in total. Note that IEEE journal, transaction, symposium, and workshop articles are not included, and NSF project domains are limited to Computer and Information Science and Engineering, Mathematical and Physical Sciences, and Engineering in accordance with IEEE articles. All words in the sentences were tokenized and lemmatized by Stanford CoreNLP before creating the vector space.

In terms of the *unclustered* problem, we confirmed that the proposed method successfully formed several clusters compared with the original paragraph embedding method. For a quantitative comparison, in **Figure 3** shows the relationships between the cosine similarities and the number of edges, and the relationship between the degree centrality and the number of nodes (i.e., projects) in the case of the cosine similarities of >0.35. As a result, we confirmed that edges with a higher cosine similarity and nodes with higher degrees increase. The reason for this result is because, through the use of high-entropy concepts, which are significant in scientific and technological contexts excluding scientifically unimportant words—as elements between paragraph vectors, the paragraph vectors were able to comprise meaningful groups. Simultaneously, newly, unknown synonyms, and closely related words that are not defined in the thesaurus can be unified to a cluster vector, if they are in the same cluster. Taking the centroid vector as a representative vector in a cluster involves separating each cluster vector as much as possible to form a clear difference in the vector space.

In terms of the accuracy of content similarities, the evaluation encounters difficulty since, to the best of our knowledge, there is no gold standard for evaluating the similarity among scientific and technological documents. Therefore, we first evaluated the degree of the similarities based on a sampling method. We randomly extracted 100 pairs of projects with a cosine similarity of >0.5 (similarities less than 0.5 are not considered in the map layout), to make the distribution similar to the entire distribution. Each pair has two project titles and descriptions, and a cosine value that is divided into three levels: weak (0.5 ≤ cos. < 0.67), middle (0.67 ≤ cos. < 0.84), and strong (0.84 ≤ cos.). Some examples of two projects and their cosine value are shown in **Table 1**. Then, three members of our organization, a funding agency in Japan, evaluated the similarity of each pair. The members were provided the prior explanations for the intended use of the map

**Figure 3.** Comparison between paragraph vectors and those with entropy clustering.


**Table 1.** Example of sampled projects/articles.

and some examples of evaluation. The members received the same data, and their backgrounds are bioscience, psychology, and computer science. As a result, we confirmed that 78% of the similarities matched majority votes of the members' opinions. Examples misjudged include, for example, the not related pairs of two projects that have the same acronyms with different meanings, and the stronger pairs of two projects that have only a few common words, but which are recent technologies attracting attention. We expect that those words will eventually have higher entropies and then the project similarities will be estimated to be stronger. We also plan to replace acronyms in project descriptions with full words before making vectors. By contrast, the accuracy of the similarities of the original paragraph embedding method was 21%. The evaluation results were determined to be in "fair" agreement (Fleiss' Kappa *κ* = 0.29) (**Table 2**).

a project description or a article abstract with sentences randomly selected from the others. Then, we measured a cosine similarity between a vector generated from the artificial project/ article and a vector of the original project/article. The projects/articles were randomly selected from all projects/articles, and then we evaluated 1000 pairs of the original project/article and the artificial project/article. The relationship of the replacement ratios and the cosine similarities is shown in **Figure 4**. As a result, we confirmed that there is an obvious correlation between content similarities of projects/articles and their cosine similarities with *R*<sup>2</sup> = 0.89. The paragraph vectors without the entropy clustering also had the same trend, but the vectors with the entropy clustering had higher similarities on average. This result matches the relationships between the cosine similarities and the number of edges shown in **Figure 3**.

Mapping Science Based on Research Content Similarity http://dx.doi.org/10.5772/intechopen.77067 183

This section describes our content-based map of science, Mapping Science [18, 19]. After introducing its interface, we describe our clustering and layout method of articles and projects in

In **Figure 5** shows three main views of the Mapping Science, which are a portfolio view, a

In the portfolio view, five research areas, Information, Mathematics and Physics, Communication, Electronics and Mechatronics, and Power and Energy, to which the entire dataset has been divided by full-text search with predefined queries, are shown. The size of circles

**4. Mapping Science**

**4.1. Interfaces**

the map and analytical functions provided.

**Figure 4.** Cosine similarities of artificial data with partial replacement.

corresponds to the number of articles and projects in the area.

clustered view, and analytic views.

Moreover, we evaluated the accuracy of content similarities using the artificial data, part of which is randomly replaced with the other projects/articles. We replaced 10, 20, …, 100% of


**Table 2.** Evaluation of similarity based on sampling (%).

**Figure 4.** Cosine similarities of artificial data with partial replacement.

a project description or a article abstract with sentences randomly selected from the others. Then, we measured a cosine similarity between a vector generated from the artificial project/ article and a vector of the original project/article. The projects/articles were randomly selected from all projects/articles, and then we evaluated 1000 pairs of the original project/article and the artificial project/article. The relationship of the replacement ratios and the cosine similarities is shown in **Figure 4**. As a result, we confirmed that there is an obvious correlation between content similarities of projects/articles and their cosine similarities with *R*<sup>2</sup> = 0.89. The paragraph vectors without the entropy clustering also had the same trend, but the vectors with the entropy clustering had higher similarities on average. This result matches the relationships between the cosine similarities and the number of edges shown in **Figure 3**.
