2. Semantic similarity representation scheme

By far, the most widely used text representation scheme in the natural language processing activities is the vector space model (VSM), in which a text or a document is represented as a point in a high-dimensional (Ni) input space. Each dimension in this input space (i.e., VSM) corresponds to a unique word [19]. That is, a document dj is represented as a vector x<sup>j</sup> = (word1j, word2j, word3j, …), where wordij is a weight that represents in some way the importance or relatedness of word wordi in document dj and is dependent on the frequency of occurrence of wordi in document dj. The semantic similarity between the compared documents is then measured using the corresponding vectors, and a usually applied measure is the cosine of the angle between the two vectors.

2.2. Sentence-level clustering algorithm

2.3. Centroid-based lexical clustering

value of sentences i to cluster j.

Algorithm 1. Centroid-Based Lexical Clustering (CBLC).

Input: Sentences to be clustered S = {Si | i = 1 to TN}

Output: Membership values of each cluster π<sup>j</sup>

1. //Randomly distribute the sentences into k classes

10. repeat until there is no move (until convergence) 11. //Define or determine the centroid for each class (cluster)

13. M<sup>j</sup> = union-set {all possible synonym occurring in the cluster j} //U set

17. similarity(Mj, Sm) // Sm is sentences related to cluster j, {m = 1.. n}.

15. //Compute the similarity between each sentence (Si) to each cluster centroid

complexity.

Classes # k

2. for i = 1 to TN

= Si //Sentencei

= Si //Sentencei

3. if i ≤ k 4. j + =1

5. π<sup>j</sup> i

8. π<sup>j</sup> i

9. end

14. end

12. for j = 1 to k

16. for j = 1 to k

6. else 7. j = 1

In this section, we firstly describe the new proposed version of the original k-means clustering algorithm which we called it centroid-based lexical-clustering (CBLC) algorithm. Then, we describe how a cluster centroid can be constructed and defined. The remaining subsections discuss the issues of calculating the semantic similarity between sentences and clustering centroid, and other related technical issues such as empirical settings and space and time

i

ji ¼ 1::TN; j ¼ 1::k n o where <sup>π</sup><sup>j</sup>

<sup>i</sup> is the membership

Centroid-Based Lexical Clustering

59

http://dx.doi.org/10.5772/intechopen.75433

The VSM has been effective in information retrieval (IR) activities because it is able to sufficiently utilize much of the semantic information expressed in the larger-sized textual collection. This is due to a large textual collection or documents may contain many shared words with each other and thus be considered similar regarding to well-known vector space similarity measures such as the cosine measure. However, in the case of sentence-level text (text fragment), this is not the case, since two sentences may be carrying the same meaning (i.e., semantically similar) whereas comprising no similar words. For instance, consider the sentences "Some places in the country are now in torrent crisis" and "The current flood disaster affects the particular states." Obviously, these two sentences have the same meaning, yet the only common word they have is the, which does not carry any semantic information (i.e., stop words). The reason why word co-occurrence may be rare or even absent in sentences is due to the flexibility of natural language that allows humans to express the same meanings using very different sentences in terms of structure and length [50]. Therefore, we need a sentencelevel text representation scheme which is superiorly able to utilize and capture all the possible semantic information of sentences, thus enabling a more efficient similarity method to be used.

#### 2.1. Measuring sentence-level text similarity

To calculate the semantic similarity between two sentences, we use sentence similarity method that uses the sets of synonym expansion appeared in the compared sentences [10]. To demonstrate how this measure work: however, suppose that Sentence<sup>1</sup> and Sentence<sup>2</sup> are the two sentences being compared to calculate their semantic similarity, W<sup>1</sup> and W<sup>2</sup> are the sets of sense-assigned words appeared in Sentence<sup>1</sup> and Sentence2, respectively, sentence1 and sentence2 are the sets of synonym expansion appeared in W<sup>1</sup> and W2, and U = W<sup>1</sup> ∪ W2. Then, a semantic vectors v<sup>1</sup> and v<sup>2</sup> have been created, according to sentence<sup>1</sup> and sentence2.

Let wordj be the corresponding sense-assigned word from U and vij be the j th element of vi. In this case, there are two instances to take into the account, relaying on whether wordj appears in sentencei or not:

Instance 1: If wordj exists in sentencei, then set vij equal to the value of 1, this is based on the semantic similarity of the same words in the WordNet.

Instance 2: If wordj does not exist in sentencei, then compute the semantic similarity between compared words by using one of the WordNet-based word-to-word similarity measures (i.e., J&C measure) [51]. The final similarity score to vij is the highest of these scores between wordj and each sentencei.

Once the vectors (v<sup>1</sup> and v2) have been constructed, the semantic similarity between two sentences can be determined using a cosine similarity measure between two constructed vectors as

$$\text{Similarity}(\text{Sentence1}, \text{Sentence2}) = (\mathbf{v}\_1 \mathbf{v}\_2) / (|\mathbf{v}\_1||\mathbf{v}\_2|) \tag{1}$$
