2.2. Sentence-level clustering algorithm

In this section, we firstly describe the new proposed version of the original k-means clustering algorithm which we called it centroid-based lexical-clustering (CBLC) algorithm. Then, we describe how a cluster centroid can be constructed and defined. The remaining subsections discuss the issues of calculating the semantic similarity between sentences and clustering centroid, and other related technical issues such as empirical settings and space and time complexity.

#### 2.3. Centroid-based lexical clustering

Algorithm 1. Centroid-Based Lexical Clustering (CBLC).

Input: Sentences to be clustered S = {Si | i = 1 to TN}

Classes # k

measured using the corresponding vectors, and a usually applied measure is the cosine of the

The VSM has been effective in information retrieval (IR) activities because it is able to sufficiently utilize much of the semantic information expressed in the larger-sized textual collection. This is due to a large textual collection or documents may contain many shared words with each other and thus be considered similar regarding to well-known vector space similarity measures such as the cosine measure. However, in the case of sentence-level text (text fragment), this is not the case, since two sentences may be carrying the same meaning (i.e., semantically similar) whereas comprising no similar words. For instance, consider the sentences "Some places in the country are now in torrent crisis" and "The current flood disaster affects the particular states." Obviously, these two sentences have the same meaning, yet the only common word they have is the, which does not carry any semantic information (i.e., stop words). The reason why word co-occurrence may be rare or even absent in sentences is due to the flexibility of natural language that allows humans to express the same meanings using very different sentences in terms of structure and length [50]. Therefore, we need a sentencelevel text representation scheme which is superiorly able to utilize and capture all the possible semantic information of sentences, thus enabling a more efficient similarity method to be used.

To calculate the semantic similarity between two sentences, we use sentence similarity method that uses the sets of synonym expansion appeared in the compared sentences [10]. To demonstrate how this measure work: however, suppose that Sentence<sup>1</sup> and Sentence<sup>2</sup> are the two sentences being compared to calculate their semantic similarity, W<sup>1</sup> and W<sup>2</sup> are the sets of sense-assigned words appeared in Sentence<sup>1</sup> and Sentence2, respectively, sentence1 and sentence2 are the sets of synonym expansion appeared in W<sup>1</sup> and W2, and U = W<sup>1</sup> ∪ W2. Then, a semantic

this case, there are two instances to take into the account, relaying on whether wordj appears in

Instance 1: If wordj exists in sentencei, then set vij equal to the value of 1, this is based on the

Instance 2: If wordj does not exist in sentencei, then compute the semantic similarity between compared words by using one of the WordNet-based word-to-word similarity measures (i.e., J&C measure) [51]. The final similarity score to vij is the highest of these scores between wordj

Once the vectors (v<sup>1</sup> and v2) have been constructed, the semantic similarity between two sentences can be determined using a cosine similarity measure between two constructed

Similarity Sentence ð <sup>1</sup>; Sentence2Þ ¼ ð Þ v1:v<sup>2</sup> =ð Þ jv1kv2j (1)

th element of vi. In

vectors v<sup>1</sup> and v<sup>2</sup> have been created, according to sentence<sup>1</sup> and sentence2.

semantic similarity of the same words in the WordNet.

Let wordj be the corresponding sense-assigned word from U and vij be the j

angle between the two vectors.

58 Recent Applications in Data Clustering

2.1. Measuring sentence-level text similarity

sentencei or not:

and each sentencei.

vectors as

Output: Membership values of each cluster π<sup>j</sup> i ji ¼ 1::TN; j ¼ 1::k n o where <sup>π</sup><sup>j</sup> <sup>i</sup> is the membership value of sentences i to cluster j.


#### 10. repeat until there is no move (until convergence)


#### 18. end

19. //Re-locate each sentence to the corresponding cluster centroid to which it is similar to.

20. re-locate(Si, Mj)

#### 21. End

Given a k set (i.e., clusters), partition all the data points (i.e., sentences) randomly in given sets (i.e., initialization), each with a determined centroid (mean) that demonstrates as representative of the cluster. There are iterations process that rearrange these means or centroid of the clusters, which is based on moving each sentence to the cluster corresponding to the centroid to which it is closest (i.e., semantically similar). Redetermine the cluster centroids based on the new located sentences belonging to them. Then, the following iteration is repeated until the centroids do not move (until convergence). The new proposed version of the original k-means clustering algorithm is as follows.

2.5. Calculating similarity between sentence and cluster centroid

which is not immediately apparent.

is, we omit SG in calculating the cluster centroid.

2.6. Space and time complexity of CBLC algorithm

When the CBLC algorithm calculates the semantic similarity between sentences, there are two cases to take into account. Firstly, if a sentence does belong to the cluster and secondly, if a sentence does not belong to the cluster. This case is straightforward to implement. Since the cluster centroids are represented in the same way as a union set (synonyms), the similarity between a sentence and a cluster centroid (i.e., two sentences) can be calculated by using sentence similarity measure, as described earlier. There is, however, a subtlety in the first case,

To demonstrate how this semantic similarity is calculated, assume that Sentence1 = {word1, word2, word3} and Sentence2 = {word4, word5} are not semantically similar. Comparing these sentences (S1 and S2), we obtain the semantic vectors v<sup>1</sup> = {1,1,1,0,0} and v<sup>2</sup> = {0,0,0,1,1} which obviously have a cosine value of zero and is reliable with the fact that they are no semantic relation between them. Now suppose, however, that Sentence1 (S1) and Sentence2 (S2) are in the same cluster. If we create the cluster union set as mentioned earlier (i.e., by taking the union of all synonym words appearing in all sentences in that cluster), we obtain Mj = {word1, word2, word3, word4, word5}. If we now calculate the semantic similarity between Mj and S<sup>1</sup> by using the cosine measure, we then obtain the vectors v<sup>j</sup> = {1,1,1,1,1} and v<sup>1</sup> = {1,1,1,0,0}, which have a similarity score equal to 0.77. An issue is clearly seen here, since S1 and S2 are not similar and their centroid would not carry any useful meaning. This issue in which we would not expect the similarity value like this

has happened due to all of the words of S1 already existing in the cluster centroid Mj

solve this problem by defining the centroid using all sentences in the cluster except the sentence with which the cluster centroid is being currently compared. Therefore, assuming that we have a cluster containing sentences Sentence1 … SentenceN, and we want the similarity between this cluster and a sentence SG appearing in the cluster, we would determine the cluster centroid using only the words appearing in Sentence1∪Sentence2∪…∪SentenceG�<sup>1</sup>∪<sup>G</sup>þ<sup>1</sup>∪…∪SentenceN; that

It has been founded that the proposed algorithm is no more expansive comparing with the basic k-means [52] and spectral-clustering [18, 37] algorithms regarding the space complexity (i.e., the three algorithms require the storage of the same similarity scores). The time (i.e., computation) complexity of a new version of the standard k-means: however, far exceeds that of basic k-means; and spectral-clustering algorithms. Furthermore, the computation complexities appeared in the stage of calculating the similarity between each sentence and corresponding centroid; this is due to representation of the text in the sentence similarity measure we have been applied within this clustering algorithm. To demonstrate this complexity, suppose that operation time unit for calculating semantic similarity between each sentence and cluster centroid is SentSim, the operation time unit for recalculating cluster centroids is ReTime, the total number of sentences in the used dataset is tn, the number of clusters is k, and the iteration loop of the proposed algorithm is LoopI. Therefore, essentially, the two following computations are required for each and every clustering iteration: (i) tn.k times sentence to cluster centroid similarity calculation; (ii) k times for

. We can

Centroid-Based Lexical Clustering

61

http://dx.doi.org/10.5772/intechopen.75433

#### 2.4. Determining a clustering centroid

In the standard vector space model, the text such as a document is processed as a vector (i.e., its elements are the tf-idf scores), a cluster centroid can be determined by taking into account the vector average over all text fragments related to that cluster. This is experienced very hard using the above-discussed text representation scheme, since the semantic vector for a sentence is not unique, but depends on the length of the compared sentence context. However, just as a context may be constructed by two sentences, it is direct to apply this nation to defining the context over a collection of sentences. While a cluster is just such text fragments, we can define the centroid of a cluster as the union set of all associated synonyms of disambiguated words existing in the sentences relating to that cluster. Thus, if Sentence1, Sentence2, … SentenceN are sentences belonging to some cluster, the centroid of the cluster, which we denote as Mj, is just the union set {word1, word2, .. wordn}, where n is the number of distinct synonym words (sentencei) inSentence1∪Sentence2∪…∪SentenceN. Figure 1 exemplifies the idea of determining a clustering centroid.

Figure 1. Clustering centroid, where sentencei (si) is a set of synonym words corresponding to Sentencei (Si).

#### 2.5. Calculating similarity between sentence and cluster centroid

18. end

21. End

20. re-locate(Si, Mj)

60 Recent Applications in Data Clustering

clustering centroid.

clustering algorithm is as follows.

2.4. Determining a clustering centroid

19. //Re-locate each sentence to the corresponding cluster centroid to which it is similar to.

Given a k set (i.e., clusters), partition all the data points (i.e., sentences) randomly in given sets (i.e., initialization), each with a determined centroid (mean) that demonstrates as representative of the cluster. There are iterations process that rearrange these means or centroid of the clusters, which is based on moving each sentence to the cluster corresponding to the centroid to which it is closest (i.e., semantically similar). Redetermine the cluster centroids based on the new located sentences belonging to them. Then, the following iteration is repeated until the centroids do not move (until convergence). The new proposed version of the original k-means

In the standard vector space model, the text such as a document is processed as a vector (i.e., its elements are the tf-idf scores), a cluster centroid can be determined by taking into account the vector average over all text fragments related to that cluster. This is experienced very hard using the above-discussed text representation scheme, since the semantic vector for a sentence is not unique, but depends on the length of the compared sentence context. However, just as a context may be constructed by two sentences, it is direct to apply this nation to defining the context over a collection of sentences. While a cluster is just such text fragments, we can define the centroid of a cluster as the union set of all associated synonyms of disambiguated words existing in the sentences relating to that cluster. Thus, if Sentence1, Sentence2, … SentenceN are sentences belonging to some cluster, the centroid of the cluster, which we denote as Mj, is just the union set {word1, word2, .. wordn}, where n is the number of distinct synonym words (sentencei) inSentence1∪Sentence2∪…∪SentenceN. Figure 1 exemplifies the idea of determining a

Figure 1. Clustering centroid, where sentencei (si) is a set of synonym words corresponding to Sentencei (Si).

When the CBLC algorithm calculates the semantic similarity between sentences, there are two cases to take into account. Firstly, if a sentence does belong to the cluster and secondly, if a sentence does not belong to the cluster. This case is straightforward to implement. Since the cluster centroids are represented in the same way as a union set (synonyms), the similarity between a sentence and a cluster centroid (i.e., two sentences) can be calculated by using sentence similarity measure, as described earlier. There is, however, a subtlety in the first case, which is not immediately apparent.

To demonstrate how this semantic similarity is calculated, assume that Sentence1 = {word1, word2, word3} and Sentence2 = {word4, word5} are not semantically similar. Comparing these sentences (S1 and S2), we obtain the semantic vectors v<sup>1</sup> = {1,1,1,0,0} and v<sup>2</sup> = {0,0,0,1,1} which obviously have a cosine value of zero and is reliable with the fact that they are no semantic relation between them. Now suppose, however, that Sentence1 (S1) and Sentence2 (S2) are in the same cluster. If we create the cluster union set as mentioned earlier (i.e., by taking the union of all synonym words appearing in all sentences in that cluster), we obtain Mj = {word1, word2, word3, word4, word5}. If we now calculate the semantic similarity between Mj and S<sup>1</sup> by using the cosine measure, we then obtain the vectors v<sup>j</sup> = {1,1,1,1,1} and v<sup>1</sup> = {1,1,1,0,0}, which have a similarity score equal to 0.77. An issue is clearly seen here, since S1 and S2 are not similar and their centroid would not carry any useful meaning. This issue in which we would not expect the similarity value like this has happened due to all of the words of S1 already existing in the cluster centroid Mj . We can solve this problem by defining the centroid using all sentences in the cluster except the sentence with which the cluster centroid is being currently compared. Therefore, assuming that we have a cluster containing sentences Sentence1 … SentenceN, and we want the similarity between this cluster and a sentence SG appearing in the cluster, we would determine the cluster centroid using only the words appearing in Sentence1∪Sentence2∪…∪SentenceG�<sup>1</sup>∪<sup>G</sup>þ<sup>1</sup>∪…∪SentenceN; that is, we omit SG in calculating the cluster centroid.

#### 2.6. Space and time complexity of CBLC algorithm

It has been founded that the proposed algorithm is no more expansive comparing with the basic k-means [52] and spectral-clustering [18, 37] algorithms regarding the space complexity (i.e., the three algorithms require the storage of the same similarity scores). The time (i.e., computation) complexity of a new version of the standard k-means: however, far exceeds that of basic k-means; and spectral-clustering algorithms. Furthermore, the computation complexities appeared in the stage of calculating the similarity between each sentence and corresponding centroid; this is due to representation of the text in the sentence similarity measure we have been applied within this clustering algorithm. To demonstrate this complexity, suppose that operation time unit for calculating semantic similarity between each sentence and cluster centroid is SentSim, the operation time unit for recalculating cluster centroids is ReTime, the total number of sentences in the used dataset is tn, the number of clusters is k, and the iteration loop of the proposed algorithm is LoopI. Therefore, essentially, the two following computations are required for each and every clustering iteration: (i) tn.k times sentence to cluster centroid similarity calculation; (ii) k times for relocate cluster centroid. As a result, the time complexity of proposed version can be defined as OCBLC = (SentSim. tn. k + ReTime. k). LoopI.

The Reuters-21,578 is the commonly used dataset for text classification task. It contains more than 20,000 documents from over 600 classes. The experimental results presented in this chapter only use a subset containing only 1833 text fragments, each of them are labeled as relating to one of 10 distinguished classes. The total number of the text fragments in each of the

Centroid-Based Lexical Clustering

63

http://dx.doi.org/10.5772/intechopen.75433

In the Aural Sonar dataset [53], two randomly selected people were asked to assign a similarity score between 1 and 5 to all pairs of signals returned from a broadband active sonar system. The two obtained scores from participated people were added to produce a 100 � 100 similar-

The Protein dataset [54, 57] consists of dissimilarity values for 226 samples over nine classes. We use the reduced set [57] of 213 proteins from four classes that result from removing classes

The Voting dataset is a two-class classification task with around 435 samples (text fragments). Similarity scores in the form of a matrix table were computed from the data in the categorical

The SearchSnippets dataset consists of eight different predefined domains (i.e., classes), which

The StackOverflow dataset consists of 3,370,528 samples collected through the period of July 31, 2012, to August 14, 2012 (https//:www.kaggle.com). In this chapter, we randomly select

The Biomedical is a challenge dataset published in BioASQ's official website, and we randomly

Since complete cluster (i.e., all objects from a single class are assigned to a single cluster) and homogeneous cluster (i.e., each cluster contains only objects from a single class) are hardly achieved, we aim to reach a satisfactory balance between these two approaches. Therefore, we apply five well-known clustering criteria in order to evaluate the performance of the

Entropy and Purity [58]. Entropy measure is used to show how the clusters of sentences are partitioned within each cluster, and it is known as the average of weighted values in each

log∣C∣

The purity of a cluster is the fraction of the cluster size that the largest class of sentences

X ∣C∣

∣wj ∩ ci∣

!

<sup>∣</sup>wj<sup>∣</sup> log <sup>∣</sup>wj <sup>∩</sup> ci<sup>∣</sup> ∣wj∣

(2)

i¼1

proposed algorithm, which are Purity, Entropy, V-measure, Rand Index, and F-measure.

10 classes is 354, 333, 258, 210, 155, 134, 113, 100, 90, and 70, respectively.

was generated from the web-search-transaction result activity.

select 20,000 paper titles from 20 different MeSH major classes.

ity matrix with values ranging from 2 to 10.

20,000 question titles from 20 different classes.

cluster entropy over all clusters C = {c1, c2, c3, … cn}:

assigned to that cluster represents, that is,

Entropy <sup>¼</sup> <sup>X</sup>

∣L∣

∣wj∣ <sup>N</sup> � <sup>1</sup>

j¼1

3.2. Clustering evaluation criteria

with fewer than seven samples.

domain.

Since SentSim> > ReTime and tn> > k, the overall time complexity of CBLC algorithm is found O (tn), which means that computational complexity is relative to the size of the dataset that needs to be clustered.
