3.2. Clustering evaluation criteria

relocate cluster centroid. As a result, the time complexity of proposed version can be defined as

Since SentSim> > ReTime and tn> > k, the overall time complexity of CBLC algorithm is found O (tn), which means that computational complexity is relative to the size of the dataset that needs

This section presents the performance of the CBLC algorithm to seven benchmark datasets, and the results are compared with that of other well-known clustering algorithms; spectral clustering [18, 36], affinity propagation [35], k-medoids algorithm [30, 31], STC-LE [39], and k-means (TF-IDF) [40]. We first describe the seven benchmark datasets, discuss cluster eval-

While CBLC algorithm is obviously appropriate to tasks involving sentence clustering, the algorithm is applied to generic in nature standard datasets such as Reuters-21,578 dataset [29], Aural Sonar dataset [29, 53], Protein dataset [29, 54], Voting dataset [29, 55], SearchSnippets [38, 56],

uation criteria, and we then report the experimental results (Figure 2).

OCBLC = (SentSim. tn. k + ReTime. k). LoopI.

62 Recent Applications in Data Clustering

3. Experiments and results

3.1. Benchmark datasets

StackOverflow [38], and Biomedical [38].

Figure 2. CBLC algorithm performance on seven benchmark datasets.

to be clustered.

Since complete cluster (i.e., all objects from a single class are assigned to a single cluster) and homogeneous cluster (i.e., each cluster contains only objects from a single class) are hardly achieved, we aim to reach a satisfactory balance between these two approaches. Therefore, we apply five well-known clustering criteria in order to evaluate the performance of the proposed algorithm, which are Purity, Entropy, V-measure, Rand Index, and F-measure.

Entropy and Purity [58]. Entropy measure is used to show how the clusters of sentences are partitioned within each cluster, and it is known as the average of weighted values in each cluster entropy over all clusters C = {c1, c2, c3, … cn}:

$$Entropy = \sum\_{j=1}^{|\mathcal{U}|} \frac{|w\_{\mathcal{V}}|}{N} \left( -\frac{1}{\log |\mathcal{C}|} \sum\_{i=1}^{|\mathcal{C}|} \frac{|w\_{\mathcal{V}} \cap c\_i|}{|w\_{\mathcal{V}}|} \log \frac{|w\_{\mathcal{V}} \cap c\_i|}{|w\_{\mathcal{V}}|} \right) \tag{2}$$

The purity of a cluster is the fraction of the cluster size that the largest class of sentences assigned to that cluster represents, that is,

$$P\_{\hat{\jmath}} = \frac{1}{|w\_{\hat{\jmath}}|} \max\_{i} \left( |w\_{\hat{\jmath}} \cap c\_{i}| \right) \tag{3}$$

Overall purity is the weighted sum of the individual cluster purities and is given by

$$Purity = \frac{1}{N} \sum\_{j=1}^{|L|} \left( |w\_j| \times P\_j \right) \tag{4}$$

Purity, Entropy, V-measure, Rand Index, and F-measure evaluation measures. CBLC algorithm however, requires an initial number of clusters in which we specified before the algorithm start. This number was varied from 7 to 12 for Reuters-21,578, Aural Sonar, Protein, Voting, and SearchSnippets datasets, and from 17 to 23 for StackOverflow and Biomedical datasets. This is because we found a proper clustering performance. Note that the values in the figure are averaged over 100 trials, and the best performance according to each measure is only presented. Figures 3–9 show the clustering performance of CBLC algorithm comparing with that of spectral clustering, affinity propagation, k-medoids, STC-LE, and k-means (TF-IDF), respectively, on seven mentioned benchmark datasets using the five cluster evaluation criteria described earlier. For the baselined (i.e., compared) methods, the total values of the used evaluation measures (i.e., purity, entropy, V-measure, Rand Index, and F-measure) were in each measure obtained by discovering a range of numbers starting from 7 to 23 clusters and then considering that which performance is the best in overall clustering quality. The figured empirical results for our proposed new version of standard k-means clustering and other

Centroid-Based Lexical Clustering

65

http://dx.doi.org/10.5772/intechopen.75433

Figure 3. CBLC algorithm and other compared algorithms performance on Reuters-21,578 dataset.

compared algorithms correspond to the best performance resulted from 200 time runs.

superior than that for the other baselined clustering algorithms.

The empirical results demonstrate that CBLC algorithm significantly outperforms the other baselined algorithms on all used datasets. In this experiment however, we knew a priori what the real number of clusters was. Generally, we wish that the clustering algorithm could automatically determine an actual number of clusters, since we would not have this information. Even when run with a high initial number of clusters, CBLC algorithm was able to converge to a solution containing not more than seven clusters (e.g., in case of Reuters-21,578 dataset), and from the figures, it can be again seen that the evaluation of these clusterings is

While purity and entropy are useful for comparing clusterings with the same number of clusters, they are not reliable when comparing clusterings with different numbers of clusters. This is because entropy and purity perform on how the sets of sentences are partitioned within each cluster, and this will lead to homogeneity case. Highest scores however, of purity and lowest scores of entropy are usually obtained when the total number of clusters is too big, where this step will lead to being lowest in the completeness. The next measure we have used considers both completeness and homogeneity approaches.

V-measure [59]. This is a measure that is known as the homogeneity and completeness harmonic mean; that is, V = homogeneity \* completeness / (homogeneity + completeness), where homogeneity and completeness are defined as homogeneity = 1 – H(C|L)/H(C) and completeness = 1 – H (L|C)/H(L).

Eq. (5) can be written as follows, where

$$\begin{aligned} H(\mathbb{C}) &= -\sum\_{i=1}^{|\mathbb{C}|} \frac{|c\_i|}{N} \log \frac{|c\_i|}{N}, \quad H(L) = -\sum\_{j=1}^{|\mathbb{L}|} \frac{|w\_j|}{N} \log \frac{|w\_j|}{N} \\ H(\mathbb{C}|L) &= -\sum\_{j=1}^{|\mathbb{L}|} \sum\_{i=1}^{|\mathbb{C}|} \frac{|w\_j \cap c\_i|}{N} \log \frac{|w\_j \cap c\_i|}{|w\_j|}, \quad \text{and} \quad H(L|\mathbb{C}) = -\sum\_{i=1}^{|\mathbb{C}|} \sum\_{j=1}^{|\mathbb{L}|} \frac{|w\_j \cap c\_i|}{N} \log \frac{|w\_j \cap c\_i|}{|c\_i|} \end{aligned} \tag{5}$$

Rand Index and F-measure. These measures depend on a combinatorial approach which considers each possible pair of sentences. It is defined as Rand Index = (TP + FP)/(TP + FP + FN + TN), where TP is a true positive (sentences corresponded to both same class and cluster), FP is a false positive (sentences corresponded to the different classes but same cluster), FP is a false positive (sentences corresponded to the different clusters but same class), and FN is a false negative (sentences must correspond to both different clusters and classes).

The F-measure is another method widely applied in the information retrieval domain and is defined as the harmonic mean of Precision (P) and Recall (R), that is, F-measure = 2\*P\*R/ (P + R), where P = TP/(TP + FP) and R = TP/(TP + FN).

#### 3.3. Results

Since CBLC algorithm is generic in nature and can in principal be applied to any lexical semantic clustering domain, Figure 3 shows the results of applying it to the Reuters-21,578, Aural Sonar, Protein, Voting, SearchSnippets, StackOverflow, and Biomedical datasets, respectively, by using the

Figure 3. CBLC algorithm and other compared algorithms performance on Reuters-21,578 dataset.

Pj <sup>¼</sup> <sup>1</sup> ∣wj∣ max

Purity <sup>¼</sup> <sup>1</sup>

considers both completeness and homogeneity approaches.

<sup>N</sup> , HLð Þ¼�<sup>X</sup>

<sup>N</sup> log <sup>∣</sup>wj <sup>∩</sup> ci<sup>∣</sup>

negative (sentences must correspond to both different clusters and classes).

Eq. (5) can be written as follows, where

∣ci∣ <sup>N</sup> log <sup>∣</sup>ci<sup>∣</sup>

> X ∣C∣

∣wj ∩ ci∣

(P + R), where P = TP/(TP + FP) and R = TP/(TP + FN).

i¼1

∣C∣

i¼1

∣L∣

j¼1

H Cð Þ¼�<sup>X</sup>

64 Recent Applications in Data Clustering

H Cð Þ¼� <sup>j</sup><sup>L</sup> <sup>X</sup>

(L|C)/H(L).

3.3. Results

Overall purity is the weighted sum of the individual cluster purities and is given by

N X ∣L∣

j¼1

While purity and entropy are useful for comparing clusterings with the same number of clusters, they are not reliable when comparing clusterings with different numbers of clusters. This is because entropy and purity perform on how the sets of sentences are partitioned within each cluster, and this will lead to homogeneity case. Highest scores however, of purity and lowest scores of entropy are usually obtained when the total number of clusters is too big, where this step will lead to being lowest in the completeness. The next measure we have used

V-measure [59]. This is a measure that is known as the homogeneity and completeness harmonic mean; that is, V = homogeneity \* completeness / (homogeneity + completeness), where homogeneity and completeness are defined as homogeneity = 1 – H(C|L)/H(C) and completeness = 1 – H

∣L∣

∣wj∣ <sup>N</sup> log <sup>∣</sup>wj<sup>∣</sup> N

Rand Index and F-measure. These measures depend on a combinatorial approach which considers each possible pair of sentences. It is defined as Rand Index = (TP + FP)/(TP + FP + FN + TN), where TP is a true positive (sentences corresponded to both same class and cluster), FP is a false positive (sentences corresponded to the different classes but same cluster), FP is a false positive (sentences corresponded to the different clusters but same class), and FN is a false

The F-measure is another method widely applied in the information retrieval domain and is defined as the harmonic mean of Precision (P) and Recall (R), that is, F-measure = 2\*P\*R/

Since CBLC algorithm is generic in nature and can in principal be applied to any lexical semantic clustering domain, Figure 3 shows the results of applying it to the Reuters-21,578, Aural Sonar, Protein, Voting, SearchSnippets, StackOverflow, and Biomedical datasets, respectively, by using the

<sup>∣</sup>wj<sup>∣</sup> , and H Lð Þ¼� <sup>j</sup><sup>C</sup> <sup>X</sup>

∣C∣

X ∣L∣

∣wj ∩ ci∣

<sup>N</sup> log <sup>∣</sup>wj <sup>∩</sup> ci<sup>∣</sup> ∣ci∣

(5)

j¼1

i¼1

j¼1

jwjj � Pj

<sup>i</sup> <sup>j</sup>wj <sup>∩</sup> ci<sup>j</sup> � � (3)

� � (4)

Purity, Entropy, V-measure, Rand Index, and F-measure evaluation measures. CBLC algorithm however, requires an initial number of clusters in which we specified before the algorithm start. This number was varied from 7 to 12 for Reuters-21,578, Aural Sonar, Protein, Voting, and SearchSnippets datasets, and from 17 to 23 for StackOverflow and Biomedical datasets. This is because we found a proper clustering performance. Note that the values in the figure are averaged over 100 trials, and the best performance according to each measure is only presented.

Figures 3–9 show the clustering performance of CBLC algorithm comparing with that of spectral clustering, affinity propagation, k-medoids, STC-LE, and k-means (TF-IDF), respectively, on seven mentioned benchmark datasets using the five cluster evaluation criteria described earlier. For the baselined (i.e., compared) methods, the total values of the used evaluation measures (i.e., purity, entropy, V-measure, Rand Index, and F-measure) were in each measure obtained by discovering a range of numbers starting from 7 to 23 clusters and then considering that which performance is the best in overall clustering quality. The figured empirical results for our proposed new version of standard k-means clustering and other compared algorithms correspond to the best performance resulted from 200 time runs.

The empirical results demonstrate that CBLC algorithm significantly outperforms the other baselined algorithms on all used datasets. In this experiment however, we knew a priori what the real number of clusters was. Generally, we wish that the clustering algorithm could automatically determine an actual number of clusters, since we would not have this information. Even when run with a high initial number of clusters, CBLC algorithm was able to converge to a solution containing not more than seven clusters (e.g., in case of Reuters-21,578 dataset), and from the figures, it can be again seen that the evaluation of these clusterings is superior than that for the other baselined clustering algorithms.

Figure 6. CBLC algorithm and other compared algorithms performance on voting dataset.

Centroid-Based Lexical Clustering

67

http://dx.doi.org/10.5772/intechopen.75433

Figure 7. CBLC algorithm and other compared algorithms performance on SearchSnippets dataset.

Figure 4. CBLC algorithm and other compared algorithms performance on aural sonar dataset.

Figure 5. CBLC algorithm and other compared algorithms performance on protein dataset.

Figure 6. CBLC algorithm and other compared algorithms performance on voting dataset.

Figure 4. CBLC algorithm and other compared algorithms performance on aural sonar dataset.

66 Recent Applications in Data Clustering

Figure 5. CBLC algorithm and other compared algorithms performance on protein dataset.

Figure 7. CBLC algorithm and other compared algorithms performance on SearchSnippets dataset.

4. Concluding remarks

on several standard datasets.

Author details

Khaled Abdalgader

References

Sohar University, Sohar, Oman

2006;18(8):1138-1150

This chapter has shown a new version of the k-means clustering method that is able to cluster small-sized text fragments. This new variation measures the semantic similarity between patterns (i.e., sentences) based on the idea of creating a synonym expansion set to be used in the compared semantic vectors. The sentences are represented in these vectors by using semantic information derived from a WordNet that is created for the purpose of identifying the actual sense to a word, based on the surrounding context. The experimental results have demonstrated the method to achieve a satisfactory performance against the compared algorithms such as spectral clustering affinity propagation, k-medoids, STC-LE, and k-means (TF-IDF), as evaluated

Centroid-Based Lexical Clustering

69

http://dx.doi.org/10.5772/intechopen.75433

A clear domain of applying the algorithm is to text-mining processing; however, the algorithm can also be used within more general text-processing settings such as text summarization. Like any clustering algorithm, the performance of CBLC will eventually be based on the text similarity values, and these values can be improved by defining the sentence-level text similarity measure that can utilize much more possible semantic information expressed with the compared sentences. Any such improvements are surly effected by the overall sentences clustering performance.

Sentence-level text clustering is an exciting area of research within the knowledge discovery and computational linguistic activities, and this chapter has proposed a new variation of k-means clustering which are capable to cluster sentences based on available semantic information written in these sentences. We are interested in some of the new research directions that we have experienced in this area; however, what we are most excited about is applying our proposed cluster technique to operate on the text-mining activities. This is because the concepts existing in human-written documents usually have buried knowledge and information, whereas the technique we have developed in this work is only applied on the clusters text-fragments domain. Therefore, one of the possible future works is to apply these ideas of sentence clustering to the

[1] Li Y, McLean D, Bandar ZA, O'Shea JD, Crockett K. Sentence similarity based on semantic nets and Corpus statistics. IEEE Transactions on Knowledge and Data Engineering.

development of complete techniques for sentiment analysis of the people's opinion.

Address all correspondence to: komar@soharuni.edu.om

Figure 8. CBLC algorithm and other compared algorithms performance on StackOverflow dataset.

Figure 9. CBLC algorithm and other compared algorithms performance on biomedical dataset.
