**4. Experimental results with synthetic data**

There are six different types of 2D synthetic datasets [29, 35] which are used in this work. They are snail, screw, ring, set3, set5, and set25 dataset. **Figures 4**–**6** show the plots of NG, GNG, and RGNG clustering with three types of 2D synthetic datasets (screw, set5, and snail) as an example. The number of neurons are selected randomly, *N* = 7, 10, and 12.

These figures cannot clearly differentiate between each method. Hence, four parameters are used in this work to evaluate the performance of the proposed clustering techniques: CR, PQ, MCN, and MSE introduced in the previous section. For the best comparison with RGNG, MDL criterion is added to NG and GNG techniques. The training results of these techniques

**Figure 4.** Clustering with screw synthetic dataset for *N* = 7, by running NG, GNG, and RGNG techniques.

these techniques, the number of neurons used is small, so the CR values registered in the table are not high. In all the three clustering techniques, the number of neurons was equal to the actual cluster number. RGNG can effectively locate the actual number of clusters compared to the other two methods; NG and GNG fail with higher cluster numbers in the synthetic case. The registered values of the MCN show that the number of detected prototypes or clusters in the RGNG technique is less than the others; which means that its ability to group data in actual number of clusters is better than the other two techniques. For example, when *N* is set to 10, the MCN value for RGNG is 10, which is less than that for NG and GNG values. The MCN value for running RGNG is equal to the number of neurons, 10, and has the same rate when compared with other *N* values; while the MCN value of running NG and GNG deviated

**Parameters Number of neurons NG GNG RGNG CR** N = 7 0.8718 0.9686 0.9929

**MCN** N = 7 9 8 7

**PQ** N = 7 0.8990 0.9465 0.9869

**MSE** N = 7 2.8032e+004 2.7608e+004 2.6493e+004

N = 10 0.8514 0.9786 0.9843 N = 12 0.8010 0.9647 0.9759

Performance Assessment of Unsupervised Clustering Algorithms Combined MDL Index

http://dx.doi.org/10.5772/intechopen.74506

187

N = 10 12 11 10 N = 12 15 14 12

N = 10 0.8531 0.9288 0.9841 N = 12 08279 0.9043 0.9807

N = 10 2.7913e+004 2.7378e+004 2.6351e+004 N = 12 2.7703e+004 2.6940e+004 2.6188e+004

Regarding the PQ value, it is noticed that the RGNG approach possesses higher PQ values than the NG and GNG techniques. For example, when *N* is set to 12, the PQ value for RGNG is 0.9807, which is higher than that of NG and GNG values. These high values of PQ indicate that the RGNG technique has a better partitioning quality with respect to the others, and finds

Moreover, the RGNG method can find all the natural clusters during the growing stage with the correct number of prototypes. Hence, the MSE values are lower, which indicates that the RGNG technique has better robustness. For example, when *N* is set to 7, the MSE value for RGNG is 2.6493e + 004, which is lower than that for NG and GNG values. NG and GNG techniques may not detect all the actual clusters; hence, they yield higher MSE values. The MDL value is one of the popular information theory evaluation measures that are used as clustering validity indexes [37]. The MDL criterion gives the ability of finding the optimal number of clusters and their center positions, corresponding to the smallest MDL value.

from the actual cluster number.

**Table 1.** Clustering results of synthetic data.

more representative clusters.

**Figure 5.** Clustering with set5 synthetic dataset for *N* = 10, by running NG, GNG, and RGNG techniques.

with synthetic data are shown in **Table 1**, where the number of neurons is chosen randomly as *N* = 7, 10, and 12.

According to literature [29, 36], the clustering output results introduced in **Table 1** clarified that RGNG approach is insensitive to different initializations and the presence of outliers. In

**Figure 6.** Clustering with snail synthetic dataset for *N* = 12, by running NG, GNG, and RGNG techniques.

Performance Assessment of Unsupervised Clustering Algorithms Combined MDL Index http://dx.doi.org/10.5772/intechopen.74506 187


**Table 1.** Clustering results of synthetic data.

with synthetic data are shown in **Table 1**, where the number of neurons is chosen randomly

**Figure 5.** Clustering with set5 synthetic dataset for *N* = 10, by running NG, GNG, and RGNG techniques.

**Figure 4.** Clustering with screw synthetic dataset for *N* = 7, by running NG, GNG, and RGNG techniques.

According to literature [29, 36], the clustering output results introduced in **Table 1** clarified that RGNG approach is insensitive to different initializations and the presence of outliers. In

**Figure 6.** Clustering with snail synthetic dataset for *N* = 12, by running NG, GNG, and RGNG techniques.

as *N* = 7, 10, and 12.

186 Recent Applications in Data Clustering

these techniques, the number of neurons used is small, so the CR values registered in the table are not high. In all the three clustering techniques, the number of neurons was equal to the actual cluster number. RGNG can effectively locate the actual number of clusters compared to the other two methods; NG and GNG fail with higher cluster numbers in the synthetic case.

The registered values of the MCN show that the number of detected prototypes or clusters in the RGNG technique is less than the others; which means that its ability to group data in actual number of clusters is better than the other two techniques. For example, when *N* is set to 10, the MCN value for RGNG is 10, which is less than that for NG and GNG values. The MCN value for running RGNG is equal to the number of neurons, 10, and has the same rate when compared with other *N* values; while the MCN value of running NG and GNG deviated from the actual cluster number.

Regarding the PQ value, it is noticed that the RGNG approach possesses higher PQ values than the NG and GNG techniques. For example, when *N* is set to 12, the PQ value for RGNG is 0.9807, which is higher than that of NG and GNG values. These high values of PQ indicate that the RGNG technique has a better partitioning quality with respect to the others, and finds more representative clusters.

Moreover, the RGNG method can find all the natural clusters during the growing stage with the correct number of prototypes. Hence, the MSE values are lower, which indicates that the RGNG technique has better robustness. For example, when *N* is set to 7, the MSE value for RGNG is 2.6493e + 004, which is lower than that for NG and GNG values. NG and GNG techniques may not detect all the actual clusters; hence, they yield higher MSE values.

The MDL value is one of the popular information theory evaluation measures that are used as clustering validity indexes [37]. The MDL criterion gives the ability of finding the optimal number of clusters and their center positions, corresponding to the smallest MDL value.

of the selected "Ring" data is 400x2 double. The selected data is plotted on sketch1 inside

Performance Assessment of Unsupervised Clustering Algorithms Combined MDL Index

http://dx.doi.org/10.5772/intechopen.74506

**Figure 9** shows some of selected 2D synthetic datasets from the different datasets that were used in this work. Beside each plot, the information related to it is shown in the "info" win-

**3.** *Selection technique*: The user can select one of the clustering techniques NG, GNG, or RGNG. The RGNG technique is selected as an example for the training in **Figure 8** with

Before clicking on "Apply NG," "Apply GNG," or "Apply RGNG" button, the training parameters related to each technique must be defined. As explained in Section 3, the training parameters must be set carefully within the limited range. The number of neurons (*N*) as well as the other parameters related to the selected technique must be defined. Another example of using the RGNG technique with Set3 dataset is shown in **Figure 10**. RGNG training parameters are set as the typical values in literature: *εbi* <sup>=</sup> 0.1, *εbf* <sup>=</sup> 0.01, *εni* <sup>=</sup> 0.005, *<sup>ε</sup>nf* <sup>=</sup> 0.0005, *<sup>α</sup>*max <sup>=</sup> <sup>100</sup>, *<sup>k</sup>* <sup>=</sup> 1.3, *<sup>η</sup>* <sup>=</sup> <sup>1</sup> <sup>×</sup> <sup>10</sup><sup>−</sup><sup>4</sup>

the number of neurons (*N*) is chosen randomly as 14. When the algorithm's training is

;

189

the main clustering window of **Figure 8**.

Ring data and *N* = 18, which is selected randomly.

**Figure 8.** Main window of the prototype-based clustering software package.

dow, in the left side of each plot.

**Figure 7.** MDL values versus the number of clusters running the NG, GNG, and RGNG techniques on synthetic data, for: (a) *N* = 7; (b) *N* = 10; (c) *N* = 12.

The average MDL values during the growth stages are plotted versus the number of clusters or prototypes. **Figure 7** shows the curves for the NG and GNG techniques combined with the MDL criterion, as well as the RGNG approach on a synthetic dataset for different number of neurons, which are selected randomly as *N* = 7, 10, and 12. Each detected the cluster number corresponding to the MDL value.

In RGNG, the smallest MDL value was recorded on average with respect to NG and GNG combined with the MDL principle. For example, in **Figure 7 (b)**, the smallest MDL value is 2.65 that is obtained from running RGNG when *N* is equal to 4. While in the same *N* = 4, higher MDL value of 2.77 is recorded from running NG and GNG. From the presented figures, it is concluded that the proposed RGNG approach is insensitive to different initializations and the presence of outliers and can successfully find the actual number of clusters.
