4.2 Clustering

Clustering methodologies based on maximization of the probability and minimization of the potential can be defined by letting replica of data points move in these directions. These methods are known as mean shift (MS) and quantum clustering (QC) correspondingly. A recent review of MS techniques has been presented in [11]. Analyzing the same data with the same width-parameter σ leads to different clustering results for these two different methods, as is expected from Figure 1.

For illustration of clustering based on these different methods, we consider the crab data set which is included in Ripley's textbook [12]. It consists of 200 instances belonging to four equally sized classes and is defined in a five-dimensional parameter space. Performing PCA and restricting ourselves to the 2D plane defined by PC2-PC3 lead to a challenging clustering problem which has been discussed by [13], when introducing support vector clustering (SVC), and by [5] when introducing QC. It has been used in other papers employing variations of QC, such as the recent study [14]. Here we will show the results of [4] who applied to these data three clustering methods: Maximal Shape Clustering (MSC) which coincides with QC, Maximal Probability Clustering (MPC) which coincides with MS, and Maximal Entropy Clustering (MEC). The quality of all three methods may be judged by applying the Jaccard score

$$\mathbf{J} = \frac{\mathbf{n\_{11}}}{\mathbf{n\_{11}} + \mathbf{n\_{10}} + \mathbf{n\_{01}}}$$

where n11 is the number of pairs of points which belong together both in the same class (accepted as "ground truth") and in the same cluster, while n10 + n01 are numbers of pairs which belong to the same class but different clusters and vice versa. This test is performed in Figure 2a, demonstrating that QC wins the competition for a wide range of σ values. The expected asymptotic value is J = 98/398, befitting one cluster and four classes.

#### Figure 2.

(a) The Jaccard score, comparing clustering results with expert classification, comparing three clustering methods over a range of σ values. (b) The number of clusters, for each method and value of σ. Reproduced from [4]. MSC, MWC, and MPC stand for maximal shape, weight, and probability clustering accordingly. MSC coincides with quantum clustering and MPC with mean shift.

Figure 3. Topographic maps of probability, weight, and shape, for σ = 0.7. Reproduced from [4].

Another comparison is being made in Figure 2b. This follows Roberts' criterion [9] that the preferable clustering method is the one which displays the most stable number of clusters with respect to variation of σ. This criterion is handy when the ground truth is unknown. QC excels also in this test, leading to a stable prediction of four clusters for a wide range of σ. This last figure also serve as a credibility test for Roberts' criterion.

In order to make the clustering results more intuitive, we display in Figure 3, also taken from [4], topographic maps of the different fields describing probability, weight, and shape, for σ = 0.7. The points in four different colors represent the four different classes. The topographic maps allow one to understand the clustering results which represent the outcome of gradient ascent applied to replica of data points which climb toward their nearest peak. Comparing the topologies of Figure 3 with the results for σ = 0.7 in Figure 2 leads to an understanding of why the three methods differ from each other.

#### 4.3 Image analysis

A gray-scale image may be analyzed as a set of inputs associated with different pixels. In higher dimensional problems, such as 3D MRI data, the pixels are replaced by voxels. Both may fit well into our analysis which starts with Eq. (1), associating a probability distribution with every image. One may then wonder if the weightshape decomposition of Eq. (6)

$$
\psi(\mathbf{x}) = W(\mathbf{x})S(\mathbf{x}).
$$

can lead to any novel understanding of image analysis.

In any practical application, non-normalized probability and weight may have very large amplitudes, yet shape will be limited to values ≤1. Nonetheless it carries some important characteristics:


The first claim is trivial since S is limited to the range 0 ≤S≤1, and the Gaussian kernels are integrable. The second property is a result of Eq. (7) which shows that the potential is related to the second derivative of the probability. It has led to an interesting result in [4], demonstrating that line caricatures of images can be produced by thresholded shape drawings.

To demonstrate the third point, we display in Figure 4 the results of an analysis of a T2 MRI of the brain of a Macaque monkey [15]. Following the general procedure outlined above, and limiting ourselves to large relative values (thresholded distributions) of probability and shape, we find that the latter peaks in cortical regions, whereas the former peaks in internal regions of the brain, as demonstrated in Figure 4. Thus, a simple thresholding procedure allows one to easily segment the MR image, for the purpose of further analysis of the cortex by applying QC to the data in the large S domain. In Figure 5, we follow these conclusions [15] with a display of QC clusters projected onto the surface of the brain, leading to its

#### Figure 4.

Thresholded shape (red) and thresholded probability (blue) dominate different regions within the same MR image of a macaque brain, projected on its y-z plane. This analysis used σ = 3 in voxel units. Data outside the brain are due to artifacts and noise in the MR image. These results are due to [15], and they indicate that large shape components dominate cortical regions of the T2 MRI brain image.

parcellation into cortical components which are derived by just computational image analysis.

#### 4.4 Convolutional representation of V

When one analyzes data in a regular underlying structure, such as pixels m of an image I(m), the translational invariance of the Gaussian kernel allows one to use a convolutional description such as

$$\boldsymbol{\Psi}\prime\prime\prime\prime = \sum\_{\mathbf{n}} I\lbrack\mathbf{n}\rbrack K\lbrack\mathbf{m}-\mathbf{n}\rbrack = I\*K\lbrack\mathbf{m}\rbrack\tag{15}$$

with K being a discrete representation of the kernel. This leads [4] to the following result for the potential

$$V[\mathbf{m}] = \frac{I \ast L[\mathbf{m}]}{I \ast K[\mathbf{m}]} \tag{16}$$

where L = �K log K. Such 3D kernels were applied to brain MR images [15] leading to the results displayed in Figures 4 and 5.

Noting that Eq. (15) is reminiscent of a convolutional layer in a deep network [16], we hypothesize that it can be useful to incorporate intermediate layers with nonlinear filters such as Eq. (16), as additional non-trained pooling filters in deep networks.

#### 4.5 Computational remarks

The clustering methodology which has been employed in the different examples shown above is the simplest flavor of gradient descent (or ascent). It calculates the relevant fields on the basis of the data points and continues with straightforward dynamics that have been applied to replica of data points, seeking the extrema of the fields. Various alternatives to this basic application exist. The most important one is hierarchical clustering, which allows for conceptual simplicity and saves computational complexity. Such methodologies were described and discussed in [11] and in [4].

Computational complexity is an obvious issue when working with large data sets. Thus, 3D MRI data may easily comprise 1 M points, whereas their

#### Figure 5.

Characteristic results of QC cortical clusters as mapped onto the surface of the brain and projected onto the x-y plane. This figure displays a map of the largest clusters of shape, each described by a different color. These results are due to Fisher [15].

#### Novel Formulation of Parzen Data Analysis DOI: http://dx.doi.org/10.5772/intechopen.83781

manipulation within a system like MATLAB may well be limited to handling only 40 K points at a time [15]. One way to overcome such issues is to consider performing the analysis within extended voxels, for example, voxels containing three pixels in each direction in a 3D image problem. Within each new voxel, one may simply sum the intensity of points, leading to a new presentation of the data in the form of Eq. 1 on the smaller extended voxel space. Clearly one has to make sure that such an approximation does not harm interesting features of the data.

When analyzing other big data, no prior dimensional representation may be required. For the sake of noise reduction and computational complexity, it is advantageous to first apply relevant dimensional reduction, as provided, for example, by singular value decomposition (SVD) and principal component analysis (PCA). It is also important to make sure that the different axes are of similar scale, as shown in the example of Figure 3. When the data is still large, one may apply the trick of extended voxels described above. For very large data, one may also separate the data into several components, as is customary in supervised learning, to make sure that conclusions are not affected by the random choice of a subset of the data.

### 5. Conclusions

In the past (see, e.g., [3]), Parzen analysis has not considered the potential field V, which plays an important part in the understanding of different features of the data. In particular, V is sensitive to small changes in the Parzen probability by being related to its second derivative. It is also the basis of quantum clustering whose advantages have been demonstrated here as well as in many other investigations in the literature. The discovery of the weight-shape decomposition of the Parzen probability has led to a focus on shape and on the potential and allows for a meaningful comparative discussion of the different features which may be extracted from data.

Here we have defined a set of fields in data space which hopefully will turn out to serve as useful tools in future data analyses. They seem to be adequately applicable to image analysis, and we expect them to be particularly useful in biomedical and technical image analyses in three dimensions. When analyzing other data, where no visual display constraints exist, noise reduction and computational complexity call for preprocessing by dimensional reduction. Further reduction of computational complexity may be tried by employing extended voxels, which our technique can easily accommodate.

In summary, this extended Parzen method replaces any set of discreet data by a continuous set of fields in data space, with interrelations in scale space. It allows for investigating data properties in terms of these fields and their extrema.

#### Acknowledgements

This work has been partially supported by the Blavatnik Cyber Center of Tel Aviv University. I thank Itay Fisher for his help with numerical data analysis.

#### Conflict of interest

The author declares no conflict of interest.

Pattern Recognition - Selected Methods and Applications
