Abstract

The Parzen analysis associates Gaussian kernels with each data point, thus obtaining a density function which may be viewed as a possible artificial generator of the data. This probability function can be decomposed into the product of two components, weight (W) and shape (S), which represent different aspects of the data. We demonstrate how this naturally leads to a formalism of fields in data space, which are interconnected through relations in one-dimensional scale space, corresponding to the common Gaussian width. We discuss the connection of this formalism to different clustering procedures such as quantum clustering (QC) and mean shift (MS). We demonstrate on various examples the importance of these concepts in the analysis of natural data as well as in image analysis in two or three dimensions.

Keywords: Parzen probability, weight-shape decomposition, quantum clustering, mean shift, image analysis

### 1. Introduction

Unsupervised machine learning has led to a wealth of clustering methods over the past few decades [1]. One of the important early ideas is that of the Parzen window distribution [2]. It has been introduced in 1962, as a kernel density estimate of a distribution function underlying measured data, and still serves as the basis of clustering algorithms in pattern recognition [1, 3]. Recently, it has been discovered [4] that the Parzen probability function can be decomposed into two components, weight and shape, which represent different aspects of the data. Weight, as its name implies, describes the semi-global strength of the distribution, whereas shape represents local properties which come to light once the bias of the weight is being removed. Moreover, ˜log(shape) coincides with a potential function V, which has been previously introduced in quantum clustering (QC) [5]. The cluster centers in QC correspond to minima of V. An alternative method, mean shift [6, 7], views the maxima of the probability function as the appropriate candidates of cluster centers. These two different points of view can now be studied and compared within a unified formalism [4].

Here we discuss the novel connections of the Parzen distribution to its potential and show how both can be used for the analysis of data points, leading to alternative clustering possibilities and extracting interesting features from the data. A particularly interesting set of applications appears in image analysis. Scale-space image analysis [8] has developed from the Parzen kernel methodology discussed in [6].

Now it turns out that insights from the potential term, or the shape component, allow for novel applications which are relevant to medical and technical imaging.

#### 2. The Parzen probability distribution and its potential function

Data analysis often involves dimensionality reduction and noise removal, as well as some other tools, which eventually lead to consider a set of preprocessed data points located in a d-dimensional Euclidean space x<sup>i</sup> ∈ Rd, with possible positive attributes (e.g., intensities) Ii. For this set, we define the non-normalized Parzen window function with Gaussian kernels as

$$\mathcal{W}\_{\sigma}(\mathbf{x}) = \sum\_{i} I\_{i} e^{-\frac{(\mathbf{x} - \mathbf{x}\_{i})^{2}}{2\sigma^{2}}} \tag{1}$$

Following [4], we introduce a relative probability weight function representing the influence of the kernel at data point x<sup>i</sup> on any arbitrary point x:

$$p\_i(\mathbf{x}) = \frac{e^{-\frac{(\mathbf{x}-\mathbf{x}\_i)^2}{2\sigma^2}}}{\varphi(\mathbf{x})} \tag{2}$$

It obeys ∑<sup>i</sup> Iipi ð Þ¼ x 1 and allows for the definition of two new scalar functions over data space x, which are the potential and entropy fields

$$V(\mathbf{x}) = \sum\_{i} I\_{i} \frac{\left(\mathbf{x} - \mathbf{x}\_{i}\right)^{2}}{2\sigma^{2}} p\_{i}(\mathbf{x}) \tag{3}$$

$$H(\mathbf{x}) = -\sum\_{i} I\_{i} p\_{i}(\mathbf{x}) \log p\_{i}(\mathbf{x}) \tag{4}$$

Their difference is related to the Parzen probability function

$$V(\mathbf{x}) = \mathbf{H}(\mathbf{x}) - \log \psi(\mathbf{x}) \tag{5}$$

This can be rewritten as

$$
\psi(\mathbf{x}) = \mathcal{W}(\mathbf{x})\mathcal{S}(\mathbf{x})\tag{6}
$$

<sup>H</sup> <sup>x</sup> <sup>e</sup>�<sup>V</sup> <sup>x</sup> using [4] the concepts of weight and shape: <sup>W</sup>ð Þ¼ <sup>x</sup> <sup>e</sup> ð Þ and <sup>S</sup>ð Þ¼ <sup>x</sup> ð Þ.

Since Vð Þ x ≥0, it follows that Sð Þ x ≤1: Moreover, both W and S are nonnegative. S is integrable over x and, as such, can also serve as a distribution. From the definitions of (1) and (3), one can derive [4] the Schrödinger equation

$$-\frac{\sigma^2}{2}\nabla^2\varphi(\mathbf{x}) + V(\mathbf{x})\varphi(\mathbf{x}) = \frac{d}{2}\varphi(\mathbf{x}),\tag{7}$$

which has been the cornerstone of the QC algorithm [5].

#### 3. Interplay of scale and data space dependence

All the scalar fields over data space, introduced in the previous section, depend on the parameter σ, the scale of all Gaussian kernels. This dependence leads to further interesting relations between the Parzen probability function and its potential. Thus, from the definitions (1) and (3), it follows that

Novel Formulation of Parzen Data Analysis DOI: http://dx.doi.org/10.5772/intechopen.83781

$$\frac{\sigma}{2} \frac{\partial}{\partial \sigma} \log \psi\_{\sigma}(\mathbf{x}) = V\_{\sigma}(\mathbf{x}) \tag{8}$$

where we keep the index σ which has been suppressed in the previous section. This relation displays a direct connection between the two scalar functions defining the probability and the potential. We proceed now to introduce a vector field D<sup>σ</sup> which is defined by

$$-\nabla \log \boldsymbol{\mu}\_{\sigma}(\mathbf{x}) = \mathbf{D}\_{\sigma} \tag{9}$$

and vanishes when the probability reaches its extrema in data space. Interestingly it is also related to the gradient of the potential function, through

$$\frac{-\sigma}{2} \frac{\partial}{\partial \sigma} \mathbf{D}\_{\sigma} = \nabla V\_{\sigma} \tag{10}$$

Hence we conclude that the potential reaches its extrema when D<sup>σ</sup> remains stationary with respect to variations of σ.

D<sup>σ</sup> may be expressed, in analogy with Eq. (3), as

$$\mathbf{D}(\mathbf{x}) = \sum\_{i} I\_{i} \ \frac{\mathbf{x} - \mathbf{x}\_{i}}{\sigma^{2}} p\_{i}(\mathbf{x}). \tag{11}$$

Its square <sup>U</sup> <sup>¼</sup> <sup>D</sup><sup>2</sup> serves as an indicator function whose stationarity

$$\frac{\sigma}{2} \frac{\partial}{\partial \sigma} U\_{\sigma}(\mathbf{x}) = 2 \nabla \log \boldsymbol{\nu}\_{\sigma}(\mathbf{x}) \cdot \nabla V\_{\sigma}(\mathbf{x}) = \mathbf{0} \tag{12}$$

implies the existence of extrema of either the probability or the potential. Since <sup>U</sup> <sup>¼</sup> <sup>D</sup><sup>2</sup> is nonnegative, <sup>U</sup> <sup>=</sup> <sup>0</sup> is <sup>a</sup> minimum in <sup>σ</sup>. It corresponds to extrema of <sup>ψ</sup> which are associated with <sup>D</sup>¼0: Other values of <sup>U</sup> which obey Eq. (12) are associ- <sup>∂</sup> ated with extrema of <sup>V</sup> which occur whenever <sup>D</sup><sup>σ</sup> <sup>¼</sup> <sup>0</sup>:Eq. (12) may be viewed as <sup>∂</sup><sup>σ</sup> a statement concerning a set of points of interest in the data: all extrema of either the probability or the potential. In analogy with statistics, one may also view this equation as an inference method finding the parameter σ which leads to points of interest at given values of x.

Although all extrema may be regarded as points of interest, some are of more interest than others: extrema that remain fixed in x for a range of scale values, which is large compared with the range of scales of other points of interest. This criterion, introduced by Roberts [9], allows searching for scales which correspond to natural properties of the data. Thus it subserves the search for good clustering of the data [4, 5, 9].

Finally, we wish to point out that ψ is not a properly normalized distribution function. A proper probability function, whose integral is 1, is defined by

$$P = \frac{1}{N} \left(\frac{1}{2\pi\sigma^2}\right)^{\frac{d}{2}} \psi(\mathbf{x}) \tag{13}$$

where N ¼ ∑<sup>i</sup> Ii. We note that ψ and V obey a joint integration constraint [10]

$$\frac{1}{N} \left(\frac{1}{2\pi\sigma^2}\right)^{\frac{d}{2}} \int d\mathbf{x} \psi(\mathbf{x}) V(\mathbf{x}) = \frac{d}{2}.\tag{14}$$

This may be interpreted as a constraint on the expectation value of the potential function in data space.

### Pattern Recognition - Selected Methods and Applications

Examples of the behavior of log ψ and of V are demonstrated in Figure 1 for a data set of 9000 observed galaxies (with redshift in the domain 0.47 �0:005) regarded as points in spherical angles θ and φ within some limited range. Whereas for σ ¼ 2 (in units of angle degrees), the two fields exhibit many extrema; there exist clear differences for larger sigma, for example, σ = 10, where log ψ has one maximum, while V displays several minima. This figure is taken from [10], a paper which contains a detailed and expanded formulation of the analysis presented in this section.

#### Figure 1.

(a) Loci of 9000 Galaxies, downloaded from the Sloan Digital Sky Server DR12, within some limited range of spherical angles. Reproduced from [10]. (b) log ψ (top) and V (bottom) displayed over the data plane 1a, using σ = 2 in spherical angle units. Reproduced from [10]. (c) Surfaces of V and log ψ for increased values of the Gaussian width. Reproduced from [10].
