**1. Introduction**

Clustering methods have long been a mainstay of statistics and machine learning [1–3], and have experienced a surge in importance with the advent of Big Data Analytics [4, 5]. A highly successful use of clustering in practical applications has been to seek out particular kinds of clustering methods that are effective in particular settings, based on the finding that different classes of problems respond best to specific classes of clustering methods. This finding motivates the work of this paper, which introduces a new class of tree-based clustering methods with an ability to modify the kinds of clusters produced by changing the value of a particular parameter. Moreover, we show all members of class can be generated without duplication by a process that adaptively determines each new parameter value from the information produced by executing the class member that precedes it.

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We are motivated to use a tree-based algorithm due to their applications in genome analysis [6–8], image segmentation [9, 10], statistics [11] and microaggregation [12]. The most common forms of the tree-based clustering methods in the literature [8, 13–15] begin with a minimum spanning tree and then successively delete edges according to various criteria. However, our approach has a greater level of flexibility than these commonly applied methods due to the fact that the clusters produced include those that cannot be obtained by removing edges of a minimum spanning tree.

We start with any selected value W = Wo≥ 0 and after obtaining a collection of clusters C(W) for a given W we systematically modify W so that over successive iterations all possible cluster collections C(W) for W ≥ Wo will be generated without duplication. The complete range of cluster collections results by choosing Wo = 0 (or Wo = 1 in the multiplicative version).

In overview, we index the edges of E in ascending cost order so that c(e(1)) ≤ c(e(2)) ≤ … ≤ c (e(|E|)), and identify the nodes of edge e(s) by writing e(s) = (p(s), q(s)). We start with each

rithm progresses, the composition of the clusters will change and the index set K of clusters

In addition, we keep a cost value denoted by MinCost(k) for each k ∈ K which identifies the

define MinCost(k) = Large, a large positive number, for all k ∈ K. (We will not have to exam-

We also maintain a list L(i) for each i∈ N that names the cluster that node i belongs to. Hence, initially, L(i) = (i) since i∈ Ni = {i} for all i∈ N. The redundancy provided by this list enables updates to be performed efficiently. Subsequently, L(i) is modified as node i becomes the

not consist of consecutive indexes. (At the end of the algorithm we can rename the clusters

Finally, during the process of generating the cluster collection C(W) for the current W value, we will identify a value Wnext so that the process may then be repeated for W: = Wnext to generate a new collection of clusters. As previously noted, by starting with W = Wo = 0 (or W = Wo = 1 in the multiplicative version), and then successively identifying Wnext each time a cluster collection C(W) is generated, we can ultimately generate all possible collections C(W), without duplication. The process terminates when W becomes large enough that C(W) consists of a min cost spanning tree over each connected component of G. (A simple condition for

of edges in the tree corresponding to N<sup>k</sup>

consisting of just the node k, that is, each cluster is a degenerate single node tree

is empty (E<sup>k</sup>

A Class of Parametric Tree-Based Clustering Methods http://dx.doi.org/10.5772/intechopen.76406

. To begin, since no cluster yet contains an edge, we

and the manner in which it changes, the organization of

. As this is done, the list K will come to have "holes" in it, i.e., will

) because the structure of the algorithm

since the sets N<sup>k</sup>

= ∅). As the algo-

137

. In general, while

, for k∈ K, will

**3. Algorithm to generate the cluster collections C(W)**

N<sup>k</sup> = {k}, k ∈ K for K = N = {1, …,n}

to identify MinCost(k) = Min(c(e): e ∈ E<sup>k</sup>

the algorithm assures that it is unnecessary to keep track of E<sup>k</sup>

indexes, if desired, so that K = {1, 2, …, ko} where ko = |K|.)

identifying this termination point is identified below.)

**C(W) Algorithm (Multiplicative Version)**

Building on these observations, we now state the full form of our algorithm.

*Inputs*: The graph G(N, E), cost vector c(e), e ∈ E, initial Wo value for W.

will insure that MinCost(k) will equal the cost of the first edge added to E<sup>k</sup>

cluster N<sup>k</sup>

given by.

ine the set E<sup>k</sup>

The associated set E<sup>k</sup>

will change accordingly.

cost of the minimum cost edge e ∈ E<sup>k</sup>

we describe the composition of E<sup>k</sup>

member of a new cluster N<sup>k</sup>

identify the elements in the clusters produced.)

We introduce special techniques for accelerating the execution of our basic approach by exploiting its underlying properties and then introduce a closely related clustering algorithm that replaces an "edge-based" focus with a complementary "node-based" focus. We unify these two classes of approaches by identifying a third class that marries their complementary features, and which provides additional variation by means of a weight that permits the contribution of these complementary approaches to be varied along a continuum. We conclude by demonstrating how the procedures for accelerating the first method can be expressed in a more general form to accelerate the execution of the combined procedure as well.

The ability to generate a family of clustering methods from each of the three basic clustering designs by varying a single parameter (and the weight employed by the third method) invites empirical research to determine parameter ranges that are effective for specific types of clustering applications, opening the possibility to produce clusters exhibiting features different from those customarily obtained.
