**3. Algorithm to generate the cluster collections C(W)**

We are motivated to use a tree-based algorithm due to their applications in genome analysis [6–8], image segmentation [9, 10], statistics [11] and microaggregation [12]. The most common forms of the tree-based clustering methods in the literature [8, 13–15] begin with a minimum spanning tree and then successively delete edges according to various criteria. However, our approach has a greater level of flexibility than these commonly applied methods due to the fact that the clusters produced include those that cannot be obtained by removing edges of a

We introduce special techniques for accelerating the execution of our basic approach by exploiting its underlying properties and then introduce a closely related clustering algorithm that replaces an "edge-based" focus with a complementary "node-based" focus. We unify these two classes of approaches by identifying a third class that marries their complementary features, and which provides additional variation by means of a weight that permits the contribution of these complementary approaches to be varied along a continuum. We conclude by demonstrating how the procedures for accelerating the first method can be expressed in a

The ability to generate a family of clustering methods from each of the three basic clustering designs by varying a single parameter (and the weight employed by the third method) invites empirical research to determine parameter ranges that are effective for specific types of clustering applications, opening the possibility to produce clusters exhibiting features different

The clustering problem in our treatment is formulated by reference to a graph G = (N, E) where N = {1, …, n} is a set of nodes (cluster elements) and E is a set of edges (pairwise connections between elements) given by E ⊂N × N = {(p,q): p,q∈N}. The notation (p,q) is understood to represent an unordered pair (hence (p,q) = (q,p), and is equivalently represented by the set notation {p,q}). Each edge e = (p,q) ∈ E has an associated cost (or length) denoted by c(e) (= c(p,q)). It is not necessary to assume that G is complete or connected. We also do not require

matically determined by the clustering process. We also identify an associated set of edges E<sup>k</sup>

spanning tree over all of G and selectively delete particular edges, our algorithm produces

The class of clustering methods we describe is based on specifying the value of a parameter W, whose value uniquely determines the outcome of each clustering method within the class. W is expressed as an additive threshold for selecting edges and hence nodes to be added to a current construction (collection of subgraphs), and observe that W can equally be expressed as a multiplicative threshold in the case where the costs are nonnegative and the two approaches

. In contrast to those tree-based clustering approaches that begin with a min cost

), k ∈ K, that may not be possible to obtain by deleting edges from such a tree.

,Ek

, k∈ K = {1, …, ko}, where the value ko is auto-

) of G constitutes a min cost spanning tree over the

⊂

more general form to accelerate the execution of the combined procedure as well.

minimum spanning tree.

136 Recent Applications in Data Clustering

from those customarily obtained.

that the costs c(e) be nonnegative.

,Ek

are equivalent in this instance.

{(p,q), p,q∈ N<sup>k</sup>

subgraphs (N<sup>k</sup>

nodes of N<sup>k</sup>

The goal is to partition N into sets (clusters) N<sup>k</sup>

}, where the subgraph (N<sup>k</sup>

**2. Cluster problem formulation**

In overview, we index the edges of E in ascending cost order so that c(e(1)) ≤ c(e(2)) ≤ … ≤ c (e(|E|)), and identify the nodes of edge e(s) by writing e(s) = (p(s), q(s)). We start with each cluster N<sup>k</sup> consisting of just the node k, that is, each cluster is a degenerate single node tree given by.

$$\mathbf{N}\_{\mathbf{k}} = \{ \mathbf{k} \} , \mathbf{k} \in \mathbf{K} \text{ for } \mathbf{K} = \mathbf{N} = \{ 1, ..., \mathbf{n} \}$$

The associated set E<sup>k</sup> of edges in the tree corresponding to N<sup>k</sup> is empty (E<sup>k</sup> = ∅). As the algorithm progresses, the composition of the clusters will change and the index set K of clusters will change accordingly.

In addition, we keep a cost value denoted by MinCost(k) for each k ∈ K which identifies the cost of the minimum cost edge e ∈ E<sup>k</sup> . To begin, since no cluster yet contains an edge, we define MinCost(k) = Large, a large positive number, for all k ∈ K. (We will not have to examine the set E<sup>k</sup> to identify MinCost(k) = Min(c(e): e ∈ E<sup>k</sup> ) because the structure of the algorithm will insure that MinCost(k) will equal the cost of the first edge added to E<sup>k</sup> . In general, while we describe the composition of E<sup>k</sup> and the manner in which it changes, the organization of the algorithm assures that it is unnecessary to keep track of E<sup>k</sup> since the sets N<sup>k</sup> , for k∈ K, will identify the elements in the clusters produced.)

We also maintain a list L(i) for each i∈ N that names the cluster that node i belongs to. Hence, initially, L(i) = (i) since i∈ Ni = {i} for all i∈ N. The redundancy provided by this list enables updates to be performed efficiently. Subsequently, L(i) is modified as node i becomes the member of a new cluster N<sup>k</sup> . As this is done, the list K will come to have "holes" in it, i.e., will not consist of consecutive indexes. (At the end of the algorithm we can rename the clusters indexes, if desired, so that K = {1, 2, …, ko} where ko = |K|.)

Finally, during the process of generating the cluster collection C(W) for the current W value, we will identify a value Wnext so that the process may then be repeated for W: = Wnext to generate a new collection of clusters. As previously noted, by starting with W = Wo = 0 (or W = Wo = 1 in the multiplicative version), and then successively identifying Wnext each time a cluster collection C(W) is generated, we can ultimately generate all possible collections C(W), without duplication. The process terminates when W becomes large enough that C(W) consists of a min cost spanning tree over each connected component of G. (A simple condition for identifying this termination point is identified below.)

Building on these observations, we now state the full form of our algorithm.
