A Flexible ILP Formulation for Hierarchical Clustering

Download A Flexible ILP Formulation for Hierarchical Clustering

Preview text

A Flexible ILP Formulation for Hierarchical Clustering
Sean Gilpina, Ian Davidsona
aComputer Science Department University of California Davis
One Shields Road Davis, California, 95616, United States of America

Hierarchical clustering is a popular approach in a number of fields with many well known algorithms. However, all existing work to our knowledge implements a greedy heuristic algorithm with no explicit objective function. In this work we formalize hierarchical clustering as an integer linear programming (ILP) problem with a natural objective function and the dendrogram properties enforced as linear constraints. Our experimental work shows that even for small data sets finding the global optimum produces more accurate results. Formalizing hierarchical clustering as an ILP with constraints has several advantages beyond finding the global optima. Relaxing the dendrogram constraints such as transitivity can produce novel problem variations such as finding hierarchies with overlapping clusterings. It is also possible to add constraints to encode guidance such as must–link, cannot–link, must–link–before etc. Finally, though exact solvers exists for ILP we show that a simple randomized algorithm and a linear programming (LP) relaxation can be used to provide approximate solutions faster. Keywords: hierarchical clustering, constraints, integer linear programming

1. Introduction
A recent survey [1] comparing non-hierarchical and hierarchical clustering algorithms showed that in published scientific articles, hierarchical algorithms are used far
Email addresses: [email protected] (Sean Gilpin), [email protected] (Ian Davidson)

Preprint submitted to Artificial Intelligence

November 13, 2014

more than non-hierarchical clustering. However, most research and papers focus on non-hierarchical (partitional) clustering. Applications of hierarchical clustering typically can be divided into those that build large trees so that, for instance, a user can navigate a large collection of documents, and those that build trees to represent a scientific process, such as phylogenetic trees (evolutionary trees). We can further differentiate these works by noting that the data collected for the first type of application is easy to collect and hence voluminous, whilst the later application typically takes much time to collect and are typically small.
Our own earlier work [2] showed how to scale hierarchical clustering using hashing to huge data sets but in this paper the focus is the latter category of applications involving a small number of instances taking a long time to collect and which must be thoroughly analyzed. This means that spending hours for an algorithm to run is not uncalled for since a precise answer is worth the wait. Colloquially, going back to our examples, no one will complain if a dendrogram places documents in a good but non-optimal ordering but will if species are shown to evolve in the wrong order.
Hierarchical clustering algorithms remain relatively under-studied with most algorithms being relatively straight-forward greedy algorithms implemented in procedural language. For example, in the above mentioned survey two thirds of the implementations mentioned were simple agglomerative algorithms (see Figure 1) that start with each instance in a cluster by itself and then the closest two clusters are merged at each and every level. Even more advanced methods published in the database literature such as CLARANS and DBSCAN [3] still use this base approach but have more complex distance measures or methods to build the tree iteratively. Whereas non-hierarchical clustering algorithms have moved to elegant linear algebra formulations such as spectral clustering [4] hierarchical clustering has not.
Contributions. In this work we provide the first formalization of agglomerative clustering as an ILP problem (for a general introduction to the topic see [5] and for an example of an application of ILP to the non-hierarchical clustering data mining problem see [6]). Formulating the problem as an ILP allows the use of high quality solvers (free and commercial) such as CPLEX and Gurobi (used in all our experiments) to find solutions and automatically take advantage of multi-core architectures. Formulating

the problem as an ILP has a number of additional benefits that we now briefly discuss and provide details of later.
• Explicit Global Objective Function Optimizing. As mentioned most existing work greedily determines the best join at each and every level of the hierarchy. At no time is it possible to reset or revisit an earlier join. This is adequate when a “near enough” dendrogram is required, such as building a tree to organize song lists. Finding the global optima is more important when the data represents a physical phenomenon. This is discussed in the Section Hierarchical Clustering as an ILP, and we show this produces better quantitative results for scientific data sets such as language evolution in Table 3 and for hierarchical organization of fMRI scans in Table 4 .
• Novel Problem Variations with Relaxing Constraints. A side benefit of formalizing hierarchy learning is that the properties of a legal dendrogram are explicitly modeled as constraints to the optimization. We will show how novel problem variations can arise if some constraints are relaxed. In particular we show that relaxing the transitivity property allows for overlapping hierarchical clustering, that is, an instance can appear multiple times at the same level in the hierarchy and is akin to overlapping clustering. To our knowledge the problem of building dendrograms when an object appears multiple times is novel. This topic is explored in the Section Relaxed Problem Settings. In Figure 8 we show empirically that our method is capable of discovering overlapping hierarchies.
• Novel Forms of Guidance By Adding Constraints. The area of constrained clustering [7] has typically added constraints to procedural implementations. Here we show how we can add a number of typically used data mining constraints must–link, cannot–link, and must–link–before as linear constraints. This topic is explored in the Section Adding Guidance and in Figure 9 we show this can obtain even better empirical results.
• Approximation Schemes Finding the global optima to an intractable problem by definition must be time consuming. However, a large literature exists on general

methods to create approximate methods for ILPs. We provide two initial steps in that direction by exploiting a simple randomized algorithm that can create a factor two approximation scheme. We also explore using LP relaxations and randomized rounding. Figure 10 shows the empirical results of how well our LP relaxation and randomized rounding scheme compares to the optimal solution.
We organize the paper as follows. In Section 2 we describe our ILP formulation of hierarchical clustering. In Section 3 we describe an interesting variant of this formulation that allows us to learn overlapping hierarchical clustering by relaxing some of the constraints. Adding guidance that can encode domain knowledge or problem constraints in the form of additional ILP constraints is explored in Section 4. Section 5 describes two approximation algorithms to our overlapping hierarchical clustering algorithms. Finally in Section 6 we describe the questions we attempted to answer through empirical analysis and described the methods we devised to solve them as well as our results.
2. Hierarchical Clustering as an ILP
In this section, we discuss how to formulate hierarchical clustering as an ILP problem. We begin with a brief introduction of hierarchical clustering that can be skipped by the informed reader.
2.1. A Brief Introduction to Agglomerative Hierarchical Clustering Agglomerative hierarchical clustering algorithms take as input a distance function
and create a dendrogram, which is a tree structure which can be interpreted as a kblock set partition for each value of k between 1 and n, where n is the number of data points to cluster. These algorithms not only allow a user to choose a particular clustering granularity, but in many domains [8, 9, 10] clusters naturally form a hierarchy; that is, clusters are part of other clusters such as in the case of phylogenetic (evolutionary) trees. The popular agglomerative algorithms are easy to implement as they just begin with each point in its own cluster and progressively join two closest clusters to

Standard agglomerative clustering Input:
• Distance matrix for n points D ∈ n × n. • Cluster distance function d(ci, cj), e.g. single or complete linkage. Output: Block partitioning Dendrogramk, for each k, 1 ≤ k ≤ n. 1. ci = {xi}, 1 ≤ i ≤ n. Dendrogramn = {c1, c2, . . . , cn}. 2. for k = n − 1 down to 1 do
(a) /* Find a closest pair of clusters. */ Let (a, b) = argmin(i,j){d(ci, cj) : 1 ≤ i < j ≤ k + 1}.
(b) Obtain Dendrogramk from Dendrogramk+1 by merging cb into ca and then deleting cb.
Figure 1: Standard agglomerative clustering. Typical variations differ by the linkage/distance function used d(i, j) whose details can drastically change the algorithm’s behavior and efficient implementation. For example, the single linkage distance between two clusters is defined as the smallest distance between all pairs of points from the two different clusters.

reduce the number of clusters by 1 until k = 1. The basic agglomerative hierarchical clustering, which our method is compared to in this paper, is shown in Figure 1.
This algorithm is very popular and was first postulated in the 1950’s and 60’s in the numerical taxonomy literature [11]. Though useful they are unlike most modern machine learning methods since they were not derived to optimize an objective function. In comparison consider non-hierarchical clustering methods such as spectral clustering, k-means and mixture models. Each has an explicit object function: graph min-cut, vector quantization error, maximum likelihood respectively, and their algorithms are derived to optimize this function. Our purpose in this paper is to move hierarchical clustering forward by explicitly deriving an objective function and optimizing it. We have chosen an ILP formulation since our focus is on exact optimization.
2.2. Solving hierarchical clustering Using ILP allows us to formally define an objective function and then find globally
optimal solutions to hierarchical clustering problems. The two challenges to an ILP formulation are: i) ensuring that the resultant solution is a legal dendrogram; and ii) encoding a useful objective function. In the former, though we must model the dendrogram as a collection of variables, those variables are clearly dependent on each other and we must formalize this relationship. In the later, we must use these same variables to define a meaningful objective function.
2.3. Enforcing a Legal Dendrogram In this section, we describe how a function (referred to as merge or as M within the
ILP formulation) , representable using O(n2) integer variables (where n is the number of instances), can represent any dendrogram. The variables represent what level a pair of instances are first joined at and the constraints described in this section enforce that these variables obey this intended meaning. The merge function is described in Definition 1 as follows. Definition 1. merge function
merge : (Instances × Instances) → Levels (a, b) → first level a, b are in same cluster

In this work we will learn this function. For a particular instance pair a, b the intended meaning of merge(a, b) = is that instances a, b are in the same cluster at level (and higher) of the corresponding dendrogram, but not at level − 1 and lower. Therefore the domain of the merge function is all pairs of instances and the range is the integers between zero (tree bottom) and the maximum hierarchy level L (tree top).
The fact that any dendrogram from standard hierarchical clustering algorithms can be represented using this scheme is clear, but it is not the case that any such merge function represents a legal dendrogram. Consider a simple example with three instances a, b, c and with merge(a, b) = 1, merge(a, c) = 2 and merge(b, c) = 3. Such a merge function does not represent a legal dendrogram because it does not obey transitivity. To ensure we learn a legal dendrogram we must encode the following partition properties: reflexivity, symmetry, and transitivity (Definitions 2,3,4). Later we shall see how to enforce these requirements as linear inequalities in an ILP.
Definition 2. Reflexivity An instance is always in the same cluster as itself.
∀a [merge(a, a) = 0]
Definition 3. Symmetry If instance a is in the same cluster as instance b then instance b is also in the same cluster as instance a.
∀a, b [merge(a, b) = merge(b, a)]
Definition 4. Hierarchy and Transitivity If instance a is in the same cluster as instance b at level q and b is in the same cluster as c at level r, then a must be in the same cluster as c at level q or r (which-ever is largest).
∀a, b, c [max (merge(a, b), merge(b, c)) ≥ merge(a, c)]
2.4. Hierarchical Clustering Objective The objective for agglomerative hierarchical clustering is traditionally specified as
a greedy process where clusters that are closest together are merged at each level as shown in Figure 1. These objectives are called linkage functions (e.g. single [12], complete [13], UPGMA [14],WPGMA [15]) and effectively define a new distance function

Variable M(a, b)/merge(a, b) Oabc wabc L

Description The values M(a, b) can be interpreted as the first level that the two points are merged. A binary variable set to 1 (TRUE) if points a, b are merged before a, c. Specifies the reward/penalty of joining a, b over a, c. In our work set as: D(a, c) − D(a, b) so can be either positive (reward) or negative (penalty). The maximum number of levels in the hierarchy. Unlike standard agglomerative clustering where the number of levels equals the number of points, our method allows the number of levels to be specified, and to be explicitly modeled as part of the objective function. Each Z variable captures the relative ordering of of two pairs of points’ hierarchical distances (as measured in M).

Table 1: Description of variables used in ILP formulation (See Figures 3 and 4).

(d(i, j) in Figure 1) to decide which clusters are closest and should therefore be merged next. The intuition behind this process can be interpreted to mean that points that are close together should be merged together before points that are far apart. Our work formalizes this intuition into an ILP.
To formalize that intuition we created an objective function that favors hierarchies where closer points are joined before further-away points. For example if D(a, c) (distance between a and c ) is larger than D(a, b) then the points a and c should be merged after the points a and b are merged. This would require the merge function learnt to have merge(a, b) < merge(a, c). Though learning the merge function directly allows us to recover the dendrogram easily, it is difficult to define an objective around it. Hence, we introduce the binary variable Oabc which is set to TRUE (value one) if the instances a and b are merged before a and c are merged. In our formulation we must


constrain O and merge so they are consistent with each other. We need both sets of variables since O is useful for specifying the objective function but difficult to specify the properties of a hierarchy on, and merge is challenging to specify an objective function on but useful for specifying the properties of a hierarchy. To complete the objective function we introduce wabc a non-binary variable which indicates how close a and b are compared to a and c. In our experiments we set wabc = D(a, c) − D(a, b) which means it is set to 0 if a and b are equal distance to a and c and a positive (negative) number if they are closer (further). Figure 2 gives an example of how our objective works and Equations 1, 2, and 3 define it formally.

f (M) =



a,b,c∈I nstances

where: 

 1 : M(a, b) < M(a, c)

Oabc =


 0 : otherwise

wabc = D(a, c) − D(a, b)


It should be noted that in all our experiments O is optimized over and w is a constant that is derived from the distance function D. However, in practice, w could be provided by a human expert either partially or completely. In the former case we would optimize over the not specified values of w which can be considered as a semi-supervised setting in that for some triplets you don’t know their distance relationships but learn them. Using these variables allows an objective function shown formally in Equation 1 which captures the intuition of agglomerative hierarchical clustering (join closer points together before further points). However, because we have formulated it and allowed it to be solved as an ILP it will find the global optimum rather than converge to a local optimum. This is so since the ILP does not gradually build up the solution as greedy heuristic methods do.

2.5. ILP Transformation Now that we have presented a global objective function for hierarchical clustering
we will address how to find globally optimum legal hierarchies by translating the prob-


lem into an integer linear program (ILP). Figure 3 shows at a high level the optimization problem we are solving.
Figure 4 shows the ILP equivalent of the problem in Figure 3. To ensure that M is a legal merge function we need to ensure the resultant hierarchy is legal. This requires encoding the constraints to enforce the previously mentioned properties of reflexivity, symmetry, and transitivity properties (Definitions 2 - 4). Recall that M can be represented as square matrix of n × n variables indicating at what level a pair of points are joined. Then reflexivity and symmetry are both easily enforced by removing redundant variables from the matrix, in particular removing all diagonal variables for reflexivity, and all lower triangle variables for symmetry. Transitivity can be turned into a set of linear constraints by noticing that the previously mentioned inequality max(M(a, b), M(b, c)) ≥ M(a, c) is logically equivalent to:

(M(a, b) ≥ M(a, c)) ∨ (M(b, c) ≥ M(a, c))


In the final ILP (Figure 4) there is a variable Z introduced for each clause/inequality from Equation 4 (i.e. Zab≥ac and Zbc≥ac) and there are three constraints added to enforce that the disjunction of the inequalities is satisfied. Constraints are also added to ensure that the O variables and the variables in M are consistent.

3. Relaxed Problem Settings - Removing Constraints
A benefit of formalizing hierarchy building is that individual parts of the formulization can be relaxed. Here we explore two aspects of relaxation. Firstly, we can relax the formal definitions of a legal hierarchy to obtain novel problem settings. In particular we consider the effect of relaxing the transitivity constraint of our ILP formulation of hierarchical clustering. Relaxing transitivity allows a form of hierarchical clustering where instances can appear in multiple clusters at the same level (i.e. overlapping clustering). Secondly, we relax the integer constraints so that we can create approximation algorithms by relaxing the problem into an LP and using randomized rounding schemes (see Section 5).


Preparing to load PDF file. please wait...

0 of 0
A Flexible ILP Formulation for Hierarchical Clustering