Suppression Distance Computation for Set Covers and Hierarchies


Download Suppression Distance Computation for Set Covers and Hierarchies


Preview text

Suppression Distance Computation for Set Covers and Hierarchies
François Queyroi, Sergey Kirgizov
To cite this version:
François Queyroi, Sergey Kirgizov. Suppression Distance Computation for Set Covers and Hierarchies. 2014. ￿hal-00996090v1￿

HAL Id: hal-00996090 https://hal.archives-ouvertes.fr/hal-00996090v1
Preprint submitted on 26 May 2014 (v1), last revised 21 Apr 2015 (v3)

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Suppression Distance Computation for Set Covers and Hierarchies
Fran¸cois Queyroi∗, Sergey Kirgizov∗
Sorbonne Universit´es, UPMC Univ Paris 06, UMR 7606, LIP6, F-75005 CNRS, UMR 7606, LIP6, F-75005, Paris, France

Abstract
We discuss the computation of the suppression distance between two set covers. It is the minimum number of element of a set that have to be removed so the renaming covers are equal. We show that this problem is N P-Hard by reduction to minimum vertex cover. We further investigate the subclass of hierarchies and prove this problem to be polynomial in this case as it gives birth to a class of perfect graphs. We also propose an algorithm based on recursive maximum assignments.
Keywords: set cover, hierarchy, distance, graphs, vertex cover

1. Introduction
Decomposing a set into patterns of interest is a central problem in data analysis. Evaluating the distance between decompositions is an important task in this context as it allows to study the behaviour of clustering algorithms or study the evolution of a set of patterns over time. The situation where the detected patterns do not overlap is called partitions. Measures based on edit distance [4, 9] or on mutual information [8] have been used to assess the distance between partitions. The situation where groups may overlaps received some attention recently [3]: we call these decomposition covers. They are particularly useful for the detection of communities in complex networks [6]. A subclass of covers are hierarchies in which patterns can

∗Corresponding author Email addresses: [email protected] (Franc¸ois Queyroi), [email protected] (Sergey Kirgizov)
Preprint submitted to Information Processing Letters

May 1, 2014

be recursively decomposed into smaller patterns with similar properties. The problem of distance definition between hierarchies is also of interest [10]. We focus in this paper on the computation of the suppression distance defined for partitions in [4] and generalized here for general covers and hierarchies.

We assume we have a set S of elements of finite cardinality n. A cover of the set S is a finite collection of subsets C = (C1, C2, . . . , Ck) where each Ci is a non-empty subset of S. We call C(S′) the sub-collection of groups that contain S′ ⊆ S. We denote by C[S′] the induced sub-cover of C by S i.e. the cover of S′ obtained after the removal of every elements of S \ S′.

Definition 1. (Suppression Distance) Let C1 and C2 be two covers of S. The suppression distance ds is defined as

ds(C1, C2) = min{|S′|, C1[S \ S′] = C2[S \ S′]}

(1)

S′⊆S

A set S′ such that C1[S \ S′] = C2[S \ S′] is called a suppression set.

Theorem 1. The function ds is a metric.

Proof. The non-negativity, identity of indiscernibles and symmetry properties are straightfoward for ds. Moreover, this distance respects the triangular inequality. Consider three covers C1, C2 and C3. Let Sij be such set that Ci[S \ Sij] = Cj[S \ Sij]. Observe that C1[S \ (S12 ∪ S23)] = C3[S \ (S12 ∪ S23)], so we have:

|S13| ≤ |S12 ∪ S23| ≤ |S12| + |S23| ds(C1, C3) ≤ ds(C1, C2) + ds(C2, C3)

2. Computation of the Suppression Distance
The evaluation of the distance between covers relates to hypergraph matching used in pattern recognition [2]. One can easily reduce the computation of the suppression distance to the problem of maximum commom subhypergraph [1]. Indeed the couple (S, C) is an hypergraph. If S′ ⊂ S is the maximum common induced sub-hypergraph of (S, C1) and (S, C2) then obviously we have ds(C1, C2) = |S| − |S′|. However, finding such subset is
2

N P-hard. We show here that the suppression distance computation can also be reduced to the minimum vertex cover problem. The differences between both covers can be encoded into a simple graph called the cover graph.

Definition 2. (Cover Graph) Let C1 and C2 be two covers of a set S. We call G(C1, C2) = (S, E) the cover graph of (C1, C2) with the edges E = {(s1, s2) ∈ S2, |C1({s1, s2})| = |C2({s1, s2})|}. This graph can contain self-
loops.

Two elements (or a singleton) of S are connected iff they (it) do (does) not appear the same amount of time together (alone) in both covers. A similar transformation has been proposed by Gusfield [4].

Lemma 1. Given G = (S, E) the cover graph of (C1, C2) and S′ ⊆ S, the induced subgraph G[S′] is the cover graph of (C1[S′], C2[S′]).

Proof. First, the G(C1[S′], C2[S′]) has the same vertex set as G[S′]. Second,

for

every

pair

of

elements

s1, s2

in

S′,

we

have

|C1({s1, s2})|

=

|

C

′ 1

(

{s

1

,

s2

}

)

|

and |C2({s1, s2})| = |C2′ ({s1, s2})|. Therefore, every edge in G(C1[S′], C2[S′])

is also an edge in G[S′] since G[S′] is an induced subgraph. Also, every non-

edge in G(C1[S′], C2[S′]) is an non-edge in G[S′]. Therefore the edge set of

both graphs are equal.

Theorem 2. ds(C1, C2) is equal to the size of the minimum vertex cover of G(C1, C2).

Proof. We first show that E = ∅ iff C1 = C2. If both covers are equal then each pair of vertices appears the same amount of time in both decomposition and the cover graph contains no edge. Now if a cover graph contains no edge then there exist a bijection between the groups of C1 and C2. If such bijection does not exist, it means there is at least one group in either C1 or C2 that has no equal counterpart in the other cover. Therefore, there is a pair of elements in S2 that does not appear together the same amount of time in C1 and C2. Those two elements should be connected in G which contradicts our hypothesis. Finding a set S′ such that C1[S \S′] and C2[S \S′] are equal is therefore equivalent to finding a S′ such that G[S \ S′] contain no edges (Lemma 1). If the S′ is the smallest among all other subsets with this property then it is a minimum vertex cover of G.

3

Theorem 3. For any simple graph G = (S, E), there exist two covers (C1, C2) of S such that G = G(C1, C2)
Proof. For each edge (u, v) ∈ E(G), add the set {u, v} to C1 and create two groups {u} and {v} in C2. Using this construction, the elements u and v are found together in one group in C1 and are not found together in C2. Therefore the cover graph G(C1, C2) will contain the edge (u, v). Moreover, the edge set of G(C1, C2) will correspond to the groups of C1. We conclude G = G(C1, C2).
Since any graph can be the cover graph of two set covers (Theorem 3), computing ds(C1, C2) is N P-hard by reduction to the Minimum Vertex Cover problem (Theorem 2).
3. The Case of Hierarchies
Hierarchies are a particular class of covers that contain partitions as special cases. We show in this section that ds can be computed in polynomial time in this context and provide a recursive algorithm.
Definition 3. (Hierarchy) A hierarchy H is a cover of a set S such that if there exist two groups H1, H2 ∈ H such that H1 ∩H2 = ∅ then either H1 ⊆ H2 or H2 ⊆ H1.
Since the inclusion between the sets define a weak ordering of the groups. The relations between the sets of H can be represented in a tree ordered fashion, the roots being the groups that are not include in any other group. Let Ni(H) denote the i-th level of H i.e. the groups sitting at depth i in this tree. Notice it is still well defined if H contains repeated groups. The depth of a hierarchy is the maximal depth of its groups. For G = G(H1, H2), we have a function p : E(G) → N which is the first level at which the couple (a, b) ∈ E belongs to a group of H1 but not H2 (or the opposite). We denote by Gi the subgraph of G formed by the edges {e ∈ E, p(e) ≥ i}. Notice we have G1 = G. An example of hierarchies and their cover graph can be found in Figure 1.
We will assume that each element of S appears the same number of times in (H1, H2). If it not the case, the cover graph G(H1, H2) contains self-loops. A subset of vertices belonging to all minimum vertex covers is therefore straightforward to find. Indeed, vertices with self-loops are part of all minimum vertex cover.
4

H1

N1

abc

def

N2 ab

c

N3 ab

c

G(H1, H2) c f
da e
b

H2 abcdef ac

N1 b N2

a

c

b N3

Figure 1: Example of two hierarchies H1, H2 of a set S = {a, b, c, d, e, f } and their cover graph G(H1, H2) (p(e) = 1 for gray edges and p(e) = 2 for black edges).
3.1. Existence of a polynomial-time solution We show here that the cover graph of hierarchies is a perfect graph. As
said before, we assume that each element of S belongs to the same number of groups in both hierarchies. One important consequence of this is the fact that H1, H2 have the same depth d.
Lemma 2. Let S′ ⊆ S and i > 0 such that Gi[S′] is connected, the elements of S′ all belong to the same groups of depth lower than i in both H1 and H2.
Proof. Assume Gi[S′] is a connected subgraph but there exist at least two non-overlapping subsets A and B of S′ that belong to different groups of depth lower than i in both H1 and H2. It means that either A and B belong to the same groups in H1 but not in H2 or both do not belong to the same groups in both hierarchies. Notice that both cases are impossible otherwise all edges would have value a value lower than i (first case) or there would be no edges between the two groups (second case). This contradicts our hypothesis since we assume A ∪ B is a connected component of Gi.
Lemma 3. Let S′ ⊆ S and i > 0 such that Gi[S′] is connected, if there exist u ∈ S′ such that (u, v) ∈ E and p(u, v) = j < i then ∀w ∈ S′, we have (u, w) ∈ E with p(u, w) = j.
Proof. According to Lemma 2, the elements in S′ all belongs to the same groups of depth lower than i in both hierarchies. If there exist u ∈ S′ such that (u, v) ∈ E and p(u, v) = j < i, it means there is a group at depth j in H1 that contain (u ∪ S′) and a group at depth j in H2 that contains S′ but not u. Therefore u should be connected to every elements of S′ with an edge of value j.
5

One important consequence of Lemma 3 is that two connected components of G2 either have no edges between them or form a complete bipartite subgraph in G.
Theorem 4. Let H1, H1 be two hierarchies of finite depth of a set S, computing ds(H1, H1) can de done in polynomial time.
Proof. Let Ψd denote the set of hierarchies of a set S of depth d, we show that the cover graph G (obtained after the removal of vertices with self-loops) of any pair in Ψd is a perfect graph by induction over d.
For d = 1, the class Ψ1 corresponds to the class of partitions of S and the graph G is therefore perfect (Theorem 3.4 of [4]).
Assume it true for any d we show it is also true for d + 1. Let G˜ be the graph obtained by contraction of edges of G2 in G. This graph is perfect. Indeed, according to Lemma 2, elements within the same connected components of G2 belongs to the same group at depth 1. The graph G˜ is therefore the cover graph of the two flat partitions (P˜1, P˜2) obtained by fusioning each group into a single new element in the partitions N1(H1) and N1(H2).
According to Lemma 3, the graph G can be recovered from G˜ by expanding each vertex u ∈ V (G˜) by its corresponding connected component of G2 and connecting each element to the vertices previously adjacent to u. Now, since every connected subgraphs of G2 is perfect as the cover graph of two hierarchies of depth d, the graph G is also perfect according to the Theorem 1 of [7]. A minimum vertex cover of G can be found in polynomial time.
3.2. A solution based on recursive maximum assignment It has been shown that, if P1 and P2 are partitions, the distance ds(P1, P2)
can be computed by solving a maximum assignment problem based on the size of intersections between all pairs of groups in P1 and P2 [4]. The computation of the intersections takes O(n) and the assignment is computed in O((|P1| + |P2|)3). The set of elements to be removed are the elements in the sets that are not covered by the maximum assignment.
Notice two hierarchies are equal if all their levels are equal i.e. if the set of couple {(Ni(H1), Ni(H2))}1≤i≤d are all pairwise equal. However, finding a suppression set S′ using a greedy “level-by-level” approach (either top-down or bottom-up) may not lead to a optimal solution. Consider the example given in Figure 1 where ds(H1, H2) = 3, a top-down approach may fail since
6

either {a, b, c} or {d, e, f } can be chosen at level 1 to be part of S′. But choosing {d, e, f } would lead to a distance of 4. Alternatively, consider the sub-hierarchies induced by the set {a, b, c}, a bottom-top approach may also fail since either a or b can belongs to S′ at the last level. Choosing b would lead to a distance of 2 whereas ds(H1[{a, b, c}], H2[{a, b, c}]) = 1.

We provide here an algorithm for computing a suppression set for two hierarchies and therefore the distance between them. The idea is to recursively compute a suppression set for two sub-hierarchies whose element belongs to the same groups at the current level.

Algorithm 1: suppressionSet(H1, H2)

Input: H1, H2 two hierarchies of a set S Output: S′ ⊆ S a suppression set

1 if H1 = H2 = ∅ then

2 return ∅

3 else

4 S′ ← ∅

5 for C ∈ maxCommonGroups(N1(H1), N1(H2)) do

6

S′ ← S′ ∪ suppressionSet(H1[C] − C, H2[C] − C)

7 end

8 return S′ ∪ f latSuppressionSet(N1(H1[S \ S′]), N1(H2[S \ S′]))

9 end

The function maxCommonGroups(P1, P2) returns the maximal subsets of vertices that are together in both partitions P1 and P2. The function flatSuppressionSet(P1, P2) returns a minimum suppression set of elements S′ such that ds(P1, P2) = |S′|. This function can used the Hungarian algorithm [5]. The following result is used to show that the set returned by a recursion call is a subset of one optimal solution.
Lemma 4. Let G = (S, E) be the cover graph of two hierarchies of S, any minimum vertex cover of Gi is a subset of a minimum vertex cover of Gi−1.
Proof. Let C be a minimum vertex cover of Gi, S′ be a maximal connected component of Gi. The set C′ = S′ ∩ C is a minimum vertex cover of Gi−1[S′]. According to Lemma 3, the edge cut (S′, S′′) forms a complete bipartite graph where S′′ is the set of vertices in (S \ S′) connected to S′. Therefore,
7

the minimum vertex cover of Gi−1 should contain either all S′ or all S′′ ∪ C′ i.e. C′ is a subset of the cover in both cases. Since it is true for the minimum cover of every maximal connected components of Gi, the set C is a subset a minimum vertex cover of Gi−1.
Theorem 5. For two hierarchies H1, H2 of a set S, Algorithm 1 always terminates and returns a suppression set S′ such that ds(H1, H2) = |S′|.
Proof. Termination: First the algorithm will terminate since the hierarchies are of finite depth and the recursive call is used on two sub hierarchies of lower depth (the “root” group is removed in both hierarchies in line 6). Moreover, the condition in line 1 is always met at some point since we assume elements of S appears the same number of times in both hierarchies.
Correctness: Observe the function maxCommonGroups(P1, P2) returns sets that correspond to either independent or maximal connected components of G2. Let C be one of these sets, the sub-hierarchies H1[C]−C and H2[C]−C correspond to the sub-hierarchies induced by C minus the first level. Assume that at the end of the loop 5–7, the set S′ is the union of the elements to be removed so that those sub-hierarchies are equal. According to Lemma 4, the set S′ is a subset of a minimum suppression set between (H1, H2) i.e. a minimum vertex cover of G2. A possible solution is therefore the union of S′ and a suppression set of H1[S \ S′] and H2[S \ S′]. The latter can be found only looking at the first level of both hierarchies. We can show the assumption on S′ to be true by induction since the Algorithm will return a minimum suppression set if H1, H2 are flat partitions.
We shall discuss the complexity of Algorithm 1. The function maxCommonGroups can be implemented in O(n log(n)). Notice the worst case scenario is achieved when H1 = H2 since the results of each recursive call will be the empty set. Assuming the first Θ(d) levels correspond to one repeated group S and the last levels correspond to Θ(n) repeated groups, the time complexity is O(n3 + dn log n).
References
[1] Bunke, H., 1997. On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters 18 (8), 689–694.
8

[2] Bunke, H., Dickinson, P., Kraetzl, M., Neuhaus, M., Stettler, M., 2008. Matching of hypergraphs—algorithms, applications, and experiments. In: Applied Pattern Recognition. Springer, pp. 131–154.
[3] Goldberg, M. K., Hayvanovych, M., Magdon-Ismail, M., 2010. Measuring similarity between sets of overlapping clusters. In: Social Computing (SocialCom), 2010 IEEE Second International Conference on. IEEE, pp. 303–308.
[4] Gusfield, D., 2002. Partition-distance: A problem and class of perfect graphs arising in clustering. Information Processing Letters 82 (3), 159– 164.
[5] Kuhn, H. W., 1955. The hungarian method for the assignment problem. Naval research logistics quarterly 2 (1-2), 83–97.
[6] Lancichinetti, A., Fortunato, S., Kert´esz, J., 2009. Detecting the overlapping and hierarchical community structure in complex networks. New Journal of Physics 11 (3), 033015.
[7] Lov´asz, L., 1972. Normal hypergraphs and the perfect graph conjecture. Discrete Mathematics 2 (3), 253–267.
[8] Meila˘, M., 2003. Comparing clusterings by the variation of information. In: Learning theory and kernel machines. Springer, pp. 173–187.
[9] Porumbel, D. C., Hao, J. K., Kuntz, P., 2011. An efficient algorithm for computing the distance between close partitions. Discrete Applied Mathematics 159 (1), 53–59.
[10] Queyroi, F., Delest, M., F´edou, J.-M., Melan¸con, G., 2014. Assessing the quality of multilevel graph clustering. Data Mining and Knowledge Discovery, 1–22.
9

Preparing to load PDF file. please wait...

0 of 0
100%
Suppression Distance Computation for Set Covers and Hierarchies