Suppression Distance Computation for Set Covers and Hierarchies
Download Suppression Distance Computation for Set Covers and Hierarchies
Preview text
Suppression Distance Computation for Set Covers and Hierarchies
François Queyroi, Sergey Kirgizov
To cite this version:
François Queyroi, Sergey Kirgizov. Suppression Distance Computation for Set Covers and Hierarchies. 2014. hal00996090v1
HAL Id: hal00996090 https://hal.archivesouvertes.fr/hal00996090v1
Preprint submitted on 26 May 2014 (v1), last revised 21 Apr 2015 (v3)
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Suppression Distance Computation for Set Covers and Hierarchies
Fran¸cois Queyroi∗, Sergey Kirgizov∗
Sorbonne Universit´es, UPMC Univ Paris 06, UMR 7606, LIP6, F75005 CNRS, UMR 7606, LIP6, F75005, Paris, France
Abstract
We discuss the computation of the suppression distance between two set covers. It is the minimum number of element of a set that have to be removed so the renaming covers are equal. We show that this problem is N PHard by reduction to minimum vertex cover. We further investigate the subclass of hierarchies and prove this problem to be polynomial in this case as it gives birth to a class of perfect graphs. We also propose an algorithm based on recursive maximum assignments.
Keywords: set cover, hierarchy, distance, graphs, vertex cover
1. Introduction
Decomposing a set into patterns of interest is a central problem in data analysis. Evaluating the distance between decompositions is an important task in this context as it allows to study the behaviour of clustering algorithms or study the evolution of a set of patterns over time. The situation where the detected patterns do not overlap is called partitions. Measures based on edit distance [4, 9] or on mutual information [8] have been used to assess the distance between partitions. The situation where groups may overlaps received some attention recently [3]: we call these decomposition covers. They are particularly useful for the detection of communities in complex networks [6]. A subclass of covers are hierarchies in which patterns can
∗Corresponding author Email addresses: [email protected] (Franc¸ois Queyroi), [email protected] (Sergey Kirgizov)
Preprint submitted to Information Processing Letters
May 1, 2014
be recursively decomposed into smaller patterns with similar properties. The problem of distance deﬁnition between hierarchies is also of interest [10]. We focus in this paper on the computation of the suppression distance deﬁned for partitions in [4] and generalized here for general covers and hierarchies.
We assume we have a set S of elements of ﬁnite cardinality n. A cover of the set S is a ﬁnite collection of subsets C = (C1, C2, . . . , Ck) where each Ci is a nonempty subset of S. We call C(S′) the subcollection of groups that contain S′ ⊆ S. We denote by C[S′] the induced subcover of C by S i.e. the cover of S′ obtained after the removal of every elements of S \ S′.
Deﬁnition 1. (Suppression Distance) Let C1 and C2 be two covers of S. The suppression distance ds is deﬁned as
ds(C1, C2) = min{S′, C1[S \ S′] = C2[S \ S′]}
(1)
S′⊆S
A set S′ such that C1[S \ S′] = C2[S \ S′] is called a suppression set.
Theorem 1. The function ds is a metric.
Proof. The nonnegativity, identity of indiscernibles and symmetry properties are straightfoward for ds. Moreover, this distance respects the triangular inequality. Consider three covers C1, C2 and C3. Let Sij be such set that Ci[S \ Sij] = Cj[S \ Sij]. Observe that C1[S \ (S12 ∪ S23)] = C3[S \ (S12 ∪ S23)], so we have:
S13 ≤ S12 ∪ S23 ≤ S12 + S23 ds(C1, C3) ≤ ds(C1, C2) + ds(C2, C3)
2. Computation of the Suppression Distance
The evaluation of the distance between covers relates to hypergraph matching used in pattern recognition [2]. One can easily reduce the computation of the suppression distance to the problem of maximum commom subhypergraph [1]. Indeed the couple (S, C) is an hypergraph. If S′ ⊂ S is the maximum common induced subhypergraph of (S, C1) and (S, C2) then obviously we have ds(C1, C2) = S − S′. However, ﬁnding such subset is
2
N Phard. We show here that the suppression distance computation can also be reduced to the minimum vertex cover problem. The diﬀerences between both covers can be encoded into a simple graph called the cover graph.
Deﬁnition 2. (Cover Graph) Let C1 and C2 be two covers of a set S. We call G(C1, C2) = (S, E) the cover graph of (C1, C2) with the edges E = {(s1, s2) ∈ S2, C1({s1, s2}) = C2({s1, s2})}. This graph can contain self
loops.
Two elements (or a singleton) of S are connected iﬀ they (it) do (does) not appear the same amount of time together (alone) in both covers. A similar transformation has been proposed by Gusﬁeld [4].
Lemma 1. Given G = (S, E) the cover graph of (C1, C2) and S′ ⊆ S, the induced subgraph G[S′] is the cover graph of (C1[S′], C2[S′]).
Proof. First, the G(C1[S′], C2[S′]) has the same vertex set as G[S′]. Second,
for
every
pair
of
elements
s1, s2
in
S′,
we
have
C1({s1, s2})
=

C
′ 1
(
{s
1
,
s2
}
)

and C2({s1, s2}) = C2′ ({s1, s2}). Therefore, every edge in G(C1[S′], C2[S′])
is also an edge in G[S′] since G[S′] is an induced subgraph. Also, every non
edge in G(C1[S′], C2[S′]) is an nonedge in G[S′]. Therefore the edge set of
both graphs are equal.
Theorem 2. ds(C1, C2) is equal to the size of the minimum vertex cover of G(C1, C2).
Proof. We ﬁrst show that E = ∅ iﬀ C1 = C2. If both covers are equal then each pair of vertices appears the same amount of time in both decomposition and the cover graph contains no edge. Now if a cover graph contains no edge then there exist a bijection between the groups of C1 and C2. If such bijection does not exist, it means there is at least one group in either C1 or C2 that has no equal counterpart in the other cover. Therefore, there is a pair of elements in S2 that does not appear together the same amount of time in C1 and C2. Those two elements should be connected in G which contradicts our hypothesis. Finding a set S′ such that C1[S \S′] and C2[S \S′] are equal is therefore equivalent to ﬁnding a S′ such that G[S \ S′] contain no edges (Lemma 1). If the S′ is the smallest among all other subsets with this property then it is a minimum vertex cover of G.
3
Theorem 3. For any simple graph G = (S, E), there exist two covers (C1, C2) of S such that G = G(C1, C2)
Proof. For each edge (u, v) ∈ E(G), add the set {u, v} to C1 and create two groups {u} and {v} in C2. Using this construction, the elements u and v are found together in one group in C1 and are not found together in C2. Therefore the cover graph G(C1, C2) will contain the edge (u, v). Moreover, the edge set of G(C1, C2) will correspond to the groups of C1. We conclude G = G(C1, C2).
Since any graph can be the cover graph of two set covers (Theorem 3), computing ds(C1, C2) is N Phard by reduction to the Minimum Vertex Cover problem (Theorem 2).
3. The Case of Hierarchies
Hierarchies are a particular class of covers that contain partitions as special cases. We show in this section that ds can be computed in polynomial time in this context and provide a recursive algorithm.
Deﬁnition 3. (Hierarchy) A hierarchy H is a cover of a set S such that if there exist two groups H1, H2 ∈ H such that H1 ∩H2 = ∅ then either H1 ⊆ H2 or H2 ⊆ H1.
Since the inclusion between the sets deﬁne a weak ordering of the groups. The relations between the sets of H can be represented in a tree ordered fashion, the roots being the groups that are not include in any other group. Let Ni(H) denote the ith level of H i.e. the groups sitting at depth i in this tree. Notice it is still well deﬁned if H contains repeated groups. The depth of a hierarchy is the maximal depth of its groups. For G = G(H1, H2), we have a function p : E(G) → N which is the ﬁrst level at which the couple (a, b) ∈ E belongs to a group of H1 but not H2 (or the opposite). We denote by Gi the subgraph of G formed by the edges {e ∈ E, p(e) ≥ i}. Notice we have G1 = G. An example of hierarchies and their cover graph can be found in Figure 1.
We will assume that each element of S appears the same number of times in (H1, H2). If it not the case, the cover graph G(H1, H2) contains selfloops. A subset of vertices belonging to all minimum vertex covers is therefore straightforward to ﬁnd. Indeed, vertices with selfloops are part of all minimum vertex cover.
4
H1
N1
abc
def
N2 ab
c
N3 ab
c
G(H1, H2) c f
da e
b
H2 abcdef ac
N1 b N2
a
c
b N3
Figure 1: Example of two hierarchies H1, H2 of a set S = {a, b, c, d, e, f } and their cover graph G(H1, H2) (p(e) = 1 for gray edges and p(e) = 2 for black edges).
3.1. Existence of a polynomialtime solution We show here that the cover graph of hierarchies is a perfect graph. As
said before, we assume that each element of S belongs to the same number of groups in both hierarchies. One important consequence of this is the fact that H1, H2 have the same depth d.
Lemma 2. Let S′ ⊆ S and i > 0 such that Gi[S′] is connected, the elements of S′ all belong to the same groups of depth lower than i in both H1 and H2.
Proof. Assume Gi[S′] is a connected subgraph but there exist at least two nonoverlapping subsets A and B of S′ that belong to diﬀerent groups of depth lower than i in both H1 and H2. It means that either A and B belong to the same groups in H1 but not in H2 or both do not belong to the same groups in both hierarchies. Notice that both cases are impossible otherwise all edges would have value a value lower than i (ﬁrst case) or there would be no edges between the two groups (second case). This contradicts our hypothesis since we assume A ∪ B is a connected component of Gi.
Lemma 3. Let S′ ⊆ S and i > 0 such that Gi[S′] is connected, if there exist u ∈ S′ such that (u, v) ∈ E and p(u, v) = j < i then ∀w ∈ S′, we have (u, w) ∈ E with p(u, w) = j.
Proof. According to Lemma 2, the elements in S′ all belongs to the same groups of depth lower than i in both hierarchies. If there exist u ∈ S′ such that (u, v) ∈ E and p(u, v) = j < i, it means there is a group at depth j in H1 that contain (u ∪ S′) and a group at depth j in H2 that contains S′ but not u. Therefore u should be connected to every elements of S′ with an edge of value j.
5
One important consequence of Lemma 3 is that two connected components of G2 either have no edges between them or form a complete bipartite subgraph in G.
Theorem 4. Let H1, H1 be two hierarchies of ﬁnite depth of a set S, computing ds(H1, H1) can de done in polynomial time.
Proof. Let Ψd denote the set of hierarchies of a set S of depth d, we show that the cover graph G (obtained after the removal of vertices with selfloops) of any pair in Ψd is a perfect graph by induction over d.
For d = 1, the class Ψ1 corresponds to the class of partitions of S and the graph G is therefore perfect (Theorem 3.4 of [4]).
Assume it true for any d we show it is also true for d + 1. Let G˜ be the graph obtained by contraction of edges of G2 in G. This graph is perfect. Indeed, according to Lemma 2, elements within the same connected components of G2 belongs to the same group at depth 1. The graph G˜ is therefore the cover graph of the two ﬂat partitions (P˜1, P˜2) obtained by fusioning each group into a single new element in the partitions N1(H1) and N1(H2).
According to Lemma 3, the graph G can be recovered from G˜ by expanding each vertex u ∈ V (G˜) by its corresponding connected component of G2 and connecting each element to the vertices previously adjacent to u. Now, since every connected subgraphs of G2 is perfect as the cover graph of two hierarchies of depth d, the graph G is also perfect according to the Theorem 1 of [7]. A minimum vertex cover of G can be found in polynomial time.
3.2. A solution based on recursive maximum assignment It has been shown that, if P1 and P2 are partitions, the distance ds(P1, P2)
can be computed by solving a maximum assignment problem based on the size of intersections between all pairs of groups in P1 and P2 [4]. The computation of the intersections takes O(n) and the assignment is computed in O((P1 + P2)3). The set of elements to be removed are the elements in the sets that are not covered by the maximum assignment.
Notice two hierarchies are equal if all their levels are equal i.e. if the set of couple {(Ni(H1), Ni(H2))}1≤i≤d are all pairwise equal. However, ﬁnding a suppression set S′ using a greedy “levelbylevel” approach (either topdown or bottomup) may not lead to a optimal solution. Consider the example given in Figure 1 where ds(H1, H2) = 3, a topdown approach may fail since
6
either {a, b, c} or {d, e, f } can be chosen at level 1 to be part of S′. But choosing {d, e, f } would lead to a distance of 4. Alternatively, consider the subhierarchies induced by the set {a, b, c}, a bottomtop approach may also fail since either a or b can belongs to S′ at the last level. Choosing b would lead to a distance of 2 whereas ds(H1[{a, b, c}], H2[{a, b, c}]) = 1.
We provide here an algorithm for computing a suppression set for two hierarchies and therefore the distance between them. The idea is to recursively compute a suppression set for two subhierarchies whose element belongs to the same groups at the current level.
Algorithm 1: suppressionSet(H1, H2)
Input: H1, H2 two hierarchies of a set S Output: S′ ⊆ S a suppression set
1 if H1 = H2 = ∅ then
2 return ∅
3 else
4 S′ ← ∅
5 for C ∈ maxCommonGroups(N1(H1), N1(H2)) do
6
S′ ← S′ ∪ suppressionSet(H1[C] − C, H2[C] − C)
7 end
8 return S′ ∪ f latSuppressionSet(N1(H1[S \ S′]), N1(H2[S \ S′]))
9 end
The function maxCommonGroups(P1, P2) returns the maximal subsets of vertices that are together in both partitions P1 and P2. The function ﬂatSuppressionSet(P1, P2) returns a minimum suppression set of elements S′ such that ds(P1, P2) = S′. This function can used the Hungarian algorithm [5]. The following result is used to show that the set returned by a recursion call is a subset of one optimal solution.
Lemma 4. Let G = (S, E) be the cover graph of two hierarchies of S, any minimum vertex cover of Gi is a subset of a minimum vertex cover of Gi−1.
Proof. Let C be a minimum vertex cover of Gi, S′ be a maximal connected component of Gi. The set C′ = S′ ∩ C is a minimum vertex cover of Gi−1[S′]. According to Lemma 3, the edge cut (S′, S′′) forms a complete bipartite graph where S′′ is the set of vertices in (S \ S′) connected to S′. Therefore,
7
the minimum vertex cover of Gi−1 should contain either all S′ or all S′′ ∪ C′ i.e. C′ is a subset of the cover in both cases. Since it is true for the minimum cover of every maximal connected components of Gi, the set C is a subset a minimum vertex cover of Gi−1.
Theorem 5. For two hierarchies H1, H2 of a set S, Algorithm 1 always terminates and returns a suppression set S′ such that ds(H1, H2) = S′.
Proof. Termination: First the algorithm will terminate since the hierarchies are of ﬁnite depth and the recursive call is used on two sub hierarchies of lower depth (the “root” group is removed in both hierarchies in line 6). Moreover, the condition in line 1 is always met at some point since we assume elements of S appears the same number of times in both hierarchies.
Correctness: Observe the function maxCommonGroups(P1, P2) returns sets that correspond to either independent or maximal connected components of G2. Let C be one of these sets, the subhierarchies H1[C]−C and H2[C]−C correspond to the subhierarchies induced by C minus the ﬁrst level. Assume that at the end of the loop 5–7, the set S′ is the union of the elements to be removed so that those subhierarchies are equal. According to Lemma 4, the set S′ is a subset of a minimum suppression set between (H1, H2) i.e. a minimum vertex cover of G2. A possible solution is therefore the union of S′ and a suppression set of H1[S \ S′] and H2[S \ S′]. The latter can be found only looking at the ﬁrst level of both hierarchies. We can show the assumption on S′ to be true by induction since the Algorithm will return a minimum suppression set if H1, H2 are ﬂat partitions.
We shall discuss the complexity of Algorithm 1. The function maxCommonGroups can be implemented in O(n log(n)). Notice the worst case scenario is achieved when H1 = H2 since the results of each recursive call will be the empty set. Assuming the ﬁrst Θ(d) levels correspond to one repeated group S and the last levels correspond to Θ(n) repeated groups, the time complexity is O(n3 + dn log n).
References
[1] Bunke, H., 1997. On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters 18 (8), 689–694.
8
[2] Bunke, H., Dickinson, P., Kraetzl, M., Neuhaus, M., Stettler, M., 2008. Matching of hypergraphs—algorithms, applications, and experiments. In: Applied Pattern Recognition. Springer, pp. 131–154.
[3] Goldberg, M. K., Hayvanovych, M., MagdonIsmail, M., 2010. Measuring similarity between sets of overlapping clusters. In: Social Computing (SocialCom), 2010 IEEE Second International Conference on. IEEE, pp. 303–308.
[4] Gusﬁeld, D., 2002. Partitiondistance: A problem and class of perfect graphs arising in clustering. Information Processing Letters 82 (3), 159– 164.
[5] Kuhn, H. W., 1955. The hungarian method for the assignment problem. Naval research logistics quarterly 2 (12), 83–97.
[6] Lancichinetti, A., Fortunato, S., Kert´esz, J., 2009. Detecting the overlapping and hierarchical community structure in complex networks. New Journal of Physics 11 (3), 033015.
[7] Lov´asz, L., 1972. Normal hypergraphs and the perfect graph conjecture. Discrete Mathematics 2 (3), 253–267.
[8] Meila˘, M., 2003. Comparing clusterings by the variation of information. In: Learning theory and kernel machines. Springer, pp. 173–187.
[9] Porumbel, D. C., Hao, J. K., Kuntz, P., 2011. An eﬃcient algorithm for computing the distance between close partitions. Discrete Applied Mathematics 159 (1), 53–59.
[10] Queyroi, F., Delest, M., F´edou, J.M., Melan¸con, G., 2014. Assessing the quality of multilevel graph clustering. Data Mining and Knowledge Discovery, 1–22.
9
François Queyroi, Sergey Kirgizov
To cite this version:
François Queyroi, Sergey Kirgizov. Suppression Distance Computation for Set Covers and Hierarchies. 2014. hal00996090v1
HAL Id: hal00996090 https://hal.archivesouvertes.fr/hal00996090v1
Preprint submitted on 26 May 2014 (v1), last revised 21 Apr 2015 (v3)
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Suppression Distance Computation for Set Covers and Hierarchies
Fran¸cois Queyroi∗, Sergey Kirgizov∗
Sorbonne Universit´es, UPMC Univ Paris 06, UMR 7606, LIP6, F75005 CNRS, UMR 7606, LIP6, F75005, Paris, France
Abstract
We discuss the computation of the suppression distance between two set covers. It is the minimum number of element of a set that have to be removed so the renaming covers are equal. We show that this problem is N PHard by reduction to minimum vertex cover. We further investigate the subclass of hierarchies and prove this problem to be polynomial in this case as it gives birth to a class of perfect graphs. We also propose an algorithm based on recursive maximum assignments.
Keywords: set cover, hierarchy, distance, graphs, vertex cover
1. Introduction
Decomposing a set into patterns of interest is a central problem in data analysis. Evaluating the distance between decompositions is an important task in this context as it allows to study the behaviour of clustering algorithms or study the evolution of a set of patterns over time. The situation where the detected patterns do not overlap is called partitions. Measures based on edit distance [4, 9] or on mutual information [8] have been used to assess the distance between partitions. The situation where groups may overlaps received some attention recently [3]: we call these decomposition covers. They are particularly useful for the detection of communities in complex networks [6]. A subclass of covers are hierarchies in which patterns can
∗Corresponding author Email addresses: [email protected] (Franc¸ois Queyroi), [email protected] (Sergey Kirgizov)
Preprint submitted to Information Processing Letters
May 1, 2014
be recursively decomposed into smaller patterns with similar properties. The problem of distance deﬁnition between hierarchies is also of interest [10]. We focus in this paper on the computation of the suppression distance deﬁned for partitions in [4] and generalized here for general covers and hierarchies.
We assume we have a set S of elements of ﬁnite cardinality n. A cover of the set S is a ﬁnite collection of subsets C = (C1, C2, . . . , Ck) where each Ci is a nonempty subset of S. We call C(S′) the subcollection of groups that contain S′ ⊆ S. We denote by C[S′] the induced subcover of C by S i.e. the cover of S′ obtained after the removal of every elements of S \ S′.
Deﬁnition 1. (Suppression Distance) Let C1 and C2 be two covers of S. The suppression distance ds is deﬁned as
ds(C1, C2) = min{S′, C1[S \ S′] = C2[S \ S′]}
(1)
S′⊆S
A set S′ such that C1[S \ S′] = C2[S \ S′] is called a suppression set.
Theorem 1. The function ds is a metric.
Proof. The nonnegativity, identity of indiscernibles and symmetry properties are straightfoward for ds. Moreover, this distance respects the triangular inequality. Consider three covers C1, C2 and C3. Let Sij be such set that Ci[S \ Sij] = Cj[S \ Sij]. Observe that C1[S \ (S12 ∪ S23)] = C3[S \ (S12 ∪ S23)], so we have:
S13 ≤ S12 ∪ S23 ≤ S12 + S23 ds(C1, C3) ≤ ds(C1, C2) + ds(C2, C3)
2. Computation of the Suppression Distance
The evaluation of the distance between covers relates to hypergraph matching used in pattern recognition [2]. One can easily reduce the computation of the suppression distance to the problem of maximum commom subhypergraph [1]. Indeed the couple (S, C) is an hypergraph. If S′ ⊂ S is the maximum common induced subhypergraph of (S, C1) and (S, C2) then obviously we have ds(C1, C2) = S − S′. However, ﬁnding such subset is
2
N Phard. We show here that the suppression distance computation can also be reduced to the minimum vertex cover problem. The diﬀerences between both covers can be encoded into a simple graph called the cover graph.
Deﬁnition 2. (Cover Graph) Let C1 and C2 be two covers of a set S. We call G(C1, C2) = (S, E) the cover graph of (C1, C2) with the edges E = {(s1, s2) ∈ S2, C1({s1, s2}) = C2({s1, s2})}. This graph can contain self
loops.
Two elements (or a singleton) of S are connected iﬀ they (it) do (does) not appear the same amount of time together (alone) in both covers. A similar transformation has been proposed by Gusﬁeld [4].
Lemma 1. Given G = (S, E) the cover graph of (C1, C2) and S′ ⊆ S, the induced subgraph G[S′] is the cover graph of (C1[S′], C2[S′]).
Proof. First, the G(C1[S′], C2[S′]) has the same vertex set as G[S′]. Second,
for
every
pair
of
elements
s1, s2
in
S′,
we
have
C1({s1, s2})
=

C
′ 1
(
{s
1
,
s2
}
)

and C2({s1, s2}) = C2′ ({s1, s2}). Therefore, every edge in G(C1[S′], C2[S′])
is also an edge in G[S′] since G[S′] is an induced subgraph. Also, every non
edge in G(C1[S′], C2[S′]) is an nonedge in G[S′]. Therefore the edge set of
both graphs are equal.
Theorem 2. ds(C1, C2) is equal to the size of the minimum vertex cover of G(C1, C2).
Proof. We ﬁrst show that E = ∅ iﬀ C1 = C2. If both covers are equal then each pair of vertices appears the same amount of time in both decomposition and the cover graph contains no edge. Now if a cover graph contains no edge then there exist a bijection between the groups of C1 and C2. If such bijection does not exist, it means there is at least one group in either C1 or C2 that has no equal counterpart in the other cover. Therefore, there is a pair of elements in S2 that does not appear together the same amount of time in C1 and C2. Those two elements should be connected in G which contradicts our hypothesis. Finding a set S′ such that C1[S \S′] and C2[S \S′] are equal is therefore equivalent to ﬁnding a S′ such that G[S \ S′] contain no edges (Lemma 1). If the S′ is the smallest among all other subsets with this property then it is a minimum vertex cover of G.
3
Theorem 3. For any simple graph G = (S, E), there exist two covers (C1, C2) of S such that G = G(C1, C2)
Proof. For each edge (u, v) ∈ E(G), add the set {u, v} to C1 and create two groups {u} and {v} in C2. Using this construction, the elements u and v are found together in one group in C1 and are not found together in C2. Therefore the cover graph G(C1, C2) will contain the edge (u, v). Moreover, the edge set of G(C1, C2) will correspond to the groups of C1. We conclude G = G(C1, C2).
Since any graph can be the cover graph of two set covers (Theorem 3), computing ds(C1, C2) is N Phard by reduction to the Minimum Vertex Cover problem (Theorem 2).
3. The Case of Hierarchies
Hierarchies are a particular class of covers that contain partitions as special cases. We show in this section that ds can be computed in polynomial time in this context and provide a recursive algorithm.
Deﬁnition 3. (Hierarchy) A hierarchy H is a cover of a set S such that if there exist two groups H1, H2 ∈ H such that H1 ∩H2 = ∅ then either H1 ⊆ H2 or H2 ⊆ H1.
Since the inclusion between the sets deﬁne a weak ordering of the groups. The relations between the sets of H can be represented in a tree ordered fashion, the roots being the groups that are not include in any other group. Let Ni(H) denote the ith level of H i.e. the groups sitting at depth i in this tree. Notice it is still well deﬁned if H contains repeated groups. The depth of a hierarchy is the maximal depth of its groups. For G = G(H1, H2), we have a function p : E(G) → N which is the ﬁrst level at which the couple (a, b) ∈ E belongs to a group of H1 but not H2 (or the opposite). We denote by Gi the subgraph of G formed by the edges {e ∈ E, p(e) ≥ i}. Notice we have G1 = G. An example of hierarchies and their cover graph can be found in Figure 1.
We will assume that each element of S appears the same number of times in (H1, H2). If it not the case, the cover graph G(H1, H2) contains selfloops. A subset of vertices belonging to all minimum vertex covers is therefore straightforward to ﬁnd. Indeed, vertices with selfloops are part of all minimum vertex cover.
4
H1
N1
abc
def
N2 ab
c
N3 ab
c
G(H1, H2) c f
da e
b
H2 abcdef ac
N1 b N2
a
c
b N3
Figure 1: Example of two hierarchies H1, H2 of a set S = {a, b, c, d, e, f } and their cover graph G(H1, H2) (p(e) = 1 for gray edges and p(e) = 2 for black edges).
3.1. Existence of a polynomialtime solution We show here that the cover graph of hierarchies is a perfect graph. As
said before, we assume that each element of S belongs to the same number of groups in both hierarchies. One important consequence of this is the fact that H1, H2 have the same depth d.
Lemma 2. Let S′ ⊆ S and i > 0 such that Gi[S′] is connected, the elements of S′ all belong to the same groups of depth lower than i in both H1 and H2.
Proof. Assume Gi[S′] is a connected subgraph but there exist at least two nonoverlapping subsets A and B of S′ that belong to diﬀerent groups of depth lower than i in both H1 and H2. It means that either A and B belong to the same groups in H1 but not in H2 or both do not belong to the same groups in both hierarchies. Notice that both cases are impossible otherwise all edges would have value a value lower than i (ﬁrst case) or there would be no edges between the two groups (second case). This contradicts our hypothesis since we assume A ∪ B is a connected component of Gi.
Lemma 3. Let S′ ⊆ S and i > 0 such that Gi[S′] is connected, if there exist u ∈ S′ such that (u, v) ∈ E and p(u, v) = j < i then ∀w ∈ S′, we have (u, w) ∈ E with p(u, w) = j.
Proof. According to Lemma 2, the elements in S′ all belongs to the same groups of depth lower than i in both hierarchies. If there exist u ∈ S′ such that (u, v) ∈ E and p(u, v) = j < i, it means there is a group at depth j in H1 that contain (u ∪ S′) and a group at depth j in H2 that contains S′ but not u. Therefore u should be connected to every elements of S′ with an edge of value j.
5
One important consequence of Lemma 3 is that two connected components of G2 either have no edges between them or form a complete bipartite subgraph in G.
Theorem 4. Let H1, H1 be two hierarchies of ﬁnite depth of a set S, computing ds(H1, H1) can de done in polynomial time.
Proof. Let Ψd denote the set of hierarchies of a set S of depth d, we show that the cover graph G (obtained after the removal of vertices with selfloops) of any pair in Ψd is a perfect graph by induction over d.
For d = 1, the class Ψ1 corresponds to the class of partitions of S and the graph G is therefore perfect (Theorem 3.4 of [4]).
Assume it true for any d we show it is also true for d + 1. Let G˜ be the graph obtained by contraction of edges of G2 in G. This graph is perfect. Indeed, according to Lemma 2, elements within the same connected components of G2 belongs to the same group at depth 1. The graph G˜ is therefore the cover graph of the two ﬂat partitions (P˜1, P˜2) obtained by fusioning each group into a single new element in the partitions N1(H1) and N1(H2).
According to Lemma 3, the graph G can be recovered from G˜ by expanding each vertex u ∈ V (G˜) by its corresponding connected component of G2 and connecting each element to the vertices previously adjacent to u. Now, since every connected subgraphs of G2 is perfect as the cover graph of two hierarchies of depth d, the graph G is also perfect according to the Theorem 1 of [7]. A minimum vertex cover of G can be found in polynomial time.
3.2. A solution based on recursive maximum assignment It has been shown that, if P1 and P2 are partitions, the distance ds(P1, P2)
can be computed by solving a maximum assignment problem based on the size of intersections between all pairs of groups in P1 and P2 [4]. The computation of the intersections takes O(n) and the assignment is computed in O((P1 + P2)3). The set of elements to be removed are the elements in the sets that are not covered by the maximum assignment.
Notice two hierarchies are equal if all their levels are equal i.e. if the set of couple {(Ni(H1), Ni(H2))}1≤i≤d are all pairwise equal. However, ﬁnding a suppression set S′ using a greedy “levelbylevel” approach (either topdown or bottomup) may not lead to a optimal solution. Consider the example given in Figure 1 where ds(H1, H2) = 3, a topdown approach may fail since
6
either {a, b, c} or {d, e, f } can be chosen at level 1 to be part of S′. But choosing {d, e, f } would lead to a distance of 4. Alternatively, consider the subhierarchies induced by the set {a, b, c}, a bottomtop approach may also fail since either a or b can belongs to S′ at the last level. Choosing b would lead to a distance of 2 whereas ds(H1[{a, b, c}], H2[{a, b, c}]) = 1.
We provide here an algorithm for computing a suppression set for two hierarchies and therefore the distance between them. The idea is to recursively compute a suppression set for two subhierarchies whose element belongs to the same groups at the current level.
Algorithm 1: suppressionSet(H1, H2)
Input: H1, H2 two hierarchies of a set S Output: S′ ⊆ S a suppression set
1 if H1 = H2 = ∅ then
2 return ∅
3 else
4 S′ ← ∅
5 for C ∈ maxCommonGroups(N1(H1), N1(H2)) do
6
S′ ← S′ ∪ suppressionSet(H1[C] − C, H2[C] − C)
7 end
8 return S′ ∪ f latSuppressionSet(N1(H1[S \ S′]), N1(H2[S \ S′]))
9 end
The function maxCommonGroups(P1, P2) returns the maximal subsets of vertices that are together in both partitions P1 and P2. The function ﬂatSuppressionSet(P1, P2) returns a minimum suppression set of elements S′ such that ds(P1, P2) = S′. This function can used the Hungarian algorithm [5]. The following result is used to show that the set returned by a recursion call is a subset of one optimal solution.
Lemma 4. Let G = (S, E) be the cover graph of two hierarchies of S, any minimum vertex cover of Gi is a subset of a minimum vertex cover of Gi−1.
Proof. Let C be a minimum vertex cover of Gi, S′ be a maximal connected component of Gi. The set C′ = S′ ∩ C is a minimum vertex cover of Gi−1[S′]. According to Lemma 3, the edge cut (S′, S′′) forms a complete bipartite graph where S′′ is the set of vertices in (S \ S′) connected to S′. Therefore,
7
the minimum vertex cover of Gi−1 should contain either all S′ or all S′′ ∪ C′ i.e. C′ is a subset of the cover in both cases. Since it is true for the minimum cover of every maximal connected components of Gi, the set C is a subset a minimum vertex cover of Gi−1.
Theorem 5. For two hierarchies H1, H2 of a set S, Algorithm 1 always terminates and returns a suppression set S′ such that ds(H1, H2) = S′.
Proof. Termination: First the algorithm will terminate since the hierarchies are of ﬁnite depth and the recursive call is used on two sub hierarchies of lower depth (the “root” group is removed in both hierarchies in line 6). Moreover, the condition in line 1 is always met at some point since we assume elements of S appears the same number of times in both hierarchies.
Correctness: Observe the function maxCommonGroups(P1, P2) returns sets that correspond to either independent or maximal connected components of G2. Let C be one of these sets, the subhierarchies H1[C]−C and H2[C]−C correspond to the subhierarchies induced by C minus the ﬁrst level. Assume that at the end of the loop 5–7, the set S′ is the union of the elements to be removed so that those subhierarchies are equal. According to Lemma 4, the set S′ is a subset of a minimum suppression set between (H1, H2) i.e. a minimum vertex cover of G2. A possible solution is therefore the union of S′ and a suppression set of H1[S \ S′] and H2[S \ S′]. The latter can be found only looking at the ﬁrst level of both hierarchies. We can show the assumption on S′ to be true by induction since the Algorithm will return a minimum suppression set if H1, H2 are ﬂat partitions.
We shall discuss the complexity of Algorithm 1. The function maxCommonGroups can be implemented in O(n log(n)). Notice the worst case scenario is achieved when H1 = H2 since the results of each recursive call will be the empty set. Assuming the ﬁrst Θ(d) levels correspond to one repeated group S and the last levels correspond to Θ(n) repeated groups, the time complexity is O(n3 + dn log n).
References
[1] Bunke, H., 1997. On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters 18 (8), 689–694.
8
[2] Bunke, H., Dickinson, P., Kraetzl, M., Neuhaus, M., Stettler, M., 2008. Matching of hypergraphs—algorithms, applications, and experiments. In: Applied Pattern Recognition. Springer, pp. 131–154.
[3] Goldberg, M. K., Hayvanovych, M., MagdonIsmail, M., 2010. Measuring similarity between sets of overlapping clusters. In: Social Computing (SocialCom), 2010 IEEE Second International Conference on. IEEE, pp. 303–308.
[4] Gusﬁeld, D., 2002. Partitiondistance: A problem and class of perfect graphs arising in clustering. Information Processing Letters 82 (3), 159– 164.
[5] Kuhn, H. W., 1955. The hungarian method for the assignment problem. Naval research logistics quarterly 2 (12), 83–97.
[6] Lancichinetti, A., Fortunato, S., Kert´esz, J., 2009. Detecting the overlapping and hierarchical community structure in complex networks. New Journal of Physics 11 (3), 033015.
[7] Lov´asz, L., 1972. Normal hypergraphs and the perfect graph conjecture. Discrete Mathematics 2 (3), 253–267.
[8] Meila˘, M., 2003. Comparing clusterings by the variation of information. In: Learning theory and kernel machines. Springer, pp. 173–187.
[9] Porumbel, D. C., Hao, J. K., Kuntz, P., 2011. An eﬃcient algorithm for computing the distance between close partitions. Discrete Applied Mathematics 159 (1), 53–59.
[10] Queyroi, F., Delest, M., F´edou, J.M., Melan¸con, G., 2014. Assessing the quality of multilevel graph clustering. Data Mining and Knowledge Discovery, 1–22.
9
Categories
You my also like
On Modular Edge Irregularity Strength of Graphs
239.5 KB31.4K13.5KEven Modular Edge Irregularity Strength of Graphs
149.4 KB11.5K5.7KGraph Theory And Combinatorics Notes
1.5 MB42.7K9.8KFactor Domination Labeling
399.4 KB4.4K1.1KEDGE Magic Pyramidal Graphs
764.6 KB5K650Graphs Data Structures and Algorithms
1.7 MB92K27.6KLecture 11 Graphs, DFS, BFS
1.4 MB11.3K4.6KAlgorithmic Graph Theory
7.1 MB8.6K2.4KCountingtuplesrestrictedbycoprimality conditions arXiv:1403
160.4 KB10.4K4.5K