# Summarization Techniques for Visualization of Large

Download Summarization Techniques for Visualization of Large

## Preview text

Summarization Techniques for Visualization of Large Multidimensional Datasets

Technical Report TR-2005-35 Sarat M. Kocherlakota, Christopher G. Healey

Knowledge Discovery Lab Department of Computer Science, North Carolina State University

Raleigh, NC 27695-8207

Email: [email protected]

Abstract

One of the main issues confronting visualization, is how to effectively display large, high dimensional datasets within a limited display area, without overwhelming the user. In this report, we discuss a data summarization approach to tackle this problem. Summarization is the process by which data is reduced in a meaningful and intelligent fashion, to its important and relevant features. We survey several different techniques from within computer science, which can be used to extract various characteristics from raw data. Using summarization techniques intelligently within visualization systems, could potentially reduce the size and dimensionality of large, high dimensional data, highlight relevant and important features, and enhance comprehension.

1 Introduction

The area of visualization is primarily focused on representing raw data in the form of images, thereby providing users with the ability to visually analyze and explore large, complex datasets [16, 35]. Visualization techniques assist users in managing and displaying data in an intelligent and intuitive fashion. Visualization can be a great asset in the discovery of relationships and dependencies that may exist within the data.

Faced with the problem of handling larger and larger datasets, recent studies in visualization have emphasized increasing the information content of the displayed images. Datasets comprising millions (if not billions) of data elements, with each data element possibly having hundreds of attributes, have become commonplace across a variety of domains and applications. Issues related to dataset size and dimensionality were outlined, by the original NSF panel on visualization [28]. Current visualization techniques that deal with large multidimensional datasets have been effective in visualizing data containing up to at most a few million data values. Moreover, as we approach these limits, the images produced by current visualization techniques become cluttered, and visually difﬁcult, even impossible to interpret [13, 17]. This diminishes the user’s ability to visually analyze the data and recognize trends, clusters, outliers or gaps within the data [13]. The problem of effectively representing these large, multidimensional datasets without overwhelming the user’s ability to comprehend the resulting images continues to pose a serious challenge to the area of visualization [16].

To face this challenge, recent studies have proposed a tighter coupling between visual and non-visual methods for data analysis and exploration [25, 33, 34]. In particular, incorporating data summarization techniques within visualization systems may enhance our ability to manage and visualize very large datasets of high dimensionality. Data summarization techniques can reduce the size and complexity of large multidimensional datasets to more manageable proportions. They can also highlight the relevant aspects of the data more clearly, leading to more coherent visualizations, and also facilitating more accurate and efﬁcient visual analysis.

Summarization is performed using various techniques. These techniques are designed for the automated and unsupervised analysis and exploration of raw data, followed by the generation of effective summaries based on the analysis [2, 3, 22, 24]. Unfortunately, many of these techniques are more suited to univariate (or low dimensional) data, and may not be effective in dealing with datasets of high dimensionality [1]. Also, many data mining and machine learning techniques are complicated. Both their intermediate as well as their ﬁnal results may not be easily understood by non-expert users. In this scenario, directly visualizing the results produced by these automated methods would provide neither any insight into the data, nor any understanding of the summarization process itself. For data summarization to be effective during visualization, a system should support the user in understanding as well as participating in the process.

In this report, we will focus on methods that could be enhanced and applied to summarize data within the context of visualization. The techniques we discuss include association rule mining [3, 4], outlier detection [1, 23, 34], clustering [22, 24], data classiﬁcation [18, 27], data aggregation [8, 13], and the principal component analysis (PCA) technique, among others.

1

The remainder of the paper is organized as follows. In Section 2, we discuss techniques for association rule mining. Section 3 is devoted to outlier detection algorithms. In Section 4, we describe clustering and clustering techniques. In Section 5, we investigate classiﬁcation techniques. Section 6 focuses on data aggregation. In Section 7, we discuss dimensionality reduction, and, in Section 8, we end with summarization techniques designed for spatial data. Finally, some conclusions are drawn in Section 9.

2 Association rule mining

Association rule mining techniques are primarily focused on the discovery of patterns and dependencies in datasets. An association rule is an expression of the form X ⇒ Y , where X and Y are sets of items. In a multidimensional dataset, an item could denote a particular attribute value, and X and Y could denote combinations of individual attribute values. For a multidimensional dataset D representing n data attributes A = {A1, ..., An}, where each data element ei ∈ D is a combination of n individual data values, one for each attribute, the expression X ⇒ Y signiﬁes that, if ei contains X then ei probably also contains Y .

Association rule mining makes use of two measures: support and conﬁdence. The support of a set of items X, denoted by supp(X), is the fraction of data elements in D that contain X. The support of an association rule X ⇒ Y , denoted by supp(X ⇒ Y ), is the fraction of the data elements in D that contain the conjunction X ∪ Y . The conﬁdence of such a rule, denoted by conf(X ⇒ Y ), is the fraction of the data elements in D containing X, that also contain Y [3]. Mathematically, supp(X ⇒ Y ) = supp(X ∪ Y ), and conf(X ⇒ Y ) = supp(X ∪ Y )/supp(X) [19].

Based on these measures, the task of association rule mining is to ﬁnd all association rules within a dataset D that satisfy certain user-deﬁned thresholds for minimum support and minimum conﬁdence, denoted by min-sup and min-conf respectively [4]. The approach to ﬁnding such rules generally involves two major steps. In the ﬁrst step, all sets of attribute values that satisfy min-sup within D are found. Next, these sets are used to generate rules of the form X ⇒ Y that satisfy min-conf[3].

Hipp, Guntzer and Nahaeizadeh present a general survey for association rule mining algorithms in [19]. They note that the most popular among association rule mining techniques are Apriori [4] and Apriori-based algorithms.

2.1 Apriori

Apriori, ﬁrst introduced by Agrawal et al., is an algorithm that is used to ﬁnd sets of items within a dataset that satisfy a user-deﬁned minimum support threshold [4]. We begin with a discussion of the terminologies relevant to the algorithm [4]. A set of k items is referred to as a k-itemset. A k-itemset X that has supp(X) ≥ min-sup is referred to as a large k-itemset. The set of all large k-itemsets is denoted by Lk. A candidate k-itemset or a k-candidate itemset is used to denote a potentially large k-itemset, and the set of all such k-candidate itemsets is

2

denoted by Ck. Apriori works in the following manner. It ﬁrst counts the number of occurrences of each

individual item in D to ﬁnd all large 1-itemsets L1. Each subsequent pass k involves two phases. In the ﬁrst phase Lk−1, the set of (k − 1)-itemsets found large in the previous pass, is used to generate a set of k-candidate itemsets Ck. In the second phase, D is scanned to compute the support of each of the k-candidate itemsets in Ck. Those that are found large (i.e. those that have support greater than or equal to min-sup ) are used to form Lk, the set of large k-itemsets.

The k-candidate itemsets in Ck are generated from Lk−1 through two steps. In the ﬁrst step, referred to as the join step, k-candidate itemsets are generated by merging all (k − 1)-itemsets X, Y ∈ Lk−1, which share their ﬁrst (k − 2) items, i.e. X.item1 = Y.item1, X.item2 = Y.item2, ..., X.itemk−2 = Y.itemk−2, X.itemk−1 = Y.itemk−1 [4]. In the second step, referred to as the prune step, all itemsets c ∈ Ck for which some (k − 1) subset of c is not in Lk−1 are deleted.

To illustrate the algorithm let us consider an example [4]. Let L1, the set of all large 1itemsets in a dataset D generated during the ﬁrst phase of the algorithm, contain the individual items { {1}, {2}, {3}, {5} }. To construct C2 from L1, all 2-itemset combinations of items from L1 are generated, producing C2 = { {1 2}, {1 3}, {1 5}, {2 3}, {2 5}, {3 5}, }. In the second phase of the algorithm, suppose the itemsets {1 5} and {3 5}, are removed from C2 as they were found to have low support in D. The remaining itemsets deﬁne L2 = { {1 2}, {1 3}, {2 3}, {2 5} }. In the ﬁrst phase of the pass k = 3 L2 is used to generated C3. During the join step, {1 2} is merged with {1 3} to generate the itemset {1 2 3}. Likewise, {2 3} is merged with {2 5}to generate {2 3 5}. C3 now contains { {1 2 3}, {2 3 5} }. During the prune step, there are three k − 1 = 2-itemsets for candidate { 2 3 5}: {2 3}, {2 5} and {3 5}. Although {2 3} and {2 5} are contained in Lk−1, {3 5} is not. So the candidate itemset {2 3 5} is removed from C3. The candidate itemset { {1 2 3} } is accepted since {1 2}, {2 3} and {1 3} are all in Lk−1. C3 now contains only { {1 2 3} }. The remaining itemset in C3, if found large in the second phase of the current pass, becomes L3.

Once combinations of items that satisfy min-sup in D are identiﬁed, they are used to generate association rules. For each large itemset Z, all rules having the form X ⇒ (Z − X) are generated, where X is any subset of Z, and supp(Z)/supp(X) ≥ min-conf . Initially, for every large itemset Z, all rules having only one item on the right hand side of the expression X ⇒ (Z − X) are generated and tested. Rules that satisfy the minimum conﬁdence requirement are further expanded to build additional rules.

Another algorithm introduced in [4] called AprioriTID extends the Apriori-based approach. In this algorithm the entire dataset D is not used for counting support at the end of each pass. Instead, AprioriTID uses only those items that were found large from the previous pass in an effort to improve the efﬁciency.

Research on association rule mining was initially targeted at transaction databases. Subsequent work has revealed that association rules can also be effective over a wider range of applications and datasets. Visualizing the patterns, rules and dependencies discovered using association rule mining could simultaneously highlight the important characteristics of the data,

3

as well as help reduce the amount of information to be displayed.

2.2 Drawbacks

Association rule mining algorithms do have certain drawbacks, such as:

1. Algorithmic complexity: Increase in size and dimensionality of the data reduces the efﬁciency of the algorithm. This means that the algorithm may be unsuited to very large multidimensional datasets [19].

2. Determining usefulness of generated rules: Association rule mining of large multidimensional datasets could result in the generation of thousands of association rules. Evaluating the relevance and utility of these rules can be a complex task by itself.

3. Counting infrequent items: Association rule mining algorithms scan the entire dataset to discover itemsets that satisfy minimum (user-speciﬁed) support and conﬁdence thresholds. This process also includes counting support for itemsets that have low support within the dataset. This can also limit the efﬁciency of association rule mining techniques [4].

4. Handling spatial datasets: Techniques for association rule mining were designed to discover relationships in transaction data, in which attributes assume binary values i.e. either 0 or 1. In general however, data attributes assume values that cover a broad, continuous range of real numbers [36]. Counting the support of each individual attribute value can pose serious problems to association rule mining algorithms.

In the next section we deal with the problem of outlier detection in large multidimensional datasets.

3 Outlier Detection

Outliers can be deﬁned as “data elements which are very different from the rest of the data based on some measure” [1]. Outliers can also be described as data elements that deviate from other observations by so much that they arouse the suspicion of being produced by a mechanism different than that which generated the other observations [15]. The detection and visualization of outliers within large multidimensional datasets could provide insight into abnormalities in the underlying data.

There are many different approaches to outlier detection. The distance-based approach by Knorr and Ng discovers outliers using a full-dimensional distance metric between any two data elements in a high dimensional information space [23]. A data element ei is considered an outlier with respect to parameters k and λ, if at least a fraction k of the elements in a dataset D lie at a distance greater than λ from ei [23, 1]. Unfortunately, this technique is sensitive to the parameter λ: a large value of λ could result in few outliers, while a small value of λ could

4

result in a large number of outliers. Also, the technique may not be effective for datasets that

exhibit a complex structure, for instance, clusters of varying densities [6].

Breunig et al. suggest identifying outliers based on the densities of local neighborhoods.

The basic idea is simple: if some element ei is farther away from its surrounding elements, when compared to the neighbors ej that lie around it, then ei is marked as a local outlier. In this technique, outliers are found by comparing densities of different sets of data elements.

Elements which are “outlying” relative to their local neighborhoods are called local outliers [6]. For each element ei, its k nearest neighbors are identiﬁed, denoted by Nk(ei). If an element ej ∈ Nk(ei) is within a user-deﬁned distance λ from ei, then ej is said to lie in the local neighborhood of ei, and the reachability distance of ei with respect to ej, denoted by d(ei, ej), is set to λ. If ej lies at a distance greater than λ, then d(ei, ej) is set to the actual distance between ei and ej. The local reachability density of ei, denoted by δlocal(ei), is then calculated as the inverse of the average of d(ei, ej) for all ej ∈ Nk(ei), that is:

1

1

δlocal(ei) = k

d(ei, ej)

(1)

∀ej ∈Nk(ei)

Finally, the local outlier factor of ei with respect to Nk(ei), denoted by LOF (ei), is computed

as the average of the ratios of the local reachability density δlocal(ej) of every ej ∈ Nk(ei) to

δlocal(ei), that is:

LOF (ei) = 1

δlocal (ej )

(2)

k ∀ej ∈Nk(ei) δlocal(ei)

The outlier factor denotes the degree to which ei is considered to be a local outlier. If the outlier factor exceeds a user-chosen threshold, ei is marked as a local outlier.

Both the distance-based approach [23] and the density-based approach [6] are more suited to low dimensional datasets. The sparse nature of high dimensional data makes it difﬁcult to determine the locality of a data element [1]. Also, to determine local densities of elements in high dimensional data, a meaningful concept of distance is necessary. Calculating full-dimensional distances between data elements in high dimensional space can also be computationally expensive.

3.1 Evolutionary Algorithm Technique

One approach to ﬁnding outliers in high dimensional data is by examining data under different low dimensional projections of the data attributes. This approach is based on the observation that the density of data varies when examined under different attribute subsets, and that the attributes may contribute differently to the behavior of each data element [1]. In most cases, a data element is dependent only on a small subset of relevant data attributes. Meaningful distances in high-dimensional information space can be determined more effectively by using fewer, more relevant data attributes [18].

The approach adopted by Aggarwal and Yu is based on the principle that outliers are patterns that have a very low presence within the data. Such patterns could be found efﬁciently,

5

by examining low dimensional projections of the data which are abnormally sparse, i.e. that have extremely low density [1]. The density measure of an attribute subset is referred to as its sparsity coefﬁcient. Examining every lower-dimensional attribute subset to determine its sparsity can be done using a brute force approach. Unfortunately, as the dimensionality of the data increases, the search space of the attribute subsets also grows exponentially, making brute force exploration inefﬁcient.

To avoid this, Aggarwal and Yu suggest using an evolutionary algorithm technique. According to this theory, the limited amount of resources available in nature to all species creates competition among the species, and only the ﬁttest of them survive. Fitter individuals mate with each other more often, producing still better individuals. In a similar fashion, the evolutionary algorithm technique combines promising sparse low dimensional attribute projections, and also allows sparser subsets to combine with each other more frequently, to produce highly sparse lower dimensional attribute projections.

Let m be the total number of elements in D. For the purpose of calculating the sparsity coefﬁcient, each of the n attributes of D is divided into φ ranges, each of which contain an equal number of elements fm, where f = φ1 . Next, consider selecting a subset of d attributes, d < n, then choosing one subrange for each of these d attributes. If the data in D is uniformly distributed across each attribute, we may compute the expected fraction of elements f d that will have attribute values that fall within the selected ranges of the d-attribute subset we have constructed.

The total number of elements from D expected to lie within the d-attribute subset is therefore m · f d, and the standard deviation of this number is m · f d · (1 − f d). In practice however, attribute values are rarely uniformly distributed and independent. This means that the actual number of elements k in D that lie within the attribute subset is not m · f d. This difference is used to calculate the sparsity coefﬁcient [1]:

S(D) = k − m · f d (3) m · f d · (1 − f d)

Each d-attribute subset is encoded as a string recording the combination of attributes present in the subset. Each position in the encoding represents a particular data attribute, and each value at the attribute’s position represents the speciﬁc subrange selected for that attribute. These encodings are also known as solutions.

The evolutionary algorithm starts with a user-selected d and φ, which are used to select an initial set of p solutions S. During the selection step, the sparsity coefﬁcient of each solution in S is calculated. The more negative the sparsity coefﬁcient, the higher a solution’s rank. Next, a crossover step selects pairs of solutions from S, with higher ranked solutions having a higher probability of being chosen. l-attribute subsets from each solution pair (l < d) are identiﬁed. Those l-attribute subsets that are sparser are recombined, to produce a new pair of d-attribute projections. The new solutions replace the old pair that generated them. Finally, a mutation step picks random positions in the solution strings in S and replaces the value with a number between 1 and φ, with a certain probability. These three steps of selection, crossover and mutation are repeated until the sparsity coefﬁcients of the solutions stabilize.

6

Each of the ﬁnal solutions in S represents a pattern of data elements with an abnormally sparse projection of the high dimensional dataset. Data elements that match these attribute patterns are designated as outliers.

Although this technique can ﬁnd high dimensional outliers efﬁciently, it is dependent on the parameters d and φ. Also, in order to make use of evolutionary search methods, a good understanding of the problem is important. Designing speciﬁc algorithms for selection, recombination and mutation which work effectively for a given problem is often a non-trivial task [1].

4 Clustering

Given a dataset D of m data elements, a clustering algorithm divides D into groups or clusters where the data elements in the same cluster are similar to one another relative to data elements in different clusters [12]. Clustering can also be described as a method for classifying data in an exclusive manner, where each data element belongs to exactly one subset or cluster [20]. Clustering algorithms are useful in detecting underlying structure within the data.

In this section we describe two methods for clustering, namely, k-means and self organizing maps.

4.1 k-means Clustering

One of the most widely used clustering methods is the k-means clustering algorithm (KMC). kmeans can be formally described as follows: “Given a set of m data elements in D comprising of n attributes, and an integer k, determine a set of k elements in D called centers, so as to minimize the distance from every element to its nearest center [21].” Each element is attached to its nearest center, thereby subdividing elements in D into k clusters. This approach of decomposing a dataset into disjoint clusters is also known as “partitional clustering” [22].

KMC ﬁrst initializes a set of k cluster centers G ∈ D, i = 1, ..., k. Cluster centers can be assigned, for instance, in a random fashion. Once the centers are initialized, the clustering algorithm assigns each of the remaining, unselected data elements to the center that it is most similar to, i.e. the center that is closest in value. If c(ei) denotes the index of the center closet to a data element ei, then the goal of k-means clustering is to minimize the mean-squared distance between each ei and its nearest cluster center gc(ei). This distance or distortion error [21] is given by[22]:

Ek =

ei − gc(ei) 2

i∈D

When all the data elements have been grouped, the positions of each cluster center is recomputed based on the distances between the data elements within each cluster. The ei which is closest to all the elements within the cluster is assigned as the new cluster center. Once all cluster centers have been recomputed in this fashion, the remaining ei are reassigned to the new centers.

7

This process is repeated several times. At the end of each iteration, the recomputed cluster centers start to resemble the actual cluster centers more closely. The algorithm terminates when convergence is achieved, i.e. the distortion error does not improve signiﬁcantly.

KMC has some drawbacks [22]. The initial assignment of the cluster centers can affect the efﬁciency of convergence to the true cluster centers. Choosing an appropriate value for k is also signiﬁcant to the performance of the algorithm. Ideally, k should be as close as possible to the actual number of clusters present within the dataset. Recomputing centers during each iteration of the algorithm affects the efﬁciency of the technique, especially for large datasets of high dimensionality. Also, before clustering algorithms can be applied to large multidimensional data, analyzing whether the data exhibits a tendency to cluster is important.

4.2 Self Organizing Maps

Self organizing maps (SOM), ﬁrst introduced by Kohonen, are used to organize unstructured data much like the k-means clustering approach [24]. SOM-based algorithms can generate clusters from raw data. They can also produce lower dimensional projections of high dimensional data.

The SOM algorithm works on the principle of competitive learning, an adaptive process by which neurons in a neural network become more and more sensitive to different input categories. A self organizing map generally consists of a two dimensional network of neurons arranged in a grid. Initially, each cell is assigned a reference vector (or reference value) in a random fashion. Every data element ei from the input dataset is assigned to the neuron with reference vector that best represents ei. Locating the closest reference vector in the SOM approach, is very similar to searching for the closest center in the k-means clustering approach. The closest reference vector is called the winning vector, since the the neurons on the grid compete to learn (adapt to) the input data element. The winning vector is then updated to represent ei more closely. The reference vectors around the winning vector are also adjusted, with the amount a reference vector learns from the new data element dependent on how close its vector is to that of the new element. This process is repeated several times over the entire dataset until the reference vectors represent the data elements of D as accurately as possible.

Once the neurons arrange themselves so that further iterations do not produce any signiﬁcant changes to their positions, the algorithm is terminated. At this stage, the reference vectors in the grid tend to topologically arrange themselves such that adjacent cells on the grid represent similar data elements in the information space [25]. This property of a SOM wherein similar data elements are grouped to nearby reference vectors on the grid, is also referred to as topology preservation [9]. A SOM can be used to identify similarities and differences within an information space, and can hence serve as a clustering tool [25]. It also has the capability to generalize, and “learns” the data in an unsupervised fashion [9]. As with clustering techniques in general, it does not require any prior knowledge about the data.

However, SOM techniques are often computationally intensive. Efﬁciency of this technique also depends on the initial arrangement of reference vectors. Since the reference vectors are assigned in a random fashion initially, SOM based techniques produce different results each

8

time the technique is applied. Generating multiple maps may be necessary to choose the most effective map. Finally, SOM based techniques are susceptible to missing data values within the input data, which is a problem that both clustering and projection based algorithms also suffer from [22].

5 Data Classiﬁcation

Classiﬁcation techniques (also referred to as classiﬁers) generally require a training dataset T for which the attribute values and data classes of each element are known in advance. Based on this information, classiﬁcation techniques predict the class of unclassiﬁed data elements ei ∈ D [5]. The performance measure of the classiﬁer, referred to as its classiﬁcation accuracy, is the percentage of elements for which the classiﬁer makes correct class predictions [5].

5.1 Rule Set Based Classiﬁcation

Rule set based classiﬁcation algorithms apply a set of classiﬁcation rules generated from a training dataset to predict the class of an unclassiﬁed data element. Classiﬁcation rules are of the form X ⇒ c, where X (the antecedent) contains a combination of attribute value pairs, and c (the consequent) denotes a corresponding class.

Traditionally, machine learning algorithms such as ID3 and C4.5 are used for class prediction [31]. In these methods, T is partitioned into smaller and smaller disjoint subsets using distinct values of the data attributes. During the partitioning process, the attribute with the highest probability of distributing T evenly is selected from among the set of all attributes. T is then subdivided into disjoint subsets based on ranges of values from the selected attribute. Each of these subsets is then recursively subdivided in a similar fashion. The partitioning is represented using a decision tree, where each node represents an attribute or an attribute subrange. The process ends when all attributes have been used and the leaves of the tree contain only individual classes. The class assigned to each leaf is the one with the highest probability among the elements belonging to the attribute subrange represented by the parent of the leaf. In cases, where each of the classes is equally probable, a class is chosen at random and assigned to the leaf. The decision tree constructed from T is then used to classify a new dataset. Each path from the root of the tree to a leaf represents a classiﬁcation rule. New unclassiﬁed data elements ei ∈ D are then matched against these classiﬁcation rules.

Rule-based classiﬁcation has certain drawbacks, however. The way the data is subdivided may not reﬂect its actual distribution [30]. Classiﬁers select a single attribute at a time to partition the subset of data, and do not examine attribute combinations for possible relevance. This can lead to a sub-optimal partitioning, especially in high dimensional data space, where not all attributes are of equal importance. Also, constructing decision trees for large multidimensional datasets can be computationally intensive. Decision tree classiﬁers will fail to classify data elements with missing attribute values, especially if the missing values involve the attribute that forms the root of the decision tree.

9

Technical Report TR-2005-35 Sarat M. Kocherlakota, Christopher G. Healey

Knowledge Discovery Lab Department of Computer Science, North Carolina State University

Raleigh, NC 27695-8207

Email: [email protected]

Abstract

One of the main issues confronting visualization, is how to effectively display large, high dimensional datasets within a limited display area, without overwhelming the user. In this report, we discuss a data summarization approach to tackle this problem. Summarization is the process by which data is reduced in a meaningful and intelligent fashion, to its important and relevant features. We survey several different techniques from within computer science, which can be used to extract various characteristics from raw data. Using summarization techniques intelligently within visualization systems, could potentially reduce the size and dimensionality of large, high dimensional data, highlight relevant and important features, and enhance comprehension.

1 Introduction

The area of visualization is primarily focused on representing raw data in the form of images, thereby providing users with the ability to visually analyze and explore large, complex datasets [16, 35]. Visualization techniques assist users in managing and displaying data in an intelligent and intuitive fashion. Visualization can be a great asset in the discovery of relationships and dependencies that may exist within the data.

Faced with the problem of handling larger and larger datasets, recent studies in visualization have emphasized increasing the information content of the displayed images. Datasets comprising millions (if not billions) of data elements, with each data element possibly having hundreds of attributes, have become commonplace across a variety of domains and applications. Issues related to dataset size and dimensionality were outlined, by the original NSF panel on visualization [28]. Current visualization techniques that deal with large multidimensional datasets have been effective in visualizing data containing up to at most a few million data values. Moreover, as we approach these limits, the images produced by current visualization techniques become cluttered, and visually difﬁcult, even impossible to interpret [13, 17]. This diminishes the user’s ability to visually analyze the data and recognize trends, clusters, outliers or gaps within the data [13]. The problem of effectively representing these large, multidimensional datasets without overwhelming the user’s ability to comprehend the resulting images continues to pose a serious challenge to the area of visualization [16].

To face this challenge, recent studies have proposed a tighter coupling between visual and non-visual methods for data analysis and exploration [25, 33, 34]. In particular, incorporating data summarization techniques within visualization systems may enhance our ability to manage and visualize very large datasets of high dimensionality. Data summarization techniques can reduce the size and complexity of large multidimensional datasets to more manageable proportions. They can also highlight the relevant aspects of the data more clearly, leading to more coherent visualizations, and also facilitating more accurate and efﬁcient visual analysis.

Summarization is performed using various techniques. These techniques are designed for the automated and unsupervised analysis and exploration of raw data, followed by the generation of effective summaries based on the analysis [2, 3, 22, 24]. Unfortunately, many of these techniques are more suited to univariate (or low dimensional) data, and may not be effective in dealing with datasets of high dimensionality [1]. Also, many data mining and machine learning techniques are complicated. Both their intermediate as well as their ﬁnal results may not be easily understood by non-expert users. In this scenario, directly visualizing the results produced by these automated methods would provide neither any insight into the data, nor any understanding of the summarization process itself. For data summarization to be effective during visualization, a system should support the user in understanding as well as participating in the process.

In this report, we will focus on methods that could be enhanced and applied to summarize data within the context of visualization. The techniques we discuss include association rule mining [3, 4], outlier detection [1, 23, 34], clustering [22, 24], data classiﬁcation [18, 27], data aggregation [8, 13], and the principal component analysis (PCA) technique, among others.

1

The remainder of the paper is organized as follows. In Section 2, we discuss techniques for association rule mining. Section 3 is devoted to outlier detection algorithms. In Section 4, we describe clustering and clustering techniques. In Section 5, we investigate classiﬁcation techniques. Section 6 focuses on data aggregation. In Section 7, we discuss dimensionality reduction, and, in Section 8, we end with summarization techniques designed for spatial data. Finally, some conclusions are drawn in Section 9.

2 Association rule mining

Association rule mining techniques are primarily focused on the discovery of patterns and dependencies in datasets. An association rule is an expression of the form X ⇒ Y , where X and Y are sets of items. In a multidimensional dataset, an item could denote a particular attribute value, and X and Y could denote combinations of individual attribute values. For a multidimensional dataset D representing n data attributes A = {A1, ..., An}, where each data element ei ∈ D is a combination of n individual data values, one for each attribute, the expression X ⇒ Y signiﬁes that, if ei contains X then ei probably also contains Y .

Association rule mining makes use of two measures: support and conﬁdence. The support of a set of items X, denoted by supp(X), is the fraction of data elements in D that contain X. The support of an association rule X ⇒ Y , denoted by supp(X ⇒ Y ), is the fraction of the data elements in D that contain the conjunction X ∪ Y . The conﬁdence of such a rule, denoted by conf(X ⇒ Y ), is the fraction of the data elements in D containing X, that also contain Y [3]. Mathematically, supp(X ⇒ Y ) = supp(X ∪ Y ), and conf(X ⇒ Y ) = supp(X ∪ Y )/supp(X) [19].

Based on these measures, the task of association rule mining is to ﬁnd all association rules within a dataset D that satisfy certain user-deﬁned thresholds for minimum support and minimum conﬁdence, denoted by min-sup and min-conf respectively [4]. The approach to ﬁnding such rules generally involves two major steps. In the ﬁrst step, all sets of attribute values that satisfy min-sup within D are found. Next, these sets are used to generate rules of the form X ⇒ Y that satisfy min-conf[3].

Hipp, Guntzer and Nahaeizadeh present a general survey for association rule mining algorithms in [19]. They note that the most popular among association rule mining techniques are Apriori [4] and Apriori-based algorithms.

2.1 Apriori

Apriori, ﬁrst introduced by Agrawal et al., is an algorithm that is used to ﬁnd sets of items within a dataset that satisfy a user-deﬁned minimum support threshold [4]. We begin with a discussion of the terminologies relevant to the algorithm [4]. A set of k items is referred to as a k-itemset. A k-itemset X that has supp(X) ≥ min-sup is referred to as a large k-itemset. The set of all large k-itemsets is denoted by Lk. A candidate k-itemset or a k-candidate itemset is used to denote a potentially large k-itemset, and the set of all such k-candidate itemsets is

2

denoted by Ck. Apriori works in the following manner. It ﬁrst counts the number of occurrences of each

individual item in D to ﬁnd all large 1-itemsets L1. Each subsequent pass k involves two phases. In the ﬁrst phase Lk−1, the set of (k − 1)-itemsets found large in the previous pass, is used to generate a set of k-candidate itemsets Ck. In the second phase, D is scanned to compute the support of each of the k-candidate itemsets in Ck. Those that are found large (i.e. those that have support greater than or equal to min-sup ) are used to form Lk, the set of large k-itemsets.

The k-candidate itemsets in Ck are generated from Lk−1 through two steps. In the ﬁrst step, referred to as the join step, k-candidate itemsets are generated by merging all (k − 1)-itemsets X, Y ∈ Lk−1, which share their ﬁrst (k − 2) items, i.e. X.item1 = Y.item1, X.item2 = Y.item2, ..., X.itemk−2 = Y.itemk−2, X.itemk−1 = Y.itemk−1 [4]. In the second step, referred to as the prune step, all itemsets c ∈ Ck for which some (k − 1) subset of c is not in Lk−1 are deleted.

To illustrate the algorithm let us consider an example [4]. Let L1, the set of all large 1itemsets in a dataset D generated during the ﬁrst phase of the algorithm, contain the individual items { {1}, {2}, {3}, {5} }. To construct C2 from L1, all 2-itemset combinations of items from L1 are generated, producing C2 = { {1 2}, {1 3}, {1 5}, {2 3}, {2 5}, {3 5}, }. In the second phase of the algorithm, suppose the itemsets {1 5} and {3 5}, are removed from C2 as they were found to have low support in D. The remaining itemsets deﬁne L2 = { {1 2}, {1 3}, {2 3}, {2 5} }. In the ﬁrst phase of the pass k = 3 L2 is used to generated C3. During the join step, {1 2} is merged with {1 3} to generate the itemset {1 2 3}. Likewise, {2 3} is merged with {2 5}to generate {2 3 5}. C3 now contains { {1 2 3}, {2 3 5} }. During the prune step, there are three k − 1 = 2-itemsets for candidate { 2 3 5}: {2 3}, {2 5} and {3 5}. Although {2 3} and {2 5} are contained in Lk−1, {3 5} is not. So the candidate itemset {2 3 5} is removed from C3. The candidate itemset { {1 2 3} } is accepted since {1 2}, {2 3} and {1 3} are all in Lk−1. C3 now contains only { {1 2 3} }. The remaining itemset in C3, if found large in the second phase of the current pass, becomes L3.

Once combinations of items that satisfy min-sup in D are identiﬁed, they are used to generate association rules. For each large itemset Z, all rules having the form X ⇒ (Z − X) are generated, where X is any subset of Z, and supp(Z)/supp(X) ≥ min-conf . Initially, for every large itemset Z, all rules having only one item on the right hand side of the expression X ⇒ (Z − X) are generated and tested. Rules that satisfy the minimum conﬁdence requirement are further expanded to build additional rules.

Another algorithm introduced in [4] called AprioriTID extends the Apriori-based approach. In this algorithm the entire dataset D is not used for counting support at the end of each pass. Instead, AprioriTID uses only those items that were found large from the previous pass in an effort to improve the efﬁciency.

Research on association rule mining was initially targeted at transaction databases. Subsequent work has revealed that association rules can also be effective over a wider range of applications and datasets. Visualizing the patterns, rules and dependencies discovered using association rule mining could simultaneously highlight the important characteristics of the data,

3

as well as help reduce the amount of information to be displayed.

2.2 Drawbacks

Association rule mining algorithms do have certain drawbacks, such as:

1. Algorithmic complexity: Increase in size and dimensionality of the data reduces the efﬁciency of the algorithm. This means that the algorithm may be unsuited to very large multidimensional datasets [19].

2. Determining usefulness of generated rules: Association rule mining of large multidimensional datasets could result in the generation of thousands of association rules. Evaluating the relevance and utility of these rules can be a complex task by itself.

3. Counting infrequent items: Association rule mining algorithms scan the entire dataset to discover itemsets that satisfy minimum (user-speciﬁed) support and conﬁdence thresholds. This process also includes counting support for itemsets that have low support within the dataset. This can also limit the efﬁciency of association rule mining techniques [4].

4. Handling spatial datasets: Techniques for association rule mining were designed to discover relationships in transaction data, in which attributes assume binary values i.e. either 0 or 1. In general however, data attributes assume values that cover a broad, continuous range of real numbers [36]. Counting the support of each individual attribute value can pose serious problems to association rule mining algorithms.

In the next section we deal with the problem of outlier detection in large multidimensional datasets.

3 Outlier Detection

Outliers can be deﬁned as “data elements which are very different from the rest of the data based on some measure” [1]. Outliers can also be described as data elements that deviate from other observations by so much that they arouse the suspicion of being produced by a mechanism different than that which generated the other observations [15]. The detection and visualization of outliers within large multidimensional datasets could provide insight into abnormalities in the underlying data.

There are many different approaches to outlier detection. The distance-based approach by Knorr and Ng discovers outliers using a full-dimensional distance metric between any two data elements in a high dimensional information space [23]. A data element ei is considered an outlier with respect to parameters k and λ, if at least a fraction k of the elements in a dataset D lie at a distance greater than λ from ei [23, 1]. Unfortunately, this technique is sensitive to the parameter λ: a large value of λ could result in few outliers, while a small value of λ could

4

result in a large number of outliers. Also, the technique may not be effective for datasets that

exhibit a complex structure, for instance, clusters of varying densities [6].

Breunig et al. suggest identifying outliers based on the densities of local neighborhoods.

The basic idea is simple: if some element ei is farther away from its surrounding elements, when compared to the neighbors ej that lie around it, then ei is marked as a local outlier. In this technique, outliers are found by comparing densities of different sets of data elements.

Elements which are “outlying” relative to their local neighborhoods are called local outliers [6]. For each element ei, its k nearest neighbors are identiﬁed, denoted by Nk(ei). If an element ej ∈ Nk(ei) is within a user-deﬁned distance λ from ei, then ej is said to lie in the local neighborhood of ei, and the reachability distance of ei with respect to ej, denoted by d(ei, ej), is set to λ. If ej lies at a distance greater than λ, then d(ei, ej) is set to the actual distance between ei and ej. The local reachability density of ei, denoted by δlocal(ei), is then calculated as the inverse of the average of d(ei, ej) for all ej ∈ Nk(ei), that is:

1

1

δlocal(ei) = k

d(ei, ej)

(1)

∀ej ∈Nk(ei)

Finally, the local outlier factor of ei with respect to Nk(ei), denoted by LOF (ei), is computed

as the average of the ratios of the local reachability density δlocal(ej) of every ej ∈ Nk(ei) to

δlocal(ei), that is:

LOF (ei) = 1

δlocal (ej )

(2)

k ∀ej ∈Nk(ei) δlocal(ei)

The outlier factor denotes the degree to which ei is considered to be a local outlier. If the outlier factor exceeds a user-chosen threshold, ei is marked as a local outlier.

Both the distance-based approach [23] and the density-based approach [6] are more suited to low dimensional datasets. The sparse nature of high dimensional data makes it difﬁcult to determine the locality of a data element [1]. Also, to determine local densities of elements in high dimensional data, a meaningful concept of distance is necessary. Calculating full-dimensional distances between data elements in high dimensional space can also be computationally expensive.

3.1 Evolutionary Algorithm Technique

One approach to ﬁnding outliers in high dimensional data is by examining data under different low dimensional projections of the data attributes. This approach is based on the observation that the density of data varies when examined under different attribute subsets, and that the attributes may contribute differently to the behavior of each data element [1]. In most cases, a data element is dependent only on a small subset of relevant data attributes. Meaningful distances in high-dimensional information space can be determined more effectively by using fewer, more relevant data attributes [18].

The approach adopted by Aggarwal and Yu is based on the principle that outliers are patterns that have a very low presence within the data. Such patterns could be found efﬁciently,

5

by examining low dimensional projections of the data which are abnormally sparse, i.e. that have extremely low density [1]. The density measure of an attribute subset is referred to as its sparsity coefﬁcient. Examining every lower-dimensional attribute subset to determine its sparsity can be done using a brute force approach. Unfortunately, as the dimensionality of the data increases, the search space of the attribute subsets also grows exponentially, making brute force exploration inefﬁcient.

To avoid this, Aggarwal and Yu suggest using an evolutionary algorithm technique. According to this theory, the limited amount of resources available in nature to all species creates competition among the species, and only the ﬁttest of them survive. Fitter individuals mate with each other more often, producing still better individuals. In a similar fashion, the evolutionary algorithm technique combines promising sparse low dimensional attribute projections, and also allows sparser subsets to combine with each other more frequently, to produce highly sparse lower dimensional attribute projections.

Let m be the total number of elements in D. For the purpose of calculating the sparsity coefﬁcient, each of the n attributes of D is divided into φ ranges, each of which contain an equal number of elements fm, where f = φ1 . Next, consider selecting a subset of d attributes, d < n, then choosing one subrange for each of these d attributes. If the data in D is uniformly distributed across each attribute, we may compute the expected fraction of elements f d that will have attribute values that fall within the selected ranges of the d-attribute subset we have constructed.

The total number of elements from D expected to lie within the d-attribute subset is therefore m · f d, and the standard deviation of this number is m · f d · (1 − f d). In practice however, attribute values are rarely uniformly distributed and independent. This means that the actual number of elements k in D that lie within the attribute subset is not m · f d. This difference is used to calculate the sparsity coefﬁcient [1]:

S(D) = k − m · f d (3) m · f d · (1 − f d)

Each d-attribute subset is encoded as a string recording the combination of attributes present in the subset. Each position in the encoding represents a particular data attribute, and each value at the attribute’s position represents the speciﬁc subrange selected for that attribute. These encodings are also known as solutions.

The evolutionary algorithm starts with a user-selected d and φ, which are used to select an initial set of p solutions S. During the selection step, the sparsity coefﬁcient of each solution in S is calculated. The more negative the sparsity coefﬁcient, the higher a solution’s rank. Next, a crossover step selects pairs of solutions from S, with higher ranked solutions having a higher probability of being chosen. l-attribute subsets from each solution pair (l < d) are identiﬁed. Those l-attribute subsets that are sparser are recombined, to produce a new pair of d-attribute projections. The new solutions replace the old pair that generated them. Finally, a mutation step picks random positions in the solution strings in S and replaces the value with a number between 1 and φ, with a certain probability. These three steps of selection, crossover and mutation are repeated until the sparsity coefﬁcients of the solutions stabilize.

6

Each of the ﬁnal solutions in S represents a pattern of data elements with an abnormally sparse projection of the high dimensional dataset. Data elements that match these attribute patterns are designated as outliers.

Although this technique can ﬁnd high dimensional outliers efﬁciently, it is dependent on the parameters d and φ. Also, in order to make use of evolutionary search methods, a good understanding of the problem is important. Designing speciﬁc algorithms for selection, recombination and mutation which work effectively for a given problem is often a non-trivial task [1].

4 Clustering

Given a dataset D of m data elements, a clustering algorithm divides D into groups or clusters where the data elements in the same cluster are similar to one another relative to data elements in different clusters [12]. Clustering can also be described as a method for classifying data in an exclusive manner, where each data element belongs to exactly one subset or cluster [20]. Clustering algorithms are useful in detecting underlying structure within the data.

In this section we describe two methods for clustering, namely, k-means and self organizing maps.

4.1 k-means Clustering

One of the most widely used clustering methods is the k-means clustering algorithm (KMC). kmeans can be formally described as follows: “Given a set of m data elements in D comprising of n attributes, and an integer k, determine a set of k elements in D called centers, so as to minimize the distance from every element to its nearest center [21].” Each element is attached to its nearest center, thereby subdividing elements in D into k clusters. This approach of decomposing a dataset into disjoint clusters is also known as “partitional clustering” [22].

KMC ﬁrst initializes a set of k cluster centers G ∈ D, i = 1, ..., k. Cluster centers can be assigned, for instance, in a random fashion. Once the centers are initialized, the clustering algorithm assigns each of the remaining, unselected data elements to the center that it is most similar to, i.e. the center that is closest in value. If c(ei) denotes the index of the center closet to a data element ei, then the goal of k-means clustering is to minimize the mean-squared distance between each ei and its nearest cluster center gc(ei). This distance or distortion error [21] is given by[22]:

Ek =

ei − gc(ei) 2

i∈D

When all the data elements have been grouped, the positions of each cluster center is recomputed based on the distances between the data elements within each cluster. The ei which is closest to all the elements within the cluster is assigned as the new cluster center. Once all cluster centers have been recomputed in this fashion, the remaining ei are reassigned to the new centers.

7

This process is repeated several times. At the end of each iteration, the recomputed cluster centers start to resemble the actual cluster centers more closely. The algorithm terminates when convergence is achieved, i.e. the distortion error does not improve signiﬁcantly.

KMC has some drawbacks [22]. The initial assignment of the cluster centers can affect the efﬁciency of convergence to the true cluster centers. Choosing an appropriate value for k is also signiﬁcant to the performance of the algorithm. Ideally, k should be as close as possible to the actual number of clusters present within the dataset. Recomputing centers during each iteration of the algorithm affects the efﬁciency of the technique, especially for large datasets of high dimensionality. Also, before clustering algorithms can be applied to large multidimensional data, analyzing whether the data exhibits a tendency to cluster is important.

4.2 Self Organizing Maps

Self organizing maps (SOM), ﬁrst introduced by Kohonen, are used to organize unstructured data much like the k-means clustering approach [24]. SOM-based algorithms can generate clusters from raw data. They can also produce lower dimensional projections of high dimensional data.

The SOM algorithm works on the principle of competitive learning, an adaptive process by which neurons in a neural network become more and more sensitive to different input categories. A self organizing map generally consists of a two dimensional network of neurons arranged in a grid. Initially, each cell is assigned a reference vector (or reference value) in a random fashion. Every data element ei from the input dataset is assigned to the neuron with reference vector that best represents ei. Locating the closest reference vector in the SOM approach, is very similar to searching for the closest center in the k-means clustering approach. The closest reference vector is called the winning vector, since the the neurons on the grid compete to learn (adapt to) the input data element. The winning vector is then updated to represent ei more closely. The reference vectors around the winning vector are also adjusted, with the amount a reference vector learns from the new data element dependent on how close its vector is to that of the new element. This process is repeated several times over the entire dataset until the reference vectors represent the data elements of D as accurately as possible.

Once the neurons arrange themselves so that further iterations do not produce any signiﬁcant changes to their positions, the algorithm is terminated. At this stage, the reference vectors in the grid tend to topologically arrange themselves such that adjacent cells on the grid represent similar data elements in the information space [25]. This property of a SOM wherein similar data elements are grouped to nearby reference vectors on the grid, is also referred to as topology preservation [9]. A SOM can be used to identify similarities and differences within an information space, and can hence serve as a clustering tool [25]. It also has the capability to generalize, and “learns” the data in an unsupervised fashion [9]. As with clustering techniques in general, it does not require any prior knowledge about the data.

However, SOM techniques are often computationally intensive. Efﬁciency of this technique also depends on the initial arrangement of reference vectors. Since the reference vectors are assigned in a random fashion initially, SOM based techniques produce different results each

8

time the technique is applied. Generating multiple maps may be necessary to choose the most effective map. Finally, SOM based techniques are susceptible to missing data values within the input data, which is a problem that both clustering and projection based algorithms also suffer from [22].

5 Data Classiﬁcation

Classiﬁcation techniques (also referred to as classiﬁers) generally require a training dataset T for which the attribute values and data classes of each element are known in advance. Based on this information, classiﬁcation techniques predict the class of unclassiﬁed data elements ei ∈ D [5]. The performance measure of the classiﬁer, referred to as its classiﬁcation accuracy, is the percentage of elements for which the classiﬁer makes correct class predictions [5].

5.1 Rule Set Based Classiﬁcation

Rule set based classiﬁcation algorithms apply a set of classiﬁcation rules generated from a training dataset to predict the class of an unclassiﬁed data element. Classiﬁcation rules are of the form X ⇒ c, where X (the antecedent) contains a combination of attribute value pairs, and c (the consequent) denotes a corresponding class.

Traditionally, machine learning algorithms such as ID3 and C4.5 are used for class prediction [31]. In these methods, T is partitioned into smaller and smaller disjoint subsets using distinct values of the data attributes. During the partitioning process, the attribute with the highest probability of distributing T evenly is selected from among the set of all attributes. T is then subdivided into disjoint subsets based on ranges of values from the selected attribute. Each of these subsets is then recursively subdivided in a similar fashion. The partitioning is represented using a decision tree, where each node represents an attribute or an attribute subrange. The process ends when all attributes have been used and the leaves of the tree contain only individual classes. The class assigned to each leaf is the one with the highest probability among the elements belonging to the attribute subrange represented by the parent of the leaf. In cases, where each of the classes is equally probable, a class is chosen at random and assigned to the leaf. The decision tree constructed from T is then used to classify a new dataset. Each path from the root of the tree to a leaf represents a classiﬁcation rule. New unclassiﬁed data elements ei ∈ D are then matched against these classiﬁcation rules.

Rule-based classiﬁcation has certain drawbacks, however. The way the data is subdivided may not reﬂect its actual distribution [30]. Classiﬁers select a single attribute at a time to partition the subset of data, and do not examine attribute combinations for possible relevance. This can lead to a sub-optimal partitioning, especially in high dimensional data space, where not all attributes are of equal importance. Also, constructing decision trees for large multidimensional datasets can be computationally intensive. Decision tree classiﬁers will fail to classify data elements with missing attribute values, especially if the missing values involve the attribute that forms the root of the decision tree.

9

## Categories

## You my also like

### SumTitles: a Summarization Dataset with Low Extractiveness

272.5 KB63.4K17.8K### An Introduction to Methods for Analyzing Large Neuroimaging

78.9 KB1.9K720### The Geoscience Data Journal: collaboration between data

35.4 KB44.6K12.9K### Fides: Towards a Platform for Responsible Data Science

366.6 KB42.4K12.3K### Multivariate Data Analysis 6e Datasets Documentation

56.2 KB21.7K9.8K### Data Wrangling Report

233.3 KB34.3K12.7K### End User Guide: SAP Lumira Discovery

2.8 MB4K593### COBOL Bootcamp for Swedbank

307.7 KB19.3K4.4K### RGBD Datasets: Past, Present and Future

1.5 MB5.8K1.4K