Data Summarization Alsayed Algergawy Presented at the group journal club 20 December, 2018
2 Outline Data summarization: a survey 1 Mohiuddin Ahmed, Knowledge and Information Systems, 2018, 1-25 Abstractive Tabular Dataset Summarization via Knowledge Base Semantic Embeddings : 2 Paul Azunre , Craig Corcoran , David Sullivan , Garrett Honke , Rebecca Ruppel , Sandeep Verma , Jonathon Morgan : CoRRabs /1804.01503 (2018)) 1 https ://link.springer.com/content/pdf/10.1007%2Fs10115-018-1183-0.pdf 2 https ://arxiv.org/abs/1804.01503
Data summarization is a process of creating concise , yet informativ e, version of the original data . The terms concise and informative are quite generic and depend on application domains Summarization is not a compression !! Compression is a syntactic method for reducing the data In contrast, summarization uses the semantic content of the data compression makes the data non-intelligible and summarization makes the data intelligible for further data analysis and decision making Role of data summarization Intelligent analysis of data is a challenging task in many domain. In reality, the volume of the datasets is quite high and the time required to perform data analysis increases with data size. A summary of the large data is easier and faster to analyze 3
Taxonomy of data summarization 4
Summarization of unstructured data To carry out text summarization, the combination of the following processes : – Extraction : Finds the key phrases or sentences and produces a summary, – Abstraction : Produces the key information in a new way, – Fusion : Extracts important parts from the text and combines them coherently, – Compression : Discards irrelevant or unimportant text. The frequency/position of any particular word, which is an useful measure to identify its significance Machine Learning (ML) approaches for text summarization started in the 1990s .: Naive–Bayes classifier, Decision tree, Hidden Markov model (HMM ), Artificial neural network (ANN ), Topic modeling 5
Summarization of structured data Statistical techniques Aggregation : ( defined for numerical values) estimate the statistical distribution of data that could be utilized to approximate the pattern in the set of data Sampling: A sample is a subset of the dataset Sampling is a popular choice for reduction of input data in data mining and machine learning techniques Simple random sampling– Stratified random sampling:– Systematic sampling:– Cluster random sampling:– Multi-stage random sampling: Semantic-based: linguistic summary, attribute oriented induction, fascicle Machine learning summarization: frequent itemsets , clustering 6
Summarization of structured data Sampling: Stratified random sampling : The dataset is divided into non-overlapping subsets, called strata. The sampling scheme selects a random element from each strata and produces a stratified sample. Systematic sampling : a data instance is sampled from the dataset , beginning from a specified starting point to the end, at equal intervals. For example , if the first random instance’s location is 2 (starting point) and the interval value is 5, then for a sample of size 3, the sample instances are from the 2nd, 7th and 12 th locations , respectively . The interval is calculated as rounded up Size of sample Size of data . Cluster random sampling : The whole dataset is organized into groups (clusters ); groups are randomly selected according to sampling rate, and all members of the selected groups are selected. 7
Evaluation metrics Human-based Conciseness : Information loss Interestingness where: S: summarized dataset size D: input dataset size T: is the number of distinct values present in the original data ( D) L: defines the difference between number of distinct values present in summary and original data n: states how many of the data instances in the original dataset are covered by the summary 8
9 Outline Data summarization: a survey: 1 Mohiuddin Ahmed, Knowledge and Information Systems, 2018, 1-25 Abstractive Tabular Dataset Summarization via Knowledge Base Semantic Embeddings : 2 Paul Azunre , Craig Corcoran , David Sullivan , Garrett Honke , Rebecca Ruppel , Sandeep Verma , Jonathon Morgan : CoRRabs /1804.01503 (2018)) 1 https ://link.springer.com/content/pdf/10.1007%2Fs10115-018-1183-0.pdf 2 https ://arxiv.org/abs/1804.01503 https://github.com/NewKnowledge/duke
DUKE - D ataset U nderstanding via K nowledge-base E mbeddings Objective: to develop a method for summarizing the content of tabular datasets Assumption: the dataset contains descriptive text in headers, columns and/or some augmenting metadata Methodology: employing a knowledge base semantic embedding to generate the summary. Employing the embedding to recommend a subject/type for each text segment. Recommendations are aggregated into a small collection of super types considered to be descriptive of the dataset by exploiting the hierarchy of types in a pre-specified ontology. Evaluation: Using February 2015 Wikipedia as the knowledge base, and a corresponding DBpedia ontology as types, carrying out set of experiments on open data taken from several sources { OpenML , CKAN and data.world } 10
Approach Definitions : distributional semantics: concept has been recently widely employed as a natural language processing (NLP) tool to embed various NLP concepts into vector spaces. “ Words occurring in similar (linguistic) contexts tend to be semantically similar” Word2vec 1 : Word2vec models utilize a large corpus of documents to build a vector space mapping words to points in a space, where proximity implies semantic similarity. This allows to calculate distances between words in the dataset and the set of types in the ontology . Wiki2vec 2 : a wiki2vec model is a form of word2vec model trained on a corpus of Wikipedia KB documents. Training on this data ensures that the list of types in the DBpedia ontology are included in the vocabulary of the model , and increases the likelihood that topics are discussed in context with their supertypes 11 2 https ://github.com/idio/wiki2vec 1 https ://code.google.com/archive/p/word2vec/
12 Distributional semantics
13 Distributional semantics
Generating Type Recommendations The method for summarizing a tabular dataset can be broken down into three distinct steps : Collect a set of types and an ontology to use for abstraction Extract any text data from the tabular dataset and embed it into a vector space to calculate the distance to all the types in our ontology Aggregate the distance vectors for every keyword in the dataset into a single vector of distances 14
Type Ontology In order to generate an abstract term to describe the dataset, collect an ontology of types to select a descriptive term from To this end, an ontology provided by DBpedia 1 is used which contains approximately 400 defined types , including everything from sound to video game and historic place. DBpedia also contains defined parent-child relationships for the types that be used to build a complete hierarchy of types e.g. that tree is a sub-type of plant which is a sub-type of eukaryote. Handling probabilistic data 15 1 http :// downloads.dbpedia.org/2015-10
Word Embedding 16 With the set of topics collected extract each word from the dataset, embed it in a wiki2vec vector space and calculate the distance between that word and every type in the ontology. If a single cell in a column contains more than one word, take the average of the corresponding embedded vectors. This results in a collection of distance vectors representing all text in the dataset. Collect the vectors according to their source within the dataset, i.e. words in the same column are collected into a matrix of distances for each column . If column headers are provided, treat them as an additional column in the dataset.
Distance Aggregation 17 Having a set of matrices containing distances between every text segment in the dataset and the set of types . The goal of this step is to reduce them to a single vector of distances . To this end, three successive aggregations in order to compute this final vector are utilized. 1 st aggregation : is computed across the rows of each column matrix in order to produce a single vector of distances between the column and all types.
Distance Aggregation 18 2 nd aggregation : is called the tree aggregation, where this vector of distances for a column is considered utilizing the hierarchy of types described by DBpedia in order to update the scores for each type. For instance, the score for means of transportation based on the scores for airplane , train , and automobile can be updated . 3 rd aggregation : is performed over the set of distance vectors computed for each column , producing a single vector of distances to every defined type.
Aggregation Function Selection 19 To select the best function for each aggregation, a collection of datasets with types from the ontology to use as a sort of ‘training set ’ is handled. Then, for each labelled dataset and each combination of aggregation functions, the percentage of true labels found in the top three labels predicted by DUKE is computed Using mean for column aggregation , mean-max tree aggregation, and then mean for the final dataset aggregation step produces the best results.
Summary 20 Data summarization Unstructured data Structured data evaluation DUKE: Dataset Understanding via Knowledge-base Embeddings