2. 2
Outline
Data summarization: a survey
– 1Mohiuddin Ahmed, Knowledge and Information Systems, 2018, 1-25
Abstractive Tabular Dataset Summarization via Knowledge Base Semantic
Embeddings:
– 2Paul Azunre, Craig Corcoran, David Sullivan, Garrett Honke, Rebecca
Ruppel, Sandeep Verma, Jonathon Morgan: CoRRabs/1804.01503 (2018))
1https://link.springer.com/content/pdf/10.1007%2Fs10115-018-1183-0.pdf
2https://arxiv.org/abs/1804.01503
3. Data summarization
is a process of creating concise, yet informative, version of the original data.
– The terms concise and informative are quite generic and depend on application
domains
Summarization is not a compression!!
– Compression is a syntactic method for reducing the data
– In contrast, summarization uses the semantic content of the data
– compression makes the data non-intelligible and summarization makes the data
intelligible for further data analysis and decision making
Role of data summarization
– Intelligent analysis of data is a challenging task in many domain. In reality, the
volume of the datasets is quite high and the time required to perform data analysis
increases with data size.
– A summary of the large data is easier and faster to analyze
3
5. Summarization of unstructured data
To carry out text summarization, the combination of the following processes :
– Extraction: Finds the key phrases or sentences and produces a summary,
– Abstraction: Produces the key information in a new way,
– Fusion: Extracts important parts from the text and combines them
coherently,
– Compression: Discards irrelevant or unimportant text.
The frequency/position of any particular word, which is an useful measure to
identify its significance
Machine Learning (ML) approaches for text summarization started in
the 1990s.:
– Naive–Bayes classifier, Decision tree, Hidden Markov model (HMM), Artificial
neural network (ANN), Topic modeling
5
6. Summarization of structured data
Statistical techniques
– Aggregation: (defined for numerical values) estimate the statistical distribution of
data that could be utilized to approximate the pattern in the set of data
– Sampling:
A sample is a subset of the dataset
Sampling is a popular choice for reduction of input data in data mining and machine
learning techniques
Simple random sampling– Stratified random sampling:– Systematic sampling:– Cluster
random sampling:– Multi-stage random sampling:
Semantic-based: linguistic summary, attribute oriented induction, fascicle
Machine learning summarization: frequent itemsets, clustering
6
7. Summarization of structured data
Sampling:
– Stratified random sampling:
The dataset is divided into non-overlapping subsets, called strata.
The sampling scheme selects a random element from each strata and
produces a stratified sample.
– Systematic sampling:
a data instance is sampled from the dataset, beginning from a specified
starting point to the end, at equal intervals.
For example, if the first random instance’s location is 2 (starting point) and the
interval value is 5, then for a sample of size 3, the sample instances are from
the 2nd, 7th and 12th locations, respectively.
The interval is calculated as rounded up Size of sample Size of data .
– Cluster random sampling:
The whole dataset is organized into groups (clusters);
groups are randomly selected according to sampling rate, and all
members of the selected groups are selected.
7
8. Evaluation metrics
Human-based
Conciseness:
Information loss
Interestingness
where:
– S: summarized dataset size
– D: input dataset size
– T: is the number of distinct values present in the original data (D)
– L: defines the difference between number of distinct values present in summary
and original data
– n: states how many of the data instances in the original dataset are covered by the
summary
8
9. 9
Outline
Data summarization: a survey:
– 1Mohiuddin Ahmed, Knowledge and Information Systems, 2018, 1-25
Abstractive Tabular Dataset Summarization via Knowledge Base Semantic
Embeddings:
– 2Paul Azunre, Craig Corcoran, David Sullivan, Garrett Honke, Rebecca
Ruppel, Sandeep Verma, Jonathon Morgan: CoRRabs/1804.01503 (2018))
1https://link.springer.com/content/pdf/10.1007%2Fs10115-018-1183-0.pdf
2https://arxiv.org/abs/1804.01503 https://github.com/NewKnowledge/duke
10. DUKE- Dataset Understanding via Knowledge-base
Embeddings
Objective: to develop a method for summarizing the content of tabular
datasets
Assumption: the dataset contains descriptive text in headers, columns and/or
some augmenting metadata
Methodology:
– employing a knowledge base semantic embedding to generate the summary.
– Employing the embedding to recommend a subject/type for each text segment.
– Recommendations are aggregated into a small collection of super types considered
to be descriptive of the dataset by exploiting the hierarchy of types in a pre-
specified ontology.
Evaluation: Using February 2015 Wikipedia as the knowledge base, and a
corresponding DBpedia ontology as types, carrying out set of experiments on
open data taken from several sources { OpenML, CKAN and data.world }
10
11. Approach
Definitions:
distributional semantics: concept has been recently widely employed as a
natural language processing (NLP) tool to embed various NLP concepts into
vector spaces.
– “Words occurring in similar (linguistic) contexts tend to be semantically similar”
Word2vec1: Word2vec models utilize a large corpus of documents to build
a vector space mapping words to points in a space, where proximity implies
semantic similarity. This allows to calculate distances between words in the
dataset and the set of types in the ontology.
Wiki2vec2: a wiki2vec model is a form of word2vec model trained on a
corpus of Wikipedia KB documents. Training on this data ensures that the list
of types in the DBpedia ontology are included in the vocabulary of the model,
and increases the likelihood that topics are discussed in context with their
supertypes
11
2https://github.com/idio/wiki2vec
1https://code.google.com/archive/p/word2vec/
14. Generating Type Recommendations
The method for summarizing a tabular dataset can be broken
down into three distinct steps:
1. Collect a set of types and an ontology to use for abstraction
2. Extract any text data from the tabular dataset and embed it
into a vector space to calculate the distance to all the types in
our ontology
3. Aggregate the distance vectors for every keyword in the
dataset into a single vector of distances
14
15. Type Ontology
In order to generate an abstract term to describe the dataset,
– collect an ontology of types to select a descriptive term from
– To this end, an ontology provided by DBpedia1 is used which
contains approximately 400 defined types, including everything
from sound to video game and historic place.
– DBpedia also contains defined parent-child relationships for the
types that be used to build a complete hierarchy of types e.g. that
tree is a sub-type of plant which is a sub-type of eukaryote.
Handling probabilistic data
15
1http://downloads.dbpedia.org/2015-10
16. Word Embedding
16
With the set of topics collected
– extract each word from the dataset,
– embed it in a wiki2vec vector space and calculate the distance
between that word and every type in the ontology.
– If a single cell in a column contains more than one word, take the
average of the corresponding embedded vectors.
– This results in a collection of distance vectors representing all text in
the dataset.
– Collect the vectors according to their source within the dataset, i.e.
words in the same column are collected into a matrix of distances
for each column.
– If column headers are provided, treat them as an additional column
in the dataset.
17. Distance Aggregation
17
Having a set of matrices containing distances between every text
segment in the dataset and the set of types.
– The goal of this step is to reduce them to a single
vector of distances.
To this end, three successive aggregations in order to compute this
final vector are utilized.
1st aggregation: is computed across the rows of each column
matrix in order to produce a single vector of distances between the
column and all types.
18. Distance Aggregation
18
2nd aggregation: is called the tree aggregation, where this vector of
distances for a column is considered utilizing the hierarchy of types
described by DBpedia in order to update the scores for each type.
– For instance, the score for means of transportation based on the scores for
airplane, train, and automobile can be updated.
3rd aggregation: is performed over the set of distance vectors computed
for each column, producing a single vector of distances to every defined
type.
19. Aggregation Function Selection
19
To select the best function for each aggregation, a collection of datasets with
types from the ontology to use as a sort of ‘training set’ is handled. Œ
Then, for each labelled dataset and each combination of aggregation
functions, the percentage of true labels found in the top
three labels predicted by DUKE is computed
Using mean for column aggregation, mean-max tree aggregation, and then
mean for the final dataset aggregation step produces the best results.
20. Summary
20
Data summarization
– Unstructured data
– Structured data
– evaluation
DUKE: Dataset Understanding via Knowledge-base Embeddings