SlideShare a Scribd company logo
1 of 21
Data Summarization
Alsayed Algergawy
Presented at the group journal club
20 December, 2018
2
Outline
 Data summarization: a survey
– 1Mohiuddin Ahmed, Knowledge and Information Systems, 2018, 1-25
 Abstractive Tabular Dataset Summarization via Knowledge Base Semantic
Embeddings:
– 2Paul Azunre, Craig Corcoran, David Sullivan, Garrett Honke, Rebecca
Ruppel, Sandeep Verma, Jonathon Morgan: CoRRabs/1804.01503 (2018))
1https://link.springer.com/content/pdf/10.1007%2Fs10115-018-1183-0.pdf
2https://arxiv.org/abs/1804.01503
Data summarization
 is a process of creating concise, yet informative, version of the original data.
– The terms concise and informative are quite generic and depend on application
domains
 Summarization is not a compression!!
– Compression is a syntactic method for reducing the data
– In contrast, summarization uses the semantic content of the data
– compression makes the data non-intelligible and summarization makes the data
intelligible for further data analysis and decision making
 Role of data summarization
– Intelligent analysis of data is a challenging task in many domain. In reality, the
volume of the datasets is quite high and the time required to perform data analysis
increases with data size.
– A summary of the large data is easier and faster to analyze
3
Taxonomy of data summarization
4
Summarization of unstructured data
 To carry out text summarization, the combination of the following processes :
– Extraction: Finds the key phrases or sentences and produces a summary,
– Abstraction: Produces the key information in a new way,
– Fusion: Extracts important parts from the text and combines them
coherently,
– Compression: Discards irrelevant or unimportant text.
 The frequency/position of any particular word, which is an useful measure to
identify its significance
 Machine Learning (ML) approaches for text summarization started in
the 1990s.:
– Naive–Bayes classifier, Decision tree, Hidden Markov model (HMM), Artificial
neural network (ANN), Topic modeling
5
Summarization of structured data
 Statistical techniques
– Aggregation: (defined for numerical values) estimate the statistical distribution of
data that could be utilized to approximate the pattern in the set of data
– Sampling:
 A sample is a subset of the dataset
 Sampling is a popular choice for reduction of input data in data mining and machine
learning techniques
 Simple random sampling– Stratified random sampling:– Systematic sampling:– Cluster
random sampling:– Multi-stage random sampling:
 Semantic-based: linguistic summary, attribute oriented induction, fascicle
 Machine learning summarization: frequent itemsets, clustering
6
Summarization of structured data
 Sampling:
– Stratified random sampling:
 The dataset is divided into non-overlapping subsets, called strata.
 The sampling scheme selects a random element from each strata and
produces a stratified sample.
– Systematic sampling:
 a data instance is sampled from the dataset, beginning from a specified
starting point to the end, at equal intervals.
 For example, if the first random instance’s location is 2 (starting point) and the
interval value is 5, then for a sample of size 3, the sample instances are from
the 2nd, 7th and 12th locations, respectively.
 The interval is calculated as rounded up Size of sample Size of data .
– Cluster random sampling:
 The whole dataset is organized into groups (clusters);
 groups are randomly selected according to sampling rate, and all
members of the selected groups are selected.
7
Evaluation metrics
 Human-based
 Conciseness:
 Information loss
 Interestingness
where:
– S: summarized dataset size
– D: input dataset size
– T: is the number of distinct values present in the original data (D)
– L: defines the difference between number of distinct values present in summary
and original data
– n: states how many of the data instances in the original dataset are covered by the
summary
8
9
Outline
 Data summarization: a survey:
– 1Mohiuddin Ahmed, Knowledge and Information Systems, 2018, 1-25
 Abstractive Tabular Dataset Summarization via Knowledge Base Semantic
Embeddings:
– 2Paul Azunre, Craig Corcoran, David Sullivan, Garrett Honke, Rebecca
Ruppel, Sandeep Verma, Jonathon Morgan: CoRRabs/1804.01503 (2018))
1https://link.springer.com/content/pdf/10.1007%2Fs10115-018-1183-0.pdf
2https://arxiv.org/abs/1804.01503 https://github.com/NewKnowledge/duke
DUKE- Dataset Understanding via Knowledge-base
Embeddings
 Objective: to develop a method for summarizing the content of tabular
datasets
 Assumption: the dataset contains descriptive text in headers, columns and/or
some augmenting metadata
 Methodology:
– employing a knowledge base semantic embedding to generate the summary.
– Employing the embedding to recommend a subject/type for each text segment.
– Recommendations are aggregated into a small collection of super types considered
to be descriptive of the dataset by exploiting the hierarchy of types in a pre-
specified ontology.
 Evaluation: Using February 2015 Wikipedia as the knowledge base, and a
corresponding DBpedia ontology as types, carrying out set of experiments on
open data taken from several sources { OpenML, CKAN and data.world }
10
Approach
Definitions:
 distributional semantics: concept has been recently widely employed as a
natural language processing (NLP) tool to embed various NLP concepts into
vector spaces.
– “Words occurring in similar (linguistic) contexts tend to be semantically similar”
 Word2vec1: Word2vec models utilize a large corpus of documents to build
a vector space mapping words to points in a space, where proximity implies
semantic similarity. This allows to calculate distances between words in the
dataset and the set of types in the ontology.
 Wiki2vec2: a wiki2vec model is a form of word2vec model trained on a
corpus of Wikipedia KB documents. Training on this data ensures that the list
of types in the DBpedia ontology are included in the vocabulary of the model,
and increases the likelihood that topics are discussed in context with their
supertypes
11
2https://github.com/idio/wiki2vec
1https://code.google.com/archive/p/word2vec/
12
Distributional semantics
13
Distributional semantics
Generating Type Recommendations
 The method for summarizing a tabular dataset can be broken
down into three distinct steps:
1. Collect a set of types and an ontology to use for abstraction
2. Extract any text data from the tabular dataset and embed it
into a vector space to calculate the distance to all the types in
our ontology
3. Aggregate the distance vectors for every keyword in the
dataset into a single vector of distances
14
Type Ontology
 In order to generate an abstract term to describe the dataset,
– collect an ontology of types to select a descriptive term from
– To this end, an ontology provided by DBpedia1 is used which
contains approximately 400 defined types, including everything
from sound to video game and historic place.
– DBpedia also contains defined parent-child relationships for the
types that be used to build a complete hierarchy of types e.g. that
tree is a sub-type of plant which is a sub-type of eukaryote.
Handling probabilistic data
15
1http://downloads.dbpedia.org/2015-10
Word Embedding
16
 With the set of topics collected
– extract each word from the dataset,
– embed it in a wiki2vec vector space and calculate the distance
between that word and every type in the ontology.
– If a single cell in a column contains more than one word, take the
average of the corresponding embedded vectors.
– This results in a collection of distance vectors representing all text in
the dataset.
– Collect the vectors according to their source within the dataset, i.e.
words in the same column are collected into a matrix of distances
for each column.
– If column headers are provided, treat them as an additional column
in the dataset.
Distance Aggregation
17
 Having a set of matrices containing distances between every text
segment in the dataset and the set of types.
– The goal of this step is to reduce them to a single
vector of distances.
 To this end, three successive aggregations in order to compute this
final vector are utilized.
 1st aggregation: is computed across the rows of each column
matrix in order to produce a single vector of distances between the
column and all types.
Distance Aggregation
18
 2nd aggregation: is called the tree aggregation, where this vector of
distances for a column is considered utilizing the hierarchy of types
described by DBpedia in order to update the scores for each type.
– For instance, the score for means of transportation based on the scores for
airplane, train, and automobile can be updated.
 3rd aggregation: is performed over the set of distance vectors computed
for each column, producing a single vector of distances to every defined
type.
Aggregation Function Selection
19
 To select the best function for each aggregation, a collection of datasets with
types from the ontology to use as a sort of ‘training set’ is handled. Œ
 Then, for each labelled dataset and each combination of aggregation
functions, the percentage of true labels found in the top
three labels predicted by DUKE is computed
 Using mean for column aggregation, mean-max tree aggregation, and then
mean for the final dataset aggregation step produces the best results.
Summary
20
 Data summarization
– Unstructured data
– Structured data
– evaluation
 DUKE: Dataset Understanding via Knowledge-base Embeddings
?

More Related Content

What's hot

Statistics ( central tendency / average)
Statistics ( central tendency / average)Statistics ( central tendency / average)
Statistics ( central tendency / average)RiyaVashisht4
 
Type i and type ii errors
Type i and type ii errorsType i and type ii errors
Type i and type ii errorsp24ssp
 
Research Design (Research Types, Quantitative Research Design and Qualitative...
Research Design (Research Types, Quantitative Research Design and Qualitative...Research Design (Research Types, Quantitative Research Design and Qualitative...
Research Design (Research Types, Quantitative Research Design and Qualitative...Alam Nuzhathalam
 
Methods of data collection (research methodology)
Methods of data collection  (research methodology)Methods of data collection  (research methodology)
Methods of data collection (research methodology)Muhammed Konari
 
Scientific methods of research
Scientific methods of researchScientific methods of research
Scientific methods of researchNursing Path
 
National population policy 2000
National population policy 2000National population policy 2000
National population policy 2000Harsh Rastogi
 
True experimental research design
True experimental research designTrue experimental research design
True experimental research designAsokan R
 
Epidemiology meaning, scope & terminology
Epidemiology meaning, scope & terminology Epidemiology meaning, scope & terminology
Epidemiology meaning, scope & terminology Jagan Kumar Ojha
 
Sampling techniques
Sampling techniquesSampling techniques
Sampling techniquesMukut Deori
 
Criteria of selecting a sampling procedure
Criteria of selecting a sampling procedureCriteria of selecting a sampling procedure
Criteria of selecting a sampling procedureDr.Sangeetha R
 
4. scientific methods of research
4.  scientific methods of research4.  scientific methods of research
4. scientific methods of researchChanda Jabeen
 
Analysis and interpretation of data
Analysis and interpretation of dataAnalysis and interpretation of data
Analysis and interpretation of datateppxcrown98
 

What's hot (20)

Statistics ( central tendency / average)
Statistics ( central tendency / average)Statistics ( central tendency / average)
Statistics ( central tendency / average)
 
Research process
Research processResearch process
Research process
 
Type i and type ii errors
Type i and type ii errorsType i and type ii errors
Type i and type ii errors
 
Research Design (Research Types, Quantitative Research Design and Qualitative...
Research Design (Research Types, Quantitative Research Design and Qualitative...Research Design (Research Types, Quantitative Research Design and Qualitative...
Research Design (Research Types, Quantitative Research Design and Qualitative...
 
Methods of data collection (research methodology)
Methods of data collection  (research methodology)Methods of data collection  (research methodology)
Methods of data collection (research methodology)
 
Scientific methods of research
Scientific methods of researchScientific methods of research
Scientific methods of research
 
Sampling
SamplingSampling
Sampling
 
National population policy 2000
National population policy 2000National population policy 2000
National population policy 2000
 
EXPERIMENTAL RESEARCH DESIGN
EXPERIMENTAL RESEARCH DESIGNEXPERIMENTAL RESEARCH DESIGN
EXPERIMENTAL RESEARCH DESIGN
 
True experimental research design
True experimental research designTrue experimental research design
True experimental research design
 
Pilot study-research
Pilot study-researchPilot study-research
Pilot study-research
 
Epidemiology meaning, scope & terminology
Epidemiology meaning, scope & terminology Epidemiology meaning, scope & terminology
Epidemiology meaning, scope & terminology
 
Vital Statistics - Uses and Importance's
Vital Statistics - Uses and Importance'sVital Statistics - Uses and Importance's
Vital Statistics - Uses and Importance's
 
Sampling techniques
Sampling techniquesSampling techniques
Sampling techniques
 
Criteria of selecting a sampling procedure
Criteria of selecting a sampling procedureCriteria of selecting a sampling procedure
Criteria of selecting a sampling procedure
 
Review of Literature
 Review of Literature Review of Literature
Review of Literature
 
4. scientific methods of research
4.  scientific methods of research4.  scientific methods of research
4. scientific methods of research
 
Data collection
Data collectionData collection
Data collection
 
NFHS 3
NFHS 3NFHS 3
NFHS 3
 
Analysis and interpretation of data
Analysis and interpretation of dataAnalysis and interpretation of data
Analysis and interpretation of data
 

Similar to data summarization.pptx

Ba2419551957
Ba2419551957Ba2419551957
Ba2419551957IJMER
 
Recommendation system using unsupervised machine learning algorithm & assoc
Recommendation system using unsupervised machine learning algorithm & assocRecommendation system using unsupervised machine learning algorithm & assoc
Recommendation system using unsupervised machine learning algorithm & associjerd
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesIRJET Journal
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET Journal
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1bPRAWEEN KUMAR
 
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval IJECEIAES
 
Comparative analysis of various data stream mining procedures and various dim...
Comparative analysis of various data stream mining procedures and various dim...Comparative analysis of various data stream mining procedures and various dim...
Comparative analysis of various data stream mining procedures and various dim...Alexander Decker
 
Document clustering for forensic analysis
Document clustering for forensic analysisDocument clustering for forensic analysis
Document clustering for forensic analysissrinivasa teja
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
IRJET- Semantics based Document Clustering
IRJET- Semantics based Document ClusteringIRJET- Semantics based Document Clustering
IRJET- Semantics based Document ClusteringIRJET Journal
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...IOSR Journals
 
Delivery Feet Data using K Mean Clustering with Applied SPSS
Delivery Feet Data using K Mean Clustering with Applied SPSSDelivery Feet Data using K Mean Clustering with Applied SPSS
Delivery Feet Data using K Mean Clustering with Applied SPSSijtsrd
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016ijcsbi
 
IRJET- Enhanced Density Based Method for Clustering Data Stream
IRJET-  	  Enhanced Density Based Method for Clustering Data StreamIRJET-  	  Enhanced Density Based Method for Clustering Data Stream
IRJET- Enhanced Density Based Method for Clustering Data StreamIRJET Journal
 

Similar to data summarization.pptx (20)

Ba2419551957
Ba2419551957Ba2419551957
Ba2419551957
 
Recommendation system using unsupervised machine learning algorithm & assoc
Recommendation system using unsupervised machine learning algorithm & assocRecommendation system using unsupervised machine learning algorithm & assoc
Recommendation system using unsupervised machine learning algorithm & assoc
 
I AM SAM web app
I AM SAM web appI AM SAM web app
I AM SAM web app
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering Techniques
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
 
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
 
Comparative analysis of various data stream mining procedures and various dim...
Comparative analysis of various data stream mining procedures and various dim...Comparative analysis of various data stream mining procedures and various dim...
Comparative analysis of various data stream mining procedures and various dim...
 
Document clustering for forensic analysis
Document clustering for forensic analysisDocument clustering for forensic analysis
Document clustering for forensic analysis
 
H04564550
H04564550H04564550
H04564550
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
IRJET- Semantics based Document Clustering
IRJET- Semantics based Document ClusteringIRJET- Semantics based Document Clustering
IRJET- Semantics based Document Clustering
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
 
Delivery Feet Data using K Mean Clustering with Applied SPSS
Delivery Feet Data using K Mean Clustering with Applied SPSSDelivery Feet Data using K Mean Clustering with Applied SPSS
Delivery Feet Data using K Mean Clustering with Applied SPSS
 
50120130406008
5012013040600850120130406008
50120130406008
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016
 
IRJET- Enhanced Density Based Method for Clustering Data Stream
IRJET-  	  Enhanced Density Based Method for Clustering Data StreamIRJET-  	  Enhanced Density Based Method for Clustering Data Stream
IRJET- Enhanced Density Based Method for Clustering Data Stream
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
 

Recently uploaded

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 

Recently uploaded (20)

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 

data summarization.pptx

  • 1. Data Summarization Alsayed Algergawy Presented at the group journal club 20 December, 2018
  • 2. 2 Outline  Data summarization: a survey – 1Mohiuddin Ahmed, Knowledge and Information Systems, 2018, 1-25  Abstractive Tabular Dataset Summarization via Knowledge Base Semantic Embeddings: – 2Paul Azunre, Craig Corcoran, David Sullivan, Garrett Honke, Rebecca Ruppel, Sandeep Verma, Jonathon Morgan: CoRRabs/1804.01503 (2018)) 1https://link.springer.com/content/pdf/10.1007%2Fs10115-018-1183-0.pdf 2https://arxiv.org/abs/1804.01503
  • 3. Data summarization  is a process of creating concise, yet informative, version of the original data. – The terms concise and informative are quite generic and depend on application domains  Summarization is not a compression!! – Compression is a syntactic method for reducing the data – In contrast, summarization uses the semantic content of the data – compression makes the data non-intelligible and summarization makes the data intelligible for further data analysis and decision making  Role of data summarization – Intelligent analysis of data is a challenging task in many domain. In reality, the volume of the datasets is quite high and the time required to perform data analysis increases with data size. – A summary of the large data is easier and faster to analyze 3
  • 4. Taxonomy of data summarization 4
  • 5. Summarization of unstructured data  To carry out text summarization, the combination of the following processes : – Extraction: Finds the key phrases or sentences and produces a summary, – Abstraction: Produces the key information in a new way, – Fusion: Extracts important parts from the text and combines them coherently, – Compression: Discards irrelevant or unimportant text.  The frequency/position of any particular word, which is an useful measure to identify its significance  Machine Learning (ML) approaches for text summarization started in the 1990s.: – Naive–Bayes classifier, Decision tree, Hidden Markov model (HMM), Artificial neural network (ANN), Topic modeling 5
  • 6. Summarization of structured data  Statistical techniques – Aggregation: (defined for numerical values) estimate the statistical distribution of data that could be utilized to approximate the pattern in the set of data – Sampling:  A sample is a subset of the dataset  Sampling is a popular choice for reduction of input data in data mining and machine learning techniques  Simple random sampling– Stratified random sampling:– Systematic sampling:– Cluster random sampling:– Multi-stage random sampling:  Semantic-based: linguistic summary, attribute oriented induction, fascicle  Machine learning summarization: frequent itemsets, clustering 6
  • 7. Summarization of structured data  Sampling: – Stratified random sampling:  The dataset is divided into non-overlapping subsets, called strata.  The sampling scheme selects a random element from each strata and produces a stratified sample. – Systematic sampling:  a data instance is sampled from the dataset, beginning from a specified starting point to the end, at equal intervals.  For example, if the first random instance’s location is 2 (starting point) and the interval value is 5, then for a sample of size 3, the sample instances are from the 2nd, 7th and 12th locations, respectively.  The interval is calculated as rounded up Size of sample Size of data . – Cluster random sampling:  The whole dataset is organized into groups (clusters);  groups are randomly selected according to sampling rate, and all members of the selected groups are selected. 7
  • 8. Evaluation metrics  Human-based  Conciseness:  Information loss  Interestingness where: – S: summarized dataset size – D: input dataset size – T: is the number of distinct values present in the original data (D) – L: defines the difference between number of distinct values present in summary and original data – n: states how many of the data instances in the original dataset are covered by the summary 8
  • 9. 9 Outline  Data summarization: a survey: – 1Mohiuddin Ahmed, Knowledge and Information Systems, 2018, 1-25  Abstractive Tabular Dataset Summarization via Knowledge Base Semantic Embeddings: – 2Paul Azunre, Craig Corcoran, David Sullivan, Garrett Honke, Rebecca Ruppel, Sandeep Verma, Jonathon Morgan: CoRRabs/1804.01503 (2018)) 1https://link.springer.com/content/pdf/10.1007%2Fs10115-018-1183-0.pdf 2https://arxiv.org/abs/1804.01503 https://github.com/NewKnowledge/duke
  • 10. DUKE- Dataset Understanding via Knowledge-base Embeddings  Objective: to develop a method for summarizing the content of tabular datasets  Assumption: the dataset contains descriptive text in headers, columns and/or some augmenting metadata  Methodology: – employing a knowledge base semantic embedding to generate the summary. – Employing the embedding to recommend a subject/type for each text segment. – Recommendations are aggregated into a small collection of super types considered to be descriptive of the dataset by exploiting the hierarchy of types in a pre- specified ontology.  Evaluation: Using February 2015 Wikipedia as the knowledge base, and a corresponding DBpedia ontology as types, carrying out set of experiments on open data taken from several sources { OpenML, CKAN and data.world } 10
  • 11. Approach Definitions:  distributional semantics: concept has been recently widely employed as a natural language processing (NLP) tool to embed various NLP concepts into vector spaces. – “Words occurring in similar (linguistic) contexts tend to be semantically similar”  Word2vec1: Word2vec models utilize a large corpus of documents to build a vector space mapping words to points in a space, where proximity implies semantic similarity. This allows to calculate distances between words in the dataset and the set of types in the ontology.  Wiki2vec2: a wiki2vec model is a form of word2vec model trained on a corpus of Wikipedia KB documents. Training on this data ensures that the list of types in the DBpedia ontology are included in the vocabulary of the model, and increases the likelihood that topics are discussed in context with their supertypes 11 2https://github.com/idio/wiki2vec 1https://code.google.com/archive/p/word2vec/
  • 14. Generating Type Recommendations  The method for summarizing a tabular dataset can be broken down into three distinct steps: 1. Collect a set of types and an ontology to use for abstraction 2. Extract any text data from the tabular dataset and embed it into a vector space to calculate the distance to all the types in our ontology 3. Aggregate the distance vectors for every keyword in the dataset into a single vector of distances 14
  • 15. Type Ontology  In order to generate an abstract term to describe the dataset, – collect an ontology of types to select a descriptive term from – To this end, an ontology provided by DBpedia1 is used which contains approximately 400 defined types, including everything from sound to video game and historic place. – DBpedia also contains defined parent-child relationships for the types that be used to build a complete hierarchy of types e.g. that tree is a sub-type of plant which is a sub-type of eukaryote. Handling probabilistic data 15 1http://downloads.dbpedia.org/2015-10
  • 16. Word Embedding 16  With the set of topics collected – extract each word from the dataset, – embed it in a wiki2vec vector space and calculate the distance between that word and every type in the ontology. – If a single cell in a column contains more than one word, take the average of the corresponding embedded vectors. – This results in a collection of distance vectors representing all text in the dataset. – Collect the vectors according to their source within the dataset, i.e. words in the same column are collected into a matrix of distances for each column. – If column headers are provided, treat them as an additional column in the dataset.
  • 17. Distance Aggregation 17  Having a set of matrices containing distances between every text segment in the dataset and the set of types. – The goal of this step is to reduce them to a single vector of distances.  To this end, three successive aggregations in order to compute this final vector are utilized.  1st aggregation: is computed across the rows of each column matrix in order to produce a single vector of distances between the column and all types.
  • 18. Distance Aggregation 18  2nd aggregation: is called the tree aggregation, where this vector of distances for a column is considered utilizing the hierarchy of types described by DBpedia in order to update the scores for each type. – For instance, the score for means of transportation based on the scores for airplane, train, and automobile can be updated.  3rd aggregation: is performed over the set of distance vectors computed for each column, producing a single vector of distances to every defined type.
  • 19. Aggregation Function Selection 19  To select the best function for each aggregation, a collection of datasets with types from the ontology to use as a sort of ‘training set’ is handled. Œ  Then, for each labelled dataset and each combination of aggregation functions, the percentage of true labels found in the top three labels predicted by DUKE is computed  Using mean for column aggregation, mean-max tree aggregation, and then mean for the final dataset aggregation step produces the best results.
  • 20. Summary 20  Data summarization – Unstructured data – Structured data – evaluation  DUKE: Dataset Understanding via Knowledge-base Embeddings
  • 21. ?