SlideShare a Scribd company logo
1 of 19
Assigning Semantic Labels to Data
Sources
Authors:
S.K. Ramnandan1, Amol Mittal2, Craig Knoblock3, Pedro Szekely3
[1] Indian Institute of Technology - Madras
[2] Indian Institute of Technology - Delhi
[3] University of Southern California
Introduction
Motivation:
- To automatically construct a semantic model of a set of
data sources using domain ontologies selected by user
Applications:
- Provides support to automate many tasks
- Data integration
- Source discovery
- Service composition
- Building knowledge graphs
- Manual description
- tedious & time-consuming
What is a semantic model?
Description of the source in terms of the concepts and
relationships defined by the domain ontology
Data Source
Domain Ontology
Person
Organization
Place
State
name
birthdate
bornIn
worksFor state
name
phone
name
livesIn
City
Event
ceo
location
organizer
nearby
startDate
title
isPartOf
postalCode
Column 1 Column 2 Column 3 Column 4 Column 5
Bill Gates Oct 1955 Microsoft Seattle WA
Mark Zuckerberg May 1984 Facebook White Plains NY
Larry Page Mar 1973 Google East Lansing MI
Column 1 Column 2 Column 3 Column 4 Column 5
Bill Gates Oct 1955 Microsoft Seattle WA
Mark Zuckerberg May 1984 Facebook White Plains NY
Larry Page Mar 1973 Google East Lansing MI
Person
Organization
State
name birthdate
bornIn
worksFor
state
name
name
name
City
Example semantic model
Semantic Labeling Step
Column 1 Column 2 Column 3 Column 4 Column 5
Bill Gates Oct 1955 Microsoft Seattle WA
Mark Zuckerberg May 1984 Facebook White Plains NY
Larry Page Mar 1973 Google East Lansing MI
Person Organization City State
name birthdate name namename
Person
Assigning a class or data property (semantic type) from
the ontology to each attribute in the source
 Taheriyan et al., ISWC 2013, ICSC 2014
 Problems with model-based machine
learning techniques (like CRF):
• Low prediction accuracy for numeric data
• Training time scales poorly as no. of
ontology data properties increases
Overall approach - semantic modeling
Overall Approach (SemTyper)
 Holistic view of data values to capture
characteristic property of semantic type
 Textual Data : TF-IDF Cosine Similarity
 Numeric Data: Kolmogorov-Smirnov Test
 Top-k suggestions returned to the user based
on the confidence scores
Approach to Textual Data
Approach to Numeric Data
Candidate Statistical Hypothesis tests:
- Welch’s t-test
- Mann-Whitney U-test
- Kolmogorov-Smirnov Test
Handling noisy datasets
 How to infer if data is textual or numeric in a noisy source?
 Training time: fraction of numeric values
• < 60% - trained as purely textual
• > 80% - trained as purely numeric
• else - trained as both textual and numeric
 Prediction time: fraction of numeric values
• > 70% - tested as numeric data
• else - tested as textual data
 Thresholds empirically chosen using coarse grid search
• Measuring label prediction accuracy on held out set
Datasets (Evaluation)
 Purely textual data
• Museum domain: 29 museum data sources (Taheriyan et al.)
 Purely numeric data
• City domain:
 30 numeric data properties from City class in Dbpedia
 Partitioned into 10 data sources
 Mixture of textual & numeric data
• City domain:
 52 data properties from City class in DBpedia
• Weather, phone directory and flight status domains
(Ambite et al.)
Metrics (Evaluation)
 Mean Reciprocal Rank
 Interested in rank at which correct semantic
label is predicted
 Average Training Time
Evaluation (Textual data- Museum domain)
Evaluation (Numeric data- City domain)
Evaluation (Mixture data- City domain)
Evaluation (Mixture data- other domains)
Related Work
 Using model-based machine learning techniques
• Goel et al. (ICAI 2012), Limaye et al. (PVLDB 2010), Mulwad et al. (ISWC
2013)
 Extract features from individual data values and build graphical model
 Do not extract characteristic properties of column data as a whole
 Training graphical models not scalable – explosion of search space
 Using external knowledge
• Venetis et al. (VLDB 2011), Syed et al. (SWSC 2010)
 Leverage knowledge on Web to label individual data values
 Restricted to domains and ontologies - huge amount of extracted data
 Highly ontology specific – models generated from specific ontologies
 Stonebraker et al. (CIDR 2013)
 Address problem of schema matching
 Draw inspiration in combining collection of experts
Conclusion
 Label Prediction Accuracy
 Our approach improves on accuracy of competing
approaches on wide variety of domains
 Efficiency & Scalability
 About 250 times faster than Conditional Random Fields
based semantic labeling technique
 Capable of handling noisy datasets
 Ontology agnostic
 Learns semantic labeling function with respect to
ontologies selected by users for their application
Assigning semantic labels to data sources

More Related Content

What's hot

Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...Andrea Scharnhorst
 
Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)infoblog
 
Demonstrating a Framework for KOS-based Recommendations Systems
Demonstrating a Framework for KOS-based Recommendations SystemsDemonstrating a Framework for KOS-based Recommendations Systems
Demonstrating a Framework for KOS-based Recommendations SystemsGESIS
 
An experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsAn experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsarx-deidentifier
 
Adaptive Knowledge Portal for Education Domain
Adaptive Knowledge Portal for Education DomainAdaptive Knowledge Portal for Education Domain
Adaptive Knowledge Portal for Education DomainMikhail Navrotskii
 
Managing international comparative data
Managing international comparative dataManaging international comparative data
Managing international comparative dataEOSC-hub project
 
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'ScienceWorks
 
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1 Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1 ÚISK FF UK
 
Query reverse engineering in the context of the semantic web
Query reverse engineering in the context of the semantic webQuery reverse engineering in the context of the semantic web
Query reverse engineering in the context of the semantic webLeandro Tabares-Martin
 
Philosophy of IR Evaluation Ellen Voorhees
Philosophy of IR Evaluation Ellen VoorheesPhilosophy of IR Evaluation Ellen Voorhees
Philosophy of IR Evaluation Ellen Voorheesk21jag
 
What's wrong with our scholarly infrastructure?
What's wrong with our scholarly infrastructure?What's wrong with our scholarly infrastructure?
What's wrong with our scholarly infrastructure?Björn Brembs
 
IPR Introduction at NUS
IPR Introduction at NUS IPR Introduction at NUS
IPR Introduction at NUS Junichiro Mori
 
Bootcamp python-1
Bootcamp python-1Bootcamp python-1
Bootcamp python-1Era Wibowo
 
Literature overview "OSS" and "Civic tech" 2017
Literature overview "OSS" and "Civic tech" 2017Literature overview "OSS" and "Civic tech" 2017
Literature overview "OSS" and "Civic tech" 2017Keiko Ono
 
CLOSED - Call for Papers: Semantic eScience special issue in Earth Science In...
CLOSED - Call for Papers: Semantic eScience special issue in Earth Science In...CLOSED - Call for Papers: Semantic eScience special issue in Earth Science In...
CLOSED - Call for Papers: Semantic eScience special issue in Earth Science In...Xiaogang (Marshall) Ma
 
Research data discovery in OpenAIRE (Presentation by Paolo Manghi at DI4R2018)
Research data discovery in OpenAIRE (Presentation by Paolo Manghi at DI4R2018)Research data discovery in OpenAIRE (Presentation by Paolo Manghi at DI4R2018)
Research data discovery in OpenAIRE (Presentation by Paolo Manghi at DI4R2018)OpenAIRE
 

What's hot (20)

Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...
 
Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)
 
06 Scott, David
06 Scott, David06 Scott, David
06 Scott, David
 
Demonstrating a Framework for KOS-based Recommendations Systems
Demonstrating a Framework for KOS-based Recommendations SystemsDemonstrating a Framework for KOS-based Recommendations Systems
Demonstrating a Framework for KOS-based Recommendations Systems
 
An experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsAn experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithms
 
Resume
ResumeResume
Resume
 
Mike Dietze PEcAn
Mike Dietze PEcAnMike Dietze PEcAn
Mike Dietze PEcAn
 
Adaptive Knowledge Portal for Education Domain
Adaptive Knowledge Portal for Education DomainAdaptive Knowledge Portal for Education Domain
Adaptive Knowledge Portal for Education Domain
 
Managing international comparative data
Managing international comparative dataManaging international comparative data
Managing international comparative data
 
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'
 
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1 Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1
 
Query reverse engineering in the context of the semantic web
Query reverse engineering in the context of the semantic webQuery reverse engineering in the context of the semantic web
Query reverse engineering in the context of the semantic web
 
Philosophy of IR Evaluation Ellen Voorhees
Philosophy of IR Evaluation Ellen VoorheesPhilosophy of IR Evaluation Ellen Voorhees
Philosophy of IR Evaluation Ellen Voorhees
 
What's wrong with our scholarly infrastructure?
What's wrong with our scholarly infrastructure?What's wrong with our scholarly infrastructure?
What's wrong with our scholarly infrastructure?
 
IPR Introduction at NUS
IPR Introduction at NUS IPR Introduction at NUS
IPR Introduction at NUS
 
Bootcamp python-1
Bootcamp python-1Bootcamp python-1
Bootcamp python-1
 
krynski_cv
krynski_cvkrynski_cv
krynski_cv
 
Literature overview "OSS" and "Civic tech" 2017
Literature overview "OSS" and "Civic tech" 2017Literature overview "OSS" and "Civic tech" 2017
Literature overview "OSS" and "Civic tech" 2017
 
CLOSED - Call for Papers: Semantic eScience special issue in Earth Science In...
CLOSED - Call for Papers: Semantic eScience special issue in Earth Science In...CLOSED - Call for Papers: Semantic eScience special issue in Earth Science In...
CLOSED - Call for Papers: Semantic eScience special issue in Earth Science In...
 
Research data discovery in OpenAIRE (Presentation by Paolo Manghi at DI4R2018)
Research data discovery in OpenAIRE (Presentation by Paolo Manghi at DI4R2018)Research data discovery in OpenAIRE (Presentation by Paolo Manghi at DI4R2018)
Research data discovery in OpenAIRE (Presentation by Paolo Manghi at DI4R2018)
 

Viewers also liked

Infografia habilidades receptivas y productivas de la lengua
Infografia habilidades receptivas y productivas de la lenguaInfografia habilidades receptivas y productivas de la lengua
Infografia habilidades receptivas y productivas de la lenguaLupitaSosa12
 
Versión 2
Versión 2Versión 2
Versión 2etnapol
 
10 aplicaciones miguel angel soto 10-22
10 aplicaciones miguel angel soto 10-2210 aplicaciones miguel angel soto 10-22
10 aplicaciones miguel angel soto 10-22miguel2423
 
10 aplicaciones educativas- teddy galara 10-2
10 aplicaciones educativas- teddy galara 10-210 aplicaciones educativas- teddy galara 10-2
10 aplicaciones educativas- teddy galara 10-2EL Durakiitooh Torres
 
PREGUNTAS EXPOSICIONES
PREGUNTAS EXPOSICIONES PREGUNTAS EXPOSICIONES
PREGUNTAS EXPOSICIONES Karen Edith
 
Sesión 4. didáctica crítica.
Sesión 4. didáctica crítica.Sesión 4. didáctica crítica.
Sesión 4. didáctica crítica.eduardo eddy
 
Hindi language article what india can learn from west and how can it transfor...
Hindi language article what india can learn from west and how can it transfor...Hindi language article what india can learn from west and how can it transfor...
Hindi language article what india can learn from west and how can it transfor...Dr. Trilok Kumar Jain
 
Lossimbolospatriosdelperu
Lossimbolospatriosdelperu Lossimbolospatriosdelperu
Lossimbolospatriosdelperu jackelyngs
 
Delivering Mobile Healthcare (Congo) - Infographic
Delivering Mobile Healthcare (Congo) - InfographicDelivering Mobile Healthcare (Congo) - Infographic
Delivering Mobile Healthcare (Congo) - InfographicCisco Service Provider
 
Documentos
DocumentosDocumentos
Documentosshane00a
 
BIZKAIA (Jon eta Ane)
BIZKAIA (Jon eta Ane)BIZKAIA (Jon eta Ane)
BIZKAIA (Jon eta Ane)aneetxu
 
Construcción de número
Construcción de númeroConstrucción de número
Construcción de númeroYanet Barbosa
 
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...Craig Knoblock
 
Building and Using a Knowledge Graph to Combat Human Trafficking
Building and Using a Knowledge Graph to Combat Human TraffickingBuilding and Using a Knowledge Graph to Combat Human Trafficking
Building and Using a Knowledge Graph to Combat Human TraffickingCraig Knoblock
 

Viewers also liked (20)

Expocomple3
Expocomple3Expocomple3
Expocomple3
 
Better Vending Machine Overview
Better Vending Machine OverviewBetter Vending Machine Overview
Better Vending Machine Overview
 
Infografia habilidades receptivas y productivas de la lengua
Infografia habilidades receptivas y productivas de la lenguaInfografia habilidades receptivas y productivas de la lengua
Infografia habilidades receptivas y productivas de la lengua
 
Versión 2
Versión 2Versión 2
Versión 2
 
10 aplicaciones miguel angel soto 10-22
10 aplicaciones miguel angel soto 10-2210 aplicaciones miguel angel soto 10-22
10 aplicaciones miguel angel soto 10-22
 
10 aplicaciones educativas- teddy galara 10-2
10 aplicaciones educativas- teddy galara 10-210 aplicaciones educativas- teddy galara 10-2
10 aplicaciones educativas- teddy galara 10-2
 
PREGUNTAS EXPOSICIONES
PREGUNTAS EXPOSICIONES PREGUNTAS EXPOSICIONES
PREGUNTAS EXPOSICIONES
 
Adjectives collage
Adjectives collageAdjectives collage
Adjectives collage
 
Sesión 4. didáctica crítica.
Sesión 4. didáctica crítica.Sesión 4. didáctica crítica.
Sesión 4. didáctica crítica.
 
Hindi language article what india can learn from west and how can it transfor...
Hindi language article what india can learn from west and how can it transfor...Hindi language article what india can learn from west and how can it transfor...
Hindi language article what india can learn from west and how can it transfor...
 
Lossimbolospatriosdelperu
Lossimbolospatriosdelperu Lossimbolospatriosdelperu
Lossimbolospatriosdelperu
 
Fort Wayne
Fort WayneFort Wayne
Fort Wayne
 
Delivering Mobile Healthcare (Congo) - Infographic
Delivering Mobile Healthcare (Congo) - InfographicDelivering Mobile Healthcare (Congo) - Infographic
Delivering Mobile Healthcare (Congo) - Infographic
 
Documentos
DocumentosDocumentos
Documentos
 
BIZKAIA (Jon eta Ane)
BIZKAIA (Jon eta Ane)BIZKAIA (Jon eta Ane)
BIZKAIA (Jon eta Ane)
 
Construcción de número
Construcción de númeroConstrucción de número
Construcción de número
 
IW-AON_REBALANCING STRATEGY TO BALANCE OUR UNBALANCED WORLD
IW-AON_REBALANCING STRATEGY TO BALANCE OUR UNBALANCED WORLDIW-AON_REBALANCING STRATEGY TO BALANCE OUR UNBALANCED WORLD
IW-AON_REBALANCING STRATEGY TO BALANCE OUR UNBALANCED WORLD
 
MANUALIDADES
MANUALIDADESMANUALIDADES
MANUALIDADES
 
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...
 
Building and Using a Knowledge Graph to Combat Human Trafficking
Building and Using a Knowledge Graph to Combat Human TraffickingBuilding and Using a Knowledge Graph to Combat Human Trafficking
Building and Using a Knowledge Graph to Combat Human Trafficking
 

Similar to Assigning semantic labels to data sources

The state of the art in integrating machine learning into visual analytics
The state of the art in integrating machine learning into visual analyticsThe state of the art in integrating machine learning into visual analytics
The state of the art in integrating machine learning into visual analyticsCagatay Turkay
 
EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning
EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning
EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning Phuc Nguyen
 
Data Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical UniversityData Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical Universitybutest
 
Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...ranjit banshpal
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineSalford Systems
 
Ibm colloquium 070915_nyberg
Ibm colloquium 070915_nybergIbm colloquium 070915_nyberg
Ibm colloquium 070915_nybergdiannepatricia
 
Multimedia Answer Generation for Community Question Answering
Multimedia Answer Generation for Community Question AnsweringMultimedia Answer Generation for Community Question Answering
Multimedia Answer Generation for Community Question AnsweringSWAMI06
 
Classifying malicious websites using an ensemble weighted features
Classifying malicious websites using an ensemble weighted featuresClassifying malicious websites using an ensemble weighted features
Classifying malicious websites using an ensemble weighted featuresDharmendra Vishwakarma
 
Introduction to OpenSemcq
Introduction to OpenSemcqIntroduction to OpenSemcq
Introduction to OpenSemcqmbtosic
 
Csi poster
Csi posterCsi poster
Csi posterISSIP
 
Introduction Machine Learning Syllabus
Introduction Machine Learning SyllabusIntroduction Machine Learning Syllabus
Introduction Machine Learning SyllabusAndres Mendez-Vazquez
 
Info 2402 information retrieval technologies course_outline
Info 2402 information retrieval technologies course_outlineInfo 2402 information retrieval technologies course_outline
Info 2402 information retrieval technologies course_outlineShahriar Rafee
 
Discovering Common Motifs in Cursor Movement Data
Discovering Common Motifs in Cursor Movement DataDiscovering Common Motifs in Cursor Movement Data
Discovering Common Motifs in Cursor Movement DataYandex
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsSurvey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsIRJET Journal
 
AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016Manjula Ambur
 
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Dios Kurniawan
 
Building Surveys in Qualtrics for Efficient Analytics
Building Surveys in Qualtrics for Efficient AnalyticsBuilding Surveys in Qualtrics for Efficient Analytics
Building Surveys in Qualtrics for Efficient AnalyticsShalin Hai-Jew
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com
 

Similar to Assigning semantic labels to data sources (20)

The state of the art in integrating machine learning into visual analytics
The state of the art in integrating machine learning into visual analyticsThe state of the art in integrating machine learning into visual analytics
The state of the art in integrating machine learning into visual analytics
 
EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning
EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning
EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning
 
Data Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical UniversityData Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical University
 
Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
 
Ibm colloquium 070915_nyberg
Ibm colloquium 070915_nybergIbm colloquium 070915_nyberg
Ibm colloquium 070915_nyberg
 
Multimedia Answer Generation for Community Question Answering
Multimedia Answer Generation for Community Question AnsweringMultimedia Answer Generation for Community Question Answering
Multimedia Answer Generation for Community Question Answering
 
Classifying malicious websites using an ensemble weighted features
Classifying malicious websites using an ensemble weighted featuresClassifying malicious websites using an ensemble weighted features
Classifying malicious websites using an ensemble weighted features
 
MUDROD - Ranking
MUDROD - RankingMUDROD - Ranking
MUDROD - Ranking
 
Introduction to OpenSemcq
Introduction to OpenSemcqIntroduction to OpenSemcq
Introduction to OpenSemcq
 
Csi poster
Csi posterCsi poster
Csi poster
 
Introduction Machine Learning Syllabus
Introduction Machine Learning SyllabusIntroduction Machine Learning Syllabus
Introduction Machine Learning Syllabus
 
Info 2402 information retrieval technologies course_outline
Info 2402 information retrieval technologies course_outlineInfo 2402 information retrieval technologies course_outline
Info 2402 information retrieval technologies course_outline
 
Discovering Common Motifs in Cursor Movement Data
Discovering Common Motifs in Cursor Movement DataDiscovering Common Motifs in Cursor Movement Data
Discovering Common Motifs in Cursor Movement Data
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsSurvey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
 
AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016
 
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1
 
Building Surveys in Qualtrics for Efficient Analytics
Building Surveys in Qualtrics for Efficient AnalyticsBuilding Surveys in Qualtrics for Efficient Analytics
Building Surveys in Qualtrics for Efficient Analytics
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
Heterogeneous data annotation
Heterogeneous data annotationHeterogeneous data annotation
Heterogeneous data annotation
 

More from Craig Knoblock

Learning to Adapt to Sensor Changes and Failures
Learning to Adapt to Sensor Changes and FailuresLearning to Adapt to Sensor Changes and Failures
Learning to Adapt to Sensor Changes and FailuresCraig Knoblock
 
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...Craig Knoblock
 
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...Craig Knoblock
 
Lessons Learned in Building Linked Data for the American Art Collaborative
Lessons Learned in Building Linked Data for the American Art CollaborativeLessons Learned in Building Linked Data for the American Art Collaborative
Lessons Learned in Building Linked Data for the American Art CollaborativeCraig Knoblock
 
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsExtracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsCraig Knoblock
 
From Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked KnowledgeFrom Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked KnowledgeCraig Knoblock
 
Semantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisSemantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisCraig Knoblock
 
Discovering Alignments in Ontologies of Linked Data
Discovering Alignments in Ontologies of Linked DataDiscovering Alignments in Ontologies of Linked Data
Discovering Alignments in Ontologies of Linked DataCraig Knoblock
 

More from Craig Knoblock (8)

Learning to Adapt to Sensor Changes and Failures
Learning to Adapt to Sensor Changes and FailuresLearning to Adapt to Sensor Changes and Failures
Learning to Adapt to Sensor Changes and Failures
 
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
 
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
 
Lessons Learned in Building Linked Data for the American Art Collaborative
Lessons Learned in Building Linked Data for the American Art CollaborativeLessons Learned in Building Linked Data for the American Art Collaborative
Lessons Learned in Building Linked Data for the American Art Collaborative
 
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsExtracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
 
From Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked KnowledgeFrom Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
 
Semantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisSemantics for Big Data Integration and Analysis
Semantics for Big Data Integration and Analysis
 
Discovering Alignments in Ontologies of Linked Data
Discovering Alignments in Ontologies of Linked DataDiscovering Alignments in Ontologies of Linked Data
Discovering Alignments in Ontologies of Linked Data
 

Recently uploaded

APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Recently uploaded (20)

The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Assigning semantic labels to data sources

  • 1. Assigning Semantic Labels to Data Sources Authors: S.K. Ramnandan1, Amol Mittal2, Craig Knoblock3, Pedro Szekely3 [1] Indian Institute of Technology - Madras [2] Indian Institute of Technology - Delhi [3] University of Southern California
  • 2. Introduction Motivation: - To automatically construct a semantic model of a set of data sources using domain ontologies selected by user Applications: - Provides support to automate many tasks - Data integration - Source discovery - Service composition - Building knowledge graphs - Manual description - tedious & time-consuming
  • 3. What is a semantic model? Description of the source in terms of the concepts and relationships defined by the domain ontology Data Source Domain Ontology Person Organization Place State name birthdate bornIn worksFor state name phone name livesIn City Event ceo location organizer nearby startDate title isPartOf postalCode Column 1 Column 2 Column 3 Column 4 Column 5 Bill Gates Oct 1955 Microsoft Seattle WA Mark Zuckerberg May 1984 Facebook White Plains NY Larry Page Mar 1973 Google East Lansing MI
  • 4. Column 1 Column 2 Column 3 Column 4 Column 5 Bill Gates Oct 1955 Microsoft Seattle WA Mark Zuckerberg May 1984 Facebook White Plains NY Larry Page Mar 1973 Google East Lansing MI Person Organization State name birthdate bornIn worksFor state name name name City Example semantic model
  • 5. Semantic Labeling Step Column 1 Column 2 Column 3 Column 4 Column 5 Bill Gates Oct 1955 Microsoft Seattle WA Mark Zuckerberg May 1984 Facebook White Plains NY Larry Page Mar 1973 Google East Lansing MI Person Organization City State name birthdate name namename Person Assigning a class or data property (semantic type) from the ontology to each attribute in the source
  • 6.  Taheriyan et al., ISWC 2013, ICSC 2014  Problems with model-based machine learning techniques (like CRF): • Low prediction accuracy for numeric data • Training time scales poorly as no. of ontology data properties increases Overall approach - semantic modeling
  • 7. Overall Approach (SemTyper)  Holistic view of data values to capture characteristic property of semantic type  Textual Data : TF-IDF Cosine Similarity  Numeric Data: Kolmogorov-Smirnov Test  Top-k suggestions returned to the user based on the confidence scores
  • 9. Approach to Numeric Data Candidate Statistical Hypothesis tests: - Welch’s t-test - Mann-Whitney U-test - Kolmogorov-Smirnov Test
  • 10. Handling noisy datasets  How to infer if data is textual or numeric in a noisy source?  Training time: fraction of numeric values • < 60% - trained as purely textual • > 80% - trained as purely numeric • else - trained as both textual and numeric  Prediction time: fraction of numeric values • > 70% - tested as numeric data • else - tested as textual data  Thresholds empirically chosen using coarse grid search • Measuring label prediction accuracy on held out set
  • 11. Datasets (Evaluation)  Purely textual data • Museum domain: 29 museum data sources (Taheriyan et al.)  Purely numeric data • City domain:  30 numeric data properties from City class in Dbpedia  Partitioned into 10 data sources  Mixture of textual & numeric data • City domain:  52 data properties from City class in DBpedia • Weather, phone directory and flight status domains (Ambite et al.)
  • 12. Metrics (Evaluation)  Mean Reciprocal Rank  Interested in rank at which correct semantic label is predicted  Average Training Time
  • 13. Evaluation (Textual data- Museum domain)
  • 16. Evaluation (Mixture data- other domains)
  • 17. Related Work  Using model-based machine learning techniques • Goel et al. (ICAI 2012), Limaye et al. (PVLDB 2010), Mulwad et al. (ISWC 2013)  Extract features from individual data values and build graphical model  Do not extract characteristic properties of column data as a whole  Training graphical models not scalable – explosion of search space  Using external knowledge • Venetis et al. (VLDB 2011), Syed et al. (SWSC 2010)  Leverage knowledge on Web to label individual data values  Restricted to domains and ontologies - huge amount of extracted data  Highly ontology specific – models generated from specific ontologies  Stonebraker et al. (CIDR 2013)  Address problem of schema matching  Draw inspiration in combining collection of experts
  • 18. Conclusion  Label Prediction Accuracy  Our approach improves on accuracy of competing approaches on wide variety of domains  Efficiency & Scalability  About 250 times faster than Conditional Random Fields based semantic labeling technique  Capable of handling noisy datasets  Ontology agnostic  Learns semantic labeling function with respect to ontologies selected by users for their application

Editor's Notes

  1. Each textual semantic label has a characteristic set of tokens associated with it Can collectively help in identifying the correct semantic label We treat each column of data as a document Create vector model representation Dimensions correspond to vocabulary of tokens extracted Weight assigned to each token is its TF-IDF score Rank candidate semantic labels in decreasing order of cosine similarity
  2. IDEA: The distribution of values in each numeric semantic type is different For example, the distribution of weights, is likely to be different from distribution of temperatures We use Statistical Hypothesis testing to compare distributions of numeric values Rank semantic labels in decreasing order of p-value