SlideShare a Scribd company logo
1 of 19
Assigning Semantic Labels to Data
Sources
Authors:
S.K. Ramnandan1, Amol Mittal2, Craig Knoblock3, Pedro Szekely3
[1] Indian Institute of Technology - Madras
[2] Indian Institute of Technology - Delhi
[3] University of Southern California
Introduction
Motivation:
- To automatically construct a semantic model of a set of
data sources using domain ontologies selected by user
Applications:
- Provides support to automate many tasks
- Data integration
- Source discovery
- Service composition
- Building knowledge graphs
- Manual description
- tedious & time-consuming
What is a semantic model?
Description of the source in terms of the concepts and
relationships defined by the domain ontology
Data Source
Domain Ontology
Person
Organization
Place
State
name
birthdate
bornIn
worksFor state
name
phone
name
livesIn
City
Event
ceo
location
organizer
nearby
startDate
title
isPartOf
postalCode
Column 1 Column 2 Column 3 Column 4 Column 5
Bill Gates Oct 1955 Microsoft Seattle WA
Mark Zuckerberg May 1984 Facebook White Plains NY
Larry Page Mar 1973 Google East Lansing MI
Column 1 Column 2 Column 3 Column 4 Column 5
Bill Gates Oct 1955 Microsoft Seattle WA
Mark Zuckerberg May 1984 Facebook White Plains NY
Larry Page Mar 1973 Google East Lansing MI
Person
Organization
State
name birthdate
bornIn
worksFor
state
name
name
name
City
Example semantic model
Semantic Labeling Step
Column 1 Column 2 Column 3 Column 4 Column 5
Bill Gates Oct 1955 Microsoft Seattle WA
Mark Zuckerberg May 1984 Facebook White Plains NY
Larry Page Mar 1973 Google East Lansing MI
Person Organization City State
name birthdate name namename
Person
Assigning a class or data property (semantic type) from
the ontology to each attribute in the source
 Taheriyan et al., ISWC 2013, ICSC 2014
 Problems with model-based machine
learning techniques (like CRF):
• Low prediction accuracy for numeric data
• Training time scales poorly as no. of
ontology data properties increases
Overall approach - semantic modeling
Overall Approach (SemTyper)
 Holistic view of data values to capture
characteristic property of semantic type
 Textual Data : TF-IDF Cosine Similarity
 Numeric Data: Kolmogorov-Smirnov Test
 Top-k suggestions returned to the user based
on the confidence scores
Approach to Textual Data
Approach to Numeric Data
Candidate Statistical Hypothesis tests:
- Welch’s t-test
- Mann-Whitney U-test
- Kolmogorov-Smirnov Test
Handling noisy datasets
 How to infer if data is textual or numeric in a noisy source?
 Training time: fraction of numeric values
• < 60% - trained as purely textual
• > 80% - trained as purely numeric
• else - trained as both textual and numeric
 Prediction time: fraction of numeric values
• > 70% - tested as numeric data
• else - tested as textual data
 Thresholds empirically chosen using coarse grid search
• Measuring label prediction accuracy on held out set
Datasets (Evaluation)
 Purely textual data
• Museum domain: 29 museum data sources (Taheriyan et al.)
 Purely numeric data
• City domain:
 30 numeric data properties from City class in Dbpedia
 Partitioned into 10 data sources
 Mixture of textual & numeric data
• City domain:
 52 data properties from City class in DBpedia
• Weather, phone directory and flight status domains
(Ambite et al.)
Metrics (Evaluation)
 Mean Reciprocal Rank
 Interested in rank at which correct semantic
label is predicted
 Average Training Time
Evaluation (Textual data- Museum domain)
Evaluation (Numeric data- City domain)
Evaluation (Mixture data- City domain)
Evaluation (Mixture data- other domains)
Related Work
 Using model-based machine learning techniques
• Goel et al. (ICAI 2012), Limaye et al. (PVLDB 2010), Mulwad et al. (ISWC
2013)
 Extract features from individual data values and build graphical model
 Do not extract characteristic properties of column data as a whole
 Training graphical models not scalable – explosion of search space
 Using external knowledge
• Venetis et al. (VLDB 2011), Syed et al. (SWSC 2010)
 Leverage knowledge on Web to label individual data values
 Restricted to domains and ontologies - huge amount of extracted data
 Highly ontology specific – models generated from specific ontologies
 Stonebraker et al. (CIDR 2013)
 Address problem of schema matching
 Draw inspiration in combining collection of experts
Conclusion
 Label Prediction Accuracy
 Our approach improves on accuracy of competing
approaches on wide variety of domains
 Efficiency & Scalability
 About 250 times faster than Conditional Random Fields
based semantic labeling technique
 Capable of handling noisy datasets
 Ontology agnostic
 Learns semantic labeling function with respect to
ontologies selected by users for their application
Assigning semantic labels to data sources

More Related Content

What's hot

Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...Andrea Scharnhorst
 
Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)infoblog
 
Demonstrating a Framework for KOS-based Recommendations Systems
Demonstrating a Framework for KOS-based Recommendations SystemsDemonstrating a Framework for KOS-based Recommendations Systems
Demonstrating a Framework for KOS-based Recommendations SystemsGESIS
 
An experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsAn experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsarx-deidentifier
 
Adaptive Knowledge Portal for Education Domain
Adaptive Knowledge Portal for Education DomainAdaptive Knowledge Portal for Education Domain
Adaptive Knowledge Portal for Education DomainMikhail Navrotskii
 
Managing international comparative data
Managing international comparative dataManaging international comparative data
Managing international comparative dataEOSC-hub project
 
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'ScienceWorks
 
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1 Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1 ÚISK FF UK
 
Query reverse engineering in the context of the semantic web
Query reverse engineering in the context of the semantic webQuery reverse engineering in the context of the semantic web
Query reverse engineering in the context of the semantic webLeandro Tabares-Martin
 
Philosophy of IR Evaluation Ellen Voorhees
Philosophy of IR Evaluation Ellen VoorheesPhilosophy of IR Evaluation Ellen Voorhees
Philosophy of IR Evaluation Ellen Voorheesk21jag
 
What's wrong with our scholarly infrastructure?
What's wrong with our scholarly infrastructure?What's wrong with our scholarly infrastructure?
What's wrong with our scholarly infrastructure?Björn Brembs
 
IPR Introduction at NUS
IPR Introduction at NUS IPR Introduction at NUS
IPR Introduction at NUS Junichiro Mori
 
Bootcamp python-1
Bootcamp python-1Bootcamp python-1
Bootcamp python-1Era Wibowo
 
Literature overview "OSS" and "Civic tech" 2017
Literature overview "OSS" and "Civic tech" 2017Literature overview "OSS" and "Civic tech" 2017
Literature overview "OSS" and "Civic tech" 2017Keiko Ono
 
CLOSED - Call for Papers: Semantic eScience special issue in Earth Science In...
CLOSED - Call for Papers: Semantic eScience special issue in Earth Science In...CLOSED - Call for Papers: Semantic eScience special issue in Earth Science In...
CLOSED - Call for Papers: Semantic eScience special issue in Earth Science In...Xiaogang (Marshall) Ma
 
Research data discovery in OpenAIRE (Presentation by Paolo Manghi at DI4R2018)
Research data discovery in OpenAIRE (Presentation by Paolo Manghi at DI4R2018)Research data discovery in OpenAIRE (Presentation by Paolo Manghi at DI4R2018)
Research data discovery in OpenAIRE (Presentation by Paolo Manghi at DI4R2018)OpenAIRE
 

What's hot (20)

Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...
 
Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)
 
06 Scott, David
06 Scott, David06 Scott, David
06 Scott, David
 
Demonstrating a Framework for KOS-based Recommendations Systems
Demonstrating a Framework for KOS-based Recommendations SystemsDemonstrating a Framework for KOS-based Recommendations Systems
Demonstrating a Framework for KOS-based Recommendations Systems
 
An experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsAn experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithms
 
Resume
ResumeResume
Resume
 
Mike Dietze PEcAn
Mike Dietze PEcAnMike Dietze PEcAn
Mike Dietze PEcAn
 
Adaptive Knowledge Portal for Education Domain
Adaptive Knowledge Portal for Education DomainAdaptive Knowledge Portal for Education Domain
Adaptive Knowledge Portal for Education Domain
 
Managing international comparative data
Managing international comparative dataManaging international comparative data
Managing international comparative data
 
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'
 
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1 Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1
 
Query reverse engineering in the context of the semantic web
Query reverse engineering in the context of the semantic webQuery reverse engineering in the context of the semantic web
Query reverse engineering in the context of the semantic web
 
Philosophy of IR Evaluation Ellen Voorhees
Philosophy of IR Evaluation Ellen VoorheesPhilosophy of IR Evaluation Ellen Voorhees
Philosophy of IR Evaluation Ellen Voorhees
 
What's wrong with our scholarly infrastructure?
What's wrong with our scholarly infrastructure?What's wrong with our scholarly infrastructure?
What's wrong with our scholarly infrastructure?
 
IPR Introduction at NUS
IPR Introduction at NUS IPR Introduction at NUS
IPR Introduction at NUS
 
Bootcamp python-1
Bootcamp python-1Bootcamp python-1
Bootcamp python-1
 
krynski_cv
krynski_cvkrynski_cv
krynski_cv
 
Literature overview "OSS" and "Civic tech" 2017
Literature overview "OSS" and "Civic tech" 2017Literature overview "OSS" and "Civic tech" 2017
Literature overview "OSS" and "Civic tech" 2017
 
CLOSED - Call for Papers: Semantic eScience special issue in Earth Science In...
CLOSED - Call for Papers: Semantic eScience special issue in Earth Science In...CLOSED - Call for Papers: Semantic eScience special issue in Earth Science In...
CLOSED - Call for Papers: Semantic eScience special issue in Earth Science In...
 
Research data discovery in OpenAIRE (Presentation by Paolo Manghi at DI4R2018)
Research data discovery in OpenAIRE (Presentation by Paolo Manghi at DI4R2018)Research data discovery in OpenAIRE (Presentation by Paolo Manghi at DI4R2018)
Research data discovery in OpenAIRE (Presentation by Paolo Manghi at DI4R2018)
 

Viewers also liked

Infografia habilidades receptivas y productivas de la lengua
Infografia habilidades receptivas y productivas de la lenguaInfografia habilidades receptivas y productivas de la lengua
Infografia habilidades receptivas y productivas de la lenguaLupitaSosa12
 
Versión 2
Versión 2Versión 2
Versión 2etnapol
 
10 aplicaciones miguel angel soto 10-22
10 aplicaciones miguel angel soto 10-2210 aplicaciones miguel angel soto 10-22
10 aplicaciones miguel angel soto 10-22miguel2423
 
10 aplicaciones educativas- teddy galara 10-2
10 aplicaciones educativas- teddy galara 10-210 aplicaciones educativas- teddy galara 10-2
10 aplicaciones educativas- teddy galara 10-2EL Durakiitooh Torres
 
PREGUNTAS EXPOSICIONES
PREGUNTAS EXPOSICIONES PREGUNTAS EXPOSICIONES
PREGUNTAS EXPOSICIONES Karen Edith
 
Sesión 4. didáctica crítica.
Sesión 4. didáctica crítica.Sesión 4. didáctica crítica.
Sesión 4. didáctica crítica.eduardo eddy
 
Hindi language article what india can learn from west and how can it transfor...
Hindi language article what india can learn from west and how can it transfor...Hindi language article what india can learn from west and how can it transfor...
Hindi language article what india can learn from west and how can it transfor...Dr. Trilok Kumar Jain
 
Lossimbolospatriosdelperu
Lossimbolospatriosdelperu Lossimbolospatriosdelperu
Lossimbolospatriosdelperu jackelyngs
 
Delivering Mobile Healthcare (Congo) - Infographic
Delivering Mobile Healthcare (Congo) - InfographicDelivering Mobile Healthcare (Congo) - Infographic
Delivering Mobile Healthcare (Congo) - InfographicCisco Service Provider
 
Documentos
DocumentosDocumentos
Documentosshane00a
 
BIZKAIA (Jon eta Ane)
BIZKAIA (Jon eta Ane)BIZKAIA (Jon eta Ane)
BIZKAIA (Jon eta Ane)aneetxu
 
Construcción de número
Construcción de númeroConstrucción de número
Construcción de númeroYanet Barbosa
 
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...Craig Knoblock
 
Building and Using a Knowledge Graph to Combat Human Trafficking
Building and Using a Knowledge Graph to Combat Human TraffickingBuilding and Using a Knowledge Graph to Combat Human Trafficking
Building and Using a Knowledge Graph to Combat Human TraffickingCraig Knoblock
 

Viewers also liked (20)

Expocomple3
Expocomple3Expocomple3
Expocomple3
 
Better Vending Machine Overview
Better Vending Machine OverviewBetter Vending Machine Overview
Better Vending Machine Overview
 
Infografia habilidades receptivas y productivas de la lengua
Infografia habilidades receptivas y productivas de la lenguaInfografia habilidades receptivas y productivas de la lengua
Infografia habilidades receptivas y productivas de la lengua
 
Versión 2
Versión 2Versión 2
Versión 2
 
10 aplicaciones miguel angel soto 10-22
10 aplicaciones miguel angel soto 10-2210 aplicaciones miguel angel soto 10-22
10 aplicaciones miguel angel soto 10-22
 
10 aplicaciones educativas- teddy galara 10-2
10 aplicaciones educativas- teddy galara 10-210 aplicaciones educativas- teddy galara 10-2
10 aplicaciones educativas- teddy galara 10-2
 
PREGUNTAS EXPOSICIONES
PREGUNTAS EXPOSICIONES PREGUNTAS EXPOSICIONES
PREGUNTAS EXPOSICIONES
 
Adjectives collage
Adjectives collageAdjectives collage
Adjectives collage
 
Sesión 4. didáctica crítica.
Sesión 4. didáctica crítica.Sesión 4. didáctica crítica.
Sesión 4. didáctica crítica.
 
Hindi language article what india can learn from west and how can it transfor...
Hindi language article what india can learn from west and how can it transfor...Hindi language article what india can learn from west and how can it transfor...
Hindi language article what india can learn from west and how can it transfor...
 
Lossimbolospatriosdelperu
Lossimbolospatriosdelperu Lossimbolospatriosdelperu
Lossimbolospatriosdelperu
 
Fort Wayne
Fort WayneFort Wayne
Fort Wayne
 
Delivering Mobile Healthcare (Congo) - Infographic
Delivering Mobile Healthcare (Congo) - InfographicDelivering Mobile Healthcare (Congo) - Infographic
Delivering Mobile Healthcare (Congo) - Infographic
 
Documentos
DocumentosDocumentos
Documentos
 
BIZKAIA (Jon eta Ane)
BIZKAIA (Jon eta Ane)BIZKAIA (Jon eta Ane)
BIZKAIA (Jon eta Ane)
 
Construcción de número
Construcción de númeroConstrucción de número
Construcción de número
 
IW-AON_REBALANCING STRATEGY TO BALANCE OUR UNBALANCED WORLD
IW-AON_REBALANCING STRATEGY TO BALANCE OUR UNBALANCED WORLDIW-AON_REBALANCING STRATEGY TO BALANCE OUR UNBALANCED WORLD
IW-AON_REBALANCING STRATEGY TO BALANCE OUR UNBALANCED WORLD
 
MANUALIDADES
MANUALIDADESMANUALIDADES
MANUALIDADES
 
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...
 
Building and Using a Knowledge Graph to Combat Human Trafficking
Building and Using a Knowledge Graph to Combat Human TraffickingBuilding and Using a Knowledge Graph to Combat Human Trafficking
Building and Using a Knowledge Graph to Combat Human Trafficking
 

Similar to Assigning semantic labels to data sources

The state of the art in integrating machine learning into visual analytics
The state of the art in integrating machine learning into visual analyticsThe state of the art in integrating machine learning into visual analytics
The state of the art in integrating machine learning into visual analyticsCagatay Turkay
 
EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning
EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning
EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning Phuc Nguyen
 
Data Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical UniversityData Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical Universitybutest
 
Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...ranjit banshpal
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineSalford Systems
 
Ibm colloquium 070915_nyberg
Ibm colloquium 070915_nybergIbm colloquium 070915_nyberg
Ibm colloquium 070915_nybergdiannepatricia
 
Multimedia Answer Generation for Community Question Answering
Multimedia Answer Generation for Community Question AnsweringMultimedia Answer Generation for Community Question Answering
Multimedia Answer Generation for Community Question AnsweringSWAMI06
 
Classifying malicious websites using an ensemble weighted features
Classifying malicious websites using an ensemble weighted featuresClassifying malicious websites using an ensemble weighted features
Classifying malicious websites using an ensemble weighted featuresDharmendra Vishwakarma
 
Introduction to OpenSemcq
Introduction to OpenSemcqIntroduction to OpenSemcq
Introduction to OpenSemcqmbtosic
 
Csi poster
Csi posterCsi poster
Csi posterISSIP
 
Introduction Machine Learning Syllabus
Introduction Machine Learning SyllabusIntroduction Machine Learning Syllabus
Introduction Machine Learning SyllabusAndres Mendez-Vazquez
 
Info 2402 information retrieval technologies course_outline
Info 2402 information retrieval technologies course_outlineInfo 2402 information retrieval technologies course_outline
Info 2402 information retrieval technologies course_outlineShahriar Rafee
 
Discovering Common Motifs in Cursor Movement Data
Discovering Common Motifs in Cursor Movement DataDiscovering Common Motifs in Cursor Movement Data
Discovering Common Motifs in Cursor Movement DataYandex
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsSurvey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsIRJET Journal
 
AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016Manjula Ambur
 
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Dios Kurniawan
 
Building Surveys in Qualtrics for Efficient Analytics
Building Surveys in Qualtrics for Efficient AnalyticsBuilding Surveys in Qualtrics for Efficient Analytics
Building Surveys in Qualtrics for Efficient AnalyticsShalin Hai-Jew
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com
 

Similar to Assigning semantic labels to data sources (20)

The state of the art in integrating machine learning into visual analytics
The state of the art in integrating machine learning into visual analyticsThe state of the art in integrating machine learning into visual analytics
The state of the art in integrating machine learning into visual analytics
 
EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning
EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning
EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning
 
Data Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical UniversityData Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical University
 
Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
 
Ibm colloquium 070915_nyberg
Ibm colloquium 070915_nybergIbm colloquium 070915_nyberg
Ibm colloquium 070915_nyberg
 
Multimedia Answer Generation for Community Question Answering
Multimedia Answer Generation for Community Question AnsweringMultimedia Answer Generation for Community Question Answering
Multimedia Answer Generation for Community Question Answering
 
Classifying malicious websites using an ensemble weighted features
Classifying malicious websites using an ensemble weighted featuresClassifying malicious websites using an ensemble weighted features
Classifying malicious websites using an ensemble weighted features
 
MUDROD - Ranking
MUDROD - RankingMUDROD - Ranking
MUDROD - Ranking
 
Introduction to OpenSemcq
Introduction to OpenSemcqIntroduction to OpenSemcq
Introduction to OpenSemcq
 
Csi poster
Csi posterCsi poster
Csi poster
 
Introduction Machine Learning Syllabus
Introduction Machine Learning SyllabusIntroduction Machine Learning Syllabus
Introduction Machine Learning Syllabus
 
Info 2402 information retrieval technologies course_outline
Info 2402 information retrieval technologies course_outlineInfo 2402 information retrieval technologies course_outline
Info 2402 information retrieval technologies course_outline
 
Discovering Common Motifs in Cursor Movement Data
Discovering Common Motifs in Cursor Movement DataDiscovering Common Motifs in Cursor Movement Data
Discovering Common Motifs in Cursor Movement Data
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning AlgorithmsSurvey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
 
AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016
 
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1
 
Building Surveys in Qualtrics for Efficient Analytics
Building Surveys in Qualtrics for Efficient AnalyticsBuilding Surveys in Qualtrics for Efficient Analytics
Building Surveys in Qualtrics for Efficient Analytics
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
Heterogeneous data annotation
Heterogeneous data annotationHeterogeneous data annotation
Heterogeneous data annotation
 

More from Craig Knoblock

Learning to Adapt to Sensor Changes and Failures
Learning to Adapt to Sensor Changes and FailuresLearning to Adapt to Sensor Changes and Failures
Learning to Adapt to Sensor Changes and FailuresCraig Knoblock
 
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...Craig Knoblock
 
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...Craig Knoblock
 
Lessons Learned in Building Linked Data for the American Art Collaborative
Lessons Learned in Building Linked Data for the American Art CollaborativeLessons Learned in Building Linked Data for the American Art Collaborative
Lessons Learned in Building Linked Data for the American Art CollaborativeCraig Knoblock
 
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsExtracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsCraig Knoblock
 
From Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked KnowledgeFrom Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked KnowledgeCraig Knoblock
 
Semantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisSemantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisCraig Knoblock
 
Discovering Alignments in Ontologies of Linked Data
Discovering Alignments in Ontologies of Linked DataDiscovering Alignments in Ontologies of Linked Data
Discovering Alignments in Ontologies of Linked DataCraig Knoblock
 

More from Craig Knoblock (8)

Learning to Adapt to Sensor Changes and Failures
Learning to Adapt to Sensor Changes and FailuresLearning to Adapt to Sensor Changes and Failures
Learning to Adapt to Sensor Changes and Failures
 
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
 
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
 
Lessons Learned in Building Linked Data for the American Art Collaborative
Lessons Learned in Building Linked Data for the American Art CollaborativeLessons Learned in Building Linked Data for the American Art Collaborative
Lessons Learned in Building Linked Data for the American Art Collaborative
 
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsExtracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
 
From Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked KnowledgeFrom Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
 
Semantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisSemantics for Big Data Integration and Analysis
Semantics for Big Data Integration and Analysis
 
Discovering Alignments in Ontologies of Linked Data
Discovering Alignments in Ontologies of Linked DataDiscovering Alignments in Ontologies of Linked Data
Discovering Alignments in Ontologies of Linked Data
 

Recently uploaded

The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfSrushith Repakula
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...FIDO Alliance
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPTiSEO AI
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024Stephanie Beckett
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101vincent683379
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGDSC PJATK
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentationyogeshlabana357357
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jNeo4j
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024Stephen Perrenod
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!Memoori
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 

Recently uploaded (20)

The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4j
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 

Assigning semantic labels to data sources

  • 1. Assigning Semantic Labels to Data Sources Authors: S.K. Ramnandan1, Amol Mittal2, Craig Knoblock3, Pedro Szekely3 [1] Indian Institute of Technology - Madras [2] Indian Institute of Technology - Delhi [3] University of Southern California
  • 2. Introduction Motivation: - To automatically construct a semantic model of a set of data sources using domain ontologies selected by user Applications: - Provides support to automate many tasks - Data integration - Source discovery - Service composition - Building knowledge graphs - Manual description - tedious & time-consuming
  • 3. What is a semantic model? Description of the source in terms of the concepts and relationships defined by the domain ontology Data Source Domain Ontology Person Organization Place State name birthdate bornIn worksFor state name phone name livesIn City Event ceo location organizer nearby startDate title isPartOf postalCode Column 1 Column 2 Column 3 Column 4 Column 5 Bill Gates Oct 1955 Microsoft Seattle WA Mark Zuckerberg May 1984 Facebook White Plains NY Larry Page Mar 1973 Google East Lansing MI
  • 4. Column 1 Column 2 Column 3 Column 4 Column 5 Bill Gates Oct 1955 Microsoft Seattle WA Mark Zuckerberg May 1984 Facebook White Plains NY Larry Page Mar 1973 Google East Lansing MI Person Organization State name birthdate bornIn worksFor state name name name City Example semantic model
  • 5. Semantic Labeling Step Column 1 Column 2 Column 3 Column 4 Column 5 Bill Gates Oct 1955 Microsoft Seattle WA Mark Zuckerberg May 1984 Facebook White Plains NY Larry Page Mar 1973 Google East Lansing MI Person Organization City State name birthdate name namename Person Assigning a class or data property (semantic type) from the ontology to each attribute in the source
  • 6.  Taheriyan et al., ISWC 2013, ICSC 2014  Problems with model-based machine learning techniques (like CRF): • Low prediction accuracy for numeric data • Training time scales poorly as no. of ontology data properties increases Overall approach - semantic modeling
  • 7. Overall Approach (SemTyper)  Holistic view of data values to capture characteristic property of semantic type  Textual Data : TF-IDF Cosine Similarity  Numeric Data: Kolmogorov-Smirnov Test  Top-k suggestions returned to the user based on the confidence scores
  • 9. Approach to Numeric Data Candidate Statistical Hypothesis tests: - Welch’s t-test - Mann-Whitney U-test - Kolmogorov-Smirnov Test
  • 10. Handling noisy datasets  How to infer if data is textual or numeric in a noisy source?  Training time: fraction of numeric values • < 60% - trained as purely textual • > 80% - trained as purely numeric • else - trained as both textual and numeric  Prediction time: fraction of numeric values • > 70% - tested as numeric data • else - tested as textual data  Thresholds empirically chosen using coarse grid search • Measuring label prediction accuracy on held out set
  • 11. Datasets (Evaluation)  Purely textual data • Museum domain: 29 museum data sources (Taheriyan et al.)  Purely numeric data • City domain:  30 numeric data properties from City class in Dbpedia  Partitioned into 10 data sources  Mixture of textual & numeric data • City domain:  52 data properties from City class in DBpedia • Weather, phone directory and flight status domains (Ambite et al.)
  • 12. Metrics (Evaluation)  Mean Reciprocal Rank  Interested in rank at which correct semantic label is predicted  Average Training Time
  • 13. Evaluation (Textual data- Museum domain)
  • 16. Evaluation (Mixture data- other domains)
  • 17. Related Work  Using model-based machine learning techniques • Goel et al. (ICAI 2012), Limaye et al. (PVLDB 2010), Mulwad et al. (ISWC 2013)  Extract features from individual data values and build graphical model  Do not extract characteristic properties of column data as a whole  Training graphical models not scalable – explosion of search space  Using external knowledge • Venetis et al. (VLDB 2011), Syed et al. (SWSC 2010)  Leverage knowledge on Web to label individual data values  Restricted to domains and ontologies - huge amount of extracted data  Highly ontology specific – models generated from specific ontologies  Stonebraker et al. (CIDR 2013)  Address problem of schema matching  Draw inspiration in combining collection of experts
  • 18. Conclusion  Label Prediction Accuracy  Our approach improves on accuracy of competing approaches on wide variety of domains  Efficiency & Scalability  About 250 times faster than Conditional Random Fields based semantic labeling technique  Capable of handling noisy datasets  Ontology agnostic  Learns semantic labeling function with respect to ontologies selected by users for their application

Editor's Notes

  1. Each textual semantic label has a characteristic set of tokens associated with it Can collectively help in identifying the correct semantic label We treat each column of data as a document Create vector model representation Dimensions correspond to vocabulary of tokens extracted Weight assigned to each token is its TF-IDF score Rank candidate semantic labels in decreasing order of cosine similarity
  2. IDEA: The distribution of values in each numeric semantic type is different For example, the distribution of weights, is likely to be different from distribution of temperatures We use Statistical Hypothesis testing to compare distributions of numeric values Rank semantic labels in decreasing order of p-value