SlideShare a Scribd company logo
1 of 25
Complex Matching of RDF Datatype Properties
Bernardo Pereira Nunes1,2
, Alexander Mera1
, Marco Antonio Casanova1
, Besnik Fetahu2
, Luiz André P. Paes Leme3
, Stefan Dietze2
1) Department of Informatics-PUC-Rio, 2) L3S Research Center, Leibniz University Hannover,
3) Computer Science Institute, Fluminese Federal University
DEXA 2013 – Prague, Czech Republic
Outline
• Introduction
• Motivation
• Related Work
• Schema Matching Principles
• Our approach:
• Phase 1) Estimated Mutual Information – EMI
• Phase 2) Genetic Programing - GP
• Evaluation
• Results
• Conclusions
Besnik Fetahu 2DEXA 2013 – Prague, Czech Republic
Introduction
• Data Integration
• Combine different data sources into an unified view of data
• Originally fomented by large organizations:
• Merge companies databases due to acquisitions
• Currently, driven by new Web trends such as:
• Improvement of Web-based search
• Proliferation of Web applications
• e-business
• Examples: momondo.de, semantic search, price watchers sites, etc.
Besnik Fetahu 3DEXA 2013 – Prague, Czech Republic
Introduction
• Challenges
• Heterogeneous data
• Different data formats
• Data quality (data impurities, corrupted information)
• Scalability
• Adaptability
• Costly
Besnik Fetahu 4DEXA 2013 – Prague, Czech Republic
Introduction
• Initiatives to address data integration problems
• Linked Data Principles
• Ontology Alignment Initiatives (OAI)
• Schema Matching tools
Besnik Fetahu 5DEXA 2013 – Prague, Czech Republic
Motivation
• Given two schemas S and T a matching from S to T is characterized if an
element e from S is mapped to an element e’ from T by some expression
that relates both elements.
Besnik Fetahu 6DEXA 2013 – Prague, Czech Republic
?
?
?
Related Work
• Methods
• RiMOM, iMAP, S-Match, DSSim, ATOM, etc.
• Schema-based approach
• Instance-based approach
• Hybrid approach
• Cardinality
• 1:1
• 1:n
• n:m
Besnik Fetahu 7DEXA 2013 – Prague, Czech Republic
Rahm, E. and Bernstein, P. 2001. A survey of approaches to automatic schema matching. The VLDB Journal 10,
4 (Dec. 2001), 334-350.
Cardinality
• Simple match
• 1:1 – direct matching
• Complex match
• 1:1 / n:1 (mapping functions)
Besnik Fetahu 8DEXA 2013 – Prague, Czech Republic
ISBN
0-671-72287-5
ISBN
0-671-72287-5
Fullname
William Shakespeare
Firstname Last name
William Shakespeare
split(fullname)
concatenate(f,l)
Our approach
• Two-phase approach:
• Estimated Mutual Information
• Suggest 1:1 and 1:n mappings
• Serve as a filtering step (filter out data properties that have no mutual information)
• Reduce search space for the next phase (speed up the process)
• Genetic Programming
• Automatic process for creating mapping functions
• Reduces the cost of traversing the search space
Besnik Fetahu 9DEXA 2013 – Prague, Czech Republic
Estimated Mutual Information (EMI)
• EMI Matrix
• p=(p1,…,pu), q=(q1,…,qv) two lists of sets (i.e. sets of data type properties)
Besnik Fetahu 10DEXA 2013 – Prague, Czech Republic
Cosine Similarity
Jaccard Index
…..
Estimated Mutual Information (EMI)
• Computing the mutual information:
• Cosine Similarity
• Simple matches: William Shakespeare → William Shakespeare
• Jaccard Similarity
• Simple and Complex matches: William → William Shakespeare
Besnik Fetahu 11DEXA 2013 – Prague, Czech Republic
Genetic Programming (GP)
• Genetic programming refers to an automated method to create and evolve
programs to solve a problem.
• A solution is represented by a tree, whose nodes are labeled with functions
(concatenate, split, sum) or with values (strings, numbers, etc).
• New individuals are generated by applying genetic operations to the current
population of individuals.
• Selects individuals that should breed by an evolutionary process.
Besnik Fetahu 12DEXA 2013 – Prague, Czech Republic
Genetic Programming (GP)
• GP Functions:
• Crossover
• The act of swapping gene values between two potential solutions,
simulating the "mating" of the two solutions.
• Mutation
• The act of randomly altering the value of a gene in a potential solution.
• Reproduction
• The act of making a copy of a potential solution
Besnik Fetahu 13DEXA 2013 – Prague, Czech Republic
Genetic Programming (GP)
• Fitness function
• Levenshtein similarity function for string values
• KL-divergence measure for numeric values
• Different measures are applied since data properties values can have
multiple common values (such as 0) and it can lead to a wrong match. Thus,
we use measure the probability of two sets being the same with KL.
Besnik Fetahu 14DEXA 2013 – Prague, Czech Republic
An Example of Implementation
Besnik Fetahu 15DEXA 2013 – Prague, Czech Republic
An Example of Implementation
Phase 1 – Co-occurrence matrix
1. Difference between Cosine/Jaccard similarity metrics.
Besnik Fetahu 16DEXA 2013 – Prague, Czech Republic
An Example of Implementation
Phase 1 – EMI matrix
2. Possible matchings:
Besnik Fetahu 17DEXA 2013 – Prague, Czech Republic
An Example of Implementation
Besnik Fetahu 18DEXA 2013 – Prague, Czech Republic
An Example of Implementation
Besnik Fetahu 19DEXA 2013 – Prague, Czech Republic
Complement
+
NumberAddress
+
Number
Crossover
NeighborhoodComplementNumber
Address
+
+ mutation
Complement
+
Number
reproduction
An Example of Implementation
Besnik Fetahu 20DEXA 2013 – Prague, Czech Republic
Correct
Repetitive and
Incorrect mutation
Evaluation
• Datasets
• “Personal Information” dataset lists information about people
• “Real Estate” dataset lists information about houses for sale
• “Inventory” dataset describes product inventories
With exception of the “Personal Information” dataset due to privacy reasons, other datasets are available at:
http://pages.cs.wisc.edu/ anhai/wisc-si-archive/domains/
Besnik Fetahu 21DEXA 2013 – Prague, Czech Republic
Results
Besnik Fetahu 22DEXA 2013 – Prague, Czech Republic
Results
Besnik Fetahu 23DEXA 2013 – Prague, Czech Republic
27/08/13Ricardo Kawase 24
Conclusions
• Complex schema matching approach
• Simple + Complex matching:
• Estimated Mutual Information + Genetic Programing
• Reduced search space for matching properties
• Adaptive to variations of 1:1 and n:1 matching instances
• High accuracy on generated matches and coverage
Questions?
Thank you!
Besnik Fetahu 25DEXA 2013 – Prague, Czech Republic

More Related Content

What's hot

Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Polytechnic University of Bari
 
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
 Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata ItemsLviv Data Science Summer School
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsShubhangi Tandon
 
Henning agt talk-caise-semnet
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnetcaise2013vlc
 
eNanoMapper database, search tools and templates
eNanoMapper database, search tools and templateseNanoMapper database, search tools and templates
eNanoMapper database, search tools and templatesNina Jeliazkova
 
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSONBoris Glavic
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataRoi Blanco
 
Mapping Domain Names to Categories
Mapping Domain Names to CategoriesMapping Domain Names to Categories
Mapping Domain Names to CategoriesGene Chuang
 
A hierarchical approach for semi structured document indexing and
A hierarchical approach for semi structured document indexing andA hierarchical approach for semi structured document indexing and
A hierarchical approach for semi structured document indexing andIbrahim Bounhas
 
A survey of web clustering engines
A survey of web clustering enginesA survey of web clustering engines
A survey of web clustering enginesunyil96
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in PracticeVsevolod Dyomkin
 
"Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий...
"Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий..."Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий...
"Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий...GeeksLab Odessa
 
Visualising Data on Interactive Maps
Visualising Data on Interactive MapsVisualising Data on Interactive Maps
Visualising Data on Interactive MapsAnna Pawlicka
 

What's hot (16)

Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
 
Rs web context_content__v4.0__20120908_ma
Rs web context_content__v4.0__20120908_maRs web context_content__v4.0__20120908_ma
Rs web context_content__v4.0__20120908_ma
 
IR tutorial
IR tutorialIR tutorial
IR tutorial
 
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
 Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into Texts
 
Henning agt talk-caise-semnet
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnet
 
eNanoMapper database, search tools and templates
eNanoMapper database, search tools and templateseNanoMapper database, search tools and templates
eNanoMapper database, search tools and templates
 
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF data
 
Mapping Domain Names to Categories
Mapping Domain Names to CategoriesMapping Domain Names to Categories
Mapping Domain Names to Categories
 
A hierarchical approach for semi structured document indexing and
A hierarchical approach for semi structured document indexing andA hierarchical approach for semi structured document indexing and
A hierarchical approach for semi structured document indexing and
 
A survey of web clustering engines
A survey of web clustering enginesA survey of web clustering engines
A survey of web clustering engines
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
 
"Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий...
"Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий..."Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий...
"Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий...
 
Visualising Data on Interactive Maps
Visualising Data on Interactive MapsVisualising Data on Interactive Maps
Visualising Data on Interactive Maps
 

Viewers also liked

Automated News Suggestions for Populating Wikipedia Entity Pages
Automated News Suggestions for Populating Wikipedia Entity PagesAutomated News Suggestions for Populating Wikipedia Entity Pages
Automated News Suggestions for Populating Wikipedia Entity PagesBesnik Fetahu
 
Towards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data GraphTowards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data GraphBesnik Fetahu
 
Improving Entity Retrieval on Structured Data
Improving Entity Retrieval on Structured DataImproving Entity Retrieval on Structured Data
Improving Entity Retrieval on Structured DataBesnik Fetahu
 
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...Besnik Fetahu
 
How much is Wikipedia lagging behind News?
How much is Wikipedia lagging behind News?How much is Wikipedia lagging behind News?
How much is Wikipedia lagging behind News?Besnik Fetahu
 
Combining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingCombining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingBesnik Fetahu
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)Besnik Fetahu
 
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesA Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesBesnik Fetahu
 
Finding News Citations For Wikipedia
Finding News Citations For WikipediaFinding News Citations For Wikipedia
Finding News Citations For WikipediaBesnik Fetahu
 

Viewers also liked (9)

Automated News Suggestions for Populating Wikipedia Entity Pages
Automated News Suggestions for Populating Wikipedia Entity PagesAutomated News Suggestions for Populating Wikipedia Entity Pages
Automated News Suggestions for Populating Wikipedia Entity Pages
 
Towards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data GraphTowards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data Graph
 
Improving Entity Retrieval on Structured Data
Improving Entity Retrieval on Structured DataImproving Entity Retrieval on Structured Data
Improving Entity Retrieval on Structured Data
 
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
 
How much is Wikipedia lagging behind News?
How much is Wikipedia lagging behind News?How much is Wikipedia lagging behind News?
How much is Wikipedia lagging behind News?
 
Combining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingCombining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linking
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)
 
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesA Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
 
Finding News Citations For Wikipedia
Finding News Citations For WikipediaFinding News Citations For Wikipedia
Finding News Citations For Wikipedia
 

Similar to Complex Matching of RDF Datatype Properties

IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstracttsysglobalsolutions
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentMaribel Acosta Deibe
 
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Editor IJMTER
 
Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1ErhardRahm
 
Segmentation - based Historical Handwritten Word Spotting using document-spec...
Segmentation - based Historical Handwritten Word Spotting using document-spec...Segmentation - based Historical Handwritten Word Spotting using document-spec...
Segmentation - based Historical Handwritten Word Spotting using document-spec...Konstantinos Zagoris
 
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...Advanced-Concepts-Team
 
Deep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesDeep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesTraian Rebedea
 
Linked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and LuzzuLinked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and Luzzujerdeb
 
Introduction to Topological Data Analysis
Introduction to Topological Data AnalysisIntroduction to Topological Data Analysis
Introduction to Topological Data AnalysisMason Porter
 
Big Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic ProcessingBig Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic ProcessingNa'im Tyson
 
Searching in metric spaces
Searching in metric spacesSearching in metric spaces
Searching in metric spacesunyil96
 
Boosting probabilistic graphical model inference by incorporating prior knowl...
Boosting probabilistic graphical model inference by incorporating prior knowl...Boosting probabilistic graphical model inference by incorporating prior knowl...
Boosting probabilistic graphical model inference by incorporating prior knowl...Hakky St
 
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routingIEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routingIEEEFINALYEARSTUDENTPROJECTS
 
2014 IEEE JAVA DATA MINING PROJECT Keyword query routing
2014 IEEE JAVA DATA MINING PROJECT Keyword query routing2014 IEEE JAVA DATA MINING PROJECT Keyword query routing
2014 IEEE JAVA DATA MINING PROJECT Keyword query routingIEEEMEMTECHSTUDENTSPROJECTS
 
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databasesIEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databasesIEEEFINALYEARSTUDENTPROJECTS
 
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databasesIEEEMEMTECHSTUDENTSPROJECTS
 
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHCOLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHIJCI JOURNAL
 
03 interlinking-dass
03 interlinking-dass03 interlinking-dass
03 interlinking-dassDiego Pessoa
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 

Similar to Complex Matching of RDF Datatype Properties (20)

Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...
Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...
Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
 
Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1
 
Segmentation - based Historical Handwritten Word Spotting using document-spec...
Segmentation - based Historical Handwritten Word Spotting using document-spec...Segmentation - based Historical Handwritten Word Spotting using document-spec...
Segmentation - based Historical Handwritten Word Spotting using document-spec...
 
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
 
Deep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesDeep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profiles
 
Linked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and LuzzuLinked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and Luzzu
 
Introduction to Topological Data Analysis
Introduction to Topological Data AnalysisIntroduction to Topological Data Analysis
Introduction to Topological Data Analysis
 
Big Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic ProcessingBig Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic Processing
 
Searching in metric spaces
Searching in metric spacesSearching in metric spaces
Searching in metric spaces
 
Boosting probabilistic graphical model inference by incorporating prior knowl...
Boosting probabilistic graphical model inference by incorporating prior knowl...Boosting probabilistic graphical model inference by incorporating prior knowl...
Boosting probabilistic graphical model inference by incorporating prior knowl...
 
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routingIEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
 
2014 IEEE JAVA DATA MINING PROJECT Keyword query routing
2014 IEEE JAVA DATA MINING PROJECT Keyword query routing2014 IEEE JAVA DATA MINING PROJECT Keyword query routing
2014 IEEE JAVA DATA MINING PROJECT Keyword query routing
 
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databasesIEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
 
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
 
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHCOLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
 
03 interlinking-dass
03 interlinking-dass03 interlinking-dass
03 interlinking-dass
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 

Recently uploaded

Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...KokoStevan
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfSanaAli374401
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 

Recently uploaded (20)

Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 

Complex Matching of RDF Datatype Properties

  • 1. Complex Matching of RDF Datatype Properties Bernardo Pereira Nunes1,2 , Alexander Mera1 , Marco Antonio Casanova1 , Besnik Fetahu2 , Luiz André P. Paes Leme3 , Stefan Dietze2 1) Department of Informatics-PUC-Rio, 2) L3S Research Center, Leibniz University Hannover, 3) Computer Science Institute, Fluminese Federal University DEXA 2013 – Prague, Czech Republic
  • 2. Outline • Introduction • Motivation • Related Work • Schema Matching Principles • Our approach: • Phase 1) Estimated Mutual Information – EMI • Phase 2) Genetic Programing - GP • Evaluation • Results • Conclusions Besnik Fetahu 2DEXA 2013 – Prague, Czech Republic
  • 3. Introduction • Data Integration • Combine different data sources into an unified view of data • Originally fomented by large organizations: • Merge companies databases due to acquisitions • Currently, driven by new Web trends such as: • Improvement of Web-based search • Proliferation of Web applications • e-business • Examples: momondo.de, semantic search, price watchers sites, etc. Besnik Fetahu 3DEXA 2013 – Prague, Czech Republic
  • 4. Introduction • Challenges • Heterogeneous data • Different data formats • Data quality (data impurities, corrupted information) • Scalability • Adaptability • Costly Besnik Fetahu 4DEXA 2013 – Prague, Czech Republic
  • 5. Introduction • Initiatives to address data integration problems • Linked Data Principles • Ontology Alignment Initiatives (OAI) • Schema Matching tools Besnik Fetahu 5DEXA 2013 – Prague, Czech Republic
  • 6. Motivation • Given two schemas S and T a matching from S to T is characterized if an element e from S is mapped to an element e’ from T by some expression that relates both elements. Besnik Fetahu 6DEXA 2013 – Prague, Czech Republic ? ? ?
  • 7. Related Work • Methods • RiMOM, iMAP, S-Match, DSSim, ATOM, etc. • Schema-based approach • Instance-based approach • Hybrid approach • Cardinality • 1:1 • 1:n • n:m Besnik Fetahu 7DEXA 2013 – Prague, Czech Republic Rahm, E. and Bernstein, P. 2001. A survey of approaches to automatic schema matching. The VLDB Journal 10, 4 (Dec. 2001), 334-350.
  • 8. Cardinality • Simple match • 1:1 – direct matching • Complex match • 1:1 / n:1 (mapping functions) Besnik Fetahu 8DEXA 2013 – Prague, Czech Republic ISBN 0-671-72287-5 ISBN 0-671-72287-5 Fullname William Shakespeare Firstname Last name William Shakespeare split(fullname) concatenate(f,l)
  • 9. Our approach • Two-phase approach: • Estimated Mutual Information • Suggest 1:1 and 1:n mappings • Serve as a filtering step (filter out data properties that have no mutual information) • Reduce search space for the next phase (speed up the process) • Genetic Programming • Automatic process for creating mapping functions • Reduces the cost of traversing the search space Besnik Fetahu 9DEXA 2013 – Prague, Czech Republic
  • 10. Estimated Mutual Information (EMI) • EMI Matrix • p=(p1,…,pu), q=(q1,…,qv) two lists of sets (i.e. sets of data type properties) Besnik Fetahu 10DEXA 2013 – Prague, Czech Republic Cosine Similarity Jaccard Index …..
  • 11. Estimated Mutual Information (EMI) • Computing the mutual information: • Cosine Similarity • Simple matches: William Shakespeare → William Shakespeare • Jaccard Similarity • Simple and Complex matches: William → William Shakespeare Besnik Fetahu 11DEXA 2013 – Prague, Czech Republic
  • 12. Genetic Programming (GP) • Genetic programming refers to an automated method to create and evolve programs to solve a problem. • A solution is represented by a tree, whose nodes are labeled with functions (concatenate, split, sum) or with values (strings, numbers, etc). • New individuals are generated by applying genetic operations to the current population of individuals. • Selects individuals that should breed by an evolutionary process. Besnik Fetahu 12DEXA 2013 – Prague, Czech Republic
  • 13. Genetic Programming (GP) • GP Functions: • Crossover • The act of swapping gene values between two potential solutions, simulating the "mating" of the two solutions. • Mutation • The act of randomly altering the value of a gene in a potential solution. • Reproduction • The act of making a copy of a potential solution Besnik Fetahu 13DEXA 2013 – Prague, Czech Republic
  • 14. Genetic Programming (GP) • Fitness function • Levenshtein similarity function for string values • KL-divergence measure for numeric values • Different measures are applied since data properties values can have multiple common values (such as 0) and it can lead to a wrong match. Thus, we use measure the probability of two sets being the same with KL. Besnik Fetahu 14DEXA 2013 – Prague, Czech Republic
  • 15. An Example of Implementation Besnik Fetahu 15DEXA 2013 – Prague, Czech Republic
  • 16. An Example of Implementation Phase 1 – Co-occurrence matrix 1. Difference between Cosine/Jaccard similarity metrics. Besnik Fetahu 16DEXA 2013 – Prague, Czech Republic
  • 17. An Example of Implementation Phase 1 – EMI matrix 2. Possible matchings: Besnik Fetahu 17DEXA 2013 – Prague, Czech Republic
  • 18. An Example of Implementation Besnik Fetahu 18DEXA 2013 – Prague, Czech Republic
  • 19. An Example of Implementation Besnik Fetahu 19DEXA 2013 – Prague, Czech Republic Complement + NumberAddress + Number Crossover NeighborhoodComplementNumber Address + + mutation Complement + Number reproduction
  • 20. An Example of Implementation Besnik Fetahu 20DEXA 2013 – Prague, Czech Republic Correct Repetitive and Incorrect mutation
  • 21. Evaluation • Datasets • “Personal Information” dataset lists information about people • “Real Estate” dataset lists information about houses for sale • “Inventory” dataset describes product inventories With exception of the “Personal Information” dataset due to privacy reasons, other datasets are available at: http://pages.cs.wisc.edu/ anhai/wisc-si-archive/domains/ Besnik Fetahu 21DEXA 2013 – Prague, Czech Republic
  • 22. Results Besnik Fetahu 22DEXA 2013 – Prague, Czech Republic
  • 23. Results Besnik Fetahu 23DEXA 2013 – Prague, Czech Republic
  • 24. 27/08/13Ricardo Kawase 24 Conclusions • Complex schema matching approach • Simple + Complex matching: • Estimated Mutual Information + Genetic Programing • Reduced search space for matching properties • Adaptive to variations of 1:1 and n:1 matching instances • High accuracy on generated matches and coverage
  • 25. Questions? Thank you! Besnik Fetahu 25DEXA 2013 – Prague, Czech Republic

Editor's Notes

  1. In case you are not familiar with GP: http://www.geneticprogramming.com/Tutorial/ Or http://jgap.sourceforge.net (we used this package)