SlideShare a Scribd company logo
Querying and Merging Heterogeneous
Data by Approximate Joins on Higher-
Order Terms
Simon Price and Peter Flach
ILP 2008
Query heterogeneous data sources as if their data
were conveniently held in a single relational
database.
Example data sources:
• web pages
• digital libraries
• knowledge bases
• Semantic Web
• databases
Our Aim
2
Outline of this paper
1. Relational Joins (a quick review)
2. Basic Terms
3. Relational Joins for Basic Terms
4. Basic Term Proximity-Join
5. Application
3
Contribution of this work
Relational Algebra for Basic Terms
Basic Term Proximity-Join
Application to bibliographic data
4
1. Relational Joins (a quick review)
5
1. Relational Joins (a quick review)
2. Basic Terms
3. Relational Joins for Basic Terms
4. Basic Term Proximity-Join
5. Application
θ-Join
6
publications
publication
#
title venue year
1 a b c
2 d e f
authors
author#
publication
#
name
1 1 p
2 1 q
3 2 r
publication
#
title venue year
1 a b c
author#
publication
#
name
1 1 p
publication
#
title venue year
2 d e f
author#
publication
#
name
3 2 r
publication
#
title venue year
1 a b c
author#
publication
#
name
2 1 q
2. Basic Terms
7
1. Relational Joins (a quick review)
2. Basic Terms
3. Relational Joins for Basic Terms
4. Basic Term Proximity-Join
5. Application
Basic Terms
8
• Proposed by John Lloyd
• Family of typed-terms in higher-order logic
• Based on Church’s simple theory of types
• Data types representing:
• tuples
• structures - e.g. trees and graphs
• abstractions - e.g. sets and multisets (bags)
• Basic Terms and the “individuals-as-terms”
model
Lloyd, J. W.: Logic for Learning. Springer. New York
(2003)
Representing Individuals as Basic Terms
9
1. Define basic type structure
e.g. an academic publication with following basic type structure
2. Transform data instances to basic terms of that type
e.g. a publication record from the CORA bibliographic database
( { “Mitchell, T.”, “Thrun, S.” },
“Explanation-Based Learning: A Comparison of Symbolic and Neural Network
Approaches.”,
“In Proceedings of the Tenth International Conference on Machine Learning”,
“1993” )
3. Relational Joins for Basic Terms
10
1. Relational Joins (a quick review)
2. Basic Terms
3. Relational Joins for Basic Terms
4. Basic Term Proximity-Join
5. Application
Upgrading Relational Joins for Basic Terms
11
1. Restate (a sub-set of) traditional relational algebra
1. Remove schematic metadata from the data itself
2. Make explicit the tuple item indexing function
2. Replace sets of tuples (relations) with sets of basic
terms (“basic term relations”)
3. Upgrade indexing function to index all types of basic
terms:
1. basic tuples (tuples of basic terms)
2. basic structures (e.g. lists and trees of basic terms)
3. basic abstractions (e.g. sets and multisets of basic terms)
4. any combination of the above three
Basic Term θ-Join (Example 1)
12
A = { ({p, q}, a, b, c),
({r}, d, e, f),
({s, t}, d, g, h) }
B = { ({p, q}, a, b, c),
({p, q}, d, e, f) }
{ ( ({p, q}, a, b, c), ({p, q}, a, b, c) ),
( ({r}, d, e, f), ({p, q}, d, e, f) ),
( ({s, t}, d, g, h), ({p, q}, d, e, f) ) }
Title = TitleA B =
Basic Term θ-Join (Example 2)
13
A = { ({p, q}, a, b, c),
({r}, d, e, f),
({s, t}, g, h, k) }
B = { ({p, q}, a, b, c),
({p, q}, d, e, f) }
{ ( ({p, q}, a, b, c), ({p, q}, a, b, c) ),
( ({p, q}, a, b, c), ({p, q}, d, e, f) ) }
Coauthors = CoauthorsA B =
Basic Term θ-Join (Example 3)
14
A = { ({p, q}, a, b, c),
({r}, d, e, f),
({s, t}, g, h, k) }
B = { ({p, q}, a, b, c),
({p, q}, d, e, f) }
{ ( ({p, q}, a, b, c), ({p, q}, a, b, c) ) }Publication = PublicationA B =
4. Basic Term Proximity Join
15
1. Relational Joins (a quick review)
2. Basic Terms
3. Relational Joins for Basic Terms
4. Basic Term Proximity-Join
5. Application
Replacing Equality with Proximity
16
Proximity
Distance Threshold
is a dependency relation, but not an
equivalence relation.
i.e. proximity is reflexive and
symmetric but not necessarily
transitive. Due to:
Properties of Proximity
• dist is not constrained to have
an upper bound.
• Some normalising function
may be used,
usually into the closed
interval [0, 1].
• Or can normalise in feature
space (e.g. normalising
kernels).
Normalisation
Basic Term Proximity Join
• Basic Term Projection
• Basic Term θ-Restriction
• Basic Term Proximity Join
17
s is a basic subterm at type tree index i
5. Application
18
1. Relational Joins (a quick review)
2. Basic Terms
3. Relational Joins for Basic Terms
4. Basic Term Proximity-Join
5. Application
• Given the ground truth
where
• The goal is to reconstruct V as V’ by choosing an appropriate s
Proximity joins on bibliographic publications
data
19
• Currently, ground truths for pairs of data sets are rare.
So, we choose with
• CORA-REFS publications data set (1881 instances)
• Data type is the one used in examples throughout this talk
• Example pair of instances that require approximate join:
CORA Data Set
20
( { “Mitchell, T.”, “Thrun, S.” },
“Explanation-Based Learning:
...”,
“In Proceedings of the Tenth
...”,
“1993” )
( { “Tom Mitchell”, “Sven Thrun”
},
“Explanation based learning:
...”,
“Proceedings of the 10th ...”,
“ ’93 ” )
Experiments on CORA data set
21
For each join:
1. Calculate pairwise distances
between all basic terms in
CORA
2. Construct a dendrogram
3. Calculate precision-recall at
each node in the dendrogram
i.e. plot a point on the p-r chart
for each node in the
dendrogram
e.g.
threshold = 120
TP FN
FP TN
Confusion Matrix
TP is no. pairs in same cluster
that should be in the same
cluster.
TN is no. pairs in different
clusters that should be in
different clusters.
FP is etc...
Proximity Joins on CORA Dataset
23
Publication
Publication.Coauthors
Publication.Title
Publication.Venue
Precision
Recall
• Distance derived from a kernel in the usual way
• Basic term kernel = default kernel for basic terms
• String kernel = p-spectrum kernel
• Default kernel = matching kernel
And finally...
24
Conclusion
26
• Relational Algebra for Basic Terms
• Combines relational model and basic terms in single
formalism
• Basic Term Proximity-Join
• Enables approximate querying and merging of basic terms
• Application to bibliographic data
• Shows potential for data integration
❦ ❦ ❦
Default Kernel for Basic Terms
27

More Related Content

What's hot

Learning-based Data Cleaning
Learning-based Data CleaningLearning-based Data Cleaning
Learning-based Data Cleaning
Christian Stade-Schuldt
 
Data Structure & Algorithms | Computer Science
Data Structure & Algorithms | Computer ScienceData Structure & Algorithms | Computer Science
Data Structure & Algorithms | Computer Science
Transweb Global Inc
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...
irjes
 
Binary tree and Binary search tree
Binary tree and Binary search treeBinary tree and Binary search tree
Binary tree and Binary search tree
Mayeesha Samiha
 
Binary tree
Binary tree Binary tree
Binary tree
Rajendran
 
Trees - Data structures in C/Java
Trees - Data structures in C/JavaTrees - Data structures in C/Java
Trees - Data structures in C/Java
geeksrik
 
computer notes - Data Structures - 13
computer notes - Data Structures - 13computer notes - Data Structures - 13
computer notes - Data Structures - 13
ecomputernotes
 
THREADED BINARY TREE AND BINARY SEARCH TREE
THREADED BINARY TREE AND BINARY SEARCH TREETHREADED BINARY TREE AND BINARY SEARCH TREE
THREADED BINARY TREE AND BINARY SEARCH TREE
Siddhi Shrivas
 
Hash table methods
Hash table methodsHash table methods
Hash table methods
unyil96
 
Text Mining Using R
Text Mining Using RText Mining Using R
Text Mining Using R
Knoldus Inc.
 
Information Extraction from Web-Scale N-Gram Data
Information Extraction from Web-Scale N-Gram DataInformation Extraction from Web-Scale N-Gram Data
Information Extraction from Web-Scale N-Gram Data
Gerard de Melo
 
Interactive Knowledge Discovery over Web of Data.
Interactive Knowledge Discovery over Web of Data.Interactive Knowledge Discovery over Web of Data.
Interactive Knowledge Discovery over Web of Data.
Mehwish Alam
 
1.1 binary tree
1.1 binary tree1.1 binary tree
1.1 binary tree
Krish_ver2
 
Tree and Binary Search tree
Tree and Binary Search treeTree and Binary Search tree
Tree and Binary Search tree
Muhazzab Chouhadry
 
Data Structures
Data StructuresData Structures
Data Structures
Rahul Jamwal
 
Final-Report
Final-ReportFinal-Report
Final-Report
Ben Reichert
 
Introduction of Data Structures and Algorithms by GOWRU BHARATH KUMAR
Introduction of Data Structures and Algorithms by GOWRU BHARATH KUMARIntroduction of Data Structures and Algorithms by GOWRU BHARATH KUMAR
Introduction of Data Structures and Algorithms by GOWRU BHARATH KUMAR
BHARATH KUMAR
 
Framester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubFramester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data Hub
Mehwish Alam
 
Furnish an Index Using the Works of Tree Structures
Furnish an Index Using the Works of Tree StructuresFurnish an Index Using the Works of Tree Structures
Furnish an Index Using the Works of Tree Structures
ijceronline
 
Lecture 8 data structures and algorithms
Lecture 8 data structures and algorithmsLecture 8 data structures and algorithms
Lecture 8 data structures and algorithms
Aakash deep Singhal
 

What's hot (20)

Learning-based Data Cleaning
Learning-based Data CleaningLearning-based Data Cleaning
Learning-based Data Cleaning
 
Data Structure & Algorithms | Computer Science
Data Structure & Algorithms | Computer ScienceData Structure & Algorithms | Computer Science
Data Structure & Algorithms | Computer Science
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...
 
Binary tree and Binary search tree
Binary tree and Binary search treeBinary tree and Binary search tree
Binary tree and Binary search tree
 
Binary tree
Binary tree Binary tree
Binary tree
 
Trees - Data structures in C/Java
Trees - Data structures in C/JavaTrees - Data structures in C/Java
Trees - Data structures in C/Java
 
computer notes - Data Structures - 13
computer notes - Data Structures - 13computer notes - Data Structures - 13
computer notes - Data Structures - 13
 
THREADED BINARY TREE AND BINARY SEARCH TREE
THREADED BINARY TREE AND BINARY SEARCH TREETHREADED BINARY TREE AND BINARY SEARCH TREE
THREADED BINARY TREE AND BINARY SEARCH TREE
 
Hash table methods
Hash table methodsHash table methods
Hash table methods
 
Text Mining Using R
Text Mining Using RText Mining Using R
Text Mining Using R
 
Information Extraction from Web-Scale N-Gram Data
Information Extraction from Web-Scale N-Gram DataInformation Extraction from Web-Scale N-Gram Data
Information Extraction from Web-Scale N-Gram Data
 
Interactive Knowledge Discovery over Web of Data.
Interactive Knowledge Discovery over Web of Data.Interactive Knowledge Discovery over Web of Data.
Interactive Knowledge Discovery over Web of Data.
 
1.1 binary tree
1.1 binary tree1.1 binary tree
1.1 binary tree
 
Tree and Binary Search tree
Tree and Binary Search treeTree and Binary Search tree
Tree and Binary Search tree
 
Data Structures
Data StructuresData Structures
Data Structures
 
Final-Report
Final-ReportFinal-Report
Final-Report
 
Introduction of Data Structures and Algorithms by GOWRU BHARATH KUMAR
Introduction of Data Structures and Algorithms by GOWRU BHARATH KUMARIntroduction of Data Structures and Algorithms by GOWRU BHARATH KUMAR
Introduction of Data Structures and Algorithms by GOWRU BHARATH KUMAR
 
Framester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubFramester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data Hub
 
Furnish an Index Using the Works of Tree Structures
Furnish an Index Using the Works of Tree StructuresFurnish an Index Using the Works of Tree Structures
Furnish an Index Using the Works of Tree Structures
 
Lecture 8 data structures and algorithms
Lecture 8 data structures and algorithmsLecture 8 data structures and algorithms
Lecture 8 data structures and algorithms
 

Viewers also liked

Adapting CARDIO for BOS
Adapting CARDIO for BOSAdapting CARDIO for BOS
Adapting CARDIO for BOS
Simon Price
 
Webs of People, Webs of Data
Webs of People, Webs of DataWebs of People, Webs of Data
Webs of People, Webs of Data
Simon Price
 
Двигатели серии Hja hjn Marathon-Regal
Двигатели серии Hja hjn  Marathon-RegalДвигатели серии Hja hjn  Marathon-Regal
Двигатели серии Hja hjn Marathon-Regal
Arve
 
Nature Locator
Nature LocatorNature Locator
Nature Locator
Simon Price
 
Co-designing Research IT and Research Data Services
Co-designing Research IT and Research Data ServicesCo-designing Research IT and Research Data Services
Co-designing Research IT and Research Data Services
Simon Price
 
NewsPatterns - visualisation layer of news feed mining
NewsPatterns - visualisation layer of news feed miningNewsPatterns - visualisation layer of news feed mining
NewsPatterns - visualisation layer of news feed mining
Simon Price
 
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
Simon Price
 
Managing Large-scale Multimedia Development Projects
Managing Large-scale Multimedia Development ProjectsManaging Large-scale Multimedia Development Projects
Managing Large-scale Multimedia Development Projects
Simon Price
 
Managing research data at Bristol
Managing research data at BristolManaging research data at Bristol
Managing research data at Bristol
Simon Price
 
Research IT at the University of Bristol
Research IT at the University of BristolResearch IT at the University of Bristol
Research IT at the University of Bristol
Simon Price
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic Web
Simon Price
 
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising ChinaBest of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
Simon Price
 
A Minimum Spanning Tree Approach of Solving a Transportation Problem
A Minimum Spanning Tree Approach of Solving a Transportation ProblemA Minimum Spanning Tree Approach of Solving a Transportation Problem
A Minimum Spanning Tree Approach of Solving a Transportation Problem
inventionjournals
 
Oscillation of Solutions to Neutral Delay and Advanced Difference Equations w...
Oscillation of Solutions to Neutral Delay and Advanced Difference Equations w...Oscillation of Solutions to Neutral Delay and Advanced Difference Equations w...
Oscillation of Solutions to Neutral Delay and Advanced Difference Equations w...
inventionjournals
 
data.bris - Use case, role and functionality for CKAN adoption
data.bris - Use case, role and functionality for CKAN adoptiondata.bris - Use case, role and functionality for CKAN adoption
data.bris - Use case, role and functionality for CKAN adoption
Simon Price
 
Visualising China - historical photos of China
Visualising China - historical photos of ChinaVisualising China - historical photos of China
Visualising China - historical photos of China
Simon Price
 
Historical Photographs of China - the journey towards sustainability and utility
Historical Photographs of China - the journey towards sustainability and utilityHistorical Photographs of China - the journey towards sustainability and utility
Historical Photographs of China - the journey towards sustainability and utility
Simon Price
 
Supporting Big Data, Open Data, Data Analytics and Data Science
Supporting Big Data, Open Data, Data Analytics and Data ScienceSupporting Big Data, Open Data, Data Analytics and Data Science
Supporting Big Data, Open Data, Data Analytics and Data Science
Simon Price
 
Data Sharing and Standards
Data Sharing and StandardsData Sharing and Standards
Data Sharing and Standards
Simon Price
 

Viewers also liked (20)

Adapting CARDIO for BOS
Adapting CARDIO for BOSAdapting CARDIO for BOS
Adapting CARDIO for BOS
 
Webs of People, Webs of Data
Webs of People, Webs of DataWebs of People, Webs of Data
Webs of People, Webs of Data
 
Двигатели серии Hja hjn Marathon-Regal
Двигатели серии Hja hjn  Marathon-RegalДвигатели серии Hja hjn  Marathon-Regal
Двигатели серии Hja hjn Marathon-Regal
 
Nature Locator
Nature LocatorNature Locator
Nature Locator
 
Co-designing Research IT and Research Data Services
Co-designing Research IT and Research Data ServicesCo-designing Research IT and Research Data Services
Co-designing Research IT and Research Data Services
 
NewsPatterns - visualisation layer of news feed mining
NewsPatterns - visualisation layer of news feed miningNewsPatterns - visualisation layer of news feed mining
NewsPatterns - visualisation layer of news feed mining
 
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
Cost of Migrating Large-Scale Computer Assisted Learning (CAL) Software to We...
 
Managing Large-scale Multimedia Development Projects
Managing Large-scale Multimedia Development ProjectsManaging Large-scale Multimedia Development Projects
Managing Large-scale Multimedia Development Projects
 
Managing research data at Bristol
Managing research data at BristolManaging research data at Bristol
Managing research data at Bristol
 
Research IT at the University of Bristol
Research IT at the University of BristolResearch IT at the University of Bristol
Research IT at the University of Bristol
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic Web
 
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising ChinaBest of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
Best of Bristol Media City - MyMobileBristol, NatureLocator, Visualising China
 
A Minimum Spanning Tree Approach of Solving a Transportation Problem
A Minimum Spanning Tree Approach of Solving a Transportation ProblemA Minimum Spanning Tree Approach of Solving a Transportation Problem
A Minimum Spanning Tree Approach of Solving a Transportation Problem
 
Oscillation of Solutions to Neutral Delay and Advanced Difference Equations w...
Oscillation of Solutions to Neutral Delay and Advanced Difference Equations w...Oscillation of Solutions to Neutral Delay and Advanced Difference Equations w...
Oscillation of Solutions to Neutral Delay and Advanced Difference Equations w...
 
чурсина
чурсиначурсина
чурсина
 
data.bris - Use case, role and functionality for CKAN adoption
data.bris - Use case, role and functionality for CKAN adoptiondata.bris - Use case, role and functionality for CKAN adoption
data.bris - Use case, role and functionality for CKAN adoption
 
Visualising China - historical photos of China
Visualising China - historical photos of ChinaVisualising China - historical photos of China
Visualising China - historical photos of China
 
Historical Photographs of China - the journey towards sustainability and utility
Historical Photographs of China - the journey towards sustainability and utilityHistorical Photographs of China - the journey towards sustainability and utility
Historical Photographs of China - the journey towards sustainability and utility
 
Supporting Big Data, Open Data, Data Analytics and Data Science
Supporting Big Data, Open Data, Data Analytics and Data ScienceSupporting Big Data, Open Data, Data Analytics and Data Science
Supporting Big Data, Open Data, Data Analytics and Data Science
 
Data Sharing and Standards
Data Sharing and StandardsData Sharing and Standards
Data Sharing and Standards
 

Similar to Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order Terms

Complex queries in a distributed multi-model database
Complex queries in a distributed multi-model databaseComplex queries in a distributed multi-model database
Complex queries in a distributed multi-model database
Max Neunhöffer
 
Rules for inducing hierarchies from social tagging data
Rules for inducing hierarchies from social tagging dataRules for inducing hierarchies from social tagging data
Rules for inducing hierarchies from social tagging data
Hang Dong
 
Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...
inscit2006
 
Phylogenetic Signal with Induction and non-Contradiction - V Berry
Phylogenetic Signal with Induction and non-Contradiction - V BerryPhylogenetic Signal with Induction and non-Contradiction - V Berry
Phylogenetic Signal with Induction and non-Contradiction - V Berry
Roderic Page
 
Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search Engine
Gan Keng Hoon
 
Cheminformatics: An overview
Cheminformatics: An overviewCheminformatics: An overview
Cheminformatics: An overview
subhasis banerjee
 
Rdf conjunctive query selectivity estimation
Rdf conjunctive query selectivity estimationRdf conjunctive query selectivity estimation
Rdf conjunctive query selectivity estimation
INRIA-OAK
 
Lesson11 transactions
Lesson11 transactionsLesson11 transactions
Lesson11 transactions
teddy demissie
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
Victoria López
 
Online Relation Alignment for Linked Datasets
Online Relation Alignment for Linked DatasetsOnline Relation Alignment for Linked Datasets
Online Relation Alignment for Linked Datasets
Maria Koutraki
 
Pertemuan 5_Relation Matriks_01 (17)
Pertemuan 5_Relation Matriks_01 (17)Pertemuan 5_Relation Matriks_01 (17)
Pertemuan 5_Relation Matriks_01 (17)
Evert Sandye Taasiringan
 
2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria
Paulo Faria
 
Db fund
Db fundDb fund
Dbms fundamentals
Dbms fundamentalsDbms fundamentals
Dbms fundamentals
venkatme83
 
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
Holistic Benchmarking of Big Linked Data
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
WU (Vienna University of Economics and Business)
 
PAT.ppt bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
PAT.ppt bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbPAT.ppt bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
PAT.ppt bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
ratnapatil14
 
Dictionaries and Tolerant Retrieval.ppt
Dictionaries and Tolerant Retrieval.pptDictionaries and Tolerant Retrieval.ppt
Dictionaries and Tolerant Retrieval.ppt
Manimaran A
 
Cs341
Cs341Cs341
"Principal Component Analysis - the original paper" presentation @ Papers We ...
"Principal Component Analysis - the original paper" presentation @ Papers We ..."Principal Component Analysis - the original paper" presentation @ Papers We ...
"Principal Component Analysis - the original paper" presentation @ Papers We ...
Adrian Florea
 

Similar to Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order Terms (20)

Complex queries in a distributed multi-model database
Complex queries in a distributed multi-model databaseComplex queries in a distributed multi-model database
Complex queries in a distributed multi-model database
 
Rules for inducing hierarchies from social tagging data
Rules for inducing hierarchies from social tagging dataRules for inducing hierarchies from social tagging data
Rules for inducing hierarchies from social tagging data
 
Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...
 
Phylogenetic Signal with Induction and non-Contradiction - V Berry
Phylogenetic Signal with Induction and non-Contradiction - V BerryPhylogenetic Signal with Induction and non-Contradiction - V Berry
Phylogenetic Signal with Induction and non-Contradiction - V Berry
 
Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search Engine
 
Cheminformatics: An overview
Cheminformatics: An overviewCheminformatics: An overview
Cheminformatics: An overview
 
Rdf conjunctive query selectivity estimation
Rdf conjunctive query selectivity estimationRdf conjunctive query selectivity estimation
Rdf conjunctive query selectivity estimation
 
Lesson11 transactions
Lesson11 transactionsLesson11 transactions
Lesson11 transactions
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
 
Online Relation Alignment for Linked Datasets
Online Relation Alignment for Linked DatasetsOnline Relation Alignment for Linked Datasets
Online Relation Alignment for Linked Datasets
 
Pertemuan 5_Relation Matriks_01 (17)
Pertemuan 5_Relation Matriks_01 (17)Pertemuan 5_Relation Matriks_01 (17)
Pertemuan 5_Relation Matriks_01 (17)
 
2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria
 
Db fund
Db fundDb fund
Db fund
 
Dbms fundamentals
Dbms fundamentalsDbms fundamentals
Dbms fundamentals
 
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
PAT.ppt bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
PAT.ppt bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbPAT.ppt bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
PAT.ppt bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
 
Dictionaries and Tolerant Retrieval.ppt
Dictionaries and Tolerant Retrieval.pptDictionaries and Tolerant Retrieval.ppt
Dictionaries and Tolerant Retrieval.ppt
 
Cs341
Cs341Cs341
Cs341
 
"Principal Component Analysis - the original paper" presentation @ Papers We ...
"Principal Component Analysis - the original paper" presentation @ Papers We ..."Principal Component Analysis - the original paper" presentation @ Papers We ...
"Principal Component Analysis - the original paper" presentation @ Papers We ...
 

More from Simon Price

Adding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' ProblemsAdding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' Problems
Simon Price
 
Citizen Science and Crowd-sourcing Biological Surveys
Citizen Science and Crowd-sourcing Biological SurveysCitizen Science and Crowd-sourcing Biological Surveys
Citizen Science and Crowd-sourcing Biological Surveys
Simon Price
 
Mining and Mapping the Research Landscape
Mining and Mapping the Research LandscapeMining and Mapping the Research Landscape
Mining and Mapping the Research Landscape
Simon Price
 
A Higher-Order Data Flow Model for Heterogeneous Big Data
A Higher-Order Data Flow Model for Heterogeneous Big DataA Higher-Order Data Flow Model for Heterogeneous Big Data
A Higher-Order Data Flow Model for Heterogeneous Big Data
Simon Price
 
SubSift web services and workflows for profiling and comparing scientists and...
SubSift web services and workflows for profiling and comparing scientists and...SubSift web services and workflows for profiling and comparing scientists and...
SubSift web services and workflows for profiling and comparing scientists and...
Simon Price
 
SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...
Simon Price
 
Code Club - a Fight Club inspired approach to software inspection and review
Code Club - a Fight Club inspired approach to software inspection and reviewCode Club - a Fight Club inspired approach to software inspection and review
Code Club - a Fight Club inspired approach to software inspection and review
Simon Price
 
Academic IT support for Data Science
Academic IT support for Data ScienceAcademic IT support for Data Science
Academic IT support for Data Science
Simon Price
 
Mobile Apps for Research Data Collection
Mobile Apps for Research Data CollectionMobile Apps for Research Data Collection
Mobile Apps for Research Data Collection
Simon Price
 
Clinical Experience Recorder
Clinical Experience RecorderClinical Experience Recorder
Clinical Experience Recorder
Simon Price
 

More from Simon Price (10)

Adding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' ProblemsAdding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' Problems
 
Citizen Science and Crowd-sourcing Biological Surveys
Citizen Science and Crowd-sourcing Biological SurveysCitizen Science and Crowd-sourcing Biological Surveys
Citizen Science and Crowd-sourcing Biological Surveys
 
Mining and Mapping the Research Landscape
Mining and Mapping the Research LandscapeMining and Mapping the Research Landscape
Mining and Mapping the Research Landscape
 
A Higher-Order Data Flow Model for Heterogeneous Big Data
A Higher-Order Data Flow Model for Heterogeneous Big DataA Higher-Order Data Flow Model for Heterogeneous Big Data
A Higher-Order Data Flow Model for Heterogeneous Big Data
 
SubSift web services and workflows for profiling and comparing scientists and...
SubSift web services and workflows for profiling and comparing scientists and...SubSift web services and workflows for profiling and comparing scientists and...
SubSift web services and workflows for profiling and comparing scientists and...
 
SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...
 
Code Club - a Fight Club inspired approach to software inspection and review
Code Club - a Fight Club inspired approach to software inspection and reviewCode Club - a Fight Club inspired approach to software inspection and review
Code Club - a Fight Club inspired approach to software inspection and review
 
Academic IT support for Data Science
Academic IT support for Data ScienceAcademic IT support for Data Science
Academic IT support for Data Science
 
Mobile Apps for Research Data Collection
Mobile Apps for Research Data CollectionMobile Apps for Research Data Collection
Mobile Apps for Research Data Collection
 
Clinical Experience Recorder
Clinical Experience RecorderClinical Experience Recorder
Clinical Experience Recorder
 

Recently uploaded

Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 

Recently uploaded (20)

Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 

Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order Terms

  • 1. Querying and Merging Heterogeneous Data by Approximate Joins on Higher- Order Terms Simon Price and Peter Flach ILP 2008
  • 2. Query heterogeneous data sources as if their data were conveniently held in a single relational database. Example data sources: • web pages • digital libraries • knowledge bases • Semantic Web • databases Our Aim 2
  • 3. Outline of this paper 1. Relational Joins (a quick review) 2. Basic Terms 3. Relational Joins for Basic Terms 4. Basic Term Proximity-Join 5. Application 3
  • 4. Contribution of this work Relational Algebra for Basic Terms Basic Term Proximity-Join Application to bibliographic data 4
  • 5. 1. Relational Joins (a quick review) 5 1. Relational Joins (a quick review) 2. Basic Terms 3. Relational Joins for Basic Terms 4. Basic Term Proximity-Join 5. Application
  • 6. θ-Join 6 publications publication # title venue year 1 a b c 2 d e f authors author# publication # name 1 1 p 2 1 q 3 2 r publication # title venue year 1 a b c author# publication # name 1 1 p publication # title venue year 2 d e f author# publication # name 3 2 r publication # title venue year 1 a b c author# publication # name 2 1 q
  • 7. 2. Basic Terms 7 1. Relational Joins (a quick review) 2. Basic Terms 3. Relational Joins for Basic Terms 4. Basic Term Proximity-Join 5. Application
  • 8. Basic Terms 8 • Proposed by John Lloyd • Family of typed-terms in higher-order logic • Based on Church’s simple theory of types • Data types representing: • tuples • structures - e.g. trees and graphs • abstractions - e.g. sets and multisets (bags) • Basic Terms and the “individuals-as-terms” model Lloyd, J. W.: Logic for Learning. Springer. New York (2003)
  • 9. Representing Individuals as Basic Terms 9 1. Define basic type structure e.g. an academic publication with following basic type structure 2. Transform data instances to basic terms of that type e.g. a publication record from the CORA bibliographic database ( { “Mitchell, T.”, “Thrun, S.” }, “Explanation-Based Learning: A Comparison of Symbolic and Neural Network Approaches.”, “In Proceedings of the Tenth International Conference on Machine Learning”, “1993” )
  • 10. 3. Relational Joins for Basic Terms 10 1. Relational Joins (a quick review) 2. Basic Terms 3. Relational Joins for Basic Terms 4. Basic Term Proximity-Join 5. Application
  • 11. Upgrading Relational Joins for Basic Terms 11 1. Restate (a sub-set of) traditional relational algebra 1. Remove schematic metadata from the data itself 2. Make explicit the tuple item indexing function 2. Replace sets of tuples (relations) with sets of basic terms (“basic term relations”) 3. Upgrade indexing function to index all types of basic terms: 1. basic tuples (tuples of basic terms) 2. basic structures (e.g. lists and trees of basic terms) 3. basic abstractions (e.g. sets and multisets of basic terms) 4. any combination of the above three
  • 12. Basic Term θ-Join (Example 1) 12 A = { ({p, q}, a, b, c), ({r}, d, e, f), ({s, t}, d, g, h) } B = { ({p, q}, a, b, c), ({p, q}, d, e, f) } { ( ({p, q}, a, b, c), ({p, q}, a, b, c) ), ( ({r}, d, e, f), ({p, q}, d, e, f) ), ( ({s, t}, d, g, h), ({p, q}, d, e, f) ) } Title = TitleA B =
  • 13. Basic Term θ-Join (Example 2) 13 A = { ({p, q}, a, b, c), ({r}, d, e, f), ({s, t}, g, h, k) } B = { ({p, q}, a, b, c), ({p, q}, d, e, f) } { ( ({p, q}, a, b, c), ({p, q}, a, b, c) ), ( ({p, q}, a, b, c), ({p, q}, d, e, f) ) } Coauthors = CoauthorsA B =
  • 14. Basic Term θ-Join (Example 3) 14 A = { ({p, q}, a, b, c), ({r}, d, e, f), ({s, t}, g, h, k) } B = { ({p, q}, a, b, c), ({p, q}, d, e, f) } { ( ({p, q}, a, b, c), ({p, q}, a, b, c) ) }Publication = PublicationA B =
  • 15. 4. Basic Term Proximity Join 15 1. Relational Joins (a quick review) 2. Basic Terms 3. Relational Joins for Basic Terms 4. Basic Term Proximity-Join 5. Application
  • 16. Replacing Equality with Proximity 16 Proximity Distance Threshold is a dependency relation, but not an equivalence relation. i.e. proximity is reflexive and symmetric but not necessarily transitive. Due to: Properties of Proximity • dist is not constrained to have an upper bound. • Some normalising function may be used, usually into the closed interval [0, 1]. • Or can normalise in feature space (e.g. normalising kernels). Normalisation
  • 17. Basic Term Proximity Join • Basic Term Projection • Basic Term θ-Restriction • Basic Term Proximity Join 17 s is a basic subterm at type tree index i
  • 18. 5. Application 18 1. Relational Joins (a quick review) 2. Basic Terms 3. Relational Joins for Basic Terms 4. Basic Term Proximity-Join 5. Application
  • 19. • Given the ground truth where • The goal is to reconstruct V as V’ by choosing an appropriate s Proximity joins on bibliographic publications data 19
  • 20. • Currently, ground truths for pairs of data sets are rare. So, we choose with • CORA-REFS publications data set (1881 instances) • Data type is the one used in examples throughout this talk • Example pair of instances that require approximate join: CORA Data Set 20 ( { “Mitchell, T.”, “Thrun, S.” }, “Explanation-Based Learning: ...”, “In Proceedings of the Tenth ...”, “1993” ) ( { “Tom Mitchell”, “Sven Thrun” }, “Explanation based learning: ...”, “Proceedings of the 10th ...”, “ ’93 ” )
  • 21. Experiments on CORA data set 21 For each join: 1. Calculate pairwise distances between all basic terms in CORA 2. Construct a dendrogram 3. Calculate precision-recall at each node in the dendrogram i.e. plot a point on the p-r chart for each node in the dendrogram e.g. threshold = 120 TP FN FP TN Confusion Matrix TP is no. pairs in same cluster that should be in the same cluster. TN is no. pairs in different clusters that should be in different clusters. FP is etc...
  • 22. Proximity Joins on CORA Dataset 23 Publication Publication.Coauthors Publication.Title Publication.Venue Precision Recall • Distance derived from a kernel in the usual way • Basic term kernel = default kernel for basic terms • String kernel = p-spectrum kernel • Default kernel = matching kernel
  • 24. Conclusion 26 • Relational Algebra for Basic Terms • Combines relational model and basic terms in single formalism • Basic Term Proximity-Join • Enables approximate querying and merging of basic terms • Application to bibliographic data • Shows potential for data integration ❦ ❦ ❦
  • 25. Default Kernel for Basic Terms 27