SlideShare a Scribd company logo
1 of 21
Schema Matching Using
Machine Learning
Submitted By:
Shruti Jadon, Tanvi Sahay, Ankita Mehta
Schema Matching
Schema - A skeleton that represents the logical view
of the database.
Schema Matching - The method of matching those
attributes which are semantically related to each
other and/or portray similar properties.
- Across two databases
- Across two tables in the same database
Students
FNam
e
LNam
e
SSN
Major
Addres
s
Grad_Students
Name
ID
Maj_S
tream
House
No
St
Name
SCHEMA 1 SCHEMA 2
Schema Matching using an Example
Schema Matching
One To One Matching One to Many
Matching
Probabilistic Strict
St Name Street Name
St Name Street Name (92%)
User Name (90%)
Street No (42%)
Our Approach
◦ One to Many Mapping
▫ Custom Dictionary Preparation
▫ Attribute matching between the tables
◦ One to One Mapping
▫ Feature Engineering
▫ Feature Extraction
▫ Clustering
▫ Linguistic Matching
One To Many Mapping
◦ Use of Global Dictionary which has all possible
mappings of the attributes.
◦ The key of the dictionary is the parent
attribute(“Name”) and its values are all possible
children (“First Name, Last Name”)
One to Many Mapping
Global Dictionary Keys:
Name, PatientName
Sample Global Dictionary
Global Dictionary Values:
FirstName,First_Name,FName,F_Name
Last_Name,LastName,LName,L_Name
Name Address Phone Number
xyz abc 123-xxx-xxxx
FName LName SSN
lmn cde 123xxxxxx
One To One Mapping
Feature Engineering and Extraction
Two tables
◦ Quality Measures across various hospitals in U.S. as “Train Table”
◦ Quality Measures across best hospital in each state of U.S. as
“Test Table”
Features
◦ 20 features of each attribute created manually
◦ Features created based on data content, data type and
constraints
Type - Float, Int, Char, Boolean, Date, Time Numerical Variance Coefficient
Length (specified by user) Numerical Minimum
Key - Primary Key, Foreign Key Numerical Maximum
Unique Ratio of Whitespace Characters with total length
Not Null Ratio of Special Characters with total length
Average Used Length Average Number of integers in attribute
Variance of Length Average number of characters in attribute
Variance Coefficient of Length Average number of hyphens in attribute
Numerical Average Average number of brackets in attribute
Numerical Variance Average number of backslash in attribute
Features
◦ Cluster attributes in Train Tables.
◦ Assign test table attribute to that cluster whose
centroid is closest to it.
◦ Perform one to one matching using edit
distance between attribute names.
Clustering and Linguistic Matching
Centroid Method
Unclustered Attributes of Train Table
tr_ehr_use_measure_description
tr_fuh_measure_end_date
tr_hospital_name
tr_hbips_2_measure_description
tr_start_date
tr_sub_1_percentage
tr_end_date
tr_fuh_30_percentage
tr_peoc_measure_description
tr_tob_2_percentage
tr_flu_season_start_date
Cluster 1 (only char values)
tr_ehr_use_measure_description
tr_hospital_name
tr_hbips_2_measure_description
tr_peoc_measure_description
Cluster 2 (all dates)
tr_start_date
tr_end_date
tr_flu_season_start_date
tr_fuh_measure_end_date
Cluster 3 (only small numerics)
tr_tob_2_percentage
tr_fuh_30_percentage
tr_sub_1_percentage
Train Attribute Clustering
Unclustered Attributes of Test Table
ts_s_flu_season_start_date
ts_s_fuh_30_percentage
ts_s_sub_1_percentage
ts_s_fuh_measure_end_date
Cluster 2 (all dates)
tr_start_date
tr_end_date
tr_flu_season_start_date
tr_fuh_measure_end_date
ts_s_flu_season_start_date
ts_s_fuh_measure_end_date
Cluster 3 (only small numerics)
tr_tob_2_percentage
tr_fuh_30_percentage
tr_sub_1_percentage
ts_s_fuh_30_percentage
ts_s_sub_1_percentage
Test Attribute Matching
Centroid distance
Test Attribute Train Attribute Edit Distance Percent Match
ts_s_flu_season_start_date tr_start_date 15 96.6
ts_s_flu_season_start_date tr_end_date 14 96.9
ts_s_flu_season_start_date tr_flu_season_start_date 3 99.6
ts_s_flu_season_start_date tr_fuh_measure_end_date 8 97.0
Linguistic Matching
◦ Combine Train and Test attributes together
and cluster them
◦ Linguistically match test attributes with train
attributes lying in the same cluster
Clustering and Linguistic Matching
Combined Method
All Unclustered Attributes
ts_s_flu_season_start_date
ts_s_fuh_30_percentage
tr_sub_1_percentage
tr_flu_season_start_date
ts_s_sub_1_percentage
ts_s_fuh_measure_end_date
tr_fuh_30_percentage
tr_fuh_measure_end_date
ts_s_address
Cluster 1 (all dates)
tr_flu_season_start_date
tr_fuh_measure_end_date
ts_s_flu_season_start_date
ts_s_fuh_measure_end_date
Cluster 2 (only small numerics)
ts_s_fuh_30_percentage
ts_s_sub_1_percentage
tr_fuh_30_percentage
tr_sub_1_percentage
Combined Attribute Clustering
Cluster 3
ts_s_address
◦ In the Centroid method, each test attribute is forced to
map to at-least one train attribute whereas in the
combined method, there is a possibility of a test attribute
matching to no train attribute.
◦ The Centroid method can be used for selective schema
matching whereas the Combined one can be used for
schema merging.
Trade-Offs between the two methods
Conclusion - Previous Work
SemaInt
Cluster train attributes based on
only data
Predict test attributes
AutoMatch
Global Dictionary for one to one
mapping
IMAP
Custom Functions for one to one
and one to many mapping
Corpus Based
Learn on train attributes and
predict test attributes based on
both linguistics and data
CUPID
Exploit schema
structure to
create a tree
Conclusion - Our Approach
◦ In this project, we implemented schema matching by clustering
train attributes based on only data and then performing intra-
cluster matching based on linguistic similarity between train and
test attributes.
◦ We also prepared a global dictionary for one to many and many
to one mappings.
◦ Preparation of the dictionary will require a domain expert and is
the only point in the process where external intervention will be
required.
THANK YOU!

More Related Content

What's hot

VTU CBCS E&C 5th sem Information theory and coding(15EC54) Module -3 notes
VTU CBCS E&C 5th sem Information theory and coding(15EC54) Module -3 notesVTU CBCS E&C 5th sem Information theory and coding(15EC54) Module -3 notes
VTU CBCS E&C 5th sem Information theory and coding(15EC54) Module -3 notesJayanth Dwijesh H P
 
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING mlaij
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsSharath TS
 
Using Word Embedding for Automatic Query Expansion
Using Word Embedding for Automatic Query ExpansionUsing Word Embedding for Automatic Query Expansion
Using Word Embedding for Automatic Query ExpansionDwaipayan Roy
 
VTU E&C,TCE CBCS[NEW]5th Sem Information Theory and Coding Module-3 notes(15&...
VTU E&C,TCE CBCS[NEW]5th Sem Information Theory and Coding Module-3 notes(15&...VTU E&C,TCE CBCS[NEW]5th Sem Information Theory and Coding Module-3 notes(15&...
VTU E&C,TCE CBCS[NEW]5th Sem Information Theory and Coding Module-3 notes(15&...Jayanth Dwijesh H P
 
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...Sharath TS
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Quinsulon Israel
 

What's hot (9)

VTU CBCS E&C 5th sem Information theory and coding(15EC54) Module -3 notes
VTU CBCS E&C 5th sem Information theory and coding(15EC54) Module -3 notesVTU CBCS E&C 5th sem Information theory and coding(15EC54) Module -3 notes
VTU CBCS E&C 5th sem Information theory and coding(15EC54) Module -3 notes
 
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into Texts
 
Using Word Embedding for Automatic Query Expansion
Using Word Embedding for Automatic Query ExpansionUsing Word Embedding for Automatic Query Expansion
Using Word Embedding for Automatic Query Expansion
 
VTU E&C,TCE CBCS[NEW]5th Sem Information Theory and Coding Module-3 notes(15&...
VTU E&C,TCE CBCS[NEW]5th Sem Information Theory and Coding Module-3 notes(15&...VTU E&C,TCE CBCS[NEW]5th Sem Information Theory and Coding Module-3 notes(15&...
VTU E&C,TCE CBCS[NEW]5th Sem Information Theory and Coding Module-3 notes(15&...
 
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
 
Data modeling
Data modelingData modeling
Data modeling
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
 

Similar to Schema matching using machine learning

Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
 
Implementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeImplementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeAlexander Decker
 
Machine Learning statistical model using Transportation data
Machine Learning statistical model using Transportation dataMachine Learning statistical model using Transportation data
Machine Learning statistical model using Transportation datajagan477830
 
Souvenir's Booth - Algorithm Design and Analysis Project Project Report
Souvenir's Booth - Algorithm Design and Analysis Project Project ReportSouvenir's Booth - Algorithm Design and Analysis Project Project Report
Souvenir's Booth - Algorithm Design and Analysis Project Project ReportAkshit Arora
 
A Novel Approach for User Search Results Using Feedback Sessions
A Novel Approach for User Search Results Using Feedback  SessionsA Novel Approach for User Search Results Using Feedback  Sessions
A Novel Approach for User Search Results Using Feedback SessionsIJMER
 
Big Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetBig Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetVishva Abeyrathne
 
Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVijay Koushik
 
A03202001005
A03202001005A03202001005
A03202001005theijes
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak
 
A survey on optimal route queries for road networks
A survey on optimal route queries for road networksA survey on optimal route queries for road networks
A survey on optimal route queries for road networkseSAT Journals
 
A survey on optimal route queries for road networks
A survey on optimal route queries for road networksA survey on optimal route queries for road networks
A survey on optimal route queries for road networkseSAT Publishing House
 
ROBUST TEXT DETECTION AND EXTRACTION IN NATURAL SCENE IMAGES USING CONDITIONA...
ROBUST TEXT DETECTION AND EXTRACTION IN NATURAL SCENE IMAGES USING CONDITIONA...ROBUST TEXT DETECTION AND EXTRACTION IN NATURAL SCENE IMAGES USING CONDITIONA...
ROBUST TEXT DETECTION AND EXTRACTION IN NATURAL SCENE IMAGES USING CONDITIONA...ijiert bestjournal
 
Overview of query evaluation
Overview of query evaluationOverview of query evaluation
Overview of query evaluationavniS
 
A survey of Stemming Algorithms for Information Retrieval
A survey of Stemming Algorithms for Information RetrievalA survey of Stemming Algorithms for Information Retrieval
A survey of Stemming Algorithms for Information Retrievaliosrjce
 

Similar to Schema matching using machine learning (20)

Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
 
Query processing System
Query processing SystemQuery processing System
Query processing System
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
Implementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeImplementation of query optimization for reducing run time
Implementation of query optimization for reducing run time
 
Machine Learning statistical model using Transportation data
Machine Learning statistical model using Transportation dataMachine Learning statistical model using Transportation data
Machine Learning statistical model using Transportation data
 
Souvenir's Booth - Algorithm Design and Analysis Project Project Report
Souvenir's Booth - Algorithm Design and Analysis Project Project ReportSouvenir's Booth - Algorithm Design and Analysis Project Project Report
Souvenir's Booth - Algorithm Design and Analysis Project Project Report
 
Lk module3
Lk module3Lk module3
Lk module3
 
A Novel Approach for User Search Results Using Feedback Sessions
A Novel Approach for User Search Results Using Feedback  SessionsA Novel Approach for User Search Results Using Feedback  Sessions
A Novel Approach for User Search Results Using Feedback Sessions
 
Big Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetBig Data Processing using a AWS Dataset
Big Data Processing using a AWS Dataset
 
Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joins
 
A03202001005
A03202001005A03202001005
A03202001005
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
 
Networks and Natural Language Processing
Networks and Natural Language ProcessingNetworks and Natural Language Processing
Networks and Natural Language Processing
 
A survey on optimal route queries for road networks
A survey on optimal route queries for road networksA survey on optimal route queries for road networks
A survey on optimal route queries for road networks
 
A survey on optimal route queries for road networks
A survey on optimal route queries for road networksA survey on optimal route queries for road networks
A survey on optimal route queries for road networks
 
ROBUST TEXT DETECTION AND EXTRACTION IN NATURAL SCENE IMAGES USING CONDITIONA...
ROBUST TEXT DETECTION AND EXTRACTION IN NATURAL SCENE IMAGES USING CONDITIONA...ROBUST TEXT DETECTION AND EXTRACTION IN NATURAL SCENE IMAGES USING CONDITIONA...
ROBUST TEXT DETECTION AND EXTRACTION IN NATURAL SCENE IMAGES USING CONDITIONA...
 
Overview of query evaluation
Overview of query evaluationOverview of query evaluation
Overview of query evaluation
 
K017367680
K017367680K017367680
K017367680
 
A survey of Stemming Algorithms for Information Retrieval
A survey of Stemming Algorithms for Information RetrievalA survey of Stemming Algorithms for Information Retrieval
A survey of Stemming Algorithms for Information Retrieval
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Schema matching using machine learning

  • 1. Schema Matching Using Machine Learning Submitted By: Shruti Jadon, Tanvi Sahay, Ankita Mehta
  • 2. Schema Matching Schema - A skeleton that represents the logical view of the database. Schema Matching - The method of matching those attributes which are semantically related to each other and/or portray similar properties. - Across two databases - Across two tables in the same database
  • 4. Schema Matching One To One Matching One to Many Matching Probabilistic Strict St Name Street Name St Name Street Name (92%) User Name (90%) Street No (42%)
  • 5. Our Approach ◦ One to Many Mapping ▫ Custom Dictionary Preparation ▫ Attribute matching between the tables ◦ One to One Mapping ▫ Feature Engineering ▫ Feature Extraction ▫ Clustering ▫ Linguistic Matching
  • 6. One To Many Mapping
  • 7. ◦ Use of Global Dictionary which has all possible mappings of the attributes. ◦ The key of the dictionary is the parent attribute(“Name”) and its values are all possible children (“First Name, Last Name”) One to Many Mapping
  • 8. Global Dictionary Keys: Name, PatientName Sample Global Dictionary Global Dictionary Values: FirstName,First_Name,FName,F_Name Last_Name,LastName,LName,L_Name Name Address Phone Number xyz abc 123-xxx-xxxx FName LName SSN lmn cde 123xxxxxx
  • 9. One To One Mapping
  • 10. Feature Engineering and Extraction Two tables ◦ Quality Measures across various hospitals in U.S. as “Train Table” ◦ Quality Measures across best hospital in each state of U.S. as “Test Table” Features ◦ 20 features of each attribute created manually ◦ Features created based on data content, data type and constraints
  • 11. Type - Float, Int, Char, Boolean, Date, Time Numerical Variance Coefficient Length (specified by user) Numerical Minimum Key - Primary Key, Foreign Key Numerical Maximum Unique Ratio of Whitespace Characters with total length Not Null Ratio of Special Characters with total length Average Used Length Average Number of integers in attribute Variance of Length Average number of characters in attribute Variance Coefficient of Length Average number of hyphens in attribute Numerical Average Average number of brackets in attribute Numerical Variance Average number of backslash in attribute Features
  • 12. ◦ Cluster attributes in Train Tables. ◦ Assign test table attribute to that cluster whose centroid is closest to it. ◦ Perform one to one matching using edit distance between attribute names. Clustering and Linguistic Matching Centroid Method
  • 13. Unclustered Attributes of Train Table tr_ehr_use_measure_description tr_fuh_measure_end_date tr_hospital_name tr_hbips_2_measure_description tr_start_date tr_sub_1_percentage tr_end_date tr_fuh_30_percentage tr_peoc_measure_description tr_tob_2_percentage tr_flu_season_start_date Cluster 1 (only char values) tr_ehr_use_measure_description tr_hospital_name tr_hbips_2_measure_description tr_peoc_measure_description Cluster 2 (all dates) tr_start_date tr_end_date tr_flu_season_start_date tr_fuh_measure_end_date Cluster 3 (only small numerics) tr_tob_2_percentage tr_fuh_30_percentage tr_sub_1_percentage Train Attribute Clustering
  • 14. Unclustered Attributes of Test Table ts_s_flu_season_start_date ts_s_fuh_30_percentage ts_s_sub_1_percentage ts_s_fuh_measure_end_date Cluster 2 (all dates) tr_start_date tr_end_date tr_flu_season_start_date tr_fuh_measure_end_date ts_s_flu_season_start_date ts_s_fuh_measure_end_date Cluster 3 (only small numerics) tr_tob_2_percentage tr_fuh_30_percentage tr_sub_1_percentage ts_s_fuh_30_percentage ts_s_sub_1_percentage Test Attribute Matching Centroid distance
  • 15. Test Attribute Train Attribute Edit Distance Percent Match ts_s_flu_season_start_date tr_start_date 15 96.6 ts_s_flu_season_start_date tr_end_date 14 96.9 ts_s_flu_season_start_date tr_flu_season_start_date 3 99.6 ts_s_flu_season_start_date tr_fuh_measure_end_date 8 97.0 Linguistic Matching
  • 16. ◦ Combine Train and Test attributes together and cluster them ◦ Linguistically match test attributes with train attributes lying in the same cluster Clustering and Linguistic Matching Combined Method
  • 17. All Unclustered Attributes ts_s_flu_season_start_date ts_s_fuh_30_percentage tr_sub_1_percentage tr_flu_season_start_date ts_s_sub_1_percentage ts_s_fuh_measure_end_date tr_fuh_30_percentage tr_fuh_measure_end_date ts_s_address Cluster 1 (all dates) tr_flu_season_start_date tr_fuh_measure_end_date ts_s_flu_season_start_date ts_s_fuh_measure_end_date Cluster 2 (only small numerics) ts_s_fuh_30_percentage ts_s_sub_1_percentage tr_fuh_30_percentage tr_sub_1_percentage Combined Attribute Clustering Cluster 3 ts_s_address
  • 18. ◦ In the Centroid method, each test attribute is forced to map to at-least one train attribute whereas in the combined method, there is a possibility of a test attribute matching to no train attribute. ◦ The Centroid method can be used for selective schema matching whereas the Combined one can be used for schema merging. Trade-Offs between the two methods
  • 19. Conclusion - Previous Work SemaInt Cluster train attributes based on only data Predict test attributes AutoMatch Global Dictionary for one to one mapping IMAP Custom Functions for one to one and one to many mapping Corpus Based Learn on train attributes and predict test attributes based on both linguistics and data CUPID Exploit schema structure to create a tree
  • 20. Conclusion - Our Approach ◦ In this project, we implemented schema matching by clustering train attributes based on only data and then performing intra- cluster matching based on linguistic similarity between train and test attributes. ◦ We also prepared a global dictionary for one to many and many to one mappings. ◦ Preparation of the dictionary will require a domain expert and is the only point in the process where external intervention will be required.

Editor's Notes

  1. Dictionary will be extended with time.