Schema matching using machine learning

Schema Matching Using
Machine Learning
Submitted By:
Shruti Jadon, Tanvi Sahay, Ankita Mehta

Schema Matching
Schema - A skeleton that represents the logical view
of the database.
Schema Matching - The method of matching those
attributes which are semantically related to each
other and/or portray similar properties.
- Across two databases
- Across two tables in the same database

Students
FNam
e
LNam
e
SSN
Major
Addres
s
Grad_Students
Name
ID
Maj_S
tream
House
No
St
Name
SCHEMA 1 SCHEMA 2
Schema Matching using an Example

Schema Matching
One To One Matching One to Many
Matching
Probabilistic Strict
St Name Street Name
St Name Street Name (92%)
User Name (90%)
Street No (42%)

Our Approach
◦ One to Many Mapping
▫ Custom Dictionary Preparation
▫ Attribute matching between the tables
◦ One to One Mapping
▫ Feature Engineering
▫ Feature Extraction
▫ Clustering
▫ Linguistic Matching

◦ Use of Global Dictionary which has all possible
mappings of the attributes.
◦ The key of the dictionary is the parent
attribute(“Name”) and its values are all possible
children (“First Name, Last Name”)
One to Many Mapping

Global Dictionary Keys:
Name, PatientName
Sample Global Dictionary
Global Dictionary Values:
FirstName,First_Name,FName,F_Name
Last_Name,LastName,LName,L_Name
Name Address Phone Number
xyz abc 123-xxx-xxxx
FName LName SSN
lmn cde 123xxxxxx

Feature Engineering and Extraction
Two tables
◦ Quality Measures across various hospitals in U.S. as “Train Table”
◦ Quality Measures across best hospital in each state of U.S. as
“Test Table”
Features
◦ 20 features of each attribute created manually
◦ Features created based on data content, data type and
constraints

Type - Float, Int, Char, Boolean, Date, Time Numerical Variance Coefficient
Length (specified by user) Numerical Minimum
Key - Primary Key, Foreign Key Numerical Maximum
Unique Ratio of Whitespace Characters with total length
Not Null Ratio of Special Characters with total length
Average Used Length Average Number of integers in attribute
Variance of Length Average number of characters in attribute
Variance Coefficient of Length Average number of hyphens in attribute
Numerical Average Average number of brackets in attribute
Numerical Variance Average number of backslash in attribute
Features

◦ Cluster attributes in Train Tables.
◦ Assign test table attribute to that cluster whose
centroid is closest to it.
◦ Perform one to one matching using edit
distance between attribute names.
Clustering and Linguistic Matching
Centroid Method

Unclustered Attributes of Train Table
tr_ehr_use_measure_description
tr_fuh_measure_end_date
tr_hospital_name
tr_hbips_2_measure_description
tr_start_date
tr_sub_1_percentage
tr_end_date
tr_fuh_30_percentage
tr_peoc_measure_description
tr_tob_2_percentage
tr_flu_season_start_date
Cluster 1 (only char values)
tr_ehr_use_measure_description
tr_hospital_name
tr_hbips_2_measure_description
tr_peoc_measure_description
Cluster 2 (all dates)
tr_start_date
tr_end_date
Cluster 3 (only small numerics)
tr_tob_2_percentage
tr_sub_1_percentage
Train Attribute Clustering

Unclustered Attributes of Test Table
ts_s_flu_season_start_date
ts_s_fuh_30_percentage
ts_s_sub_1_percentage
ts_s_fuh_measure_end_date
tr_start_date
tr_end_date
tr_tob_2_percentage
tr_sub_1_percentage
Test Attribute Matching
Centroid distance

Test Attribute Train Attribute Edit Distance Percent Match
ts_s_flu_season_start_date tr_start_date 15 96.6
ts_s_flu_season_start_date tr_end_date 14 96.9
ts_s_flu_season_start_date tr_flu_season_start_date 3 99.6
ts_s_flu_season_start_date tr_fuh_measure_end_date 8 97.0
Linguistic Matching

◦ Combine Train and Test attributes together
and cluster them
◦ Linguistically match test attributes with train
attributes lying in the same cluster
Clustering and Linguistic Matching
Combined Method

All Unclustered Attributes
tr_sub_1_percentage
ts_s_address
tr_sub_1_percentage
Combined Attribute Clustering
Cluster 3
ts_s_address

◦ In the Centroid method, each test attribute is forced to
map to at-least one train attribute whereas in the
combined method, there is a possibility of a test attribute
matching to no train attribute.
◦ The Centroid method can be used for selective schema
matching whereas the Combined one can be used for
schema merging.
Trade-Offs between the two methods

Conclusion - Previous Work
SemaInt
Cluster train attributes based on
only data
Predict test attributes
AutoMatch
Global Dictionary for one to one
mapping
IMAP
Custom Functions for one to one
and one to many mapping
Corpus Based
Learn on train attributes and
predict test attributes based on
both linguistics and data
CUPID
Exploit schema
structure to
create a tree

Conclusion - Our Approach
◦ In this project, we implemented schema matching by clustering
train attributes based on only data and then performing intra-
cluster matching based on linguistic similarity between train and
test attributes.
◦ We also prepared a global dictionary for one to many and many
to one mappings.
◦ Preparation of the dictionary will require a domain expert and is
the only point in the process where external intervention will be
required.

Schema matching using machine learning

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to Schema matching using machine learning

Similar to Schema matching using machine learning (20)

Recently uploaded

Recently uploaded (20)

Schema matching using machine learning

Editor's Notes