This document describes an approach to schema matching using machine learning. It discusses using one-to-many and one-to-one mapping between schemas. For one-to-many mapping, it uses a custom global dictionary. For one-to-one mapping, it performs feature engineering, extracts features from attributes, clusters attributes, and uses linguistic matching between clusters. It compares centroid-based and combined clustering methods, and concludes by discussing its approach to clustering train attributes based on data and performing intra-cluster linguistic matching between train and test attributes, along with using a global dictionary for mappings.
2. Schema Matching
Schema - A skeleton that represents the logical view
of the database.
Schema Matching - The method of matching those
attributes which are semantically related to each
other and/or portray similar properties.
- Across two databases
- Across two tables in the same database
4. Schema Matching
One To One Matching One to Many
Matching
Probabilistic Strict
St Name Street Name
St Name Street Name (92%)
User Name (90%)
Street No (42%)
5. Our Approach
◦ One to Many Mapping
▫ Custom Dictionary Preparation
▫ Attribute matching between the tables
◦ One to One Mapping
▫ Feature Engineering
▫ Feature Extraction
▫ Clustering
▫ Linguistic Matching
7. ◦ Use of Global Dictionary which has all possible
mappings of the attributes.
◦ The key of the dictionary is the parent
attribute(“Name”) and its values are all possible
children (“First Name, Last Name”)
One to Many Mapping
8. Global Dictionary Keys:
Name, PatientName
Sample Global Dictionary
Global Dictionary Values:
FirstName,First_Name,FName,F_Name
Last_Name,LastName,LName,L_Name
Name Address Phone Number
xyz abc 123-xxx-xxxx
FName LName SSN
lmn cde 123xxxxxx
10. Feature Engineering and Extraction
Two tables
◦ Quality Measures across various hospitals in U.S. as “Train Table”
◦ Quality Measures across best hospital in each state of U.S. as
“Test Table”
Features
◦ 20 features of each attribute created manually
◦ Features created based on data content, data type and
constraints
11. Type - Float, Int, Char, Boolean, Date, Time Numerical Variance Coefficient
Length (specified by user) Numerical Minimum
Key - Primary Key, Foreign Key Numerical Maximum
Unique Ratio of Whitespace Characters with total length
Not Null Ratio of Special Characters with total length
Average Used Length Average Number of integers in attribute
Variance of Length Average number of characters in attribute
Variance Coefficient of Length Average number of hyphens in attribute
Numerical Average Average number of brackets in attribute
Numerical Variance Average number of backslash in attribute
Features
12. ◦ Cluster attributes in Train Tables.
◦ Assign test table attribute to that cluster whose
centroid is closest to it.
◦ Perform one to one matching using edit
distance between attribute names.
Clustering and Linguistic Matching
Centroid Method
14. Unclustered Attributes of Test Table
ts_s_flu_season_start_date
ts_s_fuh_30_percentage
ts_s_sub_1_percentage
ts_s_fuh_measure_end_date
Cluster 2 (all dates)
tr_start_date
tr_end_date
tr_flu_season_start_date
tr_fuh_measure_end_date
ts_s_flu_season_start_date
ts_s_fuh_measure_end_date
Cluster 3 (only small numerics)
tr_tob_2_percentage
tr_fuh_30_percentage
tr_sub_1_percentage
ts_s_fuh_30_percentage
ts_s_sub_1_percentage
Test Attribute Matching
Centroid distance
15. Test Attribute Train Attribute Edit Distance Percent Match
ts_s_flu_season_start_date tr_start_date 15 96.6
ts_s_flu_season_start_date tr_end_date 14 96.9
ts_s_flu_season_start_date tr_flu_season_start_date 3 99.6
ts_s_flu_season_start_date tr_fuh_measure_end_date 8 97.0
Linguistic Matching
16. ◦ Combine Train and Test attributes together
and cluster them
◦ Linguistically match test attributes with train
attributes lying in the same cluster
Clustering and Linguistic Matching
Combined Method
18. ◦ In the Centroid method, each test attribute is forced to
map to at-least one train attribute whereas in the
combined method, there is a possibility of a test attribute
matching to no train attribute.
◦ The Centroid method can be used for selective schema
matching whereas the Combined one can be used for
schema merging.
Trade-Offs between the two methods
19. Conclusion - Previous Work
SemaInt
Cluster train attributes based on
only data
Predict test attributes
AutoMatch
Global Dictionary for one to one
mapping
IMAP
Custom Functions for one to one
and one to many mapping
Corpus Based
Learn on train attributes and
predict test attributes based on
both linguistics and data
CUPID
Exploit schema
structure to
create a tree
20. Conclusion - Our Approach
◦ In this project, we implemented schema matching by clustering
train attributes based on only data and then performing intra-
cluster matching based on linguistic similarity between train and
test attributes.
◦ We also prepared a global dictionary for one to many and many
to one mappings.
◦ Preparation of the dictionary will require a domain expert and is
the only point in the process where external intervention will be
required.