SlideShare a Scribd company logo
Advanced IS Design
Lecture 2
Information Integration
Bing Liu, UIC 2
Introduction
 Much of the Web information integration research
has been focused on the integration of Web query
interfaces.
Bing Liu, UIC 3
Introduction
 In this part, we introduce
 some basic integration techniques, and
 Web query interface integration
Bing Liu, UIC 4
Database integration
 Information integration started with database
integration.
 Fundamental problem: schema matching,
which takes two (or more) database schemas
to produce a mapping between elements (or
attributes) of the two (or more) schemas that
correspond semantically to each other.
 Objective: merge the schemas into a single
global schema.
Bing Liu, UIC 5
Integrating two schemas
 Consider two schemas, S1 and S2,
representing two customer relations, Cust
and Customer.
S1 S2
Cust Customer
CNo CustID
CompName Company
FirstName Contact
LastName Phone
Bing Liu, UIC 6
Integrating two schemas (contd)
 Represent the mapping with a similarity
relation, , over the power sets of S1 and S2,
where each pair in  represents one element
of the mapping. E.g.,
Cust.CNo  Customer.CustID
Cust.CompName  Customer.Company
{Cust.FirstName, Cust.LastName} 
Customer.Contact
Bing Liu, UIC 7
Different types of matching
 Schema-level only matching: only schema
information is considered.
 Domain and instance-level only matching:
some instance data (data records) and
possibly the domain of each attribute are
used. This case is quite common on the Web.
 Integrated matching of schema, domain and
instance data: Both schema and instance
data (possibly domain information) are
available.
Bing Liu, UIC 8
Pre-processing for integration
 Tokenization: break an item into atomic words using
a dictionary, e.g.,
 Break “fromCity” into “from” and “city”
 Break “first-name” into “first” and “name”
 Expansion: expand abbreviations and acronyms to
their full words, e.g.,
 From “dept” to “departure”
 Stopword removal and stemming
 Standardization of words: Irregular words are
standardized to a single form, e.g.,
 From “colour” to “color”
Bing Liu, UIC 9
Schema-level matching
 Schema level matching algorithm relies on
information such as name, description, data
type, relationship type (e.g., part-of, is-a, etc),
constraints, etc.
 Match cardinality:
 1:1 match: one element in one schema matches
one element of another schema.
 1:m match: one element in one schema matches
m elements of another schema.
 m:n match: m elements in one schema matches n
elements of another schema.
Bing Liu, UIC 10
An example
 m:1 match is similar to 1:m match. m:n match
is complex, and there is little work on it.
Bing Liu, UIC 11
1. Linguistic approaches
 They are used to derive match candidates based on
names, comments or descriptions of schema
elements:
 Name match:
 Equality of names
 Synonyms
 Equality of hypernyms: A is a hypernym of B if B is a kind-of
A.
 Common sub-strings
 Cosine similarity
 User-provided name match: usually a domain dependent
match dictionary
Bing Liu, UIC 12
Linguistic approaches (contd)
 Description match: in many databases, there are
comments to schema elements, e.g.,
 Cosine similarity from information retrieval (IR) can
be used to compare comments after stemming and
stopword removal.
Bing Liu, UIC 13
Constraint based approaches
 Constraints such as data types, value ranges,
uniqueness, relationship types, etc.
 An equivalent or compatibility table for data types
and keys can be provided. E.g.,
 string  varchar, and (primiary key)  unique
 For structured schemas, hierarchical relationships
such as
 is-a and part-of
may be utilized to help matching.
 Note: On the Web, the constraint information is often
not available, but some can be inferred based on
the domain and instance data.
Constraint based approaches
Bing Liu, UIC 14
Bing Liu, UIC 15
Domain and instance-level matching
 In many applications, some data instances or
attribute domains may be available.
 Value characteristics are used in matching.
 Two different types of domains
 Simple domain: each value in the domain has only
a single component (the value cannot be
decomposed).
 Composite domain: each value in the domain
contains more than one component.
Bing Liu, UIC 16
Match of simple domains
 A simple domain can be of any type.
 If the data type information is not available (this is
often the case on the Web), the instance values can
often be used to infer types, e.g.,
 Words may be considered as strings
 Phone numbers can have a regular expression pattern.
 Data type patterns (in regular expressions) can be
learnt automatically or defined manually.
 E.g., used to identify such types as integer, real, string,
month, weekday, date, time, zip code, phone numbers, etc.
Bing Liu, UIC 17
Match of simple domains (contd)
 Matching methods:
 Data types are used as constraints.
 For numeric data, value ranges, averages, variances can
be computed and utilized.
 For categorical data: compare domain values.
 For textual data: cosine similarity.
 Schema element names as values: A set of values in a
schema match a set of attribute names of another schema.
E.g.,
 In one schema, the attribute color has the domain {yellow, red,
blue}, but in another schema, it has the element or attribute
names called yellow, red and blue (values are yes and no).
Bing Liu, UIC 18
Handling composite domains
 A composite domain is usually indicated by
its values containing delimiters, e.g.,
 punctuation marks (e.g., “-”, “/”, “_”)
 White spaces
 Etc.
 To detect a composite domain, these
delimiters can be used. They are also used to
split a composite value into simple values.
 Match methods for simple domains can then
be applied.
Bing Liu, UIC ACL-07 19
Combining similarities
 Similarities from many match indicators can be
combined to find the most accurate candidates.
 Given the set of similarity values, sim1(u, v), sim2(u,
v), …, simn(u, v), from comparing two schema
elements u (from S1) and v (from S2), many
combination methods can be used:
 Max:
 Weighted sum:
 Weighted average:
 Machine learning: E.g., each similarity as a feature.
 Many others.
Bing Liu, UIC 20
1:m match: two types
 Part-of type: each relevant schema element on the
many side is a part of the element on the one side.
E.g.,
 “Street”, “city”, and “state” in a schema are parts of
“address” in another schema.
 Is-a type: each relevant element on the many side is
a specialization of the schema element on the one
side. E.g.,
 “NoOfAdults” and “NoOfChildren” in one schema are
specializations of “NoOfPassengers” in another schema.
 Special methods are needed to identify these types
Bing Liu, UIC 21
Web information integration
 Many integration tasks,
 Integrating Web query interfaces (search forms)
 Integrating ontologies (taxonomy)
 Integrating extracted data
 …
 We only introduce query interface integration as it
has been studied extensively.
 Many web sites provide forms (called query interfaces) to
query their underlying databases (often called the deep
web as opposed to the surface Web that can be browsed).
Bing Liu, UIC 22
Global Query Interface
united.com airtravel.com delta.com hotwire.com
Bing Liu, UIC 24
Schema model of query interfaces
 In each domain, there is a set of essential concepts C
= {c1, c2, …, cn}, used in query interfaces to enable the
user to restrict the search.
 Each concept is often represented with a single
attribute.
 Each attribute is labeled with a word or phrase, called the
label of the attribute, which is visible to the user.
 Each attribute may also have a set of possible values, its
domain.
Bing Liu, UIC 25
Schema model of query interfaces (contd)
 All the attributes with their labels in a query interface
are called the schema of the query interface.
 Each attribute also has a name in the HTML code.
The name is attached to a TEXTBOX (which takes
the user input). However,
 this name is not visible to the user.
 It is attached to the input value of the attribute and returned
to the server as the attribute of the input value.
 For practical schema integration, we are concerned
with only the label and name of each attribute and
its domain.
Bing Liu, UIC 26
Interface matching  schema matching
Bing Liu, UIC 28
The interface integration problem
 Identifying synonym attributes in an application domain.
E.g. in the book domain: Author–Writer, Subject–Category
Match Discovery
author name subject category
writer
S2:
writer
title
category
format
S3:
name
title
keyword
binding
S1:
author
title
subject
ISBN
Bing Liu, UIC 29
Schema matching as correlation mining
 It needs a large number of input query interfaces.
 The approach is based on co-occurrences of
schema attributes and the following observations
 Synonym attributes are negatively correlated
 They are semantically alternatives.
 thus, rarely co-occur in query interfaces
 Grouping attributes (they form a bigger concept
together) are positively correlated
 grouping attributes semantically complement
 They often co-occur in query interfaces
 A data mining problem.
Schema matching as correlation
mining
1. Group discovery:
 This step mines co-occurring or positively correlated
attribute groups.
 k-attribute group g is in Lk if every (k-1)-attribute
sub-group of g is in Lk-1
 A 2-attribute group {a, b} is considered positively
correlated if cp(a, b) is greater than a threshold value
, where cp is a positive correlation measure
 Example:
 Let L2 = {{a, b}, {b, c}, {a, c}, {c, d}, {d, f}}, which contains
all 2-attribute groups that are discovered from the data.
 {a, b, c} is in L3, but {a, c, d} is not because {a, d} is not in
L2.
Bing Liu, UIC 30
Schema matching as correlation
mining
2. Match discovery:
 This step mines negatively correlated groups
 including those singleton groups.
 A 2-attribute group {a, b} is considered
negatively correlated if cn(a, b) is greater than
a threshold value, where cn is a negative
correlation measure.
Bing Liu, UIC 31
Schema matching as correlation
mining
3. Matching selection:
 Choose the most confident and consistent
matches and to remove possibly false ones
 a score function is needed, which is defined
as the maximum negative correlation values
of all 2-attribute groups in the match.
Bing Liu, UIC 32
Bing Liu, UIC 33
A clustering approach
Interfaces:
 Similarity measures
 linguistic similarity
 domain similarity
 1:1 match using clustering.
 Clustering algorithm: Agglomerative hierarchical
clustering.
 Each cluster contains a set of candidate matches. E.g.,
final clusters: {{a1,b1,c1}, {b2,c2},{a2},{b3}}
A clustering approach
1. Pre-processing the data.
2. Computing all pair-wise attribute similarities of u
( Si) and v ( Sj), i  j. This produces a similarity
matrix.
3. Identify initial 1:m matches. (part of type, is a
type)
4. Cluster schema elements based on the
similarity matrix. This step
discovers 1:1 matches.
5. Generate the final 1:m matches of attributes.
Bing Liu, UIC 34
thank you
40

More Related Content

Similar to Lec2_Information Integration.ppt

Coates p: the use of genetic programing in exploring 3 d design worlds
Coates p: the use of genetic programing in exploring 3 d design worldsCoates p: the use of genetic programing in exploring 3 d design worlds
Coates p: the use of genetic programing in exploring 3 d design worlds
ArchiLab 7
 
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASESEFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
IJCSEIT Journal
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
DataminingTools Inc
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
guest0edcaf
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
Datamining Tools
 
DATA BASE MANAGEMENT SYSTEM - SHORT NOTES
DATA BASE MANAGEMENT SYSTEM - SHORT NOTESDATA BASE MANAGEMENT SYSTEM - SHORT NOTES
DATA BASE MANAGEMENT SYSTEM - SHORT NOTES
suthi
 
Sq lite module3
Sq lite module3Sq lite module3
Sq lite module3
Highervista
 
Fractal analysis of good programming style
Fractal analysis of good programming styleFractal analysis of good programming style
Fractal analysis of good programming style
csandit
 
FRACTAL ANALYSIS OF GOOD PROGRAMMING STYLE
FRACTAL ANALYSIS OF GOOD PROGRAMMING STYLEFRACTAL ANALYSIS OF GOOD PROGRAMMING STYLE
FRACTAL ANALYSIS OF GOOD PROGRAMMING STYLE
cscpconf
 
Rdbms
RdbmsRdbms
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSINGFEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
IJCI JOURNAL
 
Two Layered HMMs for Search Interface Segmentation
Two Layered HMMs for Search Interface SegmentationTwo Layered HMMs for Search Interface Segmentation
Two Layered HMMs for Search Interface Segmentation
The Children's Hospital of Philadelphia
 
Ijetcas14 347
Ijetcas14 347Ijetcas14 347
Ijetcas14 347
Iasir Journals
 
Week 4 The Relational Data Model & The Entity Relationship Data Model
Week 4 The Relational Data Model & The Entity Relationship Data ModelWeek 4 The Relational Data Model & The Entity Relationship Data Model
Week 4 The Relational Data Model & The Entity Relationship Data Model
oudesign
 
Cross domain sentiment classification via spectral feature alignment
Cross domain sentiment classification via spectral feature alignmentCross domain sentiment classification via spectral feature alignment
Cross domain sentiment classification via spectral feature alignment
lau
 
Bca3020– data base management system(dbms)
Bca3020– data base management system(dbms)Bca3020– data base management system(dbms)
Bca3020– data base management system(dbms)
smumbahelp
 
2 rel-algebra
2 rel-algebra2 rel-algebra
2 rel-algebra
Mahesh Jeedimalla
 
Dbms ii mca-ch4-relational model-2013
Dbms ii mca-ch4-relational model-2013Dbms ii mca-ch4-relational model-2013
Dbms ii mca-ch4-relational model-2013
Prosanta Ghosh
 
Image-Based Literal Node Matching for Linked Data Integration
Image-Based Literal Node Matching for Linked Data IntegrationImage-Based Literal Node Matching for Linked Data Integration
Image-Based Literal Node Matching for Linked Data Integration
IJwest
 
G1803054653
G1803054653G1803054653
G1803054653
IOSR Journals
 

Similar to Lec2_Information Integration.ppt (20)

Coates p: the use of genetic programing in exploring 3 d design worlds
Coates p: the use of genetic programing in exploring 3 d design worldsCoates p: the use of genetic programing in exploring 3 d design worlds
Coates p: the use of genetic programing in exploring 3 d design worlds
 
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASESEFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
DATA BASE MANAGEMENT SYSTEM - SHORT NOTES
DATA BASE MANAGEMENT SYSTEM - SHORT NOTESDATA BASE MANAGEMENT SYSTEM - SHORT NOTES
DATA BASE MANAGEMENT SYSTEM - SHORT NOTES
 
Sq lite module3
Sq lite module3Sq lite module3
Sq lite module3
 
Fractal analysis of good programming style
Fractal analysis of good programming styleFractal analysis of good programming style
Fractal analysis of good programming style
 
FRACTAL ANALYSIS OF GOOD PROGRAMMING STYLE
FRACTAL ANALYSIS OF GOOD PROGRAMMING STYLEFRACTAL ANALYSIS OF GOOD PROGRAMMING STYLE
FRACTAL ANALYSIS OF GOOD PROGRAMMING STYLE
 
Rdbms
RdbmsRdbms
Rdbms
 
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSINGFEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
 
Two Layered HMMs for Search Interface Segmentation
Two Layered HMMs for Search Interface SegmentationTwo Layered HMMs for Search Interface Segmentation
Two Layered HMMs for Search Interface Segmentation
 
Ijetcas14 347
Ijetcas14 347Ijetcas14 347
Ijetcas14 347
 
Week 4 The Relational Data Model & The Entity Relationship Data Model
Week 4 The Relational Data Model & The Entity Relationship Data ModelWeek 4 The Relational Data Model & The Entity Relationship Data Model
Week 4 The Relational Data Model & The Entity Relationship Data Model
 
Cross domain sentiment classification via spectral feature alignment
Cross domain sentiment classification via spectral feature alignmentCross domain sentiment classification via spectral feature alignment
Cross domain sentiment classification via spectral feature alignment
 
Bca3020– data base management system(dbms)
Bca3020– data base management system(dbms)Bca3020– data base management system(dbms)
Bca3020– data base management system(dbms)
 
2 rel-algebra
2 rel-algebra2 rel-algebra
2 rel-algebra
 
Dbms ii mca-ch4-relational model-2013
Dbms ii mca-ch4-relational model-2013Dbms ii mca-ch4-relational model-2013
Dbms ii mca-ch4-relational model-2013
 
Image-Based Literal Node Matching for Linked Data Integration
Image-Based Literal Node Matching for Linked Data IntegrationImage-Based Literal Node Matching for Linked Data Integration
Image-Based Literal Node Matching for Linked Data Integration
 
G1803054653
G1803054653G1803054653
G1803054653
 

More from NaglaaFathy42

reverse engineering.ppt
reverse engineering.pptreverse engineering.ppt
reverse engineering.ppt
NaglaaFathy42
 
introduction to web engineering.pptx
introduction to web engineering.pptxintroduction to web engineering.pptx
introduction to web engineering.pptx
NaglaaFathy42
 
introduction to web engineering.pdf
introduction to web engineering.pdfintroduction to web engineering.pdf
introduction to web engineering.pdf
NaglaaFathy42
 
understanding computers.ppt
understanding computers.pptunderstanding computers.ppt
understanding computers.ppt
NaglaaFathy42
 
semantic integration.ppt
semantic integration.pptsemantic integration.ppt
semantic integration.ppt
NaglaaFathy42
 
semantic web tech.ppt
semantic web tech.pptsemantic web tech.ppt
semantic web tech.ppt
NaglaaFathy42
 
Bioinformatic_Databases_2.ppt
Bioinformatic_Databases_2.pptBioinformatic_Databases_2.ppt
Bioinformatic_Databases_2.ppt
NaglaaFathy42
 
Web Mining .ppt
Web Mining .pptWeb Mining .ppt
Web Mining .ppt
NaglaaFathy42
 
ch5-georeferencing.ppt
ch5-georeferencing.pptch5-georeferencing.ppt
ch5-georeferencing.ppt
NaglaaFathy42
 
intro to gis
intro to gisintro to gis
intro to gis
NaglaaFathy42
 

More from NaglaaFathy42 (10)

reverse engineering.ppt
reverse engineering.pptreverse engineering.ppt
reverse engineering.ppt
 
introduction to web engineering.pptx
introduction to web engineering.pptxintroduction to web engineering.pptx
introduction to web engineering.pptx
 
introduction to web engineering.pdf
introduction to web engineering.pdfintroduction to web engineering.pdf
introduction to web engineering.pdf
 
understanding computers.ppt
understanding computers.pptunderstanding computers.ppt
understanding computers.ppt
 
semantic integration.ppt
semantic integration.pptsemantic integration.ppt
semantic integration.ppt
 
semantic web tech.ppt
semantic web tech.pptsemantic web tech.ppt
semantic web tech.ppt
 
Bioinformatic_Databases_2.ppt
Bioinformatic_Databases_2.pptBioinformatic_Databases_2.ppt
Bioinformatic_Databases_2.ppt
 
Web Mining .ppt
Web Mining .pptWeb Mining .ppt
Web Mining .ppt
 
ch5-georeferencing.ppt
ch5-georeferencing.pptch5-georeferencing.ppt
ch5-georeferencing.ppt
 
intro to gis
intro to gisintro to gis
intro to gis
 

Recently uploaded

Celonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptxCelonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptx
AnujaGaikwad28
 
Histology of Muscle types histology o.ppt
Histology of Muscle types histology o.pptHistology of Muscle types histology o.ppt
Histology of Muscle types histology o.ppt
SamanArshad11
 
Water bodies of India - Shubham 8 b.pptx
Water bodies of India - Shubham 8 b.pptxWater bodies of India - Shubham 8 b.pptx
Water bodies of India - Shubham 8 b.pptx
manshinain08
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
lenjisoHussein
 
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
45unexpected
 
PTT of AI Bots, Avatar, business continuity software.
PTT of AI Bots, Avatar, business continuity software.PTT of AI Bots, Avatar, business continuity software.
PTT of AI Bots, Avatar, business continuity software.
arash8484
 
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
rightmanforbloodline
 
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
satpalsheravatmumbai
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
weiwchu
 
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
Grant McAlister
 
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
sheetal singh$A17
 
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
sheetal singh$A17
 
Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...
Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...
Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...
sheetal singh$A17
 
High Profile Girls Call Mumbai 9910780858 Provide Best And Top Girl Service A...
High Profile Girls Call Mumbai 9910780858 Provide Best And Top Girl Service A...High Profile Girls Call Mumbai 9910780858 Provide Best And Top Girl Service A...
High Profile Girls Call Mumbai 9910780858 Provide Best And Top Girl Service A...
shardda patel
 
VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...
VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...
VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...
Ak47
 
Cyber Insurance Mathematical Model & Pricing 2
Cyber Insurance Mathematical Model & Pricing 2Cyber Insurance Mathematical Model & Pricing 2
Cyber Insurance Mathematical Model & Pricing 2
BaraDaniel1
 
Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
arti singh$A17
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
SomalyEng
 
VIP Kolkata Girls Call Kolkata 0X0000000X Doorstep High-Profile Girl Service ...
VIP Kolkata Girls Call Kolkata 0X0000000X Doorstep High-Profile Girl Service ...VIP Kolkata Girls Call Kolkata 0X0000000X Doorstep High-Profile Girl Service ...
VIP Kolkata Girls Call Kolkata 0X0000000X Doorstep High-Profile Girl Service ...
sukaniyasunnu
 
BDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDelivery
BDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDeliveryBDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDelivery
BDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDelivery
erynsouthern
 

Recently uploaded (20)

Celonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptxCelonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptx
 
Histology of Muscle types histology o.ppt
Histology of Muscle types histology o.pptHistology of Muscle types histology o.ppt
Histology of Muscle types histology o.ppt
 
Water bodies of India - Shubham 8 b.pptx
Water bodies of India - Shubham 8 b.pptxWater bodies of India - Shubham 8 b.pptx
Water bodies of India - Shubham 8 b.pptx
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
 
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
 
PTT of AI Bots, Avatar, business continuity software.
PTT of AI Bots, Avatar, business continuity software.PTT of AI Bots, Avatar, business continuity software.
PTT of AI Bots, Avatar, business continuity software.
 
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
 
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
 
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
 
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
 
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
 
Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...
Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...
Premium Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service A...
 
High Profile Girls Call Mumbai 9910780858 Provide Best And Top Girl Service A...
High Profile Girls Call Mumbai 9910780858 Provide Best And Top Girl Service A...High Profile Girls Call Mumbai 9910780858 Provide Best And Top Girl Service A...
High Profile Girls Call Mumbai 9910780858 Provide Best And Top Girl Service A...
 
VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...
VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...
VVIP Girls Call Noida 9873940964 Provide Best And Top Girl Service And No1 in...
 
Cyber Insurance Mathematical Model & Pricing 2
Cyber Insurance Mathematical Model & Pricing 2Cyber Insurance Mathematical Model & Pricing 2
Cyber Insurance Mathematical Model & Pricing 2
 
Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
 
VIP Kolkata Girls Call Kolkata 0X0000000X Doorstep High-Profile Girl Service ...
VIP Kolkata Girls Call Kolkata 0X0000000X Doorstep High-Profile Girl Service ...VIP Kolkata Girls Call Kolkata 0X0000000X Doorstep High-Profile Girl Service ...
VIP Kolkata Girls Call Kolkata 0X0000000X Doorstep High-Profile Girl Service ...
 
BDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDelivery
BDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDeliveryBDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDelivery
BDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDelivery
 

Lec2_Information Integration.ppt

  • 1. Advanced IS Design Lecture 2 Information Integration
  • 2. Bing Liu, UIC 2 Introduction  Much of the Web information integration research has been focused on the integration of Web query interfaces.
  • 3. Bing Liu, UIC 3 Introduction  In this part, we introduce  some basic integration techniques, and  Web query interface integration
  • 4. Bing Liu, UIC 4 Database integration  Information integration started with database integration.  Fundamental problem: schema matching, which takes two (or more) database schemas to produce a mapping between elements (or attributes) of the two (or more) schemas that correspond semantically to each other.  Objective: merge the schemas into a single global schema.
  • 5. Bing Liu, UIC 5 Integrating two schemas  Consider two schemas, S1 and S2, representing two customer relations, Cust and Customer. S1 S2 Cust Customer CNo CustID CompName Company FirstName Contact LastName Phone
  • 6. Bing Liu, UIC 6 Integrating two schemas (contd)  Represent the mapping with a similarity relation, , over the power sets of S1 and S2, where each pair in  represents one element of the mapping. E.g., Cust.CNo  Customer.CustID Cust.CompName  Customer.Company {Cust.FirstName, Cust.LastName}  Customer.Contact
  • 7. Bing Liu, UIC 7 Different types of matching  Schema-level only matching: only schema information is considered.  Domain and instance-level only matching: some instance data (data records) and possibly the domain of each attribute are used. This case is quite common on the Web.  Integrated matching of schema, domain and instance data: Both schema and instance data (possibly domain information) are available.
  • 8. Bing Liu, UIC 8 Pre-processing for integration  Tokenization: break an item into atomic words using a dictionary, e.g.,  Break “fromCity” into “from” and “city”  Break “first-name” into “first” and “name”  Expansion: expand abbreviations and acronyms to their full words, e.g.,  From “dept” to “departure”  Stopword removal and stemming  Standardization of words: Irregular words are standardized to a single form, e.g.,  From “colour” to “color”
  • 9. Bing Liu, UIC 9 Schema-level matching  Schema level matching algorithm relies on information such as name, description, data type, relationship type (e.g., part-of, is-a, etc), constraints, etc.  Match cardinality:  1:1 match: one element in one schema matches one element of another schema.  1:m match: one element in one schema matches m elements of another schema.  m:n match: m elements in one schema matches n elements of another schema.
  • 10. Bing Liu, UIC 10 An example  m:1 match is similar to 1:m match. m:n match is complex, and there is little work on it.
  • 11. Bing Liu, UIC 11 1. Linguistic approaches  They are used to derive match candidates based on names, comments or descriptions of schema elements:  Name match:  Equality of names  Synonyms  Equality of hypernyms: A is a hypernym of B if B is a kind-of A.  Common sub-strings  Cosine similarity  User-provided name match: usually a domain dependent match dictionary
  • 12. Bing Liu, UIC 12 Linguistic approaches (contd)  Description match: in many databases, there are comments to schema elements, e.g.,  Cosine similarity from information retrieval (IR) can be used to compare comments after stemming and stopword removal.
  • 13. Bing Liu, UIC 13 Constraint based approaches  Constraints such as data types, value ranges, uniqueness, relationship types, etc.  An equivalent or compatibility table for data types and keys can be provided. E.g.,  string  varchar, and (primiary key)  unique  For structured schemas, hierarchical relationships such as  is-a and part-of may be utilized to help matching.  Note: On the Web, the constraint information is often not available, but some can be inferred based on the domain and instance data.
  • 15. Bing Liu, UIC 15 Domain and instance-level matching  In many applications, some data instances or attribute domains may be available.  Value characteristics are used in matching.  Two different types of domains  Simple domain: each value in the domain has only a single component (the value cannot be decomposed).  Composite domain: each value in the domain contains more than one component.
  • 16. Bing Liu, UIC 16 Match of simple domains  A simple domain can be of any type.  If the data type information is not available (this is often the case on the Web), the instance values can often be used to infer types, e.g.,  Words may be considered as strings  Phone numbers can have a regular expression pattern.  Data type patterns (in regular expressions) can be learnt automatically or defined manually.  E.g., used to identify such types as integer, real, string, month, weekday, date, time, zip code, phone numbers, etc.
  • 17. Bing Liu, UIC 17 Match of simple domains (contd)  Matching methods:  Data types are used as constraints.  For numeric data, value ranges, averages, variances can be computed and utilized.  For categorical data: compare domain values.  For textual data: cosine similarity.  Schema element names as values: A set of values in a schema match a set of attribute names of another schema. E.g.,  In one schema, the attribute color has the domain {yellow, red, blue}, but in another schema, it has the element or attribute names called yellow, red and blue (values are yes and no).
  • 18. Bing Liu, UIC 18 Handling composite domains  A composite domain is usually indicated by its values containing delimiters, e.g.,  punctuation marks (e.g., “-”, “/”, “_”)  White spaces  Etc.  To detect a composite domain, these delimiters can be used. They are also used to split a composite value into simple values.  Match methods for simple domains can then be applied.
  • 19. Bing Liu, UIC ACL-07 19 Combining similarities  Similarities from many match indicators can be combined to find the most accurate candidates.  Given the set of similarity values, sim1(u, v), sim2(u, v), …, simn(u, v), from comparing two schema elements u (from S1) and v (from S2), many combination methods can be used:  Max:  Weighted sum:  Weighted average:  Machine learning: E.g., each similarity as a feature.  Many others.
  • 20. Bing Liu, UIC 20 1:m match: two types  Part-of type: each relevant schema element on the many side is a part of the element on the one side. E.g.,  “Street”, “city”, and “state” in a schema are parts of “address” in another schema.  Is-a type: each relevant element on the many side is a specialization of the schema element on the one side. E.g.,  “NoOfAdults” and “NoOfChildren” in one schema are specializations of “NoOfPassengers” in another schema.  Special methods are needed to identify these types
  • 21. Bing Liu, UIC 21 Web information integration  Many integration tasks,  Integrating Web query interfaces (search forms)  Integrating ontologies (taxonomy)  Integrating extracted data  …  We only introduce query interface integration as it has been studied extensively.  Many web sites provide forms (called query interfaces) to query their underlying databases (often called the deep web as opposed to the surface Web that can be browsed).
  • 22. Bing Liu, UIC 22 Global Query Interface united.com airtravel.com delta.com hotwire.com
  • 23. Bing Liu, UIC 24 Schema model of query interfaces  In each domain, there is a set of essential concepts C = {c1, c2, …, cn}, used in query interfaces to enable the user to restrict the search.  Each concept is often represented with a single attribute.  Each attribute is labeled with a word or phrase, called the label of the attribute, which is visible to the user.  Each attribute may also have a set of possible values, its domain.
  • 24. Bing Liu, UIC 25 Schema model of query interfaces (contd)  All the attributes with their labels in a query interface are called the schema of the query interface.  Each attribute also has a name in the HTML code. The name is attached to a TEXTBOX (which takes the user input). However,  this name is not visible to the user.  It is attached to the input value of the attribute and returned to the server as the attribute of the input value.  For practical schema integration, we are concerned with only the label and name of each attribute and its domain.
  • 25. Bing Liu, UIC 26 Interface matching  schema matching
  • 26. Bing Liu, UIC 28 The interface integration problem  Identifying synonym attributes in an application domain. E.g. in the book domain: Author–Writer, Subject–Category Match Discovery author name subject category writer S2: writer title category format S3: name title keyword binding S1: author title subject ISBN
  • 27. Bing Liu, UIC 29 Schema matching as correlation mining  It needs a large number of input query interfaces.  The approach is based on co-occurrences of schema attributes and the following observations  Synonym attributes are negatively correlated  They are semantically alternatives.  thus, rarely co-occur in query interfaces  Grouping attributes (they form a bigger concept together) are positively correlated  grouping attributes semantically complement  They often co-occur in query interfaces  A data mining problem.
  • 28. Schema matching as correlation mining 1. Group discovery:  This step mines co-occurring or positively correlated attribute groups.  k-attribute group g is in Lk if every (k-1)-attribute sub-group of g is in Lk-1  A 2-attribute group {a, b} is considered positively correlated if cp(a, b) is greater than a threshold value , where cp is a positive correlation measure  Example:  Let L2 = {{a, b}, {b, c}, {a, c}, {c, d}, {d, f}}, which contains all 2-attribute groups that are discovered from the data.  {a, b, c} is in L3, but {a, c, d} is not because {a, d} is not in L2. Bing Liu, UIC 30
  • 29. Schema matching as correlation mining 2. Match discovery:  This step mines negatively correlated groups  including those singleton groups.  A 2-attribute group {a, b} is considered negatively correlated if cn(a, b) is greater than a threshold value, where cn is a negative correlation measure. Bing Liu, UIC 31
  • 30. Schema matching as correlation mining 3. Matching selection:  Choose the most confident and consistent matches and to remove possibly false ones  a score function is needed, which is defined as the maximum negative correlation values of all 2-attribute groups in the match. Bing Liu, UIC 32
  • 31. Bing Liu, UIC 33 A clustering approach Interfaces:  Similarity measures  linguistic similarity  domain similarity  1:1 match using clustering.  Clustering algorithm: Agglomerative hierarchical clustering.  Each cluster contains a set of candidate matches. E.g., final clusters: {{a1,b1,c1}, {b2,c2},{a2},{b3}}
  • 32. A clustering approach 1. Pre-processing the data. 2. Computing all pair-wise attribute similarities of u ( Si) and v ( Sj), i  j. This produces a similarity matrix. 3. Identify initial 1:m matches. (part of type, is a type) 4. Cluster schema elements based on the similarity matrix. This step discovers 1:1 matches. 5. Generate the final 1:m matches of attributes. Bing Liu, UIC 34