Semantic integration of data in
database systems and
ontologies
Ing. Petra Šeflová
Technical university of Liberec
Faculty of mechatronics
2
Integration of data
- merging a set given schemas into global
schema
Semantic integration
- part of concept integration of data
- be focusing on data exchange between
applications in the light of their meaning, content
and required business rules
3
Example
Integration of data
homeseekers.com
Source schema
Source schema
Source schema
wrapper
wrapper
wrapper
mediated schema
Find houses
with four
bathrooms
and price
under
$500.000
realestate.com
greathomes.com
A data integration system in the real estate domain.
4
Applications
 Catalog integration in B2B applications
 E-commerce
 Bioinformatics
 P2P Databases
 Agent communications
 Web services Integration
5
Key commonalities application of
Semantic integration
 Use structured representation
(e.g. relational schemas and XML DTDs)
 Must resolve heterogenities with respect to the
schema and their data
 Enable their manipulation
 Merging the schemas
 Computing differences
 Enable translation of data and queries across the
schemas/ontologies
6
• Database schema
– Present definition physical system layout (database)
• Ontology
– System of knowledge about world
• Claimless on coherence (lot of partial ontology)
• Frequently specific created artefact
– Definition of Gruber: Ontology is formal, explicit
specification sharing conceptualization.
7
Problems of Semantic integration
 Semantic of elements can be inferred from only
a few information sources
 Creators of data
 Dokumentation
 Associated schema and data
 Schema element are typically matched based on
clues in the schema and data
 Schema and data clues are often incomlpete
 Matching is often subjective, depending in the
application
8
Matching process
• Take as input two schemas/ontologies,
each consisting of a set discrete entities,
and determine as output the relationships
holding between these entities
9
Location Price ($) Agent-id
Atlanta, GA 360,000 32
Raleigh, NC 430,000 15
Id Name city state
32 Mike Brown Athlanta GA
15 Jean Laup Relaign NC
Area list-
price
Agent-
address
Agent-
name
Denver,
CO
550,000 Boulder,CO Laura
Smith
Atlanta,
GA
370,800 Athens,
GA
Mike
Brown
Schema S
Houses
Schema T
Agents
Example : The schema of two relational database S and T on house
listing, and the semantic correspondence between them
10
Matching techniques
Two groups
 Rule-based
 Learning-based
11
Rule-based solutions
 Many of the early as well as current matching
solutions employ hand-crafted rules
 Exploit schema information
 Element names
 Data types
 Structures
 Integrity constraints
 Can provide a quick and concise method to
capture valuable user knowledge about domain
12
Rule-based solutions
 Benefits
 „relatively inexpensive“
 Do not require training
 Operate only on schema
 Drawback
 They cannot exploit data instance effectively
 They cannot exploit previous matching efforts
For example :
 TranScm
 DIKE
 MOMIS
 CUPID
13
• TranScm
– Employs rules such as
„two elements match if they have the same name
(allowing synonyms) and the same number of
subelements
• DIKE
– Computes similarity between two schema element
based on similarity of the characteristics of the
element and similarity of related elements
• MOMIS
– Compute similarity of schema elements as a
weighted suma of the similarity of name,data type
and substructure
• CUPID
– Employs rules that categorize elements based on
names, data types and domains
14
Learning-based solutions
 Exploit both schema and data information
 They do exploit previous matching efforts
 Examples:
 SemInt system
 LSD system
 iMAP system
 Autocomplex
 Automatch
15
• SemInt
– Uses a neuralnetwork learning approaches
– It matched schema elements based on attribute
specifications and statistic of data content
• LSD
– Employs Naive Bayes over data instance
– Develop novel learning solution exploit the
hierarchical nature of XML data
• iMAP
– Matches the schemas of two sources by analyzing
the description of objects that are found in both
sources
• Autoplex and Automatch
– Use a Naive Bayes learning approach that exploits
data instances to match element
16
The Matching dimensions
• Input dimension
• Process dimension
• Output dimensions
17
Input dimension
• Concern the kind of input on which algorithm
operate
• First dimension
– Algorithms depending on the data/ conceptual model
in which ontologies or schemas are expressed
• Second dimension
– Depend on the kind of data algorithms exploit
– Different approaches exploit different information of
the input data/conceptual models
• Schema-level information
• Instance data
• Exploit both
18
Process dimensions
• Classification of the matching process could be
based on its general properties
• It depends on the approximate or exact nature
of its computation
– Exact algorithms compute the absolute solution to a
problem
– Approximate algorithms sacrifice exactness to
performance
• Three large classes based on intrinsic input,
external resources or some semantic theory
– Syntactic
– External
– Semantic
19
Output dimensions
• Concern the form of the result they produce
– One-to-one correspondence
– Is any relation suitable
– Has it to be final mapping element
• System deliver a graded answer
• Correspondences hold with 98% confidence
• Correspondences hold with 4/5 probability
• All-or-nothing answer
– Correspondences using distance measuring
– Kind of relations between entities a system can
provide
• Equivalence
• Subsumption
• Incompatibility
20
Classification of elementary schema-based
matching approaches
Schema-Based Matching Techniques
Element-level Structure-level
Syntantic
Syntactic External
Linguistic Internal Relational
Semantic
Structural
Terminological
Schema-Based Matching Techniques
Semantic
External
String-
Based
Language-
Based
Linguistic
Resource
Contraint-
Based
Upper
Level
Formal
ontologies
Graph-
Based
Taxonomy-
Based
Repository
of
Structure
Model-
Based
Alignment
reuse
Basic
Techniques
layer
Granuality/Input Interpretation layer
21
Element-level vs structure-level
 Element-level matching techniques
compute mapping elements by analyzing
entities in isolation
 Ignoring their relation with other entities
 Structure-level techniques compute
mapping elements by analyzing how
entities appear together in a structure
22
Internal vs external techniques
 Interal
 Exploiting information which comes only with input
schema/ontologies
 Syntactic interpretation of input
 Sematic interpretation of input
 External
 Exploit auxiliary (external) resources of domain to
interpret the input
 Resources :
 Human input
 Some thesaurus expressing the relationship between terms
23
Schema Matching vs Ontology Matching
Differences
 Database schema often do not provide explicit
semantics for their data
 Semantics is usually specified explicitly at design-
time
 Usually performed with the help of techniques trying
to guess the meaning encoded in the schemas
 Ontologies are logical systems that themselves
obey some formal semantics
 Primarily try to exploit knowledge explicitly encoded
in the ontologies
24
Schema Matchin vs Ontology Matching
Commonalities
 Ontologies and schemas are similar in the
sense :
 Provide a vocablurary of terms that describes a
domain of interest
 Constrain the meaning of terms used in
vocablurary
 Schema and ontologies are found in such
enviroment as the Semantic web
25
Sources :
• Natalya F.Noy : Semantic Integration: A survey of Ontology-Based
Approaches
• AnHai Doan, Alon Y. Haley: Semantic Integration in the Database
Community: A Brief Survey
• P.Schvaiko, J. Euzenat: A Survey of schema-based Matching
Approaches
• G. Antonious, F. van Harmelen: A Semantic Web Primer
• R. Araújo, H. Sofia Pinto: Toward Semantics-based ontology
similarity
• H. Wache, T. Vögele, U. Visser, H. Stuckenschmidt, G. Shuster, H.
Neumann and S. Húbner: Ontology-based integration of information
– A survey existing Approaches
• E. Rahm, P.A. Bernstein: A survey of approaches to automatic
schema matching

semantic integration.ppt

  • 1.
    Semantic integration ofdata in database systems and ontologies Ing. Petra Šeflová Technical university of Liberec Faculty of mechatronics
  • 2.
    2 Integration of data -merging a set given schemas into global schema Semantic integration - part of concept integration of data - be focusing on data exchange between applications in the light of their meaning, content and required business rules
  • 3.
    3 Example Integration of data homeseekers.com Sourceschema Source schema Source schema wrapper wrapper wrapper mediated schema Find houses with four bathrooms and price under $500.000 realestate.com greathomes.com A data integration system in the real estate domain.
  • 4.
    4 Applications  Catalog integrationin B2B applications  E-commerce  Bioinformatics  P2P Databases  Agent communications  Web services Integration
  • 5.
    5 Key commonalities applicationof Semantic integration  Use structured representation (e.g. relational schemas and XML DTDs)  Must resolve heterogenities with respect to the schema and their data  Enable their manipulation  Merging the schemas  Computing differences  Enable translation of data and queries across the schemas/ontologies
  • 6.
    6 • Database schema –Present definition physical system layout (database) • Ontology – System of knowledge about world • Claimless on coherence (lot of partial ontology) • Frequently specific created artefact – Definition of Gruber: Ontology is formal, explicit specification sharing conceptualization.
  • 7.
    7 Problems of Semanticintegration  Semantic of elements can be inferred from only a few information sources  Creators of data  Dokumentation  Associated schema and data  Schema element are typically matched based on clues in the schema and data  Schema and data clues are often incomlpete  Matching is often subjective, depending in the application
  • 8.
    8 Matching process • Takeas input two schemas/ontologies, each consisting of a set discrete entities, and determine as output the relationships holding between these entities
  • 9.
    9 Location Price ($)Agent-id Atlanta, GA 360,000 32 Raleigh, NC 430,000 15 Id Name city state 32 Mike Brown Athlanta GA 15 Jean Laup Relaign NC Area list- price Agent- address Agent- name Denver, CO 550,000 Boulder,CO Laura Smith Atlanta, GA 370,800 Athens, GA Mike Brown Schema S Houses Schema T Agents Example : The schema of two relational database S and T on house listing, and the semantic correspondence between them
  • 10.
    10 Matching techniques Two groups Rule-based  Learning-based
  • 11.
    11 Rule-based solutions  Manyof the early as well as current matching solutions employ hand-crafted rules  Exploit schema information  Element names  Data types  Structures  Integrity constraints  Can provide a quick and concise method to capture valuable user knowledge about domain
  • 12.
    12 Rule-based solutions  Benefits „relatively inexpensive“  Do not require training  Operate only on schema  Drawback  They cannot exploit data instance effectively  They cannot exploit previous matching efforts For example :  TranScm  DIKE  MOMIS  CUPID
  • 13.
    13 • TranScm – Employsrules such as „two elements match if they have the same name (allowing synonyms) and the same number of subelements • DIKE – Computes similarity between two schema element based on similarity of the characteristics of the element and similarity of related elements • MOMIS – Compute similarity of schema elements as a weighted suma of the similarity of name,data type and substructure • CUPID – Employs rules that categorize elements based on names, data types and domains
  • 14.
    14 Learning-based solutions  Exploitboth schema and data information  They do exploit previous matching efforts  Examples:  SemInt system  LSD system  iMAP system  Autocomplex  Automatch
  • 15.
    15 • SemInt – Usesa neuralnetwork learning approaches – It matched schema elements based on attribute specifications and statistic of data content • LSD – Employs Naive Bayes over data instance – Develop novel learning solution exploit the hierarchical nature of XML data • iMAP – Matches the schemas of two sources by analyzing the description of objects that are found in both sources • Autoplex and Automatch – Use a Naive Bayes learning approach that exploits data instances to match element
  • 16.
    16 The Matching dimensions •Input dimension • Process dimension • Output dimensions
  • 17.
    17 Input dimension • Concernthe kind of input on which algorithm operate • First dimension – Algorithms depending on the data/ conceptual model in which ontologies or schemas are expressed • Second dimension – Depend on the kind of data algorithms exploit – Different approaches exploit different information of the input data/conceptual models • Schema-level information • Instance data • Exploit both
  • 18.
    18 Process dimensions • Classificationof the matching process could be based on its general properties • It depends on the approximate or exact nature of its computation – Exact algorithms compute the absolute solution to a problem – Approximate algorithms sacrifice exactness to performance • Three large classes based on intrinsic input, external resources or some semantic theory – Syntactic – External – Semantic
  • 19.
    19 Output dimensions • Concernthe form of the result they produce – One-to-one correspondence – Is any relation suitable – Has it to be final mapping element • System deliver a graded answer • Correspondences hold with 98% confidence • Correspondences hold with 4/5 probability • All-or-nothing answer – Correspondences using distance measuring – Kind of relations between entities a system can provide • Equivalence • Subsumption • Incompatibility
  • 20.
    20 Classification of elementaryschema-based matching approaches Schema-Based Matching Techniques Element-level Structure-level Syntantic Syntactic External Linguistic Internal Relational Semantic Structural Terminological Schema-Based Matching Techniques Semantic External String- Based Language- Based Linguistic Resource Contraint- Based Upper Level Formal ontologies Graph- Based Taxonomy- Based Repository of Structure Model- Based Alignment reuse Basic Techniques layer Granuality/Input Interpretation layer
  • 21.
    21 Element-level vs structure-level Element-level matching techniques compute mapping elements by analyzing entities in isolation  Ignoring their relation with other entities  Structure-level techniques compute mapping elements by analyzing how entities appear together in a structure
  • 22.
    22 Internal vs externaltechniques  Interal  Exploiting information which comes only with input schema/ontologies  Syntactic interpretation of input  Sematic interpretation of input  External  Exploit auxiliary (external) resources of domain to interpret the input  Resources :  Human input  Some thesaurus expressing the relationship between terms
  • 23.
    23 Schema Matching vsOntology Matching Differences  Database schema often do not provide explicit semantics for their data  Semantics is usually specified explicitly at design- time  Usually performed with the help of techniques trying to guess the meaning encoded in the schemas  Ontologies are logical systems that themselves obey some formal semantics  Primarily try to exploit knowledge explicitly encoded in the ontologies
  • 24.
    24 Schema Matchin vsOntology Matching Commonalities  Ontologies and schemas are similar in the sense :  Provide a vocablurary of terms that describes a domain of interest  Constrain the meaning of terms used in vocablurary  Schema and ontologies are found in such enviroment as the Semantic web
  • 25.
    25 Sources : • NatalyaF.Noy : Semantic Integration: A survey of Ontology-Based Approaches • AnHai Doan, Alon Y. Haley: Semantic Integration in the Database Community: A Brief Survey • P.Schvaiko, J. Euzenat: A Survey of schema-based Matching Approaches • G. Antonious, F. van Harmelen: A Semantic Web Primer • R. Araújo, H. Sofia Pinto: Toward Semantics-based ontology similarity • H. Wache, T. Vögele, U. Visser, H. Stuckenschmidt, G. Shuster, H. Neumann and S. Húbner: Ontology-based integration of information – A survey existing Approaches • E. Rahm, P.A. Bernstein: A survey of approaches to automatic schema matching