More Related Content
Similar to Pal gov.tutorial2.session13 1.data schema integration (16)
More from Mustafa Jarrar (20)
Pal gov.tutorial2.session13 1.data schema integration
- 1. أكاديمية الحكومة اإللكترونية الفلسطينية
The Palestinian eGovernment Academy
www.egovacademy.ps
Tutorial II: Data Integration and Open Information Systems
Session 13.1
Data Schema Integration
Dr. Mustafa Jarrar
University of Birzeit
mjarrar@birzeit.edu
www.jarrar.info
PalGov © 2011 1
- 2. About
This tutorial is part of the PalGov project, funded by the TEMPUS IV program of the
Commission of the European Communities, grant agreement 511159-TEMPUS-1-
2010-1-PS-TEMPUS-JPHES. The project website: www.egovacademy.ps
Project Consortium:
Birzeit University, Palestine
University of Trento, Italy
(Coordinator )
Palestine Polytechnic University, Palestine Vrije Universiteit Brussel, Belgium
Palestine Technical University, Palestine
Université de Savoie, France
Ministry of Telecom and IT, Palestine
University of Namur, Belgium
Ministry of Interior, Palestine
TrueTrust, UK
Ministry of Local Government, Palestine
Coordinator:
Dr. Mustafa Jarrar
Birzeit University, P.O.Box 14- Birzeit, Palestine
Telfax:+972 2 2982935 mjarrar@birzeit.eduPalGov © 2011
2
- 3. © Copyright Notes
Everyone is encouraged to use this material, or part of it, but should
properly cite the project (logo and website), and the author of that part.
No part of this tutorial may be reproduced or modified in any form or by
any means, without prior written permission from the project, who have
the full copyrights on the material.
Attribution-NonCommercial-ShareAlike
CC-BY-NC-SA
This license lets others remix, tweak, and build upon your work non-
commercially, as long as they credit you and license their new creations
under the identical terms.
PalGov © 2011 3
- 4. Tutorial Map
Topic h
Intended Learning Objectives
Session 1: XML Basics and Namespaces 3
A: Knowledge and Understanding
Session 2: XML DTD’s 3
2a1: Describe tree and graph data models.
Session 3: XML Schemas 3
2a2: Understand the notation of XML, RDF, RDFS, and OWL.
2a3: Demonstrate knowledge about querying techniques for data Session 4: Lab-XML Schemas 3
models as SPARQL and XPath. Session 5: RDF and RDFs 3
2a4: Explain the concepts of identity management and Linked data. Session 6: Lab-RDF and RDFs 3
2a5: Demonstrate knowledge about Integration &fusion of Session 7: OWL (Ontology Web Language) 3
heterogeneous data. Session 8: Lab-OWL 3
B: Intellectual Skills Session 9: Lab-RDF Stores -Challenges and Solutions 3
2b1: Represent data using tree and graph data models (XML & Session 10: Lab-SPARQL 3
RDF). Session 11: Lab-Oracle Semantic Technology 3
2b2: Describe data semantics using RDFS and OWL. Session 12_1: The problem of Data Integration 1.5
2b3: Manage and query data represented in RDF, XML, OWL. Session 12_2: Architectural Solutions for the Integration Issues 1.5
2b4: Integrate and fuse heterogeneous data. Session 13_1: Data Schema Integration 1
C: Professional and Practical Skills Session 13_2: GAV and LAV Integration 1
2c1: Using Oracle Semantic Technology and/or Virtuoso to store Session 13_3: Data Integration and Fusion using RDF 1
and query RDF stores. Session 14: Lab-Data Integration and Fusion using RDF 3
D: General and Transferable Skills
2d1: Working with team. Session 15_1: Data Web and Linked Data 1.5
2d2: Presenting and defending ideas. Session 15_2: RDFa 1.5
2d3: Use of creativity and innovation in problem solving.
2d4: Develop communication skills and logical reasoning abilities. Session 16: Lab-RDFa 3
PalGov © 2011 4
- 5. Module ILOs
After completing this module students will be able to:
- Integrate heterogeneous information systems by schema integration.
PalGov © 2011 5
- 6. Data Schema Integration: A simple example
In ORM:
bornIn/ locatedIn/
Employee
/WorksIn City Region
locatedIn/
Organization
Employee
Municipality
bornIn/ locatedIn/
/WorksIn
Worker City Region
locatedIn/
Organization Organization
Schema 1 Schema © 2011
PalGov 2 Schema 3 6
- 7. Data Schema Integration: A simple example
Source: Carlo Batini
In ER:
Employee born City in Region
works
Organiza Integrated schema
tion
in
Employee
Munici
pality
Empoloyee born City in Region
works
Organi in
zation
Organiza
tion
Schema 2
Schema 3
Schema 1
PalGov © 2011 7
- 8. Challenges of Data Schema Integration
Source: Carlo Batini
Schema Integration has two major challenges:
1. Identification of all portions of schemas that pertain to the
same concept, in such a way to unify such different
representations in the global schema.
2. Identification, analysis and resolution of the different types
of conflicts (heterogeneities) in different schemas.
PalGov © 2011 8
- 9. A generic framework for Schema Integration
Local
Schemas
Schemas Transformation
Transformation Rules
Schemas Matching
Matching Rules
Schemas Integration
Integration Rules
Integrated Schema
and mappings
Source: Advances in Object-Oriented Data Modeling, M. P. Papazoglou, S. Spaccapietra, Z. Tari (Eds.),
The MIT Press, 2000 PalGov © 2011 9
- 10. A generic framework for Schema Integration
0. Define the integration strategy
If the number of local schemas to be integrated is large, the order of
schema integration becomes important. Several strategies can be
adopted.
Input: n source schemas
Output: n source schemas + integration strategies
Method used: heuristics
S1 S2 S3 S1 S2 S3 S4 S1 S2 S3 S4
IS1 IS1 IS2
IS2
…
IS
IS IS
One shot strategy Pair at a time strategy Balanced Strategy
- Priority to most relevant and -Example: Production, Marketing,
- Efficient integration process
stable schemas. Sales.
- Many correspondences between
- The integration process is -To be preferred when the
concepts have to be considered
more efficient cohesion among schemas is high.
together.
PalGov © 2011 10
- 11. A generic framework for Schema Integration
Source: Stefano Spaccapietra
1. Schema transformation (or Pre-integration)
Input: n source schemas
Output: n source schemas homogeneized
Methods used: Model and Design Homogeneization
Reduce model heterogeneities as much as possible to make the sources
more suitable for integration.
Goal: use a single, common data model and format.
transformation integration
source DBs homogeneized DBs DW
PalGov © 2011 11
- 12. Schema Transformation
Schema Transformation involves:
• Data model homogeneization
– Where all data sources are described using the same data model.
• Design homogeneization
– Enforce standard design rules to reduce the number of structural
conflicts (e.g., Normalization: one fact in one place)
• Reverse Engineering
– Reverse engineer the schema from existing data (such as COBOL
files, spreadsheets, legacy relational databases, legacy object-
oriented databases).
PalGov © 2011 12
- 13. Example of Design homogeneization
(Normalization)
• ONE TABLE:
R1 (#Student, Name, LastName, #Course, CourseName,
Grade, Date)
• Dependencies:
– #Student Name, LastName
– #Course CourseName
– #Student #Course Grade, Date)
• NORMALIZED INTO 3 TABLES: ONE FACT IN ONE PLACE:
R11 (#Student, Name, LastName)
R12 (#Course, CourseName)
R13 (#Student, #Course, Grade, Date)
PalGov © 2011 13
- 15. 2. Schema matching
(Correspondences investigation)
2. Schema matching (Correspondences investigation)
Input: n source schemas
Output: n source schemas + correspondences
Method used: techniques to discover correspondences
• Correspondences relate (schema) elements which describe the same
phenomena of the real world.
– This step aims at finding and describing all semantic
links between elements of the input schemas and the
corresponding data.
– By doing so, one matches between the schemas to be
integrated.
– This step fixes the conflicts found in the schema.
PalGov © 2011 15
- 16. Semantics of Correspondences
Source: Stefano Spaccapietra
Correspondences relate (schema) elements which
describe the same phenomena of the real world.
PalGov © 2011 16
- 17. Asserting Correspondences
Source: Stefano Spaccapietra
• Finding matching correspondences is done through the use of a rich
language for expressing correspondences (matchings).
• EXAMPLE:
S1.Person S2.Person,
With Corresponding Identifiers: Pin,
With Corresponding Property: name
PalGov © 2011 17
- 18. Automated Matching
• Fully automated matching is considered impossible, as a computer
process can hardly make ultimate decisions about the semantics of
data.
• But even partial assistance in discovering of correspondences (to be
confirmed or guided by humans) is beneficial, due to the complexity of
the task.
• All proposed methods rely on some similarity measures that try to
evaluate the semantic distance between two descriptions.
• Some state of the art matching systems
Cupid (Microsoft Research, USA)
FOAM/QOM (University of Karlsruhe, Germany)
OLA (INRIA Rhône-Alpes, France / Université de Montréal,Canada)
S-Match (University of Trento, Italy)
PalGov © 2011 18
- 20. Examples of Correspondences
Employee
/WorksIn Municipality
locatedIn/
Organization Organization
Schema 1 Schema 3
bornIn/ locatedIn/
Worker City Region
Schema 2
PalGov © 2011 20
- 22. STEP3: Schemas integration and mapping
generation Source: Carlo Batini
3. Schemas integration and mapping generation
Input: n source schemas + correspondences
Output: integrated schema + mapping rules btw the integrated schema and
input source schemas
Method used: New classification of conflicts + Conflict resolution
transformations
GOAL: Creating an Integrated Schema ( IS ) and the
mappings to the local databases.
PalGov © 2011 22
- 23. GAV and LAV Integration
Research has identified two methods to set up mappings between the
integrated schema and the input schemas:
(1) GAV (Global As View): proposes to define the integrated schema
as a view over input schemas.
• GAV is usually considered simpler and more efficient for processing
queries on the integrated database, but is weaker in supporting evolution
of the global system through addition of new sources.
(2) LAV (Local As View): proposes to define the local schemas as
views over the integrated schema.
• LAV generates issues of incomplete information, which adds complexity
in handling global queries, but it better supports dynamic addition and
removal of source.
PalGov © 2011 23
- 24. Integration Process
• After we identified the correspondences (in the previous step), we now
solve the conflicts:
• One can distinguish between four types of conflicts:
– Structural conflicts
– Classification conflicts
– Descriptive conflicts
– Fragmentation conflicts
• Examples of conflicts among related object types
– different classifications (sets of instances)
– different sets of properties
– different structures
– different coding schemes
– …
PalGov © 2011 24
- 25. Integration Rules
• Rules defining the strategy to solve conflicts
• Example rules:
– If an object type corresponds to an attribute, keep the
object type
– If the population of an object type is included in the
population of another object type, build an is-a
hierarchy
• Integration rules depend on how you want the
integrated schema to look like
PalGov © 2011 25
- 26. Structural Conflicts
Source: Stefano Spaccapietra
• Different schema element types, e.g.: class, attribute, relationship
• Library example:
– S1 : Book is a class
S1
– S2 : books is an attribute of Author
• Conflict resolution :
Choose the less constraining structure
– Integrated Schema: Book is a class
S2
PalGov © 2011 26
- 27. Classification Conflicts
• Corresponding elements describe different sets of real
world objects
– S1.Faculty CONTAINS S2.PhD-advisor
• Conflict Resolution:
– Generalization / Specialization hierarchy
S1 Faculty Faculty
S2 Phd-advisor Phd-advisor
– Merging
Faculty
PalGov © 2011 27
- 28. Descriptive Conflicts
• Corresponding types have different properties, or
corresponding properties are described in different ways
• Object / Entity / Relationship type:
– naming conflicts :
• synonyms Node , Extremity
• homonyms Highway (EU) , Highway (USA)
– composition conflicts : different attributes and methods
• Employee ( E# , name , address )
• Employee ( E# , position , salary , department )
PalGov © 2011 28
- 29. Integration Methods: Manual
Source: Stefano Spaccapietra
• First method : manual integration
“ do it yourself ”
a language
mapping
rules
schemas integrated
schema
DBA
Easy to implement , Flexible
BUT
time consuming for the DBA
PalGov © 2011 29
- 30. Integration Methods: Semi-Automatic
Source: Stefano Spaccapietra
• Second method : semi-automatic integration
“ tell me about the problem , I will try to fix it “
correspondences mapping rules
TOOL
schemas integrated
schema
DBA
Opens to visual CASE tools, integration servers
BUT knowledge acquisition can be painful
PalGov © 2011 30
- 31. References
• Carlo Batini: Course on Data Integration. BZU IT Summer School
2011.
• Stefano Spaccapietra: Information Integration. Presentation at the IFIP
Academy. Porto Alegre. 2005.
• Chris Bizer: The Emerging Web of Linked Data. Presentation at SRI
International, Artificial Intelligence Center. Menlo Park, USA. 2009.
PalGov © 2011 31