Pal gov.tutorial2.session13 1.data schema integration

887 views

Published on

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
887
On SlideShare
0
From Embeds
0
Number of Embeds
72
Actions
Shares
0
Downloads
20
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Pal gov.tutorial2.session13 1.data schema integration

  1. 1. ‫أكاديمية الحكومة اإللكترونية الفلسطينية‬ The Palestinian eGovernment Academy www.egovacademy.psTutorial II: Data Integration and Open Information Systems Session 13.1 Data Schema Integration Dr. Mustafa Jarrar University of Birzeit mjarrar@birzeit.edu www.jarrar.info PalGov © 2011 1
  2. 2. AboutThis tutorial is part of the PalGov project, funded by the TEMPUS IV program of theCommission of the European Communities, grant agreement 511159-TEMPUS-1-2010-1-PS-TEMPUS-JPHES. The project website: www.egovacademy.psProject Consortium: Birzeit University, Palestine University of Trento, Italy (Coordinator ) Palestine Polytechnic University, Palestine Vrije Universiteit Brussel, Belgium Palestine Technical University, Palestine Université de Savoie, France Ministry of Telecom and IT, Palestine University of Namur, Belgium Ministry of Interior, Palestine TrueTrust, UK Ministry of Local Government, PalestineCoordinator:Dr. Mustafa JarrarBirzeit University, P.O.Box 14- Birzeit, PalestineTelfax:+972 2 2982935 mjarrar@birzeit.eduPalGov © 2011 2
  3. 3. © Copyright NotesEveryone is encouraged to use this material, or part of it, but shouldproperly cite the project (logo and website), and the author of that part.No part of this tutorial may be reproduced or modified in any form or byany means, without prior written permission from the project, who havethe full copyrights on the material. Attribution-NonCommercial-ShareAlike CC-BY-NC-SAThis license lets others remix, tweak, and build upon your work non-commercially, as long as they credit you and license their new creationsunder the identical terms. PalGov © 2011 3
  4. 4. Tutorial Map Topic h Intended Learning Objectives Session 1: XML Basics and Namespaces 3A: Knowledge and Understanding Session 2: XML DTD’s 3 2a1: Describe tree and graph data models. Session 3: XML Schemas 3 2a2: Understand the notation of XML, RDF, RDFS, and OWL. 2a3: Demonstrate knowledge about querying techniques for data Session 4: Lab-XML Schemas 3 models as SPARQL and XPath. Session 5: RDF and RDFs 3 2a4: Explain the concepts of identity management and Linked data. Session 6: Lab-RDF and RDFs 3 2a5: Demonstrate knowledge about Integration &fusion of Session 7: OWL (Ontology Web Language) 3 heterogeneous data. Session 8: Lab-OWL 3B: Intellectual Skills Session 9: Lab-RDF Stores -Challenges and Solutions 3 2b1: Represent data using tree and graph data models (XML & Session 10: Lab-SPARQL 3 RDF). Session 11: Lab-Oracle Semantic Technology 3 2b2: Describe data semantics using RDFS and OWL. Session 12_1: The problem of Data Integration 1.5 2b3: Manage and query data represented in RDF, XML, OWL. Session 12_2: Architectural Solutions for the Integration Issues 1.5 2b4: Integrate and fuse heterogeneous data. Session 13_1: Data Schema Integration 1C: Professional and Practical Skills Session 13_2: GAV and LAV Integration 1 2c1: Using Oracle Semantic Technology and/or Virtuoso to store Session 13_3: Data Integration and Fusion using RDF 1 and query RDF stores. Session 14: Lab-Data Integration and Fusion using RDF 3D: General and Transferable Skills 2d1: Working with team. Session 15_1: Data Web and Linked Data 1.5 2d2: Presenting and defending ideas. Session 15_2: RDFa 1.5 2d3: Use of creativity and innovation in problem solving. 2d4: Develop communication skills and logical reasoning abilities. Session 16: Lab-RDFa 3 PalGov © 2011 4
  5. 5. Module ILOsAfter completing this module students will be able to: - Integrate heterogeneous information systems by schema integration. PalGov © 2011 5
  6. 6. Data Schema Integration: A simple exampleIn ORM: bornIn/ locatedIn/ Employee /WorksIn City Region locatedIn/ Organization Employee Municipality bornIn/ locatedIn/ /WorksIn Worker City Region locatedIn/ Organization Organization Schema 1 Schema © 2011 PalGov 2 Schema 3 6
  7. 7. Data Schema Integration: A simple example Source: Carlo BatiniIn ER: Employee born City in Region works Organiza Integrated schema tion in Employee Munici pality Empoloyee born City in Region works Organi in zation Organiza tion Schema 2 Schema 3 Schema 1 PalGov © 2011 7
  8. 8. Challenges of Data Schema Integration Source: Carlo BatiniSchema Integration has two major challenges:1. Identification of all portions of schemas that pertain to the same concept, in such a way to unify such different representations in the global schema.2. Identification, analysis and resolution of the different types of conflicts (heterogeneities) in different schemas. PalGov © 2011 8
  9. 9. A generic framework for Schema Integration Local Schemas Schemas Transformation Transformation Rules Schemas Matching Matching Rules Schemas Integration Integration Rules Integrated Schema and mappingsSource: Advances in Object-Oriented Data Modeling, M. P. Papazoglou, S. Spaccapietra, Z. Tari (Eds.),The MIT Press, 2000 PalGov © 2011 9
  10. 10. A generic framework for Schema Integration 0. Define the integration strategy If the number of local schemas to be integrated is large, the order of schema integration becomes important. Several strategies can be adopted. Input: n source schemas Output: n source schemas + integration strategies Method used: heuristics S1 S2 S3 S1 S2 S3 S4 S1 S2 S3 S4 IS1 IS1 IS2 IS2 … IS IS IS One shot strategy Pair at a time strategy Balanced Strategy - Priority to most relevant and -Example: Production, Marketing,- Efficient integration process stable schemas. Sales.- Many correspondences between - The integration process is -To be preferred when theconcepts have to be considered more efficient cohesion among schemas is high.together. PalGov © 2011 10
  11. 11. A generic framework for Schema Integration Source: Stefano Spaccapietra1. Schema transformation (or Pre-integration) Input: n source schemas Output: n source schemas homogeneized Methods used: Model and Design HomogeneizationReduce model heterogeneities as much as possible to make the sourcesmore suitable for integration.Goal: use a single, common data model and format. transformation integration source DBs homogeneized DBs DW PalGov © 2011 11
  12. 12. Schema TransformationSchema Transformation involves:• Data model homogeneization – Where all data sources are described using the same data model.• Design homogeneization – Enforce standard design rules to reduce the number of structural conflicts (e.g., Normalization: one fact in one place)• Reverse Engineering – Reverse engineer the schema from existing data (such as COBOL files, spreadsheets, legacy relational databases, legacy object- oriented databases). PalGov © 2011 12
  13. 13. Example of Design homogeneization (Normalization)• ONE TABLE: R1 (#Student, Name, LastName, #Course, CourseName, Grade, Date)• Dependencies: – #Student  Name, LastName – #Course  CourseName – #Student #Course  Grade, Date)• NORMALIZED INTO 3 TABLES: ONE FACT IN ONE PLACE: R11 (#Student, Name, LastName) R12 (#Course, CourseName) R13 (#Student, #Course, Grade, Date) PalGov © 2011 13
  14. 14. Example of Reverse Engineering Source: Stefano Spaccapietra PalGov © 2011 14
  15. 15. 2. Schema matching (Correspondences investigation)2. Schema matching (Correspondences investigation) Input: n source schemas Output: n source schemas + correspondences Method used: techniques to discover correspondences• Correspondences relate (schema) elements which describe the same phenomena of the real world. – This step aims at finding and describing all semantic links between elements of the input schemas and the corresponding data. – By doing so, one matches between the schemas to be integrated. – This step fixes the conflicts found in the schema. PalGov © 2011 15
  16. 16. Semantics of Correspondences Source: Stefano SpaccapietraCorrespondences relate (schema) elements whichdescribe the same phenomena of the real world. PalGov © 2011 16
  17. 17. Asserting Correspondences Source: Stefano Spaccapietra• Finding matching correspondences is done through the use of a rich language for expressing correspondences (matchings).• EXAMPLE:S1.Person  S2.Person,With Corresponding Identifiers: Pin,With Corresponding Property: name PalGov © 2011 17
  18. 18. Automated Matching• Fully automated matching is considered impossible, as a computer process can hardly make ultimate decisions about the semantics of data.• But even partial assistance in discovering of correspondences (to be confirmed or guided by humans) is beneficial, due to the complexity of the task.• All proposed methods rely on some similarity measures that try to evaluate the semantic distance between two descriptions.• Some state of the art matching systems Cupid (Microsoft Research, USA) FOAM/QOM (University of Karlsruhe, Germany) OLA (INRIA Rhône-Alpes, France / Université de Montréal,Canada) S-Match (University of Trento, Italy) PalGov © 2011 18
  19. 19. Examples of Correspondences Source: Stefano Spaccapietra PalGov © 2011 19
  20. 20. Examples of Correspondences Employee /WorksIn Municipality locatedIn/Organization OrganizationSchema 1 Schema 3 bornIn/ locatedIn/ Worker City Region Schema 2 PalGov © 2011 20
  21. 21. Examples of Correspondences Source: Stefano Spaccapietra PalGov © 2011 21
  22. 22. STEP3: Schemas integration and mapping generation Source: Carlo Batini3. Schemas integration and mapping generation Input: n source schemas + correspondences Output: integrated schema + mapping rules btw the integrated schema and input source schemas Method used: New classification of conflicts + Conflict resolution transformations GOAL: Creating an Integrated Schema ( IS ) and the mappings to the local databases. PalGov © 2011 22
  23. 23. GAV and LAV IntegrationResearch has identified two methods to set up mappings between theintegrated schema and the input schemas:(1) GAV (Global As View): proposes to define the integrated schema as a view over input schemas. • GAV is usually considered simpler and more efficient for processing queries on the integrated database, but is weaker in supporting evolution of the global system through addition of new sources.(2) LAV (Local As View): proposes to define the local schemas as views over the integrated schema. • LAV generates issues of incomplete information, which adds complexity in handling global queries, but it better supports dynamic addition and removal of source. PalGov © 2011 23
  24. 24. Integration Process• After we identified the correspondences (in the previous step), we now solve the conflicts:• One can distinguish between four types of conflicts: – Structural conflicts – Classification conflicts – Descriptive conflicts – Fragmentation conflicts• Examples of conflicts among related object types – different classifications (sets of instances) – different sets of properties – different structures – different coding schemes – … PalGov © 2011 24
  25. 25. Integration Rules• Rules defining the strategy to solve conflicts• Example rules: – If an object type corresponds to an attribute, keep the object type – If the population of an object type is included in the population of another object type, build an is-a hierarchy• Integration rules depend on how you want the integrated schema to look like PalGov © 2011 25
  26. 26. Structural Conflicts Source: Stefano Spaccapietra• Different schema element types, e.g.: class, attribute, relationship• Library example: – S1 : Book is a class S1 – S2 : books is an attribute of Author• Conflict resolution : Choose the less constraining structure – Integrated Schema: Book is a class S2 PalGov © 2011 26
  27. 27. Classification Conflicts• Corresponding elements describe different sets of real world objects – S1.Faculty CONTAINS S2.PhD-advisor• Conflict Resolution: – Generalization / Specialization hierarchy S1 Faculty Faculty S2 Phd-advisor Phd-advisor – Merging Faculty PalGov © 2011 27
  28. 28. Descriptive Conflicts• Corresponding types have different properties, or corresponding properties are described in different ways• Object / Entity / Relationship type: – naming conflicts : • synonyms Node , Extremity • homonyms Highway (EU) , Highway (USA) – composition conflicts : different attributes and methods • Employee ( E# , name , address ) • Employee ( E# , position , salary , department ) PalGov © 2011 28
  29. 29. Integration Methods: Manual Source: Stefano Spaccapietra• First method : manual integration “ do it yourself ” a language mapping rules schemas integrated schema DBA Easy to implement , Flexible BUT time consuming for the DBA PalGov © 2011 29
  30. 30. Integration Methods: Semi-Automatic Source: Stefano Spaccapietra• Second method : semi-automatic integration “ tell me about the problem , I will try to fix it “ correspondences mapping rules TOOL schemas integrated schema DBA Opens to visual CASE tools, integration servers BUT knowledge acquisition can be painful PalGov © 2011 30
  31. 31. References• Carlo Batini: Course on Data Integration. BZU IT Summer School 2011.• Stefano Spaccapietra: Information Integration. Presentation at the IFIP Academy. Porto Alegre. 2005.• Chris Bizer: The Emerging Web of Linked Data. Presentation at SRI International, Artificial Intelligence Center. Menlo Park, USA. 2009. PalGov © 2011 31

×