Jarrar: Data Schema Integration

841 views

Published on

- Data Schema Integration: A simple example
- Challenges of Data Schema Integration
- Framework for Schema Integration
- Schema Transformation
- Design homogeneization (Normalization)
- Schema Matching
-Schema Integration & Mapping Generation
- GAV and LAV Integration
- Integration Process and Rules.
- Structural Conflicts
- Integration Methods

Published in: Education, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
841
On SlideShare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Jarrar: Data Schema Integration

  1. 1. Mustafa Jarrar Lecture Notes, Web Data Management (MCOM7348) University of Birzeit, Palestine 1st Semester, 2013 Data Schema Integration Dr. Mustafa Jarrar University of Birzeit mjarrar@birzeit.edu www.jarrar.info Jarrar © 2013 1
  2. 2. Watch this lecture and download the slides from http://jarrar-courses.blogspot.com/2013/11/web-data-management.html Jarrar © 2013 2
  3. 3. Data Schema Integration: A simple example In ORM: bornIn/ locatedIn/ /WorksIn Employee City Region Integrated schema locatedIn/ Organization /WorksIn Employee Organization Municipality locatedIn/ bornIn/ Worker City Schema 2 Region Organization locatedIn / Schema 3 Schema 1 Jarrar © 2013 3
  4. 4. Data Schema Integration: A simple example Source: Carlo Batini In ER: Employee born City in works Organization Region Integrated schema in Employee works Organization Empoloyee born City in Schema 2 Schema 1 Region Municipality Organization in Schema 3 Jarrar © 2013 4
  5. 5. Challenges of Data Schema Integration Source: Carlo Batini Schema Integration has two major challenges: 1.  Identification of all portions of schemas that pertain to the same concept, in such a way to unify such different representations in the global schema. 2.  Identification, analysis and resolution of the different types of conflicts (heterogeneities) in different schemas. Jarrar © 2013 5
  6. 6. Framework for Schema Integration Local Schemas Schemas Transformation Transformation Rules Schemas Matching Matching Rules Schemas Integration Integration Rules Integrated Schema and mappings Source: Advances in Object-Oriented Data Modeling, M. P. Papazoglou, S. Spaccapietra, Z. Tari (Eds.), The MIT Press, 2000 Jarrar © 2013 6
  7. 7. Framework for Schema Integration 0. Define the integration strategy If the number of local schemas to be integrated is large, the order of schema integration becomes important. Several strategies can be adopted. Input: n source schemas Output: n source schemas + integration strategies Method used: heuristics S1 S2 S3 S1 S2 S3 S4 One shot strategy - Efficient integration process S3 S4 S2 IS1 IS1 IS S1 IS2 IS2 IS … IS Pair at a time strategy - Priority to most relevant and stable schemas. - The integration process is - Many correspondences more efficient between concepts have to be considered together. Jarrar © 2013 Balanced Strategy -e.g.: Production, Marketing, Sales. -To be preferred when the cohesion among schemas is high. 7
  8. 8. Framework for Schema Integration Source: Stefano Spaccapietra 1. Schema transformation (or Pre-integration) Input: n source schemas Output: n source schemas homogeneized Methods used: Model and Design Homogeneization Reduce model heterogeneities as much as possible to make the sources more suitable for integration. Goal: use a single, common data model and format. Transformation Source DBs Integration Homogeneized DBs Jarrar © 2013 DW 8
  9. 9. Schema Transformation Schema Transformation involves: •  Data model homogeneization –  Where all data sources are described using the same data model. •  Design homogeneization –  Enforce standard design rules to reduce the number of structural conflicts (e.g., Normalization: one fact in one place) •  Reverse Engineering –  Reverse engineer the schema from existing data (such as COBOL files, spreadsheets, legacy relational databases, legacy objectoriented databases). Jarrar © 2013 9
  10. 10. Example of Design homogeneization (Normalization) ONE TABLE: R1 (#Student, Name, LastName, #Course, CourseName, Grade, Date) Dependencies: –  #Student à Name, LastName –  #Course à CourseName –  #Student #Course à Grade, Date) Normalized Into 3 Tables: One Fact In One Place: R11 (#Student, Name, LastName) R12 (#Course, CourseName) R13 (#Student, #Course, Grade, Date) Jarrar © 2013 10
  11. 11. Example of Reverse Engineering Source: Stefano Spaccapietra Jarrar © 2013 11
  12. 12. Schema Matching 2. Schema matching (Correspondences investigation) Input: n source schemas Output: n source schemas + correspondences Method used: techniques to discover correspondences Correspondences relate (schema) elements which describe the same phenomena of the real world. –  This step aims at finding and describing all semantic links between elements of the input schemas and the corresponding data. –  By doing so, one matches between the schemas to be integrated. –  This step fixes the conflicts found in the schema. Jarrar © 2013 12
  13. 13. Semantics of Correspondences Source: Stefano Spaccapietra Correspondences relate (schema) elements which describe the same phenomena of the real world. Jarrar © 2013 13
  14. 14. Asserting Correspondences Source: Stefano Spaccapietra Finding matching correspondences is done through the use of a rich language for expressing correspondences (matchings). Example: S1.Person ≡ S2.Person, With Corresponding Identifiers: Pin, With Corresponding Property: name Jarrar © 2013 14
  15. 15. Automated Matching •  Fully automated matching is impossible, as a computer process can hardly make ultimate decisions about the semantics of data. •  But even partial assistance in discovering of correspondences (to be confirmed or guided by humans) is beneficial, due to the complexity of the task. •  All proposed methods rely on some similarity measures that try to evaluate the semantic distance between two descriptions. Some state of the art matching systems Cupid (Microsoft Research, USA) FOAM/QOM (University of Karlsruhe, Germany) OLA (INRIA Rhône-Alpes, France / University of Montreal, Canada) S-Match (University of Trento, Italy) … many others Jarrar © 2013 15
  16. 16. Examples of Correspondences Source: Stefano Spaccapietra Jarrar © 2013 16
  17. 17. Examples of Correspondences Previous example Employee /WorksIn Municipality locatedIn/ Organization Organization Schema 1 Worker Schema 3 bornIn/ City locatedIn/ Region Schema 2 Jarrar © 2013 17
  18. 18. Examples of Correspondences Source: Stefano Spaccapietra Jarrar © 2013 18
  19. 19. Schema Integration & Mapping Generation Source: Carlo Batini 3. Schemas integration and mapping generation Input: n source schemas + correspondences Output: integrated schema + mapping rules btw the integrated schema and input source schemas Method used: New classification of conflicts + Conflict resolution transformations GOAL: Creating an Integrated Schema ( IS ) and the mappings to the local databases. Jarrar © 2013 19
  20. 20. GAV and LAV Integration Research has identified two methods to set up mappings between the integrated schema and the input schemas: (1)  GAV (Global As View): proposes to define the integrated schema as a view over input schemas. GAV is usually considered simpler and more efficient for processing queries on the integrated database, but is weaker in supporting evolution of the global system through addition of new sources. (2)  LAV (Local As View): proposes to define the local schemas as views over the integrated schema. LAV generates issues of incomplete information, which adds complexity in handling global queries, but it better supports dynamic addition and removal of source. Jarrar © 2013 20
  21. 21. Integration Process After we identified the correspondences (in the previous step), we now solve the conflicts: One can distinguish between four types of conflicts: –  Structural conflicts –  Classification conflicts –  Descriptive conflicts –  Fragmentation conflicts Examples of conflicts among related object types –  different classifications (sets of instances) –  different sets of properties –  different structures –  different coding schemes –  … Jarrar © 2013 21
  22. 22. Integration Rules Rules defining the strategy to solve conflicts Example rules: –  If an class corresponds to an attribute, keep the class –  If the population of a class is included in the population of another class, build an is-a hierarchy Integration rules depend on how you want the integrated schema to look like Jarrar © 2013 22
  23. 23. Structural Conflicts Source: Stefano Spaccapietra Different schema element types, e.g.: class, attribute, relationship Library example: –  S1 : Book is a class –  S2 : books is an attribute of Author S1 Conflict resolution : Choose the less constraining structure S2 –  Integrated Schema: Book is a class Jarrar © 2013 23
  24. 24. Classification Conflicts •  Corresponding elements describe different sets of real world objects –  S1.Faculty CONTAINS S2.PhD-advisor •  Conflict Resolution: –  Generalization / Specialization hierarchy S1 Faculty Faculty S2 Phd-advisor Phd-advisor –  Merging Faculty Jarrar © 2013 24
  25. 25. Descriptive Conflicts Corresponding types have different properties, or corresponding properties are described in different ways Object / Entity / Relationship type: –  Naming conflicts : •  synonyms Node , Extremity •  homonyms Highway (EU) , Highway (USA) –  Composition conflicts : different attributes and methods •  Employee ( E# , name , address ) •  Employee ( E# , position , salary , department ) Jarrar © 2013 25
  26. 26. Integration Methods: Manual Source: Stefano Spaccapietra First method: manual integration “ do it yourself ” a language mapping rules integrated schema schemas DBA Easy to implement , Flexible BUT time consuming for the DBA Jarrar © 2013 26
  27. 27. Integration Methods: Semi-Automatic Second method : semi-automatic integration “ tell me about the problem, I will try to fix it “ correspondences mapping rules TOOL integrated schema schemas DBA Opens to visual CASE tools, integration servers BUT knowledge acquisition can be painful Jarrar © 2013 27
  28. 28. References and Acknowledgement •  Carlo Batini: Course on Data Integration. BZU IT Summer School 2011. •  Stefano Spaccapietra: Information Integration. Presentation at the IFIP Academy. Porto Alegre. 2005. •  Chris Bizer: The Emerging Web of Linked Data. Presentation at SRI International, Artificial Intelligence Center. Menlo Park, USA. 2009. Thanks to Anton Deik for helping me preparing this lecture Jarrar © 2013 28

×