Mustafa Jarrar
Lecture Notes, Web Data Management (MCOM7348)
University of Birzeit, Palestine
1st Semester, 2013

Data Schema Integration

Dr. Mustafa Jarrar
University of Birzeit
mjarrar@birzeit.edu
www.jarrar.info
Jarrar © 2013

1
Watch this lecture and download the slides from
http://jarrar-courses.blogspot.com/2013/11/web-data-management.html

Jarrar © 2013

2
Data Schema Integration: A simple example
In ORM:
bornIn/

locatedIn/

/WorksIn

Employee

City

Region

Integrated
schema

locatedIn/

Organization

/WorksIn

Employee

Organization

Municipality
locatedIn/

bornIn/
Worker

City
Schema 2

Region
Organization

locatedIn
/

Schema 3

Schema 1
Jarrar © 2013

3
Data Schema Integration: A simple example
Source: Carlo Batini

In ER:
Employee

born

City

in

works
Organization

Region

Integrated
schema

in

Employee
works
Organization

Empoloyee

born

City

in

Schema 2

Schema 1

Region

Municipality

Organization

in

Schema 3
Jarrar © 2013

4
Challenges of Data Schema Integration
Source: Carlo Batini

Schema Integration has two major challenges:
1.  Identification of all portions of schemas that pertain to the
same concept, in such a way to unify such different
representations in the global schema.
2.  Identification, analysis and resolution of the different
types of conflicts (heterogeneities) in different schemas.

Jarrar © 2013

5
Framework for Schema Integration
Local
Schemas

Schemas
Transformation

Transformation
Rules

Schemas
Matching

Matching
Rules

Schemas
Integration

Integration
Rules

Integrated Schema
and mappings

Source: Advances in Object-Oriented Data Modeling, M. P. Papazoglou, S. Spaccapietra, Z. Tari (Eds.), The MIT Press, 2000

Jarrar © 2013

6
Framework for Schema Integration
0. Define the integration strategy
If the number of local schemas to be integrated is large, the order of
schema integration becomes important. Several strategies can be
adopted.
Input: n source schemas
Output: n source schemas + integration strategies
Method used: heuristics
S1

S2

S3

S1

S2

S3 S4

One shot strategy
- Efficient integration
process

S3 S4

S2
IS1

IS1
IS

S1

IS2

IS2
IS

…

IS
Pair at a time strategy
- Priority to most relevant
and stable schemas.

- The integration process is
- Many correspondences
more efficient
between concepts have to
be considered together.
Jarrar © 2013

Balanced Strategy
-e.g.: Production, Marketing,
Sales.
-To be preferred when the
cohesion among schemas is high.
7
Framework for Schema Integration
Source: Stefano Spaccapietra

1. Schema transformation (or Pre-integration)
Input: n source schemas
Output: n source schemas homogeneized
Methods used: Model and Design Homogeneization

Reduce model heterogeneities as much as possible to make
the sources more suitable for integration.
Goal: use a single, common data model and format.
Transformation

Source DBs

Integration

Homogeneized DBs
Jarrar © 2013

DW
8
Schema Transformation
Schema Transformation involves:
•  Data model homogeneization
–  Where all data sources are described using the same data model.

•  Design homogeneization
–  Enforce standard design rules to reduce the number of structural
conflicts (e.g., Normalization: one fact in one place)

•  Reverse Engineering
–  Reverse engineer the schema from existing data (such as COBOL
files, spreadsheets, legacy relational databases, legacy objectoriented databases).

Jarrar © 2013

9
Example of Design homogeneization (Normalization)

ONE TABLE:
R1 (#Student, Name, LastName, #Course, CourseName,
Grade, Date)
Dependencies:
–  #Student à Name, LastName
–  #Course à CourseName
–  #Student #Course à Grade, Date)

Normalized Into 3 Tables: One Fact In One Place:
R11 (#Student, Name, LastName)
R12 (#Course, CourseName)
R13 (#Student, #Course, Grade, Date)

Jarrar © 2013

10
Example of Reverse Engineering
Source: Stefano Spaccapietra

Jarrar © 2013

11
Schema Matching
2. Schema matching (Correspondences investigation)
Input: n source schemas
Output: n source schemas + correspondences
Method used: techniques to discover correspondences

Correspondences relate (schema) elements which describe
the same phenomena of the real world.
–  This step aims at finding and describing all semantic links between
elements of the input schemas and the corresponding data.
–  By doing so, one matches between the schemas to be integrated.
–  This step fixes the conflicts found in the schema.

Jarrar © 2013

12
Semantics of Correspondences
Source: Stefano Spaccapietra

Correspondences relate (schema) elements which describe
the same phenomena of the real world.

Jarrar © 2013

13
Asserting Correspondences
Source: Stefano Spaccapietra

Finding matching correspondences is done through the use of a
rich language for expressing correspondences (matchings).

Example:

S1.Person ≡ S2.Person,
With Corresponding Identifiers: Pin,
With Corresponding Property: name
Jarrar © 2013

14
Automated Matching
•  Fully automated matching is impossible, as a computer process can
hardly make ultimate decisions about the semantics of data.
•  But even partial assistance in discovering of correspondences (to be
confirmed or guided by humans) is beneficial, due to the complexity of
the task.
•  All proposed methods rely on some similarity measures that try to
evaluate the semantic distance between two descriptions.
Some state of the art matching systems
Cupid (Microsoft Research, USA)
FOAM/QOM (University of Karlsruhe, Germany)
OLA (INRIA Rhône-Alpes, France / University of Montreal, Canada)
S-Match (University of Trento, Italy)
… many others
Jarrar © 2013

15
Examples of Correspondences
Source: Stefano Spaccapietra

Jarrar © 2013

16
Examples of Correspondences
Previous example
Employee
/WorksIn

Municipality

locatedIn/
Organization

Organization

Schema 1

Worker

Schema 3

bornIn/

City

locatedIn/

Region

Schema 2
Jarrar © 2013

17
Examples of Correspondences
Source: Stefano Spaccapietra

Jarrar © 2013

18
Schema Integration & Mapping Generation
Source: Carlo Batini

3. Schemas integration and mapping generation
Input: n source schemas + correspondences
Output: integrated schema + mapping rules btw the integrated
schema and input source schemas
Method used: New classification of conflicts + Conflict resolution
transformations
GOAL: Creating an Integrated Schema ( IS ) and the mappings to the
local databases.

Jarrar © 2013

19
GAV and LAV Integration
Research has identified two methods to set up mappings between the
integrated schema and the input schemas:
(1)  GAV (Global As View): proposes to define the integrated schema
as a view over input schemas.
GAV is usually considered simpler and more efficient for processing
queries on the integrated database, but is weaker in supporting
evolution of the global system through addition of new sources.
(2)  LAV (Local As View): proposes to define the local schemas as
views over the integrated schema.
LAV generates issues of incomplete information, which adds
complexity in handling global queries, but it better supports dynamic
addition and removal of source.
Jarrar © 2013

20
Integration Process
After we identified the correspondences (in the previous
step), we now solve the conflicts:
One can distinguish between four types of conflicts:
–  Structural conflicts
–  Classification conflicts
–  Descriptive conflicts
–  Fragmentation conflicts

Examples of conflicts among related object types
–  different classifications (sets of instances)
–  different sets of properties
–  different structures
–  different coding schemes
–  …
Jarrar © 2013

21
Integration Rules
Rules defining the strategy to solve conflicts
Example rules:
–  If an class corresponds to an attribute, keep the class
–  If the population of a class is included in the population of another
class, build an is-a hierarchy

Integration rules depend on how you want the integrated
schema to look like

Jarrar © 2013

22
Structural Conflicts
Source: Stefano Spaccapietra

Different schema element types, e.g.: class, attribute, relationship

Library example:
–  S1 : Book is a class
–  S2 : books is an attribute of Author

S1

Conflict resolution :
Choose the less constraining structure
S2
–  Integrated Schema: Book is a class

Jarrar © 2013

23
Classification Conflicts
•  Corresponding elements describe different sets of real world objects
–  S1.Faculty CONTAINS S2.PhD-advisor

•  Conflict Resolution:
–  Generalization / Specialization hierarchy

S1

Faculty

Faculty

S2

Phd-advisor

Phd-advisor

–  Merging

Faculty
Jarrar © 2013

24
Descriptive Conflicts
Corresponding types have different properties, or corresponding
properties are described in different ways
Object / Entity / Relationship type:
–  Naming conflicts :
•  synonyms Node , Extremity
•  homonyms Highway (EU) , Highway (USA)

–  Composition conflicts : different attributes and methods
•  Employee ( E# , name , address )
•  Employee ( E# , position , salary , department )

Jarrar © 2013

25
Integration Methods: Manual
Source: Stefano Spaccapietra

First method: manual integration
“ do it yourself ”
a language
mapping
rules
integrated
schema

schemas

DBA

Easy to implement , Flexible
BUT
time consuming for the DBA
Jarrar © 2013

26
Integration Methods: Semi-Automatic
Second method : semi-automatic integration
“ tell me about the problem, I will try to fix it “

correspondences

mapping rules
TOOL
integrated
schema

schemas

DBA
Opens to visual CASE tools, integration servers
BUT knowledge acquisition can be painful
Jarrar © 2013

27
References and Acknowledgement
•  Carlo Batini: Course on Data Integration. BZU IT Summer School
2011.
•  Stefano Spaccapietra: Information Integration. Presentation at the IFIP
Academy. Porto Alegre. 2005.
•  Chris Bizer: The Emerging Web of Linked Data. Presentation at SRI
International, Artificial Intelligence Center. Menlo Park, USA. 2009.

Thanks to Anton Deik for helping me preparing this lecture

Jarrar © 2013

28

Jarrar: Data Schema Integration

  • 1.
    Mustafa Jarrar Lecture Notes,Web Data Management (MCOM7348) University of Birzeit, Palestine 1st Semester, 2013 Data Schema Integration Dr. Mustafa Jarrar University of Birzeit mjarrar@birzeit.edu www.jarrar.info Jarrar © 2013 1
  • 2.
    Watch this lectureand download the slides from http://jarrar-courses.blogspot.com/2013/11/web-data-management.html Jarrar © 2013 2
  • 3.
    Data Schema Integration:A simple example In ORM: bornIn/ locatedIn/ /WorksIn Employee City Region Integrated schema locatedIn/ Organization /WorksIn Employee Organization Municipality locatedIn/ bornIn/ Worker City Schema 2 Region Organization locatedIn / Schema 3 Schema 1 Jarrar © 2013 3
  • 4.
    Data Schema Integration:A simple example Source: Carlo Batini In ER: Employee born City in works Organization Region Integrated schema in Employee works Organization Empoloyee born City in Schema 2 Schema 1 Region Municipality Organization in Schema 3 Jarrar © 2013 4
  • 5.
    Challenges of DataSchema Integration Source: Carlo Batini Schema Integration has two major challenges: 1.  Identification of all portions of schemas that pertain to the same concept, in such a way to unify such different representations in the global schema. 2.  Identification, analysis and resolution of the different types of conflicts (heterogeneities) in different schemas. Jarrar © 2013 5
  • 6.
    Framework for SchemaIntegration Local Schemas Schemas Transformation Transformation Rules Schemas Matching Matching Rules Schemas Integration Integration Rules Integrated Schema and mappings Source: Advances in Object-Oriented Data Modeling, M. P. Papazoglou, S. Spaccapietra, Z. Tari (Eds.), The MIT Press, 2000 Jarrar © 2013 6
  • 7.
    Framework for SchemaIntegration 0. Define the integration strategy If the number of local schemas to be integrated is large, the order of schema integration becomes important. Several strategies can be adopted. Input: n source schemas Output: n source schemas + integration strategies Method used: heuristics S1 S2 S3 S1 S2 S3 S4 One shot strategy - Efficient integration process S3 S4 S2 IS1 IS1 IS S1 IS2 IS2 IS … IS Pair at a time strategy - Priority to most relevant and stable schemas. - The integration process is - Many correspondences more efficient between concepts have to be considered together. Jarrar © 2013 Balanced Strategy -e.g.: Production, Marketing, Sales. -To be preferred when the cohesion among schemas is high. 7
  • 8.
    Framework for SchemaIntegration Source: Stefano Spaccapietra 1. Schema transformation (or Pre-integration) Input: n source schemas Output: n source schemas homogeneized Methods used: Model and Design Homogeneization Reduce model heterogeneities as much as possible to make the sources more suitable for integration. Goal: use a single, common data model and format. Transformation Source DBs Integration Homogeneized DBs Jarrar © 2013 DW 8
  • 9.
    Schema Transformation Schema Transformationinvolves: •  Data model homogeneization –  Where all data sources are described using the same data model. •  Design homogeneization –  Enforce standard design rules to reduce the number of structural conflicts (e.g., Normalization: one fact in one place) •  Reverse Engineering –  Reverse engineer the schema from existing data (such as COBOL files, spreadsheets, legacy relational databases, legacy objectoriented databases). Jarrar © 2013 9
  • 10.
    Example of Designhomogeneization (Normalization) ONE TABLE: R1 (#Student, Name, LastName, #Course, CourseName, Grade, Date) Dependencies: –  #Student à Name, LastName –  #Course à CourseName –  #Student #Course à Grade, Date) Normalized Into 3 Tables: One Fact In One Place: R11 (#Student, Name, LastName) R12 (#Course, CourseName) R13 (#Student, #Course, Grade, Date) Jarrar © 2013 10
  • 11.
    Example of ReverseEngineering Source: Stefano Spaccapietra Jarrar © 2013 11
  • 12.
    Schema Matching 2. Schemamatching (Correspondences investigation) Input: n source schemas Output: n source schemas + correspondences Method used: techniques to discover correspondences Correspondences relate (schema) elements which describe the same phenomena of the real world. –  This step aims at finding and describing all semantic links between elements of the input schemas and the corresponding data. –  By doing so, one matches between the schemas to be integrated. –  This step fixes the conflicts found in the schema. Jarrar © 2013 12
  • 13.
    Semantics of Correspondences Source:Stefano Spaccapietra Correspondences relate (schema) elements which describe the same phenomena of the real world. Jarrar © 2013 13
  • 14.
    Asserting Correspondences Source: StefanoSpaccapietra Finding matching correspondences is done through the use of a rich language for expressing correspondences (matchings). Example: S1.Person ≡ S2.Person, With Corresponding Identifiers: Pin, With Corresponding Property: name Jarrar © 2013 14
  • 15.
    Automated Matching •  Fullyautomated matching is impossible, as a computer process can hardly make ultimate decisions about the semantics of data. •  But even partial assistance in discovering of correspondences (to be confirmed or guided by humans) is beneficial, due to the complexity of the task. •  All proposed methods rely on some similarity measures that try to evaluate the semantic distance between two descriptions. Some state of the art matching systems Cupid (Microsoft Research, USA) FOAM/QOM (University of Karlsruhe, Germany) OLA (INRIA Rhône-Alpes, France / University of Montreal, Canada) S-Match (University of Trento, Italy) … many others Jarrar © 2013 15
  • 16.
    Examples of Correspondences Source:Stefano Spaccapietra Jarrar © 2013 16
  • 17.
    Examples of Correspondences Previousexample Employee /WorksIn Municipality locatedIn/ Organization Organization Schema 1 Worker Schema 3 bornIn/ City locatedIn/ Region Schema 2 Jarrar © 2013 17
  • 18.
    Examples of Correspondences Source:Stefano Spaccapietra Jarrar © 2013 18
  • 19.
    Schema Integration &Mapping Generation Source: Carlo Batini 3. Schemas integration and mapping generation Input: n source schemas + correspondences Output: integrated schema + mapping rules btw the integrated schema and input source schemas Method used: New classification of conflicts + Conflict resolution transformations GOAL: Creating an Integrated Schema ( IS ) and the mappings to the local databases. Jarrar © 2013 19
  • 20.
    GAV and LAVIntegration Research has identified two methods to set up mappings between the integrated schema and the input schemas: (1)  GAV (Global As View): proposes to define the integrated schema as a view over input schemas. GAV is usually considered simpler and more efficient for processing queries on the integrated database, but is weaker in supporting evolution of the global system through addition of new sources. (2)  LAV (Local As View): proposes to define the local schemas as views over the integrated schema. LAV generates issues of incomplete information, which adds complexity in handling global queries, but it better supports dynamic addition and removal of source. Jarrar © 2013 20
  • 21.
    Integration Process After weidentified the correspondences (in the previous step), we now solve the conflicts: One can distinguish between four types of conflicts: –  Structural conflicts –  Classification conflicts –  Descriptive conflicts –  Fragmentation conflicts Examples of conflicts among related object types –  different classifications (sets of instances) –  different sets of properties –  different structures –  different coding schemes –  … Jarrar © 2013 21
  • 22.
    Integration Rules Rules definingthe strategy to solve conflicts Example rules: –  If an class corresponds to an attribute, keep the class –  If the population of a class is included in the population of another class, build an is-a hierarchy Integration rules depend on how you want the integrated schema to look like Jarrar © 2013 22
  • 23.
    Structural Conflicts Source: StefanoSpaccapietra Different schema element types, e.g.: class, attribute, relationship Library example: –  S1 : Book is a class –  S2 : books is an attribute of Author S1 Conflict resolution : Choose the less constraining structure S2 –  Integrated Schema: Book is a class Jarrar © 2013 23
  • 24.
    Classification Conflicts •  Correspondingelements describe different sets of real world objects –  S1.Faculty CONTAINS S2.PhD-advisor •  Conflict Resolution: –  Generalization / Specialization hierarchy S1 Faculty Faculty S2 Phd-advisor Phd-advisor –  Merging Faculty Jarrar © 2013 24
  • 25.
    Descriptive Conflicts Corresponding typeshave different properties, or corresponding properties are described in different ways Object / Entity / Relationship type: –  Naming conflicts : •  synonyms Node , Extremity •  homonyms Highway (EU) , Highway (USA) –  Composition conflicts : different attributes and methods •  Employee ( E# , name , address ) •  Employee ( E# , position , salary , department ) Jarrar © 2013 25
  • 26.
    Integration Methods: Manual Source:Stefano Spaccapietra First method: manual integration “ do it yourself ” a language mapping rules integrated schema schemas DBA Easy to implement , Flexible BUT time consuming for the DBA Jarrar © 2013 26
  • 27.
    Integration Methods: Semi-Automatic Secondmethod : semi-automatic integration “ tell me about the problem, I will try to fix it “ correspondences mapping rules TOOL integrated schema schemas DBA Opens to visual CASE tools, integration servers BUT knowledge acquisition can be painful Jarrar © 2013 27
  • 28.
    References and Acknowledgement • Carlo Batini: Course on Data Integration. BZU IT Summer School 2011. •  Stefano Spaccapietra: Information Integration. Presentation at the IFIP Academy. Porto Alegre. 2005. •  Chris Bizer: The Emerging Web of Linked Data. Presentation at SRI International, Artificial Intelligence Center. Menlo Park, USA. 2009. Thanks to Anton Deik for helping me preparing this lecture Jarrar © 2013 28