Jarrar © 2013 1
Dr. Mustafa Jarrar
University of Birzeit
mjarrar@birzeit.edu
www.jarrar.info
Lecture Notes, Web Data Management
Birzeit University, Palestine
2013
Introduction to Data Integration
Jarrar © 2013 2
Watch this lecture and download the slides from
http://jarrar-courses.blogspot.com/2014/01/web-data-management.html
Jarrar © 2013 3
Outline
Example from the government Domain
- Problem is in all domains
- Challenges of Data Integration:
- Heterogeneities in Database Schemas
- Name and Meaning Heterogeneities
- Heterogeneities in Structure and Type
- Heterogeneities in the rules and constraints
- Model Heterogeneities
Keywords: Data Integration, Registered data, domain, domain name system, web, distributed database,
database schema, Heterogeneities, Model Heterogeneities, Data model, Synonyms, Homonyms, Attribute,
Entity
Jarrar © 2013 4
Example from the government Domain
Consider all interactions with government agencies in order
to register a new business in Palestine.
Example: Establishing a new Radio Station.
Ministry of
Telecom
Ministry of
Information
Ministry of National
Economy
Chamber of
Commerce
Ministry of
Finance
Jarrar © 2013 5
Example from the government Domain
Consider when the business evolves or changes.
Example: Changing the address of the radio station.
– Address must be changed in 5 different databases.
Ministry of
Telecom
Ministry of
Information
Ministry of National
Economy
Chamber of
Commerce
Ministry of
Finance
Jarrar © 2013 6
Example from the government Domain
Consider the data registered about the same radio station in
the databases of different ministries and governmental
agencies:
ID Name Type City
R2563I Radio Al-Amal Radio Station Ramallah
B_ID Business Name Activity Type Province
LM1847 Al-Amal
Broadcast
Radio
Broadcasting
Ramallah
and Bireh
ID Company Name Company Type Location
182NS3 Broadcast Al-
Amal
Broadcasting
Station
Al-Balu’
Agency 1
Agency 2
Agency 3
. . .
Jarrar © 2013 7
Example from the government Domain
From our simple example one can point out to some
challenges in Data Integration:
– No agreed upon naming (name, business name, company name)
– No agreed upon meaning (Does ’Activity Type’ mean exactly the
same as ‘Company Type’?)
– Different Registered Data: Radio Al-Amal, Al-Amal Broadcast, ….
ID Name Type City
R2563I Radio Al-Amal Radio Station Ramallah
B_ID Business Name Activity Type Province
LM1847 Al-Amal
Broadcast
Radio
Broadcasting
Ramallah
and Bireh
ID Company Name Company Type Location
182NS3 Broadcast Al-
Amal
Broadcasting
Station
Al-Balu’
Agency 1
Agency 2
Agency 3
. . .
Jarrar © 2013 8
Problem is in all domains
Jarrar © 2013 9
Problem is in all domains
Problem is now even more challenging with the Web.
The Data Web envisions the web as a global world-wide
database.
This means that one can query distributed multiple databases
on the web as if he/she is querying a local database.
Jarrar © 2013 10
Challenges of Data Integration:
Heterogeneities in Database Schemas
One can distinguish between several heterogeneities
between different schemas:
– Name Heterogeneities (difference in used vocabulary).
– Meaning Heterogeneities (different meaning for the same attribute
in two schemas).
– Heterogeneities in the structure and type.
– Heterogeneities in the rules and constraints.
– Data Model Heterogeneities.
Jarrar © 2013 11
Name and Meaning Heterogeneities
Synonyms – Different names for the same concepts
– employee, clerk
– exam, course
– code, num
Homonyms – Same name for different concepts (different
meanings)
- City as City of birth in one schema,
- City as City of Residence in another schema
Saraly: Net Salary
Salary: Gross Salary
Section
Division
Synonyms
Homonyms
A specialized
division of a
large
organization
Jarrar © 2013 12
Heterogeneities in Structure and Type
The same concepts are represented with
different conceptual structures in two schemas:
– Attribute in one schema and derived value in another schema.
– Attribute in one schema and entity in another schema.
– Entity in one schema and relationship in another schema.
– Different abstraction levels for the same concept in two schemas:
e.g. two entities with homonym names related by an IS-A hierarchy
in two schemas.
Source: Carlo Batini
Jarrar © 2013 13
Heterogeneities in Structure
EXAMPLES:
PUBLISHERBOOKBOOK
PUBLISHER
EMPLOYEE
DEPARTMENT
PROJECT
EMPLOYEE
PROJECT
Source: Carlo Batini
Person
WOMANMAN
GENDER
Person
Jarrar © 2013 14
Heterogeneities in Type
Examples:
 In a single attribute (e.g., Numberic, Alphanumeric).
E.g., the attribute “gender”:
– Male/Female
– M/F
– 0/1
 Year has a four digit domain in one schema and two digit domain
in another schema
 Different currencies (Euros, US Dollars, etc.)
 Different measure systems (kilos vs. pounds,
centigrade vs. Fahrenheit.)
 Different granularities (grams, kilos, etc.)
Jarrar © 2013 15
Heterogeneities in the rules and constraints
EXAMPLES:
– Different cardinalities in the same relationships
– Key conflicts
Source: Carlo Batini
Jarrar © 2013 16
Model Heterogeneities
Model Heterogeneities occurs when different databases adheres to
different data models:
– Relational Data Model, XML, RDF, Object-Oriented, OWL, ...
Solution: Reduce Model Heterogeneity by using one data model.
Example: Convert the Relational Model to RDF graph model.
Jarrar © 2013 17
References and Acknowledgement
• Carlo Batini: Course on Data Integration. BZU IT Summer School
2011.
• Stefano Spaccapietra: Information Integration. Presentation at the IFIP
Academy. Porto Alegre. 2005.
• Chris Bizer: The Emerging Web of Linked Data. Presentation at SRI
International, Artificial Intelligence Center. Menlo Park, USA. 2009.
Thanks to Anton Deik for helping me preparing this lecture

Jarrar: Introduction to data Integration

  • 1.
    Jarrar © 20131 Dr. Mustafa Jarrar University of Birzeit mjarrar@birzeit.edu www.jarrar.info Lecture Notes, Web Data Management Birzeit University, Palestine 2013 Introduction to Data Integration
  • 2.
    Jarrar © 20132 Watch this lecture and download the slides from http://jarrar-courses.blogspot.com/2014/01/web-data-management.html
  • 3.
    Jarrar © 20133 Outline Example from the government Domain - Problem is in all domains - Challenges of Data Integration: - Heterogeneities in Database Schemas - Name and Meaning Heterogeneities - Heterogeneities in Structure and Type - Heterogeneities in the rules and constraints - Model Heterogeneities Keywords: Data Integration, Registered data, domain, domain name system, web, distributed database, database schema, Heterogeneities, Model Heterogeneities, Data model, Synonyms, Homonyms, Attribute, Entity
  • 4.
    Jarrar © 20134 Example from the government Domain Consider all interactions with government agencies in order to register a new business in Palestine. Example: Establishing a new Radio Station. Ministry of Telecom Ministry of Information Ministry of National Economy Chamber of Commerce Ministry of Finance
  • 5.
    Jarrar © 20135 Example from the government Domain Consider when the business evolves or changes. Example: Changing the address of the radio station. – Address must be changed in 5 different databases. Ministry of Telecom Ministry of Information Ministry of National Economy Chamber of Commerce Ministry of Finance
  • 6.
    Jarrar © 20136 Example from the government Domain Consider the data registered about the same radio station in the databases of different ministries and governmental agencies: ID Name Type City R2563I Radio Al-Amal Radio Station Ramallah B_ID Business Name Activity Type Province LM1847 Al-Amal Broadcast Radio Broadcasting Ramallah and Bireh ID Company Name Company Type Location 182NS3 Broadcast Al- Amal Broadcasting Station Al-Balu’ Agency 1 Agency 2 Agency 3 . . .
  • 7.
    Jarrar © 20137 Example from the government Domain From our simple example one can point out to some challenges in Data Integration: – No agreed upon naming (name, business name, company name) – No agreed upon meaning (Does ’Activity Type’ mean exactly the same as ‘Company Type’?) – Different Registered Data: Radio Al-Amal, Al-Amal Broadcast, …. ID Name Type City R2563I Radio Al-Amal Radio Station Ramallah B_ID Business Name Activity Type Province LM1847 Al-Amal Broadcast Radio Broadcasting Ramallah and Bireh ID Company Name Company Type Location 182NS3 Broadcast Al- Amal Broadcasting Station Al-Balu’ Agency 1 Agency 2 Agency 3 . . .
  • 8.
    Jarrar © 20138 Problem is in all domains
  • 9.
    Jarrar © 20139 Problem is in all domains Problem is now even more challenging with the Web. The Data Web envisions the web as a global world-wide database. This means that one can query distributed multiple databases on the web as if he/she is querying a local database.
  • 10.
    Jarrar © 201310 Challenges of Data Integration: Heterogeneities in Database Schemas One can distinguish between several heterogeneities between different schemas: – Name Heterogeneities (difference in used vocabulary). – Meaning Heterogeneities (different meaning for the same attribute in two schemas). – Heterogeneities in the structure and type. – Heterogeneities in the rules and constraints. – Data Model Heterogeneities.
  • 11.
    Jarrar © 201311 Name and Meaning Heterogeneities Synonyms – Different names for the same concepts – employee, clerk – exam, course – code, num Homonyms – Same name for different concepts (different meanings) - City as City of birth in one schema, - City as City of Residence in another schema Saraly: Net Salary Salary: Gross Salary Section Division Synonyms Homonyms A specialized division of a large organization
  • 12.
    Jarrar © 201312 Heterogeneities in Structure and Type The same concepts are represented with different conceptual structures in two schemas: – Attribute in one schema and derived value in another schema. – Attribute in one schema and entity in another schema. – Entity in one schema and relationship in another schema. – Different abstraction levels for the same concept in two schemas: e.g. two entities with homonym names related by an IS-A hierarchy in two schemas. Source: Carlo Batini
  • 13.
    Jarrar © 201313 Heterogeneities in Structure EXAMPLES: PUBLISHERBOOKBOOK PUBLISHER EMPLOYEE DEPARTMENT PROJECT EMPLOYEE PROJECT Source: Carlo Batini Person WOMANMAN GENDER Person
  • 14.
    Jarrar © 201314 Heterogeneities in Type Examples:  In a single attribute (e.g., Numberic, Alphanumeric). E.g., the attribute “gender”: – Male/Female – M/F – 0/1  Year has a four digit domain in one schema and two digit domain in another schema  Different currencies (Euros, US Dollars, etc.)  Different measure systems (kilos vs. pounds, centigrade vs. Fahrenheit.)  Different granularities (grams, kilos, etc.)
  • 15.
    Jarrar © 201315 Heterogeneities in the rules and constraints EXAMPLES: – Different cardinalities in the same relationships – Key conflicts Source: Carlo Batini
  • 16.
    Jarrar © 201316 Model Heterogeneities Model Heterogeneities occurs when different databases adheres to different data models: – Relational Data Model, XML, RDF, Object-Oriented, OWL, ... Solution: Reduce Model Heterogeneity by using one data model. Example: Convert the Relational Model to RDF graph model.
  • 17.
    Jarrar © 201317 References and Acknowledgement • Carlo Batini: Course on Data Integration. BZU IT Summer School 2011. • Stefano Spaccapietra: Information Integration. Presentation at the IFIP Academy. Porto Alegre. 2005. • Chris Bizer: The Emerging Web of Linked Data. Presentation at SRI International, Artificial Intelligence Center. Menlo Park, USA. 2009. Thanks to Anton Deik for helping me preparing this lecture