SlideShare a Scribd company logo
MDB1: A Schema Integration Prototype for a Multidatabase System
A Thesis
Presented to
the Faculty of the College of Computer Studies
De La Salle University
In Partial Fulfillment
of the Requirements for the Degree of
Bachelor of Science in Computer Science
by
Cua, Samuel V.
Gaw, Gilbert O.
Kiok, Joseph T.
Lau, Jerwynn Glenn N.
Ms. Charibeth Ko
Faculty Adviser
August 2, 1999
ACKNOWLEDGMENTS
The proponents would like to thank their parents
namely Mr. Cua Hung Cheng, Mrs. Ngo Chin Lian, Mr.
Eduardo Gaw, Mr. Manuel Kiok, Mrs. Priscilla Kiok,
Mr. Joseph Lau, and Mrs. Brigitte Lau for their
encouragement and support, understanding, prayers,
and for boarding and lodging.
We are very grateful to the faculty members who
helped us in the development of our thesis. Especially
to Ms. Charibeth Ko, our thesis adviser, who gave us
her time and effort. Under her guidance, we were able
come out with a great thesis project and presentation.
To Mr. Jefferson Tan, our previous adviser, for guiding
us in the initial development of this thesis. To Mr.
Bernard Yung and to Mr. Rene Arellano for supplying
us with software tools needed for the thesis.
Our gratitude to DLSU-PUSO and Mr. Armando B.
Victor, for providing us with several books used in the
research and development of this thesis. We also thank
the ADRIC for the facilities that we used during the
project’s initial development.
Finally, the proponents would not have been
successful in completing this thesis without the love
and guidance of our Heavenly Father.
Abstract
The need for the integration of data from heterogeneous and physically
distributed information sources has triggered the research and development of
multidatabase systems. The MDB1 system is a tightly-coupled federated database
system that employs a global schema to abstract global clients from heterogeneous
database systems. Participating components in the federation are heterogeneous,
autonomous, relational-model database systems. The main goal of the system is to
manage the integration of heterogeneous databases, which will then allow the
execution of the global queries.
The study explored issues on multidatabase architectures and semantic
heterogeneity through a review of existing literature. Through survey, study, and
analysis, the proponents produced a set of simplified methods and mapping rules for
resolving various forms of schema conflicts, namely name difference, format
difference, missing data, conflicting values, semantic difference, and structural
conflict.
The system is composed of two major components, namely the Schema
Integration Tool and the Middleware. The Schema Integration Tool facilitates
schema mapping and enables the resolution of various forms of schema conflicts.
The Middleware then refers to the mapping definitions to create the global database
and perform data integration. The system utilizes an Oracle Database to store the
final integrated data.
TABLE OF CONTENTS
1.0. RESEARCH DESCRIPTION........................................................................................1-1
1.1. OVERVIEW OF THE CURRENT STATE OF THE SPECIFIC TECHNOLOGY..............................1-1
1.2. RESEARCH OBJECTIVE .................................................................................................1-2
1.2.1. General Objective...................................................................................................1-2
1.2.2. Specific Objectives..................................................................................................1-2
1.3. SCOPE AND LIMITATIONS OF THE RESEARCH .................................................................1-3
1.4. SIGNIFICANCE OF THE RESEARCH .................................................................................1-3
1.5. RESEARCH METHODOLOGY..........................................................................................1-4
2.0. REVIEW OF RELATED LITERATURE.....................................................................2-1
2.1. ISSUES IN MULTIDATABASE SYSTEMS ...........................................................................2-1
2.1.1. Access Control in Multidatabase Systems................................................................2-1
2.1.2. Data Integration in Multidatabase Systems..............................................................2-2
2.1.3. Query Processing and Optimization........................................................................2-5
2.2. EXISTING SOFTWARE ...................................................................................................2-5
2.2.1. Cords Schema Integration Environment ..................................................................2-5
2.2.2. Microsoft Transaction Server (Viper)......................................................................2-8
2.2.3. Sybase Jaguar.........................................................................................................2-8
2.2.4. ADDS (Amoco Production Company, Research)......................................................2-9
2.2.5. DOIA....................................................................................................................2-10
2.2.6. Mermaid...............................................................................................................2-11
3.0. THEORETICAL FRAMEWORK ................................................................................3-1
3.1. DEFINITION OF TERMS..................................................................................................3-1
3.2. MULTIDATABASE ARCHITECTURE ................................................................................3-5
3.2.1. Characteristics of Database Systems.......................................................................3-5
3.2.2. Taxonomy of Federated Database Systems..............................................................3-6
3.2.3. Reference Architecture............................................................................................3-8
3.2.4. Processor Types in the Reference Architecture ........................................................3-9
3.2.5. ANSI/SPARC Three-Level Schema Architecture ....................................................3-11
3.2.6. Five-Level Schema Architecture for Federated Databases.....................................3-11
3.3. MULTIDATABASE ISSUES............................................................................................3-14
3.3.1. Schema Integration...............................................................................................3-14
3.3.2. Access Control......................................................................................................3-15
3.3.3. Query Processing and Optimization......................................................................3-15
3.4. SCHEMA INTEGRATION...............................................................................................3-16
3.4.1. Name Difference...................................................................................................3-16
3.4.2. Format Difference ................................................................................................3-17
3.4.3. Missing Data ........................................................................................................3-17
3.4.4. Conflicting Values ................................................................................................3-18
3.4.5. Semantic Difference..............................................................................................3-18
3.4.6. Structural Difference ............................................................................................3-19
3.5. CLIENT / SERVER COMPUTING....................................................................................3-20
3.6. MIDDLEWARE............................................................................................................3-21
3.7. SQL / VIEWS .............................................................................................................3-22
3.8. DISTRIBUTED COMPUTING..........................................................................................3-24
3.9. QUERY PROCESSING AND OPTIMIZATION ....................................................................3-26
4.0. THE MULTIDATABASE SYSTEM.............................................................................4-1
4.1. SYSTEM OVERVIEW .....................................................................................................4-1
4.2. SYSTEM OBJECTIVES....................................................................................................4-2
4.2.1. General Objective...................................................................................................4-2
4.2.2. Specific Objective ...................................................................................................4-2
4.3. SYSTEM FUNCTIONS.....................................................................................................4-3
4.3.1. Middleware ............................................................................................................4-3
4.3.2. Schema Integration Tool.........................................................................................4-5
4.3.3. Client Application...................................................................................................4-7
4.3.4. Catalogs.................................................................................................................4-8
4.4. SYSTEM SCOPE AND LIMITATIONS ................................................................................4-8
4.5. PHYSICAL ENVIRONMENT AND RESOURCES...................................................................4-9
4.6. ARCHITECTURAL DESIGN ...........................................................................................4-10
4.6.1. Middleware ..........................................................................................................4-10
4.6.2. Schema Integration Tool.......................................................................................4-13
4.6.3. Client Application.................................................................................................4-18
4.6.4. Architectural Issues ..............................................................................................4-19
5.0. DESIGN AND IMPLEMENTATION ISSUES.............................................................5-1
5.1. SYSTEM ARCHITECTURE...............................................................................................5-2
5.2. DATABASE SCHEMA DATA STRUCTURES.......................................................................5-9
5.2.1. DBSchema............................................................................................................5-10
5.2.2. DBTableDef..........................................................................................................5-10
5.2.3. DBFieldDef ..........................................................................................................5-11
5.3. THE GLOBAL SCHEMA EDITOR MODULE.....................................................................5-13
5.3.1. Data Structures.....................................................................................................5-13
5.3.2. Major Algorithms Used.........................................................................................5-16
5.3.3. Design Issues........................................................................................................5-20
5.4. COMPONENT SCHEMA MANAGER ...............................................................................5-22
5.4.1. Data Structures.....................................................................................................5-22
5.4.1.1 Component Schema Catalog.....................................................................................5-22
5.4.1.2 Export Schema Catalog ............................................................................................5-23
5.4.1.3 Database Profile .......................................................................................................5-25
5.4.1.4 Schema Definition Errors catalog..............................................................................5-25
5.4.2. Major Sub-Modules ..............................................................................................5-26
5.4.2.1 Schema Loader.........................................................................................................5-27
5.4.2.2 Verify Schema ..........................................................................................................5-28
5.4.2.3 Component Schema Viewer......................................................................................5-29
5.5. MAPPING EDITOR.......................................................................................................5-30
5.5.1. Data Structures.....................................................................................................5-32
5.5.1.1 The Mapping Rule base class....................................................................................5-33
5.5.1.2 Derived Mapping Rule Classes .................................................................................5-34
5.5.1.3 Supporting Classes for Mapping Rules......................................................................5-36
5.5.1.4 Table Mapping and Mapping Entries.........................................................................5-38
5.5.1.5 The Mapping Catalog ...............................................................................................5-41
5.5.1.6 Binding Factor and Table Filter.................................................................................5-43
5.5.2. Major Algorithms .................................................................................................5-44
5.6. FORMULA MANAGER .................................................................................................5-48
5.6.1. Data Structures.....................................................................................................5-48
5.7. MAPPING GENERATOR ...............................................................................................5-51
5.7.1. Create Global Schema Scripts...............................................................................5-51
5.7.2. User-defined PL/SQL function Script ....................................................................5-52
5.7.3. Integration Script..................................................................................................5-53
5.7.4. Pseudocode for Generating Scripts .......................................................................5-53
5.8. THE MIDDLEWARE.....................................................................................................5-63
5.8.1. The Update Process..............................................................................................5-64
5.8.2. Data Loader .........................................................................................................5-67
5.8.3. Data Integrator.....................................................................................................5-68
5.8.4. Update Log File....................................................................................................5-69
5.9. UPDATE MANAGER ....................................................................................................5-71
5.9.1. Major Algorithms Used.........................................................................................5-71
5.10. CLIENT APPLICATION.................................................................................................5-72
5.10.1. Data Structures................................................................................................5-72
5.10.2. Major Algorithms Used ....................................................................................5-72
6.0. RESULTS AND OBSERVATIONS ..............................................................................6-1
6.1. INTRODUCTION ............................................................................................................6-1
6.2. SCHEMA INTEGRATION TOOL .......................................................................................6-1
6.2.1. Component Schema Manager..................................................................................6-1
6.2.2. Global Schema Editor.............................................................................................6-5
6.2.3. Mapping Editor ....................................................................................................6-12
6.3. MIDDLEWARE............................................................................................................6-14
6.3.1. Data Type Conversion ..........................................................................................6-16
6.3.2. Data Type Overflow..............................................................................................6-19
6.3.3. Integrity Constraint...............................................................................................6-21
6.3.4. Data Loading........................................................................................................6-22
6.3.5. Update Manager...................................................................................................6-22
6.4. CASE STUDY – NBA MULTIDATABASE PROJECT.........................................................6-24
6.4.1. PLAYERS Table....................................................................................................6-24
6.4.2. PLAYER_STATS Table .........................................................................................6-29
6.4.3. SCHEDULE Table................................................................................................6-33
6.4.4. NBA_GUARD Table .............................................................................................6-37
6.4.5. NBA_FORWARD Table ........................................................................................6-39
6.4.6. NBA_CENTER Table............................................................................................6-41
7.0. CONCLUSIONS AND RECOMMENDATIONS .........................................................7-1
7.1. CONCLUSION ...............................................................................................................7-1
7.1.1. Schema Integration Tool.........................................................................................7-1
7.1.2. Middleware ............................................................................................................7-3
7.1.3. Client Application...................................................................................................7-4
7.2. RECOMMENDATIONS ....................................................................................................7-4
7.2.1. Facility for Accessing Generic Databases ...............................................................7-4
7.2.2. Facility for viewing actual values in Component Databases.....................................7-5
7.2.3. User-Interface Improvements..................................................................................7-5
7.2.4. Copying of schema and mapping objects .................................................................7-5
7.2.5. Document Printing..................................................................................................7-5
7.2.6. Middleware as NT service or System tray program..................................................7-6
7.2.7. Real-time Data Integration .....................................................................................7-6
7.2.8. Support for additional data types ............................................................................7-6
7.2.9. Support for different Database models ....................................................................7-6
Appendix A. Bibliography …………………………………………………………………… A-1
Appendix B. Oracle Reserved Words…………………………………………………………. B-1
Appendix C. DLSU Sample Schema ………………………………………………………… C-1
Appendix D. NBA Sample Schema…………………………………………………………... D-1
Appendix E. Resource Persons……………………………………………………………….. E-1
Appendix F. Personal Vitae……….…………………………………………………………... F-1
1-1
List of Figures
Figure 3-1. System architecture of a centralized DBMS [SHET90]........................3-7
Figure 3-2. Taxonomy of a Multidatabase System [SHET90]. ...............................3-8
Figure 3-3. An accessing processor......................................................................3-10
Figure 3-4. Five-level schema architecture of an FDBS. ......................................3-12
Figure 3-5. System Architecture for an FDBS......................................................3-13
Figure 3-6. Client/Server Interaction....................................................................3-20
Figure 4-1. The MDB1 System..............................................................................4-4
Figure 4-2. Multidatabase system hierarchical chart ............................................4-10
Figure 4-3. MDB1 System Architecture ..............................................................4-11
Figure 4-4. Middleware component hierarchical chart. ........................................4-11
Figure 4-5. The Middleware ................................................................................4-12
Figure 4-6. The Schema Integration Tool ............................................................4-14
Figure 4-7. Schema Integration Tool hierarchical chart........................................4-15
Figure 4-8. Client Application hierarchical chart. ................................................4-19
Figure 4-9. The MDB1 5-level Schema Architecture ..........................................4-21
Figure 5-1. The MDB1 System..............................................................................5-3
Figure 5-2. The Schema Integration Tool ..............................................................5-4
Figure 5-3. Data Integration in the old design........................................................5-5
Figure 5-4. Data integration in the new design.......................................................5-6
Figure 5-5. Schema class hierarchy......................................................................5-10
Figure 5-6. Table class hierarchy.........................................................................5-11
Figure 5-7. Field class hierarchy..........................................................................5-12
Figure 5-8. Global Schema Data Structure...........................................................5-12
Figure 5-9. How the Global Schema Editor interacts with the Global Schema .....5-13
Figure 5-10. Adding a Global Table to a Global Schema .....................................5-16
Figure 5-11. Inserting a Foreign Key into a Global Table ....................................5-19
Figure 5-12. Two ways of representing foreign keys in the global schema...........5-21
Figure 5-13. Component Schema Data Structure .................................................5-23
Figure 5-14. Export Schema Catalog ...................................................................5-24
Figure 5-15. Schema Loader module ...................................................................5-27
Figure 5-16. Mapping editor screen. ....................................................................5-30
Figure 5-17. A mapping rule box.........................................................................5-30
Figure 5-18. The Mapping Rule classes inheritance tree ......................................5-32
Figure 5-19. A mapping entry..............................................................................5-40
Figure 5-20. Define Mapping Entry Dialog .........................................................5-41
Figure 5-21. Mapping Catalog class hierarchy.....................................................5-42
Figure 5-22. The Mapping Catalog data structure ................................................5-42
Figure 5-23. Mapping Rule Set hierarchy ............................................................5-46
Figure 5-24. The Middleware ..............................................................................5-63
1-2
Figure 5-25. Verify Schema Changes ..................................................................5-64
Figure 5-26. Create Component schemas.............................................................5-65
Figure 5-27. Load Component Data.....................................................................5-66
Figure 5-28. Integrate Data..................................................................................5-67
Figure 5-29. Query Process .................................................................................5-72
Figure 6-1. Specify bindings notification message..............................................6-12
Figure 6-2. Cannot map autonumber message......................................................6-13
Figure 6-3. Data integration in the new design.....................................................6-15
Figure 6-4. Allowed data type conversions. .........................................................6-16
Figure 6-5. PLAYERS table mapped to Lakers database .....................................6-26
Figure 6-6. PLAYERS table mapped to Lakers database .....................................6-28
Figure 6-7. PLAYER_STATS table mapped to LAKERS database .....................6-31
Figure 6-8. PLAYER_STATS table mapped to LAKERS database .....................6-32
Figure 6-9. SCHEDULE mapped to LAKERS database ......................................6-35
Figure 6-10. SCHEDULE mapped to SPURS database........................................6-36
Figure 6-11. NBA_GUARD mapped to LAKERS database.................................6-37
Figure 6-12. NBA_GUARD mapped to SPURS database....................................6-38
Figure 6-13. NBA_FORWARD mapped to LAKERS database ...........................6-39
Figure 6-14. NBA_FORWARD mapped to SPURS database ..............................6-40
Figure 6-15. NBA_CENTER mapped to LAKERS database ...............................6-41
Figure 6-16. NBA_CENTER mapped to SPURS database...................................6-42
1-1
List of Tables
Table 3-1. Sample schema of a database in a university.......................................3-19
Table 5-1. DGlobalSchema, DGlobalTable, and DGlobalField class attributes ....5-15
Table 5-2. Scripts and their filenames..................................................................5-51
Listings
Listing 5-1. Adding of Global Tables ..................................................................5-17
Listing 5-2. Adding of Global Fields ...................................................................5-19
Listing 5-3. Assigning of Foreign Key Constraints to Global Tables....................5-20
Listing 5-4. Pseudocode for the Load Schema function.......................................5-28
Listing 5-5. Pseudocode for the Verify Schema function ....................................5-29
1-1
Chapter 1
Research Description
1.0. Research Description
1.1. Overview of the Current State of the Specific Technology
Database systems facilitate the storage, management, and access of valuable data
used in various applications. Different sectors and industries such as Banking and
Finance, Manufacturing, Government, Science and Engineering, and Information
Technology institutions require data storage and data access facilities to support
daily operations. As a result, different database systems have been developed to
meet the specific needs of diverse classes of applications and users. Likewise,
database systems vary in size, capability, and performance. Small office and home
office users utilize simple database software to manage inventories, record sales
invoices, and keep track of customer information. Big organizations and business
institutions employ large-scale database systems to store huge amounts of data.
Over time, different database systems were developed for various classes of
applications by numerous database developers. This resulted in disparate database
systems that have various kinds of incompatibilities. In today’s information age,
however, there is an increasing need to access data from multiple information
sources. A company head-office will probably need access to information to each of
several local sites. For example, the financial manager might want to know the total
expenses incurred in all departments. Or, a purchasing manager may want to know
the available stock of products from all store locations. In general, users need access
to integrated data.
Database integration, however, is a difficult task. Database systems may not
only be physically distributed, but may also differ in many aspects, such as operating
system and computer hardware, supported network protocols, access methods, data
model, query language, and data representation. There are three approaches to
database integration [MART95a]. The first is physical integration of all data into
one database. This is obviously not a good solution since it will be very costly and
will not allow independent maintenance of data [MART95a]. The second approach
is to provide interoperability, that is integration at the access language level
[MART95a]. The problem with this method is that application developers will have
to deal with the complexity of different database interfaces and formats. The third
and best approach is logical integration of all data into one virtual database that hides
the underlying heterogeneity of local databases [MART95a]. This kind of a system
is usually referred to as a multidatabase system.
1-2
A multidatabase system provides an integrated view over a collection of
different database systems. It abstracts users from the location, distribution, and
heterogeneity of different databases. Furthermore, it provides global access to
physically distributed heterogeneous databases via a single query language. For this
reason, a multidatabase is sometimes referred to as a heterogeneous distributed
database. A multidatabase is actually a middleware that sits in between global
clients and several local database systems. It is a software layer that acts as a front
end to multiple local database systems, while serving as a back end database server
to global clients. Hence, the global clients can access information from multiple
sources with a single direct request through the multidatabase system.
Using a multidatabase to address the problem of database heterogeneity has
several advantages. First, existing organizational investments are preserved. These
include the preservation of investments in computer hardware, software, and user
training. Second, local database sites can resume daily business operations and
exercise local autonomy. From the perspective of a local database system, a
multidatabase system is just another database user. The multidatabase need not
interfere with local operations. Lastly, a multidatabase enables better management
of the entire organization, since it provides a global and integrated view of business
data from all departments. The difficulty in the study of multidatabase systems lies
in the heterogeneity and distribution of local databases.
1.2. Research Objective
1.2.1. General Objective
This research aims to design and implement a prototype federated database system
for integrating at least two heterogeneous database systems.
1.2.2. Specific Objectives
Specifically, the research aims to:
1. To study how to access data from different database systems.
2. To review various issues and techniques regarding schema integration.
3. To research on query processing and query optimization (optional).
4. To survey existing multidatabase systems.
5. To study various access control policies.
6. To study and assess whether to use a tightly-coupled or a loosely-coupled design
for the system architecture.
1-3
1.3. Scope and Limitations of the Research
The aim of the thesis is to design and implement a prototype federated database
system that will enable global queries. The Multidatabase system will allow the
integration of at least two different relational database systems, but not object-
oriented database systems, due to resource limitations. In view of time constraints
and the multitude of issues in multi-database updates, the system will not allow write
operations to database items. Global queries will be processed by the system so that
the appropriate sub-queries to the component databases can be derived. Query
optimization, on the other hand, would be made optional. The system would only
provide a simple access control policy because it would concentrate more on schema
integration.
The system will not provide automatic schema integration, as it is not
practical to do so. Instead, tools for schema integration will be developed in order to
facilitate the programmer’s work in the integration of component schemas.
1.4. Significance of the Research
The thesis has significance in both theoretical and pedagogical aspects. Further, the
multi-database has a lot of practical use in database management and computer
science research.
Theoretically, the study will contribute to the already on-going research on
multidatabase systems, particularly on query processing and optimization, and on
schema integration.
The multidatabase system can be utilized by business and government
organizations that want to integrate physically distributed databases. The multi-
database approach of integration does away with having to purchase a larger DBS to
accommodate the local data in different sites and their transfer to the central server,
thus saving a great deal of resources. Likewise, such a system will allow easier
management of its local branches and will enable a holistic view of the
organizations’ situations and performances.
In the field of Computer Science research, the system is an invaluable tool.
The system can be employed by both Internet and Intranet applications for easy
retrieval of information coming from heterogeneous database sources. This will
facilitate and encourage research in the area of distributed information systems such
as schema integration, data mining, Internet agents, and digital libraries.
1-4
1.5. Research Methodology
In order to successfully accomplish the Multidatabase project, the proponents have
performed such activities as research, planning, brainstorming, design, and other
tasks, in relation to the study. The proponents have outlined a number of activity
phases that comprise the research methodology.
In the data-gathering phase, the proponents researched on books, magazines,
journals, and papers on multidatabase systems. Specifically, the proponents
researched on theories and concepts regarding schema integration, access control,
and query processing. Likewise, the proponents consulted with people who are
knowledgeable in database technology.
A study on existing multidatabase architectures was done, in order to
understand how the different software modules would fit in to the system. And since
the multidatabase system must facilitate data integration from heterogeneous data
sources, the need to study schema and data integration is a necessity. In line with
this, the proponents have surveyed on the different types of schema conflicts. The
proponents were able to do this by studying related literature on schema integration,
and by surveying actual database schemas.
In the planning phase, the proponents discussed and defined the different
functions and features of the system. Then, the project schedule was laid out and the
project milestones were identified. Next, the tasks identified were assigned to the
members.
For the preliminary system analysis and design, the system features were
identified, and the system architecture was designed. Through the identified features,
the necessary modules and their functionality were defined. At the same time, the
system architecture was designed such that these modules are contained properly and
module interdependencies are carefully considered.
For the development phase, the proponents have used a modified prototyping
approach in implementing the system. Because the multidatabase project is
experimental in nature, it is advantageous to use this approach so that changes to
system design can be carried out whenever necessary, particularly when the previous
design is difficult to realize. The development process has also undergone
preliminary analysis and design phases in order to layout the initial system
architecture. The architectural design and its details have been allowed to evolve
during the course of the development phase, so that a better solution can be adopted
to enhance system efficiency or to resolve a design flaw.
1-5
The project has been divided into parts and features, and the project schedule
divided into milestone junctures that was based on the system features that must be
accomplished. The most important features have been incorporated to the earliest
possible subproject. Each subproject then went through a complete cycle of
development that involves coding, feature integration, testing, and debugging.
The main thesis document was written as soon as the proponents achieved a
definite system architecture design. It was then continuously updated and improved
in concurrency with the development of the system, until the end of the project. The
technical manual and user’s manual were finalized after the development of the
proposed system.
2-1
Chapter 2
Review of Related Literature
2.0. Review of Related Literature
2.1. Issues in Multidatabase Systems
2.1.1. Access Control in Multidatabase Systems
There are many technical problems encountered in building a multidatabase system.
One of these issues is access control. Access control prevents unauthorized access or
malicious destruction of the database. Access control in a multidatabase system not
only includes controlling access in the local DBMSs at each site, but also controlling
and coordinating access to data at multiple sites for multi-site queries [WANG87].
By using access control, we further complicate the implementation of a
multidatabase system because of several reasons. First, if the multidatabase
environment is heterogeneous, problems will arise because different sites may use
different and incompatible mechanisms for expressing and enforcing access control
security policies. Another would be the issue of site autonomy. This means that all
the local DBMSs at each site must maintain control of the data stored at that site.
The local DBMS should decide for itself if a user may access the data it manages
[WANG87].
According to [FERN81], there are many kinds of access control that can be
used for controlling to access data. Two of which are Control-Independent and
Control-Dependent access control. Control-Independent access controls are defined
over the basis data objects supported by the DBMS. Each access rule is of the form
subject/object/privilege which means that the subject has that privilege over that
object. If access is granted to a database object to the user in the global model, there
should also be a corresponding grant that must be issued in the local model for that
database object. This maintains consistency between the local and global models.
Content-Dependent access controls, however, base their decisions of whether
to allow access or not on the values of the data in the database. Given the example by
[WANG87], an instructor may be allowed to see a student’s record only if he/she is
the advisor of that student. Relational DBMSs are Content-Dependent uses views to
implement access control. Content-Dependent access control policies are more
difficult to enforce than Content-Independent policies because the data required to
make an access decision may reside at any site of the system.
2-2
In order to implement access control policies, as said earlier, one uses views.
A view is a virtual object. The purpose of a view is to identify a subset of an entity
set or relationship set that a user is authorized to access. A view may be defined
with the following construct [CHAM75, ASTR76]:
Define View_Name <Target_List>
Where Qualification_Clauses
where: View_Name = name of the view
Target_List = subset of attributes of an entity set or relationship set
for the global database to be included in the view
Qualification_Clauses = identifies which elements of the entry set or
relationship set referenced in the Target_List are to be
included in the view
A user must be granted the right to use a view. Also, the definer of that view
can grant access to the view to other users [WANG87]. In this way, the new user
can use the view to access the subset of a database object defined by the view
without having access to this database object directly [GRIF76].
Wang and Spooner have proposed the use of protection view mechanisms for
multidatabase systems. The approach here is to produce all the data for the view
temporarily and that the result as a base object for processing the second query. This
solves the problem of site autonomy so that a user who has been granted access to a
database object not owned by him by the global model may be able to without
having the local DBMS denying his request.
2.1.2. Data Integration in Multidatabase Systems
In a multidatabase system, there exist a wide variety of independent databases.
These local databases are developed independently with differing local requirements.
In effect, multidatabase system is likely to have different models and representations
for similar objects. This results to some serious problem when generating queries
that require data from various preexisting databases. One solution for this is data
integration.
Data integration refers to the creation of an integrated view over apparently
incompatible data typically collected from different sources [WANG87]. It also
provides location-transparency and enhanced global query facility.
2-3
Data integration is one of the most significant aspects in a multidatabase
system. In [DEEN87], data integration problem was grouped into six major
categories. These are name difference, format difference, missing data, conflicting
values, semantic difference and structural difference.
Name difference occurs when two semantically equivalent data items located
in different DBMS are named differently or two semantically not equivalent data
items in different DBMS are named the same [DEEN87]. For instance, in one of the
DBMS, the field name for an employee is EMPLOYEE while in another DBMS, the
field name is WORKER, although both are referring to the same data item, but they
don't have the same name. This clearly leads to conflict. To resolve this conflict, the
global system must be able to identify equivalence of the items and map the differing
local names to a single global name.
Another problem to consider in data integration is the format differences.
Format differences include differences in data type, domain, scale, precision, and
item combinations [DEEN87]. For instance, a telephone number may be defined as
an integer in a DBMS while it is defined as an alphanumeric string in another.
Another case is some data items are broken into components in one database while
the combination is treated as a single quantity in another DBMS. This might
possibly lead to some system error when not resolve. One solution for this is to
define some transformation function between the local and global representations.
The complexity of these functions will depend on the degree of format differences
between data items in various DBMS.
One serious integration problem is conflicting data. Sometimes, two
database systems have a data item that refers to the same real-world object but their
actual data values are conflicting [DEEN87]. This is due to incomplete updates,
system error, or insufficient demand for such data. One example for this is when two
databases have the same data item but contain different values. Among the various
integration problem, this is the most complex one and can cause serious data lost to
the database system.
Another integration problem is missing data. According to [DEEN87], data
can be missing from one relation, from both relations, or it can be a summarized
data. For example, one relation may contain the employee salaries for each month,
while another relation may simply contain the average yearly salary.
Semantic differences occur when two attributes of the same name, belonging
to relations of the same name but have different meanings [DEEN87]. To resolve
2-4
this, the semantic meaning of a relation in a DBMS must be explicitly stated to
global users.
Structural differences refer to data items with the same semantic meaning in
various DBMS but are structured differently [DEEN87]. For instance, a data item
may be represented as a single relation in one DMBS and multiple relations in
another. To resolve this, there should be a mapping language that is capable of
restructuring data into one form to another.
In a paper describing the CORDS Integration Environment, Martin and
Powley [MART95a] introduced their own classification of schema conflicts.
[MART95a] classified these conflicts according to two dimensions: location and
type. The location conflict can be either in an attribute, within a relation, or within a
schema (involving multiple relations) [MART95a]. The set of conflict types, which
was in turn based on the categorization of Missier and Rusinkiewicz, include: data
type, scale, precision, default value, name, key, schema isomorphism, union
compatibility, abstraction level incompatibility, missing data, and integrity constraint
[MART95a].
The method of schema integration by [MART95a] was broken down into
steps according to the location type. First, export schemas are resolved for attribute
conflicts. A view definition is used to map the attributes from export schemas into
the MDBS attribute [MART95a]. Then, relation level conflicts are resolved and a
view definition is created. Third, schema level conflicts are resolved. Finally, the
MDBS views from each level of conflict are merged into a single MDBS view
definition [MART95a].
In addition to the above methodology, [MART95a] described attribute
contexts that would support the resolution of attribute level conflicts. Attribute
contexts are provided by export schemas and would consist of a number of facets
that would describe the semantic properties of an attribute, such as data type, scale,
and precision [MART95a]. The facets identified by [MART95a] are uniqueness,
cardinality, type, precision, scale, and default value.
In many cases, conflicts can be resolved by using transformation functions, as
exemplified in CORDS. Schema isomorphism [MART95a] is one such type of
conflict that is solvable using transformation functions. For instance, one database
contains an attribute ADDRESS, while its equivalent in another database is a
composition of attributes NUMBER, STREET and CITY. [MART95a] resolved this
conflict by applying a transformation function called “StringConcat”, which
combines the three fields into a single address field [MART95a].
2-5
Data Integration plays a vital role in multidatabase systems. Proper
integration of data is needed to resolve the semantic heterogeneity of different
databases.
2.1.3. Query Processing and Optimization
When the multidatabase receives a global query, it must decompose it into sub-
queries, perform query optimization, sends them to the individual DBMS, and
process the results to be returned to the global user. Similar data items (tables)
involved in the query may exist on different DBMS, or different data items (tables),
which must be joined, may be distributed across different DBMS. Either way, some
query processing and optimization must be done in order to yield efficiency.
In [ELMA94], an overview of the techniques used by DBMSs in processing
and optimizing high-level queries is presented. SELECT and JOIN operations have
many execution options and thus are potentials for query optimization. Likewise,
various approaches to query optimization are discussed. These techniques are
classified into heuristic approaches, cost-estimation approaches, and semantic query
optimization.
The techniques required for query processing in a multidatabase environment
are quite different from that of a single DBMS [LEE94]. In [LEE94], the query
optimization techniques presented employs cost-estimation. First, various schemas
are classified into various types. Then, the costs of executing operations on these
conflicting schemas are evaluated and given weights, which will be used in the cost-
estimation.
Query decomposition, optimization, and processing in multidatabase systems
is studied in [EVAS95]. In the paper, the optimization of query decomposition in
case of data replication and the optimization of inter-site joins are considered.
2.2. Existing Software
2.2.1. Cords Schema Integration Environment
The CORDS Multidatabase System (MDBS) provides applications with an
integrated view of a collection of distributed heterogeneous data sources
[MART95a]. Applications are presented with a relational view of the available data
and are able to access the data using standard SQL operations [MART95a]. An
application's view of the data is defined by a process called schema integration,
which is facilitated by the CORDS environment.
2-6
The CORDS MDBS is a full-function DBMS. The common data model used
in the CORDS MDBS is the relational model, so schemas define a collection of data
in terms of relational tables and their columns and any applicable constraints.
Applications interact with a MDBS Server via a library of functions called the
MDBS Client Library [MART95a]. A MDBS Server performs DBMS function, such
as query processing and optimization, transaction management, and security, at the
global level [MART95a]. A MDBS Server connects to a component database
(CDBS) through a Server Library which accepts SQL requests from the MDBS,
interacts with the CDBS through its normal application program interface, then
translates the response into the form expected by the MDBS [MART95a]. CDBSs
currently supported by the prototype include the Empress, Oracle, and DB2/6000
relational systems, the IMS hierarchical database system, and VAXDBMS network
database system [MART95a].
The MDBS Catalog in CORDS is the central repository for metadata need by
the multidatabase system [MART95a]. It includes three classes of metadata, namely
schemas, mappings, and descriptions of CDBSs [MART95a]. Two types of schemas
are stored: export schemas and MDBS schemas [MART95a]. An export schema
defines the data made available to the MDBS from a CDBS, while MDBS schemas
define collections of data at the MDBS level, which are drawn from the exported
data [MART95a]. The mappings needed to transform export schema objects into
MDBS schema objects are created during the schema integration process.
The process of schema integration in the CORDS MDBS takes schemas from
a set of CDBSs and produces one or more integrated views of the available data
[MART95a]. Martin and Powley, did not define a single all-encompassing global
schema, but instead defined MDBS schemas to provide the data for individual
applications, or groups of applications [MART95a]. MDBS schemas are equivalent
to federated schemas as defined by Sheth and Larson [MART95a]. MDBS schemas
are made up of virtual global relations called MDBS Views [MART95a].
MDBS Views are views that span multiple heterogeneous databases. They
are like relational views in that they are not physically materialized but rather are
stored as mappings that are invoked whenever an MDBS VIEW is accessed. The
syntax for MDBS Views extends the standard SQL view definition facility with
support for attribute contexts and transformation functions. Attribute contexts are
used to describe the semantics of the attributes and transformation functions are used
to resolve several types of schema conflicts [MART95a].
In order to resolve various types of schema conflicts, the CORDS MDBS
provides a Schema Integration Toolkit to support the MDBS DBA in creating MDBS
2-7
schemas. The toolkit has an AIX Windows graphical interface and was developed
on an RS/6000 machine [MART95a]. It runs as an application of the CORDS
MDBS [MART95a]. Being a multifunctional toolkit, it includes the following
integration tools with it:
1. Schema Translator – It is a tool that automates the translation from one data
model to another [MART95a]. It receives a file containing the global scheme as
a input and returns the output as a file containing the schema expressed in terms
of the target data model.
2. Thesaurus – The main function of the Thesaurus is to resolve name conflicts.
This is possible because it contains information about relationships, in particular
synonyms, among object names [MART95a]. Specifically, the thesaurus
analyzes a schema expressed in the common data model and highlights possible
relationship among names in the schema with names currently stored in the
thesaurus. However, for flexibility purpose, the user is allowed to add new
names and relationship to expand the contents of the thesaurus.
3. Transformation Function Library Manager – This module contains some basic
transformation function that is necessary in the schema integration process.
Some of the transformation functions are conversion of integers to string or vice
versa [MART95a].
4. MDBS View Compiler – The basic function of the MDBS View Compiler is to
parse and review an MDBS View definition and store the suitable information in
the MDBS catalog [MART95a].
In the schema integration method, definitions of the export schemas are made
using an extended SQL. Initial versions of export tables are produced by the Schema
Translator coming from the CDBS schemata [MART95a]. It is then edited by the
DBA using the editor supplied with the schema integration toolkit and submitted to
the MDBS where it is parsed and stored in the MDBS catalog [MART95a].
One important step in the CORDS schema integration process is the
identification of attributes to be included in the integrated schema [MART95a]. In
this case, the Thesaurus tool is used to identify relationships among attributes based
on the names used for the attributes. Name conflicts are then resolved using the
MDBS View Definition statement by mapping the export attributes to a common
generic name [MART95a]. The Transformation Library Manager is used to analyze
the contexts of the attributes and, if possible, suggests transformation functions. This
allows the mapping of the export attributes to the view attributes [MART95a].
2-8
2.2.2. Microsoft Transaction Server (Viper)
The Microsoft Transaction Server (MTS), called “Viper” is developed primarily for
the Internet and other network server. It also manages application and database
transaction requests on behalf of a client.
The Transaction Server screens the user and client computer from having to
formulate requests for unfamiliar databases. It forwards the requests directly to the
database servers. The MTS is thus a sort of a multidatabase server. Additional
features of MTS include security management, connection to other servers, and
transaction integrity.
Microsoft designed the Transaction Server in such a way that it will fit in
with its overall object-oriented programming strategy. A drag-and-drop interface is
also provided in order to create a transaction model for a single user, then allow the
Transaction Server to manage the model for multiple users, including the creation
and management of user and task threads and processes.
2.2.3. Sybase Jaguar
Sybase’s Jaguar CTS™ is the first component transaction server that combines a
scalable execution environment with support for multiple component models
including Java/JavaBeans, ActiveX, C/C++ and CORBA. Jaguar CTS’ open
environment extends the Web architecture to provide a platform for developing and
sending transaction to business applications on the Internet, intranets or extranets.
Another feature of Jaguar CTS is the provision of a component-based
environment that makes it easy for partners to extend the functionality of the core
product. Jaguar CTS combines the features of an object request broker and a TP
monitor to provide an easy-to-use, secure execution environment with support for
multiple component models for building transaction-oriented business applications
on the Net.
Jaguar CTS’ adapts easily to unpredictable workloads to deliver high
transactional throughput for large numbers of Internet users. Jaguar CTS’ flexible
transaction management delivers high performance for both synchronous and
asynchronous transaction processing.
Jaguar CTS supports all major databases and development tools offering
developers an open, standards-based environment. The Jaguar CTS can thus be used
to build a multidatabase system prototype. Furthermore, its support for multiple
2-9
component model (Java/JavaBeans, ActiveX, C/C++, and CORBA) [SYBA97]
facilitates the development of modular and interoperable software components.
2.2.4. ADDS (Amoco Production Company, Research)
The Amoco Distributed Database System (ADDS) project began in late 1983 in
response to the problem of integrating databases, which are distributed throughout
the corporation [THOM90]. The accomplishment of this project contributes a lot in
the business world at that time because database products did not provide effective
means for accessing and managing data.
The primary function of ADDS is to provide uniform access to preexisting
heterogeneous distributed databases [THOM90]. It is based on a relational data
model and uses an extended relational algebra query language [THOM90]. In the
terminology of [SHET90], ADDS is a tightly coupled federated system supporting
multiple federated schemata. Mappings are stored in the ADDS data dictionary
[THOM90]. The data dictionary is fully replicated at all ADDS sites to expedite
query processing [THOM90]. Multiple applications and users share CDB
(Component database) definitions [THOM90].
The CDBs support the integration of the hierarchical, relational, and network
data models [THOM90]. Some of the local DMBSs currently supported include
IMS, SQL/DS, DB2, RIM, INGRES, and FOCUS [THOM90]. Data items which are
semantically equivalent from different local databases, as well as appropriate data
conversion for the data items, may be defined [THOM90].
The user interface consists of an Application Program Interface (API) and an
interactive interface [THOM90]. Programs used the API to submit queries for
execution, access the schema of retrieved data, and access retrieved data on a row-
by-row basis [THOM90]. It provides transparency to the users accessing the
distributed database.
For the interactive interface, it allows users to execute queries, display the
results of the queries, and save the retrieved data [THOM90]. The interface is quite
flexible, it can be customized in a way that will fit the computer knowledge and
expertise of the user. This will mean that any type of users whether the user is a
novice or an expert will be able to use the system with ease.
Queries submitted for execution are compiled and optimized for minimal data
transmission cost [THOM90]. One example of query optimization is the application
of semi-joins. A user may submit any number of queries for simultaneous execution
[THOM90].
2-10
The interface architecture used by ADDS system is the Network Interface
Facility (NIFTY) architecture [THOM90]. It is an extension of the OSI Reference
Model and provides a uniform and reliable interface to computer systems that use
different physical communication networks [THOM90]. Communication protocol is
not an issue in an ADDS system, an ADDS process on one system can initiate a
session with an ADDS process on another system without regard for the multitude of
heterogeneous network hardware and software that is used to accomplish the session
[THOM90].
2.2.5. DOIA
DOIA was presented first at the Australian Database Conference in 1995 [KUO94].
It was funded by the Cooperative Research Centres Program through the Department
of the Prime Minister and Cabinet of the Commonwealth Government of Australia.
The DOIA system is a heterogeneous multidatabase system that provides a
single unified view of the federated schema [KUO94]. The architecture of the system
was partially based on Sheth & Larson's 5-tier model for database (schema)
integration. The DOIA architecture is composed of the Local Database Agent (LDA)
and the Global Database Agent (GDA).
The LDA acts as a transforming processor to present a view of a local
schema, in the Common Data Model (CDM). On the other hand, the GDA acts as a
constructing processor to present a federated schema in CDM.
Both the GDA and the LDA use the Common Query Language (CQL) to
query and update, transaction management operations (commit, rollback, etc). The
main difference of the DOIA system from the 5-tier model is that it does support
transaction management.
The GDA is composed of two main components: Transaction Plan Generator
and the Global Transaction Coordinator. The main function of the former component
is to translate each query or update into a series of queries or updates targeted to the
individual agents. The results are then passed to the Global Transaction Coordinator
to distribute the tasks. The outcome is then stored in a temporary location called the
collector database [KUO94].
The LDA is also composed of two components, namely, the Local
Transaction Coordinator and the Translator. The local transaction coordinator is the
one responsible to provide the transaction management functions not provided by the
underlying database. The latter component translates the query or update from the
CQL to the local query language [KUO94].
2-11
The system is currently using the relational model as their common data
model and SQL for their common query language.
2.2.6. Mermaid
The Mermaid system was first developed at Unisys in 1982, as a project for the
Department of Defense [THOM90]. The system was needed for accessing and
integrating data stored in autonomous databases. Furthermore, the Mermaid system
must operate in a permanently heterogeneous environment consisting of distributed,
heterogeneous database systems [THOM90].
Based on the terminology of [SHET90], Mermaid is a tightly coupled
federated system supporting multiple federated schemata [THOM90]. Mermaid is
said to serve as a front-end system that locates and integrates data from several
DBMSs [THOM90]. Several levels of heterogeneity are supported, namely
hardware, operating system of the DBMS host, network connection to the DBMS
host, data model (relational, network, sequential file), and database schema
[THOM90]. Initially, Mermaid only supported data retrieval from several DBMS
and updates to a single DBMS [THOM90].
The Mermaid system has four major components, namely the User Interface,
the server, the Data Dictionary/Directory (DD/D), and the DBMS Interface
[THOM90]. The User Interface provides functions such as user authentication,
system initialization, query editing, query library maintenance, and so on
[THOM90]. Most of the Mermaid software resided in a server that exists on the
same network as the user workstations and DBMSs [THOM90]. The server consists
of an optimizer that processes queries, and a controller that controls execution
[THOM90]. The Data Dictionary/Directory is a commercial, relational database that
contains information about the databases and the environment [THOM90].
Mermaid has an open architecture that supports the development of interfaces to
many types of DBMS, thus resulting in great flexibility for the participation of
various DBMS [THOM90].
3-1
Chapter 3
Theoretical Framework
3.0. Theoretical Framework
3.1. Definition of Terms
Access control - provides local Database Management Systems (DBMSs) the power
to prevent unauthorized access or malicious destruction of its databases.
Accessing Processor – accepts commands and produces data by executing the
commands against a database.
Applications Programming Interface (API) - A set of functions and programs that
allows clients and servers to intercommunicate.
Attribute – referred to as the field of a relation.
Auxiliary databases – holds additional data not stored in any component DBMS
and information needed to resolve inconsistencies.
Catalog – a named collection of schemas in a Structured English Query Language
(SQL) environment.
Centralized Database System - refers to a single centralized database management
system managing a single database on the same computer system.
Client - A networked information requester, usually a PC or workstation, that can
query database and/or other information from a server.
Client-Server System – allows remotely located programs to exchange information
in real-time.
Component Database Management System – DBMS that participates in the
multidatabase system.
Component Schema – schema derived by translating local schemas into a data
model called the canonical or common data model (CDM) of the Federated Database
System (FDBS).
Conceptual Schema – schema that describes the conceptual or logical data structure
and the relationships among those structures.
3-2
Constructing Processor – a type of processor that replicates and/or partitions an
operation submitted by a single processor into operations that are accepted by two or
more other processors.
Data Integration – refers to the production of union-compatible views for similar
information expressed dissimilarly in different nodes.
Data Model Transparency – a form of transparency wherein the data structure and
commands being used by one processor are hidden from other processors.
Distributed Computing – refers to the services provided by a distributed computing
system.
Distributed Computing System – is a collection of autonomous computers
interconnected through a communication network to perform different functions.
Distributed Database System -- consists of a single distributed DBMS managing
multiple databases. These databases can be stored in either a computer system or on
multiple computer systems.
Export Schema – schema that represents a subset of a component schema that is
made available to the FDBS.
External Schema – a schema that enables the management to customize the access
rights of global database users.
Federated Database Management System (FDBMS) – the software that provides
controlled and coordinated manipulation of the component database systems.
Federated Database System (FDBS) – consists of component database systems that
are autonomous yet participate in a federation to allow partial and controlled sharing
of their data.
Federated Schema – schema derived by the integration of multiple export schemas.
Filtering Processors – a type of processor that constrains the commands and
associated data that can be passed to another processor.
Global Data Dictionary - central repository for metadata needed by the
multidatabase system (MDBS).
3-3
Global Schema – an integrated global view of the combined component schemas. It
is the layer above the local external schema that provides additional data
independence.
Global Query – a query that is issued to a multidatabase. It uses global schema
specifications.
Heterogeneous Multidatabase System – refers to a multidatabase system that has
different database management systems in its component database systems.
Homogeneous Multidatabase System – refers to a multidatabase system that has
the same database management systems in all of its component database systems.
Internal Schema – schema that describes physical characteristics of the logical data
structures in the conceptual schema.
Local Schema - is said to be the conceptual schema of a component database
management system. It is the schema associated with one component database prior
to schema integration
Loosely Coupled System – pertains to a federated database system that is created
and maintained by its users.
Mappings – are functions that correlate the schema objects in one schema to the
schema objects in another schema.
Mapping rules – defines the relationship between the federated schema and the
export schemas.
Middleware - A set of drivers, APIs, or other software that improves connectivity
between a client application and a server.
Multidatabase – is a distributed system that acts as a front end to multiple local
database management systems or is structured as a global system layer on top of
local database management systems.
Non-federated Database System - is an integration of component database
management systems that are not autonomous.
Processor – are application-independent software modules of a DBMS that
manipulate commands and data.
3-4
Query – a search question that tells the program what kind of data should be
retrieved from the database.
Query Code Generator – a query processor sub-module that generates the query
code based on the execution plan given by the query optimizer.
Query Language – a retrieval and data-editing language that enables you to specify
the criteria by which the program retrieves and displays the information stored in a
database.
Query Optimization – a process that attempts to minimize query response time and
reduce query cost.
Query Processing – the entire process of validating, optimizing, and executing a
query string.
Reference Architecture - provides the framework in which to understand,
categorize, and compare different architectural options for developing federated
database systems.
Runtime Database Processor – a query processor sub-module that has the task of
running the query code, whether compiled or interpreted, to produce the query result.
Schema – description of data managed by one or more database management
systems. It consists of schema objects and their interrelationships.
Schema Integration - related specifically to the problems associated with
distributed databases, in particular the integration of a set of pre-existing local
schemas into a single global schema.
Schema Name - includes an authorization identifier to indicate the user or account
who owns the schema.
Server - A computer, usually a high powered workstation, a minicomputer, or a
mainframe, that houses information for manipulation by networked clients.
Site Autonomy – a key aspect of a multidatabase which provides the local DBMS
complete control over local data and processing.
Structured English Query Language (SQL) –a query language which permits
updates and data definitions.
3-5
Tightly Coupled System – refers to a federated database system that is created and
maintained by the administrator alone.
Transforming Processors – transform a command from a certain source language to
a target language.
View – a single table that is derived from other tables. A view does not necessarily
exist in physical form; it is considered as a virtual table.
3.2. Multidatabase Architecture
A database system is said to consist of a database management system (DBMS),
which manages one or more databases. A federated database system (FDBS) is
defined to be a collection of cooperating but autonomous component database
systems (DBSs) [HAMM79]. The software that provides controlled and coordinated
manipulation of the component DBSs is called a federated database management
system (FDBMS).
A component database can join in more than one federation. The database
management systems of a component DBS (component DBMS) can either be
centralized, distributed, or another FDBMS. Each component DBMSs can differ in
aspects such as data models, query languages, and transaction management
capabilities.
One of the advantages of a federated database system is that a local database
can go on with its local operations simultaneous with its participation in a given
federation. The users or the administrators can configure the integration of the
database systems.
3.2.1. Characteristics of Database Systems
Multiple database systems that are joined together can be characterized in three
dimensions namely: distribution, heterogeneity, and autonomy.
Distribution of data can be done in multiple ways. Data can be distributed
and stored in either single or multiple computer systems, or stored either co-located
or geographically distributed. The main advantages of data distribution are increased
availability, reliability and improved access times.
3-6
Heterogeneity can be generally classified into technological differences such
as hardware, software, and communication differences. Heterogeneity in database
systems can be classified into differences in database management systems and those
in the semantics of data.
Heterogeneity due to differences in DBMSs result from differences in data
models and differences at the system level. Differences can be classified in
structure, constraints and in query languages. Differences in structure result from
different structural primitives. Differences in constraints are derived from different
constraints supported by different data models. Differences in query languages
(QUEL or SQL) and the different versions of SQL supported by two relational
DBMSs is also a factor in heterogeneity.
Semantic heterogeneity occurs in cases of disagreement in the meaning,
interpretation or intended use of the same or related data. This type of heterogeneity
is very hard to detect. Autonomy can be further classified into design autonomy,
communication autonomy, execution autonomy, and association autonomy. Design
autonomy refers to the ability of a component DBS to choose its own design with
respect to any matter. The design should also include (a) the data being managed,
(b) the representation, (c) the semantic interpretation of the data, (d) constraints, (e)
functionality of the system, (f) association and sharing with other systems and (g) the
implementation.
Communication autonomy, on the other hand, refers to the ability of a
component DBMS to decide whether to communicate with other component
DBMSs. Execution autonomy refers to the ability of a component DBMS to execute
local operations without interference from external operations and to decide the order
in which to execute external operations. Association autonomy implies that a
component DBS has the ability to decide whether and how much to share its
functionality and resources with others.
3.2.2. Taxonomy of Federated Database Systems
A database system can be classified into two types, centralized and distributed.
Centralized database system (Figure 3-1) refers to a single centralized database
management system managing a single database on the same computer system.
Distributed DBS consists of a single distributed DBMS managing multiple
databases. These databases can be stored in either a computer system or on multiple
computer systems.
3-7
Figure 3-1. System architecture of a centralized DBMS [SHET90].
A multidatabase system (MDB) supports operations on multiple component
database systems. An MDBS is called homogeneous if the DBMSs of all component
databases systems are the same; otherwise it is called a heterogeneous MDBS
[SHET90]. A system is not a multidatabase system if it only allows periodic,
nontransaction-based exchange of data among multiple DBMSs or one that only
provides access to multiple DBMSs one at a time [SHET90].
External
Schema 2
External
Schema n
Internal Schema
External
Schema 1
Conceptual
Schema
Database
Transforming Processor
Filtering Processor
1
Filtering Processor
2
Filtering Processor
n
Accessing Processor
3-8
Figure 3-2. Taxonomy of a Multidatabase System [SHET90].
Multidatabase systems can be classified as non-federated and federated
(Figure 3-2). Non-federated database system is an integration of component DBMSs
that are not autonomous [SHET90]. On the contrary, federated database system
consists of component DBSs that are autonomous yet participate in a federation to
allow partial and controlled sharing of their data [SHET90].
A federated database system is further classified to loosely or tightly coupled
[SHET90]. An FDBS is loosely coupled if its users are responsible to create and
maintain the federation [SHET90]. On the other hand, it is tightly coupled if the
administrator is responsible for the configuration of the federation [SHET90],
specifically if there is a global DBA that manages a global schema.
3.2.3. Reference Architecture
A reference architecture provides the framework in which to understand, categorize,
and compare different architectural options for developing federated database
systems [SHET90]. Such an architecture requires a number of components which
Multidatabase
Systems
Nonfederated
Database
Systems
Federated
Database
Systems
Loosely
Coupled
Tightly Coupled
Single
Federation
Multiple
Federation
3-9
are essential to the system. The components consist of data, database, commands,
processors, schemas, and mappings [SHET90].
These components are joined together to form different data management
architectures. The main considerations in choosing an architecture are its level of
centralization, distribution, and the manner on how the components hide its
implementation details.
Processors and schemas are significant in defining various architectures. The
processors are application-independent software modules of a DBMS [SHET90].
While the latter component, the schemas, are application specific components that
define database contents and structure [SHET90].
3.2.4. Processor Types in the Reference Architecture
Data management architecture differs in the types of processors present and
relationships among those processors. The four types of processors include
transforming, filtering, constructing, and accessing processors.
Transforming processors transform a command from a certain source
language to a target language, or transform data from one format to another
[SHET90]. This type of processor provides a type of data independence called data
model transparency [SHET90]. Data model transparency allows the data structure
and commands being used by one processor to be hidden from other processors
[SHET90]. In effect, a transforming processor abstracts various command formats
and data representations from the receiving processor.
In order to perform a transformation, the transforming processors should be
equipped with a mapping between the objects of each schema. The primary job of
schema translation involves transforming a given schema A (which describes a
certain data in one data model) into another equivalent schema B that is in a different
data model. Furthermore, this task generates the mappings that correlate the schema
objects in one schema (schema B) to the schema objects in another schema (schema
A). The process of using these mappings to translate commands involving the
schema objects of one schema (schema B) into commands involving the schema
objects of the other schema (schema A) is called command transformation.
Filtering processors constrain the commands and associated data that can be
passed to another processor. Each filtering processor has a mapping that describes
the constraints on commands and data. The constraints may either be embedded into
the code of the processor or be specified in a separate data structure.
3-10
Examples of filtering processor include syntactic constraint checker (check
commands syntactically), semantic integrity constraint checker (check commands for
any semantic integrity constraint violation), and access controller (verifies the user’s
rights in performing a given command to a certain data) [SHET90].
Constructing processor partition and/or replicate an operation submitted by a
single processor into operations that are accepted by two or more other processors
[SHET90]. The processor should be able to support location, distribution, and
replication transparencies [SHET90]. The reason for the provision of the different
transparencies is due to the fact that a certain processor submitting a command does
not need to know the location, distribution and the number of processors
participating in processing that commands [SHET90].
Some of the jobs that a constructing processor can perform are as follows:
schema integration, negotiation (to determine the protocol to be used among the
owners of various schemas to be integrated), query decomposition and optimization,
and global transaction management (performing the concurrency and atomicity
control).
An accessing processor accepts commands and produces data by executing
the commands against a database (Figure 3-3) [SHET90]. For example, it may
accept commands from several processors and interleave the processing of those
commands.
Figure 3-3. An accessing processor
Examples of accessing processors include the following: (a) a file
management system that executes access procedures against stored file, (b) an
application program that accepts commands and returns the needed data after
Database
Data
Acessing Processor
Commands
3-11
processing it, (c) a data manager of a DBMS containing data access methods, or (d) a
dictionary manager that manages access to dictionary data.
3.2.5. ANSI/SPARC Three-Level Schema Architecture
There is a standard three level schema architecture for centralized DBMSs. The
schema architecture was outlined by the ANSI/X3/SPARC Study Group. The three
levels are the conceptual schema, the internal schema, and the external schema.
The first level, the conceptual schema, describes the conceptual or logical
data structure and the relationships among those structures. Another level, the
internal schema describes physical characteristics of the logical data structures in the
conceptual schema. These characteristics include information such as the placement
of records on physical storage devices, placement and type of indexes and physical
representation of relationship between logical records.
The last schema, external schema, manages the access rights of its users. The
task of a transforming processor includes the translation of commands expressed
using the conceptual schema objects into commands using the internal schema
objects. An accessing processor then executes the commands to retrieve data from
the physical media.
3.2.6. Five-Level Schema Architecture for Federated Databases
The ANSI/SPARC three-level architecture cannot be applied to a FDBS. However,
there exists a five-level schema architecture to support the three dimensions of an
FDBS (distribution, heterogeneity, and autonomy). The said five-level architecture
is just an extension from the former three-level.
The five-level scheme of architecture consists of local, component, export,
federated, and external schema (Figure 3-4). Local Schema is said to be the
conceptual schema of a component DBS. Further, it is in the native data model of
the component DBMS.
Component schema produces a data model called the canonical or common
data model (CDM) by translating local schemas. The two main reasons for defining
component schemas in a CDM are (a) they describe the divergent local schemas
using a single representation and (b) semantics that are missing in a local schema can
be added to its component schema [SHET90]. The translation of the local schemas
3-12
to component schemas greatly facilitates the integration of data in a federated
database system.
Figure 3-4. Five-level schema architecture of an FDBS.
The process of schema translation from a local schema to a component
schema generates the mappings between the component schema objects and the local
schema objects [SHET90]. These mappings are used by the transforming processors
to transform commands on a component schema into commands on the
corresponding local schema (Figure 3-5).
The export schema represents a subset of a component schema that is
available to the FDBS [SHET90]. The main purpose of defining export schemas is
to facilitate control and management of association autonomy [SHET90]. The
filtering processor can be tasked to manage the access control as specified in an
export schema by limiting the set of allowable operations that can be submitted
[SHET90].
A federated schema is an integration of multiple export schemas. It also
includes the information on data distribution that is generated when integrating
export schema [SHET90]. It is possible to have a number of federated schemas in an
FDBS, this may be done for each classes of the federation users [SHET90]. A class
External Schema External Schema External Schema
Federated Schema Federated Schema
Export SchemaExport Schema Export Schema
Component
Schema
Local Schema
Component
Schema
Local Schema
Component DBS Component DBS
3-13
of federation users can either be a group of users or applications performing a related
set of activities [SHET90].
Figure 3-5. System Architecture for an FDBS.
An external schema defines a schema for a user or a class of users [SHET90].
The main reasons for the use of external schemas are as follows: customization,
additional integrity constraints, and access control [SHET90]. The filtering
External Schema
Filtering Processor
Federated Schema
Constructing
Processor
Export Schema
Filtering Processor
Component
Schema
Transforming
Processor
Local Schema
Component DBS
External Schema
Filtering Processor
Federated Schema
Constructing
Processor
Export Schema
Filtering Processor
Component
Schema
Transforming
Processor
Local Schema
Component DBS
External Schema
Filtering Processor
Federated Schema
Constructing
Processor
Export Schema
Filtering Processor
Component
Schema
Transforming
Processor
Local Schema
Component DBS
3-14
processor then checks the commands on the external schema for any access control
or integrity constraint violation [SHET90]. The transforming processor will be
needed to transform commands on the external schema into commands on the
federated schema if the external schema is in a different data model [SHET90].
3.3. Multidatabase Issues
In building multidatabase systems, one has to consider several issues that may
present as a difficulty. Three of these issues are schema integration, access control,
and query optimization.
3.3.1. Schema Integration
Given a heterogeneous collection of local databases, a multidatabase system should
provide a facility to integrate these databases and be able to produce a global
schema. Ideally, this kind of facility is called an integrator's workbench the output of
which would is composed of the global data dictionary, the global and participation
schemas, the mapping rules, and the auxiliary databases. However, in order to be
able to design these, it is necessary to first develop a methodology for performing the
integration on which the workbench can be built [BELL92].
Schema integration is a relatively new concept, relating specifically to the
problems associated with distributed databases, in particular the integration of a set
of pre-existing local schemas into a single global schema. Schema integration in a
multidatabase is a complex task. The problems arise from the structural and
semantic differences between the local schemas. These local schemas have been
developed independently following, not only different methodologies, but as well as
different models (e.g. relational model, object model).
In dealing with the process of schema integration, there are six major
problems that will be encountered. These are name difference, format difference,
missing data, conflicting values, semantic difference and structural difference.
Name differences occur when two semantically equivalent data items located
in different DBMSs are named differently or two semantically not equivalent data
items in different DBMSs are named the same. Format differences include
differences in data type, domain, scale, precision, and item combinations. Missing or
conflicting data is the most complex one and can cause serious data lost to the
database system. This occurs when two database systems have a data item that refers
to the same real-world object but their actual data values are conflicting. This is due
to incomplete updates, system error, or insufficient demand for such data.
3-15
Semantic differences occur when two attributes of the same name, belonging
to relations of the same name, can have different meanings. Structural differences
refer to data items with the same semantic meaning in various DBMSs but are
structured differently.
3.3.2. Access Control
Access control provides local Database Management Systems (DBMSs) the power to
prevent unauthorized access or malicious destruction of its databases, in this case,
from Multidatabase Systems (MDBSs) [WANG87]. This would not only include
controlling access in the local DBMSs at each site, but also controlling and
coordinating access to data at multiple sites for multi-site queries.
Two types of access control mechanisms are Control-Independent and
Content-Dependent access control. Control-Independent access controls are defined
over the basis data objects supported by the DBMS. Each access rule is of the form
subject/object/privilege, which means that the subject has that privilege over that
object. If access is granted to a database object to the user in the global model, there
should also be a corresponding grant that must be issued in the local model for that
database object. This maintains consistency between the local and global models
[WANG87].
Content-Dependent access controls, however, base their decisions of whether
to allow access or not on the values of the data in the database. Given the example by
[WANG87], an instructor may be allowed to see a student's record only if he/she is
the advisor of that student. Relational DBMSs are Content-Dependent uses views to
implement access control. Content-Dependent access control policies are more
difficult to enforce than Content-Independent policies because the data required to
make an access decision may reside at any site of the system.
3.3.3. Query Processing and Optimization
In a federated database system, especially that of a tightly coupled system, query
optimization plays an important role in query performance. The query optimization
process attempts to minimize query response time and reduce query cost. In a
federated system, global queries are decomposed into multiple sub-queries that will
be executed in different component database systems. When the results from each of
the component database systems are returned, the data must be manipulated and
merged in such a way that it conforms according to the global schema and the
canonical data format.
3-16
Several relational operations are possibly performed during the process of
combining the different result sets, particularly that of JOIN. Several approaches
have been proposed regarding query optimization. The order in which these
relational operations are carried out can cause tremendous differences on the query
performance. This approach of manipulating the ordering of relational operations is
referred to as heuristic query optimization. Other approaches make use of cost
estimation and semantics, and will be discussed in the later sections.
3.4. Schema Integration
Schema integration is the process of combining related schema objects from multiple
component databases into a single, global view of the integrated data [MART95b].
This global view of combined data is commonly called a global schema or integrated
schema. A global schema provides location-transparency to the user and hides the
differences among the component databases, thus facilitating formulation of global
queries [MART95b, DEEN87].
The differences in data among the component databases give rise to semantic
heterogeneity, which appears as schema conflicts and inconsistencies during schema
integration. The idea of schema integration is to resolve these conflicts and
inconsistencies among the local databases. Semantic heterogeneity takes many
forms, and these are name differences, format differences, missing data, conflicting
values, semantic differences, and structural differences.
3.4.1. Name Difference
Local databases may have different conventions for naming objects, leading to the
problems of synonyms and homonyms. Synonym means the same data item has
different names in different databases. The global system must recognize the
semantic equivalence of the items and map the differing local names to a single
global name [HURS94]. Homonym means different data items have the same name
in different databases. The global system must recognize the semantic difference
between items and map the common names to different global names [HURS94].
STUDENT (LIB_ID, STUD_ID, NAME, STREET, CITY, STATE, SEX, PHONE)
MEMBER (CLUB_ID, STUD_NO, NAME, FEMALE, MALE)
For example, attributes Student.Stud_ID and Member.Stud_NO have the
same kind of data but have different names. To integrate them, union the two
relations and assign them to a common name like STUDENT_ID.
3-17
3.4.2. Format Difference
Format differences include differences in data type, domain, scale, precision, and
item combinations. Multidatabases typically resolve format differences by defining
transformation functions between the local and global representations [DEEN87].
Some functions may be simple numeric calculations, such as converting square feet
to acres. Some may require table conversions.
For example, temperatures may be recorded as "hot", "cold", or "frigid" in
one place and as exact degrees in another. A table can be used to define what range
of degree readings would correspond to the temperature labels. For example, 50-100
degrees Celsius may be labeled as hot. Others functions may require calls to software
procedures that implement algorithmic transformations. A problem in this area is
that the local-to-global transformation may be simple, but the inverse transformation
(global-to-local, which is required if updates are supported) may be complex.
3.4.3. Missing Data
Sometimes a local database will not store all the information of interest concerning
an entity. There are three cases of integrating relations with missing data: data which
is missing from both relations, data which is missing from one relation, and data that
found in the first relation and is summarize in the other but does not include all the
details [DEEN87].
Data Missing from Both Relations
Sometimes global users may require information, which is implicitly available to
local users, but which is not stored [DEEN87]. For instance, one local database may
describe only those students who are in De La Salle University and another may
describe students from Ateneo De Manila University. To the local users there may
be no need to store the university the students are enrolled to as an attribute and if the
local databases are pre-existing databases, then they may have been designed without
consideration of a global context. But the global user, seeing a single university
relation, may require the location as an attribute in the view. In this case the
mapping must append an extra attribute to each of the relations before forming their
union [DEEN87].
Data Missing from One Relation
Alternatively, one employee relation may store different information from another
employee relation because of differing application requirements. If the difference is
very great, then it may be best to preserve the separate relations in the view. If they
3-18
are sufficiently similar to be merged, however, there are a number of options. For
example, the technical department of a company doesn’t have to store the salary of
the employees. Just the same, the accounting department doesn’t need to know the
projects handled by the employees.
Summary Data
Another case of missing values is where one relation keeps only the summary data
while the other relation retains all the data. For example, a high-standard school
might keep all the grades of each student on each course. On the other hand, a low-
quality school might just keep the cumulative grade point average (CGPA) instead.
3.4.4. Conflicting Values
If separate local databases store information concerning the same entry then there is
a danger of conflicting values. There are two difficulties here, establishing that a
conflict exists and correcting the discrepancy. If there are Employee relations at two
sites, how does one determine when the same employee is begin described in each
relation? If the employee has salaries listed in each relation, should these salaries
necessarily be equal, or could they be salaries for different jobs?
If there is a conflict, there are still several options. One possibility is to form
a straight union of the two relations, thereby presenting the user with both values. If
a single value is required, it might be safest to take the average of the two. This
should a normally ensure reasonable approximation to the true value. However, if
the aim is to provide the exact value then one or other of the conflicting values could
be assumed to be the correct one. Various criteria could be used to determine which
value is the more reliable.
3.4.5. Semantic Difference
Semantic Difference occurs when two attributes use the same name but actually
mean different things. Take for example a database containing the relations of
different teams in the Philippine Basketball Association say, Gordon's Gin Boars and
Alaska Milkmen. Each relation has an attribute OPPONENT. The attribute
OPPONENT would refer to the Boars and the Milkmen's opponents so integrating
these tables would prove to be a difficulty.
Another example is when an employee has a salary attribute in two relations,
and the values happen to be different. In this case, it is possible that the two salary
attributes pertain to two different salary of the employee, who works in two different
jobs.
3-19
3.4.6. Structural Difference
Value-to-Attribute Conflict occurs when some values in a relation are expressed as
an attribute in another relation. For example, the values of the attribute sex of
RD.Student are represented as attributes (F and M) in ORG.Member. (Table 3-1)
Value-to-Table Conflict occurs when a value in one relation is expressed
independently as a whole relation in another database. For example, Table 3-1 shows
the relation schemas of STUDENT_FEMALE and STUDENT_MALE for female
and male students in PE. It is also represented as values of sex in other databases (i.e.
RD.Student).
Attribute-to-Table Conflict occurs when an attribute in one relation is
expressed independently as a whole relation in another database. For example, the
attribute address in LIB.Student is represented as a relation in RD.Address.
Database Name Table Definition
REGISTRAR DATABASE
(RD)
STUDENT (STUD_ID, FNAME, MI, LNAME, SEX, PHONE)
ADDRESS (STUD_ID, STREET, CITY, STATE)
PAYMENT (STUD_ID, BALANCE_OF_TUITION,
LIB_PENALTY_FEES)
*LIB_PENALTY_FEES is of the format ###.##
PE DEPARTMENT
DATABASE (PE)
STUDENT_FEMALE (STUD_ID, NAME, ADDRESS, PHONE)
STUDENT_MALE (STUD_ID, NAME, ADDRESS, PHONE)
STUD_DATA (STUD_ID, PE1_GRADE, PE2_GRADE,
PE3_GRADE, PE4_GRADE)
LIBRARY DATABASE
(LIB)
STUDENT(LIB_ID, STUD_ID, NAME, STREET, CITY, STATE,
SEX, PHONE)
FINES (LIB_ID, NO_OFFENSE, AMOUNT, PAID)
*Amount is of the format ### - no decimal places
ORGANIZATION/CLUB
DATABASE (ORG)
MEMBER (CLUB_ID, STUD_NO, NAME, FEMALE, MALE)
COMMITTEE (CLUB_ID, COMMITTEE_NAME, POSITION)
PERSONAL_DATA (CLUB_ID, ADDRESS, PHONE, BIRTHDAY)
Table 3-1. Sample schema of a database in a university.
Table-to-Table Conflict occurs when one relation in a database is expressed as
several separate relations in another database. For example, the LIB.Student has a
table-to-table conflict with the RD.Student and RD.Address.
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System
MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System

More Related Content

Similar to MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System

John O'Connor Master's Paper Final
John O'Connor Master's Paper FinalJohn O'Connor Master's Paper Final
John O'Connor Master's Paper Final
John O'Connor
 
Scoping study student wellbeing study 2008
Scoping study   student wellbeing study 2008Scoping study   student wellbeing study 2008
Scoping study student wellbeing study 2008
i4ppis
 
Accelerated Prototyping of Cyber Physical Systems in an Incubator Context
Accelerated Prototyping of Cyber Physical Systems in an Incubator ContextAccelerated Prototyping of Cyber Physical Systems in an Incubator Context
Accelerated Prototyping of Cyber Physical Systems in an Incubator Context
Sreyas Sriram
 
HJohansen (Publishable)
HJohansen (Publishable)HJohansen (Publishable)
HJohansen (Publishable)
Henry Johansen
 

Similar to MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System (20)

Thesis
ThesisThesis
Thesis
 
John O'Connor Master's Paper Final
John O'Connor Master's Paper FinalJohn O'Connor Master's Paper Final
John O'Connor Master's Paper Final
 
A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE D...
A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE D...A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE D...
A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE D...
 
E.Leute: Learning the impact of Learning Analytics with an authentic dataset
E.Leute: Learning the impact of Learning Analytics with an authentic datasetE.Leute: Learning the impact of Learning Analytics with an authentic dataset
E.Leute: Learning the impact of Learning Analytics with an authentic dataset
 
Thesis template v1.3 usm library
Thesis template v1.3 usm libraryThesis template v1.3 usm library
Thesis template v1.3 usm library
 
Introduction to InTeGrate Modules: Hands-on, data-rich, and socially-relevant...
Introduction to InTeGrate Modules: Hands-on, data-rich, and socially-relevant...Introduction to InTeGrate Modules: Hands-on, data-rich, and socially-relevant...
Introduction to InTeGrate Modules: Hands-on, data-rich, and socially-relevant...
 
Project
ProjectProject
Project
 
Computing Science Dissertation
Computing Science DissertationComputing Science Dissertation
Computing Science Dissertation
 
Montero thesis-project
Montero thesis-projectMontero thesis-project
Montero thesis-project
 
Essay On Tourism
Essay On TourismEssay On Tourism
Essay On Tourism
 
Organisering av digitale prosjekt: Hva har IT-bransjen lært om store prosjekter?
Organisering av digitale prosjekt: Hva har IT-bransjen lært om store prosjekter?Organisering av digitale prosjekt: Hva har IT-bransjen lært om store prosjekter?
Organisering av digitale prosjekt: Hva har IT-bransjen lært om store prosjekter?
 
User behavior model & recommendation on basis of social networks
User behavior model & recommendation on basis of social networks User behavior model & recommendation on basis of social networks
User behavior model & recommendation on basis of social networks
 
Scoping study student wellbeing study 2008
Scoping study   student wellbeing study 2008Scoping study   student wellbeing study 2008
Scoping study student wellbeing study 2008
 
Kumar_Akshat
Kumar_AkshatKumar_Akshat
Kumar_Akshat
 
Accelerated Prototyping of Cyber Physical Systems in an Incubator Context
Accelerated Prototyping of Cyber Physical Systems in an Incubator ContextAccelerated Prototyping of Cyber Physical Systems in an Incubator Context
Accelerated Prototyping of Cyber Physical Systems in an Incubator Context
 
Report on m library activity August 2012
Report on m library activity August 2012Report on m library activity August 2012
Report on m library activity August 2012
 
Enlightening Society On The Alert
Enlightening Society On The AlertEnlightening Society On The Alert
Enlightening Society On The Alert
 
HJohansen (Publishable)
HJohansen (Publishable)HJohansen (Publishable)
HJohansen (Publishable)
 
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
 
DMDI
DMDIDMDI
DMDI
 

MDB1-A-Schema-Integration-Prototype-for-a-Multidatabase-System

  • 1. MDB1: A Schema Integration Prototype for a Multidatabase System A Thesis Presented to the Faculty of the College of Computer Studies De La Salle University In Partial Fulfillment of the Requirements for the Degree of Bachelor of Science in Computer Science by Cua, Samuel V. Gaw, Gilbert O. Kiok, Joseph T. Lau, Jerwynn Glenn N. Ms. Charibeth Ko Faculty Adviser August 2, 1999
  • 2. ACKNOWLEDGMENTS The proponents would like to thank their parents namely Mr. Cua Hung Cheng, Mrs. Ngo Chin Lian, Mr. Eduardo Gaw, Mr. Manuel Kiok, Mrs. Priscilla Kiok, Mr. Joseph Lau, and Mrs. Brigitte Lau for their encouragement and support, understanding, prayers, and for boarding and lodging. We are very grateful to the faculty members who helped us in the development of our thesis. Especially to Ms. Charibeth Ko, our thesis adviser, who gave us her time and effort. Under her guidance, we were able come out with a great thesis project and presentation. To Mr. Jefferson Tan, our previous adviser, for guiding us in the initial development of this thesis. To Mr. Bernard Yung and to Mr. Rene Arellano for supplying us with software tools needed for the thesis. Our gratitude to DLSU-PUSO and Mr. Armando B. Victor, for providing us with several books used in the research and development of this thesis. We also thank the ADRIC for the facilities that we used during the project’s initial development. Finally, the proponents would not have been successful in completing this thesis without the love and guidance of our Heavenly Father.
  • 3. Abstract The need for the integration of data from heterogeneous and physically distributed information sources has triggered the research and development of multidatabase systems. The MDB1 system is a tightly-coupled federated database system that employs a global schema to abstract global clients from heterogeneous database systems. Participating components in the federation are heterogeneous, autonomous, relational-model database systems. The main goal of the system is to manage the integration of heterogeneous databases, which will then allow the execution of the global queries. The study explored issues on multidatabase architectures and semantic heterogeneity through a review of existing literature. Through survey, study, and analysis, the proponents produced a set of simplified methods and mapping rules for resolving various forms of schema conflicts, namely name difference, format difference, missing data, conflicting values, semantic difference, and structural conflict. The system is composed of two major components, namely the Schema Integration Tool and the Middleware. The Schema Integration Tool facilitates schema mapping and enables the resolution of various forms of schema conflicts. The Middleware then refers to the mapping definitions to create the global database and perform data integration. The system utilizes an Oracle Database to store the final integrated data.
  • 4. TABLE OF CONTENTS 1.0. RESEARCH DESCRIPTION........................................................................................1-1 1.1. OVERVIEW OF THE CURRENT STATE OF THE SPECIFIC TECHNOLOGY..............................1-1 1.2. RESEARCH OBJECTIVE .................................................................................................1-2 1.2.1. General Objective...................................................................................................1-2 1.2.2. Specific Objectives..................................................................................................1-2 1.3. SCOPE AND LIMITATIONS OF THE RESEARCH .................................................................1-3 1.4. SIGNIFICANCE OF THE RESEARCH .................................................................................1-3 1.5. RESEARCH METHODOLOGY..........................................................................................1-4 2.0. REVIEW OF RELATED LITERATURE.....................................................................2-1 2.1. ISSUES IN MULTIDATABASE SYSTEMS ...........................................................................2-1 2.1.1. Access Control in Multidatabase Systems................................................................2-1 2.1.2. Data Integration in Multidatabase Systems..............................................................2-2 2.1.3. Query Processing and Optimization........................................................................2-5 2.2. EXISTING SOFTWARE ...................................................................................................2-5 2.2.1. Cords Schema Integration Environment ..................................................................2-5 2.2.2. Microsoft Transaction Server (Viper)......................................................................2-8 2.2.3. Sybase Jaguar.........................................................................................................2-8 2.2.4. ADDS (Amoco Production Company, Research)......................................................2-9 2.2.5. DOIA....................................................................................................................2-10 2.2.6. Mermaid...............................................................................................................2-11 3.0. THEORETICAL FRAMEWORK ................................................................................3-1 3.1. DEFINITION OF TERMS..................................................................................................3-1 3.2. MULTIDATABASE ARCHITECTURE ................................................................................3-5 3.2.1. Characteristics of Database Systems.......................................................................3-5 3.2.2. Taxonomy of Federated Database Systems..............................................................3-6 3.2.3. Reference Architecture............................................................................................3-8 3.2.4. Processor Types in the Reference Architecture ........................................................3-9 3.2.5. ANSI/SPARC Three-Level Schema Architecture ....................................................3-11 3.2.6. Five-Level Schema Architecture for Federated Databases.....................................3-11 3.3. MULTIDATABASE ISSUES............................................................................................3-14 3.3.1. Schema Integration...............................................................................................3-14 3.3.2. Access Control......................................................................................................3-15 3.3.3. Query Processing and Optimization......................................................................3-15 3.4. SCHEMA INTEGRATION...............................................................................................3-16 3.4.1. Name Difference...................................................................................................3-16 3.4.2. Format Difference ................................................................................................3-17 3.4.3. Missing Data ........................................................................................................3-17 3.4.4. Conflicting Values ................................................................................................3-18 3.4.5. Semantic Difference..............................................................................................3-18 3.4.6. Structural Difference ............................................................................................3-19 3.5. CLIENT / SERVER COMPUTING....................................................................................3-20 3.6. MIDDLEWARE............................................................................................................3-21 3.7. SQL / VIEWS .............................................................................................................3-22 3.8. DISTRIBUTED COMPUTING..........................................................................................3-24
  • 5. 3.9. QUERY PROCESSING AND OPTIMIZATION ....................................................................3-26 4.0. THE MULTIDATABASE SYSTEM.............................................................................4-1 4.1. SYSTEM OVERVIEW .....................................................................................................4-1 4.2. SYSTEM OBJECTIVES....................................................................................................4-2 4.2.1. General Objective...................................................................................................4-2 4.2.2. Specific Objective ...................................................................................................4-2 4.3. SYSTEM FUNCTIONS.....................................................................................................4-3 4.3.1. Middleware ............................................................................................................4-3 4.3.2. Schema Integration Tool.........................................................................................4-5 4.3.3. Client Application...................................................................................................4-7 4.3.4. Catalogs.................................................................................................................4-8 4.4. SYSTEM SCOPE AND LIMITATIONS ................................................................................4-8 4.5. PHYSICAL ENVIRONMENT AND RESOURCES...................................................................4-9 4.6. ARCHITECTURAL DESIGN ...........................................................................................4-10 4.6.1. Middleware ..........................................................................................................4-10 4.6.2. Schema Integration Tool.......................................................................................4-13 4.6.3. Client Application.................................................................................................4-18 4.6.4. Architectural Issues ..............................................................................................4-19 5.0. DESIGN AND IMPLEMENTATION ISSUES.............................................................5-1 5.1. SYSTEM ARCHITECTURE...............................................................................................5-2 5.2. DATABASE SCHEMA DATA STRUCTURES.......................................................................5-9 5.2.1. DBSchema............................................................................................................5-10 5.2.2. DBTableDef..........................................................................................................5-10 5.2.3. DBFieldDef ..........................................................................................................5-11 5.3. THE GLOBAL SCHEMA EDITOR MODULE.....................................................................5-13 5.3.1. Data Structures.....................................................................................................5-13 5.3.2. Major Algorithms Used.........................................................................................5-16 5.3.3. Design Issues........................................................................................................5-20 5.4. COMPONENT SCHEMA MANAGER ...............................................................................5-22 5.4.1. Data Structures.....................................................................................................5-22 5.4.1.1 Component Schema Catalog.....................................................................................5-22 5.4.1.2 Export Schema Catalog ............................................................................................5-23 5.4.1.3 Database Profile .......................................................................................................5-25 5.4.1.4 Schema Definition Errors catalog..............................................................................5-25 5.4.2. Major Sub-Modules ..............................................................................................5-26 5.4.2.1 Schema Loader.........................................................................................................5-27 5.4.2.2 Verify Schema ..........................................................................................................5-28 5.4.2.3 Component Schema Viewer......................................................................................5-29 5.5. MAPPING EDITOR.......................................................................................................5-30 5.5.1. Data Structures.....................................................................................................5-32 5.5.1.1 The Mapping Rule base class....................................................................................5-33 5.5.1.2 Derived Mapping Rule Classes .................................................................................5-34 5.5.1.3 Supporting Classes for Mapping Rules......................................................................5-36 5.5.1.4 Table Mapping and Mapping Entries.........................................................................5-38 5.5.1.5 The Mapping Catalog ...............................................................................................5-41 5.5.1.6 Binding Factor and Table Filter.................................................................................5-43 5.5.2. Major Algorithms .................................................................................................5-44 5.6. FORMULA MANAGER .................................................................................................5-48 5.6.1. Data Structures.....................................................................................................5-48
  • 6. 5.7. MAPPING GENERATOR ...............................................................................................5-51 5.7.1. Create Global Schema Scripts...............................................................................5-51 5.7.2. User-defined PL/SQL function Script ....................................................................5-52 5.7.3. Integration Script..................................................................................................5-53 5.7.4. Pseudocode for Generating Scripts .......................................................................5-53 5.8. THE MIDDLEWARE.....................................................................................................5-63 5.8.1. The Update Process..............................................................................................5-64 5.8.2. Data Loader .........................................................................................................5-67 5.8.3. Data Integrator.....................................................................................................5-68 5.8.4. Update Log File....................................................................................................5-69 5.9. UPDATE MANAGER ....................................................................................................5-71 5.9.1. Major Algorithms Used.........................................................................................5-71 5.10. CLIENT APPLICATION.................................................................................................5-72 5.10.1. Data Structures................................................................................................5-72 5.10.2. Major Algorithms Used ....................................................................................5-72 6.0. RESULTS AND OBSERVATIONS ..............................................................................6-1 6.1. INTRODUCTION ............................................................................................................6-1 6.2. SCHEMA INTEGRATION TOOL .......................................................................................6-1 6.2.1. Component Schema Manager..................................................................................6-1 6.2.2. Global Schema Editor.............................................................................................6-5 6.2.3. Mapping Editor ....................................................................................................6-12 6.3. MIDDLEWARE............................................................................................................6-14 6.3.1. Data Type Conversion ..........................................................................................6-16 6.3.2. Data Type Overflow..............................................................................................6-19 6.3.3. Integrity Constraint...............................................................................................6-21 6.3.4. Data Loading........................................................................................................6-22 6.3.5. Update Manager...................................................................................................6-22 6.4. CASE STUDY – NBA MULTIDATABASE PROJECT.........................................................6-24 6.4.1. PLAYERS Table....................................................................................................6-24 6.4.2. PLAYER_STATS Table .........................................................................................6-29 6.4.3. SCHEDULE Table................................................................................................6-33 6.4.4. NBA_GUARD Table .............................................................................................6-37 6.4.5. NBA_FORWARD Table ........................................................................................6-39 6.4.6. NBA_CENTER Table............................................................................................6-41 7.0. CONCLUSIONS AND RECOMMENDATIONS .........................................................7-1 7.1. CONCLUSION ...............................................................................................................7-1 7.1.1. Schema Integration Tool.........................................................................................7-1 7.1.2. Middleware ............................................................................................................7-3 7.1.3. Client Application...................................................................................................7-4 7.2. RECOMMENDATIONS ....................................................................................................7-4 7.2.1. Facility for Accessing Generic Databases ...............................................................7-4 7.2.2. Facility for viewing actual values in Component Databases.....................................7-5 7.2.3. User-Interface Improvements..................................................................................7-5 7.2.4. Copying of schema and mapping objects .................................................................7-5 7.2.5. Document Printing..................................................................................................7-5 7.2.6. Middleware as NT service or System tray program..................................................7-6 7.2.7. Real-time Data Integration .....................................................................................7-6 7.2.8. Support for additional data types ............................................................................7-6
  • 7. 7.2.9. Support for different Database models ....................................................................7-6 Appendix A. Bibliography …………………………………………………………………… A-1 Appendix B. Oracle Reserved Words…………………………………………………………. B-1 Appendix C. DLSU Sample Schema ………………………………………………………… C-1 Appendix D. NBA Sample Schema…………………………………………………………... D-1 Appendix E. Resource Persons……………………………………………………………….. E-1 Appendix F. Personal Vitae……….…………………………………………………………... F-1
  • 8. 1-1 List of Figures Figure 3-1. System architecture of a centralized DBMS [SHET90]........................3-7 Figure 3-2. Taxonomy of a Multidatabase System [SHET90]. ...............................3-8 Figure 3-3. An accessing processor......................................................................3-10 Figure 3-4. Five-level schema architecture of an FDBS. ......................................3-12 Figure 3-5. System Architecture for an FDBS......................................................3-13 Figure 3-6. Client/Server Interaction....................................................................3-20 Figure 4-1. The MDB1 System..............................................................................4-4 Figure 4-2. Multidatabase system hierarchical chart ............................................4-10 Figure 4-3. MDB1 System Architecture ..............................................................4-11 Figure 4-4. Middleware component hierarchical chart. ........................................4-11 Figure 4-5. The Middleware ................................................................................4-12 Figure 4-6. The Schema Integration Tool ............................................................4-14 Figure 4-7. Schema Integration Tool hierarchical chart........................................4-15 Figure 4-8. Client Application hierarchical chart. ................................................4-19 Figure 4-9. The MDB1 5-level Schema Architecture ..........................................4-21 Figure 5-1. The MDB1 System..............................................................................5-3 Figure 5-2. The Schema Integration Tool ..............................................................5-4 Figure 5-3. Data Integration in the old design........................................................5-5 Figure 5-4. Data integration in the new design.......................................................5-6 Figure 5-5. Schema class hierarchy......................................................................5-10 Figure 5-6. Table class hierarchy.........................................................................5-11 Figure 5-7. Field class hierarchy..........................................................................5-12 Figure 5-8. Global Schema Data Structure...........................................................5-12 Figure 5-9. How the Global Schema Editor interacts with the Global Schema .....5-13 Figure 5-10. Adding a Global Table to a Global Schema .....................................5-16 Figure 5-11. Inserting a Foreign Key into a Global Table ....................................5-19 Figure 5-12. Two ways of representing foreign keys in the global schema...........5-21 Figure 5-13. Component Schema Data Structure .................................................5-23 Figure 5-14. Export Schema Catalog ...................................................................5-24 Figure 5-15. Schema Loader module ...................................................................5-27 Figure 5-16. Mapping editor screen. ....................................................................5-30 Figure 5-17. A mapping rule box.........................................................................5-30 Figure 5-18. The Mapping Rule classes inheritance tree ......................................5-32 Figure 5-19. A mapping entry..............................................................................5-40 Figure 5-20. Define Mapping Entry Dialog .........................................................5-41 Figure 5-21. Mapping Catalog class hierarchy.....................................................5-42 Figure 5-22. The Mapping Catalog data structure ................................................5-42 Figure 5-23. Mapping Rule Set hierarchy ............................................................5-46 Figure 5-24. The Middleware ..............................................................................5-63
  • 9. 1-2 Figure 5-25. Verify Schema Changes ..................................................................5-64 Figure 5-26. Create Component schemas.............................................................5-65 Figure 5-27. Load Component Data.....................................................................5-66 Figure 5-28. Integrate Data..................................................................................5-67 Figure 5-29. Query Process .................................................................................5-72 Figure 6-1. Specify bindings notification message..............................................6-12 Figure 6-2. Cannot map autonumber message......................................................6-13 Figure 6-3. Data integration in the new design.....................................................6-15 Figure 6-4. Allowed data type conversions. .........................................................6-16 Figure 6-5. PLAYERS table mapped to Lakers database .....................................6-26 Figure 6-6. PLAYERS table mapped to Lakers database .....................................6-28 Figure 6-7. PLAYER_STATS table mapped to LAKERS database .....................6-31 Figure 6-8. PLAYER_STATS table mapped to LAKERS database .....................6-32 Figure 6-9. SCHEDULE mapped to LAKERS database ......................................6-35 Figure 6-10. SCHEDULE mapped to SPURS database........................................6-36 Figure 6-11. NBA_GUARD mapped to LAKERS database.................................6-37 Figure 6-12. NBA_GUARD mapped to SPURS database....................................6-38 Figure 6-13. NBA_FORWARD mapped to LAKERS database ...........................6-39 Figure 6-14. NBA_FORWARD mapped to SPURS database ..............................6-40 Figure 6-15. NBA_CENTER mapped to LAKERS database ...............................6-41 Figure 6-16. NBA_CENTER mapped to SPURS database...................................6-42
  • 10. 1-1 List of Tables Table 3-1. Sample schema of a database in a university.......................................3-19 Table 5-1. DGlobalSchema, DGlobalTable, and DGlobalField class attributes ....5-15 Table 5-2. Scripts and their filenames..................................................................5-51 Listings Listing 5-1. Adding of Global Tables ..................................................................5-17 Listing 5-2. Adding of Global Fields ...................................................................5-19 Listing 5-3. Assigning of Foreign Key Constraints to Global Tables....................5-20 Listing 5-4. Pseudocode for the Load Schema function.......................................5-28 Listing 5-5. Pseudocode for the Verify Schema function ....................................5-29
  • 11. 1-1 Chapter 1 Research Description 1.0. Research Description 1.1. Overview of the Current State of the Specific Technology Database systems facilitate the storage, management, and access of valuable data used in various applications. Different sectors and industries such as Banking and Finance, Manufacturing, Government, Science and Engineering, and Information Technology institutions require data storage and data access facilities to support daily operations. As a result, different database systems have been developed to meet the specific needs of diverse classes of applications and users. Likewise, database systems vary in size, capability, and performance. Small office and home office users utilize simple database software to manage inventories, record sales invoices, and keep track of customer information. Big organizations and business institutions employ large-scale database systems to store huge amounts of data. Over time, different database systems were developed for various classes of applications by numerous database developers. This resulted in disparate database systems that have various kinds of incompatibilities. In today’s information age, however, there is an increasing need to access data from multiple information sources. A company head-office will probably need access to information to each of several local sites. For example, the financial manager might want to know the total expenses incurred in all departments. Or, a purchasing manager may want to know the available stock of products from all store locations. In general, users need access to integrated data. Database integration, however, is a difficult task. Database systems may not only be physically distributed, but may also differ in many aspects, such as operating system and computer hardware, supported network protocols, access methods, data model, query language, and data representation. There are three approaches to database integration [MART95a]. The first is physical integration of all data into one database. This is obviously not a good solution since it will be very costly and will not allow independent maintenance of data [MART95a]. The second approach is to provide interoperability, that is integration at the access language level [MART95a]. The problem with this method is that application developers will have to deal with the complexity of different database interfaces and formats. The third and best approach is logical integration of all data into one virtual database that hides the underlying heterogeneity of local databases [MART95a]. This kind of a system is usually referred to as a multidatabase system.
  • 12. 1-2 A multidatabase system provides an integrated view over a collection of different database systems. It abstracts users from the location, distribution, and heterogeneity of different databases. Furthermore, it provides global access to physically distributed heterogeneous databases via a single query language. For this reason, a multidatabase is sometimes referred to as a heterogeneous distributed database. A multidatabase is actually a middleware that sits in between global clients and several local database systems. It is a software layer that acts as a front end to multiple local database systems, while serving as a back end database server to global clients. Hence, the global clients can access information from multiple sources with a single direct request through the multidatabase system. Using a multidatabase to address the problem of database heterogeneity has several advantages. First, existing organizational investments are preserved. These include the preservation of investments in computer hardware, software, and user training. Second, local database sites can resume daily business operations and exercise local autonomy. From the perspective of a local database system, a multidatabase system is just another database user. The multidatabase need not interfere with local operations. Lastly, a multidatabase enables better management of the entire organization, since it provides a global and integrated view of business data from all departments. The difficulty in the study of multidatabase systems lies in the heterogeneity and distribution of local databases. 1.2. Research Objective 1.2.1. General Objective This research aims to design and implement a prototype federated database system for integrating at least two heterogeneous database systems. 1.2.2. Specific Objectives Specifically, the research aims to: 1. To study how to access data from different database systems. 2. To review various issues and techniques regarding schema integration. 3. To research on query processing and query optimization (optional). 4. To survey existing multidatabase systems. 5. To study various access control policies. 6. To study and assess whether to use a tightly-coupled or a loosely-coupled design for the system architecture.
  • 13. 1-3 1.3. Scope and Limitations of the Research The aim of the thesis is to design and implement a prototype federated database system that will enable global queries. The Multidatabase system will allow the integration of at least two different relational database systems, but not object- oriented database systems, due to resource limitations. In view of time constraints and the multitude of issues in multi-database updates, the system will not allow write operations to database items. Global queries will be processed by the system so that the appropriate sub-queries to the component databases can be derived. Query optimization, on the other hand, would be made optional. The system would only provide a simple access control policy because it would concentrate more on schema integration. The system will not provide automatic schema integration, as it is not practical to do so. Instead, tools for schema integration will be developed in order to facilitate the programmer’s work in the integration of component schemas. 1.4. Significance of the Research The thesis has significance in both theoretical and pedagogical aspects. Further, the multi-database has a lot of practical use in database management and computer science research. Theoretically, the study will contribute to the already on-going research on multidatabase systems, particularly on query processing and optimization, and on schema integration. The multidatabase system can be utilized by business and government organizations that want to integrate physically distributed databases. The multi- database approach of integration does away with having to purchase a larger DBS to accommodate the local data in different sites and their transfer to the central server, thus saving a great deal of resources. Likewise, such a system will allow easier management of its local branches and will enable a holistic view of the organizations’ situations and performances. In the field of Computer Science research, the system is an invaluable tool. The system can be employed by both Internet and Intranet applications for easy retrieval of information coming from heterogeneous database sources. This will facilitate and encourage research in the area of distributed information systems such as schema integration, data mining, Internet agents, and digital libraries.
  • 14. 1-4 1.5. Research Methodology In order to successfully accomplish the Multidatabase project, the proponents have performed such activities as research, planning, brainstorming, design, and other tasks, in relation to the study. The proponents have outlined a number of activity phases that comprise the research methodology. In the data-gathering phase, the proponents researched on books, magazines, journals, and papers on multidatabase systems. Specifically, the proponents researched on theories and concepts regarding schema integration, access control, and query processing. Likewise, the proponents consulted with people who are knowledgeable in database technology. A study on existing multidatabase architectures was done, in order to understand how the different software modules would fit in to the system. And since the multidatabase system must facilitate data integration from heterogeneous data sources, the need to study schema and data integration is a necessity. In line with this, the proponents have surveyed on the different types of schema conflicts. The proponents were able to do this by studying related literature on schema integration, and by surveying actual database schemas. In the planning phase, the proponents discussed and defined the different functions and features of the system. Then, the project schedule was laid out and the project milestones were identified. Next, the tasks identified were assigned to the members. For the preliminary system analysis and design, the system features were identified, and the system architecture was designed. Through the identified features, the necessary modules and their functionality were defined. At the same time, the system architecture was designed such that these modules are contained properly and module interdependencies are carefully considered. For the development phase, the proponents have used a modified prototyping approach in implementing the system. Because the multidatabase project is experimental in nature, it is advantageous to use this approach so that changes to system design can be carried out whenever necessary, particularly when the previous design is difficult to realize. The development process has also undergone preliminary analysis and design phases in order to layout the initial system architecture. The architectural design and its details have been allowed to evolve during the course of the development phase, so that a better solution can be adopted to enhance system efficiency or to resolve a design flaw.
  • 15. 1-5 The project has been divided into parts and features, and the project schedule divided into milestone junctures that was based on the system features that must be accomplished. The most important features have been incorporated to the earliest possible subproject. Each subproject then went through a complete cycle of development that involves coding, feature integration, testing, and debugging. The main thesis document was written as soon as the proponents achieved a definite system architecture design. It was then continuously updated and improved in concurrency with the development of the system, until the end of the project. The technical manual and user’s manual were finalized after the development of the proposed system.
  • 16. 2-1 Chapter 2 Review of Related Literature 2.0. Review of Related Literature 2.1. Issues in Multidatabase Systems 2.1.1. Access Control in Multidatabase Systems There are many technical problems encountered in building a multidatabase system. One of these issues is access control. Access control prevents unauthorized access or malicious destruction of the database. Access control in a multidatabase system not only includes controlling access in the local DBMSs at each site, but also controlling and coordinating access to data at multiple sites for multi-site queries [WANG87]. By using access control, we further complicate the implementation of a multidatabase system because of several reasons. First, if the multidatabase environment is heterogeneous, problems will arise because different sites may use different and incompatible mechanisms for expressing and enforcing access control security policies. Another would be the issue of site autonomy. This means that all the local DBMSs at each site must maintain control of the data stored at that site. The local DBMS should decide for itself if a user may access the data it manages [WANG87]. According to [FERN81], there are many kinds of access control that can be used for controlling to access data. Two of which are Control-Independent and Control-Dependent access control. Control-Independent access controls are defined over the basis data objects supported by the DBMS. Each access rule is of the form subject/object/privilege which means that the subject has that privilege over that object. If access is granted to a database object to the user in the global model, there should also be a corresponding grant that must be issued in the local model for that database object. This maintains consistency between the local and global models. Content-Dependent access controls, however, base their decisions of whether to allow access or not on the values of the data in the database. Given the example by [WANG87], an instructor may be allowed to see a student’s record only if he/she is the advisor of that student. Relational DBMSs are Content-Dependent uses views to implement access control. Content-Dependent access control policies are more difficult to enforce than Content-Independent policies because the data required to make an access decision may reside at any site of the system.
  • 17. 2-2 In order to implement access control policies, as said earlier, one uses views. A view is a virtual object. The purpose of a view is to identify a subset of an entity set or relationship set that a user is authorized to access. A view may be defined with the following construct [CHAM75, ASTR76]: Define View_Name <Target_List> Where Qualification_Clauses where: View_Name = name of the view Target_List = subset of attributes of an entity set or relationship set for the global database to be included in the view Qualification_Clauses = identifies which elements of the entry set or relationship set referenced in the Target_List are to be included in the view A user must be granted the right to use a view. Also, the definer of that view can grant access to the view to other users [WANG87]. In this way, the new user can use the view to access the subset of a database object defined by the view without having access to this database object directly [GRIF76]. Wang and Spooner have proposed the use of protection view mechanisms for multidatabase systems. The approach here is to produce all the data for the view temporarily and that the result as a base object for processing the second query. This solves the problem of site autonomy so that a user who has been granted access to a database object not owned by him by the global model may be able to without having the local DBMS denying his request. 2.1.2. Data Integration in Multidatabase Systems In a multidatabase system, there exist a wide variety of independent databases. These local databases are developed independently with differing local requirements. In effect, multidatabase system is likely to have different models and representations for similar objects. This results to some serious problem when generating queries that require data from various preexisting databases. One solution for this is data integration. Data integration refers to the creation of an integrated view over apparently incompatible data typically collected from different sources [WANG87]. It also provides location-transparency and enhanced global query facility.
  • 18. 2-3 Data integration is one of the most significant aspects in a multidatabase system. In [DEEN87], data integration problem was grouped into six major categories. These are name difference, format difference, missing data, conflicting values, semantic difference and structural difference. Name difference occurs when two semantically equivalent data items located in different DBMS are named differently or two semantically not equivalent data items in different DBMS are named the same [DEEN87]. For instance, in one of the DBMS, the field name for an employee is EMPLOYEE while in another DBMS, the field name is WORKER, although both are referring to the same data item, but they don't have the same name. This clearly leads to conflict. To resolve this conflict, the global system must be able to identify equivalence of the items and map the differing local names to a single global name. Another problem to consider in data integration is the format differences. Format differences include differences in data type, domain, scale, precision, and item combinations [DEEN87]. For instance, a telephone number may be defined as an integer in a DBMS while it is defined as an alphanumeric string in another. Another case is some data items are broken into components in one database while the combination is treated as a single quantity in another DBMS. This might possibly lead to some system error when not resolve. One solution for this is to define some transformation function between the local and global representations. The complexity of these functions will depend on the degree of format differences between data items in various DBMS. One serious integration problem is conflicting data. Sometimes, two database systems have a data item that refers to the same real-world object but their actual data values are conflicting [DEEN87]. This is due to incomplete updates, system error, or insufficient demand for such data. One example for this is when two databases have the same data item but contain different values. Among the various integration problem, this is the most complex one and can cause serious data lost to the database system. Another integration problem is missing data. According to [DEEN87], data can be missing from one relation, from both relations, or it can be a summarized data. For example, one relation may contain the employee salaries for each month, while another relation may simply contain the average yearly salary. Semantic differences occur when two attributes of the same name, belonging to relations of the same name but have different meanings [DEEN87]. To resolve
  • 19. 2-4 this, the semantic meaning of a relation in a DBMS must be explicitly stated to global users. Structural differences refer to data items with the same semantic meaning in various DBMS but are structured differently [DEEN87]. For instance, a data item may be represented as a single relation in one DMBS and multiple relations in another. To resolve this, there should be a mapping language that is capable of restructuring data into one form to another. In a paper describing the CORDS Integration Environment, Martin and Powley [MART95a] introduced their own classification of schema conflicts. [MART95a] classified these conflicts according to two dimensions: location and type. The location conflict can be either in an attribute, within a relation, or within a schema (involving multiple relations) [MART95a]. The set of conflict types, which was in turn based on the categorization of Missier and Rusinkiewicz, include: data type, scale, precision, default value, name, key, schema isomorphism, union compatibility, abstraction level incompatibility, missing data, and integrity constraint [MART95a]. The method of schema integration by [MART95a] was broken down into steps according to the location type. First, export schemas are resolved for attribute conflicts. A view definition is used to map the attributes from export schemas into the MDBS attribute [MART95a]. Then, relation level conflicts are resolved and a view definition is created. Third, schema level conflicts are resolved. Finally, the MDBS views from each level of conflict are merged into a single MDBS view definition [MART95a]. In addition to the above methodology, [MART95a] described attribute contexts that would support the resolution of attribute level conflicts. Attribute contexts are provided by export schemas and would consist of a number of facets that would describe the semantic properties of an attribute, such as data type, scale, and precision [MART95a]. The facets identified by [MART95a] are uniqueness, cardinality, type, precision, scale, and default value. In many cases, conflicts can be resolved by using transformation functions, as exemplified in CORDS. Schema isomorphism [MART95a] is one such type of conflict that is solvable using transformation functions. For instance, one database contains an attribute ADDRESS, while its equivalent in another database is a composition of attributes NUMBER, STREET and CITY. [MART95a] resolved this conflict by applying a transformation function called “StringConcat”, which combines the three fields into a single address field [MART95a].
  • 20. 2-5 Data Integration plays a vital role in multidatabase systems. Proper integration of data is needed to resolve the semantic heterogeneity of different databases. 2.1.3. Query Processing and Optimization When the multidatabase receives a global query, it must decompose it into sub- queries, perform query optimization, sends them to the individual DBMS, and process the results to be returned to the global user. Similar data items (tables) involved in the query may exist on different DBMS, or different data items (tables), which must be joined, may be distributed across different DBMS. Either way, some query processing and optimization must be done in order to yield efficiency. In [ELMA94], an overview of the techniques used by DBMSs in processing and optimizing high-level queries is presented. SELECT and JOIN operations have many execution options and thus are potentials for query optimization. Likewise, various approaches to query optimization are discussed. These techniques are classified into heuristic approaches, cost-estimation approaches, and semantic query optimization. The techniques required for query processing in a multidatabase environment are quite different from that of a single DBMS [LEE94]. In [LEE94], the query optimization techniques presented employs cost-estimation. First, various schemas are classified into various types. Then, the costs of executing operations on these conflicting schemas are evaluated and given weights, which will be used in the cost- estimation. Query decomposition, optimization, and processing in multidatabase systems is studied in [EVAS95]. In the paper, the optimization of query decomposition in case of data replication and the optimization of inter-site joins are considered. 2.2. Existing Software 2.2.1. Cords Schema Integration Environment The CORDS Multidatabase System (MDBS) provides applications with an integrated view of a collection of distributed heterogeneous data sources [MART95a]. Applications are presented with a relational view of the available data and are able to access the data using standard SQL operations [MART95a]. An application's view of the data is defined by a process called schema integration, which is facilitated by the CORDS environment.
  • 21. 2-6 The CORDS MDBS is a full-function DBMS. The common data model used in the CORDS MDBS is the relational model, so schemas define a collection of data in terms of relational tables and their columns and any applicable constraints. Applications interact with a MDBS Server via a library of functions called the MDBS Client Library [MART95a]. A MDBS Server performs DBMS function, such as query processing and optimization, transaction management, and security, at the global level [MART95a]. A MDBS Server connects to a component database (CDBS) through a Server Library which accepts SQL requests from the MDBS, interacts with the CDBS through its normal application program interface, then translates the response into the form expected by the MDBS [MART95a]. CDBSs currently supported by the prototype include the Empress, Oracle, and DB2/6000 relational systems, the IMS hierarchical database system, and VAXDBMS network database system [MART95a]. The MDBS Catalog in CORDS is the central repository for metadata need by the multidatabase system [MART95a]. It includes three classes of metadata, namely schemas, mappings, and descriptions of CDBSs [MART95a]. Two types of schemas are stored: export schemas and MDBS schemas [MART95a]. An export schema defines the data made available to the MDBS from a CDBS, while MDBS schemas define collections of data at the MDBS level, which are drawn from the exported data [MART95a]. The mappings needed to transform export schema objects into MDBS schema objects are created during the schema integration process. The process of schema integration in the CORDS MDBS takes schemas from a set of CDBSs and produces one or more integrated views of the available data [MART95a]. Martin and Powley, did not define a single all-encompassing global schema, but instead defined MDBS schemas to provide the data for individual applications, or groups of applications [MART95a]. MDBS schemas are equivalent to federated schemas as defined by Sheth and Larson [MART95a]. MDBS schemas are made up of virtual global relations called MDBS Views [MART95a]. MDBS Views are views that span multiple heterogeneous databases. They are like relational views in that they are not physically materialized but rather are stored as mappings that are invoked whenever an MDBS VIEW is accessed. The syntax for MDBS Views extends the standard SQL view definition facility with support for attribute contexts and transformation functions. Attribute contexts are used to describe the semantics of the attributes and transformation functions are used to resolve several types of schema conflicts [MART95a]. In order to resolve various types of schema conflicts, the CORDS MDBS provides a Schema Integration Toolkit to support the MDBS DBA in creating MDBS
  • 22. 2-7 schemas. The toolkit has an AIX Windows graphical interface and was developed on an RS/6000 machine [MART95a]. It runs as an application of the CORDS MDBS [MART95a]. Being a multifunctional toolkit, it includes the following integration tools with it: 1. Schema Translator – It is a tool that automates the translation from one data model to another [MART95a]. It receives a file containing the global scheme as a input and returns the output as a file containing the schema expressed in terms of the target data model. 2. Thesaurus – The main function of the Thesaurus is to resolve name conflicts. This is possible because it contains information about relationships, in particular synonyms, among object names [MART95a]. Specifically, the thesaurus analyzes a schema expressed in the common data model and highlights possible relationship among names in the schema with names currently stored in the thesaurus. However, for flexibility purpose, the user is allowed to add new names and relationship to expand the contents of the thesaurus. 3. Transformation Function Library Manager – This module contains some basic transformation function that is necessary in the schema integration process. Some of the transformation functions are conversion of integers to string or vice versa [MART95a]. 4. MDBS View Compiler – The basic function of the MDBS View Compiler is to parse and review an MDBS View definition and store the suitable information in the MDBS catalog [MART95a]. In the schema integration method, definitions of the export schemas are made using an extended SQL. Initial versions of export tables are produced by the Schema Translator coming from the CDBS schemata [MART95a]. It is then edited by the DBA using the editor supplied with the schema integration toolkit and submitted to the MDBS where it is parsed and stored in the MDBS catalog [MART95a]. One important step in the CORDS schema integration process is the identification of attributes to be included in the integrated schema [MART95a]. In this case, the Thesaurus tool is used to identify relationships among attributes based on the names used for the attributes. Name conflicts are then resolved using the MDBS View Definition statement by mapping the export attributes to a common generic name [MART95a]. The Transformation Library Manager is used to analyze the contexts of the attributes and, if possible, suggests transformation functions. This allows the mapping of the export attributes to the view attributes [MART95a].
  • 23. 2-8 2.2.2. Microsoft Transaction Server (Viper) The Microsoft Transaction Server (MTS), called “Viper” is developed primarily for the Internet and other network server. It also manages application and database transaction requests on behalf of a client. The Transaction Server screens the user and client computer from having to formulate requests for unfamiliar databases. It forwards the requests directly to the database servers. The MTS is thus a sort of a multidatabase server. Additional features of MTS include security management, connection to other servers, and transaction integrity. Microsoft designed the Transaction Server in such a way that it will fit in with its overall object-oriented programming strategy. A drag-and-drop interface is also provided in order to create a transaction model for a single user, then allow the Transaction Server to manage the model for multiple users, including the creation and management of user and task threads and processes. 2.2.3. Sybase Jaguar Sybase’s Jaguar CTS™ is the first component transaction server that combines a scalable execution environment with support for multiple component models including Java/JavaBeans, ActiveX, C/C++ and CORBA. Jaguar CTS’ open environment extends the Web architecture to provide a platform for developing and sending transaction to business applications on the Internet, intranets or extranets. Another feature of Jaguar CTS is the provision of a component-based environment that makes it easy for partners to extend the functionality of the core product. Jaguar CTS combines the features of an object request broker and a TP monitor to provide an easy-to-use, secure execution environment with support for multiple component models for building transaction-oriented business applications on the Net. Jaguar CTS’ adapts easily to unpredictable workloads to deliver high transactional throughput for large numbers of Internet users. Jaguar CTS’ flexible transaction management delivers high performance for both synchronous and asynchronous transaction processing. Jaguar CTS supports all major databases and development tools offering developers an open, standards-based environment. The Jaguar CTS can thus be used to build a multidatabase system prototype. Furthermore, its support for multiple
  • 24. 2-9 component model (Java/JavaBeans, ActiveX, C/C++, and CORBA) [SYBA97] facilitates the development of modular and interoperable software components. 2.2.4. ADDS (Amoco Production Company, Research) The Amoco Distributed Database System (ADDS) project began in late 1983 in response to the problem of integrating databases, which are distributed throughout the corporation [THOM90]. The accomplishment of this project contributes a lot in the business world at that time because database products did not provide effective means for accessing and managing data. The primary function of ADDS is to provide uniform access to preexisting heterogeneous distributed databases [THOM90]. It is based on a relational data model and uses an extended relational algebra query language [THOM90]. In the terminology of [SHET90], ADDS is a tightly coupled federated system supporting multiple federated schemata. Mappings are stored in the ADDS data dictionary [THOM90]. The data dictionary is fully replicated at all ADDS sites to expedite query processing [THOM90]. Multiple applications and users share CDB (Component database) definitions [THOM90]. The CDBs support the integration of the hierarchical, relational, and network data models [THOM90]. Some of the local DMBSs currently supported include IMS, SQL/DS, DB2, RIM, INGRES, and FOCUS [THOM90]. Data items which are semantically equivalent from different local databases, as well as appropriate data conversion for the data items, may be defined [THOM90]. The user interface consists of an Application Program Interface (API) and an interactive interface [THOM90]. Programs used the API to submit queries for execution, access the schema of retrieved data, and access retrieved data on a row- by-row basis [THOM90]. It provides transparency to the users accessing the distributed database. For the interactive interface, it allows users to execute queries, display the results of the queries, and save the retrieved data [THOM90]. The interface is quite flexible, it can be customized in a way that will fit the computer knowledge and expertise of the user. This will mean that any type of users whether the user is a novice or an expert will be able to use the system with ease. Queries submitted for execution are compiled and optimized for minimal data transmission cost [THOM90]. One example of query optimization is the application of semi-joins. A user may submit any number of queries for simultaneous execution [THOM90].
  • 25. 2-10 The interface architecture used by ADDS system is the Network Interface Facility (NIFTY) architecture [THOM90]. It is an extension of the OSI Reference Model and provides a uniform and reliable interface to computer systems that use different physical communication networks [THOM90]. Communication protocol is not an issue in an ADDS system, an ADDS process on one system can initiate a session with an ADDS process on another system without regard for the multitude of heterogeneous network hardware and software that is used to accomplish the session [THOM90]. 2.2.5. DOIA DOIA was presented first at the Australian Database Conference in 1995 [KUO94]. It was funded by the Cooperative Research Centres Program through the Department of the Prime Minister and Cabinet of the Commonwealth Government of Australia. The DOIA system is a heterogeneous multidatabase system that provides a single unified view of the federated schema [KUO94]. The architecture of the system was partially based on Sheth & Larson's 5-tier model for database (schema) integration. The DOIA architecture is composed of the Local Database Agent (LDA) and the Global Database Agent (GDA). The LDA acts as a transforming processor to present a view of a local schema, in the Common Data Model (CDM). On the other hand, the GDA acts as a constructing processor to present a federated schema in CDM. Both the GDA and the LDA use the Common Query Language (CQL) to query and update, transaction management operations (commit, rollback, etc). The main difference of the DOIA system from the 5-tier model is that it does support transaction management. The GDA is composed of two main components: Transaction Plan Generator and the Global Transaction Coordinator. The main function of the former component is to translate each query or update into a series of queries or updates targeted to the individual agents. The results are then passed to the Global Transaction Coordinator to distribute the tasks. The outcome is then stored in a temporary location called the collector database [KUO94]. The LDA is also composed of two components, namely, the Local Transaction Coordinator and the Translator. The local transaction coordinator is the one responsible to provide the transaction management functions not provided by the underlying database. The latter component translates the query or update from the CQL to the local query language [KUO94].
  • 26. 2-11 The system is currently using the relational model as their common data model and SQL for their common query language. 2.2.6. Mermaid The Mermaid system was first developed at Unisys in 1982, as a project for the Department of Defense [THOM90]. The system was needed for accessing and integrating data stored in autonomous databases. Furthermore, the Mermaid system must operate in a permanently heterogeneous environment consisting of distributed, heterogeneous database systems [THOM90]. Based on the terminology of [SHET90], Mermaid is a tightly coupled federated system supporting multiple federated schemata [THOM90]. Mermaid is said to serve as a front-end system that locates and integrates data from several DBMSs [THOM90]. Several levels of heterogeneity are supported, namely hardware, operating system of the DBMS host, network connection to the DBMS host, data model (relational, network, sequential file), and database schema [THOM90]. Initially, Mermaid only supported data retrieval from several DBMS and updates to a single DBMS [THOM90]. The Mermaid system has four major components, namely the User Interface, the server, the Data Dictionary/Directory (DD/D), and the DBMS Interface [THOM90]. The User Interface provides functions such as user authentication, system initialization, query editing, query library maintenance, and so on [THOM90]. Most of the Mermaid software resided in a server that exists on the same network as the user workstations and DBMSs [THOM90]. The server consists of an optimizer that processes queries, and a controller that controls execution [THOM90]. The Data Dictionary/Directory is a commercial, relational database that contains information about the databases and the environment [THOM90]. Mermaid has an open architecture that supports the development of interfaces to many types of DBMS, thus resulting in great flexibility for the participation of various DBMS [THOM90].
  • 27. 3-1 Chapter 3 Theoretical Framework 3.0. Theoretical Framework 3.1. Definition of Terms Access control - provides local Database Management Systems (DBMSs) the power to prevent unauthorized access or malicious destruction of its databases. Accessing Processor – accepts commands and produces data by executing the commands against a database. Applications Programming Interface (API) - A set of functions and programs that allows clients and servers to intercommunicate. Attribute – referred to as the field of a relation. Auxiliary databases – holds additional data not stored in any component DBMS and information needed to resolve inconsistencies. Catalog – a named collection of schemas in a Structured English Query Language (SQL) environment. Centralized Database System - refers to a single centralized database management system managing a single database on the same computer system. Client - A networked information requester, usually a PC or workstation, that can query database and/or other information from a server. Client-Server System – allows remotely located programs to exchange information in real-time. Component Database Management System – DBMS that participates in the multidatabase system. Component Schema – schema derived by translating local schemas into a data model called the canonical or common data model (CDM) of the Federated Database System (FDBS). Conceptual Schema – schema that describes the conceptual or logical data structure and the relationships among those structures.
  • 28. 3-2 Constructing Processor – a type of processor that replicates and/or partitions an operation submitted by a single processor into operations that are accepted by two or more other processors. Data Integration – refers to the production of union-compatible views for similar information expressed dissimilarly in different nodes. Data Model Transparency – a form of transparency wherein the data structure and commands being used by one processor are hidden from other processors. Distributed Computing – refers to the services provided by a distributed computing system. Distributed Computing System – is a collection of autonomous computers interconnected through a communication network to perform different functions. Distributed Database System -- consists of a single distributed DBMS managing multiple databases. These databases can be stored in either a computer system or on multiple computer systems. Export Schema – schema that represents a subset of a component schema that is made available to the FDBS. External Schema – a schema that enables the management to customize the access rights of global database users. Federated Database Management System (FDBMS) – the software that provides controlled and coordinated manipulation of the component database systems. Federated Database System (FDBS) – consists of component database systems that are autonomous yet participate in a federation to allow partial and controlled sharing of their data. Federated Schema – schema derived by the integration of multiple export schemas. Filtering Processors – a type of processor that constrains the commands and associated data that can be passed to another processor. Global Data Dictionary - central repository for metadata needed by the multidatabase system (MDBS).
  • 29. 3-3 Global Schema – an integrated global view of the combined component schemas. It is the layer above the local external schema that provides additional data independence. Global Query – a query that is issued to a multidatabase. It uses global schema specifications. Heterogeneous Multidatabase System – refers to a multidatabase system that has different database management systems in its component database systems. Homogeneous Multidatabase System – refers to a multidatabase system that has the same database management systems in all of its component database systems. Internal Schema – schema that describes physical characteristics of the logical data structures in the conceptual schema. Local Schema - is said to be the conceptual schema of a component database management system. It is the schema associated with one component database prior to schema integration Loosely Coupled System – pertains to a federated database system that is created and maintained by its users. Mappings – are functions that correlate the schema objects in one schema to the schema objects in another schema. Mapping rules – defines the relationship between the federated schema and the export schemas. Middleware - A set of drivers, APIs, or other software that improves connectivity between a client application and a server. Multidatabase – is a distributed system that acts as a front end to multiple local database management systems or is structured as a global system layer on top of local database management systems. Non-federated Database System - is an integration of component database management systems that are not autonomous. Processor – are application-independent software modules of a DBMS that manipulate commands and data.
  • 30. 3-4 Query – a search question that tells the program what kind of data should be retrieved from the database. Query Code Generator – a query processor sub-module that generates the query code based on the execution plan given by the query optimizer. Query Language – a retrieval and data-editing language that enables you to specify the criteria by which the program retrieves and displays the information stored in a database. Query Optimization – a process that attempts to minimize query response time and reduce query cost. Query Processing – the entire process of validating, optimizing, and executing a query string. Reference Architecture - provides the framework in which to understand, categorize, and compare different architectural options for developing federated database systems. Runtime Database Processor – a query processor sub-module that has the task of running the query code, whether compiled or interpreted, to produce the query result. Schema – description of data managed by one or more database management systems. It consists of schema objects and their interrelationships. Schema Integration - related specifically to the problems associated with distributed databases, in particular the integration of a set of pre-existing local schemas into a single global schema. Schema Name - includes an authorization identifier to indicate the user or account who owns the schema. Server - A computer, usually a high powered workstation, a minicomputer, or a mainframe, that houses information for manipulation by networked clients. Site Autonomy – a key aspect of a multidatabase which provides the local DBMS complete control over local data and processing. Structured English Query Language (SQL) –a query language which permits updates and data definitions.
  • 31. 3-5 Tightly Coupled System – refers to a federated database system that is created and maintained by the administrator alone. Transforming Processors – transform a command from a certain source language to a target language. View – a single table that is derived from other tables. A view does not necessarily exist in physical form; it is considered as a virtual table. 3.2. Multidatabase Architecture A database system is said to consist of a database management system (DBMS), which manages one or more databases. A federated database system (FDBS) is defined to be a collection of cooperating but autonomous component database systems (DBSs) [HAMM79]. The software that provides controlled and coordinated manipulation of the component DBSs is called a federated database management system (FDBMS). A component database can join in more than one federation. The database management systems of a component DBS (component DBMS) can either be centralized, distributed, or another FDBMS. Each component DBMSs can differ in aspects such as data models, query languages, and transaction management capabilities. One of the advantages of a federated database system is that a local database can go on with its local operations simultaneous with its participation in a given federation. The users or the administrators can configure the integration of the database systems. 3.2.1. Characteristics of Database Systems Multiple database systems that are joined together can be characterized in three dimensions namely: distribution, heterogeneity, and autonomy. Distribution of data can be done in multiple ways. Data can be distributed and stored in either single or multiple computer systems, or stored either co-located or geographically distributed. The main advantages of data distribution are increased availability, reliability and improved access times.
  • 32. 3-6 Heterogeneity can be generally classified into technological differences such as hardware, software, and communication differences. Heterogeneity in database systems can be classified into differences in database management systems and those in the semantics of data. Heterogeneity due to differences in DBMSs result from differences in data models and differences at the system level. Differences can be classified in structure, constraints and in query languages. Differences in structure result from different structural primitives. Differences in constraints are derived from different constraints supported by different data models. Differences in query languages (QUEL or SQL) and the different versions of SQL supported by two relational DBMSs is also a factor in heterogeneity. Semantic heterogeneity occurs in cases of disagreement in the meaning, interpretation or intended use of the same or related data. This type of heterogeneity is very hard to detect. Autonomy can be further classified into design autonomy, communication autonomy, execution autonomy, and association autonomy. Design autonomy refers to the ability of a component DBS to choose its own design with respect to any matter. The design should also include (a) the data being managed, (b) the representation, (c) the semantic interpretation of the data, (d) constraints, (e) functionality of the system, (f) association and sharing with other systems and (g) the implementation. Communication autonomy, on the other hand, refers to the ability of a component DBMS to decide whether to communicate with other component DBMSs. Execution autonomy refers to the ability of a component DBMS to execute local operations without interference from external operations and to decide the order in which to execute external operations. Association autonomy implies that a component DBS has the ability to decide whether and how much to share its functionality and resources with others. 3.2.2. Taxonomy of Federated Database Systems A database system can be classified into two types, centralized and distributed. Centralized database system (Figure 3-1) refers to a single centralized database management system managing a single database on the same computer system. Distributed DBS consists of a single distributed DBMS managing multiple databases. These databases can be stored in either a computer system or on multiple computer systems.
  • 33. 3-7 Figure 3-1. System architecture of a centralized DBMS [SHET90]. A multidatabase system (MDB) supports operations on multiple component database systems. An MDBS is called homogeneous if the DBMSs of all component databases systems are the same; otherwise it is called a heterogeneous MDBS [SHET90]. A system is not a multidatabase system if it only allows periodic, nontransaction-based exchange of data among multiple DBMSs or one that only provides access to multiple DBMSs one at a time [SHET90]. External Schema 2 External Schema n Internal Schema External Schema 1 Conceptual Schema Database Transforming Processor Filtering Processor 1 Filtering Processor 2 Filtering Processor n Accessing Processor
  • 34. 3-8 Figure 3-2. Taxonomy of a Multidatabase System [SHET90]. Multidatabase systems can be classified as non-federated and federated (Figure 3-2). Non-federated database system is an integration of component DBMSs that are not autonomous [SHET90]. On the contrary, federated database system consists of component DBSs that are autonomous yet participate in a federation to allow partial and controlled sharing of their data [SHET90]. A federated database system is further classified to loosely or tightly coupled [SHET90]. An FDBS is loosely coupled if its users are responsible to create and maintain the federation [SHET90]. On the other hand, it is tightly coupled if the administrator is responsible for the configuration of the federation [SHET90], specifically if there is a global DBA that manages a global schema. 3.2.3. Reference Architecture A reference architecture provides the framework in which to understand, categorize, and compare different architectural options for developing federated database systems [SHET90]. Such an architecture requires a number of components which Multidatabase Systems Nonfederated Database Systems Federated Database Systems Loosely Coupled Tightly Coupled Single Federation Multiple Federation
  • 35. 3-9 are essential to the system. The components consist of data, database, commands, processors, schemas, and mappings [SHET90]. These components are joined together to form different data management architectures. The main considerations in choosing an architecture are its level of centralization, distribution, and the manner on how the components hide its implementation details. Processors and schemas are significant in defining various architectures. The processors are application-independent software modules of a DBMS [SHET90]. While the latter component, the schemas, are application specific components that define database contents and structure [SHET90]. 3.2.4. Processor Types in the Reference Architecture Data management architecture differs in the types of processors present and relationships among those processors. The four types of processors include transforming, filtering, constructing, and accessing processors. Transforming processors transform a command from a certain source language to a target language, or transform data from one format to another [SHET90]. This type of processor provides a type of data independence called data model transparency [SHET90]. Data model transparency allows the data structure and commands being used by one processor to be hidden from other processors [SHET90]. In effect, a transforming processor abstracts various command formats and data representations from the receiving processor. In order to perform a transformation, the transforming processors should be equipped with a mapping between the objects of each schema. The primary job of schema translation involves transforming a given schema A (which describes a certain data in one data model) into another equivalent schema B that is in a different data model. Furthermore, this task generates the mappings that correlate the schema objects in one schema (schema B) to the schema objects in another schema (schema A). The process of using these mappings to translate commands involving the schema objects of one schema (schema B) into commands involving the schema objects of the other schema (schema A) is called command transformation. Filtering processors constrain the commands and associated data that can be passed to another processor. Each filtering processor has a mapping that describes the constraints on commands and data. The constraints may either be embedded into the code of the processor or be specified in a separate data structure.
  • 36. 3-10 Examples of filtering processor include syntactic constraint checker (check commands syntactically), semantic integrity constraint checker (check commands for any semantic integrity constraint violation), and access controller (verifies the user’s rights in performing a given command to a certain data) [SHET90]. Constructing processor partition and/or replicate an operation submitted by a single processor into operations that are accepted by two or more other processors [SHET90]. The processor should be able to support location, distribution, and replication transparencies [SHET90]. The reason for the provision of the different transparencies is due to the fact that a certain processor submitting a command does not need to know the location, distribution and the number of processors participating in processing that commands [SHET90]. Some of the jobs that a constructing processor can perform are as follows: schema integration, negotiation (to determine the protocol to be used among the owners of various schemas to be integrated), query decomposition and optimization, and global transaction management (performing the concurrency and atomicity control). An accessing processor accepts commands and produces data by executing the commands against a database (Figure 3-3) [SHET90]. For example, it may accept commands from several processors and interleave the processing of those commands. Figure 3-3. An accessing processor Examples of accessing processors include the following: (a) a file management system that executes access procedures against stored file, (b) an application program that accepts commands and returns the needed data after Database Data Acessing Processor Commands
  • 37. 3-11 processing it, (c) a data manager of a DBMS containing data access methods, or (d) a dictionary manager that manages access to dictionary data. 3.2.5. ANSI/SPARC Three-Level Schema Architecture There is a standard three level schema architecture for centralized DBMSs. The schema architecture was outlined by the ANSI/X3/SPARC Study Group. The three levels are the conceptual schema, the internal schema, and the external schema. The first level, the conceptual schema, describes the conceptual or logical data structure and the relationships among those structures. Another level, the internal schema describes physical characteristics of the logical data structures in the conceptual schema. These characteristics include information such as the placement of records on physical storage devices, placement and type of indexes and physical representation of relationship between logical records. The last schema, external schema, manages the access rights of its users. The task of a transforming processor includes the translation of commands expressed using the conceptual schema objects into commands using the internal schema objects. An accessing processor then executes the commands to retrieve data from the physical media. 3.2.6. Five-Level Schema Architecture for Federated Databases The ANSI/SPARC three-level architecture cannot be applied to a FDBS. However, there exists a five-level schema architecture to support the three dimensions of an FDBS (distribution, heterogeneity, and autonomy). The said five-level architecture is just an extension from the former three-level. The five-level scheme of architecture consists of local, component, export, federated, and external schema (Figure 3-4). Local Schema is said to be the conceptual schema of a component DBS. Further, it is in the native data model of the component DBMS. Component schema produces a data model called the canonical or common data model (CDM) by translating local schemas. The two main reasons for defining component schemas in a CDM are (a) they describe the divergent local schemas using a single representation and (b) semantics that are missing in a local schema can be added to its component schema [SHET90]. The translation of the local schemas
  • 38. 3-12 to component schemas greatly facilitates the integration of data in a federated database system. Figure 3-4. Five-level schema architecture of an FDBS. The process of schema translation from a local schema to a component schema generates the mappings between the component schema objects and the local schema objects [SHET90]. These mappings are used by the transforming processors to transform commands on a component schema into commands on the corresponding local schema (Figure 3-5). The export schema represents a subset of a component schema that is available to the FDBS [SHET90]. The main purpose of defining export schemas is to facilitate control and management of association autonomy [SHET90]. The filtering processor can be tasked to manage the access control as specified in an export schema by limiting the set of allowable operations that can be submitted [SHET90]. A federated schema is an integration of multiple export schemas. It also includes the information on data distribution that is generated when integrating export schema [SHET90]. It is possible to have a number of federated schemas in an FDBS, this may be done for each classes of the federation users [SHET90]. A class External Schema External Schema External Schema Federated Schema Federated Schema Export SchemaExport Schema Export Schema Component Schema Local Schema Component Schema Local Schema Component DBS Component DBS
  • 39. 3-13 of federation users can either be a group of users or applications performing a related set of activities [SHET90]. Figure 3-5. System Architecture for an FDBS. An external schema defines a schema for a user or a class of users [SHET90]. The main reasons for the use of external schemas are as follows: customization, additional integrity constraints, and access control [SHET90]. The filtering External Schema Filtering Processor Federated Schema Constructing Processor Export Schema Filtering Processor Component Schema Transforming Processor Local Schema Component DBS External Schema Filtering Processor Federated Schema Constructing Processor Export Schema Filtering Processor Component Schema Transforming Processor Local Schema Component DBS External Schema Filtering Processor Federated Schema Constructing Processor Export Schema Filtering Processor Component Schema Transforming Processor Local Schema Component DBS
  • 40. 3-14 processor then checks the commands on the external schema for any access control or integrity constraint violation [SHET90]. The transforming processor will be needed to transform commands on the external schema into commands on the federated schema if the external schema is in a different data model [SHET90]. 3.3. Multidatabase Issues In building multidatabase systems, one has to consider several issues that may present as a difficulty. Three of these issues are schema integration, access control, and query optimization. 3.3.1. Schema Integration Given a heterogeneous collection of local databases, a multidatabase system should provide a facility to integrate these databases and be able to produce a global schema. Ideally, this kind of facility is called an integrator's workbench the output of which would is composed of the global data dictionary, the global and participation schemas, the mapping rules, and the auxiliary databases. However, in order to be able to design these, it is necessary to first develop a methodology for performing the integration on which the workbench can be built [BELL92]. Schema integration is a relatively new concept, relating specifically to the problems associated with distributed databases, in particular the integration of a set of pre-existing local schemas into a single global schema. Schema integration in a multidatabase is a complex task. The problems arise from the structural and semantic differences between the local schemas. These local schemas have been developed independently following, not only different methodologies, but as well as different models (e.g. relational model, object model). In dealing with the process of schema integration, there are six major problems that will be encountered. These are name difference, format difference, missing data, conflicting values, semantic difference and structural difference. Name differences occur when two semantically equivalent data items located in different DBMSs are named differently or two semantically not equivalent data items in different DBMSs are named the same. Format differences include differences in data type, domain, scale, precision, and item combinations. Missing or conflicting data is the most complex one and can cause serious data lost to the database system. This occurs when two database systems have a data item that refers to the same real-world object but their actual data values are conflicting. This is due to incomplete updates, system error, or insufficient demand for such data.
  • 41. 3-15 Semantic differences occur when two attributes of the same name, belonging to relations of the same name, can have different meanings. Structural differences refer to data items with the same semantic meaning in various DBMSs but are structured differently. 3.3.2. Access Control Access control provides local Database Management Systems (DBMSs) the power to prevent unauthorized access or malicious destruction of its databases, in this case, from Multidatabase Systems (MDBSs) [WANG87]. This would not only include controlling access in the local DBMSs at each site, but also controlling and coordinating access to data at multiple sites for multi-site queries. Two types of access control mechanisms are Control-Independent and Content-Dependent access control. Control-Independent access controls are defined over the basis data objects supported by the DBMS. Each access rule is of the form subject/object/privilege, which means that the subject has that privilege over that object. If access is granted to a database object to the user in the global model, there should also be a corresponding grant that must be issued in the local model for that database object. This maintains consistency between the local and global models [WANG87]. Content-Dependent access controls, however, base their decisions of whether to allow access or not on the values of the data in the database. Given the example by [WANG87], an instructor may be allowed to see a student's record only if he/she is the advisor of that student. Relational DBMSs are Content-Dependent uses views to implement access control. Content-Dependent access control policies are more difficult to enforce than Content-Independent policies because the data required to make an access decision may reside at any site of the system. 3.3.3. Query Processing and Optimization In a federated database system, especially that of a tightly coupled system, query optimization plays an important role in query performance. The query optimization process attempts to minimize query response time and reduce query cost. In a federated system, global queries are decomposed into multiple sub-queries that will be executed in different component database systems. When the results from each of the component database systems are returned, the data must be manipulated and merged in such a way that it conforms according to the global schema and the canonical data format.
  • 42. 3-16 Several relational operations are possibly performed during the process of combining the different result sets, particularly that of JOIN. Several approaches have been proposed regarding query optimization. The order in which these relational operations are carried out can cause tremendous differences on the query performance. This approach of manipulating the ordering of relational operations is referred to as heuristic query optimization. Other approaches make use of cost estimation and semantics, and will be discussed in the later sections. 3.4. Schema Integration Schema integration is the process of combining related schema objects from multiple component databases into a single, global view of the integrated data [MART95b]. This global view of combined data is commonly called a global schema or integrated schema. A global schema provides location-transparency to the user and hides the differences among the component databases, thus facilitating formulation of global queries [MART95b, DEEN87]. The differences in data among the component databases give rise to semantic heterogeneity, which appears as schema conflicts and inconsistencies during schema integration. The idea of schema integration is to resolve these conflicts and inconsistencies among the local databases. Semantic heterogeneity takes many forms, and these are name differences, format differences, missing data, conflicting values, semantic differences, and structural differences. 3.4.1. Name Difference Local databases may have different conventions for naming objects, leading to the problems of synonyms and homonyms. Synonym means the same data item has different names in different databases. The global system must recognize the semantic equivalence of the items and map the differing local names to a single global name [HURS94]. Homonym means different data items have the same name in different databases. The global system must recognize the semantic difference between items and map the common names to different global names [HURS94]. STUDENT (LIB_ID, STUD_ID, NAME, STREET, CITY, STATE, SEX, PHONE) MEMBER (CLUB_ID, STUD_NO, NAME, FEMALE, MALE) For example, attributes Student.Stud_ID and Member.Stud_NO have the same kind of data but have different names. To integrate them, union the two relations and assign them to a common name like STUDENT_ID.
  • 43. 3-17 3.4.2. Format Difference Format differences include differences in data type, domain, scale, precision, and item combinations. Multidatabases typically resolve format differences by defining transformation functions between the local and global representations [DEEN87]. Some functions may be simple numeric calculations, such as converting square feet to acres. Some may require table conversions. For example, temperatures may be recorded as "hot", "cold", or "frigid" in one place and as exact degrees in another. A table can be used to define what range of degree readings would correspond to the temperature labels. For example, 50-100 degrees Celsius may be labeled as hot. Others functions may require calls to software procedures that implement algorithmic transformations. A problem in this area is that the local-to-global transformation may be simple, but the inverse transformation (global-to-local, which is required if updates are supported) may be complex. 3.4.3. Missing Data Sometimes a local database will not store all the information of interest concerning an entity. There are three cases of integrating relations with missing data: data which is missing from both relations, data which is missing from one relation, and data that found in the first relation and is summarize in the other but does not include all the details [DEEN87]. Data Missing from Both Relations Sometimes global users may require information, which is implicitly available to local users, but which is not stored [DEEN87]. For instance, one local database may describe only those students who are in De La Salle University and another may describe students from Ateneo De Manila University. To the local users there may be no need to store the university the students are enrolled to as an attribute and if the local databases are pre-existing databases, then they may have been designed without consideration of a global context. But the global user, seeing a single university relation, may require the location as an attribute in the view. In this case the mapping must append an extra attribute to each of the relations before forming their union [DEEN87]. Data Missing from One Relation Alternatively, one employee relation may store different information from another employee relation because of differing application requirements. If the difference is very great, then it may be best to preserve the separate relations in the view. If they
  • 44. 3-18 are sufficiently similar to be merged, however, there are a number of options. For example, the technical department of a company doesn’t have to store the salary of the employees. Just the same, the accounting department doesn’t need to know the projects handled by the employees. Summary Data Another case of missing values is where one relation keeps only the summary data while the other relation retains all the data. For example, a high-standard school might keep all the grades of each student on each course. On the other hand, a low- quality school might just keep the cumulative grade point average (CGPA) instead. 3.4.4. Conflicting Values If separate local databases store information concerning the same entry then there is a danger of conflicting values. There are two difficulties here, establishing that a conflict exists and correcting the discrepancy. If there are Employee relations at two sites, how does one determine when the same employee is begin described in each relation? If the employee has salaries listed in each relation, should these salaries necessarily be equal, or could they be salaries for different jobs? If there is a conflict, there are still several options. One possibility is to form a straight union of the two relations, thereby presenting the user with both values. If a single value is required, it might be safest to take the average of the two. This should a normally ensure reasonable approximation to the true value. However, if the aim is to provide the exact value then one or other of the conflicting values could be assumed to be the correct one. Various criteria could be used to determine which value is the more reliable. 3.4.5. Semantic Difference Semantic Difference occurs when two attributes use the same name but actually mean different things. Take for example a database containing the relations of different teams in the Philippine Basketball Association say, Gordon's Gin Boars and Alaska Milkmen. Each relation has an attribute OPPONENT. The attribute OPPONENT would refer to the Boars and the Milkmen's opponents so integrating these tables would prove to be a difficulty. Another example is when an employee has a salary attribute in two relations, and the values happen to be different. In this case, it is possible that the two salary attributes pertain to two different salary of the employee, who works in two different jobs.
  • 45. 3-19 3.4.6. Structural Difference Value-to-Attribute Conflict occurs when some values in a relation are expressed as an attribute in another relation. For example, the values of the attribute sex of RD.Student are represented as attributes (F and M) in ORG.Member. (Table 3-1) Value-to-Table Conflict occurs when a value in one relation is expressed independently as a whole relation in another database. For example, Table 3-1 shows the relation schemas of STUDENT_FEMALE and STUDENT_MALE for female and male students in PE. It is also represented as values of sex in other databases (i.e. RD.Student). Attribute-to-Table Conflict occurs when an attribute in one relation is expressed independently as a whole relation in another database. For example, the attribute address in LIB.Student is represented as a relation in RD.Address. Database Name Table Definition REGISTRAR DATABASE (RD) STUDENT (STUD_ID, FNAME, MI, LNAME, SEX, PHONE) ADDRESS (STUD_ID, STREET, CITY, STATE) PAYMENT (STUD_ID, BALANCE_OF_TUITION, LIB_PENALTY_FEES) *LIB_PENALTY_FEES is of the format ###.## PE DEPARTMENT DATABASE (PE) STUDENT_FEMALE (STUD_ID, NAME, ADDRESS, PHONE) STUDENT_MALE (STUD_ID, NAME, ADDRESS, PHONE) STUD_DATA (STUD_ID, PE1_GRADE, PE2_GRADE, PE3_GRADE, PE4_GRADE) LIBRARY DATABASE (LIB) STUDENT(LIB_ID, STUD_ID, NAME, STREET, CITY, STATE, SEX, PHONE) FINES (LIB_ID, NO_OFFENSE, AMOUNT, PAID) *Amount is of the format ### - no decimal places ORGANIZATION/CLUB DATABASE (ORG) MEMBER (CLUB_ID, STUD_NO, NAME, FEMALE, MALE) COMMITTEE (CLUB_ID, COMMITTEE_NAME, POSITION) PERSONAL_DATA (CLUB_ID, ADDRESS, PHONE, BIRTHDAY) Table 3-1. Sample schema of a database in a university. Table-to-Table Conflict occurs when one relation in a database is expressed as several separate relations in another database. For example, the LIB.Student has a table-to-table conflict with the RD.Student and RD.Address.