Assisting

 Migration and Evolution

               of

Relational Legacy Databases




               by

      G.N. Wikr...
Abstract

The research work reported here is concerned with enhancing and preparing databases with limited
DBMS capability...
CHAPTER 1

                                          Introduction

This chapter introduces the thesis. Section 1.1 is devo...
reflects the changing business needs. However, modern systems need to be properly designed and
implemented to benefit from...
• The performance, reliability and functionality of the existing system is not affected,
    • New applications can take a...
obsolete environment, mainly due to the high cost of frequently migrating to newer systems and/or
upgrading existing softw...
different hardware platforms, operating systems, networking protocols and local database systems,
cross-platform compatibi...
In order to evolve an information system, one needs to understand the existing system’s
structure and code. Most legacy in...
some work in the related areas of identifying extended entity relationship structures in relational
schemas, and some atte...
The inability to define and apply rules and constraints on early database systems due to
system limitations resulted in th...
global users to incrementally enhance legacy information systems. This offers the potential for users
in this type of envi...
The thesis is organised into 8 chapters. This first chapter has given an introduction to the
research done, covering backg...
chapter by drawing conclusions about the research project as a whole.




Page 13
CHAPTER 2

                   Research Scope, Approach, Aims and Objectives

This chapter describes, in some detail, the a...
a high level of abstraction. However, the semantic information now available in the form of rules
and constraints in moder...
will become legacy databases in the near future or already may be considered to be legacy
databases in that their data mod...
• During the last two decades the relational model has been the most popular model;
     therefore it has been used to dev...
Differences in DBMS characteristics lead to heterogeneity at the logical level. Here, the
different DBMSs conform to a par...
2.2.2 Application

        We view our migration approach as consisting of a series of stages, with the final stage
being ...
Schema
                        Enhanced                             Visualisation                          Enforced
      ...
(SMTS). This module needs to deal with heterogeneity at the physical and data management
levels. We achieve this by using ...
difference being that the output is graphical rather than textual.

       Stage 2: Knowledge Augmentation

        In a h...
holding this information into the representation used by the target DBMS even if it is different, as
we are mapping from a...
process all database queries, as interaction with the query interface of the legacy IS is embedded
in the legacy applicati...
A-1 in figure 2.2) to supply meta-knowledge (route A-2 in figure 2.2) to the schema mapping
process referred to as SMTS. S...
3) the process which assists a database administrator to clean inconsistent legacy data ensures a
safe migration. To perfo...
tools.
    • It should logically support a model using modern data modelling techniques irrespective of
      whether it i...
CHAPTER 3
                        Database Technology, Relational Model,
                     Conceptual Modelling and Int...
Chapter 3               Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
SEN77, KIM79...
Chapter 3               Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
       The a...
Chapter 3               Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
(i.e. true, ...
Chapter 3              Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
values for an...
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
Upcoming SlideShare
Loading in …5
×

Assisting Migration and Evolution of Relational Legacy Databases

3,086 views
2,980 views

Published on

G N Wikramanayake (1996) Assisting Migration and Evolution of Relational Legacy Databases University of Wales, College of Cardiff Cardiff, UK

Published in: Education, Technology, Spiritual
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,086
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
32
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Assisting Migration and Evolution of Relational Legacy Databases

  1. 1. Assisting Migration and Evolution of Relational Legacy Databases by G.N. Wikramanayake Department of Computer Science, University of Wales Cardiff, Cardiff September 1996
  2. 2. Abstract The research work reported here is concerned with enhancing and preparing databases with limited DBMS capability for migration to keep up with current database technology. In particular, we have addressed the problem of re-engineering heterogeneous relational legacy databases to assist them in a migration process. Special attention has been paid to the case where the legacy database service lacks the specification, representation and enforcement of integrity constraints. We have shown how knowledge constraints of modern DBMS capabilities can be incorporated into these systems to ensure that when migrated they can benefit from the current database technology. To this end, we have developed a prototype conceptual constraint visualisation and enhancement system (CCVES) to automate as efficiently as possible the process of re-engineering for a heterogeneous distributed database environment, thereby assisting the global system user in preparing their heterogeneous database systems for a graceful migration. Our prototype system has been developed using a knowledge based approach to support the representation and manipulation of structural and semantic information about schemas that the re-engineering and migration process requires. It has a graphical user interface, including graphical visualisation of schemas with constraints using user preferred modelling techniques for the convenience of the user. The system has been implemented using meta-programming technology because of the proven power and flexibility that this technology offers to this type of research applications. The important contributions resulting from our research includes extending the benefits of meta- programming technology to the very important application area of evolution and migration of heterogeneous legacy databases. In addition, we have provided an extension to various relational database systems to enable them to overcome their limitations in the representation of meta-data. These extensions contribute towards the automation of the reverse-engineering process of legacy databases, while allowing the user to analyse them using extended database modelling concepts. Page v
  3. 3. CHAPTER 1 Introduction This chapter introduces the thesis. Section 1.1 is devoted to the background and motivations of the research undertaken. Section 1.2 presents the broad goals of the research. The original achievements which have resulted from the research are summarised in Section 1.3. Finally, the overall organisation of the thesis is described in Section 1.4. 1.1 Background and Motivations of the Research Over the years rapid technological changes have taken place in all fields of computing. Most of these changes have been due to the advances in data communications, computer hardware and software [CAM89] which together have provided a reliable and powerful networking environment (i.e. standard local and wide area networks) that allow the management of data stored in computing facilities at many nodes of the network [BLI92]. These changes have turned round the hardware technology from centralised mainframes to networked file-server and client-server architectures [KHO92] which support various ways to use and share data. Modern computers are much more powerful than the previous generations and perform business tasks at a much faster rate by using their increased processing power [CAM88, CAM89]. Simultaneous developments in the software industry have produced techniques (e.g. for system design and development) and products capable of utilising the new hardware resources (e.g. multi-user environments with GUIs). These new developments are being used for a wide variety of applications, including modern distributed information processing applications, such as office automation where users can create and use databases with forms and reports with minimal effort, compared to the development efforts using 3GLs [HIR85, WOJ94]. Such applications are being developed with the aid of database technology [ELM94, DAT95] as this field too has advanced by allowing users to represent and manipulate advanced forms of data and their functionalities. Due to the program data independence feature of DBMSs the maintenance of database application programs has become easier as functionalities that were traditionally performed by procedural application routines are now supported declaratively using database concepts such as constraints and rules. In the field of databases, the recent advances resulting from technological transformation include many areas such as the use of distributed database technology [OZS91, BEL92], object- oriented technology [ATK89, ZDO90], constraints [DAT83, GRE93], knowledge-based systems [MYL89, GUI94], 4GLs and CASE tools [COMP90, SCH95, SHA95]. Meanwhile, the older technology was dealing with files and primitive database systems which now appear inflexible, as the technology itself limits them from being adapted to meet the current changing business needs catalysed by newer technologies. The older systems which have been developed using 3GLs and in operation for many years, often suffer from failures, inappropriate functionality, lack of documentation, poor performance and are referred to as legacy information systems [BRO93, COMS94, IEE94, BRO95, IEEE95]. The current technology is much more flexible as it supports methods to evolve (e.g. 4GLs, CASE tools, GUI toolkits and reusable software libraries [HAR90, MEY94]), and can share resources through software that allows interoperability (e.g. ODBC [RIC94, GEI95]). This evolution
  4. 4. reflects the changing business needs. However, modern systems need to be properly designed and implemented to benefit from this technology, which may still be unable to prevent such systems themselves being considered to be legacy information systems in the near future due to the advent of the next generation of technology with its own special features. The only salvation would appear to be building in evolution paths in the current systems. The increasing power of computers and their software has meant they have already taken over many day to day functions and are taking over more of these tasks as time passes. Thus computers are managing a larger volume of information in a more efficient manner. Over the years most enterprises have adopted the computerisation option to enable them to efficiently perform their business tasks and to be able to compete with their counterparts. As the performance ability of computers has increased, the enterprises still using early computer technology face serious problems due to the difficulties that are inherent in their legacy systems. This means that new enterprises using systems purely based on the latest technology have an advantage over those which need to continue to use legacy information systems (ISs), as modern ISs have been developed using current technology which provides not only better performance, but also utilises the benefits of improved functionality. Hence, managers of legacy IS enterprises want to retire their legacy code and use modern database management systems (DBMSs) in the latest environment to gain the full benefits from this newer technology. However they want to use this technology on the information and data they already hold as well as on data yet to be captured. They also want to ensure that any attempts to incorporate the modern technology will not adversely affect the ongoing functionality of their existing systems. This means legacy ISs need to be evolved and migrated to a modern environment in such a way that the migration is transparent to the current users. The theme of this thesis is how we can support this form of system evolution. 1.1.1 The Barriers to Legacy Information System Migration Legacy ISs are usually those systems that have stood the test of time and have become a core service component for a business’s information needs. These systems are a mix of hardware and software, sometimes proprietary, often out of date, and built to earlier styles of design, implementation and operation. Although they were productive and fulfilled their original performance criteria and their requirements, these systems lack the ability to change and evolve. The following can be seen as barriers to evolution in legacy IS [IEE94]. • The technology used to build and maintain the legacy IS is obsolete, • The system is unable to reflect changes in the business world and to support new needs, • The system cannot integrate with other sub-systems, • The cost, time and risk involved in producing new alternative systems to the legacy IS. The risk factor is that a new system may not provide the full functionality of the current system for a period because of teething problems. Due to these barriers, large organisations [PHI94] prefer to write independent sub-systems to perform new tasks using modern technology which will run alongside the existing systems, rather than attempt to achieve this by adapting existing code or by writing a new system that replaces the old and has new facilities as well. We see the following immediate advantages of this low risk approach. Page 4
  5. 5. • The performance, reliability and functionality of the existing system is not affected, • New applications can take advantage of the latest technology, • There is no need to retrain those staff who only need the facilities of the old system. However with this approach, as business requirements evolve with time, more and more new needs arise, resulting in the development and regular use of many diverse systems within the same organisation. Hence, in the long term the above advantages are overshadowed by the more serious disadvantages of this approach, such as: • The existing systems continue to exist and are legacy IS running on older and older technology, • The need to maintain many different systems to perform similar tasks increases the maintenance and support costs of the organisation, • Data becomes duplicated in different systems which implies the maintenance of redundant data with its associated increased risk of inconsistency between the data copies if updating occurs, • The overall maintenance cost for hardware, software and support personnel increases as many platforms are being supported, • The performance of the integrated information functions of the organisation decreases due to the need to interface many disparate systems. To address the above issues, legacy ISs need to be evolved and migrated to new computing environments, when their owning organisation upgrades. This migration should occur within a reasonable time after the upgrade occurs. This means that it is necessary to migrate legacy ISs to new target environments in order to allow the organisation to dispose of the technology which is becoming obsolete. Managers of some enterprises have chosen an easy way to overcome this problem, by emulating [CAM89, PHI94] the current environment on the new platforms (e.g. AS/400 emulators for IBM S/360 and ICL’s DME emulators for 1900 and System 4 users). An alternative strategy is achieved by translating [SHA93, PHI94, SHE94, BRO95] the software to run in new environments (i.e. code-to-code level translation). The emulator approach perpetuates all the software deficiencies of the legacy ISs although successfully removing the old-fashioned hardware technology and so it does enjoy the increased processing power of the new hardware. The translation approach takes advantage of some of the modern technological benefits in the target environment as the conversions - such as IBM’s JCL and ICL’s SCL code to Unix shell scripts, Assembler to COBOL, COBOL to COBOL embedded with SQL, and COBOL data structures to relational DBMS tables - are also done as part of the translation process. This approach, although a step forward, still carries over most of the legacy code as legacy systems are not evolved by this process. For example, the basic design is not changed. Hence the barrier to change and/or integration to a common sub- system still remains, and the translated systems were not designed for the environment they are now running in, so they may not be compatible with it. There are other approaches to overcoming this problem which have been used by enterprises [SHA93, BRO95]. These include re-implementing systems under the new environment and/or upgrading existing systems to achieve performance improvements. As computer technology continues to evolve at an ever quicker pace the need to migrate arises more rapidly. This means, most small organisations and individuals are left behind and are forced to work in a technologically Page 5
  6. 6. obsolete environment, mainly due to the high cost of frequently migrating to newer systems and/or upgrading existing software, as this process involves time and manpower which cost money. The gap between the older and newer system users will very soon create a barrier to information sharing unless some tools are developed to assist the older technology users’ migration to new technology environments. This assistance for the older technology users may take many forms, including tools for: analysing and understanding existing systems; enhancing and modifying existing systems; migrating legacy ISs to newer platforms. The complete migration process for a legacy IS needs to consider these requirements and many other aspects, as recently identified by Brodie and Stonebraker in [BRO95]. Our work was primarily motivated by these business oriented legacy database issues and by work in the area of extending relational database technology to enable it to represent more knowledge about its stored data [COD79, STO86a, STO86b, WIK90]. This second consideration is an important aspect of legacy system migration, since if a graceful migration is to be achieved we must be able to enhance a legacy relational database with such knowledge to take full advantage of the new system environment. 1.1.2 Heterogeneous Distributed Environments As well as the problem of having to use legacy ISs, most large enterprises are faced with the problem of heterogeneity and the need for interoperability between existing ISs [IMS91]. This arises due to the increased use of different computer systems and software tools for information processing within an organisation as time passes. The development of networking capabilities to manage and share information stored over a network has made interoperability a requirement and local area networks finding broad acceptance in business enterprises has enhanced the need to perform this task within organisations. Network file servers, client-server technology and the use of distributed databases [OZS91, BEL92, KHO92] are results of these challenging innovations. This technology is currently being used to create and process information held in heterogeneous databases, which involves linking different databases in an interoperable environment. An aspect of this work is legacy database interoperation, since as time passes these databases will have been built using different generations of software. In recent years, the demand for distributed database capabilities has been fuelled mostly by the decentralisation of business functions in large organisations to address customer needs, and by mergers and acquisitions that have taken place in the corporate world. As a consequence, there is a strong requirement among enterprises for the ability to cross-correlate data stored in different existing heterogeneous databases. This has led to the development of products referred to as gateways, to enable users to link different databases together, e.g. Microsoft’s Open Database Connectivity (ODBC) drivers can link Access, FoxPro, Btrieve, dBASE and Paradox databases together [COL94, RIC94]. There are similar products for other database vendors, such as Oracle1 [HOL93] and others2 [PUR93, SME93, RIC94, BRO95]. Database vendors have targetted cross- platform compatibility via SQL access protocols to support interoperability in a heterogeneous environment. As heterogeneity in distributed systems may occur in various forms ranging from 1 For IBM’s DB2, UNISYS’s DMS, DEC RMS. 2 For INGRES, SYBASE, Informix and other popular SQL DBMSs. 3 During the life-time of this project the SQL-3 standards moved from a preliminary draft, through several modifications before being finalised in 1995. Page 6
  7. 7. different hardware platforms, operating systems, networking protocols and local database systems, cross-platform compatibility via SQL provides only a simple form of heterogeneous distributed database access. The biggest challenge comes in addressing heterogeneity due to differences in local databases [OZS91, BEL92]. This challenge is also addressed in the design and development of our system. Distributed DBMSs have become increasingly popular in organisations as they offer the ability to interconnect existing databases, as well as having many other advantages [OZS91, BEL92]. The interconnection of existing databases leads to two types of distributed DBMS, namely: homogeneous and heterogeneous distributed DBMSs. In homogeneous systems all of the constituent nodes run the same DBMS and the databases can be designed in harmony with each other. This simplifies both the processing of queries at different nodes and the passing of data between nodes. In heterogeneous systems the situation is more complex, as each node can be running a different DBMS and the constituent databases can be designed independently. This is the normal situation when we are linking legacy databases, as the DBMS and the databases used are more likely to be heterogeneous since they are usually implemented for different platforms during different technological eras. In such a distributed database environment, heterogeneity may occur in various forms, at different levels [OZS91, BEL92], namely : • The logical level (i.e. involving different database designs), • The data management level (i.e. involving different data models), • The physical level, (i.e. involving different hardware, operating systems and network protocols), and • At all three or any pair of these levels. 1.1.3 The Problems and Search for a Solution The concept of heterogeneity itself is valuable as it allows designers a freedom of choice between different systems and design approaches, thus enabling them to identify those most suitable for different applications. The exploitation of this freedom over the years in many organisations has resulted in the creation of multiple local and remote information systems which now need to be made interoperable to provide an efficient and effective information service to the enterprise managers. Open database connectivity (ODBC) [RIC94, GEI95] and its standards has been proposed to support interoperability among databases managed by different DBMSs. Database vendors such as Oracle, INGRES, INFORMIX and Microsoft have already produced tools, engines and connectivity products to fulfil this task [HOL93, PUR93, SME93, COL94, RIC94, BRO95]. These products allow limited data transfer and query facilities among databases to support interoperability among heterogeneous DBMSs. These features, although they permit easy, transparent heterogeneous database access, still do not provide a solution to legacy IS where a primary concern is to evolve and migrate the system to a target environment so that obsolete support systems can be retired. Furthermore, the ODBC facilities are developed for current DBMSs and hence may not be capable of accessing older generation DBMSs, and, if they are, are unlikely to be able to enhance them to take advantage of the newer technologies. Hence there is a need to create tools that will allow ODBC equivalent functionality for older generation DBMSs. Our work provides such functionality for all the DBMSs we have chosen for this research. It also provides the ability to enhance and evolve legacy databases. Page 7
  8. 8. In order to evolve an information system, one needs to understand the existing system’s structure and code. Most legacy information systems are not properly documented and hence understanding such systems is a complex process. This means that changing any legacy code involves a high risk as it could result in unexpected system behaviour. Therefore one needs to analyse and understand existing system code before performing any changes to the system. Database system design and implementation tools have appeared recently which have the aim of helping new information system development. Reverse and re-engineering tools are also appearing in an attempt to address issues concerned with existing databases [SHA93, SCH95]. Some of these tools allow the examination of databases built using certain types of DBMSs, however, the enhancements they allow are done within the limitation of that system. Due to continuous ongoing technology changes, most current commercial DBMSs do not support the most recent software modelling techniques and features (e.g. Oracle version 7 does not support Object-Oriented features). Hence a system built using current software tools is guaranteed to become a legacy system in the near future (i.e. when new products with newer techniques and features begin to appear in the commercial market place). Reverse engineering tools [SHA93] are capable of recreating the conceptual model of an existing database and hence they are an ideal starting point when trying to gain a comprehensive understanding of the information held in the database and its current state, as they create a visual picture of that state. However, in legacy systems the schemas are basic, since most of the information used to compose a conceptual model is not available in these databases. Information such as constraints that show links between entities is usually embedded in the legacy application code and users find it difficult to reverse engineer these legacy ISs. Our work addresses these issues while assisting in overcoming this barrier within the knowledge representation limitations of existing DBMSs. 1.1.4 Primary and Secondary Motivations The research reported in this thesis therefore was primarily promoted by the need to provide, for a logically heterogeneous distributed database environment, a design tool that allows users not only to understand their existing systems but also to enhance and visualise an existing database’s structure using new techniques that are either not yet present in existing systems or not supported by the existing software environment. It was also motivated by: a) Its direct applicability in the business world, as the new technique can be applied to incrementally enhance existing systems and prepare them to be easily migrated to new target environments, hence avoiding continued use of legacy information systems in the organisation. Although previous work and some design tools address the issue of legacy information system analysis, evolution and migration, these are mainly concerned with 3GL languages such as COBOL and C [COMS94, BRO95, IEEE95]. Little work has been reported which addresses the new issues that arise due to the Object-Oriented (O-O) data model or the extended relational data model [CAT94]. There are no reports yet of enhancing legacy systems so that they can migrate to O-O or extended relational environments in a graceful migration from a relational system. There has been Page 8
  9. 9. some work in the related areas of identifying extended entity relationship structures in relational schemas, and some attempts at reverse-engineering relational databases [MAR90, CHI94, PRE94]. b) The lack of previous research in visualising pre-existing heterogeneous database schemas and evolving them by enhancing them with modern concepts supported in more recent releases of software. Most design tools [COMP90, SHA93] which have been developed to assist in Entity- Relationship (E-R) modelling [ELM94] and Object Modelling Technique (OMT) modelling [RUM91] are used in a top-down database design approach (i.e. forward engineering) to assist in developing new systems. However, relatively few tools attempt to support a bottom-up approach (i.e. reverse engineering) to allow visualisation of pre-existing database schemas as E-R or OMT diagrams. Among these tools only a very few allow enhancement of the pre-existing database schemas, i.e. they apply forward engineering to enhance a reverse-engineered schema. Even those which do permit this action to some extent, always operate on a single database management system and work mostly with schemas originally designed using such systems (e.g. CASE tools). The tools that permit only the bottom-up approach are referred to as reverse-engineering tools and those which support both (i.e. bottom-up and top-down) are called re-engineering tools [SHA93]. This thesis is primarily concerned with creating re-engineering tools that assist legacy database migration. The commercially available re-engineering tools are customised for particular DBMSs and are not easily usable in a heterogeneous environment. This barrier against widespread usability of re- engineering tools means that a substantial adaptation and reprogramming effort (costing time and money) is involved every time a new DBMS appears in a heterogeneous environment. An obvious example that reflects this limitation arises in a heterogeneous distributed database environment where there may be a need to visualise each participant database’s schema. In such an environment if the heterogeneity occurs at the database management level (where each node uses a different DBMS, for example, one node uses INGRES [DAT87] and another uses Oracle [ROL92]), then we have to use two different re-engineering tools to display these schemas. This situation is exacerbated for each additional DBMS that is incorporated into the given heterogeneous context. Also, legacy databases are migrated to different DBMS environments as newer versions and better database products have appeared since the original release of their DBMS. This means that a re-engineering tool that assists legacy database migration must work in an heterogeneous environment so that its use will not be restricted to particular types of ISs. Existing re-engineering tools provide a single target graphical data model (usually the E-R model or a variant of it), which may differ in presentation style between tools and therefore inhibits the uniformity of visualisation that is highly desirable in an interoperable heterogeneous distributed database environment. This limitation means that users may need to use different tools to provide the required uniformity of display in such an environment. The ability to visualise the conceptual model of an information system using a user-preferred graphical data model is important as it ensures that no inaccurate enhancements are made to the system due to any misinterpretation of graphical notations used. c) The need to apply rules and constraints to pre-existing databases to identify and clean inconsistent legacy data, as preparation for migration or as an enhancement of the database’s quality. Page 9
  10. 10. The inability to define and apply rules and constraints on early database systems due to system limitations resulted in them not using constraints to increase the accuracy and consistency of the data held by these systems. This limitation is now a barrier to information system migration as a new target DBMS is unable to enforce constraints on a migrated database until all violations are investigated and resolved either by omitting the violating data or by cleaning it. This investigation may also show that a constraint has to be adjusted as the violating data is needed by the organisation. The enhancement of such a system by rules and constraints provides knowledge that is usable to determine possible data violations. The process of detecting constraint violations may be done by applying queries that are generated from these enhanced constraints. Similar methods have been used to implement integrity constraints [STO75], optimise queries [OZS91] and obtain intensional answers [FON92, MOT89]. This is essential as constraints may have been implemented at the application coding level and that can lead to their inconsistent application. d) An awareness of the potential contribution that knowledge-based systems and meta-programming technologies, in association with extended relational database technology, have to offer in coping with semantic heterogeneity. The successful production of a conceptual model is highly dependent on the semantic information available, and on the ability to reason about these semantics. A knowledge-based system can be used to assist in this task, as the process to generalise effective exploitation of semantic information for pre-existing heterogeneous databases needs to undergo three sub-processes, namely: knowledge acquisition, representation and manipulation. The knowledge acquisition process extracts the existing knowledge from a database’s data dictionaries. This knowledge may include subsequent enhancements made by the user, as the use of a database to store such knowledge will provide easy access to this information along with its original knowledge. The knowledge representation process represents existing and enhanced knowledge. The knowledge manipulation process is concerned with deriving new knowledge and ensuring consistency of existing knowledge. These stages are addressable using specific processes. For instance, the reverse-engineering process used to produce a conceptual model can be used to perform the knowledge acquisition task. Then the derived and enhanced knowledge can be stored in the same database by adopting a process that will allow us to distinguish this knowledge from its original meta-data. Finally, knowledge manipulation can be done with the assistance of a Prolog based system [GRA88], while data and knowledge consistency can be verified using the query language of the database. 1.2 Goals of the Research The broad goals of the research reported in this thesis are highlighted here, with detailed aims and objectives presented in section 2.4. These goals are to investigate interoperability problems, schema enhancement and migration in a heterogeneous distributed database environment, with particular emphasis on extended relational systems. This should provide a basis for the design and implementation of a prototype software system that brings together new techniques from the areas of knowledge-based systems, meta-programming and O-O conceptual data modelling with the aim of facilitating schema enhancement, by means of generalising the efficient representation of constraints using the current standards. Such a system is a tool that would be a valuable asset in a logically heterogeneous distributed extended relational database environment as it would make it possible for Page 10
  11. 11. global users to incrementally enhance legacy information systems. This offers the potential for users in this type of environment to work in terms of such a global schema, through which they can prepare their legacy systems to easily migrate to target environments and so gain the benefits of modern computer technology. 1.3 Original Achievements of the Research The importance of this research lies in establishing the feasibility of enhancing, cleaning and migrating heterogeneous legacy databases using meta-programming technology, knowledge-based system technology, database system technology and O-O conceptual data modelling concepts, to create a comprehensive set of techniques and methods that form an efficient and useful generalised database re-engineering tool for heterogeneous sets of databases. The benefits such a tool can bring are also demonstrated and assessed. A prototype Conceptual Constraint Visualisation and Enhancement System (CCVES) [WIK95a] has been developed as a result of the research. To be more specific, our work has made four important contributions to progress in the database topic area of Computer Science: 1) CCVES is the first system to bring the benefits of meta-programming technology to the very important application area of enhancing and evolving heterogeneous distributed legacy databases to assist the legacy database migration process [GRA94, WIK95c]. 2) CCVES is also the first system to enhance existing databases with constraints to improve their visual presentation and hence provide a better understanding of existing applications [WIK95b]. This process is applicable to any relational database application, including those which are unable to naturally support the specification and enforcement of constraints. More importantly, this process does not affect the performance of an existing application. 3) As will be seen later, we have chosen the current SQL-3 standards [ISO94] as the basis for knowledge representation in our research. This project provides an extension to the representation of the relational data model to cope with automated reuse of knowledge in the re- engineering process. In order to cope with technological changes that result from the emergence of new systems or new versions of existing DBMSs, we also propose a series of extended relational system tables conforming to SQL-3 standards to enhance existing relational DBMSs [WIK95b]. 4) The generation of queries using the constraint specifications of the enhanced legacy systems is an easy and convenient method of detecting any constraint violating data in existing systems. The application of this technique in the context of a heterogeneous environment for legacy information systems is a significant step towards detecting and cleaning inconsistent data in legacy systems prior to their migration. This is essential if a graceful migration is to be effected [WIK95c]. 1.4 Organisation of the Thesis Page 11
  12. 12. The thesis is organised into 8 chapters. This first chapter has given an introduction to the research done, covering background and motivations, and outlining original achievements. The rest of the thesis is organised as follows: Chapter 2 is devoted to presenting an overview of the research together with detailed aims and objectives for the work undertaken. It begins by identifying the scope of the work in terms of research constraints and development technologies. This is followed by an overview of the research undertaken, where a step by step discussion of the approach adopted and its role in a heterogeneous distributed database environment is given. Finally, detailed aims and objectives are drawn together to conclude the chapter. Chapter 3 identifies the relational data model as the current dominant database model and presents its development along with its terminology, features and query languages. This is followed by a discussion of conceptual data models with special emphasis on the data models and symbols used in our project. Finally, we pay attention to key concepts related to our project, mainly the notion of semantic integrity constraints and extensions to the relational model. Here, we present important integrity constraint extensions to the relational model and its support using different SQL standards. Chapter 4 addresses the issue of legacy information system migration. The discussion commences with an introduction to legacy and our target information systems. This is followed by migration strategies and methods for such ISs. Finally, we conclude by referring to current techniques and identify the trends and existing tools applicable to database migration. Chapter 5 addresses the re-engineering process for relational databases. Techniques currently used for this purpose are identified first. Our approach, which uses constraints to re-engineer a relational legacy database is described next. This is followed by a process for detecting possible keys and structures of legacy databases. Our schema enhancement and knowledge representation techniques are then introduced. Finally, we present a process to detect and resolve conflicts that may occur due to schema enhancement. Chapter 6 introduces some example test databases which were chosen to represent a legacy heterogeneous distributed database environment and its access processes. Initially, we present the design of our test databases, the selection of our test DBMSs and the prototype system environment. This is followed by the application of our re-engineering approach to our test databases. Finally, the organisation of relational meta-data and its access is described using our test DBMSs. Chapter 7 presents the internal and external architecture and operation of our conceptual constraint visualisation and enforcement system (CCVES) in terms of the design, structure and operation of its interfaces, and its intermediate modelling system. The internal schema mappings, e.g. mapping from INGRES QUEL to SQL and vice-versa, and internal database migration processes are presented in detail here. Chapter 8 provides an evaluation of CCVES, identifying its limitations and improvements that could be made to the system. A discussion of potential applications is presented. Finally we conclude the Page 12
  13. 13. chapter by drawing conclusions about the research project as a whole. Page 13
  14. 14. CHAPTER 2 Research Scope, Approach, Aims and Objectives This chapter describes, in some detail, the aims and objectives of the research that has been undertaken. Firstly, the boundaries of the research are defined in section 2.1, which considers the scope of the project. Secondly, an overview of the research approach we have adopted in dealing with heterogeneous distributed legacy database evolution and migration is given in section 2.2. Next, in section 2.3, the discussion is extended to the wider aspects of applying our approach in a heterogeneous distributed database environment using the existing meta-programming technology developed at Cardiff in other projects. Finally, the research aims and objectives are detailed in section 2.4, illustrating what we intend to achieve, and the benefits expected from achieving the stated aims. 2.1 Scope of the Project We identify the scope of the work in terms of research constraints and the limitations of current development technologies. An overview of the problem is presented along with the drawbacks and limitations of database software development technology in addressing the problem. This will assist in identifying our interests and focussing the issues to be addressed. 2.1.1 Overview of the Problem In most database designs, a conceptual design and modelling technique is used in developing the specifications at the user requirements and analysis stage of the design. This stage usually describes the real world in terms of object/entity types that are related to one another in various ways [BAT92, ELM94]. Such a technique is also used in reverse-engineering to portray the current information content of existing databases, as the original designs are usually either lost, or inappropriate because the database has evolved from its original design. The resulting pictorial representation of a database can be used for database maintenance, for database re- design, for database enhancement, for database integration or for database migration, as it gives its users a sound understanding of an existing database’s architecture and contents. Only a few current database tools [COMP90, BAT92, SHA93, SCH95] allow the capture and presentation of database definitions from an existing database, and the analysis and display of this information at a higher level of abstraction. Furthermore, these tools are either restricted to accessing a specific database management system’s databases or permit modelling with only a single given display formalism, usually a variant of the EER [COMP90]. Consequently there is a need to cater for multiple database platforms with different user needs to allow access to a set of databases comprising a heterogeneous database, by providing a facility to visualise databases using a preferred conceptual modelling technique which is familiar to the different user communities of the heterogeneous system. The fundamental modelling constructs of current reverse and re-engineering tools are entities, relationships and associated attributes. These constructs are useful for database design at
  15. 15. a high level of abstraction. However, the semantic information now available in the form of rules and constraints in modern DBMSs provides their users with a better understanding of the underlying database as its data conforms to these constraints. This may not necessarily be true for legacy systems, which may have constraints defined that were not enforced. The ability to visualise rules and constraints as part of the conceptual model increases user understanding of a database. Users could also exploit this information to formulate queries that more effectively utilise the information held in a database. Having these features in mind, we concentrated on providing a tool that permits specification and visualisation of constraints as part of the graphical display of the conceptual model of a database. With modern technology increasing the number of legacy systems and with increasing awareness of the need to use legacy data [BRO95, IEEE95], the availability of such a visualisation tool will be more important in future as it will let users see the full definition of the contents of their databases in a familiar format. Three types of abstraction mechanism, namely: classification, aggregation and generalisation, are used in conceptual design [ELM94]. However, most existing DBMSs do not maintain sufficient meta-data information to assist in identifying all these abstraction mechanisms within their data models. This means that reverse and re-engineering tools are semi-automated, in that they extract information, but users have to guide them and decide what information to look for [WAT94]. This requires interactions with the database designer in order to obtain missing information and to resolve possible conflicts. Such additional information is supplied by the tool users when performing the reverse-engineering process. As this additional information is not retained in the database, it must be re-entered every time a reverse engineering process is undertaken if the full representation is to be achieved. To overcome this problem, knowledge bases are being used to retain this information when it is supplied. However, this approach restricts the use of this knowledge by other tools which may exist in the database’s environment. The ability to hold this knowledge in the database itself would enhance an existing database with information that can be widely used. This would be particularly useful in the context of legacy databases as it would enrich their semantics. One of the issues considered in this thesis is how this can be achieved. Most existing relational database applications record only entities and their properties (i.e. attribute names and data types) as system meta-data. This is because these systems conformed to early database standards (e.g. the SQL/86 standard [ANSI86], supported by INGRES version 5 and Oracle version 5). However, more recent relational systems record additional information such as constraint and rule definitions, as they conform to the SQL/92 standards [ANSI92] (e.g. Oracle version 7). This additional information includes, for example, primary and foreign key specifications, and can be used to identify classification and aggregation abstractions used in a conceptual model [CHI94, PRE94, WIK95b]. However, the SQL/92 standard does not capture the full range of modelling abstractions, e.g. inheritance representing generalisation hierarchies. This means that early relational database applications are now legacy systems as they fail to naturally represent additional information such as constraint and rule definitions. Such legacy database systems are being migrated to modern database systems not only to gain the benefits of the current technology but also to be compatible with new applications built with the modern technology. The SQL standards are currently subject to review to permit the representation of extra knowledge (e.g. object-oriented features), and we have anticipated some of these proposals in our work - i.e. SQL-33 [ISO94] will be adopted by commercial systems and thus the current modern DBMSs Page 15
  16. 16. will become legacy databases in the near future or already may be considered to be legacy databases in that their data model type will have to be mapped onto the newer version. Having experienced the development process of recent DBMSs it is inevitable that most current databases will have to be migrated, either to a newer version of the existing DBMS or to a completely different newer technology DBMS for a variety of reasons. Thus the migration of legacy databases is perceived to be a continuing requirement, in any organisation, as technology advances continue to be made. Most migrations currently being undertaken are based on code-to-code level translations of the applications and associated databases to enable the older system to be functional in the target environment. Minimal structural changes are made to the original system and database, thus the design structures of these systems are still old-fashioned, although they are running in a modern computing environment. This means that such systems are inflexible and cannot be easily enhanced with new functions or integrated with other applications in their new environment. We have also observed that more recent database systems have often failed to benefit from modern database technology due to inherent design faults that have resulted in the use of unnormalised structures, which cause omission of the features enforcing integrity constraints even when this is possible. The ability to create and use databases without the benefit of a database design course is one reason for such design faults. Hence there is a need to assist existing systems to be evolved, not only to perform new tasks but also to improve their structure so that these systems can maximise the gains they receive from their current technology environment and any environment they migrate to in the future. 2.1.2 Narrowing Down the Problem Technological advances in both hardware and software have improved the performance and maintenance functionality of all information systems (ISs), and as a result, older ISs suffer from comparatively poor performance and inappropriate functionality when compared with more modern systems. Most of these legacy systems are written in a 3GL such as COBOL, have been around for many years, and run on old-fashioned mainframes. Problems associated with legacy systems are being identified and various solutions are being developed [BRO93, SHE94, BRO95]. These systems basically have three functional components, namely: interface, application and a database service, which are sometimes inter-related to each other, depending on how they were used during the design and implementation stages of the IS development. This means that the complexity of a legacy IS depends on what occurred during the design and implementation of the system. These systems may range from a simple single user database application using separate interfaces and applications, to a complex multi-purpose unstructured application. Due to the complex nature of the problem area we do not address this issue as a whole, but focus only on problems associated with one sub-component of such legacy information systems, namely the database service. This in itself is a wide field, and we have further restricted ourselves to legacy ISs using a specific DBMS for their database service. We considered data models ranging from original flat file and relational systems, to modern relational DBMSs and object-oriented DBMSs. From these data models we have chosen the traditional relational model for the following reasons. • The relational model is currently the most widely used database model. Page 16
  17. 17. • During the last two decades the relational model has been the most popular model; therefore it has been used to develop many database applications and most of these are now legacy systems. • There have been many extensions and variations of the relational model, which has resulted in many heterogeneous relational database systems being used in organisations. • The relational model can be enhanced to represent additional semantics currently supported only by modern DBMSs (e.g. extended relational systems [ZDO90, CAT94]). As most business requirements change with time, the need to enhance and migrate legacy information systems exists for almost every organisation. We address problems faced by these users while seeking for a solution that prevents new systems becoming legacy systems in the near future. The selection of the relational model as our database service to demonstrate how one could achieve these needs means that we shall be addressing only relational legacy database systems and not looking at any other type of legacy information systems. This decision means we are not considering the many common legacy IS migration problems identified by Brodie [BRO95] (e.g. migration of legacy database services such as flat- file structures or hierarchical databases into modern extended relational databases; migration of legacy applications with millions of lines of code written in some COBOL-like language into a modern 4GL/GUI environment). However, as shown later, addressing the problems associated with relational legacy databases has enabled us to identify and solve problems associated with more recent DBMSs, and it also assists in identifying precautions which if implemented by designers of new systems will minimise the chance of similar problems being faced by these systems as IS developments occur in the future. 2.2 Overview of the Research Approach Having presented an overview and narrowing down of our problem, we identify the following as the main functionalities that should be provided to fulfil our research goal: • Reverse-engineering of a relational legacy database to fully portray its current information content. • Enhancing a legacy database with new knowledge to identify modelling concepts that should be available to the database concerned or to applications using that database. • Determining the extent to which the legacy database conforms to its existing and enhanced descriptions. • Ensuring that the migrated IS will not become a legacy IS in the future. We need to consider the heterogeneity issue in order to be able to reverse-engineer any given relational legacy database. Three levels of heterogeneity are present for a particular data model, namely: at a physical, logical and data management level. The physical level of heterogeneity usually arises due to different data model implementation techniques, use of different computer platforms and use of different DBMSs. The physical / logical data independence of DBMSs hides implementation differences from users, hence we need only address how to access databases that are built using different DBMSs, running on different computer platforms. Page 17
  18. 18. Differences in DBMS characteristics lead to heterogeneity at the logical level. Here, the different DBMSs conform to a particular standard (e.g. SQL/86 or SQL/92), which supports a particular database query language (e.g. SQL or QUEL) and different relational data model features (e.g. handling of integrity constraints and availability of object-oriented features). To tackle heterogeneity at the logical level, we need to be aware of different standards, and to model ISs supporting different features and query languages. Heterogeneity at the data management level arises due to the physical limitations of a DBMS, differences in the logical design and inconsistencies that occurred when populating the database. Logical differences in different database schemas have to be resolved only if we are going to integrate them. The schema integration process is concerned with merging different related database applications. Such a facility can assist the migration of heterogeneous database systems. However any attempt to integrate legacy database schemas prior to the migration process complicates the entire process as it is similar to attempting to provide new functionalities within the system which is being migrated. Such attempts increase the chance of failure of the overall migration process. Hence we consider any integration or enhancements in the form of new functionalities only after successfully migrating the original legacy IS. However, the physical limitations of a DBMS and data inconsistencies in the database need to be addressed beforehand to ensure a successful migration. Our work addresses the heterogeneity issues associated with database migration by adopting an approach that allows its users to incrementally increase the number of DBMSs it could handle without having to reprogram its main application modules. Here, the user needs to supply specific knowledge about DBMS schema and query language constructs. This is held together with the knowledge of the DBMSs already supported and has no effect on the application’s main processing modules. 2.2.1 Meta-Programming Meta-programming technology allows the meta-data (schema information) of a database to be held and processed independently of its source specification language. This allows us to work on a database language independent environment and hence overcome many logical heterogeneity issues. Prolog based meta-programming technology has been used in previous research at Cardiff in the area of logical heterogeneity [FID92, QUT94]. Using this technology the meta-translation of database query languages [HOW87] and database schemas [RAM91] has been performed. This work has shown how the heterogeneity issues of different DBMSs can be addressed without having to reprogram the same functionality for each and every DBMS. We use meta-programming technology for our legacy database migration approach as we need to be able to start with a legacy source database and end with a modern target database where the respective database schema and query languages may be different from each other. In this approach the source database schema or query language is mapped on input into an internal canonical form. All the required processing is then done using the information held in this internal form. This information is finally mapped to the target schema or query language to produce the desired output. The advantage of this approach is that processing is not affected by heterogeneity as it is always performed on data held in the canonical form. This canonical form is an enriched collection of semantic data modelling features. Page 18
  19. 19. 2.2.2 Application We view our migration approach as consisting of a series of stages, with the final stage being the actual migration and earlier stages being preparatory. At stage 1, the data definition of the selected database is reverse-engineered to produce a graphical display (cf. paths A-1 and A-2 of figure 2.1). However, in legacy systems much of the information needed to present the database schema in this way is not available as part of the database meta-data and hence these links which are present in the database cannot be shown in this conceptual model. In modern systems such links can be identified using constraint specifications. Thus, if the database does not have any explicit constraints, or it does but these are incomplete, new knowledge about the database needs to be entered at stage 2 (cf. path B-1 of figure 2.1), which will then be reflected in the enhanced schema appearing in the graphical display (cf. path B-2 of figure 2.1). This enhancement will identify new links that should be present for the database concerned. These new database constraints can next be applied experimentally to the legacy database to determine the extent to which it conforms to them. This process is done at stage 3 (cf. paths C-1 and C-2 of figure 2.1). The user can then decide whether these constraints should be enforced to improve the quality of the legacy database prior to its migration. At this point the three preparatory stages in the application of our approach are complete. The actual migration process is then performed. All stages are further described below to enable us to identify the main processing components of our proposed system as well as to explain how we deal with different levels of heterogeneity. Stage 1: Reverse Engineering In stage 1, the data definition of the selected database is reverse-engineered to produce a graphical display of the database. To perform this task, the database’s meta-data must be extracted (cf. path A-1 of figure 2.1). This is achieved by connecting directly to the heterogeneous database. The accessed meta-data needs to be represented using our internal form. This is achieved through a schema mapping process as used in the SMTS (Schema Meta-Translation System) of Ramfos [RAM91]. The meta-data in our internal formalism then needs to be processed to derive the graphical constructs present for the database concerned (cf. path A-2 of figure 2.1). These constructs are in the form of entity types and the relationships and their derivation process is the main processing component in stage 1. The identified graphical constructs are mapped to a display description language to produce a graphical display of the database. Page 19
  20. 20. Schema Enhanced Visualisation Enforced Constraints (EER or OMT) Constraints with Constraints B-1 C-1 B-2 A-2 Internal Processing B-3 C-2 A-1 Heterogeneous Databases Stage 1 (Reverse Engineering) Stage 2 (Knowledge Augmentation) Stage 3 (Constraint Enforcement) Figure 2.1: Information flow in the 3 stages of our approach prior to migration a) Database connectivity for heterogeneous database access Unlike the previous Cardiff meta-translation systems [HOW87, RAM91, QUT92], which addressed heterogeneity at the logical and data management levels, our system looks at the physical level as well. While these previous systems processed schemas in textual form and did not access actual databases to extract their DDL specification, our system addresses physical heterogeneity by accessing databases running on different hardware / software platforms (e.g. computer systems, operating systems, DBMSs and network protocols). Our aim is to directly access the meta-data of a given database application by specifying its name, the name and version of the host DBMS, and the address of the host machine4. If this database access process can produce a description of the database in DDL formalism, then this textual file is used as the starting point for the meta-translation process as in previous Cardiff systems [RAM91, QUT92]. We found that it is not essential to produce such a textual file, as the required intermediate representation can be directly produced by the database access process. This means that we could also by-pass the meta-translation process that performs the analysis of the DDL text to translate it into the intermediate representation5. However the DDL formalism of the schema can be used for optional textual viewing and could also serve as the starting point for other tools6 developed at Cardiff for meta-programming database applications. The initial functionality of the Stage 1 database connectivity process is to access a heterogeneous database and supply the accessed meta-data as input to our schema meta-translator 4 We assume that access privileges for this host machine and DBMS have been granted. 5 A list of tokens ready for syntactic analysis in the parsing phase is produced and processed based on the BNF syntax specification of the DDL [QUT92]. 6 e.g. The Schema Meta-Integration System (SMIS) of Qutaishat [QUT92]. Page 20
  21. 21. (SMTS). This module needs to deal with heterogeneity at the physical and data management levels. We achieve this by using DML commands of the specific DBMS to extract the required meta-data held in database data dictionaries treated like user defined tables. Relatively recently, the functionalities of a heterogeneous database access process have been provided by means of drivers such as ODBC [RIC94]. Use of such drivers will allow access to any database supported by them and hence obviate the need to develop specialised tools for each database type as happened in our case. These driver products were not available when we undertook this stage of our work. b) Schema meta-translation The schema meta-translation process [RAM91] accepts input of any database schema irrespective of its DDL and features. The information captured during this process is represented internally to enable it to be mapped from one database schema to another or to further process and supply information to other modules such as the schema meta-visualisation system (SMVS) [QUT93] and the query meta-translation system (QMTS) [HOW87]. Thus, the use of an internal canonical form for meta representation has successfully accommodated heterogeneity at the data management and logical levels. c) Schema meta-visualisation Schema visualisation using graphical notation and diagrams has proved to be an important step in a number of applications, e.g. during the initial stages of the database design process; for database maintenance; for database re-design; for database enhancement; for database integration; or for database migration; as it gives users a sound understanding of an existing database’s structure in an easily assimilated format [BAT92, ELM94]. Database users need to see a visual picture of their database structure instead of textual descriptions of the defining schema as it is easier for them to comprehend a picture. This has led to the production of graphical representations of schema information, effected by a reverse engineering process. Graphical data models of schemas employ a set of data modelling concepts and a language-independent graphical notation (e.g. the Entity Relationship (E-R) model [CHE76], Extended/Enhanced Entity Relationship (EER) model [ELM94] or the Object Modelling Technique (OMT) [RUM91]). In a heterogeneous environment different users may prefer different graphical models, and an understanding of the database structure and architecture beyond that given by the traditional entities and their properties. Therefore, there is a need to produce graphical models of a database’s schema using different graphical notations such as either E-R/EER or OMT, and to accompany them with additional information such as a display of the integrity constraints in force in the database [WIK95b]. The display of integrity constraints allows users to look at intra- and inter- object constraints and gain a better understanding of domain restrictions applicable to particular entities. Current reverse engineering tools do not support this type of display. The generated graphical constructs are held internally in a similar form to the meta-data of the database schema. Hence using a schema meta visualisation process (SMVS) it is possible to map the internally held graphical constructs into appropriate graphical symbols and coordinates for the graphical display of the schema. This approach has a similarity to the SMTS, the main Page 21
  22. 22. difference being that the output is graphical rather than textual. Stage 2: Knowledge Augmentation In a heterogeneous distributed database environment, evolution is expected, especially in legacy databases. This evolution can affect the schema description and in particular schema constraints that are not reflected in the stage 1 (path A-2) graphical display as they may be implicit in applications. Thus our system is designed to accept new constraint specifications (cf. path B-1 of figure 2.1) and add them to the graphical display (cf. path B-2 of figure 2.1) so that these hidden constraints become explicit. The new knowledge accepted at this point is used to enhance the schema and is retained in the database using a database augmentation process (cf. path B-3 of figure 2.1). The new information is stored in a form that conforms with the enhanced target DBMS’s methods of storing such information. This assists the subsequent migration stage. a) Schema enhancement Our system needs to permit a database schema to be enhanced by specifying new constraints applicable to the database. This process is performed via the graphical display. These constraints, which are in the form of integrity constraints (e.g. primary key, foreign key, check constraints) and structural components (e.g. inheritance hierarchies, entity modifications) are specified using a GUI. When they are entered they will appear in the graphical display. b) Database augmentation The input data to enhance a schema provides new knowledge about a database. It is essential to retain this knowledge within the database itself, if it is to be readily available for any further processing. Typically, this information is retained in the knowledge base of the tool used to capture the input data, so that it can be reused by the same tool. This approach restricts the use of this knowledge by other tools and hence it must be re-entered every time the re-engineering process is applied to that database. This makes it harder for the user to gain a consistent understanding of an application, as different constraints may be specified during two separate re- engineering processes. To overcome this problem, we augment the database itself using the techniques proposed in SQL-3 [ISO94], wherever possible. When it is not possible to use SQL-3 structures we store the information in our own augmented table format which is a natural extension of the SQL-3 approach. When a database is augmented using this method, the new knowledge is available in the database itself. Hence, any further re-engineering processes need not make requests for the same additional knowledge. The augmented tables are created and maintained in a similar way to user- defined tables, but have a special identification to distinguish them. Their structure is in line with the international standards and the newer versions of commercial DBMSs, so that the enhanced database can be easily migrated to either a newer version of the host DBMS or to a different DBMS supporting the latest SQL standards. Migration should then mean that the newer system can enforce the constraints. Our approach should also mean that it is easy to map our tables for Page 22
  23. 23. holding this information into the representation used by the target DBMS even if it is different, as we are mapping from a well defined structure. Legacy databases that do not support explicit constraints can be enhanced by using the above knowledge augmentation method. This requirement is less likely to occur for databases managed by more recent DBMSs as they already hold some constraint specification information in their system tables. The direction taken by Oracle version 6 was a step towards our augmentation approach, as it allowed the database administrator to specify integrity constraints such as primary and foreign keys, but did not yet enforce them [ROL92]. The next release of Oracle, i.e. version 7, implemented this constraint enforcement process. Stage 3: Constraint Enforcement The enhanced schema can be held in the database, but the DBMS can only enforce these constraints if it has the capability to do so. This will not normally be the case in legacy systems. In this situation, the new constraints may be enforced via a newer version of the DBMS or by migrating the database to another DBMS supporting constraint enforcement. However, the data being held in the database may not conform to the new constraints, and hence existing data may be rejected by the target DBMS in the migration, thus losing data and / or delaying the migration process. To address this problem and to assist the migration process, we provide an optional constraint enforcement process module which can be applied to a database before it is migrated. The objective of this process is to give users the facility to ensure that the database conforms to all the enhanced constraints before migration occurs. This process is optional so that the user can decide whether these constraints should be enforced to improve the quality of the legacy data prior to its migration, whether it is best left as it stands, or whether the new constraints are too severe. The constraint definitions in the augmented schema are employed to perform this task. As all constraints held have already been internally represented in the form of logical expressions, these can be used to produce data manipulation statements suitable for the host DBMS. Once these statements are produced, they are executed against the current database to identify the existence of data violating a constraint. Stage 4: Migration Process The migration process itself is incrementally performed by initially creating the target database and then copying the legacy data over to it. The schema meta-translation (SMTS) technique of Ramfos [RAM91] is used to produce the target database schema. The legacy data can be copied using the import / export tools of source and target DBMS or DML statements of the respective DBMSs. During this process, the legacy applications must continue to function until they too are migrated. To achieve this an interface can be used to capture and process all database queries of the legacy applications during migration. This interface can decide how to process database queries against the current state of the migration and re-direct those newly related to the target database. The query meta-translation (QMTS) technique of Howells [HOW87] can be used to convert these queries to the target DML. This approach will facilitate transparent migration for legacy databases. Our work does not involve the development of an interface to capture and Page 23
  24. 24. process all database queries, as interaction with the query interface of the legacy IS is embedded in the legacy application code. However, we demonstrate how to create and populate a legacy database schema in the desired target environment while showing the role of SMTS and QMTS in such a process. 2.3 The Role of CCVES in Context of Heterogeneous Distributed Databases Our approach described in section 2.2 is based on preparing a legacy database schema for graceful migration. This involves visualisation of database schemas with constraints and enhancing them with constraints to capture more knowledge. Hence we call our system the Conceptualised Constraint Visualisation and Enhancement System (CCVES). CCVES has been developed to fit in with the previously developed schema (SMTS) [RAM91] and query (QMTS) [HOW87] meta-translation systems, and the schema meta- visualisation system (SMVS) [QUT93]. This allows us to consider the complementary roles of CCVES, SMTS, QMTS and SMVS during Heterogeneous Distributed Database access in a uniform way [FID92, QUT94]. The combined set of tools achieves semantic coordination and promotes interoperability in a heterogeneous environment at logical, physical and data management levels. Figure 2.2 illustrates the architecture of CCVES in the context of heterogeneous distributed databases. It outlines in general terms the process of accessing a remote (legacy) database to perform various database tasks, such as querying, visualisation, enhancement, migration and integration. There are seven sub-processes: the schema mapping process [RAM91], query mapping process [HOW87], schema integration process [QUT92], schema visualisation process [QUT93], database connectivity process, database enhancement process and database migration process. The first two processes together have been called the Integrated Translation Support Environment [FID92], and the first four processes together have been called the Meta-Integration/Translation Support Environment [QUT92]. The last three processes were introduced as CCVES to perform database enhancement and migration in such an environment. The schema mapping process, referred to as SMTS, translates the definition of a source schema to a target schema definition (e.g. an INGRES schema to a POSTGRES schema). The query mapping process, referred to as QMTS, translates a source query to a target query (e.g. an SQL query to a QUEL query). The meta-integration process, referred to as SMIS, tackles heterogeneity at the logical level in a distributed environment containing multiple database schemas (e.g. Ontos and Exodus local schemas with a POSTGRES global schema) - it integrates the local schemas to create the global schema. The meta-visualisation process, referred to as SMVS, generates a graphical representation of a schema. The remaining three processes, namely: database connectivity, enhancement and migration with their associated processes, namely: SMVS, SMTS and QMTS, are the subject of the present thesis, as they together form CCVES (centre section of figure 2.2). The database connectivity process (DBC), queries meta-data from a remote database (route Page 24
  25. 25. A-1 in figure 2.2) to supply meta-knowledge (route A-2 in figure 2.2) to the schema mapping process referred to as SMTS. SMTS translates this meta-knowledge to an internal representation which is based on SQL schema constructs. These SQL constructs are supplied to SMVS for further processing (route A-3 in figure 2.2) which results in the production of a graphical view of the schema (route A-4 in figure 2.2). Our reverse-engineering techniques [WIK95b] are applied to identify entity and relationship types to be used in the graphical model. Meta-knowledge enhancements are solicited at this point by the database enhancement process (DBE) (route B-1 in figure 2.2), which allows the definition of new constraints and changes to the existing schema. These enhancements are reflected in the graphical view (route B-2 and B-3 in figure 2.2) and may be used to augment the database (route B-4 to B-8 in figure 2.2). This approach to augmentation makes use of the query mapping process, referred to as QMTS, to generate the required queries to update the database via the DBC process. At this stage any existing or enhanced constraints may be applied to the database to determine the extent to which it conforms to the new enhancements. Carrying out this process will also ensure that legacy data will not be rejected by the target DBMS due to possible violations. Finally, the database migration process, referred to as DBMI, assists migration by incrementally migrating the database to the target environment (route C-1 to C-6 in figure 2.2). Target schema constructs for each migratable component are produced via SMTS, and DDL statements are issued to the target DBMS to create the new database schema. The data for these migrated tables are extracted by instructing the source DBMS to export the source data to the target database via QMTS. Here too, the queries which implement this export are issued to the DBMS via the DBC process. 2.4 Research Aims and Objectives Our relational database enhancement and augmentation approach is important in three respects, namely: 1) by holding the additional defining information in the database itself, this information is usable by any design tool in addition to assisting the full automation of any future re- engineering of the same database; 2) it allows better user understanding of database applications, as the associated constraints are shown in addition to the traditional entities and attributes at the conceptual level; Page 25
  26. 26. 3) the process which assists a database administrator to clean inconsistent legacy data ensures a safe migration. To perform this latter task in a real world situation without an automated support tool is a very difficult, tedious, time consuming and error prone task. Therefore the main aim of this project has been the design and development of a tool to assist database enhancement and migration in a heterogeneous distributed relational database environment. Such a system is concerned with enhancing the constituent databases in this type of environment to exploit potential knowledge both to automate the re-engineering process and to assist in evolving and cleaning the legacy data to prevent data rejection, possible losses of data and/or delays in the migration process. To this end, the following detailed aims and objectives have been pursued in our research: 1. Investigation of the problems inherent in schema enhancement and migration for a heterogeneous distributed relational legacy database environment, in order to fully understand these processes. 2. Identification of the conceptual foundation on which to successfully base the design and development of a tool for this purpose. This foundation includes: • A framework to establish meta-data representation and manipulation. • A real world data modelling framework that facilitates the enhancement of existing working systems and which supports applications during migration. • A framework to retain the enhanced knowledge for future use which is in line with current international standards and techniques used in newer versions of relational DBMSs. • Exploiting existing databases in new ways, particularly linking them with data held in other legacy systems or more modern systems. • Displaying the structure of databases in a graphical form to make it easy for users to comprehend their contents. • The provision of an interactive graphical response when enhancements are made to a database. • A higher level of data abstraction for tasks associated with visualising the contents, relationships and behavioural properties of entities and constraints. • Determining the constraints on the information held and the extent to which the data conforms to these constraints. • Integrating with other tools to maximise the benefits of the new tool to the user community. 3. Development of a prototype tool to automate the re-engineering process and the migration assisting tasks as far as possible. The following development aims have been chosen for this system: • It should provide a realistic solution to the schema enhancement and migration assistance process. • It should be able to access and perform this task for legacy database systems. • It should be suitable for the data model at which it is targeted. • It should be as generic as possible so that it can be easily customised for other data models. • It should be able to retain the enhanced knowledge for future analysis by itself and other Page 26
  27. 27. tools. • It should logically support a model using modern data modelling techniques irrespective of whether it is supported by the DBMS in use. • It should make extensive use of modern graphical user interface facilities for all graphical displays of the database schema. • Graphical displays should also be as generic as possible so that they can be easily enhanced or customised for other display methods. Page 27
  28. 28. CHAPTER 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints The origins and historical development of database technology are initially presented here to focus the evolution of ISs and the emergence of database models. The relational data model is identified as currently the most commonly used database model and some terminology for this data model, along with its features including query languages is then presented. A discussion of conceptual data models with special emphasis on EER and OMT is provided to introduce these data models and the symbols used in our project. Finally, we pay attention to crucial concepts relating to our work, namely the notion of semantic integrity constraints, with special emphasis on those used in semantic extensions to the relational model. The relational database language SQL is also discussed, identifying how and when it supports the implementation of these semantic integrity constraints. 3.1 Origins and Historical Developments The origin of data management goes back to the 1950’s and hence, this section is sub divided into two parts: the first part describes database technology prior to the relational data model, and the second part describes developments since. This division was chosen as the relational model is currently the most dominant database model for information management [DAT90]. 3.1.1 Database Technology Prior to the Relational Data Model Database technology emerged from the need to manipulate large collections of data for frequently used data queries and reports. The first major step in mechanisation of information systems came with the advent of punched card machines which worked sequentially on fixed-length fields [SEN73, SEN77]. With the appearance of stored program computers, tape-oriented systems were used to perform these tasks with an increase in user efficiency. These systems used sequential processing of files in batch mode, which was adequate until peripheral storage with random access capabilities (e.g. DASD) and time sharing operating systems with interactive processing appeared to support real-time processing in computer systems. Access methods such as direct and indexed sequential access methods (e.g. ISAM, VSAM) [BRA82, MCF91] were used to assist with the storage and location of physical records in stored files. Enhancements were made to procedural languages (e.g. COBOL) to define and manage application files, making the application program dependent on the organisation of the file. This technique caused data redundancy as several files were used in systems to hold the same data (e.g. emp_name and address in a payroll file; insured_name and address in an insurance file; and depositors_name and address in a bank file). These stored data files used in the applications of the 1960's are now referred to as conventional file systems, and they were maintained using third generation programming languages such as COBOL and PL/1. This evolution of mechanised information systems was influenced by the hardware and software developments which occurred in the 1950’s and early 1960’s. Most long existing legacy ISs are based on this technology. Our work does not address this type of IS as they do not use a DBMS for their data management. The evolution of databases and database management systems [CHA76, FRY76, SIB76,
  29. 29. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints SEN77, KIM79, MCG81, SEL87, DAT90, ELM94] was to a large extent the result of addressing the main deficiencies in the use of files, i.e. by reducing data redundancy and making application programs less dependent on file organisation. An important factor in this evolution was the development of data definition languages which allowed the description of a database to be separated from its application programs. This facility allowed the data definition (often called a schema) to be shared and integrated to provide a wide variety of information to the users. The repository of all data definitions (meta data) is called data dictionaries and their use allows data definitions to be shared and widely available to the user community. In the late 1960's applications began to share their data files using an integrated layer of stored data descriptions, making the first true database, e.g. the IMS hierarchical database [MCG77, DAT90]. This type of database was navigational in nature and applications explicitly followed the physical organisation of records in files to locate data using commands such as GNP - get next under parent. These databases provided centralised storage management, transaction management, recovery facilities in the event of failure and system maintained access paths. These were the typical characteristics of early DBMSs. Work on extending COBOL to handle databases was carried out in the late 60s and 70s. This resulted in the establishment of the DBTG (i.e. DataBase Task Group) of CODASYL and the formal introduction of the network model along with its data manipulation commands [DBTG71]. The relational model was proposed during the same period [COD70], followed by the 3 level ANSI/SPARC architecture [ANSI75] which made databases more independent of applications, and became a standard for the organisation of DBMSs. Three popular types of commercial database systems7 classified by their underlying data model emerged during the 70s [DAT90, ELM94], namely: • hierarchical • network • relational and these have been the dominant types of DBMS from the late 60s on into the 80s and 90s. 3.1.2 Database Technology Since the Relational Data Model At the same time as the relational data model appeared, database systems introduced another layer of data description on top of the navigational functionality of the early hierarchical and network models to bring extra logical data independence8. The relational model also introduced the use of non-procedural (i.e. declarative) languages such as SQL [CHA74]. By the early 1980's many relational database products, e.g. System R [AST76], DB2 [HAD84], INGRES [STO76] and Oracle were in use and due to their growing maturity in the mid 80s and the complexity of programming, navigating, and changing data structures in the older DBMS data models, the relational data model was able to take over the commercial database market with the result that it is now dominant. 7 Other types such as flat file, inverted file systems were also used. 8 This allows changes to the logical structure of data without changing the application programs. Page 29
  30. 30. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints The advent of inexpensive and reliable communication between computer systems, through the development of national and international networks, has brought further changes in the design of these systems. These developments led to the introduction of distributed databases, where a processor uses data at several locations and links it as though it were at a single site. This technology has led to distributed DBMSs and the need for interoperability among different database systems [OZS91, BEL92]. Several shortcomings of the relational model have been identified, including its inability to perform efficiently compute-intensive applications such as simulation, to cope with computer-aided design (CAD) and programming language environments, and to represent and manipulate effectively concepts such as [KIM90]: • Complex nested entities (e.g. design and engineering objects), • Unstructured data (e.g. images, textual documents), • Generalisation and aggregation within a data structure, • The notion of time and versioning of objects and schemas, • Long duration transactions. The notion of a conceptual schema for application-independent modelling introduced by the ANSI/SPARC architecture led to another data model, namely: the semantic model. One of the most successful semantic models is the entity-relationship (E-R) model [CHE76]. Its concepts include entities, relationships, value sets and attributes. These concepts are used in traditional database design as they are application-independent. Many modelling concepts based on variants/extensions to the E-R model have appeared since Chen’s paper. The enhanced/extended entity-relationship model (EER) [TEO86, ELM94], the entity-category-relationship model (ECR) [ELM85], and the Object Modelling Technique (OMT) [RUM91] are the most popular of these. The DAPLEX functional model [SHI81] and the Semantic Data Model [HAM81] are also semantic models. They capture a richer set of semantic relationships among real-world entities in a database than the E-R based models. Semantic relationships such as generalisation / specialisation between a superclass and its subclass, the aggregation relationship between a class and its attributes, the instance-of relationship between an instance and its class, the part-of relationship between objects forming a composite object, and the version-of relationship between abstracted versioned objects are semantic extensions supported in these models. The object-oriented data model with its notions of class hierarchy, class-composition hierarchy (for nested objects) and methods could be regarded as a subset of this type of semantic data model in terms of its modelling power, except for the fact that the semantic data model lacks the notion of methods [KIM90] which is an important aspect of the object-oriented model. The relational model of data and the relational query language have been extended [ROW87] to allow modelling and manipulation of additional semantic relationships and database facilities. These extensions include data abstraction, encapsulation, object identity, composite objects, class hierarchies, rules and procedures. However, these extended relational systems are still being evolved to fully incorporate features such as implementation of domain and extended data types, enforcement of primary and foreign key and referential integrity checking, prohibition of duplicate rows in tables and views, handling missing information by supporting four-valued predicate logic Page 30
  31. 31. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints (i.e. true, false, unknown, not applicable) and view updatability [KIV92], and they are not yet available as commercial products. The early 1990's saw the emergence of new database systems by a natural evolution of database technology, with many relational database systems being extended and other data models (e.g. the object-oriented model) appearing to satisfy more diverse application needs. This opened opportunities to use databases for a greater diversity of applications which had not been previously exploited as they were not perceived as tractable by a database approach (e.g. Image, medical, document management, engineering design and multi-media information, used in complex information processing applications such as office automation (OA), computer-aided design (CAD), computer-aided manufacturing (CAM) and hyper media [KIM90, ZDO90, CAT94]). The object- oriented (O-O) paradigm represents a sound basis for making progress in these areas and as a result two types of DBMS are beginning to dominate in the mid 90s [ZDO90], namely: the object-oriented DBMS, and the extended relational DBMS. There are two styles of O-O DBMS, depending on whether they have evolved from extensions to an O-O programming language or by evolving a database model. Extensions have been created for two database models, namely: the relational and the functional models. The extensions to existing relational DBMSs have resulted in the so-called Extended Relational DBMSs which have O-O features (e.g. POSTGRES and Starburst), while extensions to the functional model have produced PROBE and OODAPLEX. The approach of extending O-O programming language systems with database management features has resulted in many systems (e.g. Smalltalk into GemStone and ALLTALK, and C++ into many DBMSs including VBase / ONTOS, IRIS and O2). References to these systems with additional information and references can be found in [CAT94]. Research is currently taking place into other kinds of database such as active, deductive and expert database systems [DAT90]. This thesis focuses on the relational model and possible extensions to it which can represent semantics in existing relational database information systems in such a way that these systems can be viewed in new ways and easily prepared for migration to more modern database environments. 3.2 Relational Data Model In this section we introduce some of the commonly used terminology of the relational model. This is followed by a selective description of the features and query languages of this model. Further details of this data model can be found in most introductory database text books, e.g. [MCF91, ROB93, ELM94, DAT95]. A relation is represented as a table (entity) in which each row represents a tuple (record), the number of columns being the degree of the relation and the number of rows being its cardinality. An example of this representation is shown in figure 3.1, which shows a relation holding Student details, with degree 3 and cardinality 5. This table and each of its columns are named, so that a unique identity for a table column of a given schema is achieved via its table name and column name. The columns of a table are called attributes (fields) each having its own domain (data type) representing its pool of legal data. Basic types of domains are used (e.g. integer, real, character, text, date) to define the domains of attributes. Constraints may be enforced to further restrict the pool of legal Page 31
  32. 32. Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity Constraints values for an attribute. Tables which actually hold data are called base tables to distinguish them from view tables which can be used for viewing data associated with one or more base tables. A view table can also be an abstraction from a single base table which is used to control access to parts of the data. A column or set of columns whose values uniquely identify a row of a relation is called a candidate key (key) of the relation. It is customary to designate one candidate key of a relation as a primary key (e.g. SNO in figure 3.1). The specification of keys restricts the possible values the key attribute(s) may hold (e.g. no duplicate values), and is a type of constraint enforceable on a relation. Additional constraints may be imposed on an attribute to further restrict its legal values. In such cases, there should be a common set of legal values satisfying all the constraints of that attribute, ensuring its ability to accept some data. For example, a pattern constraint which ensures that the first character of SNO is ‘S’ further restricts the possible values of SNO - see figure 3.1. Many other concepts and constraints are associated with the relational model although most of them are not supported by early relational systems as, indeed, some of the more recent relational systems (e.g. a value set constraint for the Address field as shown in figure 3.1). Domain (type character) Value Set Constraint Pattern Constraint (all values begin with 'S') Primary Key (unique values) Student SNO Name Address Cardinality S1 Jones Cardiff S2 Smith Bristol : Relation Tuples S3 Gray Swansea S4 Brown Cardiff : S5 Jones Newport Attributes Degree Figure 3.1: The Student relation 3.2.1 Requisite Features of the Relational Model During the early stages of the development of relational database systems there were many requisite features identified which a comprehensive relational system should have [KIM79, DAT90]. We shall now examine these features to illustrate the kind of features expected from early relational database management systems. They included support for: • Recovery from both soft and hard crashes, • A report generator for formatted display of the results of queries, • An efficient optimiser to meet the response-time requirements of users, • User views of the stored database, Page 32

×