Database Systems Introduction: Data, Relationships, and Architecture
1. 1
Database Systems Introduction
A database is a shared collection of logically related data with descriptions designed to satisfy the
information needs of an organisation. The key idea behind the database concept is separating the data from
the application program (data independence)
Some common uses of databases are in large shops and supermarkets. The databases can be linked directly
to the operations of the business.
For example, bar codes on items for sale can be linked to products in the company's database so that stock
levels can be maintained and pricing details can be applied centrally from the database rather than on each
individual item.
The database approach was developed and adopted because of the problems associated with data being
stored within application programs or within file based systems. The problem with data being stored within
application programs was that it was hard to access from other places and there were limits to what could be
done with the data.
There are three main areas within the relational model that can be focused on:
- Data Structure
- Data Integrity
- Data Manipulation
Databases and File Based Systems
A file based system is a collection of application programs that perform services for the users
wishing to access information. Each program within a file based system defines and manages its
own data. Because of this, there are limits as to how that data can be used or transported.
File based systems were developed as better alternatives to paper based filing systems. By having
files stored on computers, the data could be accessed more efficiently. It was common practice for
larger companies to have each of its departments looking after its own data.
The problems that arise with this type of file based system are listed below:
- Data separation and isolation
- Data dependence
- Data duplication
- Incompatible data (different file formats)
- Lack of flexibility in organising and querying the data
- Increased number of different application programs
Some advantages of database systems are outlined below:
- Sharing of data
- Consistency of data
- Integrity of data
- Security of data
- Data independence
- Allows for more analysis of the same amount of data
- Improved data access and system performance
2. 2
- Potentially increased productivity
- Increased concurrency
- Improved data backups and recovery
Some potential disadvantages of database systems are the cost of implementing them, the amount
of effort needed to transfer data into the database from a current system, and also the impact on the
whole company if the database fails (even if only for a relatively short period).
Database Management System (DBMS)
A DBMS is a software system that enables users to define, create and maintain a database. The
DBMS also enforces necessary access restrictions and security measures in order to protect the
database.
The DBMS also has the job of controlling access to database. Various types of control systems
within the DBMS make sure that the database continues to function properly:
- Integrity system
- Security system
- Concurrency control system
- Recovery control system
Many DBMSs enable "views" of the database to be defined. A view is how the database appears to
a certain user. Views offer the benefit of only having to show relevant information to different
types of users and it increases security as certain users will not be able to see data which they are
not meant to see. Views can also decrease the perceived complexity of the database from the user's
point of view.
Data Definition and Manipulation
The DBMS makes use of a Data Definition Language (DDL) and a Data Manipulation Language (DML). The
DML enables the specification of data types, structures and constraints that should be part of the
database. All specifications defined by the DDL are stored in the database.
The DML enables those with access to the database to insert, update, delete and retrieve data from it.
Structured Query Language (SQL) is the standard DML used today. SQL is a non-procedural language.
ANSI SPARC 3 Level Database Architecture
ANSI SPARC is an acronym for the American National Standard Institute Standard Planning and
Requirements Committee. A standard three level approach to database design has been agreed.
- External level
- Conceptual level
- Internal level (includes physical data storage)
The 3 Level Architecture has the aim of enabling users to access the same data but with a
personalised view of it. The distancing of the internal level from the external level means that users
do not need to know how the data is physically stored in the database. This level separation also
allows the Database Administrator (DBA) to change the database storage structures without
affecting the users' views.
3. 3
External Level (User Views)
A user's view of the database describes a part of the database that is relevant to a particular user. It
excludes irrelevant data as well as data which the user is not authorised to access.
Conceptual Level
The conceptual level is a way of describing what data is stored within the whole database and how the
data is inter-related. The conceptual level does not specify how the data is physically stored.
Internal Level
The internal level involves how the database is physically represented on the computer system. It describes
how the data is actually stored in the database and on the computer hardware.
Database Schema
The database schema provide an overall description of the database structure (not actual data). There are
three types of schema which relate to the 3 Level Database Architecture.
External Schemas or subschemas relate to the user views. The Conceptual Schema describes all the types
of data that appear in the database and the relationships between data items. Integrity constraints are also
specified in the conceptual schema. The Internal Schema provides definitions for stored records, methods
of representation, data fields, indexes, and hashing schemes etc...
Mappings and Data Independence
Mappings between the different database schemas allows for data independence. The Database
Management System (DBMS) is responsible for managing these mappings and checking the
schemas for consistency.
Internal-Conceptual mappings enable the DBMS to find records within the database storage
medium that correspond to the logical record in the conceptual schema. External-Conceptual
mappings enable the DBMS to match names of data items etc... in the user's view with the parts of
the conceptual schema that correspond to those items.
Data Independence
Data Independence means that the higher levels of the database model are designed to be
unaffected by changes to the lower levels (internal and physical). There are two types of Data
Independence.
- Logical data independence
- Physical data independence
Logical Data Independence involves the external schema being unaffected by changes in the
conceptual schema. For example, a new field can be added to a table (relation) without any changes
to application programs etc... being required.
Physical Data Independence means that the conceptual schema is not affected by changes made to
4. 4
the internal schema. An example of a change to the internal schema would be changing the storage
device used to store the database data. This would not affect the conceptual or external schemas /
layers.
Relational Database Principles and Terminology
A relational database is a collection of 2-dimensional tables which consist of rows and columns.
Tables are known as "relations", columns are known as "attributes" and rows (or records) are
known as "tuples".
Tables / Relations are a logical structure therefore they are an abstract concept and they do not
represent how the data is stored on the physical computer system. Each column / attribute in a
relation represents an attribute of an entity. A single row / tuple contains all the information (in the
form of attributes) about a single entity.
The "cardinality" of a relation is the number of row / tuples it has. The "degree" of a relation refers
to the number of columns / attributes in that relation. The order of records and columns is
irrelevant. Relations and columns should always be uniquely named and therefore uniquely
indentifiable. No duplicate rows occur in a single relation.
Each cell of a relation contains a single value or element which is atomic. This means that arrays or
lists, for example, would not be stored under a single attribute. Multi-valued attributes are possible
though but this involves a technique of referring to another relation which holds these multiple
values.
Attribute Domains
Attribute domains define the set of all possible values an attribute within a relation can take. For example,
the attribute "height" of a person may only take integer values with a 4 digit maximum (if measuring in
millimetres). The attribute "gender" may only take a string of length 1 with the only two characters
accepted being "m" or "f" representing male or female. Defining attribute domains allows the database to
reject invalid inappropriate data.
An attribute may in some cases be permitted to contain Null values. A Null value makes it possible to deal
with missing or erroneous data. Null means that there is an absence of a value. It is different from using a
blank space or zero value in an attribute.
Keys (Primary, Candidate, and Foreign)
A key (also known as a candidate key) is an attribute which uniquely identifies a row / tuple within a
relation. The primary key is the key chosen by the database designer as the main identifier for all records in
that relation (even though other keys could be used as the primary key). A foreign key is an attribute which
provides a logical link between tables. A foreign key in one relation will correspond to a key within another
relation. Candidate keys can consist of a single attribute or multiple attributes. Multiple attribute keys are
known as composite keys.
Relational Algebra, Calculus and Operators
Relational algebra and calculus forms the basis for relational languages. All the fundamental
operations necessary within a Data Manipulation Language (DML) can be defined in relational
algebra and calculus.
5. 5
To grasp an understanding of the difference between relational algebra and calculus, relational
algebra can be viewed as procedural and relational calculus, non-procedural. Relational algebra
specifies how to get the required information and how to build a relation from one or more other
relations. Relational calculus simply provides a definition of a relation in terms of one or more
other relations. In other words, it says what data is required, but not how it is actually retrieved.
Every algebraic expression has an equivalent calculus expression and vice versa.
The concept of closure relates to relational algebra. Closure means that relational expressions can
be nested within each other. For example, an operation which leads to a new relation as its output
can be put within brackets and used within another expression. The relations are closed under the
relational algebra.
E. F. Codd (1972) proposed 8 operations used within relational algebra. The 5 fundamental
operators are listed below:
- Selection
- Projection
- Cartesian Product
- Union
- Set Difference
The other 3 operators can be defined using these fundamental operators and are shown below:
- Join
- Intersection
- Division
Selection Conditions
A conditional statement can take the following forms:
<attribute name> <comparison operator> <constant value> or
<attribute name> <comparison operator> <attribute name>
The <comparison operator> shown above can be any of the following comparison operators:
= (equals)
≠(does not equal)
< (smaller than)
> (larger than)
≤ (smaller than or equal to)
≥ (larger than or equal to)
The Boolean operators AND, OR, and NOT can also be used to connect / combine these selection
conditions.
Database Design / Application Lifecycle
Before a database design is set into motion, planning is an essential stage. This is typically the
responsibility of management within the organisation. Good planning will enable the design and
implementation to be successfully completed with the desired results. A general definition of the
desired system should also be drawn up at this stage. This would include information such as what
6. 6
the database application should be able to do, to what areas it will be applied, and who will be
using it.
Three main factors that should be analysed in the planning are
- What work will need to be done
- The resources needed to complete the work
- How much it will all cost
Requirements Analysis and Collection
Information can be gathered using various techniques. Some of these are listed below:
- Interviewing individuals
- Observation
- Examining documents
- Using questionnaires
- Using expert knowledge and experience from other design work
Interviewing many individuals can be time consuming but can result in quality feedback which can
be used in the system design. Using questionnaires is a quicker and easier way to gain feedback
from relevant people. It is hard to ensure that the answers people give are accurate though. Looking
at the current documents and paperwork in circulation can be useful in determining what
input/output screens should look like for example.
Database Design
The database design stage will result in a database model that will hopefully support the organisation's
goals. The two design approaches that can be used are Bottom Up or Top Down. The Top Down approach
starts with a very general overview of the system and details are added in an iterative manner. Bottom Up
design involves getting all the details firs then constructing the overall design from the smaller more
detailed parts.
The aim of a database design should be to represent all relevant data and the relationships that exist
between data items. The database model should also allow for all necessary transactions on the data to be
possible.
The following stages follow the database design stage:
- DBMS Selection
- Application Design
- Prototyping
- Implementation
- Data Conversion and Loading (if data from an existing system needs to be put into the new database)
- Testing
- Operational Maintenance (follows the installation of the system)
Logical and Conceptual Data Models
The purpose of Data Modelling is to aid the understanding of the data and its semantics (meaning).
Different users will have different perspectives of the data, and so, these perspectives need to be
designed and understood. Data modelling allows independent analysis of the data, irrespective of
how it is physically stored and represented. Because of this distancing of the data model from the
7. 7
physical storage system, data models can be applied to any platform. Building data models based
on a common syntax that is widely understood means that the data model and design can be
understood by many more individuals than just the designer.
Logical Data Model
Logical database design has the aim of creating a data model that is completely independent from
any particular DBMS or software/hardware platform. A conceptual model is typically needed
before the logical model is constructed. If the system is a particularly large one, it is often the case
that individual logical models are constructed for each user view or area within the business. These
separate models are then merged into a global logical data model.
An example logical data model of a simple library system is shown below:
Conceptual Data Model
The conceptual model purely documents the data and information within the business and how it is
used. The logical model is different from the conceptual model in that it takes into consideration
the relational or object-oriented theory which will be used to store the data. In some cases, the
conceptual data model may be the same as the logical data model.
An example conceptual model is shown below. It is based on the same system as modelled in the
logical data model above. Note that extra entities / relations have been added to enable the
information and relationships to be stored in a relational database.
8. 8
Entities and Entity-Relationship (ER) Modelling
ER Modelling provides a fully scalable solution to modelling relationships between groups of data
elements. These groups of data elements can be described as "entities" or "entity types". These
entities are items (real or otherwise) that are of relevance to the business.
An ER model will contain the following concepts:
- Entity types
- Relationship types
- Attributes
An entity type describes a type of item which is distinctly identifiable. The identification of entities
is typically open to the interpretative skills of the systems analyst or designer. There are no quick
and easy rules which can be used to identify entities.
An ER model consists of an ER diagram or set of ER diagrams, as well as a set of normalised
relations (tables) which correspond to the ER diagrams. Additional information that can be
provided along with the ER model is as follows:
- written description of entities / relationships
- any assumptions
- additional constraints on the model
Entity-Relationship (ER) Diagrams
Entities are shown within a box. The entities "Student", "Book", "DVD" and "Staff" can therefore be
represented as shown below:
Relationships between entities can be shown by joining the entities with a line:
As shown in the above diagram, the relationship can be named and given a direction (shown with a small
arrow) to indicate which way the named relationship applies. This naming of relationships adds meaning
and can reduce ambiguity.
Cardinality constraints show the number of instances of entities that can be involved in a relationship. For
example, 1 member of staff can look after no books, 1 book, or many books. A book has to be assigned to
some member of staff though. This is represented by the "1..1" cardinality constraint. One and only one
member of staff must be assigned to each book.
Relationship Degrees
9. 9
The degree of a relationship is the number of entities that participate in that relationship. A simple
relationship involving two entities is known as a binary relationship. Relationships with three entities are
known as ternary relationships. It is possible to have more entities involved in a relationship but this would
lead to an increasing level of complexity.
Unary relationships involve a single entity which has a relationship with itself (i.e. the same entity type).
Unary relationships are also known as recursive relationships.
Relational Normalization / Normal Forms
Being familiar with the process of Normalization will lead to good design practices. The concept of
Normalization and Normal Forms are automatically considered by most database designers.
Understanding how to use them in a step by step process is useful with helping to understand what
the different Normal Forms mean and how they apply to real databases. It is also useful when
analysing relations that have been designed by another individual.
Normalization
Normalization is defined as a technique for producing a set of well designed relations that measure
up to a set of requirements which are outlined in various levels of normalization (or Normal
Forms). Normalization also reduces information redundancy. The concept of normalization was
first developed and documented by E. F. Codd (1972).
The most commonly used normal forms are as follows:
- First (1NF)
- Second (2NF)
- Third (3NF)
- Boyce Codd (BCNF)
Fourth (4NF) and Fifth (5NF) Normal Forms exist but are not commonly employed in typical
database design. 1NF is essential for a relational database design. 3NF is recommended to prevent
the most common update anomalies. A higher Normal Form does not necessarily mean a better
design though. Normalizing relations to a higher normal form may result in poorer performance.
Normalization has the underlying aim of minimising information redundancy, avoiding data
inconsistency and preventing insertion, deletion, and modification anomalies.
First Normal Form (1NF)
Relations can only be in 1NF if each row and column intersection (cell) contains a single atomic value. A
primary key will be determined at this initial stage of normalization also. It may be necessary that the
primary key is made up of more than one attribute.
Second Normal Form (2NF)
For relations to be in 2NF they must first be in 1NF. They must also have no partial dependencies . A partial
dependency occurs when the primary key is made up of more than one attribute (i.e. it is a composite
primary key) and there exists an attribute (which is a non-primary key attribute) that is dependant on only
part of the primary key.
These partial dependencies can be removed by removing all of the partially dependent attributes into
another relation along with a copy of the determinant attribute (which is part of the primary key in the
10. 10
original relation)
Third Normal Form (3NF)
Getting a relation to 3NF involves removing any transitive dependencies. Therefore, a relation in 3NF must
be in 1NF and 2NF and it must have no non-primary key attributes which are transitively dependent upon
the primary key.
Enhanced ER Modelling (EER)
Enhanced ER Modelling (EER) allows for more complex concepts such as specialisation and
generalisation which were not in the original ER Model. The EER model added the ability to
represent Entity Supertypes and Subtypes as well as Attribute inheritance.
Supertypes and subtypes are used when there exist entities which share common properties. The
entity supertype contains the shared properties of all the subtypes. An entity subtype has a more
specific role and belongs to a supertype. The entity subtype inherits the properties of the supertype.
An example of supertypes and subtypes can be found within various contexts. In a library system,
there may exist the Item entity supertype with subtypes being Books, DVDs, or Magazines etc...
Also, within a business, Staff may be divided into Managers, Secretaries and Sales Representatives.
Here, Staff is the supertype. The entity subtypes will inherit the common attributes from the Staff
supertype (e.g. ID, Name, Address). A Sales Representative subtype will therefore have special
attributes unique to a sales person (e.g. Bonus, Sales Area) as well as the general attributes
associated with the Staff supertype.
Depending on business or system constraints, an entity may belong to multiple subtypes. For
example, a member of Staff could be a Manager as well as a Sales Representative. In some cases,
belonging to a subtype may not be mandatory. This means that a member of Staff can exist with no
specialised role. They will just have the properties associated with the Staff supertype.
The following image shows how two subtypes inherit from a superclass.
Database Security Threats and Countermeasures
Databases need to have level of security in order to protect the database against both malicious and
accidental threats. A threat is any type of situation that will adversely affect the database system.
Some factors that drive the need for security are as follows:
- Theft and fraud
- Confidentiality
- Integrity
11. 11
- Privacy
- Database availability
Threats to database security can come from many sources. People are a substantial source of
database threats. Different types of people can pose different threats. Users can gain unauthorised
access through the use of another person's account. Some users may act as hackers and/or create
viruses to adversely affect the performance of the system. Programmers can also pose similar
threats. The Database Administrator can also cause problems by not imposing an adequate security
policy.
Some threats related to the hardware of the system are as follows:
- Equipment failure
- Deliberate equipment damage (e.g. arson, bombs)
- Accidental / unforeseen equipment damage (e.g. fire, flood)
- Power failure
- Equipment theft
Countermeasures
Some countermeasures that can be employed are outlined below:
- Access Controls (can be Discretionary or Mandatory)
- Authorisation (granting legitimate access rights)
- Authentication (determining whether a user is who they claim to be)
- Backup
- Journaling (maintaining a log file - enables easy recovery of changes)
- Encryption (encoding data using an encryption algorithm)
- RAID (Redundant Array of Independent Disks - protects against data loss due to disk failure)
- Polyinstantiation (data objects that appear to have different values to users with different access rights /
clearance)
- Views (virtual relations which can limit the data viewable by certain users)
Transaction Management and Concurrency Control
A Transaction is an action or series of actions, carried out by a user or application which can access
or change the database contents. A transaction can be viewed as a single logical unit of work.
Application programs make use of many transactions along with non-database processing in
between these transactions.
A transaction should result in the transformation of the database from one consistent state to
another. A transaction can result is success or failure. A successful transaction commits. This
means that the database reaches a new consistent state. A failed transaction will be aborted and the
database will be restored to the previous consistent state. A failed transaction is "rolled back" or
undone. The DBMS is responsible for ensuring all updates associated with a transaction occur or,
in the case of the transaction being aborted, all the changes must be undone (rolled back).
Aborted transactions are transactions which have been rolled back but can be restarted later. A
committed transaction cannot be aborted.
ACID Properties
Transactions have four basic properties which form the acronym "ACID"
12. 12
- Atomicity (it is viewed as a single unit of work)
- Consistency (database must not be left in an inconsistent state)
- Independence (partial effects of incomplete transaction are not visible to other transactions)
- Durability (committed transactions result in permanent changes to the database state)
Concurrency Control
Concurrency Control is the process of managing / controlling simultaneous operations on the database.
Concurrency control is required because actions from different users / applications taking place upon the
database must not interfere.
Interleaving operations can lead to the database being left in an inconsistent state. Three potential
problems which should be addressed by successful concurrency control are as follows:
- Lost update problem
- Uncommitted Dependency Problem
- Inconsistent Analysis Problem
Locks, 2PL and Deadlocks
Locking
A transaction will use a lock to deny data access to other transactions and so prevent incorrect
updates. Locks can be Read (shared) or Write (exclusive) locks. Write Locks on a data item prevent
other transactions from reading that data item whereas Read Locks simply stop other transactions
editing (writing to) the data item.
Two-Phase Locking (2PL)
A transaction follows the 2PL protocol if all lock requests come before the first unlock operation
within the transaction. This means there are two main phases within the transaction:
- Growing phase (all locks are acquired)
- Shrinking phase (locks released - no new locks can be acquired)
Deadlock
Deadlock occurs when two or more transactions are left waiting for locks held by each other to be
released. The only way to break a deadlock is to abort one of the transactions so that a lock is
released so that the other transaction can proceed.
The DBMS can manage this process of aborting a transaction when necessary. The aborted
transaction is typically restarted so that it is able to execute and commit without the user being
aware of any problems occurring.