Transcript of "Mc0077 – advanced database systems"
Advanced Database Systems
Question 1. List and explain various Normal Forms. How BCNF differs from the
Third Normal Form and 4th Normal forms?
Relations are classified based upon the types of anomalies to which they're vulnerable. A database
that's in the first normal form is vulnerable to all types of anomalies, while a database that's in the
domain/key normal form has no modification anomalies. Normal forms are hierarchical in nature.
That is, the lowest level is the first normal form, and the database cannot meet the requirements for
higher level normal forms without first having met all the requirements of the lesser normal forms.
First Normal Form
Any table having any relation is said to be in the first normal form. The criteria that must be met to be
considered relational is that the cells of the table must contain only single values, and repeat groups or
arrays are not allowed as values. All attributes (the entries in a column) must be of the same kind, and
each column must have a unique name. Each row in the table must be unique. Databases in the first
normal form are the weakest and suffer from all modification anomalies.
Second Normal Form
If all a relational database's non-key attributes are dependent on all of the key, then the database is
considered to meet the criteria for being in the second normal form. This normal form solves the
problem of partial dependencies, but this normal form only pertains to relations with composite keys.
Third Normal Form
A database is in the third normal form if it meets the criteria for a second normal form and has no
Boyce-Codd Normal Form
A database that meets third normal form criteria and every determinant in the database is a candidate
key, it's said to be in the Boyce-Codd Normal Form. This normal form solves the issue of functional
Fourth Normal Form
Fourth Normal Form (4NF) is an extension of BCNF for functional and multi-valued dependencies. A
schema is in 4NF if the left hand side of every non-trivial functional or multi-valued dependency is a
Domain/Key Normal Form
The domain/key normal form is the Holy Grail of relational database design, achieved when every
constraint on the relation is a logical consequence of the definition of keys and domains, and
enforcing key and domain restraints and conditions causes all constraints to be met. Thus, it avoids all
non-temporal anomalies. It's much easier to build a database in domain/key normal form than it is to
convert lesser databases which may contain numerous anomalies. However, successfully building a
domain/key normal form database remains a difficult task, even for experienced database
programmers. Thus, while the domain/key normal form eliminates the problems found in most
databases, it tends to be the most costly normal form to achieve. However, failing to achieve the
domain/key normal form may carry long-term, hidden costs due to anomalies which appear in
databases adhering only to lower normal forms over time.
Question 2. Describe the concepts of Structural Semantic Data Model (SSM).
A data model in software engineering is an abstract model that describes how data are represented and
accessed. Data models formally define data elements and relationships among data elements for a
domain of interest. According to Hoberman (2009), "A data model is a way finding tool for both
business and IT professionals, which uses a set of symbols and text to precisely explain a subset of
real information to improve communication within the organization and thereby lead to a more
flexible and stable application environment." A data model explicitly determines the structure of data
or structured data. Typical applications of data models include database models, design of information
systems, and enabling exchange of data. Usually data models are specified in a data modeling
language. Communication and precision are the two key benefits that make a data model important to
applications that use and exchange data. A data model is the medium which project team members
from different backgrounds and with different levels of experience can communicate with one
another. Precision means that the terms and rules on a data model can be interpreted only one way and
are not ambiguous. A data model can be sometimes referred to as a data structure, especially in the
context of programming languages. Data models are often complemented by function models,
especially in the context of enterprise models.
A semantic data model in software engineering is a technique to define the meaning of data within the
context of its interrelationships with other data. A semantic data model is an abstraction which defines
how the stored symbols relate to the real world. A semantic data model is sometimes called a
conceptual data model. The logical data structure of a database management system (DBMS), whether
hierarchical, network, or relational, cannot totally satisfy the requirements for a conceptual definition
of data because it is limited in scope and biased toward the implementation strategy employed by the
DBMS. Therefore, the need to define data from a conceptual view has led to the development of
semantic data modeling techniques. That is, techniques to define the meaning of data within the
context of its interrelationships with other data. The real worlds, in terms of resources, ideas, events,
etc., are symbolically defined within physical data stores. A semantic data model is an abstraction
which defines how the stored symbols relate to the real world. Thus, the model must be a true
representation of the real world
Data modeling in software engineering is the process of creating a data model by applying formal data
model descriptions using data modeling techniques. Data modeling is a technique for defining
business requirements for a database. It is sometimes called database modeling because a data model
is eventually implemented in a database.
The illustrates the way data models are developed and used today. A conceptual data model is
developed based on the data requirements for the application that is being developed, perhaps in the
context of an activity model. The data model will normally consist of entity types, attributes,
relationships, integrity rules, and the definitions of those objects. This is then used as the start point
for interface or database design
Data architecture is the design of data for use in defining the target state and the subsequent planning
needed to hit the target state. It is usually one of several architecture domains that form the pillars of
an enterprise architecture or solution architecture.
Question 3. Describe the following with respect to Object Oriented Databases:
a. Query Processing in Object-Oriented Database Systems
Query Processing in Object-Oriented Database Systems One of the criticisms of first-generation
object-oriented database management systems (OODBMSs) was their lack of declarative query
capabilities. This led some researchers to brand first generation (network and hierarchical) DBMSs as
object-oriented [Ullman 1988]. It was commonly believed that the application domains that
OODBMS technology targets do not need querying capabilities. This belief no longer holds, and
declarative query capability is accepted as one of the fundamental features of OODBMSs [Atkinson et
al. 1989; Stonebraker et al. 1990]. Indeed, most of the current prototype systems experiment with
powerful query languages and investigate their optimization. Commercial products have started to
include such languages as well (e.g., O2 [Deux et al. 1991], Object Store [Lamb et al. 1991]).In this
chapter we discuss the issues related to the optimization and execution of OODBMS query languages
(which we collectively call query processing). Query optimization techniques are dependent upon the
query model and language. For example, a functional query language lends itself to functional
optimization which is quite different from the algebraic, cost-based optimization techniques employed
in relational as well as a number of object-oriented systems. The query model, in turn, is based on the
data (or object) model since the latter defines the access primitives which are used by the query
model. These primitives, at least partially, determine the power of the query model. Despite this close
relationship, in this chapter we do not consider issues related to the design of object models query
models, or query languages in any detail. Language design issues are discussed elsewhere in this
book. The interrelationship between object and query models is discussed in [Blakeley 1991; Ozsu
and Straube 1991; Ozsu et al.1993; Yu and Osborn 1991].
Almost all object query processors proposed to date use optimization techniques developed for
relational systems. However, there are a number of issues that make query processing more difficult
in OODBMSs. The following are some of the more important issues:
1.Type system. Relational query languages operate on a simple type system consisting of a single
aggregate type: relation The closure property of relational languages implies that each relational
operator takes one or more relations as operands and produces a relation as a result. In contrast, object
systems have richer type systems. The results of object algebra operators are usually sets of objects
(or collections) whose members may be of different types. If the object languages are closed under the
algebra operators, these heterogeneous sets of objects can be operands to other operators. This
requires the development of elaborate type inferencing schemes to determine which methods can be
applied to all the objects in such a set. Furthermore, object algebras often operate on semantically
different collection types (e.g., set, bag, list) which imposes additional requirements on the type
inferencing schemes to determine the type of the results of operations on collections of different
2. Encapsulation.Relational query optimization depends on knowledge of the physical storage of data
(access paths) which is readily available to the query optimizer. The encapsulation of methods with
the data that they operate on in OODBMSs raises (at least) two issues. First, estimating the cost of
executing methods is considerably more difficult than estimating the cost of accessing an attribute
according to an access path. In fact, optimizers have to worry about optimizing method execution,
which is not an easy problem because methods may be written using a general-purpose programming
language. Second, encapsulation raises issues related to the accessibility of storage information by the
query optimizer. Some systems overcome this difficulty by treating the query optimizer as a special
application that can break encapsulation and access information directly [Cluet and Delobel 1992].
Others propose a mechanism whereby objects “reveal” their costs as part of their interface [Graefe
and Maier 1988].
b. Query Processing Architecture
In this section we focus on two architectural issues: the query processing methodology and the query
1 Query Processing Methodology
A query processing methodology similar to relational DBMSs, but modified to deal with the
difficulties discussed in the previous section, can be followed in OODBMSs. depicts such a
methodology proposed in [Straube and Ozsu 1990a]. The steps of the methodology are as follows.
Queries are expressed in a declarative language which requires no user knowledge of object
implementations, access paths or processing strategies. The calculus expression is first 2 calculus
optimization calculus-algebra transformation type check algebra optimization execution lan
generation object algebra expression type consistent expression optimized algebra expression
declarative query normalized calculus expression execution plan
2 Optimizer Architecture: Query optimization can be modeled as an optimization problem whose
solution is the choice of the “optimum” state in a state space (also called search space). In query
optimization, each state corresponds to an algebraic query indicating an execution schedule and
represented as a processing tree. The state space is a family of equivalent (in the sense of generating
the same result) algebraic queries. Query optimizers generate and search a state space using a search
strategy applying a cost function to each state and finding one with minimal cost. Thus, to
Characterize a query optimizer three things need to be specified:In this chapter we are mostly
concerned with cost-based optimization, which is arguably the more interesting case.
3.1. The search space and the the transformation rules that generate the alternative query expressions
which constitute the search space;
2. A search algorithm that allows one to move from one state to another in the search space; and
3. The cost function that is applied to each state. Many existing OODBMS optimizers are either
implemented as part of the object manager on top of a storage system, or they are implemented as
client modules in client-server architecture. In most cases, the above mentioned four aspects are
“hardwired” into the query optimizer. Given that extensibility is a major goal of OODBMSs, one
would hope to develop an extensible optimizer that accommodates different search strategies,
different algebra specifications with their different transformation rules, and different cost functions.
Rule-based query optimizers provide a limited amount of extensibility by allowing the definition of
new transformation rules. However, they do not allow extensibility in other dimensions. In this
section we discuss some new promising proposals for extensibility in OODBMSs. The Open OODB
project [Wells et al. 1992] at Texas Instruments
2 concentrate on the definition of an open architectural framework for OODBMSs and on the
description of the design space for these systems. Query processing in Open OODB [Blakeley et al.
1993]. The query module is an example of intra-module extensibility in Open OODB. The query
optimizer, built using the Volcano optimizer generator is extensible with respect to algebraic
operators, logical transformation rules, execution algorithms, implementation rules (i.e., logical
operator to execution algorithm mappings), cost estimation functions, and physical property
enforcement functions (e.g., presence of objects in memory). The clean separation between the user
query language parsing structures and the operator graph on which the optimizer operates allows the
replacement of the user language or optimizer. The separation between algebraic operators and
execution algorithms allows exploration with alternative methods for implementing algebraic
operators. Code generation is also a well defined subcomponent of the query module which facilitates
porting the query module to work on top of other OODBMSs. The Open OODB query processor
includes a query execution engine containing efficient implementations of scan, indexed scan, hybrid-
hash join [Shapiro 1986], and complex object assembly [Keller et al. 1991]. The EPOQ project
[Mitchell et al. 1993] is another approach to query optimization extensibility, where the search space
is divided into regions. Each region corresponds to an equivalent family of query expressions that are
reachable from each other. The regions are not necessarily mutually exclusive and differ in the queries
that they manipulate, control (search) strategy that they use, query transformation rules that they
incorporate, and optimization objectives they achieve. For example, one region may cover
transformation rules that deal with simple select queries, while another region may deal with
transformations for nested queries. Similarly, one region may have the objective of minimizing a cost
function, while another region may attempt to transform queries in some desirable form. Each region
may be nested to a number of levels, allowing hierarchical search within a region. Since the regions
do not represent equivalence classes, there is a need for a global control strategy to determine how the
query optimizer moves from one region to another. The feasibility and effectiveness of this approach
remains to be verified. The TIGUKAT project [Peters et al. 1992] uses an object-oriented approach to
query processing extensibility.
Question 4. Describe the Differences between Distributed & Centralized
A distributed database is a database that is under the control of a central database management system
(DBMS) in which storage devices are not all attached to a common CPU. It may be stored in multiple
computers located in the same physical location, or may be dispersed over a network of
interconnected computers. Collections of data (e.g. in a database) can be distributed across multiple
physical locations. A distributed database can reside on network servers on the Internet, on corporate
intranets or extranets, or on other company networks. The replication and distribution of databases
improves database performance at end-user worksites. To ensure that the distributive databases are up
to date and current, there are two processes: replication and duplication. Replication involves using
specialized software that looks for changes in the distributive database. Once the changes have been
identified, the replication process makes all the databases look the same. The replication process can
be very complex and time consuming depending on the size and number of the distributive databases.
This process can also require a lot of time and computer resources. Duplication on the other hand is
not as complicated. It basically identifies one database as a master and then duplicates that database.
The duplication process is normally done at a set time after hours. This is to ensure that each
distributed location has the same data. In the duplication process, changes to the master database only
are allowed. This is to ensure that local data will not be overwritten. Both of the processes can keep
the data current in all distributive locations. Besides distributed database replication and
fragmentation, there are many other distributed database design technologies. For example, local
autonomy, synchronous and asynchronous distributed database technologies. These technologies'
implementation can and does depend on the needs of the business and the sensitivity/confidentiality of
the data to be stored in the database, and hence the price the business is willing to spend on ensuring
data security, consistency and integrity. Basic architecture
A database User accesses the distributed database through: Local applications; Applications which do
not require data from other sites.
Global applications: Applications which do require data from other sites.
A distributed database does not share main memory or disks.
A centralized database has all its data on one place. As it is totally different from distributed database
which has data on different places. In centralized database as all the data reside on one place so
problem of bottle-neck can occur, and data availability is not efficient as in distributed database. Let
me define some advantages of distributed database, it will clear the difference between centralized
and distributed database.
Advantages of Data Distribution
The primary advantage of distributed database systems is the ability to share and access data in a
reliable and efficient manner.
Data sharing and Distributed Control:
If a number of different sites are connected to each other, then a user at one site may be able to access
data that is available at another site. For example, in the distributed banking system, it is possible for a
user in one branch to access data in another branch. Without this capability, a user wishing to transfer
funds from one branch to another would have to resort to some external mechanism for such a
transfer. This external mechanism would, in effect, be a single centralized database.
The primary advantage to accomplishing data sharing by means of data distribution is that each site is
able to retain a degree of control over data stored locally. In a centralized system, the database
administrator of the central site controls the database. In a distributed system, there is a global
database administrator responsible for the entire system. A part of these responsibilities is delegated to
the local database administrator for each site. Depending upon the design of the distributed database
system, each local administrator may have a different degree of autonomy which is often a major
advantage of distributed databases.
Question 5. Explain the following:
a. Query Optimization
Generally, the query optimizer cannot be accessed directly by users: once queries are submitted to
database server, and parsed by the parser, they are then passed to the query optimizer where
optimization occurs. However, some database engines allow guiding the query optimizer with hints.
A query is a request for information from a database. It can be as simple as "finding the address of a
person with SS# 123-45-6789," or more complex like "finding the average salary of all the employed
married men in California between the ages 30 to 39, that earn less than their wives." Queries results
are generated by accessing relevant database data and manipulating it in a way that yields the
requested information. Since database structures are complex, in most cases, and especially for not-
very-simple queries, the needed data for a query can be collected from a database by accessing it in
different ways, through different data-structures, and in different orders. Each different way typically
requires different processing time. Processing times of a same query may have large variance, from a
fraction of a second to hours, depending on the way selected. The purpose of query optimization,
which is an automated process, is to find the way to process a given query in minimum time. The
large possible variance in time justifies performing query optimization, though finding the exact
optimal way to execute a query, among all possibilities, is typically very complex, time consuming by
itself, may be too costly, and often practically impossible. Thus query optimization typically tries to
approximate the optimum by comparing several common-sense alternatives to provide in a reasonable
time a "good enough" plan which typically does not deviate much from the best possible result.
b. Text Retrieval Using SQL3/Text Retrieval
SQL3 supports storage of multimedia data, such as text documents, in an O-R database using the
blob/clob data types. However, the standard SQL3 specification does not include support for
processing the media content, such as indexing or querying. Thus is it not possible to use standard
SQL3 to locate documents based on an analysis of their content. Therefore, most of the larger or-
dbms vendors (IBM, Oracle, Ingres, Postgress ...) have used the SQL3 UDT/UDF functionality to
extend their or-dbms with management systems for media data. The approach used has been to add-on
own or purchased specialized media management systems to the basic or-dbms.
Basically, the new - to SQL3 - functionality includes:
Indexing routines for the various types of media data, as discussed in CH.6, for example using:
o Content terms for text data and
o Color, shape, and texture features for image data.
Selection operators for the SQL3 WHERE clause for specification of selection criteria for
Text processing sub-systems for similarity evaluation and result ranking.
Unfortunately, the result of this 'independent' activity is non standard or-dbms/mm (multimedia)
systems that differ in the functionality included and limit data retrieval from multiple or-dbm system
types. For example, unified access to data stored in Oracle and DB2 systems is difficult, both in query
formulation and result presentation. Since actual SQL3/TextRetrieval syntax varies between or-
dbms/mm implementations, the examples used in the following are given in generic
8.1 Text Document Retrieval
Multimedia documents can be complex, but are basically unstructured. They can consist of the raw
text only, or have a few fixed attributes with one or more semi- or unstructured components. For
example, a news report for an election could include the following components: where n, m, k, and x
are the number of occurrences of each component type.
1. Identifier, date, and author(s) of the report,
2. n* text blocks - (titles, abstract, content text),
3. m* images - example: image_of_candidate
4. k* charts, and
5. x* maps.
Note that the document elements listed in pt.1 above function as context metadata for the report, while
the text itself can function as semantic metadata for the image materials (Rønnevik, 2005). illustrates
elements of a semi-structured document. The original Grieg site also contains a list of references/links
which gives access to other multimedia documents about the composer, including some of his music.
Since an OR-DB can contain text documents such as web pages, SQL3 should be extended with
processing operators that support access to each of the element types listed above.
Question 6. Describe the following:
a. Data Mining Functions: Data mining functions can be divided into two categories: supervised
(directed) and unsupervised (undirected).
Supervised functions are used to predict a value; they require the specification of a target (known
outcome). Targets are either binary attributes indicating yes/no decisions (buy/don't buy, churn or
don't churn, etc.) or multi-class targets indicating a preferred alternative (color of sweater, likely
salary range, etc.). Naive Bayes for classification is a supervised mining algorithm.
Unsupervised functions are used to find the intrinsic structure, relations, or affinities in data.
Unsupervised mining does not use a target. Clustering algorithms can be used to find naturally
occurring groups in data.
Data mining can also be classified as predictive or descriptive. Predictive data mining constructs one
or more models; these models are used to predict outcomes for new data sets. Predictive data mining
functions are classification and regression. Naive Bayes is one algorithm used for predictive data
mining. Descriptive data mining describes a data set in a concise way and presents interesting
characteristics of the data. Descriptive data mining functions are clustering, association models, and
feature extraction. k-Means clustering is an algorithm used for descriptive data mining.
Different algorithms serve different purposes; each algorithm has advantages and disadvantages. A
given algorithm can be used to solve different kinds of problems. For example, k-Means clustering is
unsupervised data mining; however, if you use k-Means clustering to assign new records to a cluster,
it performs predictive data mining. Similarly, decision tree classification is supervised data mining;
however, the decision tree rules can be used for descriptive purposes.
Oracle Data Mining supports the following data mining functions:
Supervised data mining:
o Classification: Grouping items into discrete classes and predicting which class an
item belongs to
o Regression: Approximating and forecasting continuous values
o Attribute Importance: Identifying the attributes that are most important in predicting
o Anomaly Detection: Identifying items that do not satisfy the characteristics of
"normal" data (outliers)
Unsupervised data mining:
o Clustering: Finding natural groupings in the data
o Association models: Analyzing "market baskets"
o Feature extraction: Creating new attributes (features) as a combination of the original
Oracle Data Mining permits mining of one or more columns of text data.
Oracle Data Mining also supports specialized sequence search and alignment algorithms (BLAST)
used to detect similarities between nucleotide and amino acid sequences.
b. Data Mining Techniques: Several core techniques that are used in data mining describe the
type of mining and data recovery operation. Unfortunately, the different companies and solutions do
not always share terms, which can add to the confusion and apparent complexity.
Let's look at some key techniques and examples of how to use different tools to build the data mining.
Association (or relation) is probably the better known and most familiar and straightforward data
mining technique. Here, you make a simple correlation between two or more items, often of the same
type to identify patterns. For example, when tracking people's buying habits, you might identify that a
customer always buys cream when they buy strawberries, and therefore suggest that the next time that
they buy strawberries they might also want to buy cream.
Building association or relation-based data mining tools can be achieved simply with different tools.
For example, within InfoSphere Warehouse a wizard provides configurations of an information flow
that is used in association by examining your database input source, decision basis, and output.
You can use classification to build up an idea of the type of customer, item, or object by describing
multiple attributes to identify a particular class. For example, you can easily classify cars into
different types (sedan, 4x4, convertible) by identifying different attributes (number of seats, car shape,
driven wheels). Given a new car, you might apply it into a particular class by comparing the attributes
with our known definition. You can apply the same principles to customers, for example by
classifying them by age and social group.
Additionally, you can use classification as a feeder to, or the result of, other techniques. For example,
you can use decision trees to determine a classification. Clustering allows you to use common
attributes in different classifications to identify clusters.
By examining one or more attributes or classes, you can group individual pieces of data together to
form a structure opinion. At a simple level, clustering is using one or more attributes as your basis for
identifying a cluster of correlating results. Clustering is useful to identify different information
because it correlates with other examples so you can see where the similarities and ranges agree.
Clustering can work both ways. You can assume that there is a cluster at a certain point and then use
our identification criteria to see if you are correct. In this, a sample of sales data compares the age of
the customer to the size of the sale. It is not unreasonable to expect that people in their twenties
(before marriage and kids), fifties, and sixties (when the children have left home), have more