SlideShare a Scribd company logo
1 of 288
I.J. Information Technology and Computer Science, 2016, 12,
59-66
Published Online December 2016 in MECS (http://www.mecs -
press.org/)
DOI: 10.5815/ijitcs.2016.12.07
Copyright © 2016 MECS I.J.
Information Technology and Computer Science, 2016, 12, 59-66
SQL Versus NoSQL Movement with Big Data
Analytics
Sitalaks hmi Ve nkatraman
School of Engineering, Construction and Design (IT),
Melbourne Polytechnic, VIC 3072, Australia
E-mail: [email protected]
Kiran Fahd, Samue l Kas pi
School of Engineering, Construction and Design (IT),
Melbourne Polytechnic, VIC 3072, Australia
E-mail: [email protected], [email protected]
Ramanathan Ve nkatraman
National University of Singapore, Singapore
E-mail: [email protected]
Abstract—Two ma in revolutions in data manage ment
have occurred recently, name ly Big Data analytics and
NoSQL databases. Even though they have evolved with
diffe rent purposes, their independent developments
comple ment each other and their convergence would
benefit businesses tremendously in ma king real-t ime
decisions using volumes of co mple x data sets that could
be both structured and unstructured. While on one hand
many software solutions have emerged in supporting Big
Data analytics, on the other, many NoSQL database
packages have arrived in the market. However, they lack
an independent benchmarking and co mparat ive
evaluation. The a im of this paper is to provide an
understanding of their contexts and an in -depth study to
compare the features of four ma in NoSQL data models
that have evolved. The performance compa rison of
traditional SQL with No SQL databases for Big Data
analytics shows that NoSQL database poses to be a better
option for business situations that require simplicity,
adaptability, high performance analytics and distributed
scalability of large data. This paper concludes that the
NoSQL move ment should be leveraged for Big Data
analytics and would coe xist with re lational (SQL)
databases .
Index Terms—Structured Query Language (SQL), Non
SQL (NoSQL), Big Data, Big Data Analytics, Re lational
Database, SQL Database, NoSQL Database.
I. INT RODUCT ION
As the technology environment transforms and faces
new challenges, businesses increasingly realize the need
to evaluate new approaches and databases to ma nage
their data to support changing business requirements and
growing co mple xity and e xpansion of their applicat ions
[1]. Re lational database has been the default choice for
data model adoption in businesses worldwide over the
past thirty years with Structured Query Language (SQL)
as the standard language designed to perform the basic
data operations. However, with the e xplosion of data
volume, SQL-based data querying lose effic iency, and in
particular, managing larger databases has become a ma jor
challenge [2]. In addit ion, re lational databases exh ibit a
variety of limitations in meeting the recent Big Data
analytics require ment in businesses. While clusters -based
architecture has emerged as a solution for la rge databases,
SQL is not designed to suit clusters and this mis match
has led to think of alternate solutions. There are
mis matches between persistent data model and in -
me mo ry data structures, and servers based on SQL
standards are now prone to me mory footprint, security
risks and performance issues.
NoSQL (Non SQL) databases with a set of new data
manage ment features, on the other hand, are more
fle xib le and horizontally scalable. They a re considered as
alternatives to overcome the limitations of the current
SQL-dominated persistence landscape and hence they are
also known as non-relational databases [3]. The main
goal for the NoSQL move ment is to allo w easy storage
and retrieval of data, regardless of its structure and
content, which is possible due to the non -existence of a
rig id data structure in non-relat ional databases. NoSQL
databases exhib it horizontal scalability by taking
advantage of new clusters and several low -cost servers. In
addition, they are envisaged to automatically manage data
administration including fault recovery and these
capabilit ies would result in huge cost savings. Though
non-relational databases are providing different features
and advantages, they were init ially characterised by lack
of data consistency and non-ability to query stored
records using SQL. With the e mergence of NoSQL
databases new features and optimisation characteristics
are evolving to overcome these limitations as well.
However, their total capabilities are still not disclosed [4].
Also, due to the increasing differences in NoSQL
database offerings and their non-standard features,
businesses are not clear on what is the stand to take.
In this paper, we first provide an overview of the
60 SQL Versus NoSQL M ovement with Big Data Analytics
Copyright © 2016 MECS I.J.
Information Technology and Computer Science, 2016, 12, 59-66
present context o f Big Data ana lytics and NoSQL
databases. Ne xt, we discuss the four ma in data models of
non-relational databases and compare them with SQL
databases. There are a variety of No SQL databases and
which one is more appropriate for wh ich business
operation re ma ins an unanswered question so far. We
compare the different data models of NoSQL in terms of
their features and the NoSQL databases available in the
ma rket that support those features. The different data
man ipulation mechanis ms and optimisation techniques
adopted by NoSQL databases could result in their
diffe rence in performance. We discuss how these factors
play a ma jor ro le in Big Data analytics and identify the
associated challenges. We also consider the coexistence
of NoSQL databases with re lational databases and discuss
their relevance in different business contexts.
II. RELAT ED WORK: T HE CONT EXT OF NOSQL
DAT ABASES WIT H BIG DAT A ANALYT ICS
Fro m the recent trends reported in literature [5][6], it is
evident that in today's context, there is an e xponential
growth of data volume that are structured as well as
unstructured (Big Data) fro m a variety of data sources,
such as social media, e -ma ils, te xt documents, GPS data,
sensor data, surveillance data, etc. with increasing
Internet usage. Hence, we can say that Big Data is
characterised by structured, semi-structured, and
unstructured data collected fro m digita l and non-digital
resources. The ma in cha llenge is the effective use of this
Big Data that represents the data source for effic ient
decision-ma king by adopting suitable data min ing
techniques [7][8].
Based on our literature survey, we have identified that
the current challenges presented by Big Data are due to
the following general characteristics e xperienced by
businesses:
ta Veloc ity – rapid ly and continuously
updated data streams fro m d ifferent sources and
locations.
– structured, semi-structured and
unstructured data storage.
– huge nu mber of datasets with sizes
of several terabytes or petabytes.
– data organized in several
different locations or data centres.
It is important for businesses to perform Big Data
analytics, which is the process of e xa min ing la rge data
sets containing a variety of data types. Using Big Data
Analytics, businesses are able to a rrive at more accurate
analysis of huge a mounts of data to uncover hidden
patterns, unknown correlations, market trends, customer
preferences and other useful business information [2][9].
In order to support timely and effect ive decision ma king,
Big Data analytics re lies on large volu mes of data that
requires clusters for data storage. However, sinc e
relational databases are not designed for clusters, and
e xhibit performance issues with regard to Big Data
analytics, businesses are considering the need for the
NoSQL movement [10].
The schema of NoSQL is not fixed. It uses varied
interfaces to store and analyse sheer volume of user-
generated content, personal data and spatial data being
generated by modern applications, clou d computing and
smart devices. [1][11].
In this context, NoSQL database presents a preferred
solution than SQL database primarily for its ability to
cater to the horizontal partitioning of data, fle xib le data
processing and improved performance. Large Internet
companies (Facebook, Lin kedIn, A ma zon and Google),
which cannot process services by using existing re lational
databases, had researched and led to the advent of
NoSQL to solve their proble m of dealing with
continuously increasing data, optimised data utilizat ion
and horizontal scalability of large data. No SQL databases
are a better option for the information systems that
require h igh performance and dynamic scalability more
than the requirements of reliability, highly distributed
nature of the three-tier Internet architecture systems and
cloud computing [1][3][11]. There fore, it is necessary to
investigate further and compare SQ L versus NoSQL as
well as the salient differences in the performance of
NoSQL data models in supporting the necessary features
for Big Data analytics. This paper presents these
investigations and findings in today's Big Data context.
III. NOSQL DAT A MODELS
There are many NoSQL databases available, however,
they fall under four data models described below
[3][11][12]. Each category has its own specific attributes
but there are cross overs between the different data
models. Generally, a ll NoSQL databases are built to be
distributed and scaled horizontally.
Key-Va lue Store Database – Key-Va lue store is a
simp le but effic ient and powerful NoSQL database. The
data is stored in two parts, a string that represents the key
and the actual data that represents the value, thus creating
a “key-value” pair. This results in values being indexed
by keys for retrieval, a concept simila r to hash tables. In
other words, the store allo ws the user to request the
values according to the key specified. It can handle
structured or unstructured data. It offers high concurrency
and scalability as we ll as rap id lookups, but little
consistency.
Such Key-Value store databases can be used to
develop forums and online shopping carts and websites
where user sessions are required to be stored. So me
notable exa mp les are A mazon‟s Dyan moDB, Apache‟s
Cassandra, Azure Table Storage (AT S), Orac le Berke ley
DB, and Basho Technologies‟ Ria k. A ma zon offers fu lly
managed No SQL store service DynamoDB for the
purpose of internet scale applications . It is a distributed
key-value storage system wh ich provides fast, reliable
and cost-effective data access and high availability and
durability due to its replica feature.
One of the advantages of Key-Value store database is
its high insert/read rates compared to traditional SQL
SQL Versus NoSQL M ovement with Big Data Analytics 61
Copyright © 2016 MECS I.J.
Information Technology and Computer Science, 2016, 12, 59-66
database. This is achieved by saving more than one entry
to the store as shown in the example below:
@db.bulk_save([
{"hot" => "and spicy"},
{"cold" => "yet loving"},
{"other" => ["set","of","keys"]}
])
Colu mn Oriented (o r wide -colu mn) Store Databases –
In colu mn store databases, columns are defined for each
row instead of being predefined by the table structure
having uniform sized co lu mns for each row. Such stores
have a two-level aggregate structure, a key and a row
aggregate, which is a group of columns. Any column can
be added to any row, and rows can have very different
columns. In other words, each row has diffe rent
number of colu mns that are stored. It can also store data
tables as sections of columns of data. Data can be vie wed
as either row-oriented where each row is an aggregate, or
column-o riented where each colu mn fa mily defines a
record type. Each key is associated with one or more
columns and a key for each colu mn fa mily is used for
rapid data retrieval with less I/O activity thereby offering
very high performance. These databases provide high
scalability as they store data in highly distributed
architectures.
Wide-colu mn databases is ideal to be used for data
mining and analytic applicat ions with Big Data.
Exa mples of some colu mn-oriented store providers are
Facebook‟s high-performance Cassandra, Apache Hbase,
Google ‟s Big Table and HyperTable. Google ‟s Big Table
is high performance wide-colu mn database that can deal
with vast amount of data. It is developed on Google File
System GFS using C/C++. It is used by multip le Google
applications like YouTube and Gma il that have varied
latency demand of the database. It is not distributed
outside Google besides the usage inside Google's App
Engine. Big Tab le is designed for easy scalability across
thousands of machines, thus, it is tolerant to hardware
failures.
Document Store Databases – Document database
e xtends the basic key-value database concept and stores
comple x data in document form such as XML, PDF or
JSON documents. A document store is typically schema-
less where each document can contain different fie lds of
any length. Documents are accessed or identified by
using a unique key wh ich may be simple string, URI
string or path string. Docu ment databases are more
comple x databases but offer h igh performance, horizontal
scalability and schema fle xib ility which a llo w storing
virtually any structure required by any application.
Document oriented databases are suitable for content
manage ment systems and blog application s. So me
e xa mples of providers using document oriented databases
are 10gen‟s MongoDB, Apache CouchDB, Basho
Technologies‟ Ria k, Azure 's Docu mentDB and AWS
Dynamo DB. MongoDB is developed by 10gen using
C++ and is a structure free, c ross -platform document
oriented database. It uses Grid File System to store large
files such as images and videos in BSON (Binary JSON)
format. It prov ides effic ient performance, h igh
consistency and high persistence but it is not very reliable
and is resource hungry.
Graph Store – Graph database focuses on relationships
between data. It uses the graph theory approach to store
the data and optimises the search by using index free
adjacency technique. It is designed for data whose
relationships are we ll represented by graph structures
consisting of nodes, edges and properties. A node
represents an object (an entity in the database), an edge
describes the relationship between the objects and the
property is the node on the other end of the relationship.
In inde x free adjacency technique, each node consists of a
pointer which directly points to the adjacent node as
shown in Fig. 1.
These stores provide fast performance, A CID
compliance and rollback support. These databases are
suitable to develop social-networking applications,
bioinformat ics applications, content manage ment systems
and cloud management services. Exa mp les of notable
Graph databases are Neo Technology‟s Neo4j , Orient
DB, Apache Giraph and Titan.
Apache Giraph is an open source large-scale graph
processing system and imp le mentation of Google Pregel
(a graph processing architecture which has vertex-centric
approach). It is designed for high scalability to overcome
the crucial need for scalable platforms and parallel
architectures that can process the bulk data p roduced by
modern applications such as social networks and
knowledge bases. For e xa mp le, it is currently used at
Facebook, Lin kedIn and Twitter to analyse the graph
formed by users and their connections. Giraph is a
distributed and fault-tolerant system and offers features
such as, master co mputation, sharded aggregators, edge -
oriented input and out-of-core computation.
Fig.1. Graph algorit hm.
IV. HIGH LEVEL COMP ARISON BET WEEN NOSQL AND
SQL
DAT ABASES
Based on the features of each type of database recently
reported in the literature [1][3][11][13], we performed a
high level co mparison between SQL (re lational) and
NoSQL (non-re lational) databases and the summary of
findings is given in Table 1.
We considered aspects such as database type, schema,
data model used, scaling model availab le, transactional
capabilit ies, data man ipulation method used, and popular
62 SQL Versus NoSQL M ovement with Big Data Analytics
Copyright © 2016 MECS I.J.
Information Technology and Computer Science, 2016, 12, 59-66
database software available in the market in order to
compare SQL databases versus NoSQL databases. Some
e xa mples are a lso given in Table 1 for a better
understanding of their diffe rences . Overall, Tab le 1
provides the high level diffe rences in the key features and
properties exhib ited by relational and non -relational
databases, which would support businesses in making
decisions about using SQL or NoSQL database options in
various Big Data application scenarios .
T able 1. Relat ional Versus NoSQL Dat abases - High Level
Differences
Relational Databases NoSQL Databases
Data
base
Type
One SQL DBMS product
(marginal variations)
Four general types: key-
value, document and wide-
column and graph st ores
S ch ema
-defined
foreign-key relationships
bet ween t ables in an
explicit dat abase schema
ct definit ion of schema
and dat a t ypes is required
before insert ing dat a
ent ire dat abase.
definit ion in advance
st ored t ogether as
required
t he schema freely wit h
no downt ime.
Data
Mode l s
row and columns in
different tables joined via
relat ionships
of columns t o store a
specific piece of dat a
joins t wo separate t ables
t he "employees" and
"depart ments" t ogether to
find out t he department of
an employee.
dat a – st ruct ured, semi-
st ruct ured, and
unst ruct ured
different and flexible
dat a models. For
example:. Document
st ore t ype organizes all
relat ed dat a using
references and
embedded document s
t ools.
S cal i n g
Mode l
a resides on a single
node and capacit y is added
t o exist ing resources(data
st orage or I/O capacity)
part it ioning of t he dat a
across addit ional servers
or cloud inst ances as
required.
Tran s -
acti on
C apab-
i l i ti e s
t ransact ional propert ies,
such as atomicity,
consistency, isolat ion,
durabilit y to ensure high
dat a reliabilit y and dat a
int egrit y.
ance
t ransactions and CAP
T heorem of dist ributed
syst ems supports
consist ency of dat a
across all nodes of a
NoSQL dat abase
single document .
Data
Mani pul
ati on
y
Language – SQL DML
St atement s are used to
manipulat e dat a e.g.
SELECT cust omer_name
FROM cust omers
WHERE cust omer_age>18;
- Oriented APIs
are used e.g.
db.cust omers.find(
{cust omer_age: {$gt :
18 }}
{ cust omer_name:1 })
S oftware Oracle, MySQL,
DB2, SQLServer
Couchbase, Ret hinkdb,
Redis, Aerospike,
Leveldb, Hbase,
Cassandra, Neo4j,
Elast icsearch, Lucene
V. PERFORMANCE OF NOSQL AND SQL DAT ABASES
FOR
BIG DAT A ANALYT ICS
The most important reason in moving towards NoSQL
fro m re lational database is due to require ments of
performance imp rovements. Choi et al. [1] found that a
NoSQL database such as MongoDB provided mo re stable
and faster performance at the e xpense of data consistency.
The tests were done on an internal blog system based on
an open source project. It was found that MongoDB
stored posts 850% faster than a SQL database. It has been
suggested that NoSQL should be used in environ ments
which a re concerned with data availability rather than
consistency.
Fotache & Cogean [14] describe the use of MongoDB
in mobile applications. Ce rtain mu ltiple update operations
like Upsert are easier and faster to perform with NoSQL
than SQL database. The use of cloud computing along
with NoSQL is said to increase the performance
especially in the data layer for mobile platforms.
Ullah [15] co mpared performance of both re lational
database management system (RDBMS) and NoSQL
database where Resource Description Fra me work (RDF)
based Trip le store was used as the NoSQL database. It
was noted that NoSQL database was slower than the
relation database due to the mass amount of me mory
usage by the NoSQL database. Reading a large a mount of
data takes toll on the database and because of the
unstructured format of NoSQL database the storage of
thousand records requires a huge amount of storage
whereas the RDBMS uses less amount of storage. For
e xa mple , searching red berry in the database took 5255
ms in the NoSQL database while it only took 165.43 ms
to search it in RDBMS.
Floratou et al. [4] performed the Yahoo Cloud Serving
Benchma rk (YCSB) test on RDBMS and MongoDB.
They tested SQL client sharded database against
MongoDB auto and client sharded databases. The tests
found that SQL client sharded database was able to attain
higher throughput and lower latency in most of the
benchmarks. The reason for higher performance is SQL is
attributed to the fact that majority of the read requests are
made to pages in the buffer pool whereas NoSQL
databases tend to read shards located at different nodes.
The study has tried to prove that RDBMS still has the
processing power to handle larger wo rkloads similar to
NoSQL.
There are many advantages of NoSQL databases over
SQL databases like easy scalability, fle xib le schema,
lower cost and efficient and high performance. Having
said that, there are some weaknesses of NoSQL over SQL
databases to [12][16]. These are summarised below:
lack of familiarity and limited expertise.
either consistency or availability.
language in all NoSQL databas es.
databases
SQL Versus NoSQL M ovement with Big Data Analytics 63
Copyright © 2016 MECS I.J.
Information Technology and Computer Science, 2016, 12, 59-66
(Cassandra) compared to non -distributed ones
(MongoDB).
difficult to maintain.
We have identified the following situations when
NoSQL should be more suitable than SQL in the context
of Big Data analytics:
1. Simp licity of use – current Big Data technologies
are co mple x requiring highly skilled technical
e xpertise, wh ile NoSQL offers simplicity that
would improve the productivity of both developers
and users. The simple, s mall, intuitive and easy to
learn NoSQL stacks can suit businesses that
require Big Data analytics to adopt a clean
NoSQL-like APIs.
2. Adaptability to change – when business
require ments and data models change warranting
fle xib le Big Data analytics, NoSQL that supports
fle xib le data schemas are idea l to integrate siloed
and disparate backend systems.
3. Efficiency for analytics functionality – The
foundation data structure of ma jority of NoSQL
technology is the Javascript Object Notation
(JSON) data format that caters to both schema-on-
read and schema-on-write efficiently for data
warehousing functionality. For e xa mp le, NoSQL
Big Data Warehouse, SonarW for JSON ma kes
analytics functionality effic ient for Big Data
applications.
4. Distributed scalability – with mo re and more
distributed nature of systems and transactions,
fle xib le data beco mes the norm and strict schema
approach is unsuitable. With schema evolution,
NoSQL p rovides the necessary scalability for Big
Data platforms to perform distributed queries
faster.
T able 2. Comparison of NoSQL Dat a Models
NoS Q L Data
Mode l s
NoS Q L Database s Pe rforman ce S cal abi l i ty Fl e xi bi l i
ty C ompl e xi ty Fu n cti on al i ty
Ke y-Val u e
DyanmoDB,
Cassandra, AT S, Riak
Berkeley DB,
High High High None Variable (None)
W i de -C ol u mn
Cassandra, Hbase, Big
T able, HyperT able
High High Moderat e Low Minimal
Docu me n t
MongoD, CouchDB,
Riak, DynamoDB
High
Variable
(High)
High Low Variable (Low)
Graph
Neo4j, Orient DB,
Giraph, T it an.
Vari-able Variable High High Graph T heory
VI. COMP ARISON OF NOSQL DAT A MODELS
NoSQL databases vary in their performance depending
on their data model [17]. We co mpare the key attributes
of the four types of NoSQL data mode ls and summarise
them in Table 2.
As shown in Table 2, we have considered key
attributes such as, performance, scalability, fle xib ility,
comple xity and functionality fo r co mparing the four data
models supported by the popular NoSQL database
software that are available in the market.
Fig. 2 shows CAP theore m that fo rms a visual guide to
NoSQL databases under each NoSQL data model [16],
which is based on consistency, availability and partition
tolerance features. With NoSQL databases, there are now
other options for storing different kinds of data where
typically d istributed set of servers have to fit two of the
three require ments of the CAP theorem, wh ich is usually
a deciding factor in what technology could be used.
Ba zar & Losif [3] co mpared the performance of
MongoDB, Cassandra and Couchbase databases, each
possessing different features and functionalities. The tests
were conducted using the YCSB tool.
Fig.2. CAP t heorem for NoSQL dat abases.
VII. RESULT S
The benchmark tests found that Couchbase produced
the lowest latencies for interactive database applications.
Couchbase is able to process more operations per second
with a lowe r average latency in read ing and writing data
than both MongoDb and Cassandra. Docu ment level
64 SQL Versus NoSQL M ovement with Big Data Analytics
Copyright © 2016 MECS I.J.
Information Technology and Computer Science, 2016, 12, 59-66
locking in Couchbase database is the prima ry reason for
faster read and write operations. Cassandra is faster in
writing than MongoDb but both of the m have almost
equal reading speed. It is also mentioned that each
NoSQL database is suitable to specific application
environments and cannot be considered a comp lete
solution for every workload and use case.
Another case study by Klein et a l. [18] looked at the
use of NoSQL database MongoDB, Ria k and Couchbase
in a distributed healthcare organisation. These databases
use different NoSQL data models including key -va lue
(Riak), column (Cassandra) and document (MongoDB).
Cassandra produced the overall best performance for
all types of database operations (Reading, Writ ing, and
Updating). Riak‟s performance was degraded due to its
internal thread pool c reating a pool for each client session
instead of creating a shared pool for a ll c lient sessions.
Cassandra had the highest average latencies but also
produced the best throughput results. This was firstly due
to the indexing features that allowed Cassandra to
retrieve the most recent written records efficiently,
especially compared to Ria k. Secondly, the hash -based
sharding allowed Cassandra to distribute the request for
storage to be load better than MongoDB.
Prasad & Gohil [11] discussed the use of diffe rent
NoSQL databases for different work environ ments. It is
reported that the performance of NoSQL databases is
increased because of the use of a collection of processors
in the distributed system. MongoDB and Cassandra are
considered the best databases to be used in cases where
data is frequently written but rarely read. The NoSQL
databases are ment ioned to be victims of Consistency,
Availability and Partit ioning (CAP) theore m. Th is means
that a trade-off is always made e.g. the database can
either be consistent with low performance or offe rs high
availability and low consistency with fast performance
[11][17][19].
Zhikun et al. [20] suggested the use of a new database
allocation strategy based on load (DASB) in order to
increase performance of the NoSQL database. However,
the DASBL only works when it satisfies four conditions
and is unable to cater to an unbalanced system load.
Prasad et al. [11] co mpared different attributes such as
Replication, Sharding, Consistency and Failure handling.
We summa rise all these finding s in Table 3, wh ich
provides a list of the best NoSQL databases for each of
the features reported in literature.
Several doubts arise on the NoSQL pro mises and
studies have been conducted to explore the strengths and
weaknesses of NoSQL [21][22]. A recent study reviews
the trends of storage and computing tools with their
relative capabilit ies, limitations and environment they
are suitable to work with [23]. While h igh-end platforms
like IBM Netezza AMPP could cater to Big Data, due to
economic considerations, choices such as Hadoop have
proliferated world-wide resulting in the rise of NoSQL
database adoption that can integrate easily with Hadoop.
Even though HBase supports strong integration with
Hadoop using Apache Hive, it could provide a better
choice for applicat ion development only but not for rea l-
time queries and OLTP applicat ions due to very high
latency. On the other hand, graph -based platforms such
as Neo4j and Giraph form better options for storage
and computation due to their capability to model verte x-
edge scenarios in businesses that involve data
environments such as social networks and geospatial
paths .
Overall, Big Data has led to the require ment of new
generation data analytics tools [24][25] and hence it is
realistic to believe that both SQL and NoSQL databases
will coe xist. With cloud environments that support SQL
databases, fast processing of data is warranted to enable
efficient elasticity [26] and Big Data analytics that
involve current and past data as well as future predict ions.
New solutions are being proposed for cloud monitoring
with the use of NoSQL databases back-end to achieve
very quick response time.
T able 3. NoSQL Dat abases mapped t o t heir feat ures
Fe atu re s Be st NoS Q L Database s
High availabilit y
Riak, Cassandra, Google Big
T able, Couch DB
P art it ion T olerance
MongoDB, Cassandra, Google Big
t able, CouchDB, Riak, Hbase
High Scalabilit y Google Big t able
Consist ency
MongoDB, Google Big T able,
Redis, Hbase
Aut o-Sharding MongoDB
Writ e Frequently, Read Less MongoDB, Redis, Cassandra
Fault T olerant (No Single
P oint Of Failure)
Riak
Concurrency Cont rol
(MVCC)
Riak, Dynamo, CouchDB,
Cassandra, Google Big T able
Concurrency Cont rol
(Locks)
MongoDB, Redis, Google Big
T able
VIII. CONCLUSIONS
The industry has been dominated by relational o r SQL
databases for several years. Ho wever, with business
situations recently having the need to store and process
large datasets for business analytics, NoSQL database
provides the answer to overcome such challenges.
NoSQL offers s chema less data store and transactions that
allo w businesses to freely add fie lds to records without
the structured require ment of defin ing the schema a priori
which is a prime constraint in SQL databases. With the
growing need to manage large data and unstructured
business transactions via avenues such as social networks,
NoSQL graphs are we ll suited for data that has comple x
relationship structures and at the same time simp licity is
achieved through key-value stores. NoSQL data models
provide options for storing unstructured d ata to be
document-oriented, key-value pairs, colu mn-oriented or
graphs. These NoSQL storage models a re easy to
understand and imple ment and do not require comp le x
SQL Versus NoSQL M ovement with Big Data Analytics 65
Copyright © 2016 MECS I.J.
Information Technology and Computer Science, 2016, 12, 59-66
SQL optimizat ion techniques to perform Big Data
analytics. This paper has compared SQL vers us NoSQL
databases as well as the four data models of NoSQL in
the context of Big Data analytics for business situations.
We conclude that the fle xib le data modelling of NoSQL
is well suited to support dynamic scalability and
improved performance for Big Data analytics and could
be leveraged as new categories of data architectures
coexisting with traditional SQL databases.
REFERENCES
[1] Choi, Y., Jeon, W., & Yo, S. (2014), 'Imp roving Database
Sy stem Performance by App ly ing NoSQL', Journal Of
Information Processing Systems, 10(3), 355-364.
[2] M oniruzzaman, A. B., & Hossain, S. A. (2013), No SQL
database: New er a of d atabases for big data analy tics -
Classification, char acteristics and comp arison.
International Journal o f Database Theory and
Application, 6(4), 1-14.
[3] Bazar, C., & Losif, C. (2014), 'The Transition from
RDBM S to NoSQL. A Comp arative Analy sis of Three
Pop ular Non-Relational
Solution
s: Cassandra, M ongoDB
and Couchbase', Database Systems Journal, 5(2), 49-59.
[4] Floratou, A., Teletia, N., Dewitt, D., Patel, J., & Zhang, D.
(2012), 'Can the Elep hants Handle the NoSQL
Onslaught?', VLDB Endowment, 5(12), 1712-1723.
[5] M ason, R. T. (2015), 'NoSQL databases and data
modelin g techniqu es for a document -oriented NoSQL
database', Proceedings of In forming Science & IT
Education Conference (InSITE) 2015, 259-268.
[6] Pothuganti, A. (2015) 'Big Data Analy tics: Hadoop -M ap
Reduce & NoSQL Databases', International Journal of
Computer Scien ce and Information Technologies, 6(1),
522-527.
[7] Smolan, R. & Erwit, J. (2012), The Human face of Big
Data, Against all odds p roduction, O‟Reilly , USA.
[8] Sharda, R., Delen, D., & Turban, E. (2015), Business
intelligen ce and analytics: systems for decision support
(10th ed.). Up p er Saddle River, NJ: Pearson.
[9] Ohlhorst, F. (2013), Big data analytics: Turning big data
into big money. Hoboken, NJ. John Wiley and Sons.
[10] Kaur, P.D., Kaur, A. & Kaur, S. (2015), „Performance
Analy sis in Bigdata‟, International Journal of In formation
Technology and Computer Science (IJITCS), 7(11), 55-61.
[11] Prasad, A, & Gohil, B. (2014), 'A Co mp arative Study of
NoSQL Databases', International Journal Of Advanced
Research In Comp uter Science, 5(5), 170-176.
[12] Nay ak, A., Poriy a, A. & Poojary , D. (2013), „Typ e of
NOSQL Databases and its Comp arison with Relational
Databases‟, International Journal o f Applied In formation
Systems (IJAIS), 5(4) Foundation of Computer Science
FCS, New York, USA.
[13] M ongoDB (2014), „Why NoSQL?‟,
https://www.mongodb.com/nosql-exp lained, [Online:
accessed 20-Feb-2016]
[14] Fotache, M ., & Cogean, D. (2013), 'No SQL and SQL
Databases for M obile App lications. Case Study :
M ongoDB versus PostgreSQL', Informatica Economica,
17(2), 41-58.
[15] Ullah, M d A. (2015), „A Digital Library for Plant
Information with Performance Comp arison between a
Relational Database and a NoSQL Database (RDF Trip le
Store)‟, Technical Library, Paper 205.
[16] Hurst, N. (2010, ‘Visual Guide to NoSQL Systems‟,
http ://blog.nahurst.com/v isual- guid e-to-nosql-systems,
[Online: accessed 5-Nov-2015]
[17] Planet Cassandra ‘NoSQL Databases Defin ed and
Explain ed’, http ://www.p lanetcassandra.org/what-is-
nosql,[Online: accessed 24-M ar-2016]
[18] Klein, J., Gorton, I., Ernst, N. & Donohoe, P. (2015),
'Performance Evalu ation of NoSQL Databases: A Case
Study ', Proceedings of the 1st Workshop on Performance
Analysis of Big Data Systems, PABS’15, Austin, 5-10.
[19] M ongoDB (2015), ‘Top 5 Considerations When
Evaluating NoSQL Databases’,
https://s3.amazonaws.com/ info- mon godb-
com/10 gen_Top _5_NoSQL_Considerations.p df [Online:
accessed 5-Nov-2015]
[20] Zhikun, C., Shuqian g, Y., Shu an, T., Hui, Z., Li., Ge,
Z.,&
Huiy u, Z. (2014), „The Data Allocation Strategy Based on
Load in NoSQL Database’, Applied Mechanics and
Materials, 513-517, 1464-1469.
[21] Leavitt, N. (2010), „Will No SQL Databases Liv e Up to
Their Promise?‟, IEEE Computer 43(2) , 12-14.
[22] Subramanian, S. (2012), ‘NoSQL: An Ana lysis of th e
Strengths and Weaknesses’,
https://dzone.com/articles/nosql-an aly sis-strengths-and,
[Online: accessed 15-Jan-2016]
[23] Prasad B.R. & Agarwal S. (2016), 'Co mp arative Study of
Big Data Comp uting and Storage Tools: A Review',
International Journal o f Database Theory and
Application 9(1), 45-66.
[24] Warden P. (2012), Big Da ta Glossary - A Guide to th e
New Generation of Data Tools, O‟Reilly , USA.
[25] Zareian S., Fokaefs, M ., Khazaei H. Litoiu M . & Zhang
X.
(2016), 'A Big Data Framework for C loud M onitoring',
Proceedings of the 2nd In ternationa l Workshop on BIG
Data Software Engineering (BIGDSE'16), ACM Digital
Library, 58-64.
[26] Ramanathan, V. & Venkatraman, S. (2015), C loud
Adoption in Enterprises: Security Issues and Strateg ies,
96-121, Book Chap ter In Haider A. and Pishdad A. (Eds.),
Business Technologies in Contemp orary Organizations:
Adoption, Assimilation, and Institutionalization, IGI
Global Publishers, USA.
Authors’ Profiles
Dr. Sitalakshmi Venkatraman obtained
doctoral degr ee in Comp uter Science, from
National Institute of Industrial Engineer in g,
India in 1993 and M Ed from University of
Sheffield, UK in 2001. Prior to this, she had
comp leted M Sc in M athematics in 1985 and
MTech in Comp uter Science in 1987, both
from Indian Institute of Technolo gy , M adras,
India. This author is a Senior M ember (SM ) of IASCIT.
In the p ast 25 y ears, Sita's work exp erience involv es both
industry and academics - develop ing turnkey p rojects for IT
industry and teachin g a variety of IT courses for tertiary
institutions, in India, Sin gap ore, New Zealand, and more
recently in Australia since 2007. Sh e curr ently works as
Lecturer (Information Technolo gy ) at the School of En
gineerin g,
Construction & Design, M elbourne Poly technic, Australia.
She
also serves as M ember of Register of Exp erts at Australia's
Tertiary Education Quality and Standards Agency (TEQSA).
Sita has p ublished eight book ch ap ters and more than 100
research p ap ers in internationally well-known refereed
journals
and conferences that include Information Scien ces, Journal of
Artificial Intelligence in Engin eering, International Journal of
66 SQL Versus NoSQL M ovement with Big Data Analytics
Copyright © 2016 MECS I.J.
Information Technology and Computer Science, 2016, 12, 59-66
Business Information Systems, and Information Management &
Computer Security. She serves as Program Committee M ember
of several international confer ences and Sen ior M ember of
p rofessional societies and editorial board of three
international
journals.
Kiran Fahd receiv ed the B.Eng in software
engineer in g from the National University of
Emer gin g Techno lo gies, Pakistan in 2001,
and the M aster‟s degree in Enterp rise
Plannin g Sy stem - ERP from the Victoria
University , M elbourne in 2010. Since 2001,
she has worked in the cap acity of software
engineer and as a teacher.
Kiran has held various lecturing p ositions in Australian and
overseas universities. She currently teaches the subjects of
Bachelor of Infor mation Technolo gy under the Software
Develop ment major at the School of En gin eerin g,
Construction
& Design, M elbourne Poly technic, Australia.
Dr. S amuel Kaspi earned h is PhD (Comp uter
Science) from Victoria University , a M asters
of Comp uter Science from M onash University
and a Bachelor of Economics and Politics
from M onash University . He is a member of
Australian Comp uter Society (ACS) and
Association for Comp uting M achinery
(ACM ).
Sam is curr ently the Information Technology Discip line
Leader and Sen ior Lecturer of IT.at the School of En gineerin
g,
Construction & Design, M elbourne Poly technic, Australia.
Previously , Dr Kasp i taught at Victoria University , consulted
p rivately and was the CIO of OzM iz Pty Ltd.
Sam h as been active in both teachin g and p rivate enterp rise
in
the areas of software sp ecification, design and develop ment.
As
chief information off icer (CIO) of a small p rivate comp any
he
managed the develop ment and submission of five granted and
three p ending p atents. He also managed the submission of a
successful Federal Govern ment Comet gr ant under the
Commercialisin g Emer gin g Technolo gies category . He has
also
had a numb er of p eer rev iewed p ublications in cludin g the
Institute of Electrical and Electronics Engineers (IEEE).
Dr. Ramanathan Venkatraman is
working as M ember, Advanced
Technology App lication Practice at
National University of Singap ore. He has
served industry and academia for mor e than
32 y ears and has a wide sp ectrum of
exp erien ce in the fields of IT and business
p rocess engineerin g. His current resear ch
focuses in evolvin g decision mod els for business p roblems
and
more recently , he has b een contributing to frontiers of
knowledge by devising innovative ar chitectural models for
ICT
in domains such as Serv ice Oriented Architecture, B ig Data
and
Enterp rise Cloud Comp uting. Dr Venkatraman has a strong
p ractice app roach having worked in lar ge scale IT p rojects
across Asia, US, Europ e and NZ. He has p ublished more than
20 research p ap ers in leading journ als. Ap art from research
and
consulting, he also teaches adv anced technical courses for
M asters p rogram at NUS and has been a key arch itect in
setting
up innovative software engineerin g and business analy tics
curriculum in the fast changing IT education scenario.
How to cite this paper: Sitalakshmi Venkatraman, Kiran Fahd,
Samu el Kasp i, Ramanathan Venkatraman, " SQL Versus
NoSQL M ovement with Big Data Analy tics", International
Journal of Information Technolo gy and Comp uter
Science(IJITCS), Vol.8, No.12, p p .59-66, 2016. DOI:
10.5815/ijitcs.2016.12.07
Rubric for Journal Entries
Student must meet a tier in the Check+ column to receive a
Check+, starting with Format (Level 1), then Conceptualization
(Level 2), and
then with Application (Level 3). If student does not meet the
first level in the check+ column, he or she will be moved to the
Check column
and the same process will begin again. The higher bullet
points within each column will receive a higher grade within
that box. For instance
within the Check+ box for Conceptualization having 4 concepts
clearly listed from the chapter will receive a solid A while
listing only 3
concepts may receive an A-.
Criterion Check+ (A grade) Check (B grade) Check– (C grade)
Format
(Level 1)
10% of grade
1.25 points possible
per journal entry
• About one page (but no more), 12 font,
regular margins, double spaced.
• Most of the above is correct but there are a
few small problems
• Minimum half page, typed, but
no more than one page, 12 font,
regular margins, double spaced
• Not a full page. Some other small
problems.
• Less than half a page, or
irregular margins, spacing, or
font
• More problems than above
Conceptualization
(Level 2)
35% of Grade
4.75 points possible
per journal entry
• 4 concepts clearly listed1 from the chapter2
relating to score on questionnaire
• 3 concepts clearly listed
1 Clearly listed = page numbers are cited to show
where concept is found & concepts are defined.
2 Referencing content from assessment does not
count as a concept listed. That info may be used
in the entry, but only concepts found in the
chapter and cited will be counted.
• 2 concepts listed from the
chapter relating to score on
questionnaire
• 1 concept listed from the
chapter
• Concepts are definitely there but
not clearly defined or connected
to questionnaire score
• Concept possibly included but
not clear or from the wrong
chapter and not connected to the
questionnaire
• No concept listed from the
chapter relating to score on
questionnaire
Application
(Level 3)
55% of Grade
7.50 points possible
per journal entry
• Clearly defined implications for how each
concept applies to the student’s success in
the workplace. Implications are thoughtful
and detailed – they explain how you
should change by applying them.
• Implications are clearly explored in depth
but not connected very well to concepts.
Implications somewhat thoughtful.
• Implications clearly explored for
some but not all concepts.
Implications not quite as
thoughtful.
• Vaguely explored for how
concept applies to the student’s
success in the workplace.
• Implications are in the text but
are implicit and hard to discern.
• No implications of concept and
how it applies to the student’s
success in the workplace
• Can’t even draw implicit
implications from the journal
entry.
Information Retrieval P. BAXENDALE, Editor
A Relational Model of Data for
Large Shared Data Banks
E. F. CODD
IBM Research Laboratory, San Jose, California
Future users of large data banks must be protected from
having to know how the data is organized in the machine (the
internal representation). A prompting service which supplies
such information is not a satisfactory solution. Activities of
users
at terminals and most application programs should remain
unaffected when the internal representation of data is changed
and even when some aspects of the external representation
are changed. Changes in data representation will often be
needed as a result of changes in query, update, and report
traffic and natural growth in the types of stored information.
Existing noninferential, formatted data systems provide users
with tree-structured files or slightly more general network
models of the data. In Section 1, inadequacies of these models
are discussed. A model based on n-ary relations, a normal
form for data base relations, and the concept of a universal
data sublanguage are introduced. In Section 2, certain opera-
tions on relations (other than logical inference) are discussed
and applied to the problems of redundancy and consistency
in the user’s model.
KEY WORDS AND PHRASES: data bank, data base, data
structure, data
organization, hierarchies of data, networks of data, relations,
derivability,
redundancy, consistency, composition, join, retrieval language,
predicate
calculus, security, data integrity
CR CATEGORIES: 3.70, 3.73, 3.75, 4.20, 4.22, 4.29
1. Relational Model and Normal Form
1 .I. INTR~xJ~TI~N
This paper is concerned with the application of ele-
mentary relation theory to systems which provide shared
access to large banks of formatted data. Except for a paper
by Childs [l], the principal application of relations to data
systems has been to deductive question-answering systems.
Levein and Maron [2] provide numerous references to work
in this area.
In contrast, the problems treated here are those of data
independence-the independence of application programs
and terminal activities from growth in data types and
changes in data representation-and certain kinds of data
inconsistency which are expected to become troublesome
even in nondeductive systems.
Volume 13 / Number 6 / June, 1970
The relational view (or model) of data described in
Section 1 appears to be superior in several respects to the
graph or network model [3,4] presently in vogue for non-
inferential systems. It provides a means of describing data
with its natural structure only-that is, without superim-
posing any additional structure for machine representation
purposes. Accordingly, it provides a basis for a high level
data language which will yield maximal independence be-
tween programs on the one hand and machine representa-
tion and organization of data on the other.
A further advantage of the relational view is that it
forms a sound basis for treating derivability, redundancy,
and consistency of relations-these are discussed in Section
2. The network model, on the other hand, has spawned a
number of confusions, not the least of which is mistaking
the derivation of connections for the derivation of rela-
tions (see remarks in Section 2 on the “connection trap”).
Finally, the relational view permits a clearer evaluation
of the scope and logical limitations of present formatted
data systems, and also the relative merits (from a logical
standpoint) of competing representations of data within a
single system. Examples of this clearer perspective are
cited in various parts of this paper. Implementations of
systems to support the relational model are not discussed.
1.2. DATA DEPENDENCIES IN PRESENT SYSTEMS
The provision of data description tables in recently de-
veloped information systems represents a major advance
toward the goal of data independence [5,6,7]. Such tables
facilitate changing certain characteristics of the data repre-
sentation stored in a data bank. However, the variety of
data representation characteristics which can be changed
without logically impairing some application programs is
still quite limited. Further, the model of data with which
users interact is still cluttered with representational prop-
erties, particularly in regard to the representation of col-
lections of data (as opposed to individual items). Three of
the principal kinds of data dependencies which still need
to be removed are: ordering dependence, indexing depend-
ence, and access path dependence. In some systems these
dependencies are not clearly separable from one another.
1.2.1. Ordering Dependence. Elements of data in a
data bank may be stored in a variety of ways, some involv-
ing no concern for ordering, some permitting each element
to participate in one ordering only, others permitting each
element to participate in several orderings. Let us consider
those existing systems which either require or permit data
elements to be stored in at least one total ordering which is
closely associated with the hardware-determined ordering
of addresses. For example, the records of a file concerning
parts might be stored in ascending order by part serial
number. Such systems normally permit application pro-
grams to assume that the order of presentation of records
from such a file is identical to (or is a subordering of) the
Communications of the ACM 377
stored ordering. Those application programs which take
advantage of the stored ordering of a file are likely to fail
to operate correctly if for some reason it becomes necessary
to replace that ordering by a different one. Similar remarks
hold for a stored ordering implemented by means of
pointers.
It is unnecessary to single out any system as an example,
because all the well-known information systems that are
marketed today fail to make a clear distinction between
order of presentation on the one hand and stored ordering
on the other. Significant implementation problems must be
solved to provide this kind of independence.
1.2.2. Indexing Dependence. In the context of for-
matted data, an index is usually thought of as a purely
performance-oriented component of the data representa-
tion. It tends to improve response to queries and updates
and, at the same time, slow down response to insertions
and deletions. From an informational standpoint, an index
is a redundant component of the data representation. If a
system uses indices at all and if it is to perform well in an
environment with changing patterns of activity on the data
bank, an ability to create and destroy indices from time to
time will probably be necessary. The question then arises:
Can application programs and terminal activities remain
invariant as indices come and go?
Present formatted data systems take widely different
approaches to indexing. TDMS [7] unconditionally pro-
vides indexing on all attributes. The presently released
version of IMS [5] provides the user with a choice for each
file: a choice between no indexing at all (the hierarchic se-
quential organization) or indexing on the primary key
only (the hierarchic indexed sequent,ial organization). In
neither case is the user’s application logic dependent on the
existence of the unconditionally provided indices. IDS
[8], however, permits the fle designers to select attributes
to be indexed and to incorporate indices into the file struc-
ture by means of additional chains. Application programs
taking advantage of the performance benefit of these in-
dexing chains must refer to those chains by name. Such pro-
grams do not operate correctly if these chains are later
removed.
1.2.3. Access Path Dependence. Many of the existing
formatted data systems provide users with tree-structured
files or slightly more general network models of the data.
Application programs developed to work with these sys-
tems tend to be logically impaired if the trees or networks
are changed in structure. A simple example follows.
Suppose the data bank contains information about parts
and projects. For each part, the part number, part name,
part description, quantity-on-hand, and quantity-on-order
are recorded. For each project, the project number, project
name, project description are recorded. Whenever a project
makes use of a certain part, the quantity of that part com-
mitted to the given project is also recorded. Suppose that
the system requires the user or file designer to declare or
define the data in terms of tree structures. Then, any one
of the hierarchical structures may be adopted for the infor-
mation mentioned above (see Structures l-5).
378 Communications of the ACM
Structure 1. Projects Subordinate to Parts
File Segment Fields
F PART part #
part name
part description
quantity-on-hand
quantity-on-order
PROJECT project #
project name
project description
quantity committed
Structure 2. Parts Subordinate to Projects
File Sqmeut Fields
F PROJECT project #
project name
project description
PART part #
part name
part description
quantity-on-hand
quantity-on-order
quantity committed
Structure 3. Parts and Projects as Peers
Commitment Relationship Subordinate to Projects
File Segment Fields
F PART part #
part name
part description
quantity-on-hand
quantity-on-order
G PROJECT project #
project name
project description
PART part #
quantity committed
Structure 4. Parts and Projects as Peers
Commitment Relationship Subordinate to Parts
File Segnren1 Fields
F PART part #
part description
quantity-on-hand
quantity-on-order
PROJECT project #
quantity committed
G PROJECT project #
project name
project description
Structure 5. Parts, Projects, and
Commitment Relationship as Peers
FCZC .%&-,,ZC,,t Ficlds
F PART part #
part name
part description
quantity-on-hand
quantity-on-order
G PROJECT project #
project name
project description
H COMMIT part #
project #
quantity committed
Volume 13 / Number 6 / June, 1970
Now, consider the problem of printing out the part
number, part name, and quantity committed for every part
used in the project whose project name is “alpha.” The
following observations may be made regardless of which
available tree-oriented information system is selected to
tackle this problem. If a program P is developed for this
problem assuming one of the five structures above-that
is, P makes no test to determine which structure is in ef-
fect-then P will fail on at least three of the remaining
structures. More specifically, if P succeeds with structure 5,
it will fail with all the others; if P succeeds with structure 3
or 4, it will fail with at least 1,2, and 5; if P succeeds with
1 or 2, it will fail with at least 3, 4, and 5. The reason is
simple in each case. In the absence of a test to determine
which structure is in effect, P fails because an attempt is
made to exceute a reference to a nonexistent file (available
systems treat this as an error) or no attempt is made to
execute a reference to a file containing needed information.
The reader who is not convinced should develop sample
programs for this simple problem.
Since, in general, it is not practical to develop applica-
tion programs which test for all tree structurings permitted
by the system, these programs fail when a change in
&ructure becomes necessary.
Systems which provide users with a network model of
the data run into similar difficulties. In both the tree and
network cases, the user (or his program) is required to
exploit a collection of user access paths to the data. It does
not matter whether these paths are in close correspondence
with pointer-defined paths in the stored representation-in
IDS the correspondence is extremely simple, in TDMS it is
just the opposite. The consequence, regardless of the stored
representation, is that terminal activities and programs be-
come dependent on the continued existence of the user
access paths.
One solution to this is to adopt the policy that once a
user access path is defined it will not be made obsolete un-
til all application programs using that path have become
obsolete. Such a policy is not practical, because the number
of access paths in the total model for the community of
users of a data bank would eventually become excessively
large.
1.3. A RELATIONAL VIEW OF DATA
The term relation is used here in its accepted mathe-
matical sense. Given sets X1 , S, , . . . , S, (not necessarily
distinct), R is a relation on these n sets if it is a set of n-
tuples each of which has its first element from S1, its
second element from Sz , and so on.’ We shall refer to Si as
the jth domain of R. As defined above, R is said to have
degree n. Relations of degree 1 are often called unary, de-
gree 2 binary, degree 3 ternary, and degree n n-ary.
For expository reasons, we shall frequently make use of
an array representation of relations, but it must be re-
membered that this particular representation is not an es-
sential part of the relational view being expounded. An ar-
1 More concisely, R is a subset of the Cartesian product 81 X
sz x *.* x 87%.
ray which represents an n-ary relation R has the following
properties :
(1) Each row represents an n-tuple of R.
(2) The ordering of rows is immaterial.
(3) All rows are distinct.
(4) The ordering of columns is significant-it corre-
sponds to the ordering S1, Sz , . . . , S, of the do-
mains on which R is defined (see, however, remarks
below on domain-ordered and domain-unordered
relations ) .
(5) The significance of each column is partially con-
veyed by labeling it with the name of the corre-
sponding domain.
The example in Figure 1 illustrates a relation of degree
4, called supply, which reflects the shipments-in-progress
of parts from specified suppliers to specified projects in
specified quantities.
supply (supplier part project quantity)
1 2 5 17
1 3 5 23
2 3 7 9
2 7 5 4
4 1 1 12
FIG. 1. A relation of degree 4
One might ask: If the columns are labeled by the name
of corresponding domains, why should the ordering of col-
umns matter? As the example in Figure 2 shows, two col-
umns may have identical headings (indicating identical
domains) but possess distinct meanings with respect to the
relation. The relation depicted is called component. It is a
ternary relation, whose first two domains are called part
and third domain is called quantity. The meaning of com-
ponent (2, y, z) is that part x is an immediate component
(or subassembly) of part y, and z units of part 5 are needed
to assemble one unit of part y. It is a relation which plays
a critical role in the parts explosion problem.
component (part part quantity)
1 5 9
2 5 7
3 5 2
2 6 12
3 6 3
4 7 1
6 7 1
FIG. 2. A relation with-two identical domains
It is a remarkable fact that several existing information
systems (chiefly those based on tree-structured files) fail
to provide data representations for relations which have
two or more identical domains. The present version of
IMS/360 [5] is an example of such a system.
The totality of data in a data bank may be viewed as a
collection of time-varying relations. These relations are of
assorted degrees. As time progresses, each n-ary relation
may be subject to insertion of additional n-tuples, deletion
of existing ones, and alteration of components of any of its
existing n-tuples.
Volume 13 / Number 6 / June, 1970 Communications of the
ACM 379
In many commercial, governmental, and scientific data
banks, however, some of the relations are of quite high de-
gree (a degree of 30 is not at all uncommon). Users should
not normally be burdened with remembering the domain
ordering of any relation (for example, the ordering supplier,
then part, then project, then quantity in the relation supply).
Accordingly, we propose that users deal, not with relations
which are domain-ordered, but with relationships which are
their domain-unordered counterparts.2 To accomplish this,
domains must be uniquely identifiable at least within any
given relation, without using position. Thus, where there
are two or more identical domains, we require in each case
that the domain name be qualified by a distinctive role
name, which serves to identify the role played by that
domain in the given relation. For example, in the relation
component of Figure 2, the first domain part might be
qualified by the role name sub, and the second by super, so
that users could deal with the relationship component and
its domains-sub.part super.part, quantity-without regard
to any ordering between these domains.
To sum up, it is proposed that most users should interact
with a relational model of the data consisting of a collection
of time-varying relationships (rather than relations). Each
user need not know more about any relationship than its
name together with the names of its domains (role quali-
fied whenever necessary): Even this information might be
offered in menu style by the system (subject to security
and privacy constraints) upon request by the user.
There are usually many alternative ways in which a re-
lational model may be established for a data bank. In
order to discuss a preferred way (or normal form), we
must first introduce a few additional concepts (active
domain, primary key, foreign key, nonsimple domain)
and establish some links with terminology currently in use
in information systems programming. In the remainder of
this paper, we shall not bother to distinguish between re-
lations and relationships except where it appears advan-
tageous to be explicit.
Consider an example of a data bank which includes rela-
tions concerning parts, projects, and suppliers. One rela-
tion called part is defined on the following domains:
(1) part number
(2) part name
(3) part color
(4) part weight
(5) quantity on hand
(6) quantity on order
and possibly other domains as well. Each of these domains
is, in effect, a pool of values, some or all of which may be
represented in the data bank at any instant. While it is
conceivable that, at some instant, all part colors are pres-
ent, it is unlikely that all possible part weights, part
2 In mathematical terms, a relationship is an equivalence class
of
those relations that are equivalent under permutation of domains
(see Section 2.1.1).
* Naturally, as with any data put into and retrieved from a com-
puter system, the user will normally make far more effective use
of the data if he is aware of its meaning.
names, and part numbers are. We shall call the set of
values represented at some instant the active domain at that
instant.
Normally, one domain (or combination of domains) of a
given relation has values which uniquely identify each ele-
ment (n-tuple) of that relation. Such a domain (or com-
bination) is called a primary key. In the example above,
part number would be a primary key, while part color
would not be. A primary key is nonredundant if it is either
a simple domain (not a combination) or a combination
such that none of the participating simple domains is
superfluous in uniquely identifying each element. A rela-
tion may possess more than one nonredundant primary
key. This would be the case in the example if different parts
were always given distinct names. Whenever a relation
has two or more nonredundant primary keys, one of them
is arbitrarily selected and called the primary key of that re-
lation.
A common requirement is for elements of a relation to
cross-reference other elements of the same relation or ele-
ments of a different relation. Keys provide a user-oriented
means (but not the only means) of expressing such cross-
references. We shall call a domain (or domain combma-
tion) of relation R a foreign key if it is not the primary key
of R but its elements are values of the primary key of some
relation S (the possibility that S and R are identical is not
excluded). In the relation supply of Figure 1, the combina-
tion of supplier, part, project is the primary key, while each
of these three domains taken separately is a foreign key.
In previous work there has been a strong tendency to
treat the data in a data bank as consisting of two parts, one
part consisting of entity descriptions (for example, descrip-
tions of suppliers) and the other part consisting of rela-
tions between the various entities or types of entities (for
example, the supply relation). This distinction is difficult
to maintain when one may have foreign keys in any rela-
tion whatsoever. In the user’s relational model there ap-
pears to be no advantage to making such a distinction
(there may be some advantage, however, when one applies
relational concepts to machine representations of the user’s
set of relationships).
So far, we have discussed examples of relations which are
defined on simple domains-domains whose elements are
atomic (nondecomposable) values. Nonatomic values can
be discussed within the relational framework. Thus, some
domains may have relations as elements. These relations
may, in turn, be defined on nonsimple domains, and so on.
For example, one of the domains on which the relation em-
ployee is defined might be salary history. An element of the
salary history domain is a binary relation defined on the do-
main date and the domain salary. The salary history domain
is the set of all such binary relations. At any instant of time
there are as many instances of the salary history relation
in the data bank as there are employees. In contrast, there
is only one instance of the employee relation.
The terms attribute and repeating group in present data
base terminology are roughly analogous to simple domain
380 Communications of the ACM Volume 13 / Number 6 / June,
1970
and nonsimple domain, respectively. Much of the confusion
in present terminology is due to failure to distinguish be-
tween type and instance (as in “record”) and between
components of a user model of the data on the one hand
and their machine representation counterparts on the
other hand (again, we cite “record” as an example).
1.4. NORMAL FORM
A relation whose domains are all simple can be repre-
sented in storage by a two-dimensional column-homo-
geneous array of the kind discussed above. Some more
complicated data structure is necessary for a relation with
one or more nonsimple domains. For this reason (and others
to be cited below) the possibility of eliminating nonsimple
domains appears worth investigating! There is, in fact, a
very simple elimination procedure, which we shall call
normalization.
Consider, for example, the collection of relations ex-
hibited in Figure 3 (a). Job history and children are non-
simple domains of the relation employee. Salary history is a
nonsimple domain of the relation job history. The tree in
Figure 3 (a) shows just these interrelationships of the non-
simple domains.
employee
I
jobhistory
I
salaryhistory
children
employee (man#, name, birthdate, jobhistory, children)
jobhistory (jobdate, title, salaryhistory)
salaryhistory (salarydate, salary)
children (childname, birthyear)
FIG. 3(a). Unnormalized set
employee’ (man#, name, birthdate)
jobhistory’ (man#, jobdate, title)
salaryhistory’ (man#, jobdate, salarydate, salary)
children’ (man#, childname, birthyear)
FIG. 3(b). Normalized set
Normalization proceeds as follows. Starting with the re-
lation at the top of the tree, take its primary key and ex-
pand each of the immediately subordinate relations by
inserting this primary key domain or domain combination.
The primary key of each expanded relation consists of the
primary key before expansion augmented by the primary
key copied down from the parent relation. Now, strike out
from the parent relation all nonsimple domains, remove the
top node of the tree, and repeat the same sequence of
operations on each remaining subtree.
The result of normalizing the collection of relations in
Figure 3 (a) is the collection in Figure 3 (b). The primary
key of each relation is italicized to show how such keys
are expanded by the normalization.
4 M. E. Sanko of IBM, San Jose, independently recognized the
desirability of eliminating nonsimple domains.
If normalization as described above is to be applicable,
the unnormalized collection of relations must satisfy the
following conditions :
(1) The graph of interrelationships of the nonsimple
domains is a collection of trees.
(2) No primary key has a component domain which is
nonsimple.
The writer knows of no application which would require
any relaxation of these conditions. Further operations of a
normalizing kind are possible. These are not discussed in
this paper.
The simplicity of the array representation which becomes
feasible when all relations are cast in normal form is not
only an advantage for storage purposes but also for com-
munication of bulk data between systems which use widely
different representations of the data. The communication
form would be a suitably compressed version of the array
representation and would have the following advantages:
(1) It would be devoid of pointers (address-valued or
displacement-valued ) .
(2) It would avoid all dependence on hash addressing
schemes.
(3) It would contain no indices or ordering lists.
If the user’s relational model is set up in normal form,
names of items of data in the data bank can take a simpler
form than would otherwise be the case. A general name
would take a form such as
R (g).r.d
where R is a relational name; g is a generation identifier
(optional); r is a role name (optional); d is a domain name.
Since g is needed only when several generations of a given
relation exist, or are anticipated to exist, and r is needed
only when the relation R has two or more domains named
d, the simple form R.d will often be adequate.
1.5. SOME LINGUISTIC ASPECTS
The adoption of a relational model of data, as described
above, permits the development of a universal data sub-
language based on an applied predicate calculus. A first-
order predicate calculus s&ices if the collection of relations
is in normal form. Such a language would provide a yard-
stick of linguistic power for all other proposed data Ian-
guages, and would itself be a strong candidate for embed-
ding (with appropriate syntactic modification) in a variety
of host Ianguages (programming, command- or problem-
oriented). While it is not the purpose of this paper to
describe such a language in detail, its salient features
would be as follows.
Let us denote the data sublanguage by R and the host
language by H. R permits the declaration of relations and
their domains. Each declaration of a relation identifies the
primary key for that relation. Declared relations are added
to the system catalog for use by any members of the user
community who have appropriate authorization. H per-
mits supporting declarations which indicate, perhaps less
permanently, how these relations are represented in stor-
Volume 13 / Number 6 / June, 1970 Communications of the
ACM 381
age. R permits the specification for retrieval of any subset
of data from the data bank. Action on such a retrieval re-
quest is subject to security constraints.
The universality of the data sublanguage lies in its
descriptive ability (not its computing ability). In a large
data bank each subset of the data has a very large number
of possible (and sensible) descriptions, even when we as-
sume (as we do) that there is only a finite set of function
subroutines to which the system has access for use in
qualifying data for retrieval. Thus, the class of qualification
expressions which can be used in a set specification must
have the descriptive power of the class of well-formed
formulas of an applied predicate calculus. It is well known
that to preserve this descriptive power it is unnecessary to
express (in whatever syntax is chosen) every formula of
the selected predicate calculus. For example, just those in
prenex normal form are adequate [9].
Arithmetic functions may be needed in the qualification
or other parts of retrieval statements. Such functions can
be defined in H and invoked in R.
A set so specified may be fetched for query purposes
only, or it may be held for possible changes. Insertions t,ake
the form of adding new elements to declared relations with-
out regard to any ordering that may be present in their
machine representation. Deletions which are effective for
the community (as opposed to the individual user or sub-
communities) take the form of removing elements from de-
clared relations. Some deletions and updates may be trig-
gered by others, if deletion and update dependencies be-
tween specified relations are declared in R.
One important effect that the view adopted toward data
has on the language used to retrieve it is in the naming of
data elements and sets. Some aspects of this have been dis-
cussed in the previous section. With the usual network
view, users will often be burdened with coining and using
more relat,ion names than are absolutely necessary, since
names are associated with paths (or path types) rather
than with relations.
Once a user is aware that a certain relation is stored, he
will expect to be able to exploit5 it using any combination
of its arguments as “knowns” and the remaining argu-
ments as “unknowns,” because the information (like
Everest) is there. This is a system feature (missing from
many current informat.ion systems) which we shall call
(logically) symmetric expZoitation of relations. Naturally,
symmetry in performance is not to be expected.
To support symmetric exploitation of a single binary re-
lation, two directed paths are needed. For a relation of de-
gree n, the number of paths to be named and controlled is
n factorial.
Again, if a relational view is adopted in which every n-
ary relation (n > 2) has to be expressed by the user as a
nested expression involving only binary relations (see
Feldman’s LEAP System [lo], for example) then 2n - 1
names have to be coined instead of only n + 1 with direct
n-ary notation as described in Section 1.2. For example, the
6 Exploiting a relation includes query, update, and delete.
4-ary relation supply of Figure 1, which entails 5 names in
n-ary notation, would be represented in the form
P (supplier, & (part, R (project, quantity)))
in nested binary notation and, thus, employ 7 names.
-4 further disadvantage of this kind of expression is its
asymmetry. Although this asymmetry does not prohibit
symmetric exploitation, it certainly makes some bases of
interrogation very awkward for the user to express (con-
sider, for example, a query for those parts and quantities
related to certain given projects via & and R).
1.6. EXPRESSIBLE, NAMED, AND STORED RELATIONS
Associated with a data bank are two collections of rela-
tions: the named set and the expressible set. The named set
is the collection of all those relations that the community of
users can identify by means of a simple name (or identifier).
A relation R acquires membership in the named set when a
suitably authorized user declares R; it loses membership
when a suitably authorized user cancels the declaration of
R.
The expressible set is the total collection of relations that
can be designated by expressions in the data language. Such
expressions are constructed from simple names of relations
in the named set; names of generations, roles and domains;
logical connectives; the quantifiers of the predicate calcu-
1~s;~ and certain constant relation symbols such as = , > .
The named set is a subset of the expressible set-usually a
very small subset.
Since some relations in the named set may be time-inde-
pendent combinations of others in that set, it is useful to
consider associating with the named set a collection of
statements that define these time-independent constraints.
We shall postpone further discussion of this until we have
introduced several operations on relations (see Section 2).
One of the major problems confronting the designer of a
data system which is to support a relational model for its
users is that of determining the class of stored representa-
tions to be supported. Ideally, the variety of permitted
data representations should be just adequate to cover the
spectrum of performance requirements of the total col-
lection of installations. Too great a variety leads to un-
necessary overhead in storage and continual reinterpreta-
tion of descriptions for the structures currently in effect.
For any selected class of stored representations the data
system must provide a means of translating user requests
expressed in the data language of the relational model into
corresponding-and elhcient-actions on the current
stored representation. For a high level data language this
presents a challenging design problem. Nevertheless, it is a
problem which must be solved-as more users obtain con-
current access to a large data bank, responsibility for pro-
viding efficient response and throughput shifts from the
individual user to the data system.
6 Because each relation in a practical data bank is a finite set at
every instant of time, the existential and universal quantifiers
can be expressed in terms of a function that counts the number
of
elements in any finite set.
382 Communications of the ACM Volume 13 / Number 6 / June,
1970
2. Redundancy and Consistency
2.1. OPERATIONS ON RELATIONS
Since relations are sets, all of the usual set operations are
applicable to them. Nevertheless, the result may not be a
relation; for example, the union of a binary relation and a
ternary relation is not a relation.
The operations discussed below are specifically for rela-
tions. These operations are introduced because of their key
role in deriving relations from other relations. Their
principal application is in noninferential information sys-
tems-systems which do not provide logical inference
services-although their applicability is not necessarily
destroyed when such services are added.
Most users would not be directly concerned with these
operations. Information systems designers and people con-
cerned with data bank control should, however, be thor-
oughly familiar with them.
2.1.1. Permutation. A binary relation has an array
representation with two columns. Interchanging these col-
umns yields the converse relation. More generally, if a
permutation is applied to the columns of an n-ary relation,
the resulting relation is said to be a permutation of the
given relation. There are, for example, 4! = 24 permuta-
tions of the relation supply in Figure 1, if we include the
identity permutation which leaves the ordering of columns
unchanged.
Since the user’s relational model consists of a collection
of relationships (domain-unordered relations), permuta-
tion is not relevant to such a model considered in isolation.
It is, however, relevant to the consideration of stored
representations of the model. In a system which provides
symmetric exploitation of relations, the set of queries
answerable by a stored relation is identical to the set
answerable by any permutation of that relation. Although
it is logically unnecessary to store both a relation and some
permutation of it, performance considerations could make
it advisable.
2.1.2. Projection. Suppose now we select certain col-
umns of a relation (striking out the others) and then re-
move from the resulting array any duplication in the rows.
The final array represents a relation which is said to be a
projection of the given relation.
A selection operator ?r is used to obtain any desired
permutation, projection, or combination of the two opera-
tions. Thus, if L is a list of lc indices7 L = i1, ii, - . - , ik
and R is an n-ary relation (n 2 k ), then rrL (R ) is the k-ary
relation whose jth column is column ii of R (j = 1,2, * * . ,k)
except that duplication in resulting rows is removed. Con-
sider the relation supply of Figure 1. A permuted projection
of this relation is exhibited in Figure 4. Note that, in this
particular case, the projection has fewer n-tuples than the
relation from which it is derived.
2.1.3. Join. Suppose we are given two binary rela-
tions, which have some domain in common. Under what
circumstances can we combine these relations to form a
7 When dealing with relationships, we use domain names (role-
qualified whenever necessary) instead of domain positions.
ternary relation which preserves all of the information in
the given relations?
The example in Figure 5 shows two relations R, S, which
are joinable without loss of information, while Figure 6
shows a join of R with S. A binary relation R is joinable
with a binary relation S if there exists a ternary relation U
such that 7r12 (U) = R and ‘1~23 (U) = S. Any such ternary
relation is called a join of R with S. If R, S are binary rela-
tions such that ~2 (R) = ~1 (S), then R is joinable with S.
One join that always exists in such a case is the natural
join of R with S defined by
R*S = {(a, b, c):R(a, b) A S(b, c))
where R (a, b) has the value true if (a, b) is a member of R
and similarly for S(b, c). It is immediate that
and
TB(R*S) = R
T33(R*S) = S.
Note that the join shown in Figure 6 is the natural join
of R with S from Figure 5. Another join is shown in Figure
7.
II31 hPPb/) (project supplier)
5 1
5 2
1 4
7 2
FIG. 4. A permuted projection of the relation in Figure 1
R (supplier Part) S (part project)
1 1 1 1
2 1 1 2
2 2 2 1
FIG. 5. Two joinable relations
R*S (supplier part project)
1 1 1
1 1 2
2 1 1
2 1 2
2 2 1
FIG. 6. The natural join of R with S (from Figure 5)
U (supplier part project)
1 1 2
2 1 1
2 2 1
FIG. 7. Another join of R with S (from Figure 5)
Inspection of these relations reveals an element (ele-
ment 1) of the domain part (the domain on which the join
is to be made) with the property that it possesses more
than one relative under R and also under S. It is this ele-
Volume 13 / Number 6 / June, 1970 Communications of the
ACM 383
ment which gives rise to the plurality of joins. Such an ele-
ment in the joining domain is called a point of ambiguity
with respect to the joining of R with S.
If either ~1 (R) or S is a function: no point of ambiguity
can occur in joining R with S. In such a case, the natural
join of R with S is the only join of R with S. Note that the
reiterated qualification “of R with S” is necessary, because
S might be joinable with R (as well as R with S), and this
join would be an entirely separate consideration. In Figure
5, none of the relations R, 7r21(R), S, ?rzl(S) is a function.
Ambiguity in the joining of R with S can sometimes be
resolved by means of other relations. Suppose we are given,
or can derive from sources independent of R and S, a rela-
tion T on the domains project and supplier with the follow-
ing properties :
(1) m(T) = m(S),
(2) m(T) = al(R),
(3) T(i s> +~P(R(& P> A S(P,~))~
(4) R(s, P> -+ 3j(Soj,.i) * T(.A s)),
(5) [email protected],.i) + 3s(T(.?, s> A R(s, P>>,
then we may form a three-way join of R, S, T; that is, a
ternary relation such that
mz(U) = R, 7r23(U) = s, ml(U) = T.
Such a join will be called a cyclic 3-join to distinguish it
from a linear S-join which would be a quaternary relation
V such that
m(V) = R, lr23W) = s, lr34(V) = T.
While it is possible for more than one cyclic 3-join to exist
(see Figures 8,9, for an example), the circumstances under
which this can occur entail much more severe constraints
R (s P) s (P 23 T 0’ s)
1 a a d d 1
2 a
2 b b&Ii
d 2
e 2
b e e 2
FIG. 8. Binary relations with a plurality of cyclic 3-joins
u b P 8 TJ’ (s P i)
1 a d 1 a d
2 a e 2 a d
2 b d 2 a e
2 b e 2 b d
2 b e
FIG. 9. Two cyclic 3-joins of the relations in Figure 8
than those for a plurality of 2-joins. To be specific, the re-
lations R, S, T must possess points of ambiguity with
respect to joining R with S (say point x), S with T (say
8 A function is a binary relation, which is one-one or many-one,
but not one-many.
y), and T with R (say a), and, furthermore, y must be a
relative of x under S, z a relative of y under T, and x a
relative of z under R. Note that in Figure 8 the points
2 = a; y = d; x = 2 have this property.
The natural linear 3-join of three binary relations R, S,
T is given by
R*S*T = { (a, b, c, d):R (a, b) A S (b, c) A T (c, d)}
where parentheses are not needed on the left-hand side be-
cause the natural 2-join (*) is associative. To obtain the
cyclic counterpart, we introduce the operator y which pro-
duces a relation of degree n - 1 from a relation of degree n
by tying its ends together. Thus, if R is an n-ary relation
(n 2 2), the tie of R is defined by the equation
r(R) = {(a~, a2, .-. , a,-l):R(ul, az, ... , a,-~, a,)
A al = a,).
We may now represent the natural cyclic S-join of R, S, T
by the expression
y (R*S*T).
Extension of the notions of linear and cyclic S-join and
their natural counterparts to the joining of n binary rela-
tions (where n 2 3) is obvious. A few words may be ap-
propriate, however, regarding the joining of relations which
are not necessarily binary. Consider the case of two rela-
tions R (degree r ), S (degree s) which are to be joined on
p of their domains (p < T, p < s). For simplicity, sup-
pose these p domains are the last p of the r domains of R,
and the first p of the s domains of S. If this were not so, we
could always apply appropriate permutations to make it
so. Now, take the Cartesian product of the first r-p do-
mains of R, and call this new domain A. Take the Car-
tesian product of the last p domains of R, and call this B.
Take the Cartesian product of the last s-p domains of S
and call this C.
We can treat R as if it were a binary relation on the
domains A, B. Similarly, we can treat S as if it were a bi-
nary relation on the domains B, C. The notions of linear
and cyclic S-join are now directly applicable. A similar ap-
proach can be taken with the linear and cyclic n-joins of n
relations of assorted degrees.
2.1.4. Composition. The reader is probably familiar
with the notion of composition applied to functions. We
shall discuss a generalization of that concept and apply it
first to binary relations. Our definitions of composition
and composability are based very directly on the definitions
of join and joinability given above.
Suppose we are given two relations R, S. T is a cam-
position of R with S if there exists a. join U of R with S such
that T = aI3 (U) . Thus, two relations are composable if
and only if they are joinable. However, the existence of
more than one join of R with S does not imply the existence
of more than one composition of R with S.
Corresponding to the natural join of R with S is the
384 Communications of the ACM Volume 13 / Number 6 / June,
1970
ndural composition9 of R with S defined by
R.S = TH(R*S).
Taking the relations R, S from Figure 5, their natural com-
position is exhibited in Figure 10 and another composition
is exhibited in Figure 11 (derived from the join exhibited
in Figure 7).
R. S (project supplier)
1 1
1 2
2 1
FIG. 10. The natural composition of R with S (from Figure 5)
T (project supplier)
1 2
2 1
FIQ. 11. Another composition of R with S (from Figure 5)
When two or more joins exist, the number of distinct
compositions may be as few as one or as many as the num-
ber of distinct joins. Figure 12 shows an example of two
relations which have several joins but only one composition.
Note that the ambiguity of point c is lost in composing R
with S, because of unambiguous associations made via the
points a, b, d, e.
R (supplier part) S (part project)
1
1 z ; ;
1 c c f
2
2 8 :
is
2 e e f
FICA 12. Many joins, only one composition
Extension of composition to pairs of relations which are
not necessarily binary (and which may be of different de-
grees) follows the same pattern as extension of pairwise
joining to such relations.
A lack of understanding of relational composition has led
several systems designers into what may be called the
connection trap. This trap may be described in terms of the
following example. Suppose each supplier description is
linked by pointers to the descriptions of each part supplied
by that supplier, and each part description is similarly
linked to the descriptions of each project which uses that
part. A conclusion is now drawn which is, in general, er-
roneous: namely that, if all possible paths are followed from
a given supplier via the parts he supplies to the projects
using those parts, one will obtain a valid set of all projects
supplied by that supplier. Such a conclusion is correct
only in the very special case that the target relation be-
tween projects and suppliers is, in fact, the natural com-
position of the other two relations-and we must normally
add the phrase “for all time,” because this is usually im-
plied in claims concerning path-following techniques.
0 Other writers tend to ignore compositions other than the na-
tural one, and accordingly refer to this particular composition as
the composition-see, for example, Kelley’s “General Topology.”
2.1.5. Restriction. A subset of a relation is a relation.
One way in which a relation S may act on a relation R to
generate a subset of R is through the operation restriction
of R by S. This operation is a generalization of the restric-
tion of a function to a subset of its domain, and is defined
as follows.
Let L, M be equal-length lists of indices such that
L = iI,&., *** ,ik,M = jI,j2, e-s , jk where k 5 degree
of R and k 6 degree of S. Then the L, M restriction of R by
S denoted RLIMS is the maximal subset R’ of R such that
?rL(R’) = TM(S).
The operation is defined only if equality is applicable be-
tween elements of ?T+, (R) on the one hand and rjh (S) on
the other for all h = 1, 2, . . . , k.
The three relations R, S, R’ of Figure 13 satisfy the equa-
tion R' = Rw~wsS.
R (8 P ~3 s (P 8 R' (8 P 3
1aA a A 1aA
2 a A c B 2 a A
2 a B b B 2 b B
2bA
2 b B
FIG. 13. Example of restriction
We are now in a position to consider various applications
of these operations on relations.
2.2. REDUNDANCY
Redundancy in the named set of relations must be dis-
tinguished from redundancy in the stored set of representa-
tions. We are primarily concerned here with the former.
To begin with, we need a precise notion of derivability for
relations.
Suppose 0 is a collection of operations on relations and
each operation has the property that from its operands it
yields a unique relation (thus natural join is eligible, but
join is not ). A relation R is O-derivable from a set S of rela-
tions if there exists a sequence of operations from the col-
lection 0 which, for all time, yields R from members of S.
The phrase “for all time” is present, because we are dealing
with time-varying relations, and our interest is in derivabil-
ity which holds over a significant period of time. For the
named set of relationships in noninferential systems, it ap-
pears that an adequate collection & contains the following
operations: projection, natural join, tie, and restriction.
Permutation is irrelevant and natural composition need
not be included, because it is obtainable by taking a natural
join and then a projection. For the stored set of representa-
tions, an adequate collection e2 of operations would include
permutation and additional operations concerned with sub-
setting and merging relations, and ordering and connecting
their elements.
2.2.1. Strong Redundancy. A set of relations is strongly
redundant if it contains at least one relation that possesses
a projection which is derivable from other projections of
relations in the set. The following two examples are in-
tended to explain why strong redundancy is defined this
way, and to demonstrate its practical use. In the first ex-
Volume 13 / Number 6 / June, 1970 Communications of the
ACM 385
ample the collection of relations consists of just the follow-
ing relation :
employee (serial #, name, manager#, managername)
with serial# as the primary key and manager# as a foreign
key. Let us denote the active domain by A,, and suppose
that
A, (munuger#) c A, (serial#)
and
At (managername) C At (name)
for all time t. In this case the redundancy is obvious: the
domain managername is unnecessary. To see that it is a
strong redundancy as defined above, we observe that
m4 (employee) = ~12 (empZoyee)&r3 (employee).
In the second example the collection of relations includes a
relation S describing suppliers with primary key s#, a re-
lation D describing departments with primary key d#, a
relation J describing projects with primary key j#, and the
following relations :
lJ (s#, d#, - * * >t Q(s#,j#, .-->, R(d#,j#, **->,
where in each case - - - denotes domains other than s#, d#,
j#. Let us suppose the following condition C is known to
hold independent of time: supplier s supplies department
d (relation P ) if and only if supplier s supplies some project
j (relation Q) to which d is assigned (relation R). Then, we
can write the equation
m(P) = m(Q)-w(R)
and thereby exhibit a strong redundancy.
An important reason for the existence of strong re-
dundancies in the named set of relationships is user con-
venience. A particular case of this is the retention of semi-
obsolete relationships in the named set so that old pro-
grams that refer to them by name can continue to run cor-
rectly. Knowledge of the existence of strong redundancies
in the named set enables a system or data base adminis-
trator greater freedom in the selection of stored representa-
tions to cope more efficiently with current traffic. If the
strong redundancies in the named set are directly reflected
in strong redundancies in the stored set (or if other strong
redundancies are introduced into the stored set), then, gen-
erally speaking, extra storage space and update time are
consumed with a potential drop in query time for some
queries and in load on the central processing units.
2.2.2. Weak Redundancy. A second type of redun-
dancy may exist. In contrast to strong redundancy it is not
characterized by an equation. A colIection of relations is
weakly redundant if it contains a relation that has a projec-
tion which is not derivable from other members but is at
all times a projection of some join of other projections of
relations in the collection.
We can exhibit a weak redundancy by taking the second
example (cited above) for a strong redundancy, and as-
suming now that condition C does not hold at all times.
The relations al2 (P), 7r12 (Q), 7r12 (R ) are complexlo
relations
with the possibility of points of ambiguity occurring from
time to time in the potential joining of any two. Under
these circumstances, none of them is derivable from the
other two. However, constraints do exist between them,
since each is a projection of some cyclic join of the three of
them. One of the weak redundancies can be characterized
by the statement: for all time, 1r12 (P) is some composition
of ~12 (Q) with ‘~~1 (R). The composition in question might
be the natural one at some instant and a nonnatural one at
another instant.
Generally speaking, weak redundancies are inherent in
the logical needs of the community of users. They are not
removable by the system or data base administrator. If
they appear at all, they appear in both the named set and
the stored set of representations.
2.3. CONSISTENCY
Whenever the named set of relations is redundant in
either sense, we shall associate with that set a collection of
statements which define all of the redundancies which hold
independent of time between the member relations. If the
information system lacks-and it most probably will-de-
tailed semantic information about each named relation, it
cannot deduce the redundancies applicable to the named
set. It might, over a period of time, make attempts to
induce the redundancies, but such attempts would be fal-
lible.
Given a collection C of time-varying relations, an as-
sociated set Z of constraint statements and an instantaneous
value V for C, we shall call the state (C, 2, V) consistent
or inconsistent according as V does or does not satisfy 2.
For example, given stored relations R, S, T together with
the constraint statement “nlz(T) is a composition of
5~12 (R) with ~12 (X)“, we may check from time to time that
the values stored for R, S, T satisfy this constraint. An al-
gorithm for making this check would examine the first two
columns of each of R, S, T (in whatever way they are repre-
sented in the system) and determine whether
(1) ?rl(T) = rl(R),
(2) ?rz(T) = [email protected]),
(3) for every element pair (a, c) in the relation al2 (T)
there is an element b such that (a, b) is in a12(R)
and (b, c) is in 7r12(S).
There are practica1 problems (which we shall not discuss
here) in taking an instantaneous snapshot of a collection
of relations, some of which may be very large and highly
variable.
It is important to note that consistency as defined above
is a property of the instantaneous state of a data bank, and
is independent of how that state came about. Thus, in
particular, there is no distinction made on the basis of
whether a user generated an inconsistency due to an act of
omission or an act of commission. Examination of a simple
IO A binary relation is complex if neither it nor its converse is
a
function.
386 Communications of the AMC Volume 13 / Number 6 / June,
1970
example will show the reasonableness of this (possibly un-
conventional) approach to consistency.
Suppose the named set C includes the relations S, J, D,
P, Q, R of the example in Section 2.2 and that P, Q, R
possess either the strong or weak redundancies described
therein (in the particular case now under consideration, it
does not matter which kind of redundancy occurs). Further,
suppose that at some time t the data bank state is consistent
and contains no project j such that supplier 2 supplies
project j and j is assigned to department 5. Accordingly,
there is no element (2,5) in ~12 (P). Now, a user introduces
the element (2, 5) into 7~12 (P) by inserting some appropri-
ate element into P. The data bank state is now inconsistent.
The inconsistency could have arisen from an act of omis-
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx
I.J. Information Technology and Computer Science, 2016, 12, 59.docx

More Related Content

Similar to I.J. Information Technology and Computer Science, 2016, 12, 59.docx

TOP NEWSQL DATABASES AND FEATURES CLASSIFICATION
TOP NEWSQL DATABASES AND FEATURES CLASSIFICATIONTOP NEWSQL DATABASES AND FEATURES CLASSIFICATION
TOP NEWSQL DATABASES AND FEATURES CLASSIFICATIONijdms
 
Ijaprr vol1-2-6-9naseer
Ijaprr vol1-2-6-9naseerIjaprr vol1-2-6-9naseer
Ijaprr vol1-2-6-9naseerijaprr
 
Making Sense of NoSQL and Big Data Amidst High Expectations
Making Sense of NoSQL and Big Data Amidst High ExpectationsMaking Sense of NoSQL and Big Data Amidst High Expectations
Making Sense of NoSQL and Big Data Amidst High ExpectationsRackspace
 
Relational Databases For An Efficient Data Management And...
Relational Databases For An Efficient Data Management And...Relational Databases For An Efficient Data Management And...
Relational Databases For An Efficient Data Management And...Sheena Crouch
 
Assignment_4
Assignment_4Assignment_4
Assignment_4Kirti J
 
Evaluation of graph databases
Evaluation of graph databasesEvaluation of graph databases
Evaluation of graph databasesijaia
 
A Comparative Study of NoSQL and Relational Database.pdf
A Comparative Study of NoSQL and Relational Database.pdfA Comparative Study of NoSQL and Relational Database.pdf
A Comparative Study of NoSQL and Relational Database.pdfJennifer Roman
 
Ijaprr vol1-2-6-9naseer
Ijaprr vol1-2-6-9naseerIjaprr vol1-2-6-9naseer
Ijaprr vol1-2-6-9naseerijaprr_editor
 
HYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENT
HYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENTHYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENT
HYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENTIJCSEA Journal
 
HYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENT
HYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENTHYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENT
HYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENTIJCSEA Journal
 
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONSDATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONSijdms
 
3170722_BDA_GTU_Study_Material_Presentations_Unit-3_29092021094744AM.pdf
3170722_BDA_GTU_Study_Material_Presentations_Unit-3_29092021094744AM.pdf3170722_BDA_GTU_Study_Material_Presentations_Unit-3_29092021094744AM.pdf
3170722_BDA_GTU_Study_Material_Presentations_Unit-3_29092021094744AM.pdfKrishnaShah908060
 
Representing Non-Relational Databases with Darwinian Networks
Representing Non-Relational Databases with Darwinian NetworksRepresenting Non-Relational Databases with Darwinian Networks
Representing Non-Relational Databases with Darwinian NetworksIJERA Editor
 
The Rise of Nosql Databases
The Rise of Nosql DatabasesThe Rise of Nosql Databases
The Rise of Nosql DatabasesJAMES NGONDO
 
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: C...
A META DATA VAULT APPROACH FOR  EVOLUTIONARY INTEGRATION OF BIG DATA SETS:  C...A META DATA VAULT APPROACH FOR  EVOLUTIONARY INTEGRATION OF BIG DATA SETS:  C...
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: C...AIRCC Publishing Corporation
 
A Meta Data Vault Approach for Evolutionary Integration of Big Data Sets : Ca...
A Meta Data Vault Approach for Evolutionary Integration of Big Data Sets : Ca...A Meta Data Vault Approach for Evolutionary Integration of Big Data Sets : Ca...
A Meta Data Vault Approach for Evolutionary Integration of Big Data Sets : Ca...AIRCC Publishing Corporation
 
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CAS...
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CAS...A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CAS...
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CAS...ijcsit
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
 

Similar to I.J. Information Technology and Computer Science, 2016, 12, 59.docx (20)

TOP NEWSQL DATABASES AND FEATURES CLASSIFICATION
TOP NEWSQL DATABASES AND FEATURES CLASSIFICATIONTOP NEWSQL DATABASES AND FEATURES CLASSIFICATION
TOP NEWSQL DATABASES AND FEATURES CLASSIFICATION
 
Ijaprr vol1-2-6-9naseer
Ijaprr vol1-2-6-9naseerIjaprr vol1-2-6-9naseer
Ijaprr vol1-2-6-9naseer
 
Making Sense of NoSQL and Big Data Amidst High Expectations
Making Sense of NoSQL and Big Data Amidst High ExpectationsMaking Sense of NoSQL and Big Data Amidst High Expectations
Making Sense of NoSQL and Big Data Amidst High Expectations
 
Relational Databases For An Efficient Data Management And...
Relational Databases For An Efficient Data Management And...Relational Databases For An Efficient Data Management And...
Relational Databases For An Efficient Data Management And...
 
Assignment_4
Assignment_4Assignment_4
Assignment_4
 
No sql database
No sql databaseNo sql database
No sql database
 
Evaluation of graph databases
Evaluation of graph databasesEvaluation of graph databases
Evaluation of graph databases
 
A Comparative Study of NoSQL and Relational Database.pdf
A Comparative Study of NoSQL and Relational Database.pdfA Comparative Study of NoSQL and Relational Database.pdf
A Comparative Study of NoSQL and Relational Database.pdf
 
Ijaprr vol1-2-6-9naseer
Ijaprr vol1-2-6-9naseerIjaprr vol1-2-6-9naseer
Ijaprr vol1-2-6-9naseer
 
Report 2.0.docx
Report 2.0.docxReport 2.0.docx
Report 2.0.docx
 
HYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENT
HYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENTHYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENT
HYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENT
 
HYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENT
HYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENTHYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENT
HYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENT
 
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONSDATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
DATABASE SYSTEMS PERFORMANCE EVALUATION FOR IOT APPLICATIONS
 
3170722_BDA_GTU_Study_Material_Presentations_Unit-3_29092021094744AM.pdf
3170722_BDA_GTU_Study_Material_Presentations_Unit-3_29092021094744AM.pdf3170722_BDA_GTU_Study_Material_Presentations_Unit-3_29092021094744AM.pdf
3170722_BDA_GTU_Study_Material_Presentations_Unit-3_29092021094744AM.pdf
 
Representing Non-Relational Databases with Darwinian Networks
Representing Non-Relational Databases with Darwinian NetworksRepresenting Non-Relational Databases with Darwinian Networks
Representing Non-Relational Databases with Darwinian Networks
 
The Rise of Nosql Databases
The Rise of Nosql DatabasesThe Rise of Nosql Databases
The Rise of Nosql Databases
 
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: C...
A META DATA VAULT APPROACH FOR  EVOLUTIONARY INTEGRATION OF BIG DATA SETS:  C...A META DATA VAULT APPROACH FOR  EVOLUTIONARY INTEGRATION OF BIG DATA SETS:  C...
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: C...
 
A Meta Data Vault Approach for Evolutionary Integration of Big Data Sets : Ca...
A Meta Data Vault Approach for Evolutionary Integration of Big Data Sets : Ca...A Meta Data Vault Approach for Evolutionary Integration of Big Data Sets : Ca...
A Meta Data Vault Approach for Evolutionary Integration of Big Data Sets : Ca...
 
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CAS...
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CAS...A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CAS...
A META DATA VAULT APPROACH FOR EVOLUTIONARY INTEGRATION OF BIG DATA SETS: CAS...
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 

More from wilcockiris

Barbara Silva is the CIO for Peachtree Community Hospital in Atlanta.docx
Barbara Silva is the CIO for Peachtree Community Hospital in Atlanta.docxBarbara Silva is the CIO for Peachtree Community Hospital in Atlanta.docx
Barbara Silva is the CIO for Peachtree Community Hospital in Atlanta.docxwilcockiris
 
BARGAIN CITY Your career is moving along faster than you e.docx
BARGAIN CITY Your career is moving along faster than you e.docxBARGAIN CITY Your career is moving along faster than you e.docx
BARGAIN CITY Your career is moving along faster than you e.docxwilcockiris
 
Barbara schedules a meeting with a core group of clinic  managers. T.docx
Barbara schedules a meeting with a core group of clinic  managers. T.docxBarbara schedules a meeting with a core group of clinic  managers. T.docx
Barbara schedules a meeting with a core group of clinic  managers. T.docxwilcockiris
 
Barbara schedules a meeting with a core group of clinic managers.docx
Barbara schedules a meeting with a core group of clinic managers.docxBarbara schedules a meeting with a core group of clinic managers.docx
Barbara schedules a meeting with a core group of clinic managers.docxwilcockiris
 
Barbara schedules a meeting with a core group of clinic managers. Th.docx
Barbara schedules a meeting with a core group of clinic managers. Th.docxBarbara schedules a meeting with a core group of clinic managers. Th.docx
Barbara schedules a meeting with a core group of clinic managers. Th.docxwilcockiris
 
Barbara Rosenwein, A Short History of the Middle Ages 4th edition (U.docx
Barbara Rosenwein, A Short History of the Middle Ages 4th edition (U.docxBarbara Rosenwein, A Short History of the Middle Ages 4th edition (U.docx
Barbara Rosenwein, A Short History of the Middle Ages 4th edition (U.docxwilcockiris
 
BARBARA NGAM, MPAShoreline, WA 98155 ▪ 801.317.5999 ▪ [email pro.docx
BARBARA NGAM, MPAShoreline, WA 98155 ▪ 801.317.5999 ▪ [email pro.docxBARBARA NGAM, MPAShoreline, WA 98155 ▪ 801.317.5999 ▪ [email pro.docx
BARBARA NGAM, MPAShoreline, WA 98155 ▪ 801.317.5999 ▪ [email pro.docxwilcockiris
 
Banks 5Maya BanksProfessor Debra MartinEN106DLGU1A2018.docx
Banks    5Maya BanksProfessor Debra MartinEN106DLGU1A2018.docxBanks    5Maya BanksProfessor Debra MartinEN106DLGU1A2018.docx
Banks 5Maya BanksProfessor Debra MartinEN106DLGU1A2018.docxwilcockiris
 
Banking industry•Databases that storeocorporate sensiti.docx
Banking industry•Databases that storeocorporate sensiti.docxBanking industry•Databases that storeocorporate sensiti.docx
Banking industry•Databases that storeocorporate sensiti.docxwilcockiris
 
BAOL 531 Managerial AccountingWeek Three Article Research Pape.docx
BAOL 531 Managerial AccountingWeek Three Article Research Pape.docxBAOL 531 Managerial AccountingWeek Three Article Research Pape.docx
BAOL 531 Managerial AccountingWeek Three Article Research Pape.docxwilcockiris
 
bankCustomer1223333SmithJamesbbbbbb12345 Abrams Rd Dallas TX 75043.docx
bankCustomer1223333SmithJamesbbbbbb12345 Abrams Rd Dallas TX 75043.docxbankCustomer1223333SmithJamesbbbbbb12345 Abrams Rd Dallas TX 75043.docx
bankCustomer1223333SmithJamesbbbbbb12345 Abrams Rd Dallas TX 75043.docxwilcockiris
 
Barbara and Judi entered into a contract with Linda, which provi.docx
Barbara and Judi entered into a contract with Linda, which provi.docxBarbara and Judi entered into a contract with Linda, which provi.docx
Barbara and Judi entered into a contract with Linda, which provi.docxwilcockiris
 
bappsum.indd 614 182014 30258 PMHuman Reso.docx
bappsum.indd   614 182014   30258 PMHuman Reso.docxbappsum.indd   614 182014   30258 PMHuman Reso.docx
bappsum.indd 614 182014 30258 PMHuman Reso.docxwilcockiris
 
Bank ReservesSuppose that the reserve ratio is .25, and that a b.docx
Bank ReservesSuppose that the reserve ratio is .25, and that a b.docxBank ReservesSuppose that the reserve ratio is .25, and that a b.docx
Bank ReservesSuppose that the reserve ratio is .25, and that a b.docxwilcockiris
 
Bank Services, Grading GuideFIN366 Version 21Individual.docx
Bank Services, Grading GuideFIN366 Version 21Individual.docxBank Services, Grading GuideFIN366 Version 21Individual.docx
Bank Services, Grading GuideFIN366 Version 21Individual.docxwilcockiris
 
Baldwins Kentucky Revised Statutes AnnotatedTitle XXXV. Domesti.docx
Baldwins Kentucky Revised Statutes AnnotatedTitle XXXV. Domesti.docxBaldwins Kentucky Revised Statutes AnnotatedTitle XXXV. Domesti.docx
Baldwins Kentucky Revised Statutes AnnotatedTitle XXXV. Domesti.docxwilcockiris
 
Bank confirmations are critical to the cash audit. What information .docx
Bank confirmations are critical to the cash audit. What information .docxBank confirmations are critical to the cash audit. What information .docx
Bank confirmations are critical to the cash audit. What information .docxwilcockiris
 
BalShtBalance SheetBalance SheetBalance SheetBalance SheetThe Fran.docx
BalShtBalance SheetBalance SheetBalance SheetBalance SheetThe Fran.docxBalShtBalance SheetBalance SheetBalance SheetBalance SheetThe Fran.docx
BalShtBalance SheetBalance SheetBalance SheetBalance SheetThe Fran.docxwilcockiris
 
BAM 515 - Organizational Behavior(Enter your answers on th.docx
BAM 515 - Organizational Behavior(Enter your answers on th.docxBAM 515 - Organizational Behavior(Enter your answers on th.docx
BAM 515 - Organizational Behavior(Enter your answers on th.docxwilcockiris
 
BalanchineGeorge Balanchine is an important figure in the histor.docx
BalanchineGeorge Balanchine is an important figure in the histor.docxBalanchineGeorge Balanchine is an important figure in the histor.docx
BalanchineGeorge Balanchine is an important figure in the histor.docxwilcockiris
 

More from wilcockiris (20)

Barbara Silva is the CIO for Peachtree Community Hospital in Atlanta.docx
Barbara Silva is the CIO for Peachtree Community Hospital in Atlanta.docxBarbara Silva is the CIO for Peachtree Community Hospital in Atlanta.docx
Barbara Silva is the CIO for Peachtree Community Hospital in Atlanta.docx
 
BARGAIN CITY Your career is moving along faster than you e.docx
BARGAIN CITY Your career is moving along faster than you e.docxBARGAIN CITY Your career is moving along faster than you e.docx
BARGAIN CITY Your career is moving along faster than you e.docx
 
Barbara schedules a meeting with a core group of clinic  managers. T.docx
Barbara schedules a meeting with a core group of clinic  managers. T.docxBarbara schedules a meeting with a core group of clinic  managers. T.docx
Barbara schedules a meeting with a core group of clinic  managers. T.docx
 
Barbara schedules a meeting with a core group of clinic managers.docx
Barbara schedules a meeting with a core group of clinic managers.docxBarbara schedules a meeting with a core group of clinic managers.docx
Barbara schedules a meeting with a core group of clinic managers.docx
 
Barbara schedules a meeting with a core group of clinic managers. Th.docx
Barbara schedules a meeting with a core group of clinic managers. Th.docxBarbara schedules a meeting with a core group of clinic managers. Th.docx
Barbara schedules a meeting with a core group of clinic managers. Th.docx
 
Barbara Rosenwein, A Short History of the Middle Ages 4th edition (U.docx
Barbara Rosenwein, A Short History of the Middle Ages 4th edition (U.docxBarbara Rosenwein, A Short History of the Middle Ages 4th edition (U.docx
Barbara Rosenwein, A Short History of the Middle Ages 4th edition (U.docx
 
BARBARA NGAM, MPAShoreline, WA 98155 ▪ 801.317.5999 ▪ [email pro.docx
BARBARA NGAM, MPAShoreline, WA 98155 ▪ 801.317.5999 ▪ [email pro.docxBARBARA NGAM, MPAShoreline, WA 98155 ▪ 801.317.5999 ▪ [email pro.docx
BARBARA NGAM, MPAShoreline, WA 98155 ▪ 801.317.5999 ▪ [email pro.docx
 
Banks 5Maya BanksProfessor Debra MartinEN106DLGU1A2018.docx
Banks    5Maya BanksProfessor Debra MartinEN106DLGU1A2018.docxBanks    5Maya BanksProfessor Debra MartinEN106DLGU1A2018.docx
Banks 5Maya BanksProfessor Debra MartinEN106DLGU1A2018.docx
 
Banking industry•Databases that storeocorporate sensiti.docx
Banking industry•Databases that storeocorporate sensiti.docxBanking industry•Databases that storeocorporate sensiti.docx
Banking industry•Databases that storeocorporate sensiti.docx
 
BAOL 531 Managerial AccountingWeek Three Article Research Pape.docx
BAOL 531 Managerial AccountingWeek Three Article Research Pape.docxBAOL 531 Managerial AccountingWeek Three Article Research Pape.docx
BAOL 531 Managerial AccountingWeek Three Article Research Pape.docx
 
bankCustomer1223333SmithJamesbbbbbb12345 Abrams Rd Dallas TX 75043.docx
bankCustomer1223333SmithJamesbbbbbb12345 Abrams Rd Dallas TX 75043.docxbankCustomer1223333SmithJamesbbbbbb12345 Abrams Rd Dallas TX 75043.docx
bankCustomer1223333SmithJamesbbbbbb12345 Abrams Rd Dallas TX 75043.docx
 
Barbara and Judi entered into a contract with Linda, which provi.docx
Barbara and Judi entered into a contract with Linda, which provi.docxBarbara and Judi entered into a contract with Linda, which provi.docx
Barbara and Judi entered into a contract with Linda, which provi.docx
 
bappsum.indd 614 182014 30258 PMHuman Reso.docx
bappsum.indd   614 182014   30258 PMHuman Reso.docxbappsum.indd   614 182014   30258 PMHuman Reso.docx
bappsum.indd 614 182014 30258 PMHuman Reso.docx
 
Bank ReservesSuppose that the reserve ratio is .25, and that a b.docx
Bank ReservesSuppose that the reserve ratio is .25, and that a b.docxBank ReservesSuppose that the reserve ratio is .25, and that a b.docx
Bank ReservesSuppose that the reserve ratio is .25, and that a b.docx
 
Bank Services, Grading GuideFIN366 Version 21Individual.docx
Bank Services, Grading GuideFIN366 Version 21Individual.docxBank Services, Grading GuideFIN366 Version 21Individual.docx
Bank Services, Grading GuideFIN366 Version 21Individual.docx
 
Baldwins Kentucky Revised Statutes AnnotatedTitle XXXV. Domesti.docx
Baldwins Kentucky Revised Statutes AnnotatedTitle XXXV. Domesti.docxBaldwins Kentucky Revised Statutes AnnotatedTitle XXXV. Domesti.docx
Baldwins Kentucky Revised Statutes AnnotatedTitle XXXV. Domesti.docx
 
Bank confirmations are critical to the cash audit. What information .docx
Bank confirmations are critical to the cash audit. What information .docxBank confirmations are critical to the cash audit. What information .docx
Bank confirmations are critical to the cash audit. What information .docx
 
BalShtBalance SheetBalance SheetBalance SheetBalance SheetThe Fran.docx
BalShtBalance SheetBalance SheetBalance SheetBalance SheetThe Fran.docxBalShtBalance SheetBalance SheetBalance SheetBalance SheetThe Fran.docx
BalShtBalance SheetBalance SheetBalance SheetBalance SheetThe Fran.docx
 
BAM 515 - Organizational Behavior(Enter your answers on th.docx
BAM 515 - Organizational Behavior(Enter your answers on th.docxBAM 515 - Organizational Behavior(Enter your answers on th.docx
BAM 515 - Organizational Behavior(Enter your answers on th.docx
 
BalanchineGeorge Balanchine is an important figure in the histor.docx
BalanchineGeorge Balanchine is an important figure in the histor.docxBalanchineGeorge Balanchine is an important figure in the histor.docx
BalanchineGeorge Balanchine is an important figure in the histor.docx
 

Recently uploaded

Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 

Recently uploaded (20)

Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 

I.J. Information Technology and Computer Science, 2016, 12, 59.docx

  • 1. I.J. Information Technology and Computer Science, 2016, 12, 59-66 Published Online December 2016 in MECS (http://www.mecs - press.org/) DOI: 10.5815/ijitcs.2016.12.07 Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 12, 59-66 SQL Versus NoSQL Movement with Big Data Analytics Sitalaks hmi Ve nkatraman School of Engineering, Construction and Design (IT), Melbourne Polytechnic, VIC 3072, Australia E-mail: [email protected] Kiran Fahd, Samue l Kas pi School of Engineering, Construction and Design (IT), Melbourne Polytechnic, VIC 3072, Australia E-mail: [email protected], [email protected] Ramanathan Ve nkatraman National University of Singapore, Singapore
  • 2. E-mail: [email protected] Abstract—Two ma in revolutions in data manage ment have occurred recently, name ly Big Data analytics and NoSQL databases. Even though they have evolved with diffe rent purposes, their independent developments comple ment each other and their convergence would benefit businesses tremendously in ma king real-t ime decisions using volumes of co mple x data sets that could be both structured and unstructured. While on one hand many software solutions have emerged in supporting Big Data analytics, on the other, many NoSQL database packages have arrived in the market. However, they lack an independent benchmarking and co mparat ive evaluation. The a im of this paper is to provide an understanding of their contexts and an in -depth study to compare the features of four ma in NoSQL data models that have evolved. The performance compa rison of traditional SQL with No SQL databases for Big Data
  • 3. analytics shows that NoSQL database poses to be a better option for business situations that require simplicity, adaptability, high performance analytics and distributed scalability of large data. This paper concludes that the NoSQL move ment should be leveraged for Big Data analytics and would coe xist with re lational (SQL) databases . Index Terms—Structured Query Language (SQL), Non SQL (NoSQL), Big Data, Big Data Analytics, Re lational Database, SQL Database, NoSQL Database. I. INT RODUCT ION As the technology environment transforms and faces new challenges, businesses increasingly realize the need to evaluate new approaches and databases to ma nage their data to support changing business requirements and growing co mple xity and e xpansion of their applicat ions [1]. Re lational database has been the default choice for
  • 4. data model adoption in businesses worldwide over the past thirty years with Structured Query Language (SQL) as the standard language designed to perform the basic data operations. However, with the e xplosion of data volume, SQL-based data querying lose effic iency, and in particular, managing larger databases has become a ma jor challenge [2]. In addit ion, re lational databases exh ibit a variety of limitations in meeting the recent Big Data analytics require ment in businesses. While clusters -based architecture has emerged as a solution for la rge databases, SQL is not designed to suit clusters and this mis match has led to think of alternate solutions. There are mis matches between persistent data model and in - me mo ry data structures, and servers based on SQL standards are now prone to me mory footprint, security risks and performance issues. NoSQL (Non SQL) databases with a set of new data manage ment features, on the other hand, are more
  • 5. fle xib le and horizontally scalable. They a re considered as alternatives to overcome the limitations of the current SQL-dominated persistence landscape and hence they are also known as non-relational databases [3]. The main goal for the NoSQL move ment is to allo w easy storage and retrieval of data, regardless of its structure and content, which is possible due to the non -existence of a rig id data structure in non-relat ional databases. NoSQL databases exhib it horizontal scalability by taking advantage of new clusters and several low -cost servers. In addition, they are envisaged to automatically manage data administration including fault recovery and these capabilit ies would result in huge cost savings. Though non-relational databases are providing different features and advantages, they were init ially characterised by lack of data consistency and non-ability to query stored records using SQL. With the e mergence of NoSQL databases new features and optimisation characteristics
  • 6. are evolving to overcome these limitations as well. However, their total capabilities are still not disclosed [4]. Also, due to the increasing differences in NoSQL database offerings and their non-standard features, businesses are not clear on what is the stand to take. In this paper, we first provide an overview of the 60 SQL Versus NoSQL M ovement with Big Data Analytics Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 12, 59-66 present context o f Big Data ana lytics and NoSQL databases. Ne xt, we discuss the four ma in data models of non-relational databases and compare them with SQL databases. There are a variety of No SQL databases and which one is more appropriate for wh ich business operation re ma ins an unanswered question so far. We compare the different data models of NoSQL in terms of their features and the NoSQL databases available in the
  • 7. ma rket that support those features. The different data man ipulation mechanis ms and optimisation techniques adopted by NoSQL databases could result in their diffe rence in performance. We discuss how these factors play a ma jor ro le in Big Data analytics and identify the associated challenges. We also consider the coexistence of NoSQL databases with re lational databases and discuss their relevance in different business contexts. II. RELAT ED WORK: T HE CONT EXT OF NOSQL DAT ABASES WIT H BIG DAT A ANALYT ICS Fro m the recent trends reported in literature [5][6], it is evident that in today's context, there is an e xponential growth of data volume that are structured as well as unstructured (Big Data) fro m a variety of data sources, such as social media, e -ma ils, te xt documents, GPS data, sensor data, surveillance data, etc. with increasing Internet usage. Hence, we can say that Big Data is characterised by structured, semi-structured, and
  • 8. unstructured data collected fro m digita l and non-digital resources. The ma in cha llenge is the effective use of this Big Data that represents the data source for effic ient decision-ma king by adopting suitable data min ing techniques [7][8]. Based on our literature survey, we have identified that the current challenges presented by Big Data are due to the following general characteristics e xperienced by businesses: ta Veloc ity – rapid ly and continuously updated data streams fro m d ifferent sources and locations. – structured, semi-structured and unstructured data storage. – huge nu mber of datasets with sizes of several terabytes or petabytes. – data organized in several different locations or data centres.
  • 9. It is important for businesses to perform Big Data analytics, which is the process of e xa min ing la rge data sets containing a variety of data types. Using Big Data Analytics, businesses are able to a rrive at more accurate analysis of huge a mounts of data to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information [2][9]. In order to support timely and effect ive decision ma king, Big Data analytics re lies on large volu mes of data that requires clusters for data storage. However, sinc e relational databases are not designed for clusters, and e xhibit performance issues with regard to Big Data analytics, businesses are considering the need for the NoSQL movement [10]. The schema of NoSQL is not fixed. It uses varied interfaces to store and analyse sheer volume of user- generated content, personal data and spatial data being generated by modern applications, clou d computing and
  • 10. smart devices. [1][11]. In this context, NoSQL database presents a preferred solution than SQL database primarily for its ability to cater to the horizontal partitioning of data, fle xib le data processing and improved performance. Large Internet companies (Facebook, Lin kedIn, A ma zon and Google), which cannot process services by using existing re lational databases, had researched and led to the advent of NoSQL to solve their proble m of dealing with continuously increasing data, optimised data utilizat ion and horizontal scalability of large data. No SQL databases are a better option for the information systems that require h igh performance and dynamic scalability more than the requirements of reliability, highly distributed nature of the three-tier Internet architecture systems and cloud computing [1][3][11]. There fore, it is necessary to investigate further and compare SQ L versus NoSQL as well as the salient differences in the performance of
  • 11. NoSQL data models in supporting the necessary features for Big Data analytics. This paper presents these investigations and findings in today's Big Data context. III. NOSQL DAT A MODELS There are many NoSQL databases available, however, they fall under four data models described below [3][11][12]. Each category has its own specific attributes but there are cross overs between the different data models. Generally, a ll NoSQL databases are built to be distributed and scaled horizontally. Key-Va lue Store Database – Key-Va lue store is a simp le but effic ient and powerful NoSQL database. The data is stored in two parts, a string that represents the key and the actual data that represents the value, thus creating a “key-value” pair. This results in values being indexed by keys for retrieval, a concept simila r to hash tables. In other words, the store allo ws the user to request the
  • 12. values according to the key specified. It can handle structured or unstructured data. It offers high concurrency and scalability as we ll as rap id lookups, but little consistency. Such Key-Value store databases can be used to develop forums and online shopping carts and websites where user sessions are required to be stored. So me notable exa mp les are A mazon‟s Dyan moDB, Apache‟s Cassandra, Azure Table Storage (AT S), Orac le Berke ley DB, and Basho Technologies‟ Ria k. A ma zon offers fu lly managed No SQL store service DynamoDB for the purpose of internet scale applications . It is a distributed key-value storage system wh ich provides fast, reliable and cost-effective data access and high availability and durability due to its replica feature. One of the advantages of Key-Value store database is its high insert/read rates compared to traditional SQL
  • 13. SQL Versus NoSQL M ovement with Big Data Analytics 61 Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 12, 59-66 database. This is achieved by saving more than one entry to the store as shown in the example below: @db.bulk_save([ {"hot" => "and spicy"}, {"cold" => "yet loving"}, {"other" => ["set","of","keys"]} ]) Colu mn Oriented (o r wide -colu mn) Store Databases – In colu mn store databases, columns are defined for each row instead of being predefined by the table structure having uniform sized co lu mns for each row. Such stores have a two-level aggregate structure, a key and a row aggregate, which is a group of columns. Any column can be added to any row, and rows can have very different columns. In other words, each row has diffe rent
  • 14. number of colu mns that are stored. It can also store data tables as sections of columns of data. Data can be vie wed as either row-oriented where each row is an aggregate, or column-o riented where each colu mn fa mily defines a record type. Each key is associated with one or more columns and a key for each colu mn fa mily is used for rapid data retrieval with less I/O activity thereby offering very high performance. These databases provide high scalability as they store data in highly distributed architectures. Wide-colu mn databases is ideal to be used for data mining and analytic applicat ions with Big Data. Exa mples of some colu mn-oriented store providers are Facebook‟s high-performance Cassandra, Apache Hbase, Google ‟s Big Table and HyperTable. Google ‟s Big Table is high performance wide-colu mn database that can deal with vast amount of data. It is developed on Google File System GFS using C/C++. It is used by multip le Google
  • 15. applications like YouTube and Gma il that have varied latency demand of the database. It is not distributed outside Google besides the usage inside Google's App Engine. Big Tab le is designed for easy scalability across thousands of machines, thus, it is tolerant to hardware failures. Document Store Databases – Document database e xtends the basic key-value database concept and stores comple x data in document form such as XML, PDF or JSON documents. A document store is typically schema- less where each document can contain different fie lds of any length. Documents are accessed or identified by using a unique key wh ich may be simple string, URI string or path string. Docu ment databases are more comple x databases but offer h igh performance, horizontal scalability and schema fle xib ility which a llo w storing virtually any structure required by any application. Document oriented databases are suitable for content
  • 16. manage ment systems and blog application s. So me e xa mples of providers using document oriented databases are 10gen‟s MongoDB, Apache CouchDB, Basho Technologies‟ Ria k, Azure 's Docu mentDB and AWS Dynamo DB. MongoDB is developed by 10gen using C++ and is a structure free, c ross -platform document oriented database. It uses Grid File System to store large files such as images and videos in BSON (Binary JSON) format. It prov ides effic ient performance, h igh consistency and high persistence but it is not very reliable and is resource hungry. Graph Store – Graph database focuses on relationships between data. It uses the graph theory approach to store the data and optimises the search by using index free adjacency technique. It is designed for data whose relationships are we ll represented by graph structures consisting of nodes, edges and properties. A node represents an object (an entity in the database), an edge
  • 17. describes the relationship between the objects and the property is the node on the other end of the relationship. In inde x free adjacency technique, each node consists of a pointer which directly points to the adjacent node as shown in Fig. 1. These stores provide fast performance, A CID compliance and rollback support. These databases are suitable to develop social-networking applications, bioinformat ics applications, content manage ment systems and cloud management services. Exa mp les of notable Graph databases are Neo Technology‟s Neo4j , Orient DB, Apache Giraph and Titan. Apache Giraph is an open source large-scale graph processing system and imp le mentation of Google Pregel (a graph processing architecture which has vertex-centric approach). It is designed for high scalability to overcome the crucial need for scalable platforms and parallel architectures that can process the bulk data p roduced by
  • 18. modern applications such as social networks and knowledge bases. For e xa mp le, it is currently used at Facebook, Lin kedIn and Twitter to analyse the graph formed by users and their connections. Giraph is a distributed and fault-tolerant system and offers features such as, master co mputation, sharded aggregators, edge - oriented input and out-of-core computation. Fig.1. Graph algorit hm. IV. HIGH LEVEL COMP ARISON BET WEEN NOSQL AND SQL DAT ABASES Based on the features of each type of database recently reported in the literature [1][3][11][13], we performed a high level co mparison between SQL (re lational) and NoSQL (non-re lational) databases and the summary of findings is given in Table 1. We considered aspects such as database type, schema,
  • 19. data model used, scaling model availab le, transactional capabilit ies, data man ipulation method used, and popular 62 SQL Versus NoSQL M ovement with Big Data Analytics Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 12, 59-66 database software available in the market in order to compare SQL databases versus NoSQL databases. Some e xa mples are a lso given in Table 1 for a better understanding of their diffe rences . Overall, Tab le 1 provides the high level diffe rences in the key features and properties exhib ited by relational and non -relational databases, which would support businesses in making decisions about using SQL or NoSQL database options in various Big Data application scenarios . T able 1. Relat ional Versus NoSQL Dat abases - High Level Differences Relational Databases NoSQL Databases Data
  • 20. base Type One SQL DBMS product (marginal variations) Four general types: key- value, document and wide- column and graph st ores S ch ema -defined foreign-key relationships bet ween t ables in an explicit dat abase schema ct definit ion of schema and dat a t ypes is required before insert ing dat a ent ire dat abase. definit ion in advance st ored t ogether as required
  • 21. t he schema freely wit h no downt ime. Data Mode l s row and columns in different tables joined via relat ionships of columns t o store a specific piece of dat a joins t wo separate t ables t he "employees" and "depart ments" t ogether to find out t he department of an employee. dat a – st ruct ured, semi- st ruct ured, and unst ruct ured
  • 22. different and flexible dat a models. For example:. Document st ore t ype organizes all relat ed dat a using references and embedded document s t ools. S cal i n g Mode l a resides on a single node and capacit y is added t o exist ing resources(data st orage or I/O capacity) part it ioning of t he dat a across addit ional servers or cloud inst ances as required. Tran s - acti on C apab-
  • 23. i l i ti e s t ransact ional propert ies, such as atomicity, consistency, isolat ion, durabilit y to ensure high dat a reliabilit y and dat a int egrit y. ance t ransactions and CAP T heorem of dist ributed syst ems supports consist ency of dat a across all nodes of a NoSQL dat abase single document . Data Mani pul ati on y Language – SQL DML
  • 24. St atement s are used to manipulat e dat a e.g. SELECT cust omer_name FROM cust omers WHERE cust omer_age>18; - Oriented APIs are used e.g. db.cust omers.find( {cust omer_age: {$gt : 18 }} { cust omer_name:1 }) S oftware Oracle, MySQL, DB2, SQLServer Couchbase, Ret hinkdb, Redis, Aerospike, Leveldb, Hbase, Cassandra, Neo4j, Elast icsearch, Lucene V. PERFORMANCE OF NOSQL AND SQL DAT ABASES FOR BIG DAT A ANALYT ICS
  • 25. The most important reason in moving towards NoSQL fro m re lational database is due to require ments of performance imp rovements. Choi et al. [1] found that a NoSQL database such as MongoDB provided mo re stable and faster performance at the e xpense of data consistency. The tests were done on an internal blog system based on an open source project. It was found that MongoDB stored posts 850% faster than a SQL database. It has been suggested that NoSQL should be used in environ ments which a re concerned with data availability rather than consistency. Fotache & Cogean [14] describe the use of MongoDB in mobile applications. Ce rtain mu ltiple update operations like Upsert are easier and faster to perform with NoSQL than SQL database. The use of cloud computing along with NoSQL is said to increase the performance especially in the data layer for mobile platforms. Ullah [15] co mpared performance of both re lational
  • 26. database management system (RDBMS) and NoSQL database where Resource Description Fra me work (RDF) based Trip le store was used as the NoSQL database. It was noted that NoSQL database was slower than the relation database due to the mass amount of me mory usage by the NoSQL database. Reading a large a mount of data takes toll on the database and because of the unstructured format of NoSQL database the storage of thousand records requires a huge amount of storage whereas the RDBMS uses less amount of storage. For e xa mple , searching red berry in the database took 5255 ms in the NoSQL database while it only took 165.43 ms to search it in RDBMS. Floratou et al. [4] performed the Yahoo Cloud Serving Benchma rk (YCSB) test on RDBMS and MongoDB. They tested SQL client sharded database against MongoDB auto and client sharded databases. The tests found that SQL client sharded database was able to attain
  • 27. higher throughput and lower latency in most of the benchmarks. The reason for higher performance is SQL is attributed to the fact that majority of the read requests are made to pages in the buffer pool whereas NoSQL databases tend to read shards located at different nodes. The study has tried to prove that RDBMS still has the processing power to handle larger wo rkloads similar to NoSQL. There are many advantages of NoSQL databases over SQL databases like easy scalability, fle xib le schema, lower cost and efficient and high performance. Having said that, there are some weaknesses of NoSQL over SQL databases to [12][16]. These are summarised below: lack of familiarity and limited expertise. either consistency or availability. language in all NoSQL databas es.
  • 28. databases SQL Versus NoSQL M ovement with Big Data Analytics 63 Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 12, 59-66 (Cassandra) compared to non -distributed ones (MongoDB). difficult to maintain. We have identified the following situations when NoSQL should be more suitable than SQL in the context of Big Data analytics: 1. Simp licity of use – current Big Data technologies are co mple x requiring highly skilled technical e xpertise, wh ile NoSQL offers simplicity that would improve the productivity of both developers
  • 29. and users. The simple, s mall, intuitive and easy to learn NoSQL stacks can suit businesses that require Big Data analytics to adopt a clean NoSQL-like APIs. 2. Adaptability to change – when business require ments and data models change warranting fle xib le Big Data analytics, NoSQL that supports fle xib le data schemas are idea l to integrate siloed and disparate backend systems. 3. Efficiency for analytics functionality – The foundation data structure of ma jority of NoSQL technology is the Javascript Object Notation (JSON) data format that caters to both schema-on- read and schema-on-write efficiently for data warehousing functionality. For e xa mp le, NoSQL Big Data Warehouse, SonarW for JSON ma kes analytics functionality effic ient for Big Data applications.
  • 30. 4. Distributed scalability – with mo re and more distributed nature of systems and transactions, fle xib le data beco mes the norm and strict schema approach is unsuitable. With schema evolution, NoSQL p rovides the necessary scalability for Big Data platforms to perform distributed queries faster. T able 2. Comparison of NoSQL Dat a Models NoS Q L Data Mode l s NoS Q L Database s Pe rforman ce S cal abi l i ty Fl e xi bi l i ty C ompl e xi ty Fu n cti on al i ty Ke y-Val u e DyanmoDB, Cassandra, AT S, Riak Berkeley DB, High High High None Variable (None) W i de -C ol u mn Cassandra, Hbase, Big
  • 31. T able, HyperT able High High Moderat e Low Minimal Docu me n t MongoD, CouchDB, Riak, DynamoDB High Variable (High) High Low Variable (Low) Graph Neo4j, Orient DB, Giraph, T it an. Vari-able Variable High High Graph T heory VI. COMP ARISON OF NOSQL DAT A MODELS NoSQL databases vary in their performance depending on their data model [17]. We co mpare the key attributes of the four types of NoSQL data mode ls and summarise them in Table 2. As shown in Table 2, we have considered key attributes such as, performance, scalability, fle xib ility, comple xity and functionality fo r co mparing the four data
  • 32. models supported by the popular NoSQL database software that are available in the market. Fig. 2 shows CAP theore m that fo rms a visual guide to NoSQL databases under each NoSQL data model [16], which is based on consistency, availability and partition tolerance features. With NoSQL databases, there are now other options for storing different kinds of data where typically d istributed set of servers have to fit two of the three require ments of the CAP theorem, wh ich is usually a deciding factor in what technology could be used. Ba zar & Losif [3] co mpared the performance of MongoDB, Cassandra and Couchbase databases, each possessing different features and functionalities. The tests were conducted using the YCSB tool. Fig.2. CAP t heorem for NoSQL dat abases. VII. RESULT S The benchmark tests found that Couchbase produced
  • 33. the lowest latencies for interactive database applications. Couchbase is able to process more operations per second with a lowe r average latency in read ing and writing data than both MongoDb and Cassandra. Docu ment level 64 SQL Versus NoSQL M ovement with Big Data Analytics Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 12, 59-66 locking in Couchbase database is the prima ry reason for faster read and write operations. Cassandra is faster in writing than MongoDb but both of the m have almost equal reading speed. It is also mentioned that each NoSQL database is suitable to specific application environments and cannot be considered a comp lete solution for every workload and use case. Another case study by Klein et a l. [18] looked at the use of NoSQL database MongoDB, Ria k and Couchbase in a distributed healthcare organisation. These databases use different NoSQL data models including key -va lue
  • 34. (Riak), column (Cassandra) and document (MongoDB). Cassandra produced the overall best performance for all types of database operations (Reading, Writ ing, and Updating). Riak‟s performance was degraded due to its internal thread pool c reating a pool for each client session instead of creating a shared pool for a ll c lient sessions. Cassandra had the highest average latencies but also produced the best throughput results. This was firstly due to the indexing features that allowed Cassandra to retrieve the most recent written records efficiently, especially compared to Ria k. Secondly, the hash -based sharding allowed Cassandra to distribute the request for storage to be load better than MongoDB. Prasad & Gohil [11] discussed the use of diffe rent NoSQL databases for different work environ ments. It is reported that the performance of NoSQL databases is increased because of the use of a collection of processors in the distributed system. MongoDB and Cassandra are
  • 35. considered the best databases to be used in cases where data is frequently written but rarely read. The NoSQL databases are ment ioned to be victims of Consistency, Availability and Partit ioning (CAP) theore m. Th is means that a trade-off is always made e.g. the database can either be consistent with low performance or offe rs high availability and low consistency with fast performance [11][17][19]. Zhikun et al. [20] suggested the use of a new database allocation strategy based on load (DASB) in order to increase performance of the NoSQL database. However, the DASBL only works when it satisfies four conditions and is unable to cater to an unbalanced system load. Prasad et al. [11] co mpared different attributes such as Replication, Sharding, Consistency and Failure handling. We summa rise all these finding s in Table 3, wh ich provides a list of the best NoSQL databases for each of the features reported in literature.
  • 36. Several doubts arise on the NoSQL pro mises and studies have been conducted to explore the strengths and weaknesses of NoSQL [21][22]. A recent study reviews the trends of storage and computing tools with their relative capabilit ies, limitations and environment they are suitable to work with [23]. While h igh-end platforms like IBM Netezza AMPP could cater to Big Data, due to economic considerations, choices such as Hadoop have proliferated world-wide resulting in the rise of NoSQL database adoption that can integrate easily with Hadoop. Even though HBase supports strong integration with Hadoop using Apache Hive, it could provide a better choice for applicat ion development only but not for rea l- time queries and OLTP applicat ions due to very high latency. On the other hand, graph -based platforms such as Neo4j and Giraph form better options for storage and computation due to their capability to model verte x- edge scenarios in businesses that involve data
  • 37. environments such as social networks and geospatial paths . Overall, Big Data has led to the require ment of new generation data analytics tools [24][25] and hence it is realistic to believe that both SQL and NoSQL databases will coe xist. With cloud environments that support SQL databases, fast processing of data is warranted to enable efficient elasticity [26] and Big Data analytics that involve current and past data as well as future predict ions. New solutions are being proposed for cloud monitoring with the use of NoSQL databases back-end to achieve very quick response time. T able 3. NoSQL Dat abases mapped t o t heir feat ures Fe atu re s Be st NoS Q L Database s High availabilit y Riak, Cassandra, Google Big T able, Couch DB P art it ion T olerance MongoDB, Cassandra, Google Big
  • 38. t able, CouchDB, Riak, Hbase High Scalabilit y Google Big t able Consist ency MongoDB, Google Big T able, Redis, Hbase Aut o-Sharding MongoDB Writ e Frequently, Read Less MongoDB, Redis, Cassandra Fault T olerant (No Single P oint Of Failure) Riak Concurrency Cont rol (MVCC) Riak, Dynamo, CouchDB, Cassandra, Google Big T able Concurrency Cont rol (Locks) MongoDB, Redis, Google Big T able
  • 39. VIII. CONCLUSIONS The industry has been dominated by relational o r SQL databases for several years. Ho wever, with business situations recently having the need to store and process large datasets for business analytics, NoSQL database provides the answer to overcome such challenges. NoSQL offers s chema less data store and transactions that allo w businesses to freely add fie lds to records without the structured require ment of defin ing the schema a priori which is a prime constraint in SQL databases. With the growing need to manage large data and unstructured business transactions via avenues such as social networks, NoSQL graphs are we ll suited for data that has comple x relationship structures and at the same time simp licity is achieved through key-value stores. NoSQL data models provide options for storing unstructured d ata to be document-oriented, key-value pairs, colu mn-oriented or graphs. These NoSQL storage models a re easy to
  • 40. understand and imple ment and do not require comp le x SQL Versus NoSQL M ovement with Big Data Analytics 65 Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 12, 59-66 SQL optimizat ion techniques to perform Big Data analytics. This paper has compared SQL vers us NoSQL databases as well as the four data models of NoSQL in the context of Big Data analytics for business situations. We conclude that the fle xib le data modelling of NoSQL is well suited to support dynamic scalability and improved performance for Big Data analytics and could be leveraged as new categories of data architectures coexisting with traditional SQL databases. REFERENCES [1] Choi, Y., Jeon, W., & Yo, S. (2014), 'Imp roving Database Sy stem Performance by App ly ing NoSQL', Journal Of Information Processing Systems, 10(3), 355-364. [2] M oniruzzaman, A. B., & Hossain, S. A. (2013), No SQL database: New er a of d atabases for big data analy tics -
  • 41. Classification, char acteristics and comp arison. International Journal o f Database Theory and Application, 6(4), 1-14. [3] Bazar, C., & Losif, C. (2014), 'The Transition from RDBM S to NoSQL. A Comp arative Analy sis of Three Pop ular Non-Relational Solution s: Cassandra, M ongoDB and Couchbase', Database Systems Journal, 5(2), 49-59. [4] Floratou, A., Teletia, N., Dewitt, D., Patel, J., & Zhang, D. (2012), 'Can the Elep hants Handle the NoSQL Onslaught?', VLDB Endowment, 5(12), 1712-1723. [5] M ason, R. T. (2015), 'NoSQL databases and data modelin g techniqu es for a document -oriented NoSQL database', Proceedings of In forming Science & IT Education Conference (InSITE) 2015, 259-268.
  • 42. [6] Pothuganti, A. (2015) 'Big Data Analy tics: Hadoop -M ap Reduce & NoSQL Databases', International Journal of Computer Scien ce and Information Technologies, 6(1), 522-527. [7] Smolan, R. & Erwit, J. (2012), The Human face of Big Data, Against all odds p roduction, O‟Reilly , USA. [8] Sharda, R., Delen, D., & Turban, E. (2015), Business intelligen ce and analytics: systems for decision support (10th ed.). Up p er Saddle River, NJ: Pearson. [9] Ohlhorst, F. (2013), Big data analytics: Turning big data into big money. Hoboken, NJ. John Wiley and Sons. [10] Kaur, P.D., Kaur, A. & Kaur, S. (2015), „Performance Analy sis in Bigdata‟, International Journal of In formation Technology and Computer Science (IJITCS), 7(11), 55-61. [11] Prasad, A, & Gohil, B. (2014), 'A Co mp arative Study of
  • 43. NoSQL Databases', International Journal Of Advanced Research In Comp uter Science, 5(5), 170-176. [12] Nay ak, A., Poriy a, A. & Poojary , D. (2013), „Typ e of NOSQL Databases and its Comp arison with Relational Databases‟, International Journal o f Applied In formation Systems (IJAIS), 5(4) Foundation of Computer Science FCS, New York, USA. [13] M ongoDB (2014), „Why NoSQL?‟, https://www.mongodb.com/nosql-exp lained, [Online: accessed 20-Feb-2016] [14] Fotache, M ., & Cogean, D. (2013), 'No SQL and SQL Databases for M obile App lications. Case Study : M ongoDB versus PostgreSQL', Informatica Economica, 17(2), 41-58. [15] Ullah, M d A. (2015), „A Digital Library for Plant Information with Performance Comp arison between a
  • 44. Relational Database and a NoSQL Database (RDF Trip le Store)‟, Technical Library, Paper 205. [16] Hurst, N. (2010, ‘Visual Guide to NoSQL Systems‟, http ://blog.nahurst.com/v isual- guid e-to-nosql-systems, [Online: accessed 5-Nov-2015] [17] Planet Cassandra ‘NoSQL Databases Defin ed and Explain ed’, http ://www.p lanetcassandra.org/what-is- nosql,[Online: accessed 24-M ar-2016] [18] Klein, J., Gorton, I., Ernst, N. & Donohoe, P. (2015), 'Performance Evalu ation of NoSQL Databases: A Case Study ', Proceedings of the 1st Workshop on Performance Analysis of Big Data Systems, PABS’15, Austin, 5-10. [19] M ongoDB (2015), ‘Top 5 Considerations When Evaluating NoSQL Databases’,
  • 45. https://s3.amazonaws.com/ info- mon godb- com/10 gen_Top _5_NoSQL_Considerations.p df [Online: accessed 5-Nov-2015] [20] Zhikun, C., Shuqian g, Y., Shu an, T., Hui, Z., Li., Ge, Z.,& Huiy u, Z. (2014), „The Data Allocation Strategy Based on Load in NoSQL Database’, Applied Mechanics and Materials, 513-517, 1464-1469. [21] Leavitt, N. (2010), „Will No SQL Databases Liv e Up to Their Promise?‟, IEEE Computer 43(2) , 12-14. [22] Subramanian, S. (2012), ‘NoSQL: An Ana lysis of th e Strengths and Weaknesses’, https://dzone.com/articles/nosql-an aly sis-strengths-and, [Online: accessed 15-Jan-2016] [23] Prasad B.R. & Agarwal S. (2016), 'Co mp arative Study of
  • 46. Big Data Comp uting and Storage Tools: A Review', International Journal o f Database Theory and Application 9(1), 45-66. [24] Warden P. (2012), Big Da ta Glossary - A Guide to th e New Generation of Data Tools, O‟Reilly , USA. [25] Zareian S., Fokaefs, M ., Khazaei H. Litoiu M . & Zhang X. (2016), 'A Big Data Framework for C loud M onitoring', Proceedings of the 2nd In ternationa l Workshop on BIG Data Software Engineering (BIGDSE'16), ACM Digital Library, 58-64. [26] Ramanathan, V. & Venkatraman, S. (2015), C loud Adoption in Enterprises: Security Issues and Strateg ies, 96-121, Book Chap ter In Haider A. and Pishdad A. (Eds.), Business Technologies in Contemp orary Organizations: Adoption, Assimilation, and Institutionalization, IGI
  • 47. Global Publishers, USA. Authors’ Profiles Dr. Sitalakshmi Venkatraman obtained doctoral degr ee in Comp uter Science, from National Institute of Industrial Engineer in g, India in 1993 and M Ed from University of Sheffield, UK in 2001. Prior to this, she had comp leted M Sc in M athematics in 1985 and MTech in Comp uter Science in 1987, both from Indian Institute of Technolo gy , M adras, India. This author is a Senior M ember (SM ) of IASCIT.
  • 48. In the p ast 25 y ears, Sita's work exp erience involv es both industry and academics - develop ing turnkey p rojects for IT industry and teachin g a variety of IT courses for tertiary institutions, in India, Sin gap ore, New Zealand, and more recently in Australia since 2007. Sh e curr ently works as Lecturer (Information Technolo gy ) at the School of En gineerin g, Construction & Design, M elbourne Poly technic, Australia. She also serves as M ember of Register of Exp erts at Australia's Tertiary Education Quality and Standards Agency (TEQSA). Sita has p ublished eight book ch ap ters and more than 100 research p ap ers in internationally well-known refereed journals and conferences that include Information Scien ces, Journal of Artificial Intelligence in Engin eering, International Journal of
  • 49. 66 SQL Versus NoSQL M ovement with Big Data Analytics Copyright © 2016 MECS I.J. Information Technology and Computer Science, 2016, 12, 59-66 Business Information Systems, and Information Management & Computer Security. She serves as Program Committee M ember of several international confer ences and Sen ior M ember of p rofessional societies and editorial board of three international journals. Kiran Fahd receiv ed the B.Eng in software engineer in g from the National University of Emer gin g Techno lo gies, Pakistan in 2001,
  • 50. and the M aster‟s degree in Enterp rise Plannin g Sy stem - ERP from the Victoria University , M elbourne in 2010. Since 2001, she has worked in the cap acity of software engineer and as a teacher. Kiran has held various lecturing p ositions in Australian and overseas universities. She currently teaches the subjects of Bachelor of Infor mation Technolo gy under the Software Develop ment major at the School of En gin eerin g, Construction & Design, M elbourne Poly technic, Australia. Dr. S amuel Kaspi earned h is PhD (Comp uter Science) from Victoria University , a M asters
  • 51. of Comp uter Science from M onash University and a Bachelor of Economics and Politics from M onash University . He is a member of Australian Comp uter Society (ACS) and Association for Comp uting M achinery (ACM ). Sam is curr ently the Information Technology Discip line Leader and Sen ior Lecturer of IT.at the School of En gineerin g, Construction & Design, M elbourne Poly technic, Australia. Previously , Dr Kasp i taught at Victoria University , consulted p rivately and was the CIO of OzM iz Pty Ltd. Sam h as been active in both teachin g and p rivate enterp rise in
  • 52. the areas of software sp ecification, design and develop ment. As chief information off icer (CIO) of a small p rivate comp any he managed the develop ment and submission of five granted and three p ending p atents. He also managed the submission of a successful Federal Govern ment Comet gr ant under the Commercialisin g Emer gin g Technolo gies category . He has also had a numb er of p eer rev iewed p ublications in cludin g the Institute of Electrical and Electronics Engineers (IEEE). Dr. Ramanathan Venkatraman is working as M ember, Advanced Technology App lication Practice at National University of Singap ore. He has
  • 53. served industry and academia for mor e than 32 y ears and has a wide sp ectrum of exp erien ce in the fields of IT and business p rocess engineerin g. His current resear ch focuses in evolvin g decision mod els for business p roblems and more recently , he has b een contributing to frontiers of knowledge by devising innovative ar chitectural models for ICT in domains such as Serv ice Oriented Architecture, B ig Data and Enterp rise Cloud Comp uting. Dr Venkatraman has a strong p ractice app roach having worked in lar ge scale IT p rojects across Asia, US, Europ e and NZ. He has p ublished more than 20 research p ap ers in leading journ als. Ap art from research and
  • 54. consulting, he also teaches adv anced technical courses for M asters p rogram at NUS and has been a key arch itect in setting up innovative software engineerin g and business analy tics curriculum in the fast changing IT education scenario. How to cite this paper: Sitalakshmi Venkatraman, Kiran Fahd, Samu el Kasp i, Ramanathan Venkatraman, " SQL Versus NoSQL M ovement with Big Data Analy tics", International Journal of Information Technolo gy and Comp uter Science(IJITCS), Vol.8, No.12, p p .59-66, 2016. DOI: 10.5815/ijitcs.2016.12.07 Rubric for Journal Entries
  • 55. Student must meet a tier in the Check+ column to receive a Check+, starting with Format (Level 1), then Conceptualization (Level 2), and then with Application (Level 3). If student does not meet the first level in the check+ column, he or she will be moved to the Check column and the same process will begin again. The higher bullet points within each column will receive a higher grade within that box. For instance within the Check+ box for Conceptualization having 4 concepts clearly listed from the chapter will receive a solid A while listing only 3 concepts may receive an A-. Criterion Check+ (A grade) Check (B grade) Check– (C grade) Format (Level 1) 10% of grade 1.25 points possible per journal entry
  • 56. • About one page (but no more), 12 font, regular margins, double spaced. • Most of the above is correct but there are a few small problems • Minimum half page, typed, but no more than one page, 12 font, regular margins, double spaced • Not a full page. Some other small problems. • Less than half a page, or irregular margins, spacing, or font • More problems than above Conceptualization
  • 57. (Level 2) 35% of Grade 4.75 points possible per journal entry • 4 concepts clearly listed1 from the chapter2 relating to score on questionnaire • 3 concepts clearly listed 1 Clearly listed = page numbers are cited to show where concept is found & concepts are defined. 2 Referencing content from assessment does not count as a concept listed. That info may be used in the entry, but only concepts found in the chapter and cited will be counted. • 2 concepts listed from the chapter relating to score on questionnaire
  • 58. • 1 concept listed from the chapter • Concepts are definitely there but not clearly defined or connected to questionnaire score • Concept possibly included but not clear or from the wrong chapter and not connected to the questionnaire • No concept listed from the chapter relating to score on questionnaire Application (Level 3)
  • 59. 55% of Grade 7.50 points possible per journal entry • Clearly defined implications for how each concept applies to the student’s success in the workplace. Implications are thoughtful and detailed – they explain how you should change by applying them. • Implications are clearly explored in depth but not connected very well to concepts. Implications somewhat thoughtful. • Implications clearly explored for some but not all concepts. Implications not quite as thoughtful. • Vaguely explored for how concept applies to the student’s
  • 60. success in the workplace. • Implications are in the text but are implicit and hard to discern. • No implications of concept and how it applies to the student’s success in the workplace • Can’t even draw implicit implications from the journal entry. Information Retrieval P. BAXENDALE, Editor A Relational Model of Data for Large Shared Data Banks
  • 61. E. F. CODD IBM Research Laboratory, San Jose, California Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation). A prompting service which supplies such information is not a satisfactory solution. Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed. Changes in data representation will often be needed as a result of changes in query, update, and report traffic and natural growth in the types of stored information.
  • 62. Existing noninferential, formatted data systems provide users with tree-structured files or slightly more general network models of the data. In Section 1, inadequacies of these models are discussed. A model based on n-ary relations, a normal form for data base relations, and the concept of a universal data sublanguage are introduced. In Section 2, certain opera- tions on relations (other than logical inference) are discussed and applied to the problems of redundancy and consistency in the user’s model. KEY WORDS AND PHRASES: data bank, data base, data structure, data organization, hierarchies of data, networks of data, relations, derivability, redundancy, consistency, composition, join, retrieval language, predicate
  • 63. calculus, security, data integrity CR CATEGORIES: 3.70, 3.73, 3.75, 4.20, 4.22, 4.29 1. Relational Model and Normal Form 1 .I. INTR~xJ~TI~N This paper is concerned with the application of ele- mentary relation theory to systems which provide shared access to large banks of formatted data. Except for a paper by Childs [l], the principal application of relations to data systems has been to deductive question-answering systems. Levein and Maron [2] provide numerous references to work in this area. In contrast, the problems treated here are those of data independence-the independence of application programs and terminal activities from growth in data types and changes in data representation-and certain kinds of data inconsistency which are expected to become troublesome even in nondeductive systems. Volume 13 / Number 6 / June, 1970
  • 64. The relational view (or model) of data described in Section 1 appears to be superior in several respects to the graph or network model [3,4] presently in vogue for non- inferential systems. It provides a means of describing data with its natural structure only-that is, without superim- posing any additional structure for machine representation purposes. Accordingly, it provides a basis for a high level data language which will yield maximal independence be- tween programs on the one hand and machine representa- tion and organization of data on the other. A further advantage of the relational view is that it forms a sound basis for treating derivability, redundancy, and consistency of relations-these are discussed in Section 2. The network model, on the other hand, has spawned a number of confusions, not the least of which is mistaking the derivation of connections for the derivation of rela- tions (see remarks in Section 2 on the “connection trap”). Finally, the relational view permits a clearer evaluation of the scope and logical limitations of present formatted data systems, and also the relative merits (from a logical standpoint) of competing representations of data within a single system. Examples of this clearer perspective are
  • 65. cited in various parts of this paper. Implementations of systems to support the relational model are not discussed. 1.2. DATA DEPENDENCIES IN PRESENT SYSTEMS The provision of data description tables in recently de- veloped information systems represents a major advance toward the goal of data independence [5,6,7]. Such tables facilitate changing certain characteristics of the data repre- sentation stored in a data bank. However, the variety of data representation characteristics which can be changed without logically impairing some application programs is still quite limited. Further, the model of data with which users interact is still cluttered with representational prop- erties, particularly in regard to the representation of col- lections of data (as opposed to individual items). Three of the principal kinds of data dependencies which still need to be removed are: ordering dependence, indexing depend- ence, and access path dependence. In some systems these dependencies are not clearly separable from one another. 1.2.1. Ordering Dependence. Elements of data in a data bank may be stored in a variety of ways, some involv- ing no concern for ordering, some permitting each element to participate in one ordering only, others permitting each
  • 66. element to participate in several orderings. Let us consider those existing systems which either require or permit data elements to be stored in at least one total ordering which is closely associated with the hardware-determined ordering of addresses. For example, the records of a file concerning parts might be stored in ascending order by part serial number. Such systems normally permit application pro- grams to assume that the order of presentation of records from such a file is identical to (or is a subordering of) the Communications of the ACM 377 stored ordering. Those application programs which take advantage of the stored ordering of a file are likely to fail to operate correctly if for some reason it becomes necessary to replace that ordering by a different one. Similar remarks hold for a stored ordering implemented by means of pointers. It is unnecessary to single out any system as an example, because all the well-known information systems that are marketed today fail to make a clear distinction between order of presentation on the one hand and stored ordering
  • 67. on the other. Significant implementation problems must be solved to provide this kind of independence. 1.2.2. Indexing Dependence. In the context of for- matted data, an index is usually thought of as a purely performance-oriented component of the data representa- tion. It tends to improve response to queries and updates and, at the same time, slow down response to insertions and deletions. From an informational standpoint, an index is a redundant component of the data representation. If a system uses indices at all and if it is to perform well in an environment with changing patterns of activity on the data bank, an ability to create and destroy indices from time to time will probably be necessary. The question then arises: Can application programs and terminal activities remain invariant as indices come and go? Present formatted data systems take widely different approaches to indexing. TDMS [7] unconditionally pro- vides indexing on all attributes. The presently released version of IMS [5] provides the user with a choice for each file: a choice between no indexing at all (the hierarchic se- quential organization) or indexing on the primary key only (the hierarchic indexed sequent,ial organization). In neither case is the user’s application logic dependent on the
  • 68. existence of the unconditionally provided indices. IDS [8], however, permits the fle designers to select attributes to be indexed and to incorporate indices into the file struc- ture by means of additional chains. Application programs taking advantage of the performance benefit of these in- dexing chains must refer to those chains by name. Such pro- grams do not operate correctly if these chains are later removed. 1.2.3. Access Path Dependence. Many of the existing formatted data systems provide users with tree-structured files or slightly more general network models of the data. Application programs developed to work with these sys- tems tend to be logically impaired if the trees or networks are changed in structure. A simple example follows. Suppose the data bank contains information about parts and projects. For each part, the part number, part name, part description, quantity-on-hand, and quantity-on-order are recorded. For each project, the project number, project name, project description are recorded. Whenever a project makes use of a certain part, the quantity of that part com- mitted to the given project is also recorded. Suppose that the system requires the user or file designer to declare or define the data in terms of tree structures. Then, any one
  • 69. of the hierarchical structures may be adopted for the infor- mation mentioned above (see Structures l-5). 378 Communications of the ACM Structure 1. Projects Subordinate to Parts File Segment Fields F PART part # part name part description quantity-on-hand quantity-on-order PROJECT project # project name project description quantity committed Structure 2. Parts Subordinate to Projects File Sqmeut Fields F PROJECT project # project name project description
  • 70. PART part # part name part description quantity-on-hand quantity-on-order quantity committed Structure 3. Parts and Projects as Peers Commitment Relationship Subordinate to Projects File Segment Fields F PART part # part name part description quantity-on-hand quantity-on-order G PROJECT project # project name project description PART part # quantity committed
  • 71. Structure 4. Parts and Projects as Peers Commitment Relationship Subordinate to Parts File Segnren1 Fields F PART part # part description quantity-on-hand quantity-on-order PROJECT project # quantity committed G PROJECT project # project name project description Structure 5. Parts, Projects, and Commitment Relationship as Peers FCZC .%&-,,ZC,,t Ficlds F PART part # part name part description quantity-on-hand
  • 72. quantity-on-order G PROJECT project # project name project description H COMMIT part # project # quantity committed Volume 13 / Number 6 / June, 1970 Now, consider the problem of printing out the part number, part name, and quantity committed for every part used in the project whose project name is “alpha.” The following observations may be made regardless of which available tree-oriented information system is selected to tackle this problem. If a program P is developed for this problem assuming one of the five structures above-that is, P makes no test to determine which structure is in ef- fect-then P will fail on at least three of the remaining structures. More specifically, if P succeeds with structure 5, it will fail with all the others; if P succeeds with structure 3
  • 73. or 4, it will fail with at least 1,2, and 5; if P succeeds with 1 or 2, it will fail with at least 3, 4, and 5. The reason is simple in each case. In the absence of a test to determine which structure is in effect, P fails because an attempt is made to exceute a reference to a nonexistent file (available systems treat this as an error) or no attempt is made to execute a reference to a file containing needed information. The reader who is not convinced should develop sample programs for this simple problem. Since, in general, it is not practical to develop applica- tion programs which test for all tree structurings permitted by the system, these programs fail when a change in &ructure becomes necessary. Systems which provide users with a network model of the data run into similar difficulties. In both the tree and network cases, the user (or his program) is required to exploit a collection of user access paths to the data. It does not matter whether these paths are in close correspondence with pointer-defined paths in the stored representation-in IDS the correspondence is extremely simple, in TDMS it is just the opposite. The consequence, regardless of the stored representation, is that terminal activities and programs be- come dependent on the continued existence of the user
  • 74. access paths. One solution to this is to adopt the policy that once a user access path is defined it will not be made obsolete un- til all application programs using that path have become obsolete. Such a policy is not practical, because the number of access paths in the total model for the community of users of a data bank would eventually become excessively large. 1.3. A RELATIONAL VIEW OF DATA The term relation is used here in its accepted mathe- matical sense. Given sets X1 , S, , . . . , S, (not necessarily distinct), R is a relation on these n sets if it is a set of n- tuples each of which has its first element from S1, its second element from Sz , and so on.’ We shall refer to Si as the jth domain of R. As defined above, R is said to have degree n. Relations of degree 1 are often called unary, de- gree 2 binary, degree 3 ternary, and degree n n-ary. For expository reasons, we shall frequently make use of an array representation of relations, but it must be re- membered that this particular representation is not an es- sential part of the relational view being expounded. An ar-
  • 75. 1 More concisely, R is a subset of the Cartesian product 81 X sz x *.* x 87%. ray which represents an n-ary relation R has the following properties : (1) Each row represents an n-tuple of R. (2) The ordering of rows is immaterial. (3) All rows are distinct. (4) The ordering of columns is significant-it corre- sponds to the ordering S1, Sz , . . . , S, of the do- mains on which R is defined (see, however, remarks below on domain-ordered and domain-unordered relations ) . (5) The significance of each column is partially con- veyed by labeling it with the name of the corre- sponding domain. The example in Figure 1 illustrates a relation of degree 4, called supply, which reflects the shipments-in-progress of parts from specified suppliers to specified projects in specified quantities.
  • 76. supply (supplier part project quantity) 1 2 5 17 1 3 5 23 2 3 7 9 2 7 5 4 4 1 1 12 FIG. 1. A relation of degree 4 One might ask: If the columns are labeled by the name of corresponding domains, why should the ordering of col- umns matter? As the example in Figure 2 shows, two col- umns may have identical headings (indicating identical domains) but possess distinct meanings with respect to the relation. The relation depicted is called component. It is a ternary relation, whose first two domains are called part and third domain is called quantity. The meaning of com- ponent (2, y, z) is that part x is an immediate component (or subassembly) of part y, and z units of part 5 are needed to assemble one unit of part y. It is a relation which plays a critical role in the parts explosion problem. component (part part quantity)
  • 77. 1 5 9 2 5 7 3 5 2 2 6 12 3 6 3 4 7 1 6 7 1 FIG. 2. A relation with-two identical domains It is a remarkable fact that several existing information systems (chiefly those based on tree-structured files) fail to provide data representations for relations which have two or more identical domains. The present version of IMS/360 [5] is an example of such a system. The totality of data in a data bank may be viewed as a collection of time-varying relations. These relations are of assorted degrees. As time progresses, each n-ary relation may be subject to insertion of additional n-tuples, deletion of existing ones, and alteration of components of any of its existing n-tuples. Volume 13 / Number 6 / June, 1970 Communications of the
  • 78. ACM 379 In many commercial, governmental, and scientific data banks, however, some of the relations are of quite high de- gree (a degree of 30 is not at all uncommon). Users should not normally be burdened with remembering the domain ordering of any relation (for example, the ordering supplier, then part, then project, then quantity in the relation supply). Accordingly, we propose that users deal, not with relations which are domain-ordered, but with relationships which are their domain-unordered counterparts.2 To accomplish this, domains must be uniquely identifiable at least within any given relation, without using position. Thus, where there are two or more identical domains, we require in each case that the domain name be qualified by a distinctive role name, which serves to identify the role played by that domain in the given relation. For example, in the relation component of Figure 2, the first domain part might be qualified by the role name sub, and the second by super, so that users could deal with the relationship component and its domains-sub.part super.part, quantity-without regard to any ordering between these domains.
  • 79. To sum up, it is proposed that most users should interact with a relational model of the data consisting of a collection of time-varying relationships (rather than relations). Each user need not know more about any relationship than its name together with the names of its domains (role quali- fied whenever necessary): Even this information might be offered in menu style by the system (subject to security and privacy constraints) upon request by the user. There are usually many alternative ways in which a re- lational model may be established for a data bank. In order to discuss a preferred way (or normal form), we must first introduce a few additional concepts (active domain, primary key, foreign key, nonsimple domain) and establish some links with terminology currently in use in information systems programming. In the remainder of this paper, we shall not bother to distinguish between re- lations and relationships except where it appears advan- tageous to be explicit. Consider an example of a data bank which includes rela- tions concerning parts, projects, and suppliers. One rela- tion called part is defined on the following domains: (1) part number
  • 80. (2) part name (3) part color (4) part weight (5) quantity on hand (6) quantity on order and possibly other domains as well. Each of these domains is, in effect, a pool of values, some or all of which may be represented in the data bank at any instant. While it is conceivable that, at some instant, all part colors are pres- ent, it is unlikely that all possible part weights, part 2 In mathematical terms, a relationship is an equivalence class of those relations that are equivalent under permutation of domains (see Section 2.1.1). * Naturally, as with any data put into and retrieved from a com- puter system, the user will normally make far more effective use of the data if he is aware of its meaning. names, and part numbers are. We shall call the set of values represented at some instant the active domain at that instant. Normally, one domain (or combination of domains) of a
  • 81. given relation has values which uniquely identify each ele- ment (n-tuple) of that relation. Such a domain (or com- bination) is called a primary key. In the example above, part number would be a primary key, while part color would not be. A primary key is nonredundant if it is either a simple domain (not a combination) or a combination such that none of the participating simple domains is superfluous in uniquely identifying each element. A rela- tion may possess more than one nonredundant primary key. This would be the case in the example if different parts were always given distinct names. Whenever a relation has two or more nonredundant primary keys, one of them is arbitrarily selected and called the primary key of that re- lation. A common requirement is for elements of a relation to cross-reference other elements of the same relation or ele- ments of a different relation. Keys provide a user-oriented means (but not the only means) of expressing such cross- references. We shall call a domain (or domain combma- tion) of relation R a foreign key if it is not the primary key of R but its elements are values of the primary key of some relation S (the possibility that S and R are identical is not excluded). In the relation supply of Figure 1, the combina- tion of supplier, part, project is the primary key, while each
  • 82. of these three domains taken separately is a foreign key. In previous work there has been a strong tendency to treat the data in a data bank as consisting of two parts, one part consisting of entity descriptions (for example, descrip- tions of suppliers) and the other part consisting of rela- tions between the various entities or types of entities (for example, the supply relation). This distinction is difficult to maintain when one may have foreign keys in any rela- tion whatsoever. In the user’s relational model there ap- pears to be no advantage to making such a distinction (there may be some advantage, however, when one applies relational concepts to machine representations of the user’s set of relationships). So far, we have discussed examples of relations which are defined on simple domains-domains whose elements are atomic (nondecomposable) values. Nonatomic values can be discussed within the relational framework. Thus, some domains may have relations as elements. These relations may, in turn, be defined on nonsimple domains, and so on. For example, one of the domains on which the relation em- ployee is defined might be salary history. An element of the salary history domain is a binary relation defined on the do- main date and the domain salary. The salary history domain
  • 83. is the set of all such binary relations. At any instant of time there are as many instances of the salary history relation in the data bank as there are employees. In contrast, there is only one instance of the employee relation. The terms attribute and repeating group in present data base terminology are roughly analogous to simple domain 380 Communications of the ACM Volume 13 / Number 6 / June, 1970 and nonsimple domain, respectively. Much of the confusion in present terminology is due to failure to distinguish be- tween type and instance (as in “record”) and between components of a user model of the data on the one hand and their machine representation counterparts on the other hand (again, we cite “record” as an example). 1.4. NORMAL FORM A relation whose domains are all simple can be repre- sented in storage by a two-dimensional column-homo- geneous array of the kind discussed above. Some more
  • 84. complicated data structure is necessary for a relation with one or more nonsimple domains. For this reason (and others to be cited below) the possibility of eliminating nonsimple domains appears worth investigating! There is, in fact, a very simple elimination procedure, which we shall call normalization. Consider, for example, the collection of relations ex- hibited in Figure 3 (a). Job history and children are non- simple domains of the relation employee. Salary history is a nonsimple domain of the relation job history. The tree in Figure 3 (a) shows just these interrelationships of the non- simple domains. employee I jobhistory I salaryhistory children employee (man#, name, birthdate, jobhistory, children)
  • 85. jobhistory (jobdate, title, salaryhistory) salaryhistory (salarydate, salary) children (childname, birthyear) FIG. 3(a). Unnormalized set employee’ (man#, name, birthdate) jobhistory’ (man#, jobdate, title) salaryhistory’ (man#, jobdate, salarydate, salary) children’ (man#, childname, birthyear) FIG. 3(b). Normalized set Normalization proceeds as follows. Starting with the re- lation at the top of the tree, take its primary key and ex- pand each of the immediately subordinate relations by inserting this primary key domain or domain combination. The primary key of each expanded relation consists of the primary key before expansion augmented by the primary key copied down from the parent relation. Now, strike out from the parent relation all nonsimple domains, remove the top node of the tree, and repeat the same sequence of operations on each remaining subtree. The result of normalizing the collection of relations in
  • 86. Figure 3 (a) is the collection in Figure 3 (b). The primary key of each relation is italicized to show how such keys are expanded by the normalization. 4 M. E. Sanko of IBM, San Jose, independently recognized the desirability of eliminating nonsimple domains. If normalization as described above is to be applicable, the unnormalized collection of relations must satisfy the following conditions : (1) The graph of interrelationships of the nonsimple domains is a collection of trees. (2) No primary key has a component domain which is nonsimple. The writer knows of no application which would require any relaxation of these conditions. Further operations of a normalizing kind are possible. These are not discussed in this paper. The simplicity of the array representation which becomes feasible when all relations are cast in normal form is not only an advantage for storage purposes but also for com-
  • 87. munication of bulk data between systems which use widely different representations of the data. The communication form would be a suitably compressed version of the array representation and would have the following advantages: (1) It would be devoid of pointers (address-valued or displacement-valued ) . (2) It would avoid all dependence on hash addressing schemes. (3) It would contain no indices or ordering lists. If the user’s relational model is set up in normal form, names of items of data in the data bank can take a simpler form than would otherwise be the case. A general name would take a form such as R (g).r.d where R is a relational name; g is a generation identifier (optional); r is a role name (optional); d is a domain name. Since g is needed only when several generations of a given relation exist, or are anticipated to exist, and r is needed only when the relation R has two or more domains named
  • 88. d, the simple form R.d will often be adequate. 1.5. SOME LINGUISTIC ASPECTS The adoption of a relational model of data, as described above, permits the development of a universal data sub- language based on an applied predicate calculus. A first- order predicate calculus s&ices if the collection of relations is in normal form. Such a language would provide a yard- stick of linguistic power for all other proposed data Ian- guages, and would itself be a strong candidate for embed- ding (with appropriate syntactic modification) in a variety of host Ianguages (programming, command- or problem- oriented). While it is not the purpose of this paper to describe such a language in detail, its salient features would be as follows. Let us denote the data sublanguage by R and the host language by H. R permits the declaration of relations and their domains. Each declaration of a relation identifies the primary key for that relation. Declared relations are added to the system catalog for use by any members of the user community who have appropriate authorization. H per- mits supporting declarations which indicate, perhaps less permanently, how these relations are represented in stor-
  • 89. Volume 13 / Number 6 / June, 1970 Communications of the ACM 381 age. R permits the specification for retrieval of any subset of data from the data bank. Action on such a retrieval re- quest is subject to security constraints. The universality of the data sublanguage lies in its descriptive ability (not its computing ability). In a large data bank each subset of the data has a very large number of possible (and sensible) descriptions, even when we as- sume (as we do) that there is only a finite set of function subroutines to which the system has access for use in qualifying data for retrieval. Thus, the class of qualification expressions which can be used in a set specification must have the descriptive power of the class of well-formed formulas of an applied predicate calculus. It is well known that to preserve this descriptive power it is unnecessary to express (in whatever syntax is chosen) every formula of the selected predicate calculus. For example, just those in prenex normal form are adequate [9].
  • 90. Arithmetic functions may be needed in the qualification or other parts of retrieval statements. Such functions can be defined in H and invoked in R. A set so specified may be fetched for query purposes only, or it may be held for possible changes. Insertions t,ake the form of adding new elements to declared relations with- out regard to any ordering that may be present in their machine representation. Deletions which are effective for the community (as opposed to the individual user or sub- communities) take the form of removing elements from de- clared relations. Some deletions and updates may be trig- gered by others, if deletion and update dependencies be- tween specified relations are declared in R. One important effect that the view adopted toward data has on the language used to retrieve it is in the naming of data elements and sets. Some aspects of this have been dis- cussed in the previous section. With the usual network view, users will often be burdened with coining and using more relat,ion names than are absolutely necessary, since names are associated with paths (or path types) rather than with relations. Once a user is aware that a certain relation is stored, he
  • 91. will expect to be able to exploit5 it using any combination of its arguments as “knowns” and the remaining argu- ments as “unknowns,” because the information (like Everest) is there. This is a system feature (missing from many current informat.ion systems) which we shall call (logically) symmetric expZoitation of relations. Naturally, symmetry in performance is not to be expected. To support symmetric exploitation of a single binary re- lation, two directed paths are needed. For a relation of de- gree n, the number of paths to be named and controlled is n factorial. Again, if a relational view is adopted in which every n- ary relation (n > 2) has to be expressed by the user as a nested expression involving only binary relations (see Feldman’s LEAP System [lo], for example) then 2n - 1 names have to be coined instead of only n + 1 with direct n-ary notation as described in Section 1.2. For example, the 6 Exploiting a relation includes query, update, and delete. 4-ary relation supply of Figure 1, which entails 5 names in n-ary notation, would be represented in the form
  • 92. P (supplier, & (part, R (project, quantity))) in nested binary notation and, thus, employ 7 names. -4 further disadvantage of this kind of expression is its asymmetry. Although this asymmetry does not prohibit symmetric exploitation, it certainly makes some bases of interrogation very awkward for the user to express (con- sider, for example, a query for those parts and quantities related to certain given projects via & and R). 1.6. EXPRESSIBLE, NAMED, AND STORED RELATIONS Associated with a data bank are two collections of rela- tions: the named set and the expressible set. The named set is the collection of all those relations that the community of users can identify by means of a simple name (or identifier). A relation R acquires membership in the named set when a suitably authorized user declares R; it loses membership when a suitably authorized user cancels the declaration of R. The expressible set is the total collection of relations that can be designated by expressions in the data language. Such expressions are constructed from simple names of relations
  • 93. in the named set; names of generations, roles and domains; logical connectives; the quantifiers of the predicate calcu- 1~s;~ and certain constant relation symbols such as = , > . The named set is a subset of the expressible set-usually a very small subset. Since some relations in the named set may be time-inde- pendent combinations of others in that set, it is useful to consider associating with the named set a collection of statements that define these time-independent constraints. We shall postpone further discussion of this until we have introduced several operations on relations (see Section 2). One of the major problems confronting the designer of a data system which is to support a relational model for its users is that of determining the class of stored representa- tions to be supported. Ideally, the variety of permitted data representations should be just adequate to cover the spectrum of performance requirements of the total col- lection of installations. Too great a variety leads to un- necessary overhead in storage and continual reinterpreta- tion of descriptions for the structures currently in effect. For any selected class of stored representations the data system must provide a means of translating user requests
  • 94. expressed in the data language of the relational model into corresponding-and elhcient-actions on the current stored representation. For a high level data language this presents a challenging design problem. Nevertheless, it is a problem which must be solved-as more users obtain con- current access to a large data bank, responsibility for pro- viding efficient response and throughput shifts from the individual user to the data system. 6 Because each relation in a practical data bank is a finite set at every instant of time, the existential and universal quantifiers can be expressed in terms of a function that counts the number of elements in any finite set. 382 Communications of the ACM Volume 13 / Number 6 / June, 1970 2. Redundancy and Consistency 2.1. OPERATIONS ON RELATIONS Since relations are sets, all of the usual set operations are
  • 95. applicable to them. Nevertheless, the result may not be a relation; for example, the union of a binary relation and a ternary relation is not a relation. The operations discussed below are specifically for rela- tions. These operations are introduced because of their key role in deriving relations from other relations. Their principal application is in noninferential information sys- tems-systems which do not provide logical inference services-although their applicability is not necessarily destroyed when such services are added. Most users would not be directly concerned with these operations. Information systems designers and people con- cerned with data bank control should, however, be thor- oughly familiar with them. 2.1.1. Permutation. A binary relation has an array representation with two columns. Interchanging these col- umns yields the converse relation. More generally, if a permutation is applied to the columns of an n-ary relation, the resulting relation is said to be a permutation of the given relation. There are, for example, 4! = 24 permuta- tions of the relation supply in Figure 1, if we include the identity permutation which leaves the ordering of columns
  • 96. unchanged. Since the user’s relational model consists of a collection of relationships (domain-unordered relations), permuta- tion is not relevant to such a model considered in isolation. It is, however, relevant to the consideration of stored representations of the model. In a system which provides symmetric exploitation of relations, the set of queries answerable by a stored relation is identical to the set answerable by any permutation of that relation. Although it is logically unnecessary to store both a relation and some permutation of it, performance considerations could make it advisable. 2.1.2. Projection. Suppose now we select certain col- umns of a relation (striking out the others) and then re- move from the resulting array any duplication in the rows. The final array represents a relation which is said to be a projection of the given relation. A selection operator ?r is used to obtain any desired permutation, projection, or combination of the two opera- tions. Thus, if L is a list of lc indices7 L = i1, ii, - . - , ik and R is an n-ary relation (n 2 k ), then rrL (R ) is the k-ary relation whose jth column is column ii of R (j = 1,2, * * . ,k)
  • 97. except that duplication in resulting rows is removed. Con- sider the relation supply of Figure 1. A permuted projection of this relation is exhibited in Figure 4. Note that, in this particular case, the projection has fewer n-tuples than the relation from which it is derived. 2.1.3. Join. Suppose we are given two binary rela- tions, which have some domain in common. Under what circumstances can we combine these relations to form a 7 When dealing with relationships, we use domain names (role- qualified whenever necessary) instead of domain positions. ternary relation which preserves all of the information in the given relations? The example in Figure 5 shows two relations R, S, which are joinable without loss of information, while Figure 6 shows a join of R with S. A binary relation R is joinable with a binary relation S if there exists a ternary relation U such that 7r12 (U) = R and ‘1~23 (U) = S. Any such ternary relation is called a join of R with S. If R, S are binary rela- tions such that ~2 (R) = ~1 (S), then R is joinable with S. One join that always exists in such a case is the natural join of R with S defined by
  • 98. R*S = {(a, b, c):R(a, b) A S(b, c)) where R (a, b) has the value true if (a, b) is a member of R and similarly for S(b, c). It is immediate that and TB(R*S) = R T33(R*S) = S. Note that the join shown in Figure 6 is the natural join of R with S from Figure 5. Another join is shown in Figure 7. II31 hPPb/) (project supplier) 5 1 5 2 1 4 7 2 FIG. 4. A permuted projection of the relation in Figure 1 R (supplier Part) S (part project)
  • 99. 1 1 1 1 2 1 1 2 2 2 2 1 FIG. 5. Two joinable relations R*S (supplier part project) 1 1 1 1 1 2 2 1 1 2 1 2 2 2 1 FIG. 6. The natural join of R with S (from Figure 5) U (supplier part project) 1 1 2 2 1 1 2 2 1 FIG. 7. Another join of R with S (from Figure 5) Inspection of these relations reveals an element (ele-
  • 100. ment 1) of the domain part (the domain on which the join is to be made) with the property that it possesses more than one relative under R and also under S. It is this ele- Volume 13 / Number 6 / June, 1970 Communications of the ACM 383 ment which gives rise to the plurality of joins. Such an ele- ment in the joining domain is called a point of ambiguity with respect to the joining of R with S. If either ~1 (R) or S is a function: no point of ambiguity can occur in joining R with S. In such a case, the natural join of R with S is the only join of R with S. Note that the reiterated qualification “of R with S” is necessary, because S might be joinable with R (as well as R with S), and this join would be an entirely separate consideration. In Figure 5, none of the relations R, 7r21(R), S, ?rzl(S) is a function. Ambiguity in the joining of R with S can sometimes be resolved by means of other relations. Suppose we are given, or can derive from sources independent of R and S, a rela- tion T on the domains project and supplier with the follow-
  • 101. ing properties : (1) m(T) = m(S), (2) m(T) = al(R), (3) T(i s> +~P(R(& P> A S(P,~))~ (4) R(s, P> -+ 3j(Soj,.i) * T(.A s)), (5) [email protected],.i) + 3s(T(.?, s> A R(s, P>>, then we may form a three-way join of R, S, T; that is, a ternary relation such that mz(U) = R, 7r23(U) = s, ml(U) = T. Such a join will be called a cyclic 3-join to distinguish it from a linear S-join which would be a quaternary relation V such that m(V) = R, lr23W) = s, lr34(V) = T. While it is possible for more than one cyclic 3-join to exist (see Figures 8,9, for an example), the circumstances under
  • 102. which this can occur entail much more severe constraints R (s P) s (P 23 T 0’ s) 1 a a d d 1 2 a 2 b b&Ii d 2 e 2 b e e 2 FIG. 8. Binary relations with a plurality of cyclic 3-joins u b P 8 TJ’ (s P i) 1 a d 1 a d 2 a e 2 a d 2 b d 2 a e 2 b e 2 b d 2 b e FIG. 9. Two cyclic 3-joins of the relations in Figure 8 than those for a plurality of 2-joins. To be specific, the re-
  • 103. lations R, S, T must possess points of ambiguity with respect to joining R with S (say point x), S with T (say 8 A function is a binary relation, which is one-one or many-one, but not one-many. y), and T with R (say a), and, furthermore, y must be a relative of x under S, z a relative of y under T, and x a relative of z under R. Note that in Figure 8 the points 2 = a; y = d; x = 2 have this property. The natural linear 3-join of three binary relations R, S, T is given by R*S*T = { (a, b, c, d):R (a, b) A S (b, c) A T (c, d)} where parentheses are not needed on the left-hand side be- cause the natural 2-join (*) is associative. To obtain the cyclic counterpart, we introduce the operator y which pro- duces a relation of degree n - 1 from a relation of degree n by tying its ends together. Thus, if R is an n-ary relation (n 2 2), the tie of R is defined by the equation r(R) = {(a~, a2, .-. , a,-l):R(ul, az, ... , a,-~, a,) A al = a,).
  • 104. We may now represent the natural cyclic S-join of R, S, T by the expression y (R*S*T). Extension of the notions of linear and cyclic S-join and their natural counterparts to the joining of n binary rela- tions (where n 2 3) is obvious. A few words may be ap- propriate, however, regarding the joining of relations which are not necessarily binary. Consider the case of two rela- tions R (degree r ), S (degree s) which are to be joined on p of their domains (p < T, p < s). For simplicity, sup- pose these p domains are the last p of the r domains of R, and the first p of the s domains of S. If this were not so, we could always apply appropriate permutations to make it so. Now, take the Cartesian product of the first r-p do- mains of R, and call this new domain A. Take the Car- tesian product of the last p domains of R, and call this B. Take the Cartesian product of the last s-p domains of S and call this C. We can treat R as if it were a binary relation on the domains A, B. Similarly, we can treat S as if it were a bi- nary relation on the domains B, C. The notions of linear
  • 105. and cyclic S-join are now directly applicable. A similar ap- proach can be taken with the linear and cyclic n-joins of n relations of assorted degrees. 2.1.4. Composition. The reader is probably familiar with the notion of composition applied to functions. We shall discuss a generalization of that concept and apply it first to binary relations. Our definitions of composition and composability are based very directly on the definitions of join and joinability given above. Suppose we are given two relations R, S. T is a cam- position of R with S if there exists a. join U of R with S such that T = aI3 (U) . Thus, two relations are composable if and only if they are joinable. However, the existence of more than one join of R with S does not imply the existence of more than one composition of R with S. Corresponding to the natural join of R with S is the 384 Communications of the ACM Volume 13 / Number 6 / June, 1970
  • 106. ndural composition9 of R with S defined by R.S = TH(R*S). Taking the relations R, S from Figure 5, their natural com- position is exhibited in Figure 10 and another composition is exhibited in Figure 11 (derived from the join exhibited in Figure 7). R. S (project supplier) 1 1 1 2 2 1 FIG. 10. The natural composition of R with S (from Figure 5) T (project supplier) 1 2 2 1 FIQ. 11. Another composition of R with S (from Figure 5) When two or more joins exist, the number of distinct compositions may be as few as one or as many as the num-
  • 107. ber of distinct joins. Figure 12 shows an example of two relations which have several joins but only one composition. Note that the ambiguity of point c is lost in composing R with S, because of unambiguous associations made via the points a, b, d, e. R (supplier part) S (part project) 1 1 z ; ; 1 c c f 2 2 8 : is 2 e e f FICA 12. Many joins, only one composition Extension of composition to pairs of relations which are not necessarily binary (and which may be of different de- grees) follows the same pattern as extension of pairwise joining to such relations.
  • 108. A lack of understanding of relational composition has led several systems designers into what may be called the connection trap. This trap may be described in terms of the following example. Suppose each supplier description is linked by pointers to the descriptions of each part supplied by that supplier, and each part description is similarly linked to the descriptions of each project which uses that part. A conclusion is now drawn which is, in general, er- roneous: namely that, if all possible paths are followed from a given supplier via the parts he supplies to the projects using those parts, one will obtain a valid set of all projects supplied by that supplier. Such a conclusion is correct only in the very special case that the target relation be- tween projects and suppliers is, in fact, the natural com- position of the other two relations-and we must normally add the phrase “for all time,” because this is usually im- plied in claims concerning path-following techniques. 0 Other writers tend to ignore compositions other than the na- tural one, and accordingly refer to this particular composition as the composition-see, for example, Kelley’s “General Topology.” 2.1.5. Restriction. A subset of a relation is a relation. One way in which a relation S may act on a relation R to generate a subset of R is through the operation restriction
  • 109. of R by S. This operation is a generalization of the restric- tion of a function to a subset of its domain, and is defined as follows. Let L, M be equal-length lists of indices such that L = iI,&., *** ,ik,M = jI,j2, e-s , jk where k 5 degree of R and k 6 degree of S. Then the L, M restriction of R by S denoted RLIMS is the maximal subset R’ of R such that ?rL(R’) = TM(S). The operation is defined only if equality is applicable be- tween elements of ?T+, (R) on the one hand and rjh (S) on the other for all h = 1, 2, . . . , k. The three relations R, S, R’ of Figure 13 satisfy the equa- tion R' = Rw~wsS. R (8 P ~3 s (P 8 R' (8 P 3 1aA a A 1aA 2 a A c B 2 a A 2 a B b B 2 b B 2bA 2 b B
  • 110. FIG. 13. Example of restriction We are now in a position to consider various applications of these operations on relations. 2.2. REDUNDANCY Redundancy in the named set of relations must be dis- tinguished from redundancy in the stored set of representa- tions. We are primarily concerned here with the former. To begin with, we need a precise notion of derivability for relations. Suppose 0 is a collection of operations on relations and each operation has the property that from its operands it yields a unique relation (thus natural join is eligible, but join is not ). A relation R is O-derivable from a set S of rela- tions if there exists a sequence of operations from the col- lection 0 which, for all time, yields R from members of S. The phrase “for all time” is present, because we are dealing with time-varying relations, and our interest is in derivabil- ity which holds over a significant period of time. For the named set of relationships in noninferential systems, it ap- pears that an adequate collection & contains the following operations: projection, natural join, tie, and restriction.
  • 111. Permutation is irrelevant and natural composition need not be included, because it is obtainable by taking a natural join and then a projection. For the stored set of representa- tions, an adequate collection e2 of operations would include permutation and additional operations concerned with sub- setting and merging relations, and ordering and connecting their elements. 2.2.1. Strong Redundancy. A set of relations is strongly redundant if it contains at least one relation that possesses a projection which is derivable from other projections of relations in the set. The following two examples are in- tended to explain why strong redundancy is defined this way, and to demonstrate its practical use. In the first ex- Volume 13 / Number 6 / June, 1970 Communications of the ACM 385 ample the collection of relations consists of just the follow- ing relation : employee (serial #, name, manager#, managername)
  • 112. with serial# as the primary key and manager# as a foreign key. Let us denote the active domain by A,, and suppose that A, (munuger#) c A, (serial#) and At (managername) C At (name) for all time t. In this case the redundancy is obvious: the domain managername is unnecessary. To see that it is a strong redundancy as defined above, we observe that m4 (employee) = ~12 (empZoyee)&r3 (employee). In the second example the collection of relations includes a relation S describing suppliers with primary key s#, a re- lation D describing departments with primary key d#, a relation J describing projects with primary key j#, and the following relations : lJ (s#, d#, - * * >t Q(s#,j#, .-->, R(d#,j#, **->, where in each case - - - denotes domains other than s#, d#, j#. Let us suppose the following condition C is known to
  • 113. hold independent of time: supplier s supplies department d (relation P ) if and only if supplier s supplies some project j (relation Q) to which d is assigned (relation R). Then, we can write the equation m(P) = m(Q)-w(R) and thereby exhibit a strong redundancy. An important reason for the existence of strong re- dundancies in the named set of relationships is user con- venience. A particular case of this is the retention of semi- obsolete relationships in the named set so that old pro- grams that refer to them by name can continue to run cor- rectly. Knowledge of the existence of strong redundancies in the named set enables a system or data base adminis- trator greater freedom in the selection of stored representa- tions to cope more efficiently with current traffic. If the strong redundancies in the named set are directly reflected in strong redundancies in the stored set (or if other strong redundancies are introduced into the stored set), then, gen- erally speaking, extra storage space and update time are consumed with a potential drop in query time for some queries and in load on the central processing units.
  • 114. 2.2.2. Weak Redundancy. A second type of redun- dancy may exist. In contrast to strong redundancy it is not characterized by an equation. A colIection of relations is weakly redundant if it contains a relation that has a projec- tion which is not derivable from other members but is at all times a projection of some join of other projections of relations in the collection. We can exhibit a weak redundancy by taking the second example (cited above) for a strong redundancy, and as- suming now that condition C does not hold at all times. The relations al2 (P), 7r12 (Q), 7r12 (R ) are complexlo relations with the possibility of points of ambiguity occurring from time to time in the potential joining of any two. Under these circumstances, none of them is derivable from the other two. However, constraints do exist between them, since each is a projection of some cyclic join of the three of them. One of the weak redundancies can be characterized by the statement: for all time, 1r12 (P) is some composition of ~12 (Q) with ‘~~1 (R). The composition in question might be the natural one at some instant and a nonnatural one at another instant.
  • 115. Generally speaking, weak redundancies are inherent in the logical needs of the community of users. They are not removable by the system or data base administrator. If they appear at all, they appear in both the named set and the stored set of representations. 2.3. CONSISTENCY Whenever the named set of relations is redundant in either sense, we shall associate with that set a collection of statements which define all of the redundancies which hold independent of time between the member relations. If the information system lacks-and it most probably will-de- tailed semantic information about each named relation, it cannot deduce the redundancies applicable to the named set. It might, over a period of time, make attempts to induce the redundancies, but such attempts would be fal- lible. Given a collection C of time-varying relations, an as- sociated set Z of constraint statements and an instantaneous value V for C, we shall call the state (C, 2, V) consistent or inconsistent according as V does or does not satisfy 2. For example, given stored relations R, S, T together with the constraint statement “nlz(T) is a composition of
  • 116. 5~12 (R) with ~12 (X)“, we may check from time to time that the values stored for R, S, T satisfy this constraint. An al- gorithm for making this check would examine the first two columns of each of R, S, T (in whatever way they are repre- sented in the system) and determine whether (1) ?rl(T) = rl(R), (2) ?rz(T) = [email protected]), (3) for every element pair (a, c) in the relation al2 (T) there is an element b such that (a, b) is in a12(R) and (b, c) is in 7r12(S). There are practica1 problems (which we shall not discuss here) in taking an instantaneous snapshot of a collection of relations, some of which may be very large and highly variable. It is important to note that consistency as defined above is a property of the instantaneous state of a data bank, and is independent of how that state came about. Thus, in particular, there is no distinction made on the basis of whether a user generated an inconsistency due to an act of omission or an act of commission. Examination of a simple
  • 117. IO A binary relation is complex if neither it nor its converse is a function. 386 Communications of the AMC Volume 13 / Number 6 / June, 1970 example will show the reasonableness of this (possibly un- conventional) approach to consistency. Suppose the named set C includes the relations S, J, D, P, Q, R of the example in Section 2.2 and that P, Q, R possess either the strong or weak redundancies described therein (in the particular case now under consideration, it does not matter which kind of redundancy occurs). Further, suppose that at some time t the data bank state is consistent and contains no project j such that supplier 2 supplies project j and j is assigned to department 5. Accordingly, there is no element (2,5) in ~12 (P). Now, a user introduces the element (2, 5) into 7~12 (P) by inserting some appropri- ate element into P. The data bank state is now inconsistent. The inconsistency could have arisen from an act of omis-