SlideShare a Scribd company logo
1 of 132
Download to read offline
A Comparative Analysis of Relational Databases and
Graph Databases
Thesis submitted in partial fulfilment of the requirements for the award of degree of:
Masters of Science
in
Software Engineering & Database Technologies
Department of Information Technology
Head of Department:
Dr. James Duggan, B.E., M.Eng.Sc., Ph.D.
Submitted by:
Darroch Greally
Thesis Advisor:
Dr. Robert Mason, Ph.D.
August 2016
i
Certificate of Authorship
I hereby certify that I am the author of this document and that any assistance I received in its
preparation is fully acknowledged and disclosed in the document. I have also cited all sources
from which I obtained data, ideas or words that are copied directly or paraphrased in the
document. Sources are properly credited according to accepted standards for professional
publications. I also certify that this paper was prepared by me for the purpose of partial
fulfilment of requirements for the Degree Programme.
Signed: __________________________ Date: 20 – August – 2015
ii
Acknowledgements
I would first like to thank my thesis advisor Dr. Robert Mason of the College of Computer &
Information Sciences at Regis University. His advice and expertise was extremely valuable to
me when carrying out my research.
I must also express my very profound gratitude to my parents for providing me with unfailing
support and continuous encouragement throughout my years of study and through the process
of researching and writing this thesis. This accomplishment would not have been possible
without them. Thank you
iii
Abstract
Data storage technologies are not evolving at the required rate to deal with the significant
increase in the amount and type of data collected today by software across almost every
industry. The data flowing through businesses is becoming more and more complex due to an
increase in velocity, volume and variety; and its manageability and storage is becoming an
increasingly difficult task to manage, mainly due to the legacy technologies employed to do
so.
Newer technologies are now starting to emerge in an attempt to overcome the disadvantages of
previous database management systems. Two popular database types, the Relational Database
Management Systems and Graph Databases are tested. The aim of this thesis was to examine
and compare two databases from these two database models and answer the question of which
one performs better when the data contains many relationships that require traversing multiple
relationships to resolve queries.
From the comparison of the results it was found that an increase in the number of joins/traversal
required, lead to a noticeable decrease in the performance of Oracle 12c. Due to the model
which Neo4j implements, its performance levels remained consistent when retrieving the same
information.
iv
Table of Contents
Certificate of Authorship ............................................................................................................i
Acknowledgements....................................................................................................................ii
Abstract.................................................................................................................................... iii
List of Figures..........................................................................................................................vii
1 Introduction........................................................................................................................1
2 Literature Review...............................................................................................................3
3 The Relational Database ..................................................................................................12
3.1 Relational Data Model ..............................................................................................12
3.1.1 The Relation.......................................................................................................12
3.3 Data Structure............................................................................................................14
3.4 Integrity Constraints..................................................................................................14
3.5 Operations .................................................................................................................15
3.5.1 Selection Operation............................................................................................16
3.5.2 Projection Operation ..........................................................................................17
3.5.3 Cartesian Product...............................................................................................18
3.5.4 Union..................................................................................................................20
3.5.5 Set Difference ....................................................................................................20
4 Graph Databases ..............................................................................................................21
4.1 Introduction...............................................................................................................21
4.2 Data Model................................................................................................................23
v
4.2.1 Nodes .................................................................................................................23
4.2.2 Edges..................................................................................................................23
4.2.3 Labels.................................................................................................................24
4.3 Terminology..............................................................................................................24
4.3.1 The Graph ..........................................................................................................25
4.3.2 Graph size ..........................................................................................................25
4.3.3 Degree of a node................................................................................................25
4.3.4 Path and path length...........................................................................................25
4.4 Key Features of the Graph Database.........................................................................26
4.4.1 Performance:......................................................................................................26
4.4.2 Flexibility...........................................................................................................27
4.5 Real Life Applications of the Graph Database..........................................................27
4.5.1 Google’s page rank ............................................................................................27
4.5.2 Master data management ...................................................................................28
4.5.3 Social networking ..............................................................................................28
4.5.4 Telecommunication............................................................................................29
4.5.5 Security and access management.......................................................................29
4.5.6 Bioinformatics....................................................................................................29
4.6 Neo4j.........................................................................................................................29
5 Comparative Analysis......................................................................................................31
5.1 Data Modelling..........................................................................................................31
vi
5.1.1 Relational Modelling .........................................................................................31
5.1.2 Graph Data Modelling .......................................................................................33
5.2 Querying The Database.............................................................................................35
5.2.1 Querying the Relational Database .....................................................................37
5.2.2 Querying the Graph Database............................................................................41
6 Conclusion .......................................................................................................................44
7 References........................................................................................................................46
8 Appendix..........................................................................................................................55
8.1 Appendix A ...............................................................................................................55
9 Appendix B......................................................................................................................95
9.1 Neo4j Scripts.............................................................................................................95
vii
List of Figures
Figure 1: Relation Instance ......................................................................................................14
Figure 2: Selection Operation..................................................................................................16
Figure 3: Selection Operation..................................................................................................17
Figure 4: Reviews Table ..........................................................................................................17
Figure 5: Projection Operation.................................................................................................18
Figure 6: Keanu Reeves Movies..............................................................................................19
Figure 7: Cartesian Product .....................................................................................................19
Figure 8: Directors and actors in Unforgiven ..........................................................................20
Figure 9: Set Difference Operation..........................................................................................20
Figure 10: The property graph .................................................................................................22
Figure 11: UML schema for graph data model........................................................................23
Figure 12: A Simple Graph /....................................................................................................24
Figure 13: The Logical Model.................................................................................................32
Figure 14: The Relational Model.............................................................................................33
Figure 15: The Graph Data Model...........................................................................................34
Figure 16: A look inside the Graph Data Model......................................................................35
Figure 17: SQL Query .............................................................................................................37
Figure 18: SQL Query 'Join Map'............................................................................................38
Figure 19: Oracle 12c Query Times.........................................................................................39
Figure 20: Relational Database Index Lookups.......................................................................40
Figure 21 Cypher Query ..........................................................................................................41
Figure 22: Graph Data Index Lookup......................................................................................41
Figure 23: Neo4j Data Output..................................................................................................42
Figure 24: Neo4j Graphical Output .........................................................................................42
viii
Figure 25: Neo4j Query Times ................................................................................................43
Figure 26: Neo4j vs Oracle 12c ...............................................................................................43
1
1 Introduction
NoSQL is all the rage and as a consequence we are currently witnessing a rise in popularity of
the graph database. There is an ongoing debate that relational database systems are outdated
technology and people are looking towards newer technologies that will serve as their
replacement. The main talking point of the debate surrounds the issue of complex data and this
thesis aims to establish if graph databases are a better alternative to relational databases in terms
of performance, when the data contains many relationships that require traversing multiple
relationships to resolve queries.
Relational databases have been around as a standard for more than thirty years. They are
utilised industry wide and have been a major success to date. Relational databases have been
the number one data storage choice for decades, providing businesses with a flexible, standard
interface to store their data. However, storage needs are changing as data becomes bigger and
more complex and relational databases apply much of the same overhead necessary for
complex update operations to every activity, which can lead to inferior performance and an
altogether more limited data store than its graphical counterpart. Modern day data is very
complex, large in volume and highly connected and it cannot be efficiently managed by
relational database technologies because of the number of data relationships involved. The
graph database can provide a solution to this issue, and in turn a successful rival to the
seemingly irreplaceable relational database.
This thesis aims to quantitatively measure the performance of both database types, that is by
benchmarking them. The benchmark is designed to be general enough to model all of the
capabilities of both types and the information queried is to be of a realistic and unbiased nature.
To achieve this, a dataset containing a number of movies along with their actors, directors,
2
producers and writers was written. The code for both Neo4j and Oracle12c are contained in the
appendices of this document.
The remainder of this thesis is organised as follows: Chapter 2 provides a review of all relevant
literature. Chapter 3 and Chapter 4 provide introductions and the fundamental design
considerations of relational databases and graph databases respectfully. The performance of
each database type is tested and analysed in Chapter 5. Chapter 6 provides a conclusion for this
work.
3
2 Literature Review
Since the evolution of database management systems, there has been a continuing argument
about what database model should be used for a particular purpose. The development and
assortment of existing database models shows that there are numerous circumstances that affect
their development. Angles & Gutierrez (2008) believed some of the more important factors to
be, the structure of the domain to be modelled, the types of theoretical tools desirable to the
intended end user and the hardware and software constraints.
Long before the invention of what is now the modern computer system, the storage of
information has posed many challenges. During these times, information retrieval and indexing
was made more efficient, by systems such as the Dewey Decimal System, which was a
proprietary system developed by Melvil Dewey in 1876 (Dewey, Mitchell, Beall, Matthews,
& New, 1996). This somewhat eased the difficulties associated with information storage,
however, in order to store data, a significant amount of physical volume was necessary for
storage and it required human intelligence and understanding to process complicated relations
in the data.
The first commercial database management (DBMS) was developed in 1964 by Charles
Bachmann, while he worked at Honeywell (Bachman, 1972). Taking its design from an early
data model, IDS-Integrated Data Store, stored a single set of shared files in a disk, which
resulted in the automation of several data processing tasks achieved by effective commands
suitable for data manipulation, that were made available to application programmers (Bressan
& Catania, 2005). Some advances were made on this model throughout the 1960’s, however
many problems remained, primarily the lacking of good data abstraction. The data structures
associated with Bachman’s model and its subsequent predecessors, were unsuitable for
modelling non-traditional applications (Kahate, 2004). A first solution to these problems led to
the conceptual basis and initial definition of a Relational Model, which was proposed by E.F.
4
Codd in 1970. Codd introduced a system of structuring data using relations, organised in a
mathematical structure consisting of columns and rows (Codd, 1970). Due to its applicability,
it was widely accepted among business applications in comparison to previous models
(Peckham & Maryanski, 1988). This was a revolutionary concept, which would prove to
become an extremely significant development in the design of relational databases. The
relational model’s purity and mathematical foundation are the primary factors in why it has
help its position as the dominant DBMS over the past thirty years (Silberschatz, Korth &
Sudarshan, 2011).
An initial serious interest in the area of graph databases arrived in the early nineties, before
being forgotten due to the emergence of XML (Angles & Gutierrez, 2008). Before this however
a number of pioneer papers focussing on the idea of graph database were published.
Roussopoulos & Mylopoulos (1975) presented a semantic network store about the database. In
1981 the Functional Data Model was presented to represent an implicit structure of graphs for
a data set, with the aim being the visionary representation of a logical database (Shipman,
1981). A few years later Logical Data Model (LDM), an explicit graph database model
proposed to consolidate the relational, hierarchical and network models was developed (Kuper
& Vardi, 1984). The G-Base, a graph database model intended for the representation of
complex structures was presented in 1987 (Kunii, 1987). An object orientated database model
based on a graph structure known as O2 was presented in 1988 (Lecluse, Richard, & Velez,
1988). Furthering this concept, was the development of a system known as GOOD, in which
data manipulation was implicitly graph based along with the representation of the data model.
(Gemis, Paredaens, Thyssens, & Van den Bussche, 1993). Further advancements GOOD led
to the development of GMOD, which focussed on approaches for graph-orientated database
user interfaces (Andries, Gemis, Paredaens, Thyssens, & Van den Bussche, 1992); Gram, an
explicit graph database model for hypertext data (Amann & Scholl, 1992); PaMal, extending
5
GOOD with a clear representation of tuples and sets (Gemis & Paredaens, 1993); GOAL,
which initialised the idea of association nodes (Hidders & Paredaens, 1994); G-Log, a proposal
for a graph orientated declarative query language, which incorporates the meaningful
capabilities of logic, the modelling capabilities of complex objects with identity and the
representation capabilities of graphs (Paredaens, Peelman, & Tanca, 1995); and a final proposal
is GDM, graph-based data model in which database instances and database schemas are
illustrated by instance graphs and schema graphs (Hidders, 2002).
The early nineties also saw proposals using the general concept of graphs with data modelling
functions. The Hypernode Model was a database model introduced by Levene and Poulovassils
in 1990 (Levene & Poulovassilis, 1990). Their model was based on nested graphs and provided
a basis from which later developments followed, such as for modelling multi-scaled networks
(Mainguenaud & Simatic, 1992). The concepts proposed in the Hypernode Model was also
used for modelling genome data, mapping and other genomic data can be clearly represented
by graphs, and graphs can be stored in a database (Graves, Bergeman, & Lawrence, 1995).
Levene and Poulovassils introduced the data model GROOVY (Graphically-Represented
Object-Oriented data model with Values) in 1992. A pure generalisation for the primary
concepts of object-oriented data modelling was provided by GROOVY with the use of
hypergraphs. (Levene & Poulovassilis, 1991).
GraphDB was a proposal made in 1994, driven by the requirement to manage data in transport
networks, its ideology involved modelling and querying graphs in object orientated databases
(Güting, 1994). A subsequent proposal, Database Graph View, provided a mechanism for
abstraction to enable graphs to be defined and manipulated in both relation, object-orientated
or file systems (Gutiérrez, Pucheral, Steffen, & Thévenin, 1994). Complex Data associated
with software engineering projects were modelled using attributed graphs by a project known
as GRAS (Kiesel, Schuerr, & Westfechtel, 1995). The popular OEM model proposed unified
6
access to heterogeneous information sources, concentrating on the transfer of information
(Klyne & Carroll, 2006).
A further and extremely important development involves data representation models and the
World Wide Web. These include data exchange models like XML, metadata representation
models such RDF and ontology representation models such as OWL (Böhnlein & vom Ende,
1999) (Klyne & Carroll, 2006) (McGuinness & Van Harmelen, 2004).
It is however the growth of the Web and the large amount or available resources that pose the
greatest threat to accurate information retrieval (Barabási & Albert, 1999). This huge store of
unstructured data has resulted in making efficient information search and retrieval a very
tiresome task, particularly in the case of a less scalable data model. Scalability in databases is
their capability to manage a growing number of transactions and stored data at the same speed.
The significant amounts of data stored today by software in almost every area imaginable is
progressively leading to major problems due to current storage technologies not advancing at
a required rate to cater for the performance scalability needed (Agrawal, El Abbadi, Das, &
Elmore, 2011). Scalability can be achieved in two ways; vertical scalability and horizontal
scalability (Kaur & Rani, 2013). Vertical scalability means to scale up. This is achieved by
increasing resources to a single machine. Horizontal scalability on other hand is called scaling
out. This can be achieved by adding commodity servers to the existing node. Vertical
scalability is expensive in comparison to horizontal scalability and the degree to which the
database can scale vertically is also limited. Vertical scalability is rarely an efficient option and
in some cases not feasible choice where the amount of data to be managed continues to
increase. To help overcome such issues a number of new systems have been designed to
provide good horizontal scalability, unlike traditional database products which have in
comparison minimal capacity to scale horizontally (Pokorny, 2011). Many of the new systems
are referred to as “NoSQL” data stores (Cattell, 2011).
7
The NoSQL movement, where “NoSQL” stands for “Not Only SQL” came about as it was
believed that the traditional relational database system was not an effective solution for all
database requirements, particularly databases that were concerned with processing large
amounts of data with high scalability needs (Tauro, Patil, & Prashanth, 2013). Relational
databases are often not very well suited to particular operations essential to Big Data
management. Firstly, due to a lack of scalability, large data solutions tend to be more expensive
with relational databases. In certain circumstances, grid solutions can improve this weakness,
however the creation of new clusters on the grid is not dynamic, thus the potential and
exploration for a more efficient solution. Furthermore, relational databases don’t handle
unstructured data search as well as one would wish, and the same applies to data appearing in
unexpected format. Additionally, it is strenuous to implement particular types of queries using
SQL and relational databases, such as the shortest path between two points (Cui, Mei, & Ooi,
2014).
Big Data organisations and Social networking, including Facebook, Google, and Amazon were
the initial exponents of the idea that relational databases were not the most effective solutions
for the volumes and types of data that they were required to handle. Such limitations lead to
the development of the Hadoop File System, a reliable store for very large data sets (Shvachko,
Kuang, Radia, & Chansler, 2010). Also developed as a consequence was the MapReduce
programming language designed for processing and generating large data sets with a parallel,
distributed algorithm on a cluster and associated NoSQL databases such as Cassandra and
HBase (Dean & Ghemawat, 2008) (Rabl et al., 2012).
A fundamental element of the NoSQL concept is “shared nothing” horizontal scaling, a
distributed computing architecture in which each node is independent and self-sufficient, and
there is no single point of contention across the system. This enable the databases to support a
greater number of basic read/write operations per second (Topor et al., 2009).
8
The area of SQL vs NoSQL has been well debated, however such debates do not specifically
cover graph databases, with a broader view of the topics investigated. Tauro, Patil, & Prashanth
(2013) provided an overview, evaluation and analysis of various NoSQL Databases. The
conclusion they came to was that NoSQL databases were more effective when there was a
requirement to handle a large amount of data with high scalability, compared to the traditional
relational database systems, and that a NoSQL database is most suitable a solution where the
database will be required to scale over time. Strauch & Kriha (2011) carried out methodical
examination of the purposes and justifications behind the NoSQL movement. Their
conclusions supported the view presented in Tauro, Patil, & Prashanth (2013), that the need for
high scalability is a primary reason for the development of NoSQL databases.
It is such scalability that provides us with the issue of ‘complex data’. Bitner (2015) states that
while there is no particular method of determining the level of complexity associated with data,
there are two features of the data that can allow it to be called complex - if you are working
with Big Data and if your data comes from varying data sources. Russom (2011) states that
complexity in data can be broken into three separate categories, which he labels “The 3 v’s”.
The 3 v’s stand for velocity, volume and variety. Naturally volume deals with the size of the
data, velocity is considered to be the data processing methods and variety deal with which
category of structured, unstructured or semi structured the data falls into.
Unstructured data is schema-less, and comes in various formats, ranging from social media
posts and sensor data to email, images and web logs. It and it has become much more common
since the increase in popularity of the usage online service, information systems and the
growing need for high data volumes (Lomotey & Deters, 2013). Studies show that unstructured
data is growing at an unprecedented pace. It is estimated that more than 80% of all potentially
useful business information is unstructured data (Gharehchopogh & Khalifelu, 2011). The
world creates 2.5 quintillion bytes of data per day from unstructured data sources like sensors,
9
social media posts and digital photos (Bajaj, 2014). Clearly, unstructured data is growing
exponentially, and its need to be managed in the most efficient manner possible is of pivotal
importance.
Relational databases are best suited to structured data, which readily fits in well-organized
tables but the opposite is the case when it is required to deal with unstructured data according
to (Leavitt, 2010). He goes on to state that with relational databases, users are required to
convert all data into tables and if the case is such that the data does not readily fit into a table,
the database’s structure can be complex, difficult, and slow to work with. Reeve (2012)
supports this and is also of the opinion that relational databases are unsuitable to perform an
efficient unstructured data search and also questions their ability to handle data in unexpected
formats well. Both, also agree that it is difficult to implement some basic queries using SQL
and relational databases, such as the shortest path between two points (Reeve, 2012). The use
of SQL with data that is unstructured proves problematic because it is designed to work with
structured, relationally organized databases with fixed table information (Leavitt, 2010). On
the other hand, Kuala et al. (2013) conducted a study to that prove that relational databases can
manipulate unstructured data. The fact that relational databases are now able to manipulate
unstructured data tell us that database companies as now recognize the important role this type
of data will play in current and future environment.
The graph database is highly efficient in the processing of large, interrelated datasets
(Rodriguez & Neubauer, 2010). The properties of its design allow the development of
predictive models, and the exposure of interactions and patterns (Shimpi & Chaudhari, 2012).
This highly dynamic model provides a means for exceptionally fast traversals along edges and
between vertices due to the linking together of all nodes by relations, resulting in more localised
traversals which are not required to take any unrelated data into account, thus, overcoming an
implicit problem in SQL (Rodriguez & Neubauer, 2012). Bachman (2013) even goes as far as
10
stating that that all but the simplest of graph queries would result in the use of a join operator
in a relational database, the performance of which worsens exponentially with the increase in
the size of the data. Another paper points out that due to the fact that the graph data model
stores all the relationships of data along with data, further computing process is not required to
join the data in an effort to extract the desired information (Mohamed Ali & Padma, 2016).
A comparative investigation of a NoSQL database and a relational database, with respect to
their performance and scalability capabilities was carried out by (Hadjigeorgiou, 2013). The
authors investigated the performance and scalability of MongoDB, a documented-orientated
database and MySQL, a relational database. The experiments carried out involved running
various numbers and types of queries, with the complexity of the query changing throughout
experimentation to allow for analysis on how the databases scaled with increased load. It was
found that the NoSQL database could handle the more complex queries faster. The study does
show however, that in spite the performance advantage MongoDB possesses over MySQL in
terms of how it deals with complex queries, when the benchmark modelled the MySQL query
in a similar way to the MongoDB complex query by utilising nested SELECTs, it was
discovered that MySQL performed better, although at higher numbers of connections the two
behaved similarly.
Oracle recently released a white paper, Unstructured Data Management with Oracle Database
12c, which emphasis the value that they have on keeping up with competitors in a most
competitive market. The white paper states that Oracle Database 12c has focused dramatic
performance improvements for Unstructured Data query and analysis. It claims that Oracle
Database 12c allows you to store and query unstructured data efficiently, with highly efficient
compression and, in many instances, query languages, semantics, and other mechanisms
designed for specific data types. Oracle Database 12c supports specialized data types for many
common forms of unstructured data. This enables application developers, development tools
11
and database utilities to interact with unstructured data with the same ease as with standard
relational data.
While this literature review has provided a plethora of information on the rise of NoSQL, its
efficiency and in some cases its significant advantages over relational databases when it comes
to managing complex data, no such literature is available that refers specifically to the graph
database. Due to the rising popularity of the graph database, it is deemed necessary so have
some sort of factual analysis and critical comparison to test the claims of graph database
vendors, specifically Neo4j. This paper aims to achieve this, along with proving or disproving
Oracle claims of keeping up with competitors in storage of complex data.
12
3 The Relational Database
3.1 Relational Data Model
The relational model was first proposed in 1970 by E.F. Codd as a new model for database
systems. With its mathematical foundation, the relational model paved the way for modern
database systems. It had a significant impact on everything from the theory, to the actual
development and implementations of database systems. Information is retrieved from a
database with the use of what is known as a query language. Query languages are typically
more high level than the usual programming language. In general, query languages fall into
two categories; procedural and non-procedural, with their method of obtaining results the
defining factor. With a procedural language, any query constructed by a user with the intention
of retrieving information from the database will need to be a definitive chain of tasks on the
database. With a non-procedural language, the user must only provide a description of data
needed. It is unnecessary to provide the system with a particular procedure. Relational algebra
is representative of procedural language, while relational calculus is an example of non-
procedural languages (Silberschatz, Korth, & Sudarshan, 2011). Modern relational database
systems however, tend to integrate aspects of both the procedural and non-procedural
methodologies.
3.1.1 The Relation
In order to define a relation, a domain and an attribute must be initially defined. Let
𝐷1, 𝐷2, . . . , 𝐷𝑛 (𝑛 < 0) be 𝑛 domains, with a domain being defined as a set of values of
similar type. For all rows in dataset, a domain and its unique identifier is defined and this
becomes known an attribute.
The Cartesian product is the set of all 𝑛-tuples (𝑡1, 𝑡2, . . . , 𝑡𝑛) such that 𝑡𝑖 ∈ 𝐷𝑖 for all 𝑖. If a
relation is a subset of this Cartesian product, it can be defined on these 𝑛 domains and is
13
described as degree 𝑛 (Codd, 1979). This theory forms the basis of a relational model in that
the relation is a set of rows, where each row is required to have the same set of attributes. In
the relational model, a series of rows come together to form a table.
3.2 Relational Algebra
The basis of a relational model, comes from the mathematical definition of a relation. Ullman
(1995) describes a relation as “a subset of a Cartesian product of a list of domains”. Using this
logic, in the RDBMS, the database table becomes the mathematical relation. The relational
model’s mathematical foundation provides a collection of algebraic operations that can manage
tables in a RDBMS. Relational algebra is a procedural query language that has five central
operations including; project, select, Cartesian product, union and set difference (Silberschatz,
Korth, & Sudarshan, 2011). A number of other operations such as inner join and natural join
can also be defined. All of the aforementioned operators serve to define or manipulate operands
that are relations, with a new relation occurring as a result of any operations carried out. More
complex queries tend to be carried out by a sequence of these operators. When defining his
relational model, E.F. Codd highlighted that the relational algebraic operators were a
significant part of the definition, and could be considered as influential as the model itself
(Codd, 1979). A relation is defined as a selection of n-tuples, for operations in relational
algebra. It can therefore be concluded that the operations of relational algebra are the operations
of set theory working in collaboration with further operators which operate with consideration
to the explicit nature of relations of that dataset (Shenai, 1992). An extension of these relational
algebra operations also helps to define further operators in the relational model (Silberschatz,
Korth & Sudarshan, 2011) (Ullman, 1995). Codd (1970) outlined three components of his
relational model as; data structures to store data in a relational database, integrity constraints
and operations. These are explained in greater detail below.
14
3.3 Data Structure
The data structures used to store data are; relation (table), attributes (columns), tuples (rows),
relation instance and relation schema (table header). An attribute is described by a relational
model as a <Name, Domain> pair, in which the domain consists of the collection of values and
operators consisting of the attributes domain. Figure 1 below, provides an example of an
attribute <name, VARCHAR2>, with the attribute name being Name and the domain consisting
of the collection of valid characters and operations of the VARCHAR2 attribute type.
The relational model describes a relation schema as a <Name, Set of Attributes> pair. For
example, Figure 1 is a table schema used to store information on particular set of actors. It has
a relation schema Actors (Actor_ID, Name, Born). In any given relation schema, a tuple for
that relation schema is a mapping from every attribute of that relation schema into the domain
of that attribute, where a tuple is an element of a relation instance, with a relation instance being
a subset of a Cartesian product of a list of domains consisting of a fixed collection of tuples for
a particular schema. A relation instance for the schema in Figure 1 can be described as {relation
instance} = {tuple1, tuple2, tuple3} where tuple1 = (Keanu, Keanu Reeves, 1964); tuple2 =
(Carrie, Carrie-Ann Moss, 1967) and so on.
Figure 1: Relation Instance
3.4 Integrity Constraints
15
A collection of integrity constrains comprise the second component of the relational model.
Such integrity constraints are necessary to ensure that the data stored within the RDBMS is
valid. A RDBMS implements integrity constraints, meaning it allows only legal instances to
be stored in the database. Integrity constraints are stated and implemented at particular times
in each database instance (Asirelli, De Santis & Martelli, 1985). Firstly, when an end user
describes a database schema, they stipulate the integrity constraints that are necessary to be
applied on any instance of the database. Secondly, when a database application is run, the
database management system seeks violations and forbids changes to the data that breach the
defined integrity constraints. Business rules are enforced as integrity constraints to ensure the
database only supports valid data. Once more referring to the example in Figure 1, the attribute
Born has a business rule associated with it, that requires it to be an integer with the precision
of four and the scale of zero. This constraint is implemented from the outset, when the data
type and domain are defined. As well as domain constraints, there are other integrity constraints
can be specified in the relational model; such as key constraints, referential, integrity
constraints, NOT NULL constraints and general assertions (Khodaei, 2008).
3.5 Operations
Relational algebra and relational calculus provide the basis for performing operations in the
relational model. Harrington (2003) explains how relational algebra provides operations on
rows a set at a time in a single operation that is non-procedural. Relational operations perform
on multiple levels. Firstly, we consider the basic operations of the relational algebra where pure
mathematical abstractions form the foundations for rationale about a relational database
(Sumathi & Esakkirajan, 2007). Codd’s model describes operations to manage data in the
database by employing relational algebra (Silberschatz, Korth & Sudarshan, 2011), (Codd,
1970). As described above, a relational database is a collection of relations, each of which
conform to a certain relation schema. The fundamental operations define what can be done to
16
these relations. The principal relationship algebra operations are apportioned into two
categories; unary operations and binary operations (Sumathi & Esakkirajan, 2007). Unary
operations take a single relation as an operand, whereas binary operands require two relation
operands. There are five fundamental operators that provide relational algebra with the power
to develop complex queries. Each of these operators can be thought of as functions mapping
one or more relations into another relation. The primary database operations are selection,
projection and join (Sumathi & Esakkirajan, 2007). Binary operations are join, union, set
difference and Cartesian product.
3.5.1 Selection Operation
The selection operation performs on a single relation and defines a relation that holds certain
rows of that relation, the rows that satisfy the declared condition. The syntax of the operation
is σpredicate(R), where predicate refers to a condition and R is the relation. The selection
condition may be any legally formed expression that includes; constants, attribute names,
arithmetic and/or logical operators (Roman, 2002). An example of a selection operation in the
above table would be as follows:
If we apply the select operator to the relation above as:
𝜎 𝑎𝑐𝑡𝑜𝑟_𝑖𝑑 = ‘𝐾𝑒𝑎𝑛𝑢’(𝑎𝑐𝑡𝑜𝑟)
The result is a binary relation listing the actor with the Actor_ID of ‘Keanu’. As Keanu acts as
the primary key, the resulting tuple will appear as the below:
Figure 2: Selection Operation
17
A further example, still working off the same tables would be if one wishes to select all actors
born after a certain year. The relational algebraic equation is as follows:
𝜎𝑏𝑜𝑟𝑛 > 1965(𝑎𝑐𝑡𝑜𝑟)
Figure 3: Selection Operation
In general, selections operations employ Boolean expressions to establish the logical attributes
of data to return.
3.5.2 Projection Operation
The projection operation performs on a single relation and defines a relation that holds a
vertical subject of that relation, obtaining only the attributes stated and nothing more. In
simplified terms, it is the ‘column version’ of selection, meaning projection filters by column,
rather than by row which we saw above with selection.
The syntax for projection is ∏<𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒 𝑙𝑖𝑠𝑡>(𝑅), where once more 𝑅 stands for the relation.
To show an example of a projection we will introduce another table from our dataset, which
consists of a critic, a movie and their rating of that movie.
Figure 4: Reviews Table
18
A projection on this table would be as follows:
𝛱 𝑚𝑜𝑣𝑖𝑒_𝑖𝑑(𝑟𝑒𝑣𝑖𝑒𝑤𝑠)
This projection will list all movies in the reviews table as follows:
Figure 5: Projection Operation
3.5.3 Cartesian Product
The Cartesian product is the first binary operation, denoted by a cross (×) and it is utilised
where there is a necessity to retrieve information from more than one relation. The Cartesian
product of two relations, 𝑟1 and 𝑟2, is described in relational algebra as 𝑟1 × 𝑟2. Fully qualified
attribute names are required when defining the final relation scheme in order to attach the name
of the initial relation to the attribute as a reference as a means to differentiate between 𝑟1. 𝐴 and
𝑟2. 𝐴. If 𝑟1(𝐴1, . . . , 𝐴𝑛) and 𝑟2(𝐴1, . . . , 𝐴𝑛) are relations, then the Cartesian product 𝑟1 × 𝑟2 is
a relation with a scheme comprising all fully qualified attributes located in 𝑟1 and 𝑟2:
(𝑟1. 𝐴1, . . . , 𝑟1. 𝐴𝑛, 𝑟2. 𝐴1, . . . , 𝑟2. 𝐴𝑛). The rows of the Cartesian product are constructed by
connected every possible combination of rows: one from the 𝑟1 relation and one from the 𝑟2
relation. If both relations contain 𝑛𝑖 rows, then the Cartesian product contains 𝑛𝑖 𝑛𝑖 rows.
Assume that two relations in the dataset used throughout contain only Keanu Reeves as a name,
and the seven movies he features in below:
19
Figure 6: Keanu Reeves Movies
Now we are going to use Cartesian product operator on those relations to obtain our result.
𝑎𝑐𝑡𝑜𝑟 × 𝑚𝑜𝑣𝑖𝑒
The above expression produces a relation whose scheme is a concatenation of actor scheme
and movie scheme. It should be taken into account that the Cartesian product encompasses no
more data than its individual components, however, the Cartesian product uses a significant
amount more memory than the two original relations consume before they are combined. This
is the primary reason why the Cartesian product is primarily for explanatory or conceptual
purposes only (Codd, 1990). In real life situations it is generally replaced by the natural join
operator.
Figure 7: Cartesian Product
20
3.5.4 Union
The union operation is a binary operation on two relations that is denoted by the same symbol
∪. Consider two relations 𝑥 and 𝑦, then a union operation on these is written as: 𝑥 ∪ 𝑦. The
union of the two relations can only be executed if the relations have the same degree. In
addition, the first attribute of 𝑋 must be compatible with the first attribute of 𝑌, the second
attribute of 𝑋 must be compatible with the second attribute of 𝑌, and so on. The degree of the
resulting relation is the same as the degree of the input relations.
3.5.5 Set Difference
The final fundamental operator in relational algebra is set difference. It is a binary operator,
denoted by the symbol “– “. In order for a relation to be effected by this operator it needs to be
union compatible. If one wished to retrieve values from one table that did not appear in another
table, then the expression 𝑟1 – 𝑟2 is used to achieve this. No duplicate tuples will be found in
the query result. To take an example from our movie database, we can say that some actors are
also directors, but not all. For simplicity, the example specifically refers to the movie
‘Unforgiven’.
Figure 8: Directors and actors in Unforgiven
Applying the set difference operator leaves us with the following set of actors:
Figure 9: Set Difference Operation
21
4 Graph Databases
4.1 Introduction
A graph database is an online database management system with Create, Read, Update, and
Delete (CRUD) methods that expose a graph data model. Graph databases are generally built
for use with online transaction processing (OLTP) systems. Subsequently, they are typically
developed for transactional performance, and engineered with primary focus on transactional
integrity and operational availability (Robinson, Webber & Eifrem, 2015)
Relationships are the most important aspect of the graph data model, a characteristic that
distinguishes them from other database management systems. In the case of relational
databases, one is required to deduce connections between entities using mechanisms such as
foreign keys. Graph databases use the formal notation for a graph as the basis of their design.
A graph can be described as simply being a collection of vertices and edges (Rodriguez &
Neubauer, 2010). Translating this to the case of graph databases, this becomes, a set of nodes
and the relationships that connect them. Graphs represent entities as nodes and the ways in
which those entities relate to the world are represented as relationships. By assembling the
simple abstractions of nodes and relationships into connected structures, graph databases
enable us to build practical and refined models that map closely to our problem domain. This
results in models which are simpler to read and understand, whilst also managing to be far more
articulate than those produced using traditional relational modelling (Robinson, Webber &
Eifrem, 2015). Graph databases have been shown to be a powerful tool for modelling data,
when the concern on the relationship between entities is a fundamental element in the design
of the data model (Shimpi & Chaudhari, 2012). Modelling objects and the relationships
between them means virtually anything can be expressed in an associative graph. Diestel
(2010) describes a graph structure as, 𝐺 = (𝑉, 𝐸), where 𝑉 = 𝜗1, 𝜗2, 𝜗3, … . 𝜗𝑛 is a set of
22
vertices and E is a set of edges 𝐸 = 𝜀1, 𝜀2, 𝜀3, … . 𝜀𝑛. An edge 𝜀𝑖 ∈ 𝐸 is defined with triple
(𝑖, 𝑗, 𝜔𝑖) where, (𝑖, 𝑗 ∈ 𝑉) and 𝜔𝑖 is a positive real number. A directed edge is defined as 𝑖 →
𝑗. The graph database systems today are still a relatively new technology that is undergoing
significant development. The majority of graph databases support directed attributed, labelled
and directed multi-graph, or the property graph, a popular choice due to the fact that it is
supported by the majority of systems and also because it provides capabilities for associating
attributes to every node and relationship on top of graph structure (Rodriguez & Neubauer,
2010). A visual representation of a property graph is depicted below in Figure 10.
Figure 10: The property graph
An advantage of the property graph is its complexity. This complexity means that all other
graph types can be effectively modelled by the property graph, due to the fact that all other
graphs types lie in subsets of the property graph implementation. The graph database is highly
suitable and efficient for the powerful processing of substantial, interrelated datasets
(Rodriguez & Neubauer, 2010). The properties of its design allow the development of
predictive models, and the exposure of interactions and patterns (Shimpi & Chaudhari, 2012).
This highly dynamic model provides a means for exceptionally fast traversals along edges and
between vertices due to the linking together of all nodes by relations, resulting in more localised
23
traversals which are unrequired to take any unrelated data into account, thus, overcoming an
implicit problem in SQL (Rodriguez & Neubauer, 2012). The property graph is an important
prerequisite for calculating the weighted shortest path which is implemented in the core of the
graph database.
4.2 Data Model.
The graph database data model is a relatively simple structure consisting of nodes, edges and
labels. For example, take a social network where the nodes will be people. Every node may
have additional information attached such as name, address and so on. These nodes are
connected by binary edges. An edge in the example of a social network could be ‘friends with’
or ‘related to’. An edge may also have additional information attached. Figure 11 below shows
the UML schema for the representation of graphs.
Figure 11: UML schema for graph data model, retrieved from Grust, T., Freytag, J., & Leser, U. (2016). Cost-based
Optimization of Graph Queries in Relational Database Management Systems (Masters). University of Berlin.
4.2.1 Nodes
The node in a graph database is similar to a row or a record in the relational database. They
represent entities such as people, products, appointments, or any other item of data that requires
storage. Similar to a class in the relational database, it is necessary for each node to be uniquely
identifiable. In the UML above, the integer node_id could be used for this purpose.
4.2.2 Edges
A node may be connected to another node by an edge. The edge acts as a visual representation
of the relationship between the nodes which it joins. It is the patterns that materialise when
24
examining the connections and interconnections between nodes on a graph, that make the graph
database what it is. Edges are the exclusive concept to graph databases that distinguish them
from other data models, providing an abstraction that is not achievable in other systems.
4.2.3 Labels
Nodes are grouped into sets using labels. Any node labelled with a particular label will belong
to the set of nodes to which it shares a label. Labels are regularly used to query a database
adding to the efficiency of a search. Labels are not mandatory so a node does not necessarily
belong to a set, or a node can belong to a number of sets meaning it will be attributed a number
of labels.
4.3 Terminology
Below is a simple undirected graph structure that will be used for the purposes of explaining
the terminology that will be used throughout this thesis. The nodes in this instance are simple
A, B, C, D, E, F, G, H. Each edge is labelled with a number and for sake of explanation this
number will be used as the distance between two nodes.
Figure 12: A Simple Graph, retrieved from Burstein, M. (2013). Introduction to Graph Theory: Finding The Shortest Path.
Max Burstein. Retrieved from http://www.maxburstein.com/blog/introduction-to-graph-theory-finding-shortest-path/
25
4.3.1 The Graph
A graph 𝐺 = (𝑉(𝐺), 𝐸(𝐺)) is a database record containing a collection of nodes 𝑉(𝐺) and a
collection of edges 𝐸(𝐺), with 𝐸(𝐺) ∈ 𝑉(𝐺) 𝑥 𝑉(𝐺). Once the conditions of the graph are
clear, it can be defined as 𝐺 = (𝑉, 𝐸). Graphs may be either directed or undirected. The graph
database focuses on a directed graph structure. A directed graph contains an ordered pair of
nodes in 𝐸. Consider (𝑢, 𝑣) ∈ 𝐸, with 𝑢, 𝑣 ∈ 𝑉 and 𝑢 ≠ 𝑣. 𝑣 in this case will be adjacent to 𝑢
and it is said that 𝑢 has the outgoing edge (𝑢, 𝑣), thus 𝑢 is the start node of (𝑢, 𝑣), with (𝑢, 𝑣)
being the incoming edge of node 𝑣, which makes 𝑣 the target node of (𝑢, 𝑣). A parent child
relationship is developed with 𝑢 becoming the parent with a child 𝑣.
4.3.2 Graph size
From before, 𝐺 = (𝑉, 𝐸), and the size of 𝐺 is determined by the number of nodes |𝑉| plus the
number of edges |𝐸|, that it contains. In mathematical terms |𝐺| = |𝑉| + |𝐸|. The density of a
graph is defined as the ratio of edges to nodes. Graph size is loosely split into two sizes based
on its density – dense or sparse. A graph is considered dense if the value of 𝐸 is close to the
value of |𝑉|2
, and any graph with a value of 𝐸 that is considerably lower than |𝑉|2
can be
considered sparse. The size of the above graph is 16 (8 nodes plus 8 edges).
4.3.3 Degree of a node
The degree of a node 𝑣 ∈ 𝑉 deg(𝑣) is the amount of connections or edges associated with that
node. Naturally, each node in the directed graph will have two distinct values, one for the
number of edges with 𝑣 when it acts as the target node and the other when it acts as the start
node. The degree of node F in the above example is 5. These nodes and edges are labelled in a
graph database meaning it is also necessary to define a label function for the nodes and edges.
A label is made up of a type and a value.
4.3.4 Path and path length
26
A sequence of nodes (𝑣0, 𝑣1, 𝑣2, … . . , 𝑣 𝑛 ), 𝑣𝑖 ∈ 𝑉 such that (𝑣𝑖−1, 𝑣𝑖) ∈ 𝐸 for 𝑖 = 1,2, … … 𝑛.
is described as the path and the number of edges in the path is defined as the path length. A
path is only a path if all the nodes are distinct. In the case of a directed graph where 𝑣0 =
𝑣 𝑘 and 𝑘 ≥ 2, the path then becomes a cycle. A directed graph that does not contain any cycles
is known as a directed acyclic graph (DAG). A node in a DAG may contain numerous parent
nodes. If it is the case is that every node has a maximum of one parent node, then due to the
appearance of the graphical structure it is often referred to as a “tree”.
4.3.4.1 Dijkstra’s Algorithm
Dijkstra algorithm is a graph search algorithm utilised in a graph database to solve the shortest
path problem for a given graph 𝐺 = (𝑉, 𝐸). The solution to this problem is achieved by initially
selecting a start node and an end node. The start note is immediately added to the list of solved
nodes and assigned a 0 value, in that from the outset it is 0 distance away from itself. We can
then traverse breadth-first from this start node to its neighbours calculating and recording the
path length against each node. This process is repeated for every node that will be on the path
between the start and end node. On finishing, the shortest path is found. Even though the
conventional Dijkstra algorithm searches the shortest paths from start node to every other node,
in Neo4j the algorithm locates the shortest path between the start node and end node. The leads
to Neo4j’s efficiency as it is necessary to only record the lengths of a small subset of the
possible paths through the graph in comparison to how many paths that are theoretically
available. When the length of a destination node is solved, this then reveals the shortest path
from the start node, from which all subsequent paths can be safely built.
4.4 Key Features of the Graph Database
4.4.1 Performance:When connected data is the primary concern, the graph database offers a
significant increase in performance in comparison to the relational database. In the case
27
of the relational database, difficulties often arise as the dataset increases in size, due to
the fact that joins become increasingly difficult. As data set sizes grow, it is almost
impossible to maintain database performance as query times become increasingly
longer. Join-intensive models are generally a result of an effort to solve a connection or
path problem, but the mathematical foundations that form the basis of relational
databases are not ideally suitable for emulating path operations. Such difficulties cannot
be associated with graph databases as query times scale linearly with the amount of data
the query contains, rather than with the actual size of the dataset. Queries are localised
to a section of the graph, resulting in an execution time that is proportional to the size
of the section of the graph traversed to solve that query, rather than the entire graph.
4.4.2 Flexibility
Due to the flexible nature of a graph, it is now possible for the database to advance at the same
rate as the business which it serves. Graph databases are extremely comfortable with the
addition of new data, a feature which enables the subtle addition of new kinds of relationships,
new nodes, and new sub graphs to the current structure without negatively effecting any
existing queries and application functionality. This is a significant advantage of the graph
database as it allows for some freedom during the modelling phase and reduces the risk that
comes with imprecise or inaccurate modelling associated with the relational database model.
The graph database provides the ability for the structure and schema to develop parallel to an
advancement in knowledge of the problem space. Another advantage of the flexibility of the
graph database is fewer database migrations will be required, which will naturally reduce the
risk of data loss and reduce business expenses.
4.5 Real Life Applications of the Graph Database
4.5.1 Google’s page rank
28
Page rank is an algorithm used by Google Search to organise websites in their search engine
results by popularity. Google employs the graph databases in organising the order how search
results are displayed. Here, the directed graph ideology is applied when the web pages (nodes)
are connected together by the hyperlinks (edges). The amount of outgoing edges per graph is
set as the weight for the edge and page rank is calculated as per the weight on one edge in
relation to other edges (Wills, 2006).
4.5.2 Master data management
Major enterprises such as Cisco, StarHub and Polyvore are using Neo4j to reveal business
value by taking advantage of the data relationships contained in their information. Cisco’s
whole sales organisation is modelled by a complex graph structure and Neo4j provides the
capabilities for real-time queries to be carried out on it. The data relationships in StarHub’s
product and customer data is used to significantly reduce the amount of time required for the
company to assemble product bundles. Polyvore’s extensive catalogue of items is managed by
Neo4j, and the data relationships between the items help form real-time recommendations that
are made available to its users (Carlsson, 2016).
4.5.3 Social networking
Due to the fact that social networks naturally take the form of a graph, it could be considered
counterproductive and unnecessary to convert all the data and relationships into tabular format.
Gamesys, a British betting and gaming website, wished to create a social network for its users
and after exploring several databases options, they decided that graph databases would be the
most natural fit for their problem domain. One of the main factors considered when reaching
their decision to opt for the graph database over a relational one was the issue of impedance
mismatch. The existing graphical nature of the data would mean that queries would also be
graph-orientated, and therefore the use of a relational model would result in substantial project
29
cost and performance overhead. A graph database was preferred to fulfil requirements in the
both the operational and analytical environment (Nixon, 2015).
4.5.4 Telecommunication
Deutsche Telekom, CenturyLink and 3 are just some of the telecommunication companies that
have turned to graph databases, and specifically to Neo4j, to model their networks of highly
interconnected data. Telecommunications is all about connections, making graph databases a
natural fit. A telecommunication company would have a vast amount of interconnected data in
the form of plans, customers and groups and graphs are useful when analysing networks and
data centres. Graph databases are now an integral component in Telecommunications
companies’ approach to managing the massive rise in popularity in that sector (Agricola, 2014).
4.5.5 Security and access management
Adobe’s new Creative cloud, powered by Neo4j, uses a graphical model to connect
authentication details in order to permit access to contents for all its clients. It also makes a
new range of services available to its customers and facilitates the storage of vast amounts of
connected data across the world, while providing high query performance (Tangen, 2012).
4.5.6 Bioinformatics
There are numerous reasons as to why this growing industry is turning to the graph database.
Graphs containing a significant amount of nodes and edges are everywhere in bioinformatics.
Graph databases are a natural fit to the huge network of relationships between extremely large
biological sets. Also, due to the fact that much information in this industry remains unknown,
the flexibility of the graph database makes it ideally suitable. Neo4j has been leveraged by
several of the leading competitors in this market, including Curaspan Health Group, GoodStart
Genetics and Janssen Pharmaceuticals, Inc. (Merkl Sasaki, 2016).
4.6 Neo4j
30
Neo4j is currently the most popular graph database management on the market (Van Bruggen,
2014). First released in 2007, it is an open-source project written completely in Java (Vicknair
et al., 2010). It is an embedded, disk-based, fully transactional Java persistence engine that
stores data structured in graphs rather than in tables. Neo4j is comprised of two parts, a client
and a server. The client is responsible for sending commands to the server, where they are then
processed before the results are returned to the client. It is claimed to be extremely scalable by
its developers, with possibilities for several billion nodes on a single machine. Its API is very
user friendly and helps facilitate efficient traversals. Neo4j is built using Apache’s Lucene for
indexing and search. Lucene is a text search engine, written in Java, geared toward high
performance (DeCandia et al., 2007).
Neo4j’s graph model adheres to that of the Property graph model introduced earlier, consisting
of two primitive types; nodes and relationships, with added properties and labels. Nodes
contain properties that are stored as arbitrary key-value pairs. The keys are strings in Neo4j,
with the values being the Java string and associated data type, along with arrays of the data
types. These nodes are then tagged by the labels, whose responsibility it is to arrange the nodes
in a manner which will specify their roles inside the dataset. The nodes are then connected by
associative relationships that will form the graph structure. A relationship in Neo4j must consist
of a start and end node, a direction and a name. The consistent presence of a start node and end
node ensure that there will be no dangling relationships, while the presence of the direction
proves useful when traversing the graph. As with nodes, relationships may also contain
properties. This can prove quiet useful in Neo4j for a number of reasons. Further metadata is
made available for graph algorithms; it provides a means for the addition of further semantics
to relationships and also a means for constraining queries at runtime.
31
5 Comparative Analysis
5.1 Data Modelling
Data modelling is a technique motivated by a particular requirement that involves the
conversion of a complex collection of data into a readily understandable diagram. It is a highly
necessary practice, in which particular aspects of an unorganised domain are placed into a
model in which it can be structured and manipulated.
What differentiates graph database modelling techniques from its relational database
equivalent are the similarities between the logical and physical models. With modelling in the
relational database, one is required to veer away from what could be considered a natural
representation of the domain. Initially the representation must be converted to a logical model,
prior to being transformed to a physical model. Oftentimes, such conversions and
transformations introduce semantic dissonance between what we perceive the dataset to be and
how an instance of the dataset can be created in the relational model. With graph databases,
this disparity is significantly reduced.
To prove this, our Movie dataset will be examined. The data model should be designed in a
manner that permits us to store and query data in an efficient manner. We will also want to be
able to update the underlying model as the dataset we have is liable to change.
5.1.1 Relational Modelling
The first stage of modelling a dataset in the world of relational databases is identifying the
entities in the domain, how they are connected and the business rules involved. Below is the
resulting logical model.
32
Figure 13: The Logical Model
Having arrived at a suitable logical model, the next relational modelling step was to map that
model to suitable tables and relationships. Here, data redundancy is eliminated through a
process known as normalisation. This is the process that database architects employ to reduce
database redundancy and cater for disk storage savings. It requires splitting off data elements
that present themselves more than once in a relational database table into individual table
structures. While a highly necessary step, it can also lead to additional complexity with the data
model. This complexity is clearly evident in the entity relationship diagram in Figure 14. Such
complexity arises due to the addition of foreign key constraints, necessary to support one-to-
many relationships, and join tables such as actor_movie_role and producer_movie, which
support many-to-many relationships. It is these join tables that cause the biggest problems. To
query the database at a later date, these tables regularly need to be joined back together again.
As data schemas start to become more complex, relational systems may become increasingly
difficult to work with. Primarily, this difficulty arises from complex join operations, where
users ask queries of the database that would require data to be retrieved from a number of
33
different tables. These joins can become extremely complicated and resource intensive for the
database management system.
Figure 14: The Relational Model
5.1.2 Graph Data Modelling
It is evident thus far that the series of developments involved with creating a relational model
significantly add to its complexity of design and also removes the model from the conceptual
view that the stakeholders possess, to a more chaotic and less understandable model. As the
database changes and increases in complexity, the rigid schema of the relational design leads
to a much less scalable model. This negative results in a desire to develop a model that is
closely aligned with the domain, maintains database performance and has high scalability
without sacrificing the integrity of the data. The graph database model can achieve this.
In a similar way to drawing up the conceptual model in the relational world, the graph database
requires a domain is sought and agreed up. This is the final similarity between the two.
34
Rather than converting a domain model’s graph-like representation into tables, it is enhanced,
with the aspiration to produce a precise representation of the elements of the domain applicable
to the application goals. To achieve this, every entity is granted relevant roles as labels,
attributes as properties, and connections as relationships. The resulting model is illustrated
below.
Figure 15: The Graph Data Model
Taking a look inside the model it is easier to notice how the join tables are replaced. This is
achieved by attaching properties to the relationships, for example ‘roles’ in the acted_in
relationship.
35
Figure 16: A look inside the Graph Data Model
In order to allow each node the opportunity to accomplish its own particular data-centric
domain obligations, we must make sure that each node has its own necessary role-specific
labels and properties. Relationships are added in order to connect nodes and provide the domain
with a suitable degree of structure. These relationships are named, directed and are also
regularly attributed.
The join problem is something which the graph database prides itself on avoiding. The
relationships that connect nodes together is effectively Neo4j’s equivalent to the Cartesian
product calculation of the full indices on the tables involved, required when querying a
relational database. By doing so, connecting data becomes as simple as traversing from one
node to another. These complex questions that are so difficult to ask in a relational world are
extremely simple, efficient and fast in a graph structure.
5.2 Querying The Database
36
Now that the domain model is refined, a number of realistic queries will be asked to test its
suitability for handling complex data.
A query language can be described as a selection of operators that can be enforced on any
database instance, with the aim of manipulating and querying data in those structures.
Relational database systems use a shared standard known as SQL (Structured Query Language)
in order to query data. Similarly, graph databases also employ a universal query ideology
known as a graph traversal. The required data is retrieved from the database by using these
traversals. A graph traversal involves “walking” along the elements of a graph. The traversal
is an essential process for data retrieval. The primary distinction between a SQL query and a
traversal is that traversals are localised operations, and therein lies their major advantage.
Rather than utilising a global adjacency index, each vertex and edge in the graph store a local-
index of the nodes connected to it. Thus, the size of the graph becomes irrelevant and the
complex joins associated with a relational database are no longer necessary. This is not to say
that global indexes do not exist in Neo4j, because as a matter of fact they do. Indexes are
necessary to allow vertices to be readily retrieved based on their value, but they are only used
when retrieving the starting point of a traversal. Traversals in Neo4j are carried out by Cypher.
Cypher is a declarative graph query language based on SQL that aims simplify query writing
by avoiding the requirement to write traversals in code. Cypher provides the structure for
keyword system, similar to SQL. This is weakness when compared to the more mature RDBMS
SQL. There is a lack of consistency that requires one to learn all implementations before
understanding what approach is most suitable to the problem.
In Cypher a query begins at one or more known starting points on the graph, these points are
referred to as bound nodes. Cypher applies the labels and property predicates that are provided
in the MATCH and WHERE clauses, in conjunction with the information supplied by indexes
and constraints, to locate the starting points which support our graph patterns.
37
In order to investigate the efficiency and power of the query in each database we will take some
examples from our Movie database.
5.2.1 Querying the Relational Database
The first query we will look at is one which seeks to find all actors, directors, producers, writers
and critics along with their reviews and ratings for the movie ‘Cloud Atlas’.
Firstly, we will take a look at this query in the relational database.
Figure 17: SQL Query
38
What is immediately clear is the number of joins necessary in order to achieve the desired
results. A full ‘map’ highlighting how this query works its way through the database is outlined
below in Figure 18.
Figure 18: SQL Query 'Join Map'
In this instance each broken arrow is the join path the query takes. This SQL query queries
Movies to find Cloud Atlas, then is required to search all the join tables (actor_movie_role,
director_movie, producer_movie, writer_movie) in order to find the names of the actors,
directors, producers, writers and critics associated with the movie Cloud Atlas. The ten joins
required in this query is manageable and results can be readily achieved, however as we
traverse more, additional joins will have to be computed and this will further hinder query
times. Below is a graph that shows results for when the database was searched for all the people
associated with one, five, eight and finally all thirty-eight movies.
39
Figure 19: Oracle 12c Query Times
The decrease in performance is due the fact that each join requires an index lookup for each
movie, actor, director, producer, writer and critic. Each index lookup adds overhead and
performance that can become extremely difficult for the database to handle. As can be seen
such difficulties increase exponentially with the number of lookups required. A graphic for the
index lookups required for querying one movie is shown in Figure 20. The amount of traversals,
even in this relatively simple query is immediately obvious.
0
0.1
0.2
0.3
1
5
8
15
38
QueryTime(seconds)
No. of Movies Queried
1 5 8 15 38
Query Time 0.019 0.055 0.102 0.165 0.203
Query Times - Oracle 12c
40
Figure 20: Relational Database Index Lookups
41
5.2.2 Querying the Graph Database
The following is the same query in the Cypher query language:
Figure 21 Cypher Query
Only one index lookup is required here in order to find the start node. From there the query
will traverse the necessary relationships in order to find the necessary data.
Figure 22: Graph Data Index Lookup
The diagram above clearly indicates the need for only one index lookup, the main action
associated with poor performance in the relational database. The dashed orange line are
representitive of traversals, with each traversal returning the required data.
42
Figure 23 and Figure 24 below show the data output in Neo4j form.
Figure 23: Neo4j Data Output
Figure 24: Neo4j Graphical Output
43
As with earlier tests of the relational database, the graph database was then required to search
for all persons involved in five, eight, fifteen and thirty-eight movies. The query times are
illustrated below:
Figure 25: Neo4j Query Times
As illustrated, the increase in query time is significantly less than it was in the relational
database when both instances were tested under increased load.
Figure 26: Neo4j vs Oracle 12c
0
0.01
0.02
0.03
0.04
1
5
8
15
38
QueryTime(seconds)
No. of Movies Queried
1 5 8 15 38
Query Time 0.014 0.022 0.026 0.027 0.032
Query Times -Neo4j
0.014 0.022 0.026 0.027 0.032
0.019
0.055
0.102
0.165
0.203
0
0.05
0.1
0.15
0.2
0.25
1 5 8 15 38
QueryTimes(seconds)
No. of Movies Queried
Neo4j Oracle12c
44
6 Conclusion
The thesis was an investigation and analysis of the performance of Relational Database
Management Systems and Graph Databases, with the aim of discovering how more complex
data is handled in each database and whether graph database technology is more suitable than
the its widely used relational counterpart. Relational databases were designed with structure in
mind and developed in strict tabular format which conformed to a pre-specified schema. The
foundation of their design is their database schema, which provides a logical view of the
database alongside relations between tables. In comparison with the graph database, a
significant amount more work was required in order to get the dataset to fit. Numerous of the
dreaded join tables were required in order to incorporate all the many to many relationships
that appeared in the logical model. In comparison with this the graphical model was extremely
straightforward. By labelling each node and attaching properties to relationships as required,
the logical model effectively became the data model.
When it came to querying the database, again the relational database came out on top. The first
thing to realised when carrying out the analysis was the importance of a fundamental concept
in both database types, the use of indexes. In the relational database, indexes are expensive but
necessary tools, utilised to quickly find the desired records within the data using either a foreign
key or a primary key. When two or more tables are joined, the indexes on both tables would
need to be scanned completely and recursively to locate all the data elements matching the
query specification. This is why performing joins is so computationally expensive. This is also
where graph databases excel. They are extremely fast for join-intensive queries. With a graph
database, the index on the data is used only at the beginning of the query when locating the
start node. Once you have the starting nodes, you can just “walk the network” and find the next
data element by traversing along the relationships without the need for further index lookups.
This is known as “index-free adjacency” and it is a fundamental concept in graph databases. It
45
can be concluded that graph databases better manage complex data than the relational database
in cases when the data contains many relationships that require traversing multiple
relationships.
46
7 References
Agrawal, D., El Abbadi, A., Das, S., & Elmore, A. (2011). Database Scalability, Elasticity, and
Autonomy in the Cloud. Database Systems For Advanced Applications, 2-15.
http://dx.doi.org/10.1007/978-3-642-20149-3_2
Agricola, A. (2014). World’s Leading Telcos Turn to Neo4j. Neo4j News. Retrieved from
https://neo4j.com/news/worlds-leading-telcos-turn-neo4j/
Amann, B. & Scholl, M. (1992). Gram. Proceedings Of The ACM Conference On Hypertext -
ECHT '92. http://dx.doi.org/10.1145/168466.168527
Andries, M., Gemis, M., Paredaens, J., Thyssens, I., & Van den Bussche, J. (1992). Concepts
for graph-oriented object manipulation. Advances In Database Technology — EDBT '92, 21-
38. http://dx.doi.org/10.1007/bfb0032421
Angles, R. & Gutierrez, C. (2008). Survey of graph database models. CSUR, 40(1), 1-39.
http://dx.doi.org/10.1145/1322432.1322433
Asirelli, P., De Santis, M., & Martelli, M. (1985). Integrity constraints in logic databases. The
Journal Of Logic Programming, 2(3), 221-232. http://dx.doi.org/10.1016/0743-
1066(85)90020-2
Bachman, C. (1972). The evolution of storage structures. Communications Of The ACM, 15(7),
628-634. http://dx.doi.org/10.1145/361454.361495
Bachman, M. (2013). GraphAware: Towards Online Analytical Processing in Graph
Databases (MSc Degree in Computing (Distributed Systems). Imperial College London
Department of Computing.
Bajaj, R. (2014). Big Data – The New Era of Data. (IJCSIT) International Journal Of Computer
Science And Information Technologies, 5(2), 1875-1885.
47
Barabási, A. & Albert, R. (1999). Emergence of Scaling in Random
Networks. Science, 286(5439), 509-512. http://dx.doi.org/10.1126/science.286.5439.509
Bitner, S. (2015). The future of Business Analytics is in complex data. Big Data Made Simple
- One source. Many perspectives.. Retrieved 11 June 2016, from http://bigdata-
madesimple.com/future-business-analytics-complex-data/
Böhnlein, M. & vom Ende, A. (1999). XML — Extensible Markup
Language. Wirtschaftsinf, 41(3), 274-276. http://dx.doi.org/10.1007/bf03254940
Bonnet, L., Laurent, A., Sala, M., Laurent, B., & Sicard, N. (2011). Reduce, You Say: What
NoSQL Can Do for Data Aggregation and BI in Large Repositories. 2011 22Nd International
Workshop On Database And Expert Systems Applications.
http://dx.doi.org/10.1109/dexa.2011.71
Bressan, S. & Catania, B. (2005). Introduction to database systems. Singapore: McGraw-Hill.
Broder, A. (2002). A taxonomy of web search. ACM SIGIR Forum, 36(2), 3.
http://dx.doi.org/10.1145/792550.792552
Brunner, R. (2006). The basics of relational database systems. Developing with Apache Derby
-- Hitting the Trifecta: Database development with Apache Derby, Part 2. Retrieved from
http://www.ibm.com/developerworks/library/os-ad-trifecta3/
Burstein, M. (2013). Introduction to Graph Theory: Finding The Shortest Path. Max Burstein.
Retrieved from http://www.maxburstein.com/blog/introduction-to-graph-theory-finding-
shortest-path/
Carlsson, T. (2016). Neo4j Powers Master Data Management (MDM) Applications for
Enterprises Across the Globe. Yahoo Finance. Retrieved from
http://finance.yahoo.com/news/neo4j-powers-master-data-management-153517563.html
48
Cattell, R. (2011). Scalable SQL and NoSQL data stores. ACM SIGMOD Record, 39(4), 12.
http://dx.doi.org/10.1145/1978915.1978919
Chen, P. (1976). The entity-relationship model---toward a unified view of data. ACM
Transactions On Database Systems, 1(1), 9-36. http://dx.doi.org/10.1145/320434.320440
Codd, E. (1970). A relational model of data for large shared data banks. Communications Of
The ACM,13(6), 377-387. http://dx.doi.org/10.1145/362384.362685
Codd, E. (1979). Extending the database relational model to capture more meaning. ACM
Transactions On Database Systems, 4(4), 397-434. http://dx.doi.org/10.1145/320107.320109
Codd, E. (1990). The relational model for database management. Reading, Mass.: Addison-
Wesley.
Cook, W. (2009). On understanding data abstraction, revisited. ACM SIGPLAN
Notices, 44(10), 557. http://dx.doi.org/10.1145/1639949.1640133
Cui, B., Mei, H., & Ooi, B. (2014). Big data: the driver for innovation in databases. National
Science Review, 1(1), 27-30. http://dx.doi.org/10.1093/nsr/nwt020
DAVIDSON, S., OVERTON, C., & BUNEMAN, P. (1995). Challenges in Integrating
Biological Data Sources. Journal Of Computational Biology, 2(4), 557-572.
http://dx.doi.org/10.1089/cmb.1995.2.557
Dean, J. & Ghemawat, S. (2008). MapReduce. Communications Of The ACM, 51(1), 107.
http://dx.doi.org/10.1145/1327452.1327492
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., & Pilchin, A. et al.
(2007). Dynamo. ACM SIGOPS Operating Systems Review, 41(6), 205.
http://dx.doi.org/10.1145/1323293.1294281
49
Dewey, M., Mitchell, J., Beall, J., Matthews, W., & New, G. (1996). Dewey decimal
classification and relative index. Albany, N.Y.: Forest Press, a division of OCLC Online
Computer Library Center.
Diestel, R. (2010). Graph theory. Berlin: Springer.
Gemis, M. & Paredaens, J. (1993). An object-oriented pattern matching language. Lecture
Notes In Computer Science, 339-355. http://dx.doi.org/10.1007/3-540-57342-9_82
Gemis, M., Paredaens, J., Thyssens, I., & Van den Bussche, J. (1993). GOOD. Proceedings Of
The 1993 ACM SIGMOD International Conference On Management Of Data - SIGMOD '93.
http://dx.doi.org/10.1145/170035.171533
Gharehchopogh, F. & Khalifelu, Z. (2011). Analysis and evaluation of unstructured data: text
mining versus natural language processing. 2011 5Th International Conference On Application
Of Information And Communication Technologies (AICT).
http://dx.doi.org/10.1109/icaict.2011.6111017
Graves, M., Bergeman, E., & Lawrence, C. (1995). A graph-theoretic data model for genome
mapping databases. Proceedings Of The Twenty-Eighth Hawaii International Conference On
System Sciences, Vol.5. http://dx.doi.org/10.1109/hicss.1995.375353
Grust, T., Freytag, J., & Leser, U. (2016). Cost-based Optimization of Graph Queries in
Relational Database Management Systems (Masters). University of Berlin.
Gutiérrez, A., Pucheral, P., Steffen, H., & Thévenin, J. (1994). Database Graph Views: A
Practical Model to Manage Persistent Graphs. International Conference On Very Large
Databases, 94, 391-402.
Güting, R. (1994). GraphDB: Modeling and Querying Graphs in Databases. Proceedings Of
The 20Th International Conference On Very Large Data Bases, 297-308.
50
Harrington, J. (2003). SQL clearly explained. (3rd ed.). Elsevier.
Hidders, J. (2002). Typing Graph-Manipulation Operations. Lecture Notes In Computer
Science, 394-409. http://dx.doi.org/10.1007/3-540-36285-1_26
Hidders, J. & Paredaens, J. (1994). Goal, a Graph-Based Object and Association
Language. Advances In Database Systems, 247-265. http://dx.doi.org/10.1007/978-3-7091-
2704-9_13
Kahate, A. (2004). Introduction to database management systems. Delhi, India: Pearson
Education (Singapore).
Kaur, K. & Rani, R. (2013). Modeling and querying data in NoSQL databases. 2013 IEEE
International Conference On Big Data. http://dx.doi.org/10.1109/bigdata.2013.6691765
Khodaei, M. (2008). Case Study: Implementation of Integrity Constraints in Actual Database
Systems(MSc Electrical Engineering and Information Technology). Czech Technical
University in Prague.
Kiesel, N., Schuerr, A., & Westfechtel, B. (1995). Gras, a graph-oriented (software)
engineering database system. Information Systems, 20(1), 21-51.
http://dx.doi.org/10.1016/0306-4379(95)00002-l
Klyne, G. & Carroll, J. (2006). Resource description framework (RDF): Concepts and abstract
syntax.
Kunii, H. (1987). DBMS with graph data model for knowledge handling. Proceedings Of The
1987 Fall Joint Computer Conference On Exploring Technology: Today And Tomorrow, 138-
142. Retrieved from
http://dl.acm.org/citation.cfm?id=42071&CFID=776948696&CFTOKEN=76722333
51
Kuper, G. & Vardi, M. (1984). A new approach to database logic. Proceedings Of The 3Rd
ACM SIGACT-SIGMOD Symposium On Principles Of Database Systems - PODS '84.
http://dx.doi.org/10.1145/588011.588026
Leavitt, N. (2010). Will NoSQL Databases Live Up to Their Promise?. Computer, 43(2), 12-
14. http://dx.doi.org/10.1109/mc.2010.58
Lecluse, C., Richard, P., & Velez, F. (1988). O2, an object-oriented data model. ACM SIGMOD
Record,17(3), 424-433. http://dx.doi.org/10.1145/971701.50253
Levene, M. & Poulovassilis, A. (1990). The hypernode model and its associated query
language.Proceedings Of The 5Th Jerusalem Conference On Information Technology, 1990.
'Next Decade In Information Technology'. http://dx.doi.org/10.1109/jcit.1990.128324
Lomotey, R. & Deters, R. (2013). Unstructured data extraction in distributed NoSQL. 2013
7Th IEEE International Conference On Digital Ecosystems And Technologies (DEST).
http://dx.doi.org/10.1109/dest.2013.6611347
Mainguenaud, M. & Simatic, X. (1992). A data model to deal with multi-scaled
networks. Computers, Environment And Urban Systems, 16(4), 281-288.
http://dx.doi.org/10.1016/0198-9715(92)90009-g
McGuinness, D. & Van Harmelen, F. (2004). OWL web ontology language overview. W3C
Recommendation, 10(10).
Merkl Sasaki, B. (2016). Neo4j Graph Database Powers the Healthcare Sector. Neo4j News.
Retrieved from https://neo4j.com/news/neo4j-graph-database-powers-healthcare-sector/
Miler, M., Medak, D., & Odobašić, D. (2014). A shortest path algorithm performance
comparison in graph and relational database on a transportation network. PROMET -
Traffic&Transportation,26(1). http://dx.doi.org/10.7307/ptt.v26i1.1268
52
Mohamed Ali, N. & Padma, D. (2016). Graph Database: A Contemporary Storage Mechanism
for Connected Data. International Journal Of Advanced Research In Computer And
Communication Engineering, 5(3). http://dx.doi.org/10.17148/IJARCCE.2016.53220
Nixon, K. (2015). How Gamesys Harnessed Neo4j for Competitive Advantage. Neo4j Blog.
Retrieved from https://neo4j.com/blog/gamesys-neo4j-competitive-advantage/
Paredaens, J., Peelman, P., & Tanca, L. (1995). G-Log: a graph-based query language. IEEE
Trans. Knowl. Data Eng., 7(3), 436-453. http://dx.doi.org/10.1109/69.390249
Peckham, J. & Maryanski, F. (1988). Semantic data models. CSUR, 20(3), 153-189.
http://dx.doi.org/10.1145/62061.62062
Pokorny, J. (2011). NoSQL databases. Proceedings Of The 13Th International Conference On
Information Integration And Web-Based Applications And Services - Iiwas '11.
http://dx.doi.org/10.1145/2095536.2095583
Rabl, T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H., & Mankovskii,
S. (2012). Solving big data challenges for enterprise application performance
management. Proc. VLDB Endow., 5(12), 1724-1735.
http://dx.doi.org/10.14778/2367502.2367512
Reeve, A. (2012). Big Data and NoSQL: The Problem with Relational Databases |
InFocus. InFocus. Retrieved 11 June 2016, from https://infocus.emc.com/april_reeve/big-
data-and-nosql-the-problem-with-relational-databases/
Rodriguez, M. & Neubauer, P. (2010). Constructions from dots and lines. Bulletin Of The
American Society For Information Science And Technology, 36(6), 35-41.
http://dx.doi.org/10.1002/bult.2010.1720360610
53
Rodriguez, M. & Neubauer, P. (2012). The Graph Traversal Pattern. Techniques And
Applications, 29-46. http://dx.doi.org/10.4018/978-1-61350-053-8.ch002
Roman, S. (2002). Access database design and programming. Sebastopol [CA]: O'Reilly.
Roussopoulos, N. & Mylopoulos, J. (1975). Using semantic networks for data base
management.Proceedings Of The 1St International Conference On Very Large Data Bases -
VLDB '75. http://dx.doi.org/10.1145/1282480.1282490
Russom, P. (2011). Big data analytics (pp. 1-35). 1105 Media, Inc.
Selvarani, S. & Sadhasivam, G. (2010). Improved cost-based algorithm for task scheduling in
cloud computing. 2010 IEEE International Conference On Computational Intelligence And
Computing Research. http://dx.doi.org/10.1109/iccic.2010.5705847
Shenai, K. (1992). Introduction to database and knowledge-base systems. Singapore: World
Scientific.
Shimpi, D. & Chaudhari, S. (2012). An overview of Graph Databases. International Journal
Of Computer Applications® (IJCA) (0975 – 8887), 18-22. Retrieved from
http://research.ijcaonline.org/icrtitcs2012/number3/icrtitcs1351.pdf
Shipman, D. (1981). The functional data model and the data languages DAPLEX. ACM
Transactions On Database Systems, 6(1), 140-173. http://dx.doi.org/10.1145/319540.319561
Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The Hadoop Distributed File
System. 2010 IEEE 26Th Symposium On Mass Storage Systems And Technologies (MSST).
http://dx.doi.org/10.1109/msst.2010.5496972
Silberschatz, A., Korth, H., & Sudarshan, S. (2011). Database system concepts. New York:
McGraw-Hill.
54
Strauch, C. & Kriha, W. (2011). NoSQL Databases. Selected Topics On Software-Technology
Ultra-Large Scale Sites Hochschule Der Medien, Stuttgart.
Sumathi, D. & Esakkirajan, S. (2007). Fundamentals of Relational Database Management
Systems. Springer Berlin Heidelberg.
Tangen, J. (2012). Companies Worldwide Flock to Neo4j to Make Their Applications
Social. Neo Technology. Retrieved from http://www.marketwired.com/press-
release/companies-worldwide-flock-to-neo4j-to-make-their-applications-social-1634667.htm
Tauro, C., Patil, B., & Prashanth, K. (2013). A Comparative Analysis of Different NoSQL
Databases on Data Model, Query Model and Replication Model. EREICA.
Topor, R., Salem, K., Gupta, A., Goda, K., Gehrke, J., & Palmer, N. et al. (2009). Shared-
Nothing Architecture. Encyclopedia Of Database Systems, 2638-2639.
http://dx.doi.org/10.1007/978-0-387-39940-9_1512
Ullman, J. (1995). Principles of database and knowledge-base systems. Rockville, Md:
Computer Science Press.
Van Bruggen, R. (2014). Learning Neo4j. Birmingham, UK: Packt Pub.
Vicknair, C., Macias, M., Zhao, Z., Nan, X., Chen, Y., & Wilkins, D. (2010). A comparison of
a graph database and a relational database. Proceedings Of The 48Th Annual Southeast
Regional Conference On - ACM SE '10. http://dx.doi.org/10.1145/1900008.1900067
Wills, R. (2006). Google’s pagerank. The Mathematical Intelligencer, 28(4), 6-11.
http://dx.doi.org/10.1007/bf02984696
55
8 Appendix
8.1 Appendix A
ALTER TABLE ACTOR_MOVIE_ROLE DROP CONSTRAINT
ACTOR_MOVIE_ROLE_ACTOR_FK;
ALTER TABLE ACTOR_MOVIE_ROLE DROP CONSTRAINT
ACTOR_MOVIE_ROLE_MOVIES_FK;
ALTER TABLE CRITIC_FOLLOWS DROP CONSTRAINT
CRITIC_FOLLOWS_CRITIC_FK;
ALTER TABLE CRITIC_FOLLOWS DROP CONSTRAINT
CRITIC_FOLLOWS_CRITIC_FKv1;
ALTER TABLE PRODUCER_MOVIE DROP CONSTRAINT
PRODUCER_MOVIE_MOVIES_FK;
ALTER TABLE PRODUCER_MOVIE DROP CONSTRAINT
PRODUCER_MOVIE_PRODUCER_FK;
ALTER TABLE DIRECTOR_MOVIE DROP CONSTRAINT
DIRECTOR_MOVIE_DIRECTOR_FK;
ALTER TABLE DIRECTOR_MOVIE DROP CONSTRAINT
DIRECTOR_MOVIE_MOVIES_FK;
ALTER TABLE Reviews DROP CONSTRAINT Reviews_CRITIC_FK;
ALTER TABLE Reviews DROP CONSTRAINT Reviews_MOVIES_FK;
ALTER TABLE WRITER_MOVIE DROP CONSTRAINT
WRITER_MOVIE_MOVIES_FK;
56
ALTER TABLE WRITER_MOVIE DROP CONSTRAINT
WRITER_MOVIE_WRITER_FK;
DROP TABLE ACTOR;
DROP TABLE MOVIES;
DROP TABLE ACTOR_MOVIE_ROLE;
DROP TABLE DIRECTOR;
DROP TABLE DIRECTOR_MOVIE;
DROP TABLE PRODUCER;
DROP TABLE PRODUCER_MOVIE;
DROP TABLE WRITER;
DROP TABLE WRITER_MOVIE;
DROP TABLE CRITIC;
DROP TABLE CRITIC_FOLLOWS;
DROP TABLE Reviews;
CREATE
TABLE ACTOR
(
Actor_ID VARCHAR2 (50) NOT NULL ,
Name VARCHAR2 (50) NOT NULL ,
Born NUMBER (4)
57
) ;
ALTER TABLE ACTOR ADD CONSTRAINT Actor_PK PRIMARY KEY ( Actor_ID ) ;
CREATE
TABLE ACTOR_MOVIE_ROLE
(
Actor_ID VARCHAR2 (50) NOT NULL ,
Movie_ID VARCHAR2 (225) NOT NULL ,
Role_ID VARCHAR2 (50) NOT NULL
) ;
ALTER TABLE ACTOR_MOVIE_ROLE ADD CONSTRAINT
ACTOR_MOVIE_ROLE_PK PRIMARY KEY (
Actor_ID, Movie_ID, Role_ID ) ;
CREATE
TABLE CRITIC
(
Critic_ID VARCHAR2 (50) NOT NULL ,
Name VARCHAR2 (50)
) ;
ALTER TABLE CRITIC ADD CONSTRAINT CRITIC_PK PRIMARY KEY ( Critic_ID ) ;
58
CREATE
TABLE CRITIC_FOLLOWS
(
Critic_ID VARCHAR2 (50) NOT NULL ,
Following VARCHAR2 (50) NOT NULL
) ;
ALTER TABLE CRITIC_FOLLOWS ADD CONSTRAINT CRITIC_FOLLOWS_PK
PRIMARY KEY (
Critic_ID, Following ) ;
CREATE
TABLE DIRECTOR
(
Director_ID VARCHAR2 (50) NOT NULL ,
Name VARCHAR2 (50) NOT NULL ,
Born NUMBER (4)
) ;
ALTER TABLE DIRECTOR ADD CONSTRAINT DIRECTOR_PK PRIMARY KEY (
Director_ID ) ;
CREATE
TABLE DIRECTOR_MOVIE
59
(
Director_ID VARCHAR2 (50) NOT NULL ,
Movie_ID VARCHAR2 (225) NOT NULL
) ;
ALTER TABLE DIRECTOR_MOVIE ADD CONSTRAINT DIRECTOR_ID_PK
PRIMARY KEY (
Director_ID, Movie_ID ) ;
CREATE
TABLE MOVIES
(
Movie_ID VARCHAR2 (225) NOT NULL ,
Title VARCHAR2 (225) ,
Released NUMBER (4) NOT NULL ,
Tagline VARCHAR2 (4000)
) ;
ALTER TABLE MOVIES ADD CONSTRAINT MOVIES_PK PRIMARY KEY ( Movie_ID
) ;
CREATE
TABLE PRODUCER
(
60
Producer VARCHAR2 (50) NOT NULL ,
Name VARCHAR2 (50) ,
Born NUMBER (4)
) ;
ALTER TABLE PRODUCER ADD CONSTRAINT PRODUCER_PK PRIMARY KEY (
Producer ) ;
CREATE
TABLE PRODUCER_MOVIE
(
Producer_ID VARCHAR2 (50) NOT NULL ,
Movie_ID VARCHAR2 (225) NOT NULL
) ;
ALTER TABLE PRODUCER_MOVIE ADD CONSTRAINT PRODUCER_MOVIE_PK
PRIMARY KEY (
Producer_ID, Movie_ID ) ;
CREATE
TABLE Reviews
(
Critic_ID VARCHAR2 (50) NOT NULL ,
Movie_ID VARCHAR2 (225) NOT NULL ,
61
Rating NUMBER (5,2) ,
Summary VARCHAR2 (4000)
) ;
ALTER TABLE Reviews ADD CONSTRAINT Reviews_PK PRIMARY KEY ( Critic_ID,
Movie_ID
) ;
CREATE
TABLE WRITER
(
Writer_ID VARCHAR2 (50) NOT NULL ,
Name VARCHAR2 (50) ,
Born NUMBER (4)
) ;
ALTER TABLE WRITER ADD CONSTRAINT WRITER_PK PRIMARY KEY ( Writer_ID
) ;
CREATE
TABLE WRITER_MOVIE
(
Writer_ID VARCHAR2 (50) NOT NULL ,
Movie_ID VARCHAR2 (50) NOT NULL
62
) ;
ALTER TABLE WRITER_MOVIE ADD CONSTRAINT WRITER_MOVIE_PK
PRIMARY KEY ( Writer_ID
, Movie_ID ) ;
ALTER TABLE ACTOR_MOVIE_ROLE ADD CONSTRAINT
ACTOR_MOVIE_ROLE_ACTOR_FK FOREIGN
KEY ( Actor_ID ) REFERENCES ACTOR ( Actor_ID ) ON
DELETE CASCADE ;
ALTER TABLE ACTOR_MOVIE_ROLE ADD CONSTRAINT
ACTOR_MOVIE_ROLE_MOVIES_FK FOREIGN
KEY ( Movie_ID ) REFERENCES MOVIES ( Movie_ID ) ON
DELETE CASCADE ;
ALTER TABLE CRITIC_FOLLOWS ADD CONSTRAINT
CRITIC_FOLLOWS_CRITIC_FK FOREIGN KEY
( Critic_ID ) REFERENCES CRITIC ( Critic_ID ) ON
DELETE CASCADE ;
ALTER TABLE CRITIC_FOLLOWS ADD CONSTRAINT
CRITIC_FOLLOWS_CRITIC_FKv1 FOREIGN
KEY ( Following ) REFERENCES CRITIC ( Critic_ID ) ON
DELETE CASCADE ;
63
ALTER TABLE DIRECTOR_MOVIE ADD CONSTRAINT
DIRECTOR_MOVIE_DIRECTOR_FK FOREIGN
KEY ( Director_ID ) REFERENCES DIRECTOR ( Director_ID ) ON
DELETE CASCADE ;
ALTER TABLE DIRECTOR_MOVIE ADD CONSTRAINT
DIRECTOR_MOVIE_MOVIES_FK FOREIGN KEY
( Movie_ID ) REFERENCES MOVIES ( Movie_ID ) ON
DELETE CASCADE ;
ALTER TABLE PRODUCER_MOVIE ADD CONSTRAINT
PRODUCER_MOVIE_MOVIES_FK FOREIGN KEY
( Movie_ID ) REFERENCES MOVIES ( Movie_ID ) ON
DELETE CASCADE ;
ALTER TABLE PRODUCER_MOVIE ADD CONSTRAINT
PRODUCER_MOVIE_PRODUCER_FK FOREIGN
KEY ( Producer_ID ) REFERENCES PRODUCER ( Producer ) ON
DELETE CASCADE ;
ALTER TABLE Reviews ADD CONSTRAINT Reviews_CRITIC_FK FOREIGN KEY (
Critic_ID )
REFERENCES CRITIC ( Critic_ID ) ON
DELETE CASCADE ;
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database
A Comparative analysis of Graph Databases vs Relational Database

More Related Content

What's hot

The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...
The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...
The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...Neo4j
 
Quantum computers, quantum key distribution, quantum networks
Quantum computers, quantum key distribution, quantum networksQuantum computers, quantum key distribution, quantum networks
Quantum computers, quantum key distribution, quantum networksMiranda Ghrist
 
Why you should care about synthetic data
Why you should care about synthetic dataWhy you should care about synthetic data
Why you should care about synthetic dataRiaktr
 
Neo4j 4 Overview
Neo4j 4 OverviewNeo4j 4 Overview
Neo4j 4 OverviewNeo4j
 
Datafication is transforming the industry landscape
Datafication is transforming the industry landscapeDatafication is transforming the industry landscape
Datafication is transforming the industry landscapeEricsson
 
Introducing Neo4j
Introducing Neo4jIntroducing Neo4j
Introducing Neo4jNeo4j
 
Recommendations at Zillow
Recommendations at ZillowRecommendations at Zillow
Recommendations at Zillownjstevens
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchNeo4j
 
Zillow's favorite big data & machine learning tools
Zillow's favorite big data & machine learning toolsZillow's favorite big data & machine learning tools
Zillow's favorite big data & machine learning toolsnjstevens
 
The Extraordinary World of Quantum Computing
The Extraordinary World of Quantum ComputingThe Extraordinary World of Quantum Computing
The Extraordinary World of Quantum ComputingTim Ellison
 
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...Alan McSweeney
 
Easily Identify Sources of Supply Chain Gridlock
Easily Identify Sources of Supply Chain GridlockEasily Identify Sources of Supply Chain Gridlock
Easily Identify Sources of Supply Chain GridlockNeo4j
 
Graph Applications and Algorithms at Elevance Health
Graph Applications and Algorithms at Elevance HealthGraph Applications and Algorithms at Elevance Health
Graph Applications and Algorithms at Elevance HealthNeo4j
 
Neo4j - graph database for recommendations
Neo4j - graph database for recommendationsNeo4j - graph database for recommendations
Neo4j - graph database for recommendationsproksik
 
Knowledge Graphs and Generative AI
Knowledge Graphs and Generative AIKnowledge Graphs and Generative AI
Knowledge Graphs and Generative AINeo4j
 
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Databricks
 
GPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphGPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphNeo4j
 
How Expedia’s Entity Graph Powers Global Travel
How Expedia’s Entity Graph Powers Global TravelHow Expedia’s Entity Graph Powers Global Travel
How Expedia’s Entity Graph Powers Global TravelNeo4j
 
The Data Platform for Today's Intelligent Applications.pdf
The Data Platform for Today's Intelligent Applications.pdfThe Data Platform for Today's Intelligent Applications.pdf
The Data Platform for Today's Intelligent Applications.pdfNeo4j
 
Data-centric design and the knowledge graph
Data-centric design and the knowledge graphData-centric design and the knowledge graph
Data-centric design and the knowledge graphAlan Morrison
 

What's hot (20)

The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...
The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...
The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...
 
Quantum computers, quantum key distribution, quantum networks
Quantum computers, quantum key distribution, quantum networksQuantum computers, quantum key distribution, quantum networks
Quantum computers, quantum key distribution, quantum networks
 
Why you should care about synthetic data
Why you should care about synthetic dataWhy you should care about synthetic data
Why you should care about synthetic data
 
Neo4j 4 Overview
Neo4j 4 OverviewNeo4j 4 Overview
Neo4j 4 Overview
 
Datafication is transforming the industry landscape
Datafication is transforming the industry landscapeDatafication is transforming the industry landscape
Datafication is transforming the industry landscape
 
Introducing Neo4j
Introducing Neo4jIntroducing Neo4j
Introducing Neo4j
 
Recommendations at Zillow
Recommendations at ZillowRecommendations at Zillow
Recommendations at Zillow
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based Search
 
Zillow's favorite big data & machine learning tools
Zillow's favorite big data & machine learning toolsZillow's favorite big data & machine learning tools
Zillow's favorite big data & machine learning tools
 
The Extraordinary World of Quantum Computing
The Extraordinary World of Quantum ComputingThe Extraordinary World of Quantum Computing
The Extraordinary World of Quantum Computing
 
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
 
Easily Identify Sources of Supply Chain Gridlock
Easily Identify Sources of Supply Chain GridlockEasily Identify Sources of Supply Chain Gridlock
Easily Identify Sources of Supply Chain Gridlock
 
Graph Applications and Algorithms at Elevance Health
Graph Applications and Algorithms at Elevance HealthGraph Applications and Algorithms at Elevance Health
Graph Applications and Algorithms at Elevance Health
 
Neo4j - graph database for recommendations
Neo4j - graph database for recommendationsNeo4j - graph database for recommendations
Neo4j - graph database for recommendations
 
Knowledge Graphs and Generative AI
Knowledge Graphs and Generative AIKnowledge Graphs and Generative AI
Knowledge Graphs and Generative AI
 
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
 
GPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphGPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge Graph
 
How Expedia’s Entity Graph Powers Global Travel
How Expedia’s Entity Graph Powers Global TravelHow Expedia’s Entity Graph Powers Global Travel
How Expedia’s Entity Graph Powers Global Travel
 
The Data Platform for Today's Intelligent Applications.pdf
The Data Platform for Today's Intelligent Applications.pdfThe Data Platform for Today's Intelligent Applications.pdf
The Data Platform for Today's Intelligent Applications.pdf
 
Data-centric design and the knowledge graph
Data-centric design and the knowledge graphData-centric design and the knowledge graph
Data-centric design and the knowledge graph
 

Similar to A Comparative analysis of Graph Databases vs Relational Database

Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)Qazi Maaz Arshad
 
Abcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasAbcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasMerce Crosas
 
Record matching over query results
Record matching over query resultsRecord matching over query results
Record matching over query resultsambitlick
 
Agile Business Continuity Planning Using Business Process Modeling Notation
Agile Business Continuity Planning Using Business Process Modeling NotationAgile Business Continuity Planning Using Business Process Modeling Notation
Agile Business Continuity Planning Using Business Process Modeling NotationBrandi Gonzales
 
Make compliance fulfillment count double
Make compliance fulfillment count doubleMake compliance fulfillment count double
Make compliance fulfillment count doubleDirk Ortloff
 
Data Gaurd Final Thesis for University in Progress (2).docx
Data Gaurd Final Thesis for University in Progress (2).docxData Gaurd Final Thesis for University in Progress (2).docx
Data Gaurd Final Thesis for University in Progress (2).docxMohdKashif82
 
Query-Based Retrieval of Annotated Document
Query-Based Retrieval of Annotated DocumentQuery-Based Retrieval of Annotated Document
Query-Based Retrieval of Annotated DocumentIRJET Journal
 
DENG Master Improving data quality and regulatory compliance in global Inform...
DENG Master Improving data quality and regulatory compliance in global Inform...DENG Master Improving data quality and regulatory compliance in global Inform...
DENG Master Improving data quality and regulatory compliance in global Inform...Harvey Robson
 
ESG Lab Report - Catalogic Software DPX
ESG Lab Report - Catalogic Software DPXESG Lab Report - Catalogic Software DPX
ESG Lab Report - Catalogic Software DPXCatalogic Software
 
LCI report-Demo
LCI report-DemoLCI report-Demo
LCI report-DemoMo Mamouei
 
Adf and data quality
Adf and data qualityAdf and data quality
Adf and data qualitySankeTt Dassh
 
Online shopping-project-documentation-template
Online shopping-project-documentation-templateOnline shopping-project-documentation-template
Online shopping-project-documentation-templateLaibaMalik17
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentationrenjan131
 
Enterprise Ontology and Semantics
Enterprise Ontology and SemanticsEnterprise Ontology and Semantics
Enterprise Ontology and Semanticscurioz
 
An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis
An_investigation_into_Spring XD_to_study_methods_of_big_data_analysisAn_investigation_into_Spring XD_to_study_methods_of_big_data_analysis
An_investigation_into_Spring XD_to_study_methods_of_big_data_analysisMicheal Walsh
 
Phase 1 Documentation (Added System Req)
Phase 1 Documentation (Added System Req)Phase 1 Documentation (Added System Req)
Phase 1 Documentation (Added System Req)Reinier Eiman
 
Damien Gallagher Dissertation
Damien Gallagher DissertationDamien Gallagher Dissertation
Damien Gallagher DissertationDamien Gallagher
 

Similar to A Comparative analysis of Graph Databases vs Relational Database (20)

Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
 
Abcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasAbcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosas
 
Record matching over query results
Record matching over query resultsRecord matching over query results
Record matching over query results
 
Agile Business Continuity Planning Using Business Process Modeling Notation
Agile Business Continuity Planning Using Business Process Modeling NotationAgile Business Continuity Planning Using Business Process Modeling Notation
Agile Business Continuity Planning Using Business Process Modeling Notation
 
Make compliance fulfillment count double
Make compliance fulfillment count doubleMake compliance fulfillment count double
Make compliance fulfillment count double
 
20171003 lancaster data conversations Chue-Hong
20171003 lancaster data conversations Chue-Hong20171003 lancaster data conversations Chue-Hong
20171003 lancaster data conversations Chue-Hong
 
Data Gaurd Final Thesis for University in Progress (2).docx
Data Gaurd Final Thesis for University in Progress (2).docxData Gaurd Final Thesis for University in Progress (2).docx
Data Gaurd Final Thesis for University in Progress (2).docx
 
Query-Based Retrieval of Annotated Document
Query-Based Retrieval of Annotated DocumentQuery-Based Retrieval of Annotated Document
Query-Based Retrieval of Annotated Document
 
DENG Master Improving data quality and regulatory compliance in global Inform...
DENG Master Improving data quality and regulatory compliance in global Inform...DENG Master Improving data quality and regulatory compliance in global Inform...
DENG Master Improving data quality and regulatory compliance in global Inform...
 
ESG Lab Report - Catalogic Software DPX
ESG Lab Report - Catalogic Software DPXESG Lab Report - Catalogic Software DPX
ESG Lab Report - Catalogic Software DPX
 
LCI report-Demo
LCI report-DemoLCI report-Demo
LCI report-Demo
 
Adf and data quality
Adf and data qualityAdf and data quality
Adf and data quality
 
Online shopping-project-documentation-template
Online shopping-project-documentation-templateOnline shopping-project-documentation-template
Online shopping-project-documentation-template
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
 
Enterprise Ontology and Semantics
Enterprise Ontology and SemanticsEnterprise Ontology and Semantics
Enterprise Ontology and Semantics
 
An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis
An_investigation_into_Spring XD_to_study_methods_of_big_data_analysisAn_investigation_into_Spring XD_to_study_methods_of_big_data_analysis
An_investigation_into_Spring XD_to_study_methods_of_big_data_analysis
 
Phase 1 Documentation (Added System Req)
Phase 1 Documentation (Added System Req)Phase 1 Documentation (Added System Req)
Phase 1 Documentation (Added System Req)
 
Archana_New (1) (1) (2)
Archana_New (1) (1) (2)Archana_New (1) (1) (2)
Archana_New (1) (1) (2)
 
Damien Gallagher Dissertation
Damien Gallagher DissertationDamien Gallagher Dissertation
Damien Gallagher Dissertation
 
Mr bi amrp
Mr bi amrpMr bi amrp
Mr bi amrp
 

Recently uploaded

Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 

Recently uploaded (20)

Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 

A Comparative analysis of Graph Databases vs Relational Database

  • 1. A Comparative Analysis of Relational Databases and Graph Databases Thesis submitted in partial fulfilment of the requirements for the award of degree of: Masters of Science in Software Engineering & Database Technologies Department of Information Technology Head of Department: Dr. James Duggan, B.E., M.Eng.Sc., Ph.D. Submitted by: Darroch Greally Thesis Advisor: Dr. Robert Mason, Ph.D. August 2016
  • 2. i Certificate of Authorship I hereby certify that I am the author of this document and that any assistance I received in its preparation is fully acknowledged and disclosed in the document. I have also cited all sources from which I obtained data, ideas or words that are copied directly or paraphrased in the document. Sources are properly credited according to accepted standards for professional publications. I also certify that this paper was prepared by me for the purpose of partial fulfilment of requirements for the Degree Programme. Signed: __________________________ Date: 20 – August – 2015
  • 3. ii Acknowledgements I would first like to thank my thesis advisor Dr. Robert Mason of the College of Computer & Information Sciences at Regis University. His advice and expertise was extremely valuable to me when carrying out my research. I must also express my very profound gratitude to my parents for providing me with unfailing support and continuous encouragement throughout my years of study and through the process of researching and writing this thesis. This accomplishment would not have been possible without them. Thank you
  • 4. iii Abstract Data storage technologies are not evolving at the required rate to deal with the significant increase in the amount and type of data collected today by software across almost every industry. The data flowing through businesses is becoming more and more complex due to an increase in velocity, volume and variety; and its manageability and storage is becoming an increasingly difficult task to manage, mainly due to the legacy technologies employed to do so. Newer technologies are now starting to emerge in an attempt to overcome the disadvantages of previous database management systems. Two popular database types, the Relational Database Management Systems and Graph Databases are tested. The aim of this thesis was to examine and compare two databases from these two database models and answer the question of which one performs better when the data contains many relationships that require traversing multiple relationships to resolve queries. From the comparison of the results it was found that an increase in the number of joins/traversal required, lead to a noticeable decrease in the performance of Oracle 12c. Due to the model which Neo4j implements, its performance levels remained consistent when retrieving the same information.
  • 5. iv Table of Contents Certificate of Authorship ............................................................................................................i Acknowledgements....................................................................................................................ii Abstract.................................................................................................................................... iii List of Figures..........................................................................................................................vii 1 Introduction........................................................................................................................1 2 Literature Review...............................................................................................................3 3 The Relational Database ..................................................................................................12 3.1 Relational Data Model ..............................................................................................12 3.1.1 The Relation.......................................................................................................12 3.3 Data Structure............................................................................................................14 3.4 Integrity Constraints..................................................................................................14 3.5 Operations .................................................................................................................15 3.5.1 Selection Operation............................................................................................16 3.5.2 Projection Operation ..........................................................................................17 3.5.3 Cartesian Product...............................................................................................18 3.5.4 Union..................................................................................................................20 3.5.5 Set Difference ....................................................................................................20 4 Graph Databases ..............................................................................................................21 4.1 Introduction...............................................................................................................21 4.2 Data Model................................................................................................................23
  • 6. v 4.2.1 Nodes .................................................................................................................23 4.2.2 Edges..................................................................................................................23 4.2.3 Labels.................................................................................................................24 4.3 Terminology..............................................................................................................24 4.3.1 The Graph ..........................................................................................................25 4.3.2 Graph size ..........................................................................................................25 4.3.3 Degree of a node................................................................................................25 4.3.4 Path and path length...........................................................................................25 4.4 Key Features of the Graph Database.........................................................................26 4.4.1 Performance:......................................................................................................26 4.4.2 Flexibility...........................................................................................................27 4.5 Real Life Applications of the Graph Database..........................................................27 4.5.1 Google’s page rank ............................................................................................27 4.5.2 Master data management ...................................................................................28 4.5.3 Social networking ..............................................................................................28 4.5.4 Telecommunication............................................................................................29 4.5.5 Security and access management.......................................................................29 4.5.6 Bioinformatics....................................................................................................29 4.6 Neo4j.........................................................................................................................29 5 Comparative Analysis......................................................................................................31 5.1 Data Modelling..........................................................................................................31
  • 7. vi 5.1.1 Relational Modelling .........................................................................................31 5.1.2 Graph Data Modelling .......................................................................................33 5.2 Querying The Database.............................................................................................35 5.2.1 Querying the Relational Database .....................................................................37 5.2.2 Querying the Graph Database............................................................................41 6 Conclusion .......................................................................................................................44 7 References........................................................................................................................46 8 Appendix..........................................................................................................................55 8.1 Appendix A ...............................................................................................................55 9 Appendix B......................................................................................................................95 9.1 Neo4j Scripts.............................................................................................................95
  • 8. vii List of Figures Figure 1: Relation Instance ......................................................................................................14 Figure 2: Selection Operation..................................................................................................16 Figure 3: Selection Operation..................................................................................................17 Figure 4: Reviews Table ..........................................................................................................17 Figure 5: Projection Operation.................................................................................................18 Figure 6: Keanu Reeves Movies..............................................................................................19 Figure 7: Cartesian Product .....................................................................................................19 Figure 8: Directors and actors in Unforgiven ..........................................................................20 Figure 9: Set Difference Operation..........................................................................................20 Figure 10: The property graph .................................................................................................22 Figure 11: UML schema for graph data model........................................................................23 Figure 12: A Simple Graph /....................................................................................................24 Figure 13: The Logical Model.................................................................................................32 Figure 14: The Relational Model.............................................................................................33 Figure 15: The Graph Data Model...........................................................................................34 Figure 16: A look inside the Graph Data Model......................................................................35 Figure 17: SQL Query .............................................................................................................37 Figure 18: SQL Query 'Join Map'............................................................................................38 Figure 19: Oracle 12c Query Times.........................................................................................39 Figure 20: Relational Database Index Lookups.......................................................................40 Figure 21 Cypher Query ..........................................................................................................41 Figure 22: Graph Data Index Lookup......................................................................................41 Figure 23: Neo4j Data Output..................................................................................................42 Figure 24: Neo4j Graphical Output .........................................................................................42
  • 9. viii Figure 25: Neo4j Query Times ................................................................................................43 Figure 26: Neo4j vs Oracle 12c ...............................................................................................43
  • 10. 1 1 Introduction NoSQL is all the rage and as a consequence we are currently witnessing a rise in popularity of the graph database. There is an ongoing debate that relational database systems are outdated technology and people are looking towards newer technologies that will serve as their replacement. The main talking point of the debate surrounds the issue of complex data and this thesis aims to establish if graph databases are a better alternative to relational databases in terms of performance, when the data contains many relationships that require traversing multiple relationships to resolve queries. Relational databases have been around as a standard for more than thirty years. They are utilised industry wide and have been a major success to date. Relational databases have been the number one data storage choice for decades, providing businesses with a flexible, standard interface to store their data. However, storage needs are changing as data becomes bigger and more complex and relational databases apply much of the same overhead necessary for complex update operations to every activity, which can lead to inferior performance and an altogether more limited data store than its graphical counterpart. Modern day data is very complex, large in volume and highly connected and it cannot be efficiently managed by relational database technologies because of the number of data relationships involved. The graph database can provide a solution to this issue, and in turn a successful rival to the seemingly irreplaceable relational database. This thesis aims to quantitatively measure the performance of both database types, that is by benchmarking them. The benchmark is designed to be general enough to model all of the capabilities of both types and the information queried is to be of a realistic and unbiased nature. To achieve this, a dataset containing a number of movies along with their actors, directors,
  • 11. 2 producers and writers was written. The code for both Neo4j and Oracle12c are contained in the appendices of this document. The remainder of this thesis is organised as follows: Chapter 2 provides a review of all relevant literature. Chapter 3 and Chapter 4 provide introductions and the fundamental design considerations of relational databases and graph databases respectfully. The performance of each database type is tested and analysed in Chapter 5. Chapter 6 provides a conclusion for this work.
  • 12. 3 2 Literature Review Since the evolution of database management systems, there has been a continuing argument about what database model should be used for a particular purpose. The development and assortment of existing database models shows that there are numerous circumstances that affect their development. Angles & Gutierrez (2008) believed some of the more important factors to be, the structure of the domain to be modelled, the types of theoretical tools desirable to the intended end user and the hardware and software constraints. Long before the invention of what is now the modern computer system, the storage of information has posed many challenges. During these times, information retrieval and indexing was made more efficient, by systems such as the Dewey Decimal System, which was a proprietary system developed by Melvil Dewey in 1876 (Dewey, Mitchell, Beall, Matthews, & New, 1996). This somewhat eased the difficulties associated with information storage, however, in order to store data, a significant amount of physical volume was necessary for storage and it required human intelligence and understanding to process complicated relations in the data. The first commercial database management (DBMS) was developed in 1964 by Charles Bachmann, while he worked at Honeywell (Bachman, 1972). Taking its design from an early data model, IDS-Integrated Data Store, stored a single set of shared files in a disk, which resulted in the automation of several data processing tasks achieved by effective commands suitable for data manipulation, that were made available to application programmers (Bressan & Catania, 2005). Some advances were made on this model throughout the 1960’s, however many problems remained, primarily the lacking of good data abstraction. The data structures associated with Bachman’s model and its subsequent predecessors, were unsuitable for modelling non-traditional applications (Kahate, 2004). A first solution to these problems led to the conceptual basis and initial definition of a Relational Model, which was proposed by E.F.
  • 13. 4 Codd in 1970. Codd introduced a system of structuring data using relations, organised in a mathematical structure consisting of columns and rows (Codd, 1970). Due to its applicability, it was widely accepted among business applications in comparison to previous models (Peckham & Maryanski, 1988). This was a revolutionary concept, which would prove to become an extremely significant development in the design of relational databases. The relational model’s purity and mathematical foundation are the primary factors in why it has help its position as the dominant DBMS over the past thirty years (Silberschatz, Korth & Sudarshan, 2011). An initial serious interest in the area of graph databases arrived in the early nineties, before being forgotten due to the emergence of XML (Angles & Gutierrez, 2008). Before this however a number of pioneer papers focussing on the idea of graph database were published. Roussopoulos & Mylopoulos (1975) presented a semantic network store about the database. In 1981 the Functional Data Model was presented to represent an implicit structure of graphs for a data set, with the aim being the visionary representation of a logical database (Shipman, 1981). A few years later Logical Data Model (LDM), an explicit graph database model proposed to consolidate the relational, hierarchical and network models was developed (Kuper & Vardi, 1984). The G-Base, a graph database model intended for the representation of complex structures was presented in 1987 (Kunii, 1987). An object orientated database model based on a graph structure known as O2 was presented in 1988 (Lecluse, Richard, & Velez, 1988). Furthering this concept, was the development of a system known as GOOD, in which data manipulation was implicitly graph based along with the representation of the data model. (Gemis, Paredaens, Thyssens, & Van den Bussche, 1993). Further advancements GOOD led to the development of GMOD, which focussed on approaches for graph-orientated database user interfaces (Andries, Gemis, Paredaens, Thyssens, & Van den Bussche, 1992); Gram, an explicit graph database model for hypertext data (Amann & Scholl, 1992); PaMal, extending
  • 14. 5 GOOD with a clear representation of tuples and sets (Gemis & Paredaens, 1993); GOAL, which initialised the idea of association nodes (Hidders & Paredaens, 1994); G-Log, a proposal for a graph orientated declarative query language, which incorporates the meaningful capabilities of logic, the modelling capabilities of complex objects with identity and the representation capabilities of graphs (Paredaens, Peelman, & Tanca, 1995); and a final proposal is GDM, graph-based data model in which database instances and database schemas are illustrated by instance graphs and schema graphs (Hidders, 2002). The early nineties also saw proposals using the general concept of graphs with data modelling functions. The Hypernode Model was a database model introduced by Levene and Poulovassils in 1990 (Levene & Poulovassilis, 1990). Their model was based on nested graphs and provided a basis from which later developments followed, such as for modelling multi-scaled networks (Mainguenaud & Simatic, 1992). The concepts proposed in the Hypernode Model was also used for modelling genome data, mapping and other genomic data can be clearly represented by graphs, and graphs can be stored in a database (Graves, Bergeman, & Lawrence, 1995). Levene and Poulovassils introduced the data model GROOVY (Graphically-Represented Object-Oriented data model with Values) in 1992. A pure generalisation for the primary concepts of object-oriented data modelling was provided by GROOVY with the use of hypergraphs. (Levene & Poulovassilis, 1991). GraphDB was a proposal made in 1994, driven by the requirement to manage data in transport networks, its ideology involved modelling and querying graphs in object orientated databases (Güting, 1994). A subsequent proposal, Database Graph View, provided a mechanism for abstraction to enable graphs to be defined and manipulated in both relation, object-orientated or file systems (Gutiérrez, Pucheral, Steffen, & Thévenin, 1994). Complex Data associated with software engineering projects were modelled using attributed graphs by a project known as GRAS (Kiesel, Schuerr, & Westfechtel, 1995). The popular OEM model proposed unified
  • 15. 6 access to heterogeneous information sources, concentrating on the transfer of information (Klyne & Carroll, 2006). A further and extremely important development involves data representation models and the World Wide Web. These include data exchange models like XML, metadata representation models such RDF and ontology representation models such as OWL (Böhnlein & vom Ende, 1999) (Klyne & Carroll, 2006) (McGuinness & Van Harmelen, 2004). It is however the growth of the Web and the large amount or available resources that pose the greatest threat to accurate information retrieval (Barabási & Albert, 1999). This huge store of unstructured data has resulted in making efficient information search and retrieval a very tiresome task, particularly in the case of a less scalable data model. Scalability in databases is their capability to manage a growing number of transactions and stored data at the same speed. The significant amounts of data stored today by software in almost every area imaginable is progressively leading to major problems due to current storage technologies not advancing at a required rate to cater for the performance scalability needed (Agrawal, El Abbadi, Das, & Elmore, 2011). Scalability can be achieved in two ways; vertical scalability and horizontal scalability (Kaur & Rani, 2013). Vertical scalability means to scale up. This is achieved by increasing resources to a single machine. Horizontal scalability on other hand is called scaling out. This can be achieved by adding commodity servers to the existing node. Vertical scalability is expensive in comparison to horizontal scalability and the degree to which the database can scale vertically is also limited. Vertical scalability is rarely an efficient option and in some cases not feasible choice where the amount of data to be managed continues to increase. To help overcome such issues a number of new systems have been designed to provide good horizontal scalability, unlike traditional database products which have in comparison minimal capacity to scale horizontally (Pokorny, 2011). Many of the new systems are referred to as “NoSQL” data stores (Cattell, 2011).
  • 16. 7 The NoSQL movement, where “NoSQL” stands for “Not Only SQL” came about as it was believed that the traditional relational database system was not an effective solution for all database requirements, particularly databases that were concerned with processing large amounts of data with high scalability needs (Tauro, Patil, & Prashanth, 2013). Relational databases are often not very well suited to particular operations essential to Big Data management. Firstly, due to a lack of scalability, large data solutions tend to be more expensive with relational databases. In certain circumstances, grid solutions can improve this weakness, however the creation of new clusters on the grid is not dynamic, thus the potential and exploration for a more efficient solution. Furthermore, relational databases don’t handle unstructured data search as well as one would wish, and the same applies to data appearing in unexpected format. Additionally, it is strenuous to implement particular types of queries using SQL and relational databases, such as the shortest path between two points (Cui, Mei, & Ooi, 2014). Big Data organisations and Social networking, including Facebook, Google, and Amazon were the initial exponents of the idea that relational databases were not the most effective solutions for the volumes and types of data that they were required to handle. Such limitations lead to the development of the Hadoop File System, a reliable store for very large data sets (Shvachko, Kuang, Radia, & Chansler, 2010). Also developed as a consequence was the MapReduce programming language designed for processing and generating large data sets with a parallel, distributed algorithm on a cluster and associated NoSQL databases such as Cassandra and HBase (Dean & Ghemawat, 2008) (Rabl et al., 2012). A fundamental element of the NoSQL concept is “shared nothing” horizontal scaling, a distributed computing architecture in which each node is independent and self-sufficient, and there is no single point of contention across the system. This enable the databases to support a greater number of basic read/write operations per second (Topor et al., 2009).
  • 17. 8 The area of SQL vs NoSQL has been well debated, however such debates do not specifically cover graph databases, with a broader view of the topics investigated. Tauro, Patil, & Prashanth (2013) provided an overview, evaluation and analysis of various NoSQL Databases. The conclusion they came to was that NoSQL databases were more effective when there was a requirement to handle a large amount of data with high scalability, compared to the traditional relational database systems, and that a NoSQL database is most suitable a solution where the database will be required to scale over time. Strauch & Kriha (2011) carried out methodical examination of the purposes and justifications behind the NoSQL movement. Their conclusions supported the view presented in Tauro, Patil, & Prashanth (2013), that the need for high scalability is a primary reason for the development of NoSQL databases. It is such scalability that provides us with the issue of ‘complex data’. Bitner (2015) states that while there is no particular method of determining the level of complexity associated with data, there are two features of the data that can allow it to be called complex - if you are working with Big Data and if your data comes from varying data sources. Russom (2011) states that complexity in data can be broken into three separate categories, which he labels “The 3 v’s”. The 3 v’s stand for velocity, volume and variety. Naturally volume deals with the size of the data, velocity is considered to be the data processing methods and variety deal with which category of structured, unstructured or semi structured the data falls into. Unstructured data is schema-less, and comes in various formats, ranging from social media posts and sensor data to email, images and web logs. It and it has become much more common since the increase in popularity of the usage online service, information systems and the growing need for high data volumes (Lomotey & Deters, 2013). Studies show that unstructured data is growing at an unprecedented pace. It is estimated that more than 80% of all potentially useful business information is unstructured data (Gharehchopogh & Khalifelu, 2011). The world creates 2.5 quintillion bytes of data per day from unstructured data sources like sensors,
  • 18. 9 social media posts and digital photos (Bajaj, 2014). Clearly, unstructured data is growing exponentially, and its need to be managed in the most efficient manner possible is of pivotal importance. Relational databases are best suited to structured data, which readily fits in well-organized tables but the opposite is the case when it is required to deal with unstructured data according to (Leavitt, 2010). He goes on to state that with relational databases, users are required to convert all data into tables and if the case is such that the data does not readily fit into a table, the database’s structure can be complex, difficult, and slow to work with. Reeve (2012) supports this and is also of the opinion that relational databases are unsuitable to perform an efficient unstructured data search and also questions their ability to handle data in unexpected formats well. Both, also agree that it is difficult to implement some basic queries using SQL and relational databases, such as the shortest path between two points (Reeve, 2012). The use of SQL with data that is unstructured proves problematic because it is designed to work with structured, relationally organized databases with fixed table information (Leavitt, 2010). On the other hand, Kuala et al. (2013) conducted a study to that prove that relational databases can manipulate unstructured data. The fact that relational databases are now able to manipulate unstructured data tell us that database companies as now recognize the important role this type of data will play in current and future environment. The graph database is highly efficient in the processing of large, interrelated datasets (Rodriguez & Neubauer, 2010). The properties of its design allow the development of predictive models, and the exposure of interactions and patterns (Shimpi & Chaudhari, 2012). This highly dynamic model provides a means for exceptionally fast traversals along edges and between vertices due to the linking together of all nodes by relations, resulting in more localised traversals which are not required to take any unrelated data into account, thus, overcoming an implicit problem in SQL (Rodriguez & Neubauer, 2012). Bachman (2013) even goes as far as
  • 19. 10 stating that that all but the simplest of graph queries would result in the use of a join operator in a relational database, the performance of which worsens exponentially with the increase in the size of the data. Another paper points out that due to the fact that the graph data model stores all the relationships of data along with data, further computing process is not required to join the data in an effort to extract the desired information (Mohamed Ali & Padma, 2016). A comparative investigation of a NoSQL database and a relational database, with respect to their performance and scalability capabilities was carried out by (Hadjigeorgiou, 2013). The authors investigated the performance and scalability of MongoDB, a documented-orientated database and MySQL, a relational database. The experiments carried out involved running various numbers and types of queries, with the complexity of the query changing throughout experimentation to allow for analysis on how the databases scaled with increased load. It was found that the NoSQL database could handle the more complex queries faster. The study does show however, that in spite the performance advantage MongoDB possesses over MySQL in terms of how it deals with complex queries, when the benchmark modelled the MySQL query in a similar way to the MongoDB complex query by utilising nested SELECTs, it was discovered that MySQL performed better, although at higher numbers of connections the two behaved similarly. Oracle recently released a white paper, Unstructured Data Management with Oracle Database 12c, which emphasis the value that they have on keeping up with competitors in a most competitive market. The white paper states that Oracle Database 12c has focused dramatic performance improvements for Unstructured Data query and analysis. It claims that Oracle Database 12c allows you to store and query unstructured data efficiently, with highly efficient compression and, in many instances, query languages, semantics, and other mechanisms designed for specific data types. Oracle Database 12c supports specialized data types for many common forms of unstructured data. This enables application developers, development tools
  • 20. 11 and database utilities to interact with unstructured data with the same ease as with standard relational data. While this literature review has provided a plethora of information on the rise of NoSQL, its efficiency and in some cases its significant advantages over relational databases when it comes to managing complex data, no such literature is available that refers specifically to the graph database. Due to the rising popularity of the graph database, it is deemed necessary so have some sort of factual analysis and critical comparison to test the claims of graph database vendors, specifically Neo4j. This paper aims to achieve this, along with proving or disproving Oracle claims of keeping up with competitors in storage of complex data.
  • 21. 12 3 The Relational Database 3.1 Relational Data Model The relational model was first proposed in 1970 by E.F. Codd as a new model for database systems. With its mathematical foundation, the relational model paved the way for modern database systems. It had a significant impact on everything from the theory, to the actual development and implementations of database systems. Information is retrieved from a database with the use of what is known as a query language. Query languages are typically more high level than the usual programming language. In general, query languages fall into two categories; procedural and non-procedural, with their method of obtaining results the defining factor. With a procedural language, any query constructed by a user with the intention of retrieving information from the database will need to be a definitive chain of tasks on the database. With a non-procedural language, the user must only provide a description of data needed. It is unnecessary to provide the system with a particular procedure. Relational algebra is representative of procedural language, while relational calculus is an example of non- procedural languages (Silberschatz, Korth, & Sudarshan, 2011). Modern relational database systems however, tend to integrate aspects of both the procedural and non-procedural methodologies. 3.1.1 The Relation In order to define a relation, a domain and an attribute must be initially defined. Let 𝐷1, 𝐷2, . . . , 𝐷𝑛 (𝑛 < 0) be 𝑛 domains, with a domain being defined as a set of values of similar type. For all rows in dataset, a domain and its unique identifier is defined and this becomes known an attribute. The Cartesian product is the set of all 𝑛-tuples (𝑡1, 𝑡2, . . . , 𝑡𝑛) such that 𝑡𝑖 ∈ 𝐷𝑖 for all 𝑖. If a relation is a subset of this Cartesian product, it can be defined on these 𝑛 domains and is
  • 22. 13 described as degree 𝑛 (Codd, 1979). This theory forms the basis of a relational model in that the relation is a set of rows, where each row is required to have the same set of attributes. In the relational model, a series of rows come together to form a table. 3.2 Relational Algebra The basis of a relational model, comes from the mathematical definition of a relation. Ullman (1995) describes a relation as “a subset of a Cartesian product of a list of domains”. Using this logic, in the RDBMS, the database table becomes the mathematical relation. The relational model’s mathematical foundation provides a collection of algebraic operations that can manage tables in a RDBMS. Relational algebra is a procedural query language that has five central operations including; project, select, Cartesian product, union and set difference (Silberschatz, Korth, & Sudarshan, 2011). A number of other operations such as inner join and natural join can also be defined. All of the aforementioned operators serve to define or manipulate operands that are relations, with a new relation occurring as a result of any operations carried out. More complex queries tend to be carried out by a sequence of these operators. When defining his relational model, E.F. Codd highlighted that the relational algebraic operators were a significant part of the definition, and could be considered as influential as the model itself (Codd, 1979). A relation is defined as a selection of n-tuples, for operations in relational algebra. It can therefore be concluded that the operations of relational algebra are the operations of set theory working in collaboration with further operators which operate with consideration to the explicit nature of relations of that dataset (Shenai, 1992). An extension of these relational algebra operations also helps to define further operators in the relational model (Silberschatz, Korth & Sudarshan, 2011) (Ullman, 1995). Codd (1970) outlined three components of his relational model as; data structures to store data in a relational database, integrity constraints and operations. These are explained in greater detail below.
  • 23. 14 3.3 Data Structure The data structures used to store data are; relation (table), attributes (columns), tuples (rows), relation instance and relation schema (table header). An attribute is described by a relational model as a <Name, Domain> pair, in which the domain consists of the collection of values and operators consisting of the attributes domain. Figure 1 below, provides an example of an attribute <name, VARCHAR2>, with the attribute name being Name and the domain consisting of the collection of valid characters and operations of the VARCHAR2 attribute type. The relational model describes a relation schema as a <Name, Set of Attributes> pair. For example, Figure 1 is a table schema used to store information on particular set of actors. It has a relation schema Actors (Actor_ID, Name, Born). In any given relation schema, a tuple for that relation schema is a mapping from every attribute of that relation schema into the domain of that attribute, where a tuple is an element of a relation instance, with a relation instance being a subset of a Cartesian product of a list of domains consisting of a fixed collection of tuples for a particular schema. A relation instance for the schema in Figure 1 can be described as {relation instance} = {tuple1, tuple2, tuple3} where tuple1 = (Keanu, Keanu Reeves, 1964); tuple2 = (Carrie, Carrie-Ann Moss, 1967) and so on. Figure 1: Relation Instance 3.4 Integrity Constraints
  • 24. 15 A collection of integrity constrains comprise the second component of the relational model. Such integrity constraints are necessary to ensure that the data stored within the RDBMS is valid. A RDBMS implements integrity constraints, meaning it allows only legal instances to be stored in the database. Integrity constraints are stated and implemented at particular times in each database instance (Asirelli, De Santis & Martelli, 1985). Firstly, when an end user describes a database schema, they stipulate the integrity constraints that are necessary to be applied on any instance of the database. Secondly, when a database application is run, the database management system seeks violations and forbids changes to the data that breach the defined integrity constraints. Business rules are enforced as integrity constraints to ensure the database only supports valid data. Once more referring to the example in Figure 1, the attribute Born has a business rule associated with it, that requires it to be an integer with the precision of four and the scale of zero. This constraint is implemented from the outset, when the data type and domain are defined. As well as domain constraints, there are other integrity constraints can be specified in the relational model; such as key constraints, referential, integrity constraints, NOT NULL constraints and general assertions (Khodaei, 2008). 3.5 Operations Relational algebra and relational calculus provide the basis for performing operations in the relational model. Harrington (2003) explains how relational algebra provides operations on rows a set at a time in a single operation that is non-procedural. Relational operations perform on multiple levels. Firstly, we consider the basic operations of the relational algebra where pure mathematical abstractions form the foundations for rationale about a relational database (Sumathi & Esakkirajan, 2007). Codd’s model describes operations to manage data in the database by employing relational algebra (Silberschatz, Korth & Sudarshan, 2011), (Codd, 1970). As described above, a relational database is a collection of relations, each of which conform to a certain relation schema. The fundamental operations define what can be done to
  • 25. 16 these relations. The principal relationship algebra operations are apportioned into two categories; unary operations and binary operations (Sumathi & Esakkirajan, 2007). Unary operations take a single relation as an operand, whereas binary operands require two relation operands. There are five fundamental operators that provide relational algebra with the power to develop complex queries. Each of these operators can be thought of as functions mapping one or more relations into another relation. The primary database operations are selection, projection and join (Sumathi & Esakkirajan, 2007). Binary operations are join, union, set difference and Cartesian product. 3.5.1 Selection Operation The selection operation performs on a single relation and defines a relation that holds certain rows of that relation, the rows that satisfy the declared condition. The syntax of the operation is σpredicate(R), where predicate refers to a condition and R is the relation. The selection condition may be any legally formed expression that includes; constants, attribute names, arithmetic and/or logical operators (Roman, 2002). An example of a selection operation in the above table would be as follows: If we apply the select operator to the relation above as: 𝜎 𝑎𝑐𝑡𝑜𝑟_𝑖𝑑 = ‘𝐾𝑒𝑎𝑛𝑢’(𝑎𝑐𝑡𝑜𝑟) The result is a binary relation listing the actor with the Actor_ID of ‘Keanu’. As Keanu acts as the primary key, the resulting tuple will appear as the below: Figure 2: Selection Operation
  • 26. 17 A further example, still working off the same tables would be if one wishes to select all actors born after a certain year. The relational algebraic equation is as follows: 𝜎𝑏𝑜𝑟𝑛 > 1965(𝑎𝑐𝑡𝑜𝑟) Figure 3: Selection Operation In general, selections operations employ Boolean expressions to establish the logical attributes of data to return. 3.5.2 Projection Operation The projection operation performs on a single relation and defines a relation that holds a vertical subject of that relation, obtaining only the attributes stated and nothing more. In simplified terms, it is the ‘column version’ of selection, meaning projection filters by column, rather than by row which we saw above with selection. The syntax for projection is ∏<𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒 𝑙𝑖𝑠𝑡>(𝑅), where once more 𝑅 stands for the relation. To show an example of a projection we will introduce another table from our dataset, which consists of a critic, a movie and their rating of that movie. Figure 4: Reviews Table
  • 27. 18 A projection on this table would be as follows: 𝛱 𝑚𝑜𝑣𝑖𝑒_𝑖𝑑(𝑟𝑒𝑣𝑖𝑒𝑤𝑠) This projection will list all movies in the reviews table as follows: Figure 5: Projection Operation 3.5.3 Cartesian Product The Cartesian product is the first binary operation, denoted by a cross (×) and it is utilised where there is a necessity to retrieve information from more than one relation. The Cartesian product of two relations, 𝑟1 and 𝑟2, is described in relational algebra as 𝑟1 × 𝑟2. Fully qualified attribute names are required when defining the final relation scheme in order to attach the name of the initial relation to the attribute as a reference as a means to differentiate between 𝑟1. 𝐴 and 𝑟2. 𝐴. If 𝑟1(𝐴1, . . . , 𝐴𝑛) and 𝑟2(𝐴1, . . . , 𝐴𝑛) are relations, then the Cartesian product 𝑟1 × 𝑟2 is a relation with a scheme comprising all fully qualified attributes located in 𝑟1 and 𝑟2: (𝑟1. 𝐴1, . . . , 𝑟1. 𝐴𝑛, 𝑟2. 𝐴1, . . . , 𝑟2. 𝐴𝑛). The rows of the Cartesian product are constructed by connected every possible combination of rows: one from the 𝑟1 relation and one from the 𝑟2 relation. If both relations contain 𝑛𝑖 rows, then the Cartesian product contains 𝑛𝑖 𝑛𝑖 rows. Assume that two relations in the dataset used throughout contain only Keanu Reeves as a name, and the seven movies he features in below:
  • 28. 19 Figure 6: Keanu Reeves Movies Now we are going to use Cartesian product operator on those relations to obtain our result. 𝑎𝑐𝑡𝑜𝑟 × 𝑚𝑜𝑣𝑖𝑒 The above expression produces a relation whose scheme is a concatenation of actor scheme and movie scheme. It should be taken into account that the Cartesian product encompasses no more data than its individual components, however, the Cartesian product uses a significant amount more memory than the two original relations consume before they are combined. This is the primary reason why the Cartesian product is primarily for explanatory or conceptual purposes only (Codd, 1990). In real life situations it is generally replaced by the natural join operator. Figure 7: Cartesian Product
  • 29. 20 3.5.4 Union The union operation is a binary operation on two relations that is denoted by the same symbol ∪. Consider two relations 𝑥 and 𝑦, then a union operation on these is written as: 𝑥 ∪ 𝑦. The union of the two relations can only be executed if the relations have the same degree. In addition, the first attribute of 𝑋 must be compatible with the first attribute of 𝑌, the second attribute of 𝑋 must be compatible with the second attribute of 𝑌, and so on. The degree of the resulting relation is the same as the degree of the input relations. 3.5.5 Set Difference The final fundamental operator in relational algebra is set difference. It is a binary operator, denoted by the symbol “– “. In order for a relation to be effected by this operator it needs to be union compatible. If one wished to retrieve values from one table that did not appear in another table, then the expression 𝑟1 – 𝑟2 is used to achieve this. No duplicate tuples will be found in the query result. To take an example from our movie database, we can say that some actors are also directors, but not all. For simplicity, the example specifically refers to the movie ‘Unforgiven’. Figure 8: Directors and actors in Unforgiven Applying the set difference operator leaves us with the following set of actors: Figure 9: Set Difference Operation
  • 30. 21 4 Graph Databases 4.1 Introduction A graph database is an online database management system with Create, Read, Update, and Delete (CRUD) methods that expose a graph data model. Graph databases are generally built for use with online transaction processing (OLTP) systems. Subsequently, they are typically developed for transactional performance, and engineered with primary focus on transactional integrity and operational availability (Robinson, Webber & Eifrem, 2015) Relationships are the most important aspect of the graph data model, a characteristic that distinguishes them from other database management systems. In the case of relational databases, one is required to deduce connections between entities using mechanisms such as foreign keys. Graph databases use the formal notation for a graph as the basis of their design. A graph can be described as simply being a collection of vertices and edges (Rodriguez & Neubauer, 2010). Translating this to the case of graph databases, this becomes, a set of nodes and the relationships that connect them. Graphs represent entities as nodes and the ways in which those entities relate to the world are represented as relationships. By assembling the simple abstractions of nodes and relationships into connected structures, graph databases enable us to build practical and refined models that map closely to our problem domain. This results in models which are simpler to read and understand, whilst also managing to be far more articulate than those produced using traditional relational modelling (Robinson, Webber & Eifrem, 2015). Graph databases have been shown to be a powerful tool for modelling data, when the concern on the relationship between entities is a fundamental element in the design of the data model (Shimpi & Chaudhari, 2012). Modelling objects and the relationships between them means virtually anything can be expressed in an associative graph. Diestel (2010) describes a graph structure as, 𝐺 = (𝑉, 𝐸), where 𝑉 = 𝜗1, 𝜗2, 𝜗3, … . 𝜗𝑛 is a set of
  • 31. 22 vertices and E is a set of edges 𝐸 = 𝜀1, 𝜀2, 𝜀3, … . 𝜀𝑛. An edge 𝜀𝑖 ∈ 𝐸 is defined with triple (𝑖, 𝑗, 𝜔𝑖) where, (𝑖, 𝑗 ∈ 𝑉) and 𝜔𝑖 is a positive real number. A directed edge is defined as 𝑖 → 𝑗. The graph database systems today are still a relatively new technology that is undergoing significant development. The majority of graph databases support directed attributed, labelled and directed multi-graph, or the property graph, a popular choice due to the fact that it is supported by the majority of systems and also because it provides capabilities for associating attributes to every node and relationship on top of graph structure (Rodriguez & Neubauer, 2010). A visual representation of a property graph is depicted below in Figure 10. Figure 10: The property graph An advantage of the property graph is its complexity. This complexity means that all other graph types can be effectively modelled by the property graph, due to the fact that all other graphs types lie in subsets of the property graph implementation. The graph database is highly suitable and efficient for the powerful processing of substantial, interrelated datasets (Rodriguez & Neubauer, 2010). The properties of its design allow the development of predictive models, and the exposure of interactions and patterns (Shimpi & Chaudhari, 2012). This highly dynamic model provides a means for exceptionally fast traversals along edges and between vertices due to the linking together of all nodes by relations, resulting in more localised
  • 32. 23 traversals which are unrequired to take any unrelated data into account, thus, overcoming an implicit problem in SQL (Rodriguez & Neubauer, 2012). The property graph is an important prerequisite for calculating the weighted shortest path which is implemented in the core of the graph database. 4.2 Data Model. The graph database data model is a relatively simple structure consisting of nodes, edges and labels. For example, take a social network where the nodes will be people. Every node may have additional information attached such as name, address and so on. These nodes are connected by binary edges. An edge in the example of a social network could be ‘friends with’ or ‘related to’. An edge may also have additional information attached. Figure 11 below shows the UML schema for the representation of graphs. Figure 11: UML schema for graph data model, retrieved from Grust, T., Freytag, J., & Leser, U. (2016). Cost-based Optimization of Graph Queries in Relational Database Management Systems (Masters). University of Berlin. 4.2.1 Nodes The node in a graph database is similar to a row or a record in the relational database. They represent entities such as people, products, appointments, or any other item of data that requires storage. Similar to a class in the relational database, it is necessary for each node to be uniquely identifiable. In the UML above, the integer node_id could be used for this purpose. 4.2.2 Edges A node may be connected to another node by an edge. The edge acts as a visual representation of the relationship between the nodes which it joins. It is the patterns that materialise when
  • 33. 24 examining the connections and interconnections between nodes on a graph, that make the graph database what it is. Edges are the exclusive concept to graph databases that distinguish them from other data models, providing an abstraction that is not achievable in other systems. 4.2.3 Labels Nodes are grouped into sets using labels. Any node labelled with a particular label will belong to the set of nodes to which it shares a label. Labels are regularly used to query a database adding to the efficiency of a search. Labels are not mandatory so a node does not necessarily belong to a set, or a node can belong to a number of sets meaning it will be attributed a number of labels. 4.3 Terminology Below is a simple undirected graph structure that will be used for the purposes of explaining the terminology that will be used throughout this thesis. The nodes in this instance are simple A, B, C, D, E, F, G, H. Each edge is labelled with a number and for sake of explanation this number will be used as the distance between two nodes. Figure 12: A Simple Graph, retrieved from Burstein, M. (2013). Introduction to Graph Theory: Finding The Shortest Path. Max Burstein. Retrieved from http://www.maxburstein.com/blog/introduction-to-graph-theory-finding-shortest-path/
  • 34. 25 4.3.1 The Graph A graph 𝐺 = (𝑉(𝐺), 𝐸(𝐺)) is a database record containing a collection of nodes 𝑉(𝐺) and a collection of edges 𝐸(𝐺), with 𝐸(𝐺) ∈ 𝑉(𝐺) 𝑥 𝑉(𝐺). Once the conditions of the graph are clear, it can be defined as 𝐺 = (𝑉, 𝐸). Graphs may be either directed or undirected. The graph database focuses on a directed graph structure. A directed graph contains an ordered pair of nodes in 𝐸. Consider (𝑢, 𝑣) ∈ 𝐸, with 𝑢, 𝑣 ∈ 𝑉 and 𝑢 ≠ 𝑣. 𝑣 in this case will be adjacent to 𝑢 and it is said that 𝑢 has the outgoing edge (𝑢, 𝑣), thus 𝑢 is the start node of (𝑢, 𝑣), with (𝑢, 𝑣) being the incoming edge of node 𝑣, which makes 𝑣 the target node of (𝑢, 𝑣). A parent child relationship is developed with 𝑢 becoming the parent with a child 𝑣. 4.3.2 Graph size From before, 𝐺 = (𝑉, 𝐸), and the size of 𝐺 is determined by the number of nodes |𝑉| plus the number of edges |𝐸|, that it contains. In mathematical terms |𝐺| = |𝑉| + |𝐸|. The density of a graph is defined as the ratio of edges to nodes. Graph size is loosely split into two sizes based on its density – dense or sparse. A graph is considered dense if the value of 𝐸 is close to the value of |𝑉|2 , and any graph with a value of 𝐸 that is considerably lower than |𝑉|2 can be considered sparse. The size of the above graph is 16 (8 nodes plus 8 edges). 4.3.3 Degree of a node The degree of a node 𝑣 ∈ 𝑉 deg(𝑣) is the amount of connections or edges associated with that node. Naturally, each node in the directed graph will have two distinct values, one for the number of edges with 𝑣 when it acts as the target node and the other when it acts as the start node. The degree of node F in the above example is 5. These nodes and edges are labelled in a graph database meaning it is also necessary to define a label function for the nodes and edges. A label is made up of a type and a value. 4.3.4 Path and path length
  • 35. 26 A sequence of nodes (𝑣0, 𝑣1, 𝑣2, … . . , 𝑣 𝑛 ), 𝑣𝑖 ∈ 𝑉 such that (𝑣𝑖−1, 𝑣𝑖) ∈ 𝐸 for 𝑖 = 1,2, … … 𝑛. is described as the path and the number of edges in the path is defined as the path length. A path is only a path if all the nodes are distinct. In the case of a directed graph where 𝑣0 = 𝑣 𝑘 and 𝑘 ≥ 2, the path then becomes a cycle. A directed graph that does not contain any cycles is known as a directed acyclic graph (DAG). A node in a DAG may contain numerous parent nodes. If it is the case is that every node has a maximum of one parent node, then due to the appearance of the graphical structure it is often referred to as a “tree”. 4.3.4.1 Dijkstra’s Algorithm Dijkstra algorithm is a graph search algorithm utilised in a graph database to solve the shortest path problem for a given graph 𝐺 = (𝑉, 𝐸). The solution to this problem is achieved by initially selecting a start node and an end node. The start note is immediately added to the list of solved nodes and assigned a 0 value, in that from the outset it is 0 distance away from itself. We can then traverse breadth-first from this start node to its neighbours calculating and recording the path length against each node. This process is repeated for every node that will be on the path between the start and end node. On finishing, the shortest path is found. Even though the conventional Dijkstra algorithm searches the shortest paths from start node to every other node, in Neo4j the algorithm locates the shortest path between the start node and end node. The leads to Neo4j’s efficiency as it is necessary to only record the lengths of a small subset of the possible paths through the graph in comparison to how many paths that are theoretically available. When the length of a destination node is solved, this then reveals the shortest path from the start node, from which all subsequent paths can be safely built. 4.4 Key Features of the Graph Database 4.4.1 Performance:When connected data is the primary concern, the graph database offers a significant increase in performance in comparison to the relational database. In the case
  • 36. 27 of the relational database, difficulties often arise as the dataset increases in size, due to the fact that joins become increasingly difficult. As data set sizes grow, it is almost impossible to maintain database performance as query times become increasingly longer. Join-intensive models are generally a result of an effort to solve a connection or path problem, but the mathematical foundations that form the basis of relational databases are not ideally suitable for emulating path operations. Such difficulties cannot be associated with graph databases as query times scale linearly with the amount of data the query contains, rather than with the actual size of the dataset. Queries are localised to a section of the graph, resulting in an execution time that is proportional to the size of the section of the graph traversed to solve that query, rather than the entire graph. 4.4.2 Flexibility Due to the flexible nature of a graph, it is now possible for the database to advance at the same rate as the business which it serves. Graph databases are extremely comfortable with the addition of new data, a feature which enables the subtle addition of new kinds of relationships, new nodes, and new sub graphs to the current structure without negatively effecting any existing queries and application functionality. This is a significant advantage of the graph database as it allows for some freedom during the modelling phase and reduces the risk that comes with imprecise or inaccurate modelling associated with the relational database model. The graph database provides the ability for the structure and schema to develop parallel to an advancement in knowledge of the problem space. Another advantage of the flexibility of the graph database is fewer database migrations will be required, which will naturally reduce the risk of data loss and reduce business expenses. 4.5 Real Life Applications of the Graph Database 4.5.1 Google’s page rank
  • 37. 28 Page rank is an algorithm used by Google Search to organise websites in their search engine results by popularity. Google employs the graph databases in organising the order how search results are displayed. Here, the directed graph ideology is applied when the web pages (nodes) are connected together by the hyperlinks (edges). The amount of outgoing edges per graph is set as the weight for the edge and page rank is calculated as per the weight on one edge in relation to other edges (Wills, 2006). 4.5.2 Master data management Major enterprises such as Cisco, StarHub and Polyvore are using Neo4j to reveal business value by taking advantage of the data relationships contained in their information. Cisco’s whole sales organisation is modelled by a complex graph structure and Neo4j provides the capabilities for real-time queries to be carried out on it. The data relationships in StarHub’s product and customer data is used to significantly reduce the amount of time required for the company to assemble product bundles. Polyvore’s extensive catalogue of items is managed by Neo4j, and the data relationships between the items help form real-time recommendations that are made available to its users (Carlsson, 2016). 4.5.3 Social networking Due to the fact that social networks naturally take the form of a graph, it could be considered counterproductive and unnecessary to convert all the data and relationships into tabular format. Gamesys, a British betting and gaming website, wished to create a social network for its users and after exploring several databases options, they decided that graph databases would be the most natural fit for their problem domain. One of the main factors considered when reaching their decision to opt for the graph database over a relational one was the issue of impedance mismatch. The existing graphical nature of the data would mean that queries would also be graph-orientated, and therefore the use of a relational model would result in substantial project
  • 38. 29 cost and performance overhead. A graph database was preferred to fulfil requirements in the both the operational and analytical environment (Nixon, 2015). 4.5.4 Telecommunication Deutsche Telekom, CenturyLink and 3 are just some of the telecommunication companies that have turned to graph databases, and specifically to Neo4j, to model their networks of highly interconnected data. Telecommunications is all about connections, making graph databases a natural fit. A telecommunication company would have a vast amount of interconnected data in the form of plans, customers and groups and graphs are useful when analysing networks and data centres. Graph databases are now an integral component in Telecommunications companies’ approach to managing the massive rise in popularity in that sector (Agricola, 2014). 4.5.5 Security and access management Adobe’s new Creative cloud, powered by Neo4j, uses a graphical model to connect authentication details in order to permit access to contents for all its clients. It also makes a new range of services available to its customers and facilitates the storage of vast amounts of connected data across the world, while providing high query performance (Tangen, 2012). 4.5.6 Bioinformatics There are numerous reasons as to why this growing industry is turning to the graph database. Graphs containing a significant amount of nodes and edges are everywhere in bioinformatics. Graph databases are a natural fit to the huge network of relationships between extremely large biological sets. Also, due to the fact that much information in this industry remains unknown, the flexibility of the graph database makes it ideally suitable. Neo4j has been leveraged by several of the leading competitors in this market, including Curaspan Health Group, GoodStart Genetics and Janssen Pharmaceuticals, Inc. (Merkl Sasaki, 2016). 4.6 Neo4j
  • 39. 30 Neo4j is currently the most popular graph database management on the market (Van Bruggen, 2014). First released in 2007, it is an open-source project written completely in Java (Vicknair et al., 2010). It is an embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables. Neo4j is comprised of two parts, a client and a server. The client is responsible for sending commands to the server, where they are then processed before the results are returned to the client. It is claimed to be extremely scalable by its developers, with possibilities for several billion nodes on a single machine. Its API is very user friendly and helps facilitate efficient traversals. Neo4j is built using Apache’s Lucene for indexing and search. Lucene is a text search engine, written in Java, geared toward high performance (DeCandia et al., 2007). Neo4j’s graph model adheres to that of the Property graph model introduced earlier, consisting of two primitive types; nodes and relationships, with added properties and labels. Nodes contain properties that are stored as arbitrary key-value pairs. The keys are strings in Neo4j, with the values being the Java string and associated data type, along with arrays of the data types. These nodes are then tagged by the labels, whose responsibility it is to arrange the nodes in a manner which will specify their roles inside the dataset. The nodes are then connected by associative relationships that will form the graph structure. A relationship in Neo4j must consist of a start and end node, a direction and a name. The consistent presence of a start node and end node ensure that there will be no dangling relationships, while the presence of the direction proves useful when traversing the graph. As with nodes, relationships may also contain properties. This can prove quiet useful in Neo4j for a number of reasons. Further metadata is made available for graph algorithms; it provides a means for the addition of further semantics to relationships and also a means for constraining queries at runtime.
  • 40. 31 5 Comparative Analysis 5.1 Data Modelling Data modelling is a technique motivated by a particular requirement that involves the conversion of a complex collection of data into a readily understandable diagram. It is a highly necessary practice, in which particular aspects of an unorganised domain are placed into a model in which it can be structured and manipulated. What differentiates graph database modelling techniques from its relational database equivalent are the similarities between the logical and physical models. With modelling in the relational database, one is required to veer away from what could be considered a natural representation of the domain. Initially the representation must be converted to a logical model, prior to being transformed to a physical model. Oftentimes, such conversions and transformations introduce semantic dissonance between what we perceive the dataset to be and how an instance of the dataset can be created in the relational model. With graph databases, this disparity is significantly reduced. To prove this, our Movie dataset will be examined. The data model should be designed in a manner that permits us to store and query data in an efficient manner. We will also want to be able to update the underlying model as the dataset we have is liable to change. 5.1.1 Relational Modelling The first stage of modelling a dataset in the world of relational databases is identifying the entities in the domain, how they are connected and the business rules involved. Below is the resulting logical model.
  • 41. 32 Figure 13: The Logical Model Having arrived at a suitable logical model, the next relational modelling step was to map that model to suitable tables and relationships. Here, data redundancy is eliminated through a process known as normalisation. This is the process that database architects employ to reduce database redundancy and cater for disk storage savings. It requires splitting off data elements that present themselves more than once in a relational database table into individual table structures. While a highly necessary step, it can also lead to additional complexity with the data model. This complexity is clearly evident in the entity relationship diagram in Figure 14. Such complexity arises due to the addition of foreign key constraints, necessary to support one-to- many relationships, and join tables such as actor_movie_role and producer_movie, which support many-to-many relationships. It is these join tables that cause the biggest problems. To query the database at a later date, these tables regularly need to be joined back together again. As data schemas start to become more complex, relational systems may become increasingly difficult to work with. Primarily, this difficulty arises from complex join operations, where users ask queries of the database that would require data to be retrieved from a number of
  • 42. 33 different tables. These joins can become extremely complicated and resource intensive for the database management system. Figure 14: The Relational Model 5.1.2 Graph Data Modelling It is evident thus far that the series of developments involved with creating a relational model significantly add to its complexity of design and also removes the model from the conceptual view that the stakeholders possess, to a more chaotic and less understandable model. As the database changes and increases in complexity, the rigid schema of the relational design leads to a much less scalable model. This negative results in a desire to develop a model that is closely aligned with the domain, maintains database performance and has high scalability without sacrificing the integrity of the data. The graph database model can achieve this. In a similar way to drawing up the conceptual model in the relational world, the graph database requires a domain is sought and agreed up. This is the final similarity between the two.
  • 43. 34 Rather than converting a domain model’s graph-like representation into tables, it is enhanced, with the aspiration to produce a precise representation of the elements of the domain applicable to the application goals. To achieve this, every entity is granted relevant roles as labels, attributes as properties, and connections as relationships. The resulting model is illustrated below. Figure 15: The Graph Data Model Taking a look inside the model it is easier to notice how the join tables are replaced. This is achieved by attaching properties to the relationships, for example ‘roles’ in the acted_in relationship.
  • 44. 35 Figure 16: A look inside the Graph Data Model In order to allow each node the opportunity to accomplish its own particular data-centric domain obligations, we must make sure that each node has its own necessary role-specific labels and properties. Relationships are added in order to connect nodes and provide the domain with a suitable degree of structure. These relationships are named, directed and are also regularly attributed. The join problem is something which the graph database prides itself on avoiding. The relationships that connect nodes together is effectively Neo4j’s equivalent to the Cartesian product calculation of the full indices on the tables involved, required when querying a relational database. By doing so, connecting data becomes as simple as traversing from one node to another. These complex questions that are so difficult to ask in a relational world are extremely simple, efficient and fast in a graph structure. 5.2 Querying The Database
  • 45. 36 Now that the domain model is refined, a number of realistic queries will be asked to test its suitability for handling complex data. A query language can be described as a selection of operators that can be enforced on any database instance, with the aim of manipulating and querying data in those structures. Relational database systems use a shared standard known as SQL (Structured Query Language) in order to query data. Similarly, graph databases also employ a universal query ideology known as a graph traversal. The required data is retrieved from the database by using these traversals. A graph traversal involves “walking” along the elements of a graph. The traversal is an essential process for data retrieval. The primary distinction between a SQL query and a traversal is that traversals are localised operations, and therein lies their major advantage. Rather than utilising a global adjacency index, each vertex and edge in the graph store a local- index of the nodes connected to it. Thus, the size of the graph becomes irrelevant and the complex joins associated with a relational database are no longer necessary. This is not to say that global indexes do not exist in Neo4j, because as a matter of fact they do. Indexes are necessary to allow vertices to be readily retrieved based on their value, but they are only used when retrieving the starting point of a traversal. Traversals in Neo4j are carried out by Cypher. Cypher is a declarative graph query language based on SQL that aims simplify query writing by avoiding the requirement to write traversals in code. Cypher provides the structure for keyword system, similar to SQL. This is weakness when compared to the more mature RDBMS SQL. There is a lack of consistency that requires one to learn all implementations before understanding what approach is most suitable to the problem. In Cypher a query begins at one or more known starting points on the graph, these points are referred to as bound nodes. Cypher applies the labels and property predicates that are provided in the MATCH and WHERE clauses, in conjunction with the information supplied by indexes and constraints, to locate the starting points which support our graph patterns.
  • 46. 37 In order to investigate the efficiency and power of the query in each database we will take some examples from our Movie database. 5.2.1 Querying the Relational Database The first query we will look at is one which seeks to find all actors, directors, producers, writers and critics along with their reviews and ratings for the movie ‘Cloud Atlas’. Firstly, we will take a look at this query in the relational database. Figure 17: SQL Query
  • 47. 38 What is immediately clear is the number of joins necessary in order to achieve the desired results. A full ‘map’ highlighting how this query works its way through the database is outlined below in Figure 18. Figure 18: SQL Query 'Join Map' In this instance each broken arrow is the join path the query takes. This SQL query queries Movies to find Cloud Atlas, then is required to search all the join tables (actor_movie_role, director_movie, producer_movie, writer_movie) in order to find the names of the actors, directors, producers, writers and critics associated with the movie Cloud Atlas. The ten joins required in this query is manageable and results can be readily achieved, however as we traverse more, additional joins will have to be computed and this will further hinder query times. Below is a graph that shows results for when the database was searched for all the people associated with one, five, eight and finally all thirty-eight movies.
  • 48. 39 Figure 19: Oracle 12c Query Times The decrease in performance is due the fact that each join requires an index lookup for each movie, actor, director, producer, writer and critic. Each index lookup adds overhead and performance that can become extremely difficult for the database to handle. As can be seen such difficulties increase exponentially with the number of lookups required. A graphic for the index lookups required for querying one movie is shown in Figure 20. The amount of traversals, even in this relatively simple query is immediately obvious. 0 0.1 0.2 0.3 1 5 8 15 38 QueryTime(seconds) No. of Movies Queried 1 5 8 15 38 Query Time 0.019 0.055 0.102 0.165 0.203 Query Times - Oracle 12c
  • 49. 40 Figure 20: Relational Database Index Lookups
  • 50. 41 5.2.2 Querying the Graph Database The following is the same query in the Cypher query language: Figure 21 Cypher Query Only one index lookup is required here in order to find the start node. From there the query will traverse the necessary relationships in order to find the necessary data. Figure 22: Graph Data Index Lookup The diagram above clearly indicates the need for only one index lookup, the main action associated with poor performance in the relational database. The dashed orange line are representitive of traversals, with each traversal returning the required data.
  • 51. 42 Figure 23 and Figure 24 below show the data output in Neo4j form. Figure 23: Neo4j Data Output Figure 24: Neo4j Graphical Output
  • 52. 43 As with earlier tests of the relational database, the graph database was then required to search for all persons involved in five, eight, fifteen and thirty-eight movies. The query times are illustrated below: Figure 25: Neo4j Query Times As illustrated, the increase in query time is significantly less than it was in the relational database when both instances were tested under increased load. Figure 26: Neo4j vs Oracle 12c 0 0.01 0.02 0.03 0.04 1 5 8 15 38 QueryTime(seconds) No. of Movies Queried 1 5 8 15 38 Query Time 0.014 0.022 0.026 0.027 0.032 Query Times -Neo4j 0.014 0.022 0.026 0.027 0.032 0.019 0.055 0.102 0.165 0.203 0 0.05 0.1 0.15 0.2 0.25 1 5 8 15 38 QueryTimes(seconds) No. of Movies Queried Neo4j Oracle12c
  • 53. 44 6 Conclusion The thesis was an investigation and analysis of the performance of Relational Database Management Systems and Graph Databases, with the aim of discovering how more complex data is handled in each database and whether graph database technology is more suitable than the its widely used relational counterpart. Relational databases were designed with structure in mind and developed in strict tabular format which conformed to a pre-specified schema. The foundation of their design is their database schema, which provides a logical view of the database alongside relations between tables. In comparison with the graph database, a significant amount more work was required in order to get the dataset to fit. Numerous of the dreaded join tables were required in order to incorporate all the many to many relationships that appeared in the logical model. In comparison with this the graphical model was extremely straightforward. By labelling each node and attaching properties to relationships as required, the logical model effectively became the data model. When it came to querying the database, again the relational database came out on top. The first thing to realised when carrying out the analysis was the importance of a fundamental concept in both database types, the use of indexes. In the relational database, indexes are expensive but necessary tools, utilised to quickly find the desired records within the data using either a foreign key or a primary key. When two or more tables are joined, the indexes on both tables would need to be scanned completely and recursively to locate all the data elements matching the query specification. This is why performing joins is so computationally expensive. This is also where graph databases excel. They are extremely fast for join-intensive queries. With a graph database, the index on the data is used only at the beginning of the query when locating the start node. Once you have the starting nodes, you can just “walk the network” and find the next data element by traversing along the relationships without the need for further index lookups. This is known as “index-free adjacency” and it is a fundamental concept in graph databases. It
  • 54. 45 can be concluded that graph databases better manage complex data than the relational database in cases when the data contains many relationships that require traversing multiple relationships.
  • 55. 46 7 References Agrawal, D., El Abbadi, A., Das, S., & Elmore, A. (2011). Database Scalability, Elasticity, and Autonomy in the Cloud. Database Systems For Advanced Applications, 2-15. http://dx.doi.org/10.1007/978-3-642-20149-3_2 Agricola, A. (2014). World’s Leading Telcos Turn to Neo4j. Neo4j News. Retrieved from https://neo4j.com/news/worlds-leading-telcos-turn-neo4j/ Amann, B. & Scholl, M. (1992). Gram. Proceedings Of The ACM Conference On Hypertext - ECHT '92. http://dx.doi.org/10.1145/168466.168527 Andries, M., Gemis, M., Paredaens, J., Thyssens, I., & Van den Bussche, J. (1992). Concepts for graph-oriented object manipulation. Advances In Database Technology — EDBT '92, 21- 38. http://dx.doi.org/10.1007/bfb0032421 Angles, R. & Gutierrez, C. (2008). Survey of graph database models. CSUR, 40(1), 1-39. http://dx.doi.org/10.1145/1322432.1322433 Asirelli, P., De Santis, M., & Martelli, M. (1985). Integrity constraints in logic databases. The Journal Of Logic Programming, 2(3), 221-232. http://dx.doi.org/10.1016/0743- 1066(85)90020-2 Bachman, C. (1972). The evolution of storage structures. Communications Of The ACM, 15(7), 628-634. http://dx.doi.org/10.1145/361454.361495 Bachman, M. (2013). GraphAware: Towards Online Analytical Processing in Graph Databases (MSc Degree in Computing (Distributed Systems). Imperial College London Department of Computing. Bajaj, R. (2014). Big Data – The New Era of Data. (IJCSIT) International Journal Of Computer Science And Information Technologies, 5(2), 1875-1885.
  • 56. 47 Barabási, A. & Albert, R. (1999). Emergence of Scaling in Random Networks. Science, 286(5439), 509-512. http://dx.doi.org/10.1126/science.286.5439.509 Bitner, S. (2015). The future of Business Analytics is in complex data. Big Data Made Simple - One source. Many perspectives.. Retrieved 11 June 2016, from http://bigdata- madesimple.com/future-business-analytics-complex-data/ Böhnlein, M. & vom Ende, A. (1999). XML — Extensible Markup Language. Wirtschaftsinf, 41(3), 274-276. http://dx.doi.org/10.1007/bf03254940 Bonnet, L., Laurent, A., Sala, M., Laurent, B., & Sicard, N. (2011). Reduce, You Say: What NoSQL Can Do for Data Aggregation and BI in Large Repositories. 2011 22Nd International Workshop On Database And Expert Systems Applications. http://dx.doi.org/10.1109/dexa.2011.71 Bressan, S. & Catania, B. (2005). Introduction to database systems. Singapore: McGraw-Hill. Broder, A. (2002). A taxonomy of web search. ACM SIGIR Forum, 36(2), 3. http://dx.doi.org/10.1145/792550.792552 Brunner, R. (2006). The basics of relational database systems. Developing with Apache Derby -- Hitting the Trifecta: Database development with Apache Derby, Part 2. Retrieved from http://www.ibm.com/developerworks/library/os-ad-trifecta3/ Burstein, M. (2013). Introduction to Graph Theory: Finding The Shortest Path. Max Burstein. Retrieved from http://www.maxburstein.com/blog/introduction-to-graph-theory-finding- shortest-path/ Carlsson, T. (2016). Neo4j Powers Master Data Management (MDM) Applications for Enterprises Across the Globe. Yahoo Finance. Retrieved from http://finance.yahoo.com/news/neo4j-powers-master-data-management-153517563.html
  • 57. 48 Cattell, R. (2011). Scalable SQL and NoSQL data stores. ACM SIGMOD Record, 39(4), 12. http://dx.doi.org/10.1145/1978915.1978919 Chen, P. (1976). The entity-relationship model---toward a unified view of data. ACM Transactions On Database Systems, 1(1), 9-36. http://dx.doi.org/10.1145/320434.320440 Codd, E. (1970). A relational model of data for large shared data banks. Communications Of The ACM,13(6), 377-387. http://dx.doi.org/10.1145/362384.362685 Codd, E. (1979). Extending the database relational model to capture more meaning. ACM Transactions On Database Systems, 4(4), 397-434. http://dx.doi.org/10.1145/320107.320109 Codd, E. (1990). The relational model for database management. Reading, Mass.: Addison- Wesley. Cook, W. (2009). On understanding data abstraction, revisited. ACM SIGPLAN Notices, 44(10), 557. http://dx.doi.org/10.1145/1639949.1640133 Cui, B., Mei, H., & Ooi, B. (2014). Big data: the driver for innovation in databases. National Science Review, 1(1), 27-30. http://dx.doi.org/10.1093/nsr/nwt020 DAVIDSON, S., OVERTON, C., & BUNEMAN, P. (1995). Challenges in Integrating Biological Data Sources. Journal Of Computational Biology, 2(4), 557-572. http://dx.doi.org/10.1089/cmb.1995.2.557 Dean, J. & Ghemawat, S. (2008). MapReduce. Communications Of The ACM, 51(1), 107. http://dx.doi.org/10.1145/1327452.1327492 DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., & Pilchin, A. et al. (2007). Dynamo. ACM SIGOPS Operating Systems Review, 41(6), 205. http://dx.doi.org/10.1145/1323293.1294281
  • 58. 49 Dewey, M., Mitchell, J., Beall, J., Matthews, W., & New, G. (1996). Dewey decimal classification and relative index. Albany, N.Y.: Forest Press, a division of OCLC Online Computer Library Center. Diestel, R. (2010). Graph theory. Berlin: Springer. Gemis, M. & Paredaens, J. (1993). An object-oriented pattern matching language. Lecture Notes In Computer Science, 339-355. http://dx.doi.org/10.1007/3-540-57342-9_82 Gemis, M., Paredaens, J., Thyssens, I., & Van den Bussche, J. (1993). GOOD. Proceedings Of The 1993 ACM SIGMOD International Conference On Management Of Data - SIGMOD '93. http://dx.doi.org/10.1145/170035.171533 Gharehchopogh, F. & Khalifelu, Z. (2011). Analysis and evaluation of unstructured data: text mining versus natural language processing. 2011 5Th International Conference On Application Of Information And Communication Technologies (AICT). http://dx.doi.org/10.1109/icaict.2011.6111017 Graves, M., Bergeman, E., & Lawrence, C. (1995). A graph-theoretic data model for genome mapping databases. Proceedings Of The Twenty-Eighth Hawaii International Conference On System Sciences, Vol.5. http://dx.doi.org/10.1109/hicss.1995.375353 Grust, T., Freytag, J., & Leser, U. (2016). Cost-based Optimization of Graph Queries in Relational Database Management Systems (Masters). University of Berlin. Gutiérrez, A., Pucheral, P., Steffen, H., & Thévenin, J. (1994). Database Graph Views: A Practical Model to Manage Persistent Graphs. International Conference On Very Large Databases, 94, 391-402. Güting, R. (1994). GraphDB: Modeling and Querying Graphs in Databases. Proceedings Of The 20Th International Conference On Very Large Data Bases, 297-308.
  • 59. 50 Harrington, J. (2003). SQL clearly explained. (3rd ed.). Elsevier. Hidders, J. (2002). Typing Graph-Manipulation Operations. Lecture Notes In Computer Science, 394-409. http://dx.doi.org/10.1007/3-540-36285-1_26 Hidders, J. & Paredaens, J. (1994). Goal, a Graph-Based Object and Association Language. Advances In Database Systems, 247-265. http://dx.doi.org/10.1007/978-3-7091- 2704-9_13 Kahate, A. (2004). Introduction to database management systems. Delhi, India: Pearson Education (Singapore). Kaur, K. & Rani, R. (2013). Modeling and querying data in NoSQL databases. 2013 IEEE International Conference On Big Data. http://dx.doi.org/10.1109/bigdata.2013.6691765 Khodaei, M. (2008). Case Study: Implementation of Integrity Constraints in Actual Database Systems(MSc Electrical Engineering and Information Technology). Czech Technical University in Prague. Kiesel, N., Schuerr, A., & Westfechtel, B. (1995). Gras, a graph-oriented (software) engineering database system. Information Systems, 20(1), 21-51. http://dx.doi.org/10.1016/0306-4379(95)00002-l Klyne, G. & Carroll, J. (2006). Resource description framework (RDF): Concepts and abstract syntax. Kunii, H. (1987). DBMS with graph data model for knowledge handling. Proceedings Of The 1987 Fall Joint Computer Conference On Exploring Technology: Today And Tomorrow, 138- 142. Retrieved from http://dl.acm.org/citation.cfm?id=42071&CFID=776948696&CFTOKEN=76722333
  • 60. 51 Kuper, G. & Vardi, M. (1984). A new approach to database logic. Proceedings Of The 3Rd ACM SIGACT-SIGMOD Symposium On Principles Of Database Systems - PODS '84. http://dx.doi.org/10.1145/588011.588026 Leavitt, N. (2010). Will NoSQL Databases Live Up to Their Promise?. Computer, 43(2), 12- 14. http://dx.doi.org/10.1109/mc.2010.58 Lecluse, C., Richard, P., & Velez, F. (1988). O2, an object-oriented data model. ACM SIGMOD Record,17(3), 424-433. http://dx.doi.org/10.1145/971701.50253 Levene, M. & Poulovassilis, A. (1990). The hypernode model and its associated query language.Proceedings Of The 5Th Jerusalem Conference On Information Technology, 1990. 'Next Decade In Information Technology'. http://dx.doi.org/10.1109/jcit.1990.128324 Lomotey, R. & Deters, R. (2013). Unstructured data extraction in distributed NoSQL. 2013 7Th IEEE International Conference On Digital Ecosystems And Technologies (DEST). http://dx.doi.org/10.1109/dest.2013.6611347 Mainguenaud, M. & Simatic, X. (1992). A data model to deal with multi-scaled networks. Computers, Environment And Urban Systems, 16(4), 281-288. http://dx.doi.org/10.1016/0198-9715(92)90009-g McGuinness, D. & Van Harmelen, F. (2004). OWL web ontology language overview. W3C Recommendation, 10(10). Merkl Sasaki, B. (2016). Neo4j Graph Database Powers the Healthcare Sector. Neo4j News. Retrieved from https://neo4j.com/news/neo4j-graph-database-powers-healthcare-sector/ Miler, M., Medak, D., & Odobašić, D. (2014). A shortest path algorithm performance comparison in graph and relational database on a transportation network. PROMET - Traffic&Transportation,26(1). http://dx.doi.org/10.7307/ptt.v26i1.1268
  • 61. 52 Mohamed Ali, N. & Padma, D. (2016). Graph Database: A Contemporary Storage Mechanism for Connected Data. International Journal Of Advanced Research In Computer And Communication Engineering, 5(3). http://dx.doi.org/10.17148/IJARCCE.2016.53220 Nixon, K. (2015). How Gamesys Harnessed Neo4j for Competitive Advantage. Neo4j Blog. Retrieved from https://neo4j.com/blog/gamesys-neo4j-competitive-advantage/ Paredaens, J., Peelman, P., & Tanca, L. (1995). G-Log: a graph-based query language. IEEE Trans. Knowl. Data Eng., 7(3), 436-453. http://dx.doi.org/10.1109/69.390249 Peckham, J. & Maryanski, F. (1988). Semantic data models. CSUR, 20(3), 153-189. http://dx.doi.org/10.1145/62061.62062 Pokorny, J. (2011). NoSQL databases. Proceedings Of The 13Th International Conference On Information Integration And Web-Based Applications And Services - Iiwas '11. http://dx.doi.org/10.1145/2095536.2095583 Rabl, T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H., & Mankovskii, S. (2012). Solving big data challenges for enterprise application performance management. Proc. VLDB Endow., 5(12), 1724-1735. http://dx.doi.org/10.14778/2367502.2367512 Reeve, A. (2012). Big Data and NoSQL: The Problem with Relational Databases | InFocus. InFocus. Retrieved 11 June 2016, from https://infocus.emc.com/april_reeve/big- data-and-nosql-the-problem-with-relational-databases/ Rodriguez, M. & Neubauer, P. (2010). Constructions from dots and lines. Bulletin Of The American Society For Information Science And Technology, 36(6), 35-41. http://dx.doi.org/10.1002/bult.2010.1720360610
  • 62. 53 Rodriguez, M. & Neubauer, P. (2012). The Graph Traversal Pattern. Techniques And Applications, 29-46. http://dx.doi.org/10.4018/978-1-61350-053-8.ch002 Roman, S. (2002). Access database design and programming. Sebastopol [CA]: O'Reilly. Roussopoulos, N. & Mylopoulos, J. (1975). Using semantic networks for data base management.Proceedings Of The 1St International Conference On Very Large Data Bases - VLDB '75. http://dx.doi.org/10.1145/1282480.1282490 Russom, P. (2011). Big data analytics (pp. 1-35). 1105 Media, Inc. Selvarani, S. & Sadhasivam, G. (2010). Improved cost-based algorithm for task scheduling in cloud computing. 2010 IEEE International Conference On Computational Intelligence And Computing Research. http://dx.doi.org/10.1109/iccic.2010.5705847 Shenai, K. (1992). Introduction to database and knowledge-base systems. Singapore: World Scientific. Shimpi, D. & Chaudhari, S. (2012). An overview of Graph Databases. International Journal Of Computer Applications® (IJCA) (0975 – 8887), 18-22. Retrieved from http://research.ijcaonline.org/icrtitcs2012/number3/icrtitcs1351.pdf Shipman, D. (1981). The functional data model and the data languages DAPLEX. ACM Transactions On Database Systems, 6(1), 140-173. http://dx.doi.org/10.1145/319540.319561 Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The Hadoop Distributed File System. 2010 IEEE 26Th Symposium On Mass Storage Systems And Technologies (MSST). http://dx.doi.org/10.1109/msst.2010.5496972 Silberschatz, A., Korth, H., & Sudarshan, S. (2011). Database system concepts. New York: McGraw-Hill.
  • 63. 54 Strauch, C. & Kriha, W. (2011). NoSQL Databases. Selected Topics On Software-Technology Ultra-Large Scale Sites Hochschule Der Medien, Stuttgart. Sumathi, D. & Esakkirajan, S. (2007). Fundamentals of Relational Database Management Systems. Springer Berlin Heidelberg. Tangen, J. (2012). Companies Worldwide Flock to Neo4j to Make Their Applications Social. Neo Technology. Retrieved from http://www.marketwired.com/press- release/companies-worldwide-flock-to-neo4j-to-make-their-applications-social-1634667.htm Tauro, C., Patil, B., & Prashanth, K. (2013). A Comparative Analysis of Different NoSQL Databases on Data Model, Query Model and Replication Model. EREICA. Topor, R., Salem, K., Gupta, A., Goda, K., Gehrke, J., & Palmer, N. et al. (2009). Shared- Nothing Architecture. Encyclopedia Of Database Systems, 2638-2639. http://dx.doi.org/10.1007/978-0-387-39940-9_1512 Ullman, J. (1995). Principles of database and knowledge-base systems. Rockville, Md: Computer Science Press. Van Bruggen, R. (2014). Learning Neo4j. Birmingham, UK: Packt Pub. Vicknair, C., Macias, M., Zhao, Z., Nan, X., Chen, Y., & Wilkins, D. (2010). A comparison of a graph database and a relational database. Proceedings Of The 48Th Annual Southeast Regional Conference On - ACM SE '10. http://dx.doi.org/10.1145/1900008.1900067 Wills, R. (2006). Google’s pagerank. The Mathematical Intelligencer, 28(4), 6-11. http://dx.doi.org/10.1007/bf02984696
  • 64. 55 8 Appendix 8.1 Appendix A ALTER TABLE ACTOR_MOVIE_ROLE DROP CONSTRAINT ACTOR_MOVIE_ROLE_ACTOR_FK; ALTER TABLE ACTOR_MOVIE_ROLE DROP CONSTRAINT ACTOR_MOVIE_ROLE_MOVIES_FK; ALTER TABLE CRITIC_FOLLOWS DROP CONSTRAINT CRITIC_FOLLOWS_CRITIC_FK; ALTER TABLE CRITIC_FOLLOWS DROP CONSTRAINT CRITIC_FOLLOWS_CRITIC_FKv1; ALTER TABLE PRODUCER_MOVIE DROP CONSTRAINT PRODUCER_MOVIE_MOVIES_FK; ALTER TABLE PRODUCER_MOVIE DROP CONSTRAINT PRODUCER_MOVIE_PRODUCER_FK; ALTER TABLE DIRECTOR_MOVIE DROP CONSTRAINT DIRECTOR_MOVIE_DIRECTOR_FK; ALTER TABLE DIRECTOR_MOVIE DROP CONSTRAINT DIRECTOR_MOVIE_MOVIES_FK; ALTER TABLE Reviews DROP CONSTRAINT Reviews_CRITIC_FK; ALTER TABLE Reviews DROP CONSTRAINT Reviews_MOVIES_FK; ALTER TABLE WRITER_MOVIE DROP CONSTRAINT WRITER_MOVIE_MOVIES_FK;
  • 65. 56 ALTER TABLE WRITER_MOVIE DROP CONSTRAINT WRITER_MOVIE_WRITER_FK; DROP TABLE ACTOR; DROP TABLE MOVIES; DROP TABLE ACTOR_MOVIE_ROLE; DROP TABLE DIRECTOR; DROP TABLE DIRECTOR_MOVIE; DROP TABLE PRODUCER; DROP TABLE PRODUCER_MOVIE; DROP TABLE WRITER; DROP TABLE WRITER_MOVIE; DROP TABLE CRITIC; DROP TABLE CRITIC_FOLLOWS; DROP TABLE Reviews; CREATE TABLE ACTOR ( Actor_ID VARCHAR2 (50) NOT NULL , Name VARCHAR2 (50) NOT NULL , Born NUMBER (4)
  • 66. 57 ) ; ALTER TABLE ACTOR ADD CONSTRAINT Actor_PK PRIMARY KEY ( Actor_ID ) ; CREATE TABLE ACTOR_MOVIE_ROLE ( Actor_ID VARCHAR2 (50) NOT NULL , Movie_ID VARCHAR2 (225) NOT NULL , Role_ID VARCHAR2 (50) NOT NULL ) ; ALTER TABLE ACTOR_MOVIE_ROLE ADD CONSTRAINT ACTOR_MOVIE_ROLE_PK PRIMARY KEY ( Actor_ID, Movie_ID, Role_ID ) ; CREATE TABLE CRITIC ( Critic_ID VARCHAR2 (50) NOT NULL , Name VARCHAR2 (50) ) ; ALTER TABLE CRITIC ADD CONSTRAINT CRITIC_PK PRIMARY KEY ( Critic_ID ) ;
  • 67. 58 CREATE TABLE CRITIC_FOLLOWS ( Critic_ID VARCHAR2 (50) NOT NULL , Following VARCHAR2 (50) NOT NULL ) ; ALTER TABLE CRITIC_FOLLOWS ADD CONSTRAINT CRITIC_FOLLOWS_PK PRIMARY KEY ( Critic_ID, Following ) ; CREATE TABLE DIRECTOR ( Director_ID VARCHAR2 (50) NOT NULL , Name VARCHAR2 (50) NOT NULL , Born NUMBER (4) ) ; ALTER TABLE DIRECTOR ADD CONSTRAINT DIRECTOR_PK PRIMARY KEY ( Director_ID ) ; CREATE TABLE DIRECTOR_MOVIE
  • 68. 59 ( Director_ID VARCHAR2 (50) NOT NULL , Movie_ID VARCHAR2 (225) NOT NULL ) ; ALTER TABLE DIRECTOR_MOVIE ADD CONSTRAINT DIRECTOR_ID_PK PRIMARY KEY ( Director_ID, Movie_ID ) ; CREATE TABLE MOVIES ( Movie_ID VARCHAR2 (225) NOT NULL , Title VARCHAR2 (225) , Released NUMBER (4) NOT NULL , Tagline VARCHAR2 (4000) ) ; ALTER TABLE MOVIES ADD CONSTRAINT MOVIES_PK PRIMARY KEY ( Movie_ID ) ; CREATE TABLE PRODUCER (
  • 69. 60 Producer VARCHAR2 (50) NOT NULL , Name VARCHAR2 (50) , Born NUMBER (4) ) ; ALTER TABLE PRODUCER ADD CONSTRAINT PRODUCER_PK PRIMARY KEY ( Producer ) ; CREATE TABLE PRODUCER_MOVIE ( Producer_ID VARCHAR2 (50) NOT NULL , Movie_ID VARCHAR2 (225) NOT NULL ) ; ALTER TABLE PRODUCER_MOVIE ADD CONSTRAINT PRODUCER_MOVIE_PK PRIMARY KEY ( Producer_ID, Movie_ID ) ; CREATE TABLE Reviews ( Critic_ID VARCHAR2 (50) NOT NULL , Movie_ID VARCHAR2 (225) NOT NULL ,
  • 70. 61 Rating NUMBER (5,2) , Summary VARCHAR2 (4000) ) ; ALTER TABLE Reviews ADD CONSTRAINT Reviews_PK PRIMARY KEY ( Critic_ID, Movie_ID ) ; CREATE TABLE WRITER ( Writer_ID VARCHAR2 (50) NOT NULL , Name VARCHAR2 (50) , Born NUMBER (4) ) ; ALTER TABLE WRITER ADD CONSTRAINT WRITER_PK PRIMARY KEY ( Writer_ID ) ; CREATE TABLE WRITER_MOVIE ( Writer_ID VARCHAR2 (50) NOT NULL , Movie_ID VARCHAR2 (50) NOT NULL
  • 71. 62 ) ; ALTER TABLE WRITER_MOVIE ADD CONSTRAINT WRITER_MOVIE_PK PRIMARY KEY ( Writer_ID , Movie_ID ) ; ALTER TABLE ACTOR_MOVIE_ROLE ADD CONSTRAINT ACTOR_MOVIE_ROLE_ACTOR_FK FOREIGN KEY ( Actor_ID ) REFERENCES ACTOR ( Actor_ID ) ON DELETE CASCADE ; ALTER TABLE ACTOR_MOVIE_ROLE ADD CONSTRAINT ACTOR_MOVIE_ROLE_MOVIES_FK FOREIGN KEY ( Movie_ID ) REFERENCES MOVIES ( Movie_ID ) ON DELETE CASCADE ; ALTER TABLE CRITIC_FOLLOWS ADD CONSTRAINT CRITIC_FOLLOWS_CRITIC_FK FOREIGN KEY ( Critic_ID ) REFERENCES CRITIC ( Critic_ID ) ON DELETE CASCADE ; ALTER TABLE CRITIC_FOLLOWS ADD CONSTRAINT CRITIC_FOLLOWS_CRITIC_FKv1 FOREIGN KEY ( Following ) REFERENCES CRITIC ( Critic_ID ) ON DELETE CASCADE ;
  • 72. 63 ALTER TABLE DIRECTOR_MOVIE ADD CONSTRAINT DIRECTOR_MOVIE_DIRECTOR_FK FOREIGN KEY ( Director_ID ) REFERENCES DIRECTOR ( Director_ID ) ON DELETE CASCADE ; ALTER TABLE DIRECTOR_MOVIE ADD CONSTRAINT DIRECTOR_MOVIE_MOVIES_FK FOREIGN KEY ( Movie_ID ) REFERENCES MOVIES ( Movie_ID ) ON DELETE CASCADE ; ALTER TABLE PRODUCER_MOVIE ADD CONSTRAINT PRODUCER_MOVIE_MOVIES_FK FOREIGN KEY ( Movie_ID ) REFERENCES MOVIES ( Movie_ID ) ON DELETE CASCADE ; ALTER TABLE PRODUCER_MOVIE ADD CONSTRAINT PRODUCER_MOVIE_PRODUCER_FK FOREIGN KEY ( Producer_ID ) REFERENCES PRODUCER ( Producer ) ON DELETE CASCADE ; ALTER TABLE Reviews ADD CONSTRAINT Reviews_CRITIC_FK FOREIGN KEY ( Critic_ID ) REFERENCES CRITIC ( Critic_ID ) ON DELETE CASCADE ;