1. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Introduction to NoSQL
Not Only SQL
Dr. Dipali Meher
Assistant Professor
Modern College of Arts, Science and Commerce, Ganeshkhind, Pune 411016
mailtomeher@gmail.com/dipalimeher@moderncollegegk.org
MCS, M.Phil,NET,Ph.D
1
2. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Agenda
• Introduction
• Why No SQL?
• Aggregate data models
• Data Modeling Details
• Distribution models
• Consistency
• Version stamps
• Map- reduce
2
4. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Introduction
• A NO SQL originally referring to non SQL or non relational is
a database that provides a mechanism for storage and retrieval
of data.
• tabular relations used in relational databases
• Such databases came into existence in the late 1960s,
• Used in real-time web applications and big data
• Sometimes called Not only SQL to emphasize the fact that they
may support SQL-like query languages.
• Example: MarkLogic, Aerospike, FairCom c-treeACE, Google
Spanner (though technically a NewSQL database), Symas
LMDB, and OrientDB have made them central to their designs.
4
7. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
JSON Format
• JSON stands for JavaScript Object Notation.
• JSON objects are used for transferring data between server and client,
XML serves the same purpose. However JSON objects have several
advantages over XML and we are going to discuss them in this tutorial
along with JSON concepts and its usages.
• Example JSON DB
• var chaitanya =
{ "firstName" : "Chaitanya",
"lastName" : "Singh",
"age" : "28" };
7
8. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Features of JSON
• It is light-weight
• It is language independent
• Easy to read and write
• Text based, human readable data exchange format
8
9. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Why use JSON?
• Standard Structure: As we have seen so far that JSON objects
are having a standard structure that makes developers job easy to
read and write code, because they know what to expect from
JSON.
• Light weight: When working with AJAX, it is important to load
the data quickly and asynchronously without requesting the page
re-load. Since JSON is light weighted, it becomes easier to get and
load the requested data quickly.
• Scalable: JSON is language independent, which means it can
work well with most of the modern programming language. Let’s
say if we need to change the server side language, in that case it
would be easier for us to go ahead with that change as JSON
structure is same for all the languages.
9
10. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Difference as example between JSON and
XML Style DB
JSON style: XML style:
{"students":
[ {"name":"John", "age":"23",
"city":"Agra"},
{"name":"Steve", "age":"28",
"city":"Delhi"},
{"name":"Peter", "age":"32",
"city":"Chennai"},
{"name":"Chaitanya", "age":"28",
"city":"Bangalore"}
]}
<students>
<student> <name>John</name> <age>23</age>
<city>Agra</city>
</student>
<student> <name>Steve</name> <age>28</age>
<city>Delhi</city>
</student>
<student> <name>Peter</name> <age>32</age>
<city>Chennai</city>
</student>
<student>
<name>Chaitanya</name> <age>28</age>
<city>Bangalore</city>
</student> </students>
10
11. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Limitations of Relational DB
• In relational database we need to define structure and schema of
data first and then only we can process the data.
• Relational database systems provides consistency and integrity
of data by enforcing ACID properties (Atomicity, Consistency,
Isolation and Durability ). There are some scenarios where this
is useful like banking system. However in most of the other
cases these properties are significant performance overhead and
can make your database response very slow.
• Most of the applications store their data in JSON format and
RDBMS don’t provide you a better way of performing
operations such as create, insert, update, delete etc on this data.
• On the other hand NoSQL store their data in JSON format,
which is compatible with most of the today’s world application.
11
12. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
RDBMSVs NoSQL
• RDBMS: It is a structured data that provides more
functionality but gives less performance.
• NoSQL: Structured or semi structured data, less functionality
and high performance.
12
15. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
So when I say less functionality in NoSQL what’s
missing:
• You can’t have constraints in
NoSQL
• Joins are not supported in NoSQL
• These supports actually hinders
the scalability of a database, so
while using NoSQL database like
MongoDB, you can implements
these functionalities at the
application level.
15
When to go for NoSQL:
When you would want to choose NoSQL
over relational database:
When you want to store and retrieve huge
amount of data.
The relationship between the data you
store is not that important
The data is not structured and changing
over time
Constraints and Joins support is not
required at database level
The data is growing continuously and you
need to scale the database regular to
handle the data.
16. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Why NO SQL?
• NoSQL databases are different than relational databases like MySQL.
• In relational database you need to create the table, define schema, set
the data types of fields etc before you can actually insert the data.
• In NoSQL you don’t have to worry about that, you can insert, update
data on the fly.
• One of the advantage of NoSQL database is that they are really easy to
scale and they are much faster in most types of operations that we
perform on database.
• There are certain situations where you would prefer relational
database over NoSQL, however when you are dealing with huge
amount of data then NoSQL database is your best choice.
16
17. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Introduction Continued….
• includes simplicity of design
• Simpler horizontal scaling to clusters of machines
• finer control over availability
• The data structures used by NOSQL databases are different
from those used by default in relational databases which makes
some operations faster in NoSQL.
• Data Structures used in NO SQL language are flexible
17
19. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Barriers to NO SQL
• Low-level query languages
• lack of standardized interfaces
• huge previous investments in existing relational databases
• Lacks true ACID(Atomicity, Consistency, Isolation, Durability)
properties
19
20. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Types of NO SQL DB
• MongoDB falls in the category of NoSQL document based
database.
• Key value store: Memcached, Redis, Coherence
• Tabular: Hbase, Big Table, Accumulo
• Document based: MongoDB, CouchDB, Cloudant
20
21. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Other problems faced by NO SQL
• stale reads problem- Most NoSQL databases offer a concept of
eventual consistency in which database changes are
propagated to all nodes so queries for data might not return
updated data immediately or might result in reading data that
is not accurate which is a problem known as stale reads.
• NO SQL may exhibit lost writes and other forms of data loss.
• Data consistency is bigger challenge
21
22. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Advantages
• High scalability- NO SQL DB uses sharding for horizontal
scaling. Partitioning of data and placing it on multiple machines in
such a way that the order of the data is preserved is sharding.
• Vertical scaling means adding more resources to the existing
machine
• Horizontal scaling means adding more machines to handle the
data. Vertical scaling is not that easy to implement but horizontal
scaling is easy to implement.
• Examples of horizontal scaling databases are MongoDB,
Cassandra etc.
• NoSQL can handle huge amount of data because of scalability, as
the data grows NoSQL scale itself to handle that data in efficient
manner.
• High availability-replication feature in NoSQL databases makes it
highly available because in case of any failure data replicates itself
to the previous consistent state. 22
23. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Disadvantages of NO SQL
• Narrow focus-NoSQL databases have very narrow focus as it is mainly designed
for storage but it provides very little functionality. Relational databases are a better
choice in the field of Transaction Management than NoSQL.
• Open source- It is open-source database. There is no reliable standard for NoSQL
yet. In other words two database systems are likely to be unequal.
• Management Challenge- he purpose of big data tools is to make management of a
large amount of data as simple as possible. But it is not so easy. Data management
in NoSQL is much more complex than a relational database. NoSQL, in particular,
has a reputation for being challenging to install and even more hectic to manage on
a daily basis.
• GUI is not available- GUI mode tools to access the database is not flexibly
available in the market.
• Backup- Backup is a great weak point for some NoSQL databases like MongoDB.
MongoDB has no approach for the backup of data in a consistent manner.
• Large document size-Some database systems like MongoDB and CouchDB store
data in JSON format. Which means that documents are quite large (BigData,
network bandwidth, speed), and having descriptive key names actually hurts, since
they increase the document size.
23
24. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
When should NoSQL be used
• When huge amount of data need to be stored and
retrieved .
• The relationship between the data you store is not that
important
• The data changing over time and is not structured.
• Support of Constraints and Joins is not required at
database level
• The data is growing continuously and you need to scale
the database regular to handle the data
24
25. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
• successful technology for twenty years, providing
persistence, concurrency control, and an integration
mechanism.
• Application developers have been frustrated with the
impedance mismatch between the relational model and
the in-memory data structures.
• There is a movement away from using databases as
integration points towards encapsulating databases
within applications and integrating through services.
• The vital factor for a change in data storage was the need
to support large volumes of data by running on clusters.
Relational databases are not designed to run efficiently
on clusters. 25
RDBMS
26. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Impedance mismatch
Impedance mismatch is the term used to refer to the
problems that occurs due to differences between
the database model and the programming language
model.
26
27. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
NO SQL
• NoSQL is an accidental neologism.There is no prescriptive
definition—all you can make is an observation of common
characteristics.
• The common characteristics of NoSQL databases are
• Not using the relational model
• Running well on clusters
• Open-source
• Built for the 21st century web estates
• Schemaless
• The most important result of the rise of NoSQL is Polyglot
Persistence
27
28. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Aggregate Data Models
• An aggregate is a collection of data that we interact with as
a unit.
• These units of data or aggregates form the boundaries for
ACID operations with the database, Key-value, Document,
and Column-family databases can all be seen as forms of
aggregate-oriented database.
28
29. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Data Model
• A data model is the model through which we perceive
and manipulate our data
• The data model describes how we interact with the data
in the database
• A data model (or datamodel)is an abstract model that
organizes elements of data and standardizes how they
relate to one another and to the properties of real-world
entities. For instance, a data model may specify that the
data element representing a car be composed of a
number of other elements which, in turn, represent the
color and size of the car and define its owner.
• concepts such as entities, attributes, relations, or tables.
29
30. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
• Data models are distinct form storage models.
• Storage models describes how the database stores and
manipulates the data internally.
• A storage model is a model that captures key physical
aspects of data structure in a data store.
30
33. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
• Ideally we should be ignorant of the storage model, but
in practice we need at least some inkling (impact of thing
after it over)of it—primarily to achieve decent
( acceptable standard )performance.
33
“data model” often means the model of the specific data in
an application. A developer might point to an entity-
relationship diagram of their database and refer to that as
their data model containing customers, orders, products
Metamodel :the model by which the database organizes data
34. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Aggregates
•It recognizes that often, you want to
operate on data in units that have a more
complex structure than a set of tuples. It
can be handy to think in terms of a
complex record that allows lists and other
record structures to be nested inside it
34
complex record = aggregate
Programmers manipulate data through
aggregate structures
35. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
• Domain-Driven Design
• aggregate is a collection of related objects treated as unit
• it is a unit for data manipulation and management of consistency
• Aggregates will be updated with atomic operations
• key-value, document, and column-family databases will do this.
• When databases are operating in cluster using of these Aggregate will
be easy
• why easy (aggregate makes a natural unit for replication and sharding)
35
47. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
• there’s no universal answer for how to draw your aggregate
boundaries.
• It depends entirely on how you tend to manipulate your data.
• If you tend to access a customer together with all of that
customer’s orders at once, then you would prefer a single
aggregate.
• However, if you tend to focus on accessing a single order at a
time, then you should prefer having separate aggregates for
each order.
• Naturally, this is very context-specific; some applications will
prefer one or the other, even within a single system, which is
exactly why many people prefer aggregate ignorance
47
48. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Summary of aggregate data models
• An aggregate is a collection of data that we interact with
as a unit. Aggregates form the boundaries forACID
operations with the database.
• Key-value, document, and column-family databases can
all be seen as forms of aggregate oriented database.
• Aggregates make it easier for the database to manage
data storage over clusters.
• Aggregate-oriented databases work best when most data
interaction is done with the same aggregate; aggregate-
ignorant databases are better when interactions use data
organized in many different formations.
48
49. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Aggregate Data Models Continued…
• Aggregates make it easier for the database to manage data storage over
clusters, since the unit of data now could reside on any machine and
when retrieved from the database gets all the related data along with it.
• Aggregate-oriented databases work best when most data interaction is
done with the same aggregate,
• for example when there is need to get an order and all its details, it
better to store order as an aggregate object but dealing with these
aggregates to get item details on all the orders is not elegant.
• Aggregate-oriented databases make inter-aggregate relationships more
difficult to handle than intra-aggregate relationships.
• Aggregate-ignorant databases are better when interactions use data
organized in many different formations.
• Aggregate-oriented databases often compute materialized views to
provide data organized differently from their primary aggregates. This
is often done with map-reduce computations, such as a map-reduce job
to get items sold per day.
49
50. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Details of Data Models
• Relationships
• Graph Databases
• Schemaless databases
• MaterializedViews
• Modeling for Data Access
50
52. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Relationships
• Create the aggregates for commonly accessed data. And put all
these aggregates together.
• In real life this might happen that aggregates access on common
data might be accessed differently.
• Example: one customer is having many orders
Some applications will want to access the order history whenever they
access the customer; this fits in well with combining the customer with
his order history into a single aggregate.
Other applications, however, want to process orders individually and
thus model orders as independent aggregates. In this situation
customer and order aggregate are separated but keep the same
relation ship and(one customer many orders)
many databases—even key-value stores—provide ways to make these
relationships visible to the database. . Document stores make the
content of the aggregate available to the database to form indexes and
queries. Riak, a key-value store, allows you to put link information in
metadata, supporting partial retrieval and link-walking capability.
52
53. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Important aspect about
relationship and aggregates
• How updates are handled?
• Aggregate oriented databases treat the aggregate as the unit of
data-retrieval.
• Consequently, atomicity is only supported within the contents of a
single aggregate.
• If you update multiple aggregates at once, you have to deal yourself
with a failure partway through.
• Relational databases help you with this by allowing you to modify
multiple records in a single transaction, providingACID guarantees
while altering many rows.
53
56. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
• Most of the NOSQL databases run on clusters and are
aggregate oriented.
• These aggregate data models are of large records with
simple connections.
• In case of graph databases there are small records with
complex interconnections. See example in next slide.
56
57. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
57
a graph isn’t a bar chart or histogram; instead, we refer to a graph data
structure of nodes connected by edges
58. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
• There is difference between graph databases and relational
database queries. In case of graph databases we have to keep
in mind graphical network structure and then ask the query. In
RDBMS we have to keep schema in mind(like foreign keys, the
join)
• In graphical query languages user can find answer then query
by navigating through network of edges.
• Relationships makes graph databases very different from
aggregate-oriented databases query work to be navigating
(to shows directions)relationships.
58
60. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
• The emphasis on relationships makes graph databases
very different from aggregate-oriented databases.
60
61. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Schemaless Databases
• A common theme across all the forms of NoSQL
databases is that they are schemaless.
61
• NoSQL storing data is much more casual.
• A key-value store allows you to store any data you like under
a key.
• A document database effectively does the same thing, since
it makes no restrictions on the structure of the documents
you store.
• Column-family databases allow you to store any data under
any column you like.
• Graph databases allow you to freely add new edges and
freely add properties to nodes and edges as you wish.
62. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Schemaless databases
• freedom and flexibility
• With schema figure out in advance what you need to
store/ document it / diagram it which is hard to do
• Without schema is no binding User can easily change
your data storage as you learn more about your project.
• User can easily add new things as you discover them.
• If user donot want to store more attributes in database or
any rows in database then tis is allowed in NoSQL
62
a schemaless store also makes it easier to deal with nonuniform data:
63. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Schemaless databases: Nonuniform data
• data where each record has a different set of fields.
• A schema puts all rows of a table into a straightjacket, which
becomes awkward if you have different kinds of data in
different rows.You either end up with lots of columns that are
usually null (a sparse table), or you end up with meaningless
columns like custom column.
63
Schemalessness avoids this, allowing each record to
contain just what it needs—no more, no less
64. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
In schemaless database implicit
schemas are present.
• implicit schema is a set of assumptions about the data’s
structure in the code that manipulates the data.
64
66. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
MaterializedViews
• View in RDBMS
• Views provide a mechanism to hide from the client whether data is derived data or base data—but
can’t avoid the fact that some views are expensive to compute.
• To cope with this, materialized views were invented, which are views that are computed in
advance and cached on disk.
• Materialized views are effective for data that is read heavily but can stand being somewhat
stale(in real life it is nothing but tasteleass in database it is just for view purpose no DDL AND DML
FORTHATVIEW).
• NoSQL databases don’t have views, they may have precomputed and cached
queries, and they reuse the term
“materialized view” to describe them.
•MAP REDUCETECHNIQUE IS USED
• Map-reduce is a data processing paradigm for condensing large volumes of data into useful
aggregated
results.
Materialized views can be used within the same aggregate. 66
67. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
2 main ways to build the materialized view
• Eager approach: user can update the materialized view at the
same time you update the base data for it. This approach is
good when you have more frequent reads of the materialized
view than you have writes and you want the materialized views
to be as fresh as possible
• Application database: user can do any updates to base data
also update materialized views.
• outside of the database by reading the data, computing the
view, and saving it back to the database.
67
If you don’t want to pay that overhead on each update, you can run batch jobs
to update the materialized views at regular intervals.Views are
updated with MAP REDUCETECHNIQUE
68. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
MAP REDUCE
• A MapReduce job usually splits the input data-set into
independent chunks which are processed by
the map tasks in a completely parallel manner.
• The framework sorts the outputs of the maps, which are
then input to the reduce tasks.
• Typically both the input and the output of the job are
stored in a file-system.
68
71. Source: NoSQL Distilled
Prepared by Dr. Dipali Meher
Key points
• Aggregate-oriented databases make inter-aggregate
relationships more difficult to handle than intra-aggregate
relationships.
• Graph databases organize data into node and edge graphs;
they work best for data that has complex relationship
structures.
• Schemaless databases allow you to freely add fields to
records, but there is usually an implicit schema expected by
users of the data.
• Aggregate-oriented databases often compute materialized
views to provide data organized differently from their
primary aggregates.This is often done with map-reduce
computations. 71
The dominant data model of the last couple of decades is the relational data model, which is best visualized as a set of tables,
rather like a page of a spreadsheet.
Each table has rows, with each row representing some entity of interest.
We describe this entity through columns, each having a single value.
A column may refer to another row in the same or different table, which constitutes a relationship between those entities.
(We’re using informal but common terminology when we speak of tables and rows; the more formal terms would be relations and tuples.)
NoSQL is a move away from the relational model.
Each NoSQL solution has a different model that it uses, which we put into four categories widely used in the NoSQL ecosystem:
key-value, document, column-family, and graph.
Of these, the first three share a common characteristic of their data models which we will call aggregate orientation.
The relational model takes the information that we want to store and divides it into tuples (rows).
A tuple is a limited data structure:
It captures a set of values, so you cannot nest one tuple within another to get nested records, nor can you put a list of values or tuples within another.
This simplicity underpins the relational model—it allows us to think of all operations as operating on and returning tuples.
A database shard, or simply a shard, is a horizontal partition of data in a database or search engine. Each shard is held on a separate database server instance, to spread load.
Sharding is a database architecture pattern related to horizontal partitioning —
the practice of separating one table's rows into multiple different tables, known as partitions.
Sharding and partitioning are both about breaking up a large data set into smaller subsets.
The difference is that sharding implies the data is spread across multiple computers while partitioning does not.
Partitioning is about grouping subsets of data within a single database instance.
Sharding is necessary if a dataset is too large to be stored in a single database.
Moreover, many sharding strategies allow additional machines to be added.
Sharding allows a database cluster to scale along with its data and traffic growth.
Sharding is also referred as horizontal partitioning.
A shard is one horizontal partition in a table, relation, or database. The difference between a shard and horizontal partitioning is that the shard is located on a separate network node. The benefit of sharding is that you will have less data on each node, so the data will be smaller, more likely to be held in cache, and the indexes will be smaller.
Sharding is not so useful for graph databases. The highly connected nature of nodes and edges in a typical graph database can make it difficult to partition the data effectively. Many graph databases do not provide facilities for edges to reference nodes in different databases. For these databases, scaling up rather than scaling out may be a better option. I'll talk more about graph databases in a future NoSQL blog post.
we are going to be selling items directly to customers over the web, and we will have to store information about users,
our product catalog, orders, shipping addresses, billing addresses, and payment data.
We can use this scenario to model the data using a relation data store as well as NoSQL data stores and talk about their pros and cons.
For a relational database, we might start with a data model shown in above figure
The customer contains a list of billing addresses;
the order contains a list of order items, a shipping address, and payments.
The payment itself contains a billing address for that payment.
A single logical address record appears three times in the example data, but instead of using IDs it’s treated as a value and copied each time.
This fits the domain where we would not want the shipping address, nor the payment’s billing address, to change.
In a relational database, we would ensure that the address rows aren’t updated for this case, making a new row instead.
With aggregates, we can copy the whole address structure into the aggregate as we need to.
The link between the customer and the order isn’t within either aggregate—it’s a relationship between aggregates.
Similarly, the link from an order item would cross into a separate aggregate structure for products, which we haven’t gone into.
We’ve shown the product name as part of the order item here—this kind of denormalization is similar to the tradeoffs with relational databases,
but is more common with aggregates because we want to minimize the number of aggregates we access during a data interaction.
The important thing to notice here isn’t the particular way we’ve drawn the aggregate boundary so much as the fact that you have
to think about accessing that data—and make that part of your thinking when developing the application data model.
Indeed we could draw our aggregate boundaries differently, putting all the orders for a customer into the customer aggregate
When you want to store data in a relational database, you first have to define a schema—a defined structure for the database which
says what tables exist, which columns exist, and what data types each column can hold.
Before you store some data, you have to have the schema defined for it.
Having the implicit schema in the application code results in some problems.
It means that in order to understand what data is present you have to dig into the application code.
If that code is well structured you should be able to find a clear place from which to deduce the schema.
But there are no guarantees; it all depends on how clear the application code is
MapReduce is a programming model or pattern within the Hadoop framework that is used to access big data stored in the Hadoop File System (HDFS).