NoSQL Consepts

NO SQL
Concepts (Part II)
No SQL Databases
Based on Chapter 2 – Making Sense of NoSQL
1
Chapter 2: NoSQL Concepts

Outline
• Keeping Components Simple
• Using Application Tier
• Speeding Performance
• Using Consistent Hashing
• Comparing ACID and BASE
• Database sharding
• Brewer’s CAP theorem
2

Keeping Components
Simple
• Relational databases can be very complex. They perform lot of
functions to manage multiple tables, perform join operations, do query
optimization, replicate transactions, run stored procedures, set
triggers, enforce security, and perform indexing.
• From the NoSQL perspective, simple is good. It’s not necessary to
include all functions in a single software application. They often have
a number of simpler components that can be reassembled to meet
the needs of different applications.
• For example, one function allows sharing of objects in RAM
(memcache), another function runs batch jobs (MapReduce,) and yet
another function stores binary documents (key-value stores).
3

Using Application Tier
• Using application tiers in your architecture allows you to create
flexible and reusable applications.
• By segregating an application into tiers, you have the option of
modifying or adding a specific layer instead of reworking an
entire application when modifications are required.
• Application designers have to choose whether they need to
use application tiers to divide the overall functionality of their
application.
• Identifying each layer breaks the application into separate
architectural components, which allows the software designer
to determine the responsibility of each component. This
separation of concerns allows designers to make the
complicated seem simple when explaining the system to
others.
4

(Cont.)
Application tiers are typically viewed in a
layer-cake-like drawing.
• The user events (like a user clicking a
button on a web page) trigger code in
the user interface. The output or
response from the user interface is sent
to the middle tier.
• This middle tier may respond by
sending something back to the user
interface, or it could access the
database layer.
• The database layer may in turn run a
query and send a response back to the
middle tier. The middle tier then uses
the data to create a report and sends it
to the user.
5

(Cont.)
6

(Cont.)
• Both models have a user interface tier at the top. In the
relational database, most of the application functionality is
found in the database layer.
• In the NoSQL application, most of the application functionality
is found in the middle tier.
• In addition, NoSQL systems leverage more services for
managing BLOBs of data (the key-value store), for storing full-
text indexes (the Lucene indexes), and for executing batch
jobs (MapReduce).
• In traditional relational database systems, the complexity found
in the database tier impacts the overall scalability of the
application.
7

Pros and Cons of
RDBMS
RDBMS pros:
ACID transactions at the database
level makes development easier.
Fine-grained security on columns and
rows using views prevents views and
changes by unauthorized users.
Most SQL code is portable to other
SQL databases, including open source
options.
Typed columns and constraints will
validate data before it’s added to the
database and increase data quality.
Existing staff members are already
familiar with entity-relational design
and SQL.
RDBMS cons:
The object-relational mapping
layer can be complex.
Entity-relationship modeling must
be completed before testing
begins, which slows development.
RDBMSs don’t scale out when
joins are required.
Sharding over many servers can
be done but requires application
code and will be operationally
inefficient.
Full-text search requires third-party
tools.
It can be difficult to store high-
variability data in tables.
8

Pros and Cons of
NOSQL
NoSQL pros:
Loading test data can be done with drag-
and-drop tools before ER modeling is
complete.
Modular architecture allows components to
be exchanged.
Linear scaling takes place as new
processing nodes are added to the cluster.
Lower operational costs are obtained by
autosharding.
Integrated search functions provide high-
quality ranked search results.
There’s no need for an object-relational
mapping layer.
It’s easy to store high-variability data.
NoSQL cons:
ACID transactions can be done
only within a document at the
database level. Other transactions
must be done at the application
level.
Document stores don’t provide
fine-grained security at the
element level.
NoSQL systems are new to many
staff members and additional
training may be required.
The document store has its own
proprietary nonstandard query
language, which prohibits
portability.
The document store won’t work
with existing reporting and OLAP
tools.
9

Terminology of
database cluster
• In general, each cluster consists of racks filled with commodity
computer hardware.
• Nodes contain a single logical processor called a CPU, local RAM
and disk. The CPU may in fact be implemented using multiple
chips, each with multiple core processors. The disk system may
also be composed of several independent drives.
10
• Nodes are grouped together in
racks that have high-bandwidth
connections between all the nodes
within a rack.
• Racks are grouped together to form
a database cluster within a data
center.
• A single data center location may
contain many database clusters.

Outline
11

Speeding Performance
• Generally, traditional database management systems weren’t concerned
with memory management optimization.
• The solution to faster systems is to keep as much of the right information
in RAM as you can, and check your local servers to see if they might
also have a copy. This local fast data store is often called a RAM cache
or memory cache.
• Many memory caches use a simple timestamp for each block of memory
in the cache as a way of keeping the most recently used objects in
memory.
• When memory fills up, the timestamp is used to determine which items
in memory are the oldest and should be overwritten.
• A more refined view can take into account how much time or resources
it’ll take to re-create the dataset and store it in memory.
• This “cost model” allows more expensive queries to be kept in RAM
longer than similar items that could be regenerated much faster.
12

Speeding Performance
(Cont.)
The effective use of RAM cache is predicated on the efficient
answer to the question:
• Have we run this query before?
• Have we seen this document before?
These questions can be answered by using consistent hashing,
which lets you know if an item is already in the cache or if you
need to retrieve it from SSD or HDD.
13

Outline
14

Using Consistent Hashing
• Consistent hashing quickly tells you if a new query or document is
the same as one already in your cache.
• Knowing this information prevents you from making unnecessary
calls to disk for information and keeps your databases running fast.
• A hash string (also known as a checksum or hash) is a process
that calculates a sequence of letters by looking at each byte of a
document.
• The hash string uniquely identifies each document.
• It is used to determine whether the document you’re presented
with is the same document you already have on hand.
• If there’s any difference between two documents (even a single
byte), the resulting hash will be different.
15

Using Consistent Hashing (Cont.)
• Consistent hashing occurs when two different processes running
on different nodes in your network create the same hash for the
same object.
• Consistent hashing confirms that the information in the document
hasn’t been altered and allows you to determine whether the object
exists in your cache or message store, saving precious resources
by only rerunning processes when necessary.
• Consistent hashing is also critical for synchronizing distributed
databases. For example, version control systems such as Git or
Subversion can run a hash not only on a single document but also
on hashes of hashes for all files within a directory.
16

Outline
17

ACID vs. BASE
• Typically one of two types of transaction control models are used:
1. ACID, used in RDBMS that focuses on consistency
2. BASE, found in many NoSQL systems that focus on availability.
• The difference is in the amount of effort required by application
developers and the location (tier) of the transactional controls.
18

BASE
BASE stands for these concepts:
• Basic Availability allows systems to be temporarily inconsistent so
that transactions are manageable. In BASE systems, the
information and service capability are “basically available.”
• Soft-state recognizes that some inaccuracy is temporarily allowed
and data may change while being used to reduce the amount of
consumed resources.
• Eventual consistency means eventually, when all service logic is
executed, the system is left in a consistent state.
19

BASE (Cont.)
• The number-one objective is to allow new data to be stored, even at
the risk of being out of sync for a short period of time. They relax
the rules and allow reports to run even if not all portions of the
database are synchronized.
• They’re optimistic in that they assume that eventually all systems
will catch up and become consistent.
• BASE systems tend to be simpler and faster because they don’t
have to write code that deals with locking and unlocking resources.
• Their mission is to keep the process moving and deal with broken
parts at a later time.
• BASE systems are ideal for web storefronts, where filling a
shopping cart and placing an order is the main priority.
20

Outline
21

Horizontal scalability
with sharding
• There may come a point when the amount of data needed to
run the business exceeds the current environment and some
mechanism for breaking the information into reasonable
chunks is required.
• Breaking a database into chunks called shards and spreading
the chunks across a number of distributed servers.
• Organizations and systems that reach this capacity can use
automatic database sharding.
22

Database Sharding
• Let’s say you’ve created a website that allows users to log in
and create their own personal space to share with friends.
• They have profiles, text, and product information on things they
like (or don’t like).
• You set up your website, store the information in a MySQL
database, and you run it on a single CPU.
• People love it, they log in, create pages, invite their friends,
and before you realize it your disk space is 95% full.
What do you do?
23

Database Sharding
(Cont.)
The answer is buy a new system and transfer half the users to the
new system. Oh, and your old system might have to be down for a
while so you can rewrite your application to know which database to
get information from.
The process of moving from a single to multiple databases can be
done in a number of ways; for example:
1. You can keep the users with account names that start with the
letters A-N on the first drive and put users from O-Z on the new
system.
2. You can keep the people in the United States on the original
system and put the people who live in Europe on the new
system.
3. You can randomly move half the users to the new system.
24

Database Sharding
(Cont.)
Some important Questions
• In option 1, if a user changes their name, should they be automatically
moved to the new drive?
• In option 2, if a user moves to a new country, should all their data be moved?
• If people tend to share links with people near them, would there be
performance advantages to keeping these users together?
• What if people in the United States tend to be active at the same time in the
evening?
• Would one database get overwhelmed and the other be idle?
• What happens if your site doubles in size again?
• Do you have to continue to rewrite your code each time this happens?
• Do you have to shut the system down for a weekend while you upgrade your
software?
25

Database Sharding
(Cont.)
As the number of servers grows, you find that the chance of any one
server being down remains the same.
• So for every server you add the chance of one part not working
increases.
• So you think that perhaps the same process you used to split the
database between two systems can also be used to duplicate
data to a backup or mirrored system if the first one fails.
• But then you have another problem. When there are changes to a
master copy, you must also keep the backup copies in sync.
• You must have a method of data replication. The time it takes to
keep these databases in sync can decrease system performance.
You now need more servers to keep up!
26

Database Sharding
(Cont.)
Welcome to the world of database sharding, replication, and
distributed computing.
• NoSQL systems have been noted for having many ways to allow
you to grow your database without ever having to shut down your
servers.
• Keeping your database running when there are node or network
failures is called partition tolerance—a new concept in the NoSQL
community and one that traditional database managers struggle
with.
27

Outline
28

Understanding trade-offs
with Brewer’s CAP theorem
The CAP theorem states that any distributed database system can have at most two
of the following three desirable properties:
1. Consistency—Having a single, up-to-date, readable version of your data
available to all clients. This isn’t the same as the consistency we talked about in
ACID. Consistency here is concerned with multiple clients reading the same
items from replicated partitions and getting consistent results.
2. High availability—Knowing that the distributed database will always allow
database clients to update items without delay. Internal communication failures
between replicated data shouldn’t prevent updates.
3. Partition tolerance—The ability of the system to keep responding to client
requests even if there’s a communication failure between database partitions.
This is analogous to a person still having an intelligent conversation even after a
link between parts of their brain isn’t working.
29

CAP theorem
30

31
Questions?
Introduction to Pervasive ComputingChapter 2: NoSQL Concepts

NoSQL Consepts

More Related Content

What's hot

Similar to NoSQL Consepts

More from Maynooth University

Recently uploaded

NoSQL Consepts

Editor's Notes