The complexity for minimum component costs has increased at a rate of roughly a
factor of two per year...Certainly over the short term this rate can be expected to
continue, if not to increase. Over the longer term, the rate of increase is a bit more
uncertain, although there is no reason to believe it will not remain nearly constant
for at least 10 years.
-- Gordon Moore, 1965
…Then you better start swimmin’…Or you’ll sink like a
stone…For the times they are a-changin’.
-- Bob Dylan
•NoSQL is a set of concepts that allows the rapid
and efficient processing of data sets with a focus
on performance, reliability, and agility.
Definition of NoSQL
Sounds great… What???
Operational Data
• Read and written by applications to carry out their ordinary functions.
• Examples:
• Shopping cart data in Amazon.com
• Information about employees in a human resources system
• Buy/Sell prices in Fidelity
• Posts made by Facebook users
• Travel Itineraries for bookings done on Expedia
Two Categories of Data
Analytical Data
• Used to provide business intelligence (BI).
• Data is often created by storing the operational data used by applications
over time, and it’s commonly read-only.
• Because these analytical datasets provide a historical record, they’re
commonly much bigger than an application’s current operational data.
• Example:
• A e-commerce company might record all of the purchase data from its web
application, then analyze this data to learn about customer buying habits or market
trends.
• Facebook might sell all the posts made by its users to other companies who can
analyze the posts to determine each user’s significant events so that they can tailor
offers based on user needs, likes and dislikes.
Two Categories of Data
The Problem called Big Data
Cracks in the Single CPU RDBMS System
due to pressure from the four business drivers of the current age.
Volume
• Need to query big data always resulted in performance concerns
in RDBMS.
• These performance concerns were solved by purchasing faster
processors.
• But, the power wall was reached which meant increasing
processor speed was no longer an option.
• System designers shifted their focus from increasing speed on a
single chip (vertical scaling or scale up) to using more processors
working together (horizontal scaling or scale out).
The Problem called Big Data
Velocity
• Many single-processor RDBMSs are unable to keep up with the
demands of real-time inserts and online queries to the database
made by public-facing websites.
• RDBMSs frequently index many columns of every new row, a
process which decreases system performance.
• When single-processor RDBMSs are used as a back end to a web
store front, the random bursts in web traffic slow down response
for everyone, and tuning these systems can be costly when both
high read and write throughput is desired.
• This was another reason for engineers to look for a scaled out
solution.
The Problem called Big Data
Variability
• Companies that want to capture and report on exception data
struggle when attempting to use rigid database schema
structures imposed by RDBMS. For example, if a business unit
wants to capture a few custom fields for a particular customer,
all customer rows within the database need to store this
information even though it doesn’t apply.
• Adding new columns to an RDBMS requires the system be shut
down and ALTER TABLE commands to be run. When a database
is large, this process can impact system availability, costing time
and money.
• This was another reason engineers looked for a more viable
solution.
The Problem called Big Data
Agility
• The most complex part of building applications using RDBMSs is
the process of putting data into and getting data out of the
database.
• If your data has nested and repeated subgroups of data
structures, you need to include an object-relational mapping
layer. The responsibility of this layer is to generate the correct
combination of INSERT, UPDATE, DELETE, and SELECT SQL
statements to move object data to and from the RDBMS
persistence layer.
• This process isn’t simple and is associated with the largest
barrier to rapid change when developing new or modifying
existing applications.
The Problem called Big Data
• It’s more than rows in tables
• NoSQL systems store and retrieve data from many formats: key-value stores, graph
databases, column-family stores, document stores, and even rows in tables.
• It’s free of joins
• NoSQL systems allow you to extract your data using simple interfaces without joins.
• It’s schema-free
• NoSQL systems allow you to drag-and-drop your data into a folder and then query it
without creating an entity-relational model.
The Solution called NoSQL
• It works on many processors
• NoSQL systems allow you to store your database on multiple processors and maintain
high-speed performance.
• It uses shared-nothing commodity computers
• Most NoSQL systems leverage low-cost commodity processors that have separate
RAM and disk.
• It supports linear scalability
• When you add more processors, you get a consistent increase in performance.
• It’s innovative
• NoSQL offers options to a single way of storing, retrieving, and manipulating data.
NoSQL supporters (also known as NoSQLers) have an inclusive attitude about NoSQL
and recognize SQL solutions as viable options. To the NoSQL community, NoSQL
means “Not only SQL.”
What else?
• It’s not about not using the SQL language
• It’s not only open source
• It’s not only about volume
• It’s not about cloud computing
• It’s not just a clever use of RAM and SSD
• It’s not an elite group of products
• It’s not just Hadoop
What is NoSQL not…
Single Complex Component Vs Multiple Simple Components
• Removes Complexity
• Promotes Reuse
• Easier Maintenance
• Functions distributed to many NoSQL (and SQL) databases that
consist of simple tools that have simpler interfaces and well-
defined roles.
• NoSQL products take a Master of one thing Vs Jack of All things
approach.
• Example: MemCache to share objects in RAM, MapReduce to
run batch jobs, DynamoDB to store key-value items.
NoSQL Concepts
NoSQL Concepts
ACID BASE
Get transaction details right Never block a write
Block any reports while you are
working
Focus on throughput, not consistency
Be pessimistic, anything might go
wrong!
Be optimistic, if one service fails it will
eventually get caught up
Detailed testing and failure mode
analysis
Some reports may be inconsistent for
a while, but don’t worry
Lots of locks and unlocks Keep things simple and avoid locks
Eric Brewer’s CAP Theorem for Replication
Consistency—Having a single, up-to-date, readable version of your data
available to all clients. Consistency here is concerned with multiple clients
reading the same items from replicated partitions and getting consistent
results.
High availability—Knowing that the distributed database will always allow
database clients to update items without delay. Internal communication
failures between replicated data shouldn’t prevent updates.
Partition tolerance—The ability of the system to keep responding to client
requests even if there’s a communication failure between database partitions.
This is analogous to a person still having an intelligent conversation even after
a link between parts of their brain isn’t working.
NoSQL Concepts
Four Quadrants of Data Technologies
Operational Relational
SQL Relational Databases
Oracle
SQL Server
MySQL
Relational Analytics
Oracle
SQL Server
MySQL
NoSQL Key-Value Stores
DynamoDB, Azure Tables, Riak, etc.
Column Family Stores
Apache HBase, Apache Cassandra,
Google BigTable, etc.
Document Stores
MongoDB, DocumentDB, etc.
Graph Stores
Neo4j, AllegoGraph, etc.
Big Data Analytics
Hadoop
HDInsight
Hadoop Core Technologies
• Hadoop Distributed File System (HDFS)
• Provides a way to store and access very large binary files across a cluster of
commodity servers and disk drives.
• Hadoop MapReduce
• Supports the creation of applications that process large amounts of analytical data in
parallel. That data is commonly stored in HDFS.
• Hive
• A Hadoop-based framework for querying and analyzing data. Among other things, it
provides HiveQL, a SQL-like language that can generate MapReduce jobs.
• Pig
• Another Hadoop-based framework for working with data. It provides a language called
Pig Latin for creating MapReduce jobs.
Big Data Analytics using Hadoop
• NoSQL really means Not Only SQL
• Volume, Velocity, Variability & Agility are the main business
drivers for NoSQL.
• Key NoSQL Concepts: Multiple Simple Components, Application
Tiers With External Services, Strategic Use of RAM, SSD, HDD,
BASE Transaction Control, Automatic Sharding, Replication Using
CAP.
• Popular NoSQL Datastores: Key-Value, Column Family,
Document, Graph.
• Big Data Analytics using Hadoop
Quick Recap