No SQL introduction

NO SQL
10/20/2014 @ Surabhi Dwivedi 1

Contents
 Introduction and Feature of NoSQL
 CAP Theorem
 RDBMS VS NoSQL
 NoSQL Database family

Features- Not Only SQL
 No RDBMS
◦ No relational
 Distributed Data Store
◦ Horizontally scalable
 Schema-free / Flexible schema
◦ Database JOINs generally not supported
 A huge amount of data
◦ Eg Google/Facebook which collects terabits of data
 BASE properties
◦ Basically Available
◦ Soft state
 It does not have to be consitent all the time
◦ Eventually consistent
 The system will eventually become consistent when the updates
propagate, in particular, when there are not too many updates

NoSQL
 Provides a mechanism for
◦ storage and retrieval of data that is modeled in
means other than the tabular relations used in
relational databases
 Used in big data and real-time web
applications
 NoSQL isn’t a single product or technology,
but an umbrella term for a category of
databases

NoSQL does not Provide
 Joins
 Group by
 ACID transactions
 SQL
 NoSQL databases reject:
◦ Overhead of ACID transactions
◦ “Complexity” of SQL
◦ Burden of up-front schema design
◦ Declarative query expression

Requirement of NoSQL

NoSQL - Users

CAP Theorem

CAP Theorem
 Three properties of a system
◦ Consistency
 all copies have same value
◦ Availability
 system can run even if parts have failed Via replication
◦ Partitions
 network can break into two or more parts, each with active
systems that can’t talk to other parts
 Very large systems will partition at some point
◦ Choose one of consistency or availability
◦ Traditional database choose consistency
◦ Most Web applications choose availability
 Except for specific parts such as order processing

RDBMS VS NoSQL database
RDBMS NoSQL
Structured and organized data Stands for Not Only SQL
Structured query language (SQL) No declarative query language
Data and its relationships are stored in
separate tables.
No predefined schema
Data Manipulation Language, Data
Definition Language
Variants - Key-Value Pair Store, Column
Store, Document Store, Graph Store
Tight Consistency Eventual consistency rather ACID
property
ACID Transaction CAP Theorem
- Prioritizes high performance, high
availability and scalability

Example –NoSQL Databases

NoSQL Database Family

NoSQL Database Types
• Hash table of keys
• Lookup a single value for a key
• Amazon’s Dynamo
Distributed Key-
Value Systems
• Stores documents made up of tagged elements
• Access data by key or by search of “document” data.
• CouchDB, MongoDB
Document-based
Systems
• Each storage block contains data from only one column
• Google’s BigTable
• Facebook’s Cassandra
Column-based
Systems
• Use a graph structure
• Google’s Pregel, - Neo4j
Graph-based
Systems

Column-oriented databases
• Column-family stores allow you to store data with keys mapped to
values and the values grouped into multiple column families,
• Each column family being a map of data Most popular types - non-relational
databases
• Column-family databases store data in column families as rows
• They have many columns associated with a row key
• Column families are groups of related data that is often
• accessed together
• The basic unit of storage in Column-family databases is a column
• Example
• Hadoop / Hbase
• Cassandra :Apache Cassandra was initially developed at Facebook to
power their Inbox Search feature
• Cloudata :Google's Big table clone like HBase

Column-Oriented Databases Cont …
 Data tables are stored as sections of columns of
data, rather than as rows of data.
 The column is used as a store for the value, and
has a timestamp that is used to differentiate the
valid content from stale ones.
 Application will use the timestamp to find out
which of the stored values in the backup nodes
are up-to-date.
 Column Family
◦ A container for columns, analogous to table in a
relational database.
◦ The column Family has a name, a map with a key and
a value(which is a map c10o/20n/2t0a14in@in Sgura cbhoi Dlwuivmedi ns). 16

Example
 Cassandra
 Hbase
 Hypertable
 Amazon Simple DB

{
“row_key_1” : {
“name” : {
...
}
“location” : {
...
},
“preferences” : {
...
}
},
“row_key_2” : {
“name” : {
...
},
“location” : {
...
},
...
}
},
“row_key_3” : {
...
}
uniquely identifies a record in a
column database
•Column-family identifier.
•Second level key

{
“row_key_1” : {
“name” : {
“first_name” : “Jolly”,
“last_name” : “Goodfellow”
}
}
},
“location” : {
“zip”: “94301”
},
“d/r” : “D”
}
},
“row_key_2” : {
“name” : {
“first_name” : “Very”,
“middle_name” : “Happy”,
“last_name” : “Guy”
},
“location” : {
“zip” : “10001”
},
“preferences” :
“v/nv”: “V”
}
},
...
}
Each row may have a different set of
columns within a column-family

Contrasting Column Databases with
RDBMS
• Column-oriented database
– minimal need for schema dentition
– easily accommodate newer columns
– predefined column-family
– set of columns grouped together into a bundle
– Column family(no data type) - column in an
RDBMS(with data type)
– Column databases designed to scale and can easily
accommodate millions of columns and billions of
rows

Contrasting Column Databases with
RDBMS Cont …

Hadoop distributed filesystem (HDFS) –
Background for Distributed Storage
• Apache Hadoop is an open source software
project
• Enables the distributed processing of large data
sets across clusters of servers
• Designed to scale up from a single server to
thousands of machines, with a very high degree
of fault tolerance.
• Data in a Hadoop cluster is broken down into
smaller pieces (called blocks) and distributed
throughout the cluster.
• The map and reduce functions can be executed
on smaller subsets of larger data sets

Hadoop distributed filesystem
(HDFS)
A MapReduce
◦ Map() procedure - performs filtering and sorting
(such as sorting students by first name into
queues, one queue for each name)
◦ Reduce() procedure performs a summary
operation (such as counting the number of
students in each queue, yielding name
frequencies).

Hadoop distributed filesystem
(HDFS) - Example
 A file containing the phone numbers for everyone in the
United States;
 The people with a last name starting with A might be
stored on server 1, B on server 2, and so on.
 In a Hadoop world, pieces of this phonebook would be
stored across the cluster
 To reconstruct the entire phonebook, your program
would need the blocks from every server in the cluster.
 To achieve availability as components fail, HDFS
replicates these smaller pieces onto two additional
servers by default.
◦ This redundancy offers multiple benefits,
 Higher availability.
 Scalability : Hadoop cluster break work into smaller chunks and run
those jobs on all the servers in the cluster
 Data locality, which is critical when working with large data sets.

Hbase - Distributed Storage
 HBase is a column-oriented database
management system that runs on top of
HDFS.
 HBase’s distributed architecture is designed
for applications storing up to billions of
rows and millions of columns
 A good option to replace a relational
database that cannot support such large data
sets.

Hbase Distributed Storage Architecture

• master-worker pattern
• A master and a set of workers(range servers)
• When HBase starts, master allocates set of ranges to a range
server.
• Each range stores an ordered set of rows, where each row is
idetified by a unique row-key.
• As number of rows stored in a range grows beyond a
configured thresold
• the range is split into two and rows are divided between the
two new ranges.

write-ahead-log (WAL)
• WAL is a common technique for providing atomicity
and durability (two of the ACID properties).
• When data is written to a region, it’s first written to the
write-ahead-log, if enabled.
• Later, it’s written to the region’s in-memory store.
• If the in-memory store is full, data is flushed to disk
and persisted in the underlying distributed storage.
• In HBase a client program could decide to turn WAL
on or switch it off.
• Switching it off would boost performance but reduce
reliability and recovery, in case of failure.

write-ahead-log (WAL)

Document Model
 Notion of a schema is dynamic: each
document can contain different fields.
◦ Helpful for modeling unstructured and
polymorphic data.
◦ It also makes it easier to evolve an application
during development , such as adding new fields.
◦ Data can be queried based on any fields in a
document

DOCUMENT STORE
• Documents are grouped together into collections
• Collections - relational tables.
• Collections don’t impose strict schema
constraints
• Records are not documents in the sense of a
word processing document
• Structure of any document can be modified
• By adding and removing members from the document
- by reading the document into program, modifying it
and re-saving it
• By using various update commands.

DOCUMENT STORE
• Each document is stored in BSON format.
• Binary data (using BSON format) can be stored
in any of the fields in the document.
• BSON is a binary-encoded representation of a JSON-type
document format
– nested set of key/value pairs.
– JSON – JavaScript Object Notation
• BSON is a superset of JSON
– supports additional types
• regular expression,
• binary data,
• date.
• Each document has a unique identifier, which
MongoDB can generate like auto-generated object ids

DOCUMENT STORE
 Document databases –
◦ Good for storing and managing Big Data-size
collections of literal documents
 like text documents, email messages, and XML
documents
 conceptual “documents” like de-normalized
(aggregate) representations of a database entity
 Good for storing “sparse” data
◦ irregular (semi-structured) data that would
require an extensive use of “nulls” in an
RDBMS.

DOCUMENT STORE
 “Documents” are encoded in a standard data exchange
format
◦ XML, JSON (JavaScript Object Notation) or BSON (Binary
JSON).
 Unlike the simple key-value stores, the value column in
document databases contains semi-structured data
◦ specifically attribute name/value pairs.
 A single column can house hundreds of such attributes
 Number and type of attributes recorded can vary from
row to row.
 Both keys and values are fully searchable in document
databases.

DOCUMENT STORE
 Records within a single table can have different structures.
 An example record from Mongo, using JSON format, might
look like
{
“_id” : ObjectId(“4fccbf281168a6aa3c215443″),
“first_name” : “Thomas”,
“last_name” : “Jefferson”,
“address” : {
“street” : “1600 Pennsylvania Ave NW”,
“city” : “Washington”,
“state” : “DC”
}
}

Document Store - Internals
 Document Stores
◦ Like Key-Value Stores, except Value is a “Document”
 Data model: (key, “document”) pairs
 Basic operations: I
◦ Insert (key, document),
◦ Fetch(key), Update(key),
◦ Delete(key)
 Also Fetch() based on document contents
 Example systems
◦ CouchDB, MongoDB
 Document stores
◦ Store arbitrary/extensible structures as a “value”

Advantages of the Document Model
 More natural to represent data at the database level
 An aggregated document can be accessed with a
single call to the database
◦ rather than having to JOIN multiple tables to respond to a
query.
 The MongoDB document is physically stored as a
single object, requiring only a single read from
memory or disk.
◦ RDBMS JOINs require multiple reads from multiple
physical locations.
 Distributing the database across multiple nodes (a
process called sharding) is easier
◦ horizontal scalability
◦ documents are self-contained

MongoDB- Features
 MongoDB provides high performance data persistence.
◦ Support for embedded data models reduces I/O activity on database
system.
◦ Indexes support faster queries and can include keys from embedded
documents and arrays.
 High Availability
◦ automatic failover.
◦ data redundancy.
 A replica set is a group of MongoDB servers that maintain the
same data set, providing redundancy and increasing data
availability.
 Automatic Scaling
◦ MongoDB provides horizontal scalability as part of its core
functionality.
◦ Automatic sharding distributes data across a cluster of machines.
◦ Replica sets can provide eventually-consistent reads for low-latency
high throughput deployments.

MongoDB - Sharding
• Data is distributed across multiple range servers
• MongoDB allows ordered collections to be saved across
multiple machines.
• Shards are replicated to allow failover.
• Large collection could be split into four shards
• Each shard in turn may be replicated three times.
• This would create 12 units of a MongoDB server.
• The two additional cpies of each shard serve as failover units.
• Sharding addresses the challenge of scaling to support
high throughput and large data sets:
• Each shard processes fewer operations as the cluster grows.
• As a result, a cluster can increase capacity and throughput
horizontally.
• For example, to insert data, the application only needs to access
the shard responsible for that record.
• Sharding reduces the amount of data that each server needs to
store. Each shard stores less data as the cluster grows.

•Data set is divided and
distributed data over
multiple servers, or shards.
• Each shard is an
independent database, and
collectively, the shards make
up a single logical database.

Distributed Key-Value Systems
 Key-Value Pair (KVP) Stores
◦ Access data (values) by strings called keys.
◦ Data has no required format – data may have any format
◦ Extremely simple interface
 Data model: (key, value) pairs
 NoSQL Key-Value store is a single table with two
columns:
◦ one being the (Primary) Key, and the other being the Value.
 Basic Operations: Insert (key, value), Fetch
(key),Update (key), Delete (key)
◦ Implementation: efficiency, scalability, fault-tolerance
 Records distributed to nodes based on key Replication
 Single-record transactions, “eventual consistency”

Example- Key Value
 Riak
 Redis
 Memcached DB
 Berkeley DB
 Hamster DB (especially suited for
embedded use)
 Amazon Dynamo DB (not open source)
 Project Voldemort (Open Source
Implementation of Dynamo DB)

References
 Professional NoSQL – Shashank Tiwari
 MongoDB Manual
 http://docs.mongodb.org
 http://docs.mongodb.org/manual/core/shar
ding-introduction/
 Wikipedia References
 Intro to Hbase Internals & Schema Design
(for HBase Users)
◦ Alex Baranau, Sematext International, 2012

No SQL introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to No SQL introduction

Similar to No SQL introduction (20)

Recently uploaded

Recently uploaded (20)

No SQL introduction