2. Contents
๏ Introduction and Feature of NoSQL
๏ CAP Theorem
๏ RDBMS VS NoSQL
๏ NoSQL Database family
10/20/2014 @ Surabhi Dwivedi 2
3. Features- Not Only SQL
๏ No RDBMS
โฆ No relational
๏ Distributed Data Store
โฆ Horizontally scalable
๏ Schema-free / Flexible schema
โฆ Database JOINs generally not supported
๏ A huge amount of data
โฆ Eg Google/Facebook which collects terabits of data
๏ BASE properties
โฆ Basically Available
โฆ Soft state
๏ It does not have to be consitent all the time
โฆ Eventually consistent
๏ The system will eventually become consistent when the updates
propagate, in particular, when there are not too many updates
10/20/2014 @ Surabhi Dwivedi 3
4. NoSQL
๏ Provides a mechanism for
โฆ storage and retrieval of data that is modeled in
means other than the tabular relations used in
relational databases
๏ Used in big data and real-time web
applications
๏ NoSQL isnโt a single product or technology,
but an umbrella term for a category of
databases
10/20/2014 @ Surabhi Dwivedi 4
5. NoSQL does not Provide
๏ Joins
๏ Group by
๏ ACID transactions
๏ SQL
๏ NoSQL databases reject:
โฆ Overhead of ACID transactions
โฆ โComplexityโ of SQL
โฆ Burden of up-front schema design
โฆ Declarative query expression
10/20/2014 @ Surabhi Dwivedi 5
10. CAP Theorem
๏ Three properties of a system
โฆ Consistency
๏ all copies have same value
โฆ Availability
๏ system can run even if parts have failed Via replication
โฆ Partitions
๏ network can break into two or more parts, each with active
systems that canโt talk to other parts
๏ Very large systems will partition at some point
โฆ Choose one of consistency or availability
โฆ Traditional database choose consistency
โฆ Most Web applications choose availability
๏ Except for specific parts such as order processing
10/20/2014 @ Surabhi Dwivedi 10
11. RDBMS VS NoSQL database
RDBMS NoSQL
Structured and organized data Stands for Not Only SQL
Structured query language (SQL) No declarative query language
Data and its relationships are stored in
separate tables.
No predefined schema
Data Manipulation Language, Data
Definition Language
Variants - Key-Value Pair Store, Column
Store, Document Store, Graph Store
Tight Consistency Eventual consistency rather ACID
property
ACID Transaction CAP Theorem
- Prioritizes high performance, high
availability and scalability
10/20/2014 @ Surabhi Dwivedi 11
14. NoSQL Database Types
โข Hash table of keys
โข Lookup a single value for a key
โข Amazonโs Dynamo
Distributed Key-
Value Systems
โข Stores documents made up of tagged elements
โข Access data by key or by search of โdocumentโ data.
โข CouchDB, MongoDB
Document-based
Systems
โข Each storage block contains data from only one column
โข Googleโs BigTable
โข Facebookโs Cassandra
Column-based
Systems
โข Use a graph structure
โข Googleโs Pregel, - Neo4j
Graph-based
Systems
10/20/2014 @ Surabhi Dwivedi 14
15. Column-oriented databases
โข Column-family stores allow you to store data with keys mapped to
values and the values grouped into multiple column families,
โข Each column family being a map of data Most popular types - non-relational
databases
โข Column-family databases store data in column families as rows
โข They have many columns associated with a row key
โข Column families are groups of related data that is often
โข accessed together
โข The basic unit of storage in Column-family databases is a column
โข Example
โข Hadoop / Hbase
โข Cassandra :Apache Cassandra was initially developed at Facebook to
power their Inbox Search feature
โข Cloudata :Google's Big table clone like HBase
10/20/2014 @ Surabhi Dwivedi 15
16. Column-Oriented Databases Cont โฆ
๏ Data tables are stored as sections of columns of
data, rather than as rows of data.
๏ The column is used as a store for the value, and
has a timestamp that is used to differentiate the
valid content from stale ones.
๏ Application will use the timestamp to find out
which of the stored values in the backup nodes
are up-to-date.
๏ Column Family
โฆ A container for columns, analogous to table in a
relational database.
โฆ The column Family has a name, a map with a key and
a value(which is a map c10o/20n/2t0a14in@in Sgura cbhoi Dlwuivmedi ns). 16
17. Example
๏ Cassandra
๏ Hbase
๏ Hypertable
๏ Amazon Simple DB
10/20/2014 @ Surabhi Dwivedi 17
19. {
โrow_key_1โ : {
โnameโ : {
โfirst_nameโ : โJollyโ,
โlast_nameโ : โGoodfellowโ
}
}
},
โlocationโ : {
โzipโ: โ94301โ
},
โpreferencesโ : {
โd/rโ : โDโ
}
},
โrow_key_2โ : {
โnameโ : {
โfirst_nameโ : โVeryโ,
โmiddle_nameโ : โHappyโ,
โlast_nameโ : โGuyโ
},
โlocationโ : {
โzipโ : โ10001โ
},
โpreferencesโ :
โv/nvโ: โVโ
}
},
...
}
Each row may have a different set of
columns within a column-family
10/20/2014 @ Surabhi Dwivedi 19
20. Contrasting Column Databases with
RDBMS
โข Column-oriented database
โ minimal need for schema dentition
โ easily accommodate newer columns
โ predefined column-family
โ set of columns grouped together into a bundle
โ Column family(no data type) - column in an
RDBMS(with data type)
โ Column databases designed to scale and can easily
accommodate millions of columns and billions of
rows
10/20/2014 @ Surabhi Dwivedi 20
22. Hadoop distributed filesystem (HDFS) โ
Background for Distributed Storage
โข Apache Hadoop is an open source software
project
โข Enables the distributed processing of large data
sets across clusters of servers
โข Designed to scale up from a single server to
thousands of machines, with a very high degree
of fault tolerance.
โข Data in a Hadoop cluster is broken down into
smaller pieces (called blocks) and distributed
throughout the cluster.
โข The map and reduce functions can be executed
on smaller subsets of larger data sets
10/20/2014 @ Surabhi Dwivedi 22
23. Hadoop distributed filesystem
(HDFS)
A MapReduce
โฆ Map() procedure - performs filtering and sorting
(such as sorting students by first name into
queues, one queue for each name)
โฆ Reduce() procedure performs a summary
operation (such as counting the number of
students in each queue, yielding name
frequencies).
10/20/2014 @ Surabhi Dwivedi 23
24. Hadoop distributed filesystem
(HDFS) - Example
๏ A file containing the phone numbers for everyone in the
United States;
๏ The people with a last name starting with A might be
stored on server 1, B on server 2, and so on.
๏ In a Hadoop world, pieces of this phonebook would be
stored across the cluster
๏ To reconstruct the entire phonebook, your program
would need the blocks from every server in the cluster.
๏ To achieve availability as components fail, HDFS
replicates these smaller pieces onto two additional
servers by default.
โฆ This redundancy offers multiple benefits,
๏ Higher availability.
๏ Scalability : Hadoop cluster break work into smaller chunks and run
those jobs on all the servers in the cluster
๏ Data locality, which is critical when working with large data sets.
10/20/2014 @ Surabhi Dwivedi 24
25. Hbase - Distributed Storage
๏ HBase is a column-oriented database
management system that runs on top of
HDFS.
๏ HBaseโs distributed architecture is designed
for applications storing up to billions of
rows and millions of columns
๏ A good option to replace a relational
database that cannot support such large data
sets.
10/20/2014 @ Surabhi Dwivedi 25
28. โข master-worker pattern
โข A master and a set of workers(range servers)
โข When HBase starts, master allocates set of ranges to a range
server.
โข Each range stores an ordered set of rows, where each row is
idetified by a unique row-key.
โข As number of rows stored in a range grows beyond a
configured thresold
โข the range is split into two and rows are divided between the
two new ranges.
10/20/2014 @ Surabhi Dwivedi 28
29. write-ahead-log (WAL)
โข WAL is a common technique for providing atomicity
and durability (two of the ACID properties).
โข When data is written to a region, itโs first written to the
write-ahead-log, if enabled.
โข Later, itโs written to the regionโs in-memory store.
โข If the in-memory store is full, data is flushed to disk
and persisted in the underlying distributed storage.
โข In HBase a client program could decide to turn WAL
on or switch it off.
โข Switching it off would boost performance but reduce
reliability and recovery, in case of failure.
10/20/2014 @ Surabhi Dwivedi 29
31. Document Model
๏ Notion of a schema is dynamic: each
document can contain different fields.
โฆ Helpful for modeling unstructured and
polymorphic data.
โฆ It also makes it easier to evolve an application
during development , such as adding new fields.
โฆ Data can be queried based on any fields in a
document
10/20/2014 @ Surabhi Dwivedi 31
32. DOCUMENT STORE
โข Documents are grouped together into collections
โข Collections - relational tables.
โข Collections donโt impose strict schema
constraints
โข Records are not documents in the sense of a
word processing document
โข Structure of any document can be modified
โข By adding and removing members from the document
- by reading the document into program, modifying it
and re-saving it
โข By using various update commands.
10/20/2014 @ Surabhi Dwivedi 32
33. DOCUMENT STORE
โข Each document is stored in BSON format.
โข Binary data (using BSON format) can be stored
in any of the fields in the document.
โข BSON is a binary-encoded representation of a JSON-type
document format
โ nested set of key/value pairs.
โ JSON โ JavaScript Object Notation
โข BSON is a superset of JSON
โ supports additional types
โข regular expression,
โข binary data,
โข date.
โข Each document has a unique identifier, which
MongoDB can generate like auto-generated object ids
10/20/2014 @ Surabhi Dwivedi 33
34. DOCUMENT STORE
๏ Document databases โ
โฆ Good for storing and managing Big Data-size
collections of literal documents
๏ like text documents, email messages, and XML
documents
๏ conceptual โdocumentsโ like de-normalized
(aggregate) representations of a database entity
๏ Good for storing โsparseโ data
โฆ irregular (semi-structured) data that would
require an extensive use of โnullsโ in an
RDBMS.
10/20/2014 @ Surabhi Dwivedi 34
35. DOCUMENT STORE
๏ โDocumentsโ are encoded in a standard data exchange
format
โฆ XML, JSON (JavaScript Object Notation) or BSON (Binary
JSON).
๏ Unlike the simple key-value stores, the value column in
document databases contains semi-structured data
โฆ specifically attribute name/value pairs.
๏ A single column can house hundreds of such attributes
๏ Number and type of attributes recorded can vary from
row to row.
๏ Both keys and values are fully searchable in document
databases.
10/20/2014 @ Surabhi Dwivedi 35
36. DOCUMENT STORE
๏ Records within a single table can have different structures.
๏ An example record from Mongo, using JSON format, might
look like
{
โ_idโ : ObjectId(โ4fccbf281168a6aa3c215443โณ),
โfirst_nameโ : โThomasโ,
โlast_nameโ : โJeffersonโ,
โaddressโ : {
โstreetโ : โ1600 Pennsylvania Ave NWโ,
โcityโ : โWashingtonโ,
โstateโ : โDCโ
}
}
10/20/2014 @ Surabhi Dwivedi 36
37. Document Store - Internals
๏ Document Stores
โฆ Like Key-Value Stores, except Value is a โDocumentโ
๏ Data model: (key, โdocumentโ) pairs
๏ Basic operations: I
โฆ Insert (key, document),
โฆ Fetch(key), Update(key),
โฆ Delete(key)
๏ Also Fetch() based on document contents
๏ Example systems
โฆ CouchDB, MongoDB
๏ Document stores
โฆ Store arbitrary/extensible structures as a โvalueโ
10/20/2014 @ Surabhi Dwivedi 37
39. Advantages of the Document Model
๏ More natural to represent data at the database level
๏ An aggregated document can be accessed with a
single call to the database
โฆ rather than having to JOIN multiple tables to respond to a
query.
๏ The MongoDB document is physically stored as a
single object, requiring only a single read from
memory or disk.
โฆ RDBMS JOINs require multiple reads from multiple
physical locations.
๏ Distributing the database across multiple nodes (a
process called sharding) is easier
โฆ horizontal scalability
โฆ documents are self-contained
10/20/2014 @ Surabhi Dwivedi 39
40. MongoDB- Features
๏ MongoDB provides high performance data persistence.
โฆ Support for embedded data models reduces I/O activity on database
system.
โฆ Indexes support faster queries and can include keys from embedded
documents and arrays.
๏ High Availability
โฆ automatic failover.
โฆ data redundancy.
๏ A replica set is a group of MongoDB servers that maintain the
same data set, providing redundancy and increasing data
availability.
๏ Automatic Scaling
โฆ MongoDB provides horizontal scalability as part of its core
functionality.
โฆ Automatic sharding distributes data across a cluster of machines.
โฆ Replica sets can provide eventually-consistent reads for low-latency
high throughput deployments.
10/20/2014 @ Surabhi Dwivedi 40
41. MongoDB - Sharding
โข Data is distributed across multiple range servers
โข MongoDB allows ordered collections to be saved across
multiple machines.
โข Shards are replicated to allow failover.
โข Large collection could be split into four shards
โข Each shard in turn may be replicated three times.
โข This would create 12 units of a MongoDB server.
โข The two additional cpies of each shard serve as failover units.
โข Sharding addresses the challenge of scaling to support
high throughput and large data sets:
โข Each shard processes fewer operations as the cluster grows.
โข As a result, a cluster can increase capacity and throughput
horizontally.
โข For example, to insert data, the application only needs to access
the shard responsible for that record.
โข Sharding reduces the amount of data that each server needs to
store. Each shard stores less data as the cluster grows.
10/20/2014 @ Surabhi Dwivedi 41
42. โขData set is divided and
distributed data over
multiple servers, or shards.
โข Each shard is an
independent database, and
collectively, the shards make
up a single logical database.
10/20/2014 @ Surabhi Dwivedi 42
43. Distributed Key-Value Systems
๏ Key-Value Pair (KVP) Stores
โฆ Access data (values) by strings called keys.
โฆ Data has no required format โ data may have any format
โฆ Extremely simple interface
๏ Data model: (key, value) pairs
๏ NoSQL Key-Value store is a single table with two
columns:
โฆ one being the (Primary) Key, and the other being the Value.
๏ Basic Operations: Insert (key, value), Fetch
(key),Update (key), Delete (key)
โฆ Implementation: efficiency, scalability, fault-tolerance
๏ Records distributed to nodes based on key Replication
๏ Single-record transactions, โeventual consistencyโ
10/20/2014 @ Surabhi Dwivedi 43
44. Example- Key Value
๏ Riak
๏ Redis
๏ Memcached DB
๏ Berkeley DB
๏ Hamster DB (especially suited for
embedded use)
๏ Amazon Dynamo DB (not open source)
๏ Project Voldemort (Open Source
Implementation of Dynamo DB)
10/20/2014 @ Surabhi Dwivedi 44