SlideShare a Scribd company logo
1 of 164
NoSQL
Technologies
HBase | Cassandra | MongoDB | Redis
Girish Khanzode
Contents
• NoSQL
– Horizontal Scalability
– CAP Theorem
– Gossip Protocol & Hinted Handoffs
• Hbase
– HBase Data Model
– HBase Regions
– Column Families
– HBase API
• Redis NoSQL Database
• Cassandra
– Architecture Overview
– Partitioning
– Write Properties
– Gossip Protocols
– Accrual Failure Detector
– Data Model
– Tunable Consistency
• CQL
• Memcached Database
• MongoDB
• References
NoSQL
• Not Only SQL
• Class of non-relational data storage systems
• Usually no fixed table schema
• No concept of joins
• Relax one or more of the ACID properties
NoSQL
• Column StoreType
– Each storage block contains data from only one column
– More efficient than row (or document) store if
• Multiple row/record/documents are inserted at the same time so updates of column blocks can
be aggregated
• Retrievals access only some of the columns in a row/record/document
• Document Store Type – stores documents made up of tagged elements
• Key-Value Store Type – Hash table of keys
• Graph DatabasesType
Categories
• Key-Value Store
– Big HashTable of keys & values
• Products
– Memcached
– Membase
– Redis – Data structure server
– Riak – Amazon Dynamo based
– Amazon S3 (Dynamo)
Categories
• Schema-less - column-based, document-based, graph-based
• Document-basedStore- Stores documents made up of tagged elements (CouchDB, MongoDB)
• Column-based Store- Each storage block contains data from only one column
– Google BigTable
– Cassandra
– HBase
• Graph-based-A network database that uses edges and nodes to represent and store data (Neo4J)
NoSQLTypes Comparison
RDBMS Scaling - Master-Slave
• All writes are written to the master
• All reads performed against the replicated slave databases
• Critical reads may be incorrect as writes may not have been propagated
down
• Large data sets can pose problems as master needs to duplicate data to
slaves
RDBMS Scaling
• Partition or Sharding
– Scales well for both reads and writes
– Not transparent, application needs to be partition-aware
– Can no longer have relationships/joins across partitions
– Loss of referential integrity across shards
• Multi-Master replication
• INSERT only, not UPDATES/DELETES
• No JOINs, thereby reducing query time
– Requires de-normalizing data
• In-memory databases
RDBMS Limitations
• One size does not fit all
• Impedance mismatch
• Rigid schema design
• Harder to scale
• Replication
• Difficult to join across multiple nodes
• Can not easily handle data growth
• Need a DBA
RDBMS Limitations
• Many issues while scaling up for
massive datasets
• Not designed for distributed
computing
• Expensive specialized hardware
• Multi-node databases considered
as solutions - Known as ‘scaling
out’ or ‘horizontal scaling’
– Master-slave
– Sharding
Horizontal Scalability
• Scale out
• Easily add servers to existing system - Elastically scalable
– Bugs, hardware errors, things fail all the time
– Cost efficient
• Non sharing
• Use commodity/cheap hardware
• Heterogeneous systems
Horizontal Scalability
• Controlled concurrency (avoids locks)
• Service Oriented Architecture
– Local states
– Decentralized to reduce bottlenecks
– Avoids single point of failures
• Asynchronous
• All nodes are symmetric
Horizontal Scalability
NoSQL Database Features
• Large data volumes
• Scalable replication and distribution
– Potentially thousands of machines
– Potentially distributed around the world
• Queries require to return answers quickly
• CAPTheorem
• Open source development
• Key /Value
NoSQL Database Features
• Mostly query, few updates
• Asynchronous Inserts & Updates
• Schema-less
• ACID transaction properties not needed – BASE
• Schema-Less Stores
– Richer model than key/value pairs
– Eventual consistency
– Distributed
– Excellent performance and scalability
– Downside - typically no ACID transactions or joins
Key-Value Store
• A simple Hash table
• Read and write values using a key
– Get(key), returns the value associated with the provided key
– Put(key, value), associates the value with the key
– Multi-get(key1, key2, .., keyN), returns the list of values associated with the
list of keys
– Delete(key), removes the entry for the key from the data store
Key-Value Store
• Pros
– Very fast
– Scalable
– Simple model
– Distribute horizontally
• Cons
– Many data structures (objects) not easily modeled
– As data volume rises, maintaining unique values as keys is difficult
Document Store
• The data is a collection of key value pairs, is compressed as a document
store similar to a key-value store
• Difference is that the values stored (documents) provide some structure
and encoding of the managed data
• XML, JSON (Java Script Object Notation), BSON (binary JSON objects)
are some common standard encodings
Column Store
• Data stored in cells grouped in columns of data rather than rows
• Columns logically grouped into column families
• Families can contain a virtually unlimited number of columns that can be created
at runtime or the definition of the schema
• Read and write is done using columns rather than rows
• Benefit of storing data in columns, is fast search/ access and data aggregation
• Store all the cells corresponding to a column as a continuous disk entry thus
makes the search/access faster
Column Store - Data Model
• ColumnFamily - A single structure that can group Columns and SuperColumns
• Key - permanent name of the record. Keys have different numbers of columns, so
the database can scale in an irregular way
• Key-space - Defines the outermost level of an organization, typically the name of
the application
• Column - Ordered list of elements -Tuple with a name and a value defined
ACIDTransactions - Atomic
• Either the whole process is done or none
• If transaction successful – commit
• System responsible for saving all changes to database
• If transaction unsuccessful - abort
• System responsible for rollback of all changes
ACIDTransactions - Consistent
• Database constraints preserved
• Enterprise rules limit occurrence of some real-world events
• Customer cannot withdraw if balance less than minimum
• These limitations are integrity constraints: assertions that must be
satisfied by all database states (state invariants)
• Isolated - User sees as if only one process executes at a time - two
concurrent transactions will not see on another’s transaction while “in flight”
ACIDTransactions - Durable
• Effects of a process not lost if the system crashes
• System ensures that once a transaction commits, its effect on the database state
is not lost despite subsequent failures
• Database stored redundantly on mass storage devices to protect against media
failure
• Related to Availability - extent to which a (possibly distributed) system can
provide service despite failure
– Non-stop DBMS (mirrored disks)
– Recovery based DBMS (log)
CAPTheorem
• Brewer’sTheorem by Prof. Eric Brewer, published in 2000 at University of
Berkeley
• Consistency: Every node in the system contains the same data
• Replicas never out of data
• Availability - Every request to a non-failing node in the system returns a
response
– System available during software and hardware upgrades and node failures
– Traditionally thought of as server/process available for five 9’s (99.999 %)
– For large node system, at any point there’s a good chance that a node is either
down or a network disruption among the nodes
• Need a system resilience during network disruption
CAPTheorem
CAPTheorem
• PartitionTolerance - System properties (consistency and/or availability) hold even
when the system is partitioned (communicate lost) and data is lost (node lost)
• A system can continue to operate in the presence of a network partitions
• At most two of these three properties supported for any shared-data system
• Scaling out requires partition
• It leaves either consistency or availability to choose from
• In almost all cases, availability chosen over consistency
Eventual Consistency
• BASE (BasicallyAvailable Soft-state Eventual consistency)
• BASE is an alternative to ACID
• Weak consistency – stale data OK
• When no updates occur for a long period of time, eventually all updates
propagate through the system and all the nodes are consistent
• For a given accepted update and a given node, eventually either the update
reaches the node or the node is removed from service
• Availability first
• Approximate answers
Eventual Consistency
• Given a sufficiently long period of time over which no changes are sent, all
updates can be expected to propagate eventually through the system and all the
replicas will be consistent
• Conflict resolution
– Read repair -The correction is done when a read finds an inconsistency. This slows
down the read operation
– Write repair -The correction takes place during a write operation, if an inconsistency
has been found, slowing down the write operation
– Asynchronous repair -The correction is not part of a read or write operation
NoSQL Advantages
• Cheap - open source
• Easy to implement
• Data replicated to multiple nodes (identical and fault-tolerant)
• Partitioned
– Down nodes easily replaced
– No single point of failure
• Easy to distribute
• No predefined schema
• Scale up and down
• Relax the data consistency requirement (CAP)
NoSQL Downsides
• Joins
• Group by
• Order by
• ACID transactions
• SQL frustrating but still a powerful query language
• Easy integration with other applications that support SQL
Gossip Protocol & Hinted Handoffs
• Most preferred communication protocol in a distributed environment
• All the nodes talk to each other peer wise
• No global state
• No single point of coordinator
• If one node goes down and there is a Quorum
• Load for down node shared by others
• Self managing system
• If a new node joins, load is also distributed
• Requests coming to node F handled by node C. When F becomes available, it will get this Information
from C
• Self healing property
Gossip Protocol & Hinted Handoffs
HBASE
HBase
• An open-source, distributed, column-oriented database built on top of
HDFS based on BigTable
• A distributed data store scalable horizontally to 1,000’s of commodity
servers and petabytes of indexed storage
• Designed to operate on top of the Hadoop distributed file system (HDFS)
or Kosmos File System (KFS - Cloudstore) for scalability, fault tolerance
and high availability
HBase History
Started by Chad
Walters and Jim
2006.11 -
Google releases
paper on
BigTable
2007.2 - Initial
HBase
prototype
created as
Hadoop
contribution
2007.10 - First
useable HBase
2008.1 -
Hadoop
become Apache
top-level
project and
HBase becomes
subproject
2008.10 - HBase
0.18, 0.19
released
A Big Map
• Row Key + Column Key + timestamp => value
Row Key Column Key Timestamp Value
1 Info:name 1273516197868 Sakis
1 Info:age 1273871824184 21
1 Info:sex 1273746281432 Male
2 Info:name 1273863723227 Themis
2 Info:name 1273973134238 Andreas
Why BigTable?
• RDBMS performance good for transaction processing
• Very large scale analytic processing solutions are commercial, expensive,
and specialized
• Very large scale analytic processing
– Big queries – typically range or table scans
– Big databases (100s ofTB)
Why BigTable?
• Map reduce on Bigtable with optional cascading on top to support some
relational algebras - a cost effective solution
• Sharding not a solution to scale open source RDBMS platforms
– Application specific
– Labor intensive (re)partitioning
HBase as Hadoop Component
• Hbase built on top of HDFS
• HBase files internally stored in HDFS
HBase Data Model
• Based on Google’s Bigtable model - Key-Value pairs
• HBase schema consists of several tables
• Each table consists of a set of column families
– Columns not part of schema
• Tables sorted by Row
Row key
Column Family
valueTimeStamp
HBase Data Model
• Dynamic Columns
– Because column names are encoded inside the cells
– Different cells can have different columns
• Table schema only defines it’s column families
– Each family has any number of columns
– Each column consists of any number of versions
– Columns only exist when inserted, NULLs are free.
– Columns within a family sorted and stored together
• Everything except table names are byte[]
• (Row, Family: Column,Timestamp) =Value
Components
• Region
– A subset of a table rows, like horizontal range partitioning
– Automatic
• RegionServer (many slaves)
– Manages data regions
– Serves data for reads and writes (using a log)
• Master
– Responsible for coordinating the slaves
– Assigns regions, detects failures
– Admin functions
HBase Members
• Master
– Monitors region servers
– Load balancing for regions
– Redirect client to correct region servers
– Current SPOF
– Signs regions, detects failures of Region
Servers
– Control admin function
• Slaves – Region Servers
– Region - A subset of table's rows
– Serves data for reads and writes
– Send Heartbeat to Master
HBase Regions
• Each HTable (column family) is partitioned horizontally into regions
– Regions are counterpart to HDFS blocks
Regions
• Contain an in-memory data store (MemStore) and a persistent data store (HFile)
• All regions on a region server share a reference to the write-ahead log (WAL)
which is used to store new data that hasn't yet been persisted to permanent
storage and to recover from region server crashes
• Each region holds a specific range of row keys, and when a region exceeds a
configurable size, HBase automatically splits the region into two child regions,
which is the key to scaling HBase
Regions
LogicalView
Column Families
Each row has a Key
Each record is divided into Column Families
Each column family consists of one or more Columns
Column Families
HBase vs. HDFS
• Both distributed systems that scale to hundreds or thousands
of nodes
• HDFS is good for batch processing (scans over big files)
– Not good for record lookup
– Not good for incremental addition of small batches
– Not good for updates
HBase vs. HDFS
• HBase is designed to efficiently address the above points
– Fast record lookup
– Support for record-level insertion
– Support for updates (not in place)
• HBase updates are done by creating new versions of values
HBase vs. HDFS
• If application has neither random reads or writes, stick to HDFS
HBase vs. RDBMS
When to Use HBase
• Random read, write or both are required
• Need to do many thousands of operations per second on multipleTB of
data
• Access patterns are well-known as simple
Row key
Time
Stamp
Column
“content
s:”
Column “anchor:”
“com.apac
he.ww
w”
t12
“<html>
…”
t11
“<html>
…”
t10
“anchor:apache
.com”
“APACH
E”
“com.cnn.w
ww”
t15
“anchor:cnnsi.co
m”
“CNN”
t13
“anchor:my.look.
ca”
“CNN.co
m”
t6
“<html>
…”
t5
“<html>
…”
t3
“<html>
…”
Column family named “Contents”
Column family named “anchor”
Column named “apache.com”
• Key
– Byte array
– Serves as the primary key for
the table
– Indexed far fast lookup
• Column Family
– Has a name (string)
– Contains one or more related
columns
• Column
– Belongs to one column family
– Included inside the row
• familyName:columnName
Row key
Time
Stamp
Column
“content
s:”
Column “anchor:”
“com.apac
he.ww
w”
t12
“<html>
…”
t11
“<html>
…”
t10
“anchor:apache
.com”
“APACH
E”
“com.cnn.w
ww”
t15
“anchor:cnnsi.co
m”
“CNN”
t13
“anchor:my.look.
ca”
“CNN.co
m”
t6
“<html>
…”
t5
“<html>
…”
t3
“<html>
…”
Version number for each row
value
• Version Number
– Unique within each key
– By default - System timestamp
– Data type is Long
• Value (Cell)
– Byte array
Data Model
• Version number can be user-supplied
– Even does not have to be inserted in increasing order
– Version numbers are unique within each key
• Table can be very sparse
– Many cells are empty
• Keys are indexed as the primary key
Has two columns
[cnnsi.com & my.look.ca]
Physical Model
• Each column family is stored in a separate file (called HTables)
• Key & version numbers are replicated with each column family
• Empty cells are not stored
HBase maintains a multi-level index on values:
<key, column family, column name, timestamp>
Architecture
Zookeeper and HBase
• HBase depends on Zookeeper
• To manage master election and
server availability, Zookeeper used
• Set up a cluster, provides distributed
coordination primitives
• A tool for building cluster
management systems
Connecting to HBase
• Java client
– get(byte [] row, byte [] column, long timestamp, int versions);
• Non-Java clients
– Thrift server hosting HBase client instance
• Sample ruby, C++, & java (via thrift) clients
– REST server hosts HBase client
• TableInput / OutputFormat for MapReduce
– HBase as MR source or sink
• HBase Shell
– JRuby IRB with “DSL” to add get, scan, and admin
– ./bin/hbase shell YOUR_SCRIPT
ApacheThrift
• $hbase-daemon.sh start thrift
• $hbase-daemon.sh stop thrift
• High performance, scalable, cross-language serialization and RPC framework
• Created at Facebook along with Cassandra
• A cross-language, service-generation framework
• Binary Protocol (like Google Protocol Buffers)
• Compiles to: C++, Java, Python, PHP, Ruby, Perl, …
HBase API
• get(key) – Extract value given a key
– get(row)
• put(key, value) - Create or update the value given its key
– put(row, Map<column, value>)
• delete(key) -- Remove the key and its associated value
• execute(key, operation, parameters)
– operate on value given a key
– List, Set, Map…
Hive HBase Integration
• Reasons to use Hive on Hbase
– Large data in Hbase for use in a real-time environment, but never used for analysis
– Give access to data in HBase usually only queried through MapReduce to people
that don’t code (business analysts)
– When needing a more flexible storage solution, so that rows can be updated live
by either a Hive job or an application and can be seen immediately to the other
• Reasons not to do it
– Run SQL queries on HBase to answer live user requests (it’s still a MR job)
– Hoping to see interoperability with other SQL analytics systems
Hive HBase Integration
HBase - Benefits
• Distributed storage
• Table-like in data structure - Multi-dimensional map
• High scalability, availability and performance
• No real indexes
• Automatic partitioning
• Scale linearly and automatically with new nodes
• Commodity hardware
• Fault tolerance
• Batch processing
HBase Limitations
• Tables have one primary index / key , the row key
• Each row can have any number of columns
• Table schema only defines column families (column family can have any
number of columns)
• Each cell value has a timestamp
• No join operators
• Scans and queries can select a subset of available columns using a
wildcard
HBase Limitations
• Lookups
– Fast lookup using row key and optional timestamp
– Full table scan
– Range scan from region start to end
• Limited atomicity and transaction support
– Supports multiple batched mutations of single rows only
– Data is unstructured and un-typed
• No access via SQL
– Programmatic access - Java,Thrift(Ruby, Php, Python, Perl, C++,..), Hbase Shell
REDIS
Redis NoSQL Database
• Redis is an open source, advanced key-value data store
• Often referred to as a data structure server since keys can contain strings,
hashes, lists, sets and sorted sets
• Redis works with an in-memory dataset
• It is possible to persist dataset either by
– dumping the dataset to disk every once in a while
– or by appending each command to a log
Redis NoSQL Database
• Distributed data structure server
• Consistent hashing at client
• Non-blocking I/O, single threaded
• Values are binary safe strings: byte strings
• String : Key/Value Pair, set/get. O(1) for many string operations.
• Lists: lpush, lpop, rpush, rpop.you - use as stack or queue. O(1)
Redis NoSQL Database
• Publisher/Subscriber model
• Set: collection of unique elements - add, pop, union, intersection - set operations.
• Sorted set: unique elements sorted by scores. O(logn). Range operations
• Hash: multiple key/value pairs
– HMSET user 1 username foo password bar age 30
– HGET user 1 age
Architecture
Redis Keys
• Keys are binary safe - it is possible to use any binary sequence as a key
• The empty string is also a valid key
• Too long keys are not a good idea
• Too short keys are often also not a good idea ("u:1000:pwd" versus
"user:1000:password")
• Nice idea is to use some kind of schema, like: "object-type:id:field"
Redis DataTypes
• Redis is often referred to as a data structure server since keys
can contain
– Strings
– Lists
– Sets
– Hashes
– Sorted Sets
Redis Strings
• Most basic kind of Redis value
• Binary safe - can contain any kind of data, for instance a JPEG image or a
serialized Ruby object
• Max 512 Megabytes in length
• Can be used as atomic counters using commands in the INCR family
• Can be appended with the APPEND command
Redis Strings - Example
Redis Lists
• Lists of strings, sorted by insertion order
• Add elements to a Redis List pushing new elements on the head (on the left) or on
the tail (on the right) of the list
• Max length: (2^32 - 1) elements
• Model a timeline in a social network, using LPUSH to add new elements, and
using LRANGE in order to retrieve recent items
• Use LPUSH together with LTRIM to create a list that never exceeds a given
number of elements
Redis Lists - Example
Redis Sorted Sets
• Every member of a Sorted Set is associated with score, that is used in
order to take the sorted set ordered, from the smallest to the greatest
score
• You can do a lot of tasks with great performance that are really hard to
model in other kind of databases
• Probably the most advanced Redis data type
Redis Hashes
• Map between string fields and string values
• Perfect data type to represent objects
HMSET user:1000 username antirez password P1pp0 age 34
HGETALL user:1000
HSET user:1000 password 12345
HGETALL user:1000
Redis Operations
• It is possible to run atomic operations on data types:
• Appending to a string
• Incrementing the value in a hash
• Pushing to a list
• Computing set intersection, union and difference
• Getting the member with highest ranking in a sorted set
CASSANDRA
Cassandra
• Structured Storage System over a P2P Network
• Was created to power the Facebook Inbox Search
• Facebook open-sourced Cassandra in 2008 and became anApache
Incubator project
• In 2010, Cassandra graduated to a top-level project, regular update and
releases followed
Cassandra
• High availability
• Designed to handle large amount of data across multiple servers
• Eventual consistency - trade-off strong consistency in favor of high
availability
• Incremental scalability
• Optimistic Replication
Cassandra
• “Knobs” to tune tradeoffs between consistency, durability and latency
• Low total cost of ownership
• Minimal administration
• Tunable consistency
• Decentralized - No single point of failure
• Writes faster than reads
• Uses consistent hashing (logical partitioning) when clustered.
Cassandra
• Hinted handoffs
• Peer to peer routing(ring)
• Thrift API
• Multi data center support
• Mimics traditional relational database systems, but with triggers and
lightweight transactions
• Raw, simple data structures
Features
• Emphasis on performance over analysis
– Still supports analysis tools like Hadoop
• Organization
– Rows are organized into tables
– First component of a table’s primary key is the partition key
– Rows clustered by the remaining columns of the key
– Columns may be indexed separately from the primary key
– Tables may be created, dropped, altered at runtime without blocking queries
Features
• Language
– CQL (Cassandra Query Language) introduced, similar to SQL (flattened
learning curve)
• Peer-to-Peer cluster
– Decentralized design
• Each node has the same role
– No single point of failure
• Avoids issues of master-slave DBMS’s
– No bottlenecking
Comparisons
Apache Cassandra Google Big Table Amazon DynamoDB
StorageType Column Column Key-Value
Best Use Write often, read
less
Designed for large
scalability
Large database
solution
Concurrency Control MVCC Locks ACID
Characteristics HighAvailability
PartitionTolerance
Persistence
Consistency
HighAvailability
PartitionTolerance
Persistence
Consistency
HighAvailability
Key Point – Cassandra offers a healthy cross between BigTable and Dynamo.
Cassandra History
Google Bigtable (2006)
• consistency model: strong
• data model: sparse map
• clones: hbase, hypertable
Amazon Dynamo (2007)
• O(1) dht
• consistency model: client
tune-able
• clones: riak, voldemort
Cassandra ~= Bigtable +
Dynamo
Architecture Overview
• Cassandra was designed with the understanding that system/ hardware
failures can and do occur
• Peer-to-peer, distributed system
• All nodes are the same
• Data partitioned among all nodes in the cluster
• Custom data replication to ensure fault tolerance
• Read/Write-anywhere design
Architecture Overview
Architecture Overview
• Google BigTable - data model
– Column Families
– Memtables
– SSTables
• Amazon Dynamo - distributed systems technologies
– Consistent hashing
– Partitioning
– Replication
– One-hop routing
Architecture
Transparent Elasticity
• Nodes can be added and removed from Cassandra online,
with no downtime being experienced.
1
2
3
4
5
6
1
7
10 4
2
3
5
6
8
9
11
12
Transparent Scalability
• Addition of Cassandra nodes increases performance linearly
and ability to manage TB’s-PB’s of data
1
2
3
4
5
6
1
7
10 4
2
3
5
6
8
9
11
12
Performance
throughput = N
Performance
throughput = N x 2
High Availability
• Cassandra has no single point of failure due to peer-to-peer
architecture
Multi-Geography - Zone Aware
Cassandra allows a single logical database to span 1-N datacenters that are
geographically dispersed. Also supports a hybrid on-premise/Cloud
implementation
Partitioning
• Nodes are logically structured in RingTopology
• Hashed value of key associated with data partition is used to assign it to a
node in the ring
• Hashing rounds off after certain value to support ring structure
• Lightly loaded nodes moves position to alleviate highly loaded nodes
Partitioning
Data Redundancy
• Cassandra allows for customizable data redundancy so that data is
completely protected
• Supports rack awareness (data can be replicated between different racks
to guard against machine/rack failures)
• Uses Zookeeper to choose a leader which tells nodes the range they are
replicas for
Data Redundancy
Operations
• A client issues a write request to a random node in the Cassandra cluster
• Partitioner determines the nodes responsible for the data
• Locally, write operations are logged and then applied to an in-memory
version
• Commit log is stored on a dedicated disk local to the machine
• Relies on local file system for data persistency
Operations
• Write operations happens in 2 steps
– Write to commit log in local disk of the node
– Update in-memory data structure.
– Why 2 steps or any preference to order or execution?
• Read operation
– Looks up in-memory ds first before looking up files on disk.
– Uses Bloom Filter (summarization of keys in file store in memory) to
avoid looking up files that do not contain the key
Consistency
• Read Consistency
– Number of nodes that must agree before read request returns
– ONE to ALL
• Write Consistency
– Number of nodes that must be updated before a write is considered successful
– ANY to ALL
– AtANY, a hinted handoff is all that is needed to return.
• QUORUM
– Commonly used middle-ground consistency level
– Defined as (replication_factor / 2) + 1
Hinted Handoff Write
• Write intended for a node
that is offline
• An online node, processing
the request, makes a note
to carry out the write once
the node comes back online
Write Properties
• No locks in the critical path
• Sequential disk access
• Behaves like a write back Cache
• Append support without read ahead
• Atomicity guarantee for a key
• AlwaysWritable
– accept writes during failure scenarios
Write Operations
• Stages
– Logging data in the commit log
– Writing data to the memtable
– Flushing data from the memtable
– Storing data on disk in SSTables
• Commit Log
– First place a write is recorded
– Crash recovery mechanism
– Write not successful until recorded in commit log
– Once recorded in commit log, data is written to Memtable
Write Operations
• Memtable
– Data structure in memory
– Once memtable size reaches a threshold, it is flushed (appended) to SSTable
– Several may exist at once (1 current, any others waiting to be flushed)
– First place read operations look for data
• SSTable
– Kept on disk
– Immutable once written
– Periodically compacted for performance
Write Operations
Read Repair
• On read, nodes are queried until the number of nodes which respond with
the most recent value meet a specified consistency level from ONE to
ALL
• If the consistency level is not met, nodes are updated with the most
recent value which is then returned
• If the consistency level is met, the value is returned and any nodes that
reported old values are then updated
Read Repair
Delete Operations
• Tombstones
– On delete request, records are marked for deletion
– Similar to Recycle Bin
– Data is actually deleted on major compaction or configurable timer
Gossip Protocols
• Used to discover location and state information about the
other nodes participating in a Cassandra cluster
• Network Communication protocols inspired for real life
rumor spreading
• Periodic, Pairwise, inter-node communication
• Low frequency communication ensures low cost
Gossip Protocols
• Random selection of peers
• Example – Node A wish to search for pattern in data
– Round 1 – Node A searches locally and then gossips with node B
– Round 2 – Node A,B gossips with C and D
– Round 3 – Nodes A,B,C and D gossips with 4 other nodes ……
• Round by round doubling makes protocol very robust
Failure Detection
• Gossip process tracks heartbeats from other nodes both directly and indirectly
• Node Fail state is given by variable Φ
– tells how likely a node might fail (suspicion level) instead of simple binary value (up/down).
• This type of system is known as Accrual Failure Detector
• Takes into account network conditions, workload, or other conditions that might
affect perceived heartbeat rate
• A threshold for Φ tells is used to decide if a node is dead
– If node is correct, phi will be constant set by application.
– Generally Φ(t) = 0
Failure Detection
• Uses Scuttleback (a Gossip protocol) to manage nodes
• Uses gossip for node membership and to transmit system control state
• Lightweight with mathematically provable properties
• State disseminated in O(logN) rounds where N is the number of nodes in
the cluster.
• EveryT seconds each member increments its heartbeat counter and
selects one other member to send its list to.
• A member merges the list with its own list .
Accrual Failure Detector
• Valuable for system management, replication, load balancing etc
• Node Fail state is given by variable ‘phi’ which tells how likely a
node might fail (suspicion level) instead of simple binary value
(up/down)
• Defined as a failure detector that outputs a value, PHI, associated
with each process.
• Also known as Adaptive Failure detectors - designed to adapt to
changing network conditions
Accrual Failure Detector
• The value output, PHI, represents a suspicion level
• Applications set an appropriate threshold, trigger suspicions
and perform appropriate actions
• In Cassandra the average time taken to detect a failure is 10-
15 seconds with the PHI threshold set at 5
Performance Benchmark
• Loading of data - limited by network bandwidth
• Read performance for Inbox Search in production
Search Interactions Term Search
Min 7.69 ms 7.78 ms
Median 15.69 ms 18.27 ms
Average 26.13 ms 44.41 ms
Throughput Benchmark
Data Model
• Column: smallest data element, a tuple with a name and a value :Rockets, '1'
might return:
{
'name' => ‘Rocket-Powered Roller Skates',
‘toon' => ‘Ready Set Zoom',
‘inventoryQty' => ‘5‘,
‘productUrl’ => ‘rockets1.gif’
}
Data Model
• ColumnFamily -There’s a single structure used to group both the
Columns and SuperColumns. Called a ColumnFamily (think table), it has
two types, Standard & Super.
– Column families must be defined at startup
• Key - the permanent name of the record
• Keyspace - the outer-most level of organization.This is usually the name
of the application. For example, ‘Acme' (think database name)
Data Model
• Optional super column: a named list.A super column contains standard columns,
stored in recent order
• SupposeOtherProducts has inventory in categories
• Querying (:OtherProducts, '174927') might return
– {‘OtherProducts' => {'name' => ‘Acme Instant Girl', ..}, ‘foods': {...}, ‘martian': {...},
‘animals': {...}}
• In the example, foods, martian, and animals are all super column names
• They are defined on the fly, and there can be any number of them per row.
:OtherProducts would be the name of the super column family
Data Model
• Columns and SuperColumns are both tuples with a name & value.The key difference is that a standard Column’s
value is a “string” and in a SuperColumn the value is a Map of Columns
• Columns are always sorted by their name. Sorting supports:
– BytesType
– UTF8Type
– LexicalUUIDType
– TimeUUIDType
– AsciiType
– LongType
• Each of these options treats the Columns' name as a different data type
Tunable Consistency
• Cassandra has programmable read/writable consistency
• Any - Ensure that the write is written to at least 1 node
• One - Ensure that the write is written to at least 1 node’s commit log and memory
table before receipt to client
• Quorom - Ensure that the write goes to node/2 + 1
• All - Ensure that writes go to all nodes. An unresponsive node would fail the write
Consistent Hashing
A
H
D
B
M
V
S
R
C
• Partition using consistent hashing
– Keys hash to a point on a fixed circular space
– Ring is partitioned into a set of ordered slots
and servers and keys hashed over these slots
• Nodes take positions on the circle.
• A, B, and D exists.
– B responsible for AB range.
– D responsible for BD range.
– A responsible for DA range.
• C joins.
– B, D split ranges.
– C gets BC from D.
Key-Value Model
• Cassandra is a column oriented
NoSQL system
• Column families: sets of key-value
pairs
– column family as a table and key-
value pairs as a row (using relational
database analogy)
• A row is a collection of columns
labeled with a name
Cassandra Row
• Value of row is itself a sequence
of key-value pairs
• such nested key-value pairs are
columns
• key = column name
• A row must contain at least 1
column
Example of Columns
Column Names StoringValues
• key: User ID
• column names store tweet ID
values
• values of all column names are
set to “-” (empty byte array) as
they are not used
Key Space
• A Key Space is a group of column
families together. It is only a logical
grouping of column families and
provides an isolated scope for
names
Comparison with RDBMS
• With RDBMS, a normalized data model is created without
considering the exact queries
– SQL can return almost anything though Joins
• With C*, the data model is designed for specific queries
– schema is adjusted as new queries introduced
• C*: NO joins, relationships, or foreign keys
– a separate table is leveraged per query
– data required by multiple tables is denormalized across those tables
Compaction
• Compaction runs periodically to merge multiple SSTables
– Reclaims space
– Creates new index
– Merges keys
– Combines columns
– Discards tombstones
– Improves performance by minimizing disk seeks
• Types
– Major
– Read-only
Anti-Entropy
• Replica synchronization mechanism
• Ensures synchronization of data across nodes
• Compares data checksums against neighboring nodes
• Uses Merkle trees (Hash trees)
– Snapshot of data sent to neighboring nodes
– Created and broadcasted on every major compaction
– If two nodes take snapshots withinTREE_STORE_TIMEOUT of each other,
snapshots are compared and data is synced.
Anti-Entropy
Cassandra Query Language - CQL
• Creating a keyspace - namespace of tables
CREATE KEYSPACE demo
WITH replication = {‘class’: ’SimpleStrategy’, replication_factor’: 3};
• To use namespace:
USE demo;
CQL – CreateTable
CREATE TABLE users( CREATE TABLE tweets(
email varchar, email varchar,
bio varchar, time_posted timestamp,
birthday timestamp, tweet varchar,
active boolean, PRIMARY KEY (email,
time_posted));
PRIMARY KEY (email));
CQL
• Insert
– INSERT INTO users (email, bio, birthday, active)VALUES
(‘Tom.Stok@btx.com’, ‘StarTeammate’, 516513612220, true);
– Timestamp fields are specified in milliseconds since epoch
• Query tables
– SELECT expression reads one or more records from Cassandra column family
and returns a result-set of rows
– SELECT * FROM users;
– SELECT email FROM usersWHERE active = true;
Cassandra Advantages
• Perfect for time-series data
• High performance
• Decentralization
• Nearly linear scalability
• Replication support
• No single points of failure
• MapReduce support
CassandraWeaknesses
• No referential integrity
– no concept of JOIN
• Querying options for retrieving data are limited
• Sorting data is a design decision
– no GROUP BY
• No support for atomic operations
– if operation fails, changes can still occur
• First think about queries, then about data model
Key Points
• Cassandra is designed as a distributed database management system
– use it when you have a lot of data spread across multiple servers
• Cassandra write performance is always excellent, but read performance
depends on write patterns
– it is important to spend enough time to design proper schema around the
query pattern
• having a high-level understanding of some internals is a plus
– ensures a design of a strong application built atop Cassandra
Hector – Java API for Cassandra
• Sits on top ofThrift
• Load balancing
• JMX monitoring
• Connection-pooling
• Failover
• JNDI integration with application servers
• Additional methods on top of the standard get, update, delete methods.
• Under discussion
– hooks into Spring declarative transactions
Memcached Database
• Key-Value Store
• Very easy to setup and use
• Consistent hashing
• Scales very well
• In memory caching, no persistence
• LRU eviction policy
• O(1) to set/get/delete
• Atomic operations set/get/delete
• No iterators or very difficult
MONGODB
MongoDB
• Publicly released in 2009
• Allows data to persist in a nested state
• Query that nested data in an ad hoc fashion
• Enforces no schema
• Documents can optionally contain fields or types that no other document
in the collection contains
• NoSQL
MongoDB
MongoDB
• Document-oriented database
• Uses BSON format – Binary JSON
• An instance may have zero or more databases
• A database may have zero or more collections
• A collection may have zero or more documents
• A document may have one or more fields
• Indexes function like RDBMS counterparts
MongoDB
• Data types: bool, int, double, string, object(bson), oid, array, null, date
• Database and collections created automatically
• Language Drivers
• Capped collections are fixed size collections, buffers, very fast, FIFO,
good for logs. No indexes
• Object id are generated by client, 12 bytes packed data - 4 byte time, 3
byte machine, 2 byte pid, 3 byte counter
MongoDB
• Possible to refer other documents in different collections but more
efficient to embed documents
• Replication easy to setup. Read from slaves
• Supports aggregation
– Map Reduce with JavaScript
• Indexes, B-Trees. Ids are always indexed
MongoDB
• Updates are atomic. Low contention locks
• Querying mongo done with a document
– Lazy, returns a cursor
– Reducable to SQL, select, insert, update limit, sort - upsert (either inserts of
updates)
– Operators - $ne, $and, $or, $lt, $gt, $incr, $decr
• Repository Pattern for easy development
MongoDB
• Full Index Support
• Replication & High Availability
• Auto-Sharding
• Querying
• Fast In-Place Updates
• Map/Reduce
Architecture
Comparison
RDBMS MongoDB
Database Database
Table,View Collection
Row Document (JSON, BSON)
Column Field
Index Index
Join Embedded Document
Foreign Key Reference
Partition Shard
CRUD
• Create
– db.collection.insert( <document> )
– db.collection.save( <document> )
– db.collection.update( <query>, <update>, { upsert: true } )
• Read
– db.collection.find( <query>, <projection> )
– db.collection.findOne( <query>, <projection> )
• Update
– db.collection.update( <query>, <update>, <options> )
• Delete
– db.collection.remove( <query>, <justOne> )
Commands
# create a doc and save into a collection
 p = {firstname:"Dave", lastname:"Ho“}
 db.person.save(p)
 db.person.insert({firstname:"Ricky", lastname:"Ho"})
# Show all docs within a collection
 db.person.find()
# Iterate result using cursor
 var c = db.person.find()
 p1 = c.next()
 p2 = c.next()
Commands
#Query
 p3 = db.person.findone({lastname:"Ho"}
# Return a subset of fields (ie: projection)
 db.person.find({lastname:"Ho"}, {firstname:true})
# Delete some records
 db.person.remove({firstname:"Ricky"})
#To build an index for a collection
 db.person.ensureIndex({firstname:1})
Commands
#To show all existing indexes
 db.person.getIndexes()
#To remove an index
 db.person.dropIndex({firstname:1})
# Index can be build on a path of the doc
 db.person.ensureIndex({"address.city":1})
# A composite key can be used to build index
 db.person.ensureIndex({lastname:1, firstname:1})
Commands
#Data update andTransaction: To update an existing doc, we can do the following
 var p1 = db.person.findone({lastname:"Ho"})
 p1["address"] = "San Jose" db.person.save(p1)
# Do the same in one command
 db.person.update({lastname:"Ho"}, {$set:{address:"San Jose"}}, false, true)
MongoDB Sharding
• Config servers: Keeps mapping
• Mongos: Routing servers
• Mongod: master-slave replicas
References
• NoSQL --Your Ultimate Guide to the Non - Relational Universe! http://nosql-database.org/links.html
• NoSQL (RDBMS) http://en.wikipedia.org/wiki/NoSQL
• PODC Keynote, July 19, 2000.Towards Robust. DistributedSystems. Dr. Eric A. Brewer. Professor, UC Berkeley.Co-Founder
& Chief Scientist, Inktomi
www.eecs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
• http://planetcassandra.org/functional-use-cases/
• http://marsmedia.info/en/cassandra-pros-cons-and-model.php
• http://www.slideshare.net/adrianco/migrating-netflix-from-oracle-to-global-cassandra
• http://wiki.apache.org/cassandra/CassandraLimitations
• “Brewer'sCAPTheorem” posted by Julian Browne, January 11, 2009. http://www.julianbrowne.com/article/viewer/brewers-
cap-theorem
• “Scalable SQL”,ACM Queue, Michael Rys, April 19, 2011
http://queue.acm.org/detail.cfm?id=1971597
ThankYou
Check Out My LinkedIn Profile at
https://in.linkedin.com/in/girishkhanzode

More Related Content

What's hot

Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architectureBishal Khanal
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and UsesSuvradeep Rudra
 
Basics of MongoDB
Basics of MongoDB Basics of MongoDB
Basics of MongoDB Habilelabs
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016DataStax
 
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql DatabasePrashant Gupta
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf
 
Copy of MongoDB .pptx
Copy of MongoDB .pptxCopy of MongoDB .pptx
Copy of MongoDB .pptxnehabsairam
 
Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databasesAshwani Kumar
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremGrisha Weintraub
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBNodeXperts
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 

What's hot (20)

Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and Uses
 
NOSQL vs SQL
NOSQL vs SQLNOSQL vs SQL
NOSQL vs SQL
 
Basics of MongoDB
Basics of MongoDB Basics of MongoDB
Basics of MongoDB
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql Database
 
NoSQL
NoSQLNoSQL
NoSQL
 
MongoDB
MongoDBMongoDB
MongoDB
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Copy of MongoDB .pptx
Copy of MongoDB .pptxCopy of MongoDB .pptx
Copy of MongoDB .pptx
 
Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databases
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Nosql
NosqlNosql
Nosql
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Presto
PrestoPresto
Presto
 

Viewers also liked

NoSQL for Data Services, Data Virtualization & Big Data
NoSQL for Data Services, Data Virtualization & Big DataNoSQL for Data Services, Data Virtualization & Big Data
NoSQL for Data Services, Data Virtualization & Big DataGuido Schmutz
 
physical and logical data independence
physical and logical data independencephysical and logical data independence
physical and logical data independenceapoorva_upadhyay
 
Database Design Slide 1
Database Design Slide 1Database Design Slide 1
Database Design Slide 1ahfiki
 

Viewers also liked (10)

NoSQL Seminer
NoSQL SeminerNoSQL Seminer
NoSQL Seminer
 
Python Scipy Numpy
Python Scipy NumpyPython Scipy Numpy
Python Scipy Numpy
 
Data Visulalization
Data VisulalizationData Visulalization
Data Visulalization
 
Cassandra
CassandraCassandra
Cassandra
 
NLTK
NLTKNLTK
NLTK
 
Data independence
Data independenceData independence
Data independence
 
NoSQL for Data Services, Data Virtualization & Big Data
NoSQL for Data Services, Data Virtualization & Big DataNoSQL for Data Services, Data Virtualization & Big Data
NoSQL for Data Services, Data Virtualization & Big Data
 
physical and logical data independence
physical and logical data independencephysical and logical data independence
physical and logical data independence
 
Dbms architecture
Dbms architectureDbms architecture
Dbms architecture
 
Database Design Slide 1
Database Design Slide 1Database Design Slide 1
Database Design Slide 1
 

Similar to NoSql

UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxRahul Borate
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxRahul Borate
 
Cassandra an overview
Cassandra an overviewCassandra an overview
Cassandra an overviewPritamKathar
 
The Rise of NoSQL and Polyglot Persistence
The Rise of NoSQL and Polyglot PersistenceThe Rise of NoSQL and Polyglot Persistence
The Rise of NoSQL and Polyglot PersistenceAbdelmonaim Remani
 
NoSQL A brief look at Apache Cassandra Distributed Database
NoSQL A brief look at Apache Cassandra Distributed DatabaseNoSQL A brief look at Apache Cassandra Distributed Database
NoSQL A brief look at Apache Cassandra Distributed DatabaseJoe Alex
 
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...raghdooosh
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInLinkedIn
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandraNguyen Quang
 
Oracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data ArchitectureOracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data ArchitectureArthur Gimpel
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 

Similar to NoSql (20)

Master.pptx
Master.pptxMaster.pptx
Master.pptx
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Cassandra an overview
Cassandra an overviewCassandra an overview
Cassandra an overview
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
 
Revision
RevisionRevision
Revision
 
Database Technologies
Database TechnologiesDatabase Technologies
Database Technologies
 
The Rise of NoSQL and Polyglot Persistence
The Rise of NoSQL and Polyglot PersistenceThe Rise of NoSQL and Polyglot Persistence
The Rise of NoSQL and Polyglot Persistence
 
No SQL
No SQLNo SQL
No SQL
 
NoSQL A brief look at Apache Cassandra Distributed Database
NoSQL A brief look at Apache Cassandra Distributed DatabaseNoSQL A brief look at Apache Cassandra Distributed Database
NoSQL A brief look at Apache Cassandra Distributed Database
 
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
 
Hbase hivepig
Hbase hivepigHbase hivepig
Hbase hivepig
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandra
 
Oracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data ArchitectureOracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data Architecture
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Nosql data models
Nosql data modelsNosql data models
Nosql data models
 
Hbase hive pig
Hbase hive pigHbase hive pig
Hbase hive pig
 

More from Girish Khanzode (10)

Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
IR
IRIR
IR
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
NLP
NLPNLP
NLP
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Hadoop
HadoopHadoop
Hadoop
 
Language R
Language RLanguage R
Language R
 
Funtional Programming
Funtional ProgrammingFuntional Programming
Funtional Programming
 

Recently uploaded

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 

Recently uploaded (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 

NoSql

  • 1. NoSQL Technologies HBase | Cassandra | MongoDB | Redis Girish Khanzode
  • 2. Contents • NoSQL – Horizontal Scalability – CAP Theorem – Gossip Protocol & Hinted Handoffs • Hbase – HBase Data Model – HBase Regions – Column Families – HBase API • Redis NoSQL Database • Cassandra – Architecture Overview – Partitioning – Write Properties – Gossip Protocols – Accrual Failure Detector – Data Model – Tunable Consistency • CQL • Memcached Database • MongoDB • References
  • 3. NoSQL • Not Only SQL • Class of non-relational data storage systems • Usually no fixed table schema • No concept of joins • Relax one or more of the ACID properties
  • 4. NoSQL • Column StoreType – Each storage block contains data from only one column – More efficient than row (or document) store if • Multiple row/record/documents are inserted at the same time so updates of column blocks can be aggregated • Retrievals access only some of the columns in a row/record/document • Document Store Type – stores documents made up of tagged elements • Key-Value Store Type – Hash table of keys • Graph DatabasesType
  • 5. Categories • Key-Value Store – Big HashTable of keys & values • Products – Memcached – Membase – Redis – Data structure server – Riak – Amazon Dynamo based – Amazon S3 (Dynamo)
  • 6. Categories • Schema-less - column-based, document-based, graph-based • Document-basedStore- Stores documents made up of tagged elements (CouchDB, MongoDB) • Column-based Store- Each storage block contains data from only one column – Google BigTable – Cassandra – HBase • Graph-based-A network database that uses edges and nodes to represent and store data (Neo4J)
  • 8. RDBMS Scaling - Master-Slave • All writes are written to the master • All reads performed against the replicated slave databases • Critical reads may be incorrect as writes may not have been propagated down • Large data sets can pose problems as master needs to duplicate data to slaves
  • 9. RDBMS Scaling • Partition or Sharding – Scales well for both reads and writes – Not transparent, application needs to be partition-aware – Can no longer have relationships/joins across partitions – Loss of referential integrity across shards • Multi-Master replication • INSERT only, not UPDATES/DELETES • No JOINs, thereby reducing query time – Requires de-normalizing data • In-memory databases
  • 10. RDBMS Limitations • One size does not fit all • Impedance mismatch • Rigid schema design • Harder to scale • Replication • Difficult to join across multiple nodes • Can not easily handle data growth • Need a DBA
  • 11. RDBMS Limitations • Many issues while scaling up for massive datasets • Not designed for distributed computing • Expensive specialized hardware • Multi-node databases considered as solutions - Known as ‘scaling out’ or ‘horizontal scaling’ – Master-slave – Sharding
  • 12. Horizontal Scalability • Scale out • Easily add servers to existing system - Elastically scalable – Bugs, hardware errors, things fail all the time – Cost efficient • Non sharing • Use commodity/cheap hardware • Heterogeneous systems
  • 13. Horizontal Scalability • Controlled concurrency (avoids locks) • Service Oriented Architecture – Local states – Decentralized to reduce bottlenecks – Avoids single point of failures • Asynchronous • All nodes are symmetric
  • 15. NoSQL Database Features • Large data volumes • Scalable replication and distribution – Potentially thousands of machines – Potentially distributed around the world • Queries require to return answers quickly • CAPTheorem • Open source development • Key /Value
  • 16. NoSQL Database Features • Mostly query, few updates • Asynchronous Inserts & Updates • Schema-less • ACID transaction properties not needed – BASE • Schema-Less Stores – Richer model than key/value pairs – Eventual consistency – Distributed – Excellent performance and scalability – Downside - typically no ACID transactions or joins
  • 17. Key-Value Store • A simple Hash table • Read and write values using a key – Get(key), returns the value associated with the provided key – Put(key, value), associates the value with the key – Multi-get(key1, key2, .., keyN), returns the list of values associated with the list of keys – Delete(key), removes the entry for the key from the data store
  • 18. Key-Value Store • Pros – Very fast – Scalable – Simple model – Distribute horizontally • Cons – Many data structures (objects) not easily modeled – As data volume rises, maintaining unique values as keys is difficult
  • 19. Document Store • The data is a collection of key value pairs, is compressed as a document store similar to a key-value store • Difference is that the values stored (documents) provide some structure and encoding of the managed data • XML, JSON (Java Script Object Notation), BSON (binary JSON objects) are some common standard encodings
  • 20. Column Store • Data stored in cells grouped in columns of data rather than rows • Columns logically grouped into column families • Families can contain a virtually unlimited number of columns that can be created at runtime or the definition of the schema • Read and write is done using columns rather than rows • Benefit of storing data in columns, is fast search/ access and data aggregation • Store all the cells corresponding to a column as a continuous disk entry thus makes the search/access faster
  • 21. Column Store - Data Model • ColumnFamily - A single structure that can group Columns and SuperColumns • Key - permanent name of the record. Keys have different numbers of columns, so the database can scale in an irregular way • Key-space - Defines the outermost level of an organization, typically the name of the application • Column - Ordered list of elements -Tuple with a name and a value defined
  • 22. ACIDTransactions - Atomic • Either the whole process is done or none • If transaction successful – commit • System responsible for saving all changes to database • If transaction unsuccessful - abort • System responsible for rollback of all changes
  • 23. ACIDTransactions - Consistent • Database constraints preserved • Enterprise rules limit occurrence of some real-world events • Customer cannot withdraw if balance less than minimum • These limitations are integrity constraints: assertions that must be satisfied by all database states (state invariants) • Isolated - User sees as if only one process executes at a time - two concurrent transactions will not see on another’s transaction while “in flight”
  • 24. ACIDTransactions - Durable • Effects of a process not lost if the system crashes • System ensures that once a transaction commits, its effect on the database state is not lost despite subsequent failures • Database stored redundantly on mass storage devices to protect against media failure • Related to Availability - extent to which a (possibly distributed) system can provide service despite failure – Non-stop DBMS (mirrored disks) – Recovery based DBMS (log)
  • 25. CAPTheorem • Brewer’sTheorem by Prof. Eric Brewer, published in 2000 at University of Berkeley • Consistency: Every node in the system contains the same data • Replicas never out of data • Availability - Every request to a non-failing node in the system returns a response – System available during software and hardware upgrades and node failures – Traditionally thought of as server/process available for five 9’s (99.999 %) – For large node system, at any point there’s a good chance that a node is either down or a network disruption among the nodes • Need a system resilience during network disruption
  • 27. CAPTheorem • PartitionTolerance - System properties (consistency and/or availability) hold even when the system is partitioned (communicate lost) and data is lost (node lost) • A system can continue to operate in the presence of a network partitions • At most two of these three properties supported for any shared-data system • Scaling out requires partition • It leaves either consistency or availability to choose from • In almost all cases, availability chosen over consistency
  • 28. Eventual Consistency • BASE (BasicallyAvailable Soft-state Eventual consistency) • BASE is an alternative to ACID • Weak consistency – stale data OK • When no updates occur for a long period of time, eventually all updates propagate through the system and all the nodes are consistent • For a given accepted update and a given node, eventually either the update reaches the node or the node is removed from service • Availability first • Approximate answers
  • 29. Eventual Consistency • Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system and all the replicas will be consistent • Conflict resolution – Read repair -The correction is done when a read finds an inconsistency. This slows down the read operation – Write repair -The correction takes place during a write operation, if an inconsistency has been found, slowing down the write operation – Asynchronous repair -The correction is not part of a read or write operation
  • 30. NoSQL Advantages • Cheap - open source • Easy to implement • Data replicated to multiple nodes (identical and fault-tolerant) • Partitioned – Down nodes easily replaced – No single point of failure • Easy to distribute • No predefined schema • Scale up and down • Relax the data consistency requirement (CAP)
  • 31. NoSQL Downsides • Joins • Group by • Order by • ACID transactions • SQL frustrating but still a powerful query language • Easy integration with other applications that support SQL
  • 32. Gossip Protocol & Hinted Handoffs • Most preferred communication protocol in a distributed environment • All the nodes talk to each other peer wise • No global state • No single point of coordinator • If one node goes down and there is a Quorum • Load for down node shared by others • Self managing system • If a new node joins, load is also distributed • Requests coming to node F handled by node C. When F becomes available, it will get this Information from C • Self healing property
  • 33. Gossip Protocol & Hinted Handoffs
  • 34. HBASE
  • 35. HBase • An open-source, distributed, column-oriented database built on top of HDFS based on BigTable • A distributed data store scalable horizontally to 1,000’s of commodity servers and petabytes of indexed storage • Designed to operate on top of the Hadoop distributed file system (HDFS) or Kosmos File System (KFS - Cloudstore) for scalability, fault tolerance and high availability
  • 36. HBase History Started by Chad Walters and Jim 2006.11 - Google releases paper on BigTable 2007.2 - Initial HBase prototype created as Hadoop contribution 2007.10 - First useable HBase 2008.1 - Hadoop become Apache top-level project and HBase becomes subproject 2008.10 - HBase 0.18, 0.19 released
  • 37. A Big Map • Row Key + Column Key + timestamp => value Row Key Column Key Timestamp Value 1 Info:name 1273516197868 Sakis 1 Info:age 1273871824184 21 1 Info:sex 1273746281432 Male 2 Info:name 1273863723227 Themis 2 Info:name 1273973134238 Andreas
  • 38. Why BigTable? • RDBMS performance good for transaction processing • Very large scale analytic processing solutions are commercial, expensive, and specialized • Very large scale analytic processing – Big queries – typically range or table scans – Big databases (100s ofTB)
  • 39. Why BigTable? • Map reduce on Bigtable with optional cascading on top to support some relational algebras - a cost effective solution • Sharding not a solution to scale open source RDBMS platforms – Application specific – Labor intensive (re)partitioning
  • 40. HBase as Hadoop Component • Hbase built on top of HDFS • HBase files internally stored in HDFS
  • 41. HBase Data Model • Based on Google’s Bigtable model - Key-Value pairs • HBase schema consists of several tables • Each table consists of a set of column families – Columns not part of schema • Tables sorted by Row Row key Column Family valueTimeStamp
  • 42. HBase Data Model • Dynamic Columns – Because column names are encoded inside the cells – Different cells can have different columns • Table schema only defines it’s column families – Each family has any number of columns – Each column consists of any number of versions – Columns only exist when inserted, NULLs are free. – Columns within a family sorted and stored together • Everything except table names are byte[] • (Row, Family: Column,Timestamp) =Value
  • 43. Components • Region – A subset of a table rows, like horizontal range partitioning – Automatic • RegionServer (many slaves) – Manages data regions – Serves data for reads and writes (using a log) • Master – Responsible for coordinating the slaves – Assigns regions, detects failures – Admin functions
  • 44. HBase Members • Master – Monitors region servers – Load balancing for regions – Redirect client to correct region servers – Current SPOF – Signs regions, detects failures of Region Servers – Control admin function • Slaves – Region Servers – Region - A subset of table's rows – Serves data for reads and writes – Send Heartbeat to Master
  • 45. HBase Regions • Each HTable (column family) is partitioned horizontally into regions – Regions are counterpart to HDFS blocks
  • 46. Regions • Contain an in-memory data store (MemStore) and a persistent data store (HFile) • All regions on a region server share a reference to the write-ahead log (WAL) which is used to store new data that hasn't yet been persisted to permanent storage and to recover from region server crashes • Each region holds a specific range of row keys, and when a region exceeds a configurable size, HBase automatically splits the region into two child regions, which is the key to scaling HBase
  • 49. Column Families Each row has a Key Each record is divided into Column Families Each column family consists of one or more Columns
  • 51. HBase vs. HDFS • Both distributed systems that scale to hundreds or thousands of nodes • HDFS is good for batch processing (scans over big files) – Not good for record lookup – Not good for incremental addition of small batches – Not good for updates
  • 52. HBase vs. HDFS • HBase is designed to efficiently address the above points – Fast record lookup – Support for record-level insertion – Support for updates (not in place) • HBase updates are done by creating new versions of values
  • 53. HBase vs. HDFS • If application has neither random reads or writes, stick to HDFS
  • 55. When to Use HBase • Random read, write or both are required • Need to do many thousands of operations per second on multipleTB of data • Access patterns are well-known as simple
  • 56. Row key Time Stamp Column “content s:” Column “anchor:” “com.apac he.ww w” t12 “<html> …” t11 “<html> …” t10 “anchor:apache .com” “APACH E” “com.cnn.w ww” t15 “anchor:cnnsi.co m” “CNN” t13 “anchor:my.look. ca” “CNN.co m” t6 “<html> …” t5 “<html> …” t3 “<html> …” Column family named “Contents” Column family named “anchor” Column named “apache.com” • Key – Byte array – Serves as the primary key for the table – Indexed far fast lookup • Column Family – Has a name (string) – Contains one or more related columns • Column – Belongs to one column family – Included inside the row • familyName:columnName
  • 58. Data Model • Version number can be user-supplied – Even does not have to be inserted in increasing order – Version numbers are unique within each key • Table can be very sparse – Many cells are empty • Keys are indexed as the primary key Has two columns [cnnsi.com & my.look.ca]
  • 59. Physical Model • Each column family is stored in a separate file (called HTables) • Key & version numbers are replicated with each column family • Empty cells are not stored HBase maintains a multi-level index on values: <key, column family, column name, timestamp>
  • 61. Zookeeper and HBase • HBase depends on Zookeeper • To manage master election and server availability, Zookeeper used • Set up a cluster, provides distributed coordination primitives • A tool for building cluster management systems
  • 62. Connecting to HBase • Java client – get(byte [] row, byte [] column, long timestamp, int versions); • Non-Java clients – Thrift server hosting HBase client instance • Sample ruby, C++, & java (via thrift) clients – REST server hosts HBase client • TableInput / OutputFormat for MapReduce – HBase as MR source or sink • HBase Shell – JRuby IRB with “DSL” to add get, scan, and admin – ./bin/hbase shell YOUR_SCRIPT
  • 63. ApacheThrift • $hbase-daemon.sh start thrift • $hbase-daemon.sh stop thrift • High performance, scalable, cross-language serialization and RPC framework • Created at Facebook along with Cassandra • A cross-language, service-generation framework • Binary Protocol (like Google Protocol Buffers) • Compiles to: C++, Java, Python, PHP, Ruby, Perl, …
  • 64. HBase API • get(key) – Extract value given a key – get(row) • put(key, value) - Create or update the value given its key – put(row, Map<column, value>) • delete(key) -- Remove the key and its associated value • execute(key, operation, parameters) – operate on value given a key – List, Set, Map…
  • 65. Hive HBase Integration • Reasons to use Hive on Hbase – Large data in Hbase for use in a real-time environment, but never used for analysis – Give access to data in HBase usually only queried through MapReduce to people that don’t code (business analysts) – When needing a more flexible storage solution, so that rows can be updated live by either a Hive job or an application and can be seen immediately to the other • Reasons not to do it – Run SQL queries on HBase to answer live user requests (it’s still a MR job) – Hoping to see interoperability with other SQL analytics systems
  • 67. HBase - Benefits • Distributed storage • Table-like in data structure - Multi-dimensional map • High scalability, availability and performance • No real indexes • Automatic partitioning • Scale linearly and automatically with new nodes • Commodity hardware • Fault tolerance • Batch processing
  • 68. HBase Limitations • Tables have one primary index / key , the row key • Each row can have any number of columns • Table schema only defines column families (column family can have any number of columns) • Each cell value has a timestamp • No join operators • Scans and queries can select a subset of available columns using a wildcard
  • 69. HBase Limitations • Lookups – Fast lookup using row key and optional timestamp – Full table scan – Range scan from region start to end • Limited atomicity and transaction support – Supports multiple batched mutations of single rows only – Data is unstructured and un-typed • No access via SQL – Programmatic access - Java,Thrift(Ruby, Php, Python, Perl, C++,..), Hbase Shell
  • 70. REDIS
  • 71. Redis NoSQL Database • Redis is an open source, advanced key-value data store • Often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets • Redis works with an in-memory dataset • It is possible to persist dataset either by – dumping the dataset to disk every once in a while – or by appending each command to a log
  • 72. Redis NoSQL Database • Distributed data structure server • Consistent hashing at client • Non-blocking I/O, single threaded • Values are binary safe strings: byte strings • String : Key/Value Pair, set/get. O(1) for many string operations. • Lists: lpush, lpop, rpush, rpop.you - use as stack or queue. O(1)
  • 73. Redis NoSQL Database • Publisher/Subscriber model • Set: collection of unique elements - add, pop, union, intersection - set operations. • Sorted set: unique elements sorted by scores. O(logn). Range operations • Hash: multiple key/value pairs – HMSET user 1 username foo password bar age 30 – HGET user 1 age
  • 75. Redis Keys • Keys are binary safe - it is possible to use any binary sequence as a key • The empty string is also a valid key • Too long keys are not a good idea • Too short keys are often also not a good idea ("u:1000:pwd" versus "user:1000:password") • Nice idea is to use some kind of schema, like: "object-type:id:field"
  • 76. Redis DataTypes • Redis is often referred to as a data structure server since keys can contain – Strings – Lists – Sets – Hashes – Sorted Sets
  • 77. Redis Strings • Most basic kind of Redis value • Binary safe - can contain any kind of data, for instance a JPEG image or a serialized Ruby object • Max 512 Megabytes in length • Can be used as atomic counters using commands in the INCR family • Can be appended with the APPEND command
  • 78. Redis Strings - Example
  • 79. Redis Lists • Lists of strings, sorted by insertion order • Add elements to a Redis List pushing new elements on the head (on the left) or on the tail (on the right) of the list • Max length: (2^32 - 1) elements • Model a timeline in a social network, using LPUSH to add new elements, and using LRANGE in order to retrieve recent items • Use LPUSH together with LTRIM to create a list that never exceeds a given number of elements
  • 80. Redis Lists - Example
  • 81. Redis Sorted Sets • Every member of a Sorted Set is associated with score, that is used in order to take the sorted set ordered, from the smallest to the greatest score • You can do a lot of tasks with great performance that are really hard to model in other kind of databases • Probably the most advanced Redis data type
  • 82. Redis Hashes • Map between string fields and string values • Perfect data type to represent objects HMSET user:1000 username antirez password P1pp0 age 34 HGETALL user:1000 HSET user:1000 password 12345 HGETALL user:1000
  • 83. Redis Operations • It is possible to run atomic operations on data types: • Appending to a string • Incrementing the value in a hash • Pushing to a list • Computing set intersection, union and difference • Getting the member with highest ranking in a sorted set
  • 85. Cassandra • Structured Storage System over a P2P Network • Was created to power the Facebook Inbox Search • Facebook open-sourced Cassandra in 2008 and became anApache Incubator project • In 2010, Cassandra graduated to a top-level project, regular update and releases followed
  • 86. Cassandra • High availability • Designed to handle large amount of data across multiple servers • Eventual consistency - trade-off strong consistency in favor of high availability • Incremental scalability • Optimistic Replication
  • 87. Cassandra • “Knobs” to tune tradeoffs between consistency, durability and latency • Low total cost of ownership • Minimal administration • Tunable consistency • Decentralized - No single point of failure • Writes faster than reads • Uses consistent hashing (logical partitioning) when clustered.
  • 88. Cassandra • Hinted handoffs • Peer to peer routing(ring) • Thrift API • Multi data center support • Mimics traditional relational database systems, but with triggers and lightweight transactions • Raw, simple data structures
  • 89. Features • Emphasis on performance over analysis – Still supports analysis tools like Hadoop • Organization – Rows are organized into tables – First component of a table’s primary key is the partition key – Rows clustered by the remaining columns of the key – Columns may be indexed separately from the primary key – Tables may be created, dropped, altered at runtime without blocking queries
  • 90. Features • Language – CQL (Cassandra Query Language) introduced, similar to SQL (flattened learning curve) • Peer-to-Peer cluster – Decentralized design • Each node has the same role – No single point of failure • Avoids issues of master-slave DBMS’s – No bottlenecking
  • 91. Comparisons Apache Cassandra Google Big Table Amazon DynamoDB StorageType Column Column Key-Value Best Use Write often, read less Designed for large scalability Large database solution Concurrency Control MVCC Locks ACID Characteristics HighAvailability PartitionTolerance Persistence Consistency HighAvailability PartitionTolerance Persistence Consistency HighAvailability Key Point – Cassandra offers a healthy cross between BigTable and Dynamo.
  • 92. Cassandra History Google Bigtable (2006) • consistency model: strong • data model: sparse map • clones: hbase, hypertable Amazon Dynamo (2007) • O(1) dht • consistency model: client tune-able • clones: riak, voldemort Cassandra ~= Bigtable + Dynamo
  • 93. Architecture Overview • Cassandra was designed with the understanding that system/ hardware failures can and do occur • Peer-to-peer, distributed system • All nodes are the same • Data partitioned among all nodes in the cluster • Custom data replication to ensure fault tolerance • Read/Write-anywhere design
  • 95. Architecture Overview • Google BigTable - data model – Column Families – Memtables – SSTables • Amazon Dynamo - distributed systems technologies – Consistent hashing – Partitioning – Replication – One-hop routing
  • 97. Transparent Elasticity • Nodes can be added and removed from Cassandra online, with no downtime being experienced. 1 2 3 4 5 6 1 7 10 4 2 3 5 6 8 9 11 12
  • 98. Transparent Scalability • Addition of Cassandra nodes increases performance linearly and ability to manage TB’s-PB’s of data 1 2 3 4 5 6 1 7 10 4 2 3 5 6 8 9 11 12 Performance throughput = N Performance throughput = N x 2
  • 99. High Availability • Cassandra has no single point of failure due to peer-to-peer architecture
  • 100. Multi-Geography - Zone Aware Cassandra allows a single logical database to span 1-N datacenters that are geographically dispersed. Also supports a hybrid on-premise/Cloud implementation
  • 101. Partitioning • Nodes are logically structured in RingTopology • Hashed value of key associated with data partition is used to assign it to a node in the ring • Hashing rounds off after certain value to support ring structure • Lightly loaded nodes moves position to alleviate highly loaded nodes
  • 103. Data Redundancy • Cassandra allows for customizable data redundancy so that data is completely protected • Supports rack awareness (data can be replicated between different racks to guard against machine/rack failures) • Uses Zookeeper to choose a leader which tells nodes the range they are replicas for
  • 105. Operations • A client issues a write request to a random node in the Cassandra cluster • Partitioner determines the nodes responsible for the data • Locally, write operations are logged and then applied to an in-memory version • Commit log is stored on a dedicated disk local to the machine • Relies on local file system for data persistency
  • 106. Operations • Write operations happens in 2 steps – Write to commit log in local disk of the node – Update in-memory data structure. – Why 2 steps or any preference to order or execution? • Read operation – Looks up in-memory ds first before looking up files on disk. – Uses Bloom Filter (summarization of keys in file store in memory) to avoid looking up files that do not contain the key
  • 107. Consistency • Read Consistency – Number of nodes that must agree before read request returns – ONE to ALL • Write Consistency – Number of nodes that must be updated before a write is considered successful – ANY to ALL – AtANY, a hinted handoff is all that is needed to return. • QUORUM – Commonly used middle-ground consistency level – Defined as (replication_factor / 2) + 1
  • 108. Hinted Handoff Write • Write intended for a node that is offline • An online node, processing the request, makes a note to carry out the write once the node comes back online
  • 109. Write Properties • No locks in the critical path • Sequential disk access • Behaves like a write back Cache • Append support without read ahead • Atomicity guarantee for a key • AlwaysWritable – accept writes during failure scenarios
  • 110. Write Operations • Stages – Logging data in the commit log – Writing data to the memtable – Flushing data from the memtable – Storing data on disk in SSTables • Commit Log – First place a write is recorded – Crash recovery mechanism – Write not successful until recorded in commit log – Once recorded in commit log, data is written to Memtable
  • 111. Write Operations • Memtable – Data structure in memory – Once memtable size reaches a threshold, it is flushed (appended) to SSTable – Several may exist at once (1 current, any others waiting to be flushed) – First place read operations look for data • SSTable – Kept on disk – Immutable once written – Periodically compacted for performance
  • 113. Read Repair • On read, nodes are queried until the number of nodes which respond with the most recent value meet a specified consistency level from ONE to ALL • If the consistency level is not met, nodes are updated with the most recent value which is then returned • If the consistency level is met, the value is returned and any nodes that reported old values are then updated
  • 115. Delete Operations • Tombstones – On delete request, records are marked for deletion – Similar to Recycle Bin – Data is actually deleted on major compaction or configurable timer
  • 116. Gossip Protocols • Used to discover location and state information about the other nodes participating in a Cassandra cluster • Network Communication protocols inspired for real life rumor spreading • Periodic, Pairwise, inter-node communication • Low frequency communication ensures low cost
  • 117. Gossip Protocols • Random selection of peers • Example – Node A wish to search for pattern in data – Round 1 – Node A searches locally and then gossips with node B – Round 2 – Node A,B gossips with C and D – Round 3 – Nodes A,B,C and D gossips with 4 other nodes …… • Round by round doubling makes protocol very robust
  • 118. Failure Detection • Gossip process tracks heartbeats from other nodes both directly and indirectly • Node Fail state is given by variable Φ – tells how likely a node might fail (suspicion level) instead of simple binary value (up/down). • This type of system is known as Accrual Failure Detector • Takes into account network conditions, workload, or other conditions that might affect perceived heartbeat rate • A threshold for Φ tells is used to decide if a node is dead – If node is correct, phi will be constant set by application. – Generally Φ(t) = 0
  • 119. Failure Detection • Uses Scuttleback (a Gossip protocol) to manage nodes • Uses gossip for node membership and to transmit system control state • Lightweight with mathematically provable properties • State disseminated in O(logN) rounds where N is the number of nodes in the cluster. • EveryT seconds each member increments its heartbeat counter and selects one other member to send its list to. • A member merges the list with its own list .
  • 120. Accrual Failure Detector • Valuable for system management, replication, load balancing etc • Node Fail state is given by variable ‘phi’ which tells how likely a node might fail (suspicion level) instead of simple binary value (up/down) • Defined as a failure detector that outputs a value, PHI, associated with each process. • Also known as Adaptive Failure detectors - designed to adapt to changing network conditions
  • 121. Accrual Failure Detector • The value output, PHI, represents a suspicion level • Applications set an appropriate threshold, trigger suspicions and perform appropriate actions • In Cassandra the average time taken to detect a failure is 10- 15 seconds with the PHI threshold set at 5
  • 122. Performance Benchmark • Loading of data - limited by network bandwidth • Read performance for Inbox Search in production Search Interactions Term Search Min 7.69 ms 7.78 ms Median 15.69 ms 18.27 ms Average 26.13 ms 44.41 ms
  • 124. Data Model • Column: smallest data element, a tuple with a name and a value :Rockets, '1' might return: { 'name' => ‘Rocket-Powered Roller Skates', ‘toon' => ‘Ready Set Zoom', ‘inventoryQty' => ‘5‘, ‘productUrl’ => ‘rockets1.gif’ }
  • 125. Data Model • ColumnFamily -There’s a single structure used to group both the Columns and SuperColumns. Called a ColumnFamily (think table), it has two types, Standard & Super. – Column families must be defined at startup • Key - the permanent name of the record • Keyspace - the outer-most level of organization.This is usually the name of the application. For example, ‘Acme' (think database name)
  • 126. Data Model • Optional super column: a named list.A super column contains standard columns, stored in recent order • SupposeOtherProducts has inventory in categories • Querying (:OtherProducts, '174927') might return – {‘OtherProducts' => {'name' => ‘Acme Instant Girl', ..}, ‘foods': {...}, ‘martian': {...}, ‘animals': {...}} • In the example, foods, martian, and animals are all super column names • They are defined on the fly, and there can be any number of them per row. :OtherProducts would be the name of the super column family
  • 127. Data Model • Columns and SuperColumns are both tuples with a name & value.The key difference is that a standard Column’s value is a “string” and in a SuperColumn the value is a Map of Columns • Columns are always sorted by their name. Sorting supports: – BytesType – UTF8Type – LexicalUUIDType – TimeUUIDType – AsciiType – LongType • Each of these options treats the Columns' name as a different data type
  • 128. Tunable Consistency • Cassandra has programmable read/writable consistency • Any - Ensure that the write is written to at least 1 node • One - Ensure that the write is written to at least 1 node’s commit log and memory table before receipt to client • Quorom - Ensure that the write goes to node/2 + 1 • All - Ensure that writes go to all nodes. An unresponsive node would fail the write
  • 129. Consistent Hashing A H D B M V S R C • Partition using consistent hashing – Keys hash to a point on a fixed circular space – Ring is partitioned into a set of ordered slots and servers and keys hashed over these slots • Nodes take positions on the circle. • A, B, and D exists. – B responsible for AB range. – D responsible for BD range. – A responsible for DA range. • C joins. – B, D split ranges. – C gets BC from D.
  • 130. Key-Value Model • Cassandra is a column oriented NoSQL system • Column families: sets of key-value pairs – column family as a table and key- value pairs as a row (using relational database analogy) • A row is a collection of columns labeled with a name
  • 131. Cassandra Row • Value of row is itself a sequence of key-value pairs • such nested key-value pairs are columns • key = column name • A row must contain at least 1 column
  • 133. Column Names StoringValues • key: User ID • column names store tweet ID values • values of all column names are set to “-” (empty byte array) as they are not used
  • 134. Key Space • A Key Space is a group of column families together. It is only a logical grouping of column families and provides an isolated scope for names
  • 135. Comparison with RDBMS • With RDBMS, a normalized data model is created without considering the exact queries – SQL can return almost anything though Joins • With C*, the data model is designed for specific queries – schema is adjusted as new queries introduced • C*: NO joins, relationships, or foreign keys – a separate table is leveraged per query – data required by multiple tables is denormalized across those tables
  • 136. Compaction • Compaction runs periodically to merge multiple SSTables – Reclaims space – Creates new index – Merges keys – Combines columns – Discards tombstones – Improves performance by minimizing disk seeks • Types – Major – Read-only
  • 137. Anti-Entropy • Replica synchronization mechanism • Ensures synchronization of data across nodes • Compares data checksums against neighboring nodes • Uses Merkle trees (Hash trees) – Snapshot of data sent to neighboring nodes – Created and broadcasted on every major compaction – If two nodes take snapshots withinTREE_STORE_TIMEOUT of each other, snapshots are compared and data is synced.
  • 139. Cassandra Query Language - CQL • Creating a keyspace - namespace of tables CREATE KEYSPACE demo WITH replication = {‘class’: ’SimpleStrategy’, replication_factor’: 3}; • To use namespace: USE demo;
  • 140. CQL – CreateTable CREATE TABLE users( CREATE TABLE tweets( email varchar, email varchar, bio varchar, time_posted timestamp, birthday timestamp, tweet varchar, active boolean, PRIMARY KEY (email, time_posted)); PRIMARY KEY (email));
  • 141. CQL • Insert – INSERT INTO users (email, bio, birthday, active)VALUES (‘Tom.Stok@btx.com’, ‘StarTeammate’, 516513612220, true); – Timestamp fields are specified in milliseconds since epoch • Query tables – SELECT expression reads one or more records from Cassandra column family and returns a result-set of rows – SELECT * FROM users; – SELECT email FROM usersWHERE active = true;
  • 142. Cassandra Advantages • Perfect for time-series data • High performance • Decentralization • Nearly linear scalability • Replication support • No single points of failure • MapReduce support
  • 143. CassandraWeaknesses • No referential integrity – no concept of JOIN • Querying options for retrieving data are limited • Sorting data is a design decision – no GROUP BY • No support for atomic operations – if operation fails, changes can still occur • First think about queries, then about data model
  • 144. Key Points • Cassandra is designed as a distributed database management system – use it when you have a lot of data spread across multiple servers • Cassandra write performance is always excellent, but read performance depends on write patterns – it is important to spend enough time to design proper schema around the query pattern • having a high-level understanding of some internals is a plus – ensures a design of a strong application built atop Cassandra
  • 145. Hector – Java API for Cassandra • Sits on top ofThrift • Load balancing • JMX monitoring • Connection-pooling • Failover • JNDI integration with application servers • Additional methods on top of the standard get, update, delete methods. • Under discussion – hooks into Spring declarative transactions
  • 146. Memcached Database • Key-Value Store • Very easy to setup and use • Consistent hashing • Scales very well • In memory caching, no persistence • LRU eviction policy • O(1) to set/get/delete • Atomic operations set/get/delete • No iterators or very difficult
  • 148. MongoDB • Publicly released in 2009 • Allows data to persist in a nested state • Query that nested data in an ad hoc fashion • Enforces no schema • Documents can optionally contain fields or types that no other document in the collection contains • NoSQL
  • 150. MongoDB • Document-oriented database • Uses BSON format – Binary JSON • An instance may have zero or more databases • A database may have zero or more collections • A collection may have zero or more documents • A document may have one or more fields • Indexes function like RDBMS counterparts
  • 151. MongoDB • Data types: bool, int, double, string, object(bson), oid, array, null, date • Database and collections created automatically • Language Drivers • Capped collections are fixed size collections, buffers, very fast, FIFO, good for logs. No indexes • Object id are generated by client, 12 bytes packed data - 4 byte time, 3 byte machine, 2 byte pid, 3 byte counter
  • 152. MongoDB • Possible to refer other documents in different collections but more efficient to embed documents • Replication easy to setup. Read from slaves • Supports aggregation – Map Reduce with JavaScript • Indexes, B-Trees. Ids are always indexed
  • 153. MongoDB • Updates are atomic. Low contention locks • Querying mongo done with a document – Lazy, returns a cursor – Reducable to SQL, select, insert, update limit, sort - upsert (either inserts of updates) – Operators - $ne, $and, $or, $lt, $gt, $incr, $decr • Repository Pattern for easy development
  • 154. MongoDB • Full Index Support • Replication & High Availability • Auto-Sharding • Querying • Fast In-Place Updates • Map/Reduce
  • 156. Comparison RDBMS MongoDB Database Database Table,View Collection Row Document (JSON, BSON) Column Field Index Index Join Embedded Document Foreign Key Reference Partition Shard
  • 157. CRUD • Create – db.collection.insert( <document> ) – db.collection.save( <document> ) – db.collection.update( <query>, <update>, { upsert: true } ) • Read – db.collection.find( <query>, <projection> ) – db.collection.findOne( <query>, <projection> ) • Update – db.collection.update( <query>, <update>, <options> ) • Delete – db.collection.remove( <query>, <justOne> )
  • 158. Commands # create a doc and save into a collection  p = {firstname:"Dave", lastname:"Ho“}  db.person.save(p)  db.person.insert({firstname:"Ricky", lastname:"Ho"}) # Show all docs within a collection  db.person.find() # Iterate result using cursor  var c = db.person.find()  p1 = c.next()  p2 = c.next()
  • 159. Commands #Query  p3 = db.person.findone({lastname:"Ho"} # Return a subset of fields (ie: projection)  db.person.find({lastname:"Ho"}, {firstname:true}) # Delete some records  db.person.remove({firstname:"Ricky"}) #To build an index for a collection  db.person.ensureIndex({firstname:1})
  • 160. Commands #To show all existing indexes  db.person.getIndexes() #To remove an index  db.person.dropIndex({firstname:1}) # Index can be build on a path of the doc  db.person.ensureIndex({"address.city":1}) # A composite key can be used to build index  db.person.ensureIndex({lastname:1, firstname:1})
  • 161. Commands #Data update andTransaction: To update an existing doc, we can do the following  var p1 = db.person.findone({lastname:"Ho"})  p1["address"] = "San Jose" db.person.save(p1) # Do the same in one command  db.person.update({lastname:"Ho"}, {$set:{address:"San Jose"}}, false, true)
  • 162. MongoDB Sharding • Config servers: Keeps mapping • Mongos: Routing servers • Mongod: master-slave replicas
  • 163. References • NoSQL --Your Ultimate Guide to the Non - Relational Universe! http://nosql-database.org/links.html • NoSQL (RDBMS) http://en.wikipedia.org/wiki/NoSQL • PODC Keynote, July 19, 2000.Towards Robust. DistributedSystems. Dr. Eric A. Brewer. Professor, UC Berkeley.Co-Founder & Chief Scientist, Inktomi www.eecs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf • http://planetcassandra.org/functional-use-cases/ • http://marsmedia.info/en/cassandra-pros-cons-and-model.php • http://www.slideshare.net/adrianco/migrating-netflix-from-oracle-to-global-cassandra • http://wiki.apache.org/cassandra/CassandraLimitations • “Brewer'sCAPTheorem” posted by Julian Browne, January 11, 2009. http://www.julianbrowne.com/article/viewer/brewers- cap-theorem • “Scalable SQL”,ACM Queue, Michael Rys, April 19, 2011 http://queue.acm.org/detail.cfm?id=1971597
  • 164. ThankYou Check Out My LinkedIn Profile at https://in.linkedin.com/in/girishkhanzode