2. Contents
• NoSQL
– Horizontal Scalability
– CAP Theorem
– Gossip Protocol & Hinted Handoffs
• Hbase
– HBase Data Model
– HBase Regions
– Column Families
– HBase API
• Redis NoSQL Database
• Cassandra
– Architecture Overview
– Partitioning
– Write Properties
– Gossip Protocols
– Accrual Failure Detector
– Data Model
– Tunable Consistency
• CQL
• Memcached Database
• MongoDB
• References
3. NoSQL
• Not Only SQL
• Class of non-relational data storage systems
• Usually no fixed table schema
• No concept of joins
• Relax one or more of the ACID properties
4. NoSQL
• Column StoreType
– Each storage block contains data from only one column
– More efficient than row (or document) store if
• Multiple row/record/documents are inserted at the same time so updates of column blocks can
be aggregated
• Retrievals access only some of the columns in a row/record/document
• Document Store Type – stores documents made up of tagged elements
• Key-Value Store Type – Hash table of keys
• Graph DatabasesType
5. Categories
• Key-Value Store
– Big HashTable of keys & values
• Products
– Memcached
– Membase
– Redis – Data structure server
– Riak – Amazon Dynamo based
– Amazon S3 (Dynamo)
6. Categories
• Schema-less - column-based, document-based, graph-based
• Document-basedStore- Stores documents made up of tagged elements (CouchDB, MongoDB)
• Column-based Store- Each storage block contains data from only one column
– Google BigTable
– Cassandra
– HBase
• Graph-based-A network database that uses edges and nodes to represent and store data (Neo4J)
8. RDBMS Scaling - Master-Slave
• All writes are written to the master
• All reads performed against the replicated slave databases
• Critical reads may be incorrect as writes may not have been propagated
down
• Large data sets can pose problems as master needs to duplicate data to
slaves
9. RDBMS Scaling
• Partition or Sharding
– Scales well for both reads and writes
– Not transparent, application needs to be partition-aware
– Can no longer have relationships/joins across partitions
– Loss of referential integrity across shards
• Multi-Master replication
• INSERT only, not UPDATES/DELETES
• No JOINs, thereby reducing query time
– Requires de-normalizing data
• In-memory databases
10. RDBMS Limitations
• One size does not fit all
• Impedance mismatch
• Rigid schema design
• Harder to scale
• Replication
• Difficult to join across multiple nodes
• Can not easily handle data growth
• Need a DBA
11. RDBMS Limitations
• Many issues while scaling up for
massive datasets
• Not designed for distributed
computing
• Expensive specialized hardware
• Multi-node databases considered
as solutions - Known as ‘scaling
out’ or ‘horizontal scaling’
– Master-slave
– Sharding
12. Horizontal Scalability
• Scale out
• Easily add servers to existing system - Elastically scalable
– Bugs, hardware errors, things fail all the time
– Cost efficient
• Non sharing
• Use commodity/cheap hardware
• Heterogeneous systems
13. Horizontal Scalability
• Controlled concurrency (avoids locks)
• Service Oriented Architecture
– Local states
– Decentralized to reduce bottlenecks
– Avoids single point of failures
• Asynchronous
• All nodes are symmetric
15. NoSQL Database Features
• Large data volumes
• Scalable replication and distribution
– Potentially thousands of machines
– Potentially distributed around the world
• Queries require to return answers quickly
• CAPTheorem
• Open source development
• Key /Value
16. NoSQL Database Features
• Mostly query, few updates
• Asynchronous Inserts & Updates
• Schema-less
• ACID transaction properties not needed – BASE
• Schema-Less Stores
– Richer model than key/value pairs
– Eventual consistency
– Distributed
– Excellent performance and scalability
– Downside - typically no ACID transactions or joins
17. Key-Value Store
• A simple Hash table
• Read and write values using a key
– Get(key), returns the value associated with the provided key
– Put(key, value), associates the value with the key
– Multi-get(key1, key2, .., keyN), returns the list of values associated with the
list of keys
– Delete(key), removes the entry for the key from the data store
18. Key-Value Store
• Pros
– Very fast
– Scalable
– Simple model
– Distribute horizontally
• Cons
– Many data structures (objects) not easily modeled
– As data volume rises, maintaining unique values as keys is difficult
19. Document Store
• The data is a collection of key value pairs, is compressed as a document
store similar to a key-value store
• Difference is that the values stored (documents) provide some structure
and encoding of the managed data
• XML, JSON (Java Script Object Notation), BSON (binary JSON objects)
are some common standard encodings
20. Column Store
• Data stored in cells grouped in columns of data rather than rows
• Columns logically grouped into column families
• Families can contain a virtually unlimited number of columns that can be created
at runtime or the definition of the schema
• Read and write is done using columns rather than rows
• Benefit of storing data in columns, is fast search/ access and data aggregation
• Store all the cells corresponding to a column as a continuous disk entry thus
makes the search/access faster
21. Column Store - Data Model
• ColumnFamily - A single structure that can group Columns and SuperColumns
• Key - permanent name of the record. Keys have different numbers of columns, so
the database can scale in an irregular way
• Key-space - Defines the outermost level of an organization, typically the name of
the application
• Column - Ordered list of elements -Tuple with a name and a value defined
22. ACIDTransactions - Atomic
• Either the whole process is done or none
• If transaction successful – commit
• System responsible for saving all changes to database
• If transaction unsuccessful - abort
• System responsible for rollback of all changes
23. ACIDTransactions - Consistent
• Database constraints preserved
• Enterprise rules limit occurrence of some real-world events
• Customer cannot withdraw if balance less than minimum
• These limitations are integrity constraints: assertions that must be
satisfied by all database states (state invariants)
• Isolated - User sees as if only one process executes at a time - two
concurrent transactions will not see on another’s transaction while “in flight”
24. ACIDTransactions - Durable
• Effects of a process not lost if the system crashes
• System ensures that once a transaction commits, its effect on the database state
is not lost despite subsequent failures
• Database stored redundantly on mass storage devices to protect against media
failure
• Related to Availability - extent to which a (possibly distributed) system can
provide service despite failure
– Non-stop DBMS (mirrored disks)
– Recovery based DBMS (log)
25. CAPTheorem
• Brewer’sTheorem by Prof. Eric Brewer, published in 2000 at University of
Berkeley
• Consistency: Every node in the system contains the same data
• Replicas never out of data
• Availability - Every request to a non-failing node in the system returns a
response
– System available during software and hardware upgrades and node failures
– Traditionally thought of as server/process available for five 9’s (99.999 %)
– For large node system, at any point there’s a good chance that a node is either
down or a network disruption among the nodes
• Need a system resilience during network disruption
27. CAPTheorem
• PartitionTolerance - System properties (consistency and/or availability) hold even
when the system is partitioned (communicate lost) and data is lost (node lost)
• A system can continue to operate in the presence of a network partitions
• At most two of these three properties supported for any shared-data system
• Scaling out requires partition
• It leaves either consistency or availability to choose from
• In almost all cases, availability chosen over consistency
28. Eventual Consistency
• BASE (BasicallyAvailable Soft-state Eventual consistency)
• BASE is an alternative to ACID
• Weak consistency – stale data OK
• When no updates occur for a long period of time, eventually all updates
propagate through the system and all the nodes are consistent
• For a given accepted update and a given node, eventually either the update
reaches the node or the node is removed from service
• Availability first
• Approximate answers
29. Eventual Consistency
• Given a sufficiently long period of time over which no changes are sent, all
updates can be expected to propagate eventually through the system and all the
replicas will be consistent
• Conflict resolution
– Read repair -The correction is done when a read finds an inconsistency. This slows
down the read operation
– Write repair -The correction takes place during a write operation, if an inconsistency
has been found, slowing down the write operation
– Asynchronous repair -The correction is not part of a read or write operation
30. NoSQL Advantages
• Cheap - open source
• Easy to implement
• Data replicated to multiple nodes (identical and fault-tolerant)
• Partitioned
– Down nodes easily replaced
– No single point of failure
• Easy to distribute
• No predefined schema
• Scale up and down
• Relax the data consistency requirement (CAP)
31. NoSQL Downsides
• Joins
• Group by
• Order by
• ACID transactions
• SQL frustrating but still a powerful query language
• Easy integration with other applications that support SQL
32. Gossip Protocol & Hinted Handoffs
• Most preferred communication protocol in a distributed environment
• All the nodes talk to each other peer wise
• No global state
• No single point of coordinator
• If one node goes down and there is a Quorum
• Load for down node shared by others
• Self managing system
• If a new node joins, load is also distributed
• Requests coming to node F handled by node C. When F becomes available, it will get this Information
from C
• Self healing property
35. HBase
• An open-source, distributed, column-oriented database built on top of
HDFS based on BigTable
• A distributed data store scalable horizontally to 1,000’s of commodity
servers and petabytes of indexed storage
• Designed to operate on top of the Hadoop distributed file system (HDFS)
or Kosmos File System (KFS - Cloudstore) for scalability, fault tolerance
and high availability
36. HBase History
Started by Chad
Walters and Jim
2006.11 -
Google releases
paper on
BigTable
2007.2 - Initial
HBase
prototype
created as
Hadoop
contribution
2007.10 - First
useable HBase
2008.1 -
Hadoop
become Apache
top-level
project and
HBase becomes
subproject
2008.10 - HBase
0.18, 0.19
released
37. A Big Map
• Row Key + Column Key + timestamp => value
Row Key Column Key Timestamp Value
1 Info:name 1273516197868 Sakis
1 Info:age 1273871824184 21
1 Info:sex 1273746281432 Male
2 Info:name 1273863723227 Themis
2 Info:name 1273973134238 Andreas
38. Why BigTable?
• RDBMS performance good for transaction processing
• Very large scale analytic processing solutions are commercial, expensive,
and specialized
• Very large scale analytic processing
– Big queries – typically range or table scans
– Big databases (100s ofTB)
39. Why BigTable?
• Map reduce on Bigtable with optional cascading on top to support some
relational algebras - a cost effective solution
• Sharding not a solution to scale open source RDBMS platforms
– Application specific
– Labor intensive (re)partitioning
40. HBase as Hadoop Component
• Hbase built on top of HDFS
• HBase files internally stored in HDFS
41. HBase Data Model
• Based on Google’s Bigtable model - Key-Value pairs
• HBase schema consists of several tables
• Each table consists of a set of column families
– Columns not part of schema
• Tables sorted by Row
Row key
Column Family
valueTimeStamp
42. HBase Data Model
• Dynamic Columns
– Because column names are encoded inside the cells
– Different cells can have different columns
• Table schema only defines it’s column families
– Each family has any number of columns
– Each column consists of any number of versions
– Columns only exist when inserted, NULLs are free.
– Columns within a family sorted and stored together
• Everything except table names are byte[]
• (Row, Family: Column,Timestamp) =Value
43. Components
• Region
– A subset of a table rows, like horizontal range partitioning
– Automatic
• RegionServer (many slaves)
– Manages data regions
– Serves data for reads and writes (using a log)
• Master
– Responsible for coordinating the slaves
– Assigns regions, detects failures
– Admin functions
44. HBase Members
• Master
– Monitors region servers
– Load balancing for regions
– Redirect client to correct region servers
– Current SPOF
– Signs regions, detects failures of Region
Servers
– Control admin function
• Slaves – Region Servers
– Region - A subset of table's rows
– Serves data for reads and writes
– Send Heartbeat to Master
45. HBase Regions
• Each HTable (column family) is partitioned horizontally into regions
– Regions are counterpart to HDFS blocks
46. Regions
• Contain an in-memory data store (MemStore) and a persistent data store (HFile)
• All regions on a region server share a reference to the write-ahead log (WAL)
which is used to store new data that hasn't yet been persisted to permanent
storage and to recover from region server crashes
• Each region holds a specific range of row keys, and when a region exceeds a
configurable size, HBase automatically splits the region into two child regions,
which is the key to scaling HBase
51. HBase vs. HDFS
• Both distributed systems that scale to hundreds or thousands
of nodes
• HDFS is good for batch processing (scans over big files)
– Not good for record lookup
– Not good for incremental addition of small batches
– Not good for updates
52. HBase vs. HDFS
• HBase is designed to efficiently address the above points
– Fast record lookup
– Support for record-level insertion
– Support for updates (not in place)
• HBase updates are done by creating new versions of values
53. HBase vs. HDFS
• If application has neither random reads or writes, stick to HDFS
55. When to Use HBase
• Random read, write or both are required
• Need to do many thousands of operations per second on multipleTB of
data
• Access patterns are well-known as simple
58. Data Model
• Version number can be user-supplied
– Even does not have to be inserted in increasing order
– Version numbers are unique within each key
• Table can be very sparse
– Many cells are empty
• Keys are indexed as the primary key
Has two columns
[cnnsi.com & my.look.ca]
59. Physical Model
• Each column family is stored in a separate file (called HTables)
• Key & version numbers are replicated with each column family
• Empty cells are not stored
HBase maintains a multi-level index on values:
<key, column family, column name, timestamp>
61. Zookeeper and HBase
• HBase depends on Zookeeper
• To manage master election and
server availability, Zookeeper used
• Set up a cluster, provides distributed
coordination primitives
• A tool for building cluster
management systems
62. Connecting to HBase
• Java client
– get(byte [] row, byte [] column, long timestamp, int versions);
• Non-Java clients
– Thrift server hosting HBase client instance
• Sample ruby, C++, & java (via thrift) clients
– REST server hosts HBase client
• TableInput / OutputFormat for MapReduce
– HBase as MR source or sink
• HBase Shell
– JRuby IRB with “DSL” to add get, scan, and admin
– ./bin/hbase shell YOUR_SCRIPT
63. ApacheThrift
• $hbase-daemon.sh start thrift
• $hbase-daemon.sh stop thrift
• High performance, scalable, cross-language serialization and RPC framework
• Created at Facebook along with Cassandra
• A cross-language, service-generation framework
• Binary Protocol (like Google Protocol Buffers)
• Compiles to: C++, Java, Python, PHP, Ruby, Perl, …
64. HBase API
• get(key) – Extract value given a key
– get(row)
• put(key, value) - Create or update the value given its key
– put(row, Map<column, value>)
• delete(key) -- Remove the key and its associated value
• execute(key, operation, parameters)
– operate on value given a key
– List, Set, Map…
65. Hive HBase Integration
• Reasons to use Hive on Hbase
– Large data in Hbase for use in a real-time environment, but never used for analysis
– Give access to data in HBase usually only queried through MapReduce to people
that don’t code (business analysts)
– When needing a more flexible storage solution, so that rows can be updated live
by either a Hive job or an application and can be seen immediately to the other
• Reasons not to do it
– Run SQL queries on HBase to answer live user requests (it’s still a MR job)
– Hoping to see interoperability with other SQL analytics systems
67. HBase - Benefits
• Distributed storage
• Table-like in data structure - Multi-dimensional map
• High scalability, availability and performance
• No real indexes
• Automatic partitioning
• Scale linearly and automatically with new nodes
• Commodity hardware
• Fault tolerance
• Batch processing
68. HBase Limitations
• Tables have one primary index / key , the row key
• Each row can have any number of columns
• Table schema only defines column families (column family can have any
number of columns)
• Each cell value has a timestamp
• No join operators
• Scans and queries can select a subset of available columns using a
wildcard
69. HBase Limitations
• Lookups
– Fast lookup using row key and optional timestamp
– Full table scan
– Range scan from region start to end
• Limited atomicity and transaction support
– Supports multiple batched mutations of single rows only
– Data is unstructured and un-typed
• No access via SQL
– Programmatic access - Java,Thrift(Ruby, Php, Python, Perl, C++,..), Hbase Shell
71. Redis NoSQL Database
• Redis is an open source, advanced key-value data store
• Often referred to as a data structure server since keys can contain strings,
hashes, lists, sets and sorted sets
• Redis works with an in-memory dataset
• It is possible to persist dataset either by
– dumping the dataset to disk every once in a while
– or by appending each command to a log
72. Redis NoSQL Database
• Distributed data structure server
• Consistent hashing at client
• Non-blocking I/O, single threaded
• Values are binary safe strings: byte strings
• String : Key/Value Pair, set/get. O(1) for many string operations.
• Lists: lpush, lpop, rpush, rpop.you - use as stack or queue. O(1)
73. Redis NoSQL Database
• Publisher/Subscriber model
• Set: collection of unique elements - add, pop, union, intersection - set operations.
• Sorted set: unique elements sorted by scores. O(logn). Range operations
• Hash: multiple key/value pairs
– HMSET user 1 username foo password bar age 30
– HGET user 1 age
75. Redis Keys
• Keys are binary safe - it is possible to use any binary sequence as a key
• The empty string is also a valid key
• Too long keys are not a good idea
• Too short keys are often also not a good idea ("u:1000:pwd" versus
"user:1000:password")
• Nice idea is to use some kind of schema, like: "object-type:id:field"
76. Redis DataTypes
• Redis is often referred to as a data structure server since keys
can contain
– Strings
– Lists
– Sets
– Hashes
– Sorted Sets
77. Redis Strings
• Most basic kind of Redis value
• Binary safe - can contain any kind of data, for instance a JPEG image or a
serialized Ruby object
• Max 512 Megabytes in length
• Can be used as atomic counters using commands in the INCR family
• Can be appended with the APPEND command
79. Redis Lists
• Lists of strings, sorted by insertion order
• Add elements to a Redis List pushing new elements on the head (on the left) or on
the tail (on the right) of the list
• Max length: (2^32 - 1) elements
• Model a timeline in a social network, using LPUSH to add new elements, and
using LRANGE in order to retrieve recent items
• Use LPUSH together with LTRIM to create a list that never exceeds a given
number of elements
81. Redis Sorted Sets
• Every member of a Sorted Set is associated with score, that is used in
order to take the sorted set ordered, from the smallest to the greatest
score
• You can do a lot of tasks with great performance that are really hard to
model in other kind of databases
• Probably the most advanced Redis data type
82. Redis Hashes
• Map between string fields and string values
• Perfect data type to represent objects
HMSET user:1000 username antirez password P1pp0 age 34
HGETALL user:1000
HSET user:1000 password 12345
HGETALL user:1000
83. Redis Operations
• It is possible to run atomic operations on data types:
• Appending to a string
• Incrementing the value in a hash
• Pushing to a list
• Computing set intersection, union and difference
• Getting the member with highest ranking in a sorted set
85. Cassandra
• Structured Storage System over a P2P Network
• Was created to power the Facebook Inbox Search
• Facebook open-sourced Cassandra in 2008 and became anApache
Incubator project
• In 2010, Cassandra graduated to a top-level project, regular update and
releases followed
86. Cassandra
• High availability
• Designed to handle large amount of data across multiple servers
• Eventual consistency - trade-off strong consistency in favor of high
availability
• Incremental scalability
• Optimistic Replication
87. Cassandra
• “Knobs” to tune tradeoffs between consistency, durability and latency
• Low total cost of ownership
• Minimal administration
• Tunable consistency
• Decentralized - No single point of failure
• Writes faster than reads
• Uses consistent hashing (logical partitioning) when clustered.
88. Cassandra
• Hinted handoffs
• Peer to peer routing(ring)
• Thrift API
• Multi data center support
• Mimics traditional relational database systems, but with triggers and
lightweight transactions
• Raw, simple data structures
89. Features
• Emphasis on performance over analysis
– Still supports analysis tools like Hadoop
• Organization
– Rows are organized into tables
– First component of a table’s primary key is the partition key
– Rows clustered by the remaining columns of the key
– Columns may be indexed separately from the primary key
– Tables may be created, dropped, altered at runtime without blocking queries
90. Features
• Language
– CQL (Cassandra Query Language) introduced, similar to SQL (flattened
learning curve)
• Peer-to-Peer cluster
– Decentralized design
• Each node has the same role
– No single point of failure
• Avoids issues of master-slave DBMS’s
– No bottlenecking
91. Comparisons
Apache Cassandra Google Big Table Amazon DynamoDB
StorageType Column Column Key-Value
Best Use Write often, read
less
Designed for large
scalability
Large database
solution
Concurrency Control MVCC Locks ACID
Characteristics HighAvailability
PartitionTolerance
Persistence
Consistency
HighAvailability
PartitionTolerance
Persistence
Consistency
HighAvailability
Key Point – Cassandra offers a healthy cross between BigTable and Dynamo.
92. Cassandra History
Google Bigtable (2006)
• consistency model: strong
• data model: sparse map
• clones: hbase, hypertable
Amazon Dynamo (2007)
• O(1) dht
• consistency model: client
tune-able
• clones: riak, voldemort
Cassandra ~= Bigtable +
Dynamo
93. Architecture Overview
• Cassandra was designed with the understanding that system/ hardware
failures can and do occur
• Peer-to-peer, distributed system
• All nodes are the same
• Data partitioned among all nodes in the cluster
• Custom data replication to ensure fault tolerance
• Read/Write-anywhere design
100. Multi-Geography - Zone Aware
Cassandra allows a single logical database to span 1-N datacenters that are
geographically dispersed. Also supports a hybrid on-premise/Cloud
implementation
101. Partitioning
• Nodes are logically structured in RingTopology
• Hashed value of key associated with data partition is used to assign it to a
node in the ring
• Hashing rounds off after certain value to support ring structure
• Lightly loaded nodes moves position to alleviate highly loaded nodes
103. Data Redundancy
• Cassandra allows for customizable data redundancy so that data is
completely protected
• Supports rack awareness (data can be replicated between different racks
to guard against machine/rack failures)
• Uses Zookeeper to choose a leader which tells nodes the range they are
replicas for
105. Operations
• A client issues a write request to a random node in the Cassandra cluster
• Partitioner determines the nodes responsible for the data
• Locally, write operations are logged and then applied to an in-memory
version
• Commit log is stored on a dedicated disk local to the machine
• Relies on local file system for data persistency
106. Operations
• Write operations happens in 2 steps
– Write to commit log in local disk of the node
– Update in-memory data structure.
– Why 2 steps or any preference to order or execution?
• Read operation
– Looks up in-memory ds first before looking up files on disk.
– Uses Bloom Filter (summarization of keys in file store in memory) to
avoid looking up files that do not contain the key
107. Consistency
• Read Consistency
– Number of nodes that must agree before read request returns
– ONE to ALL
• Write Consistency
– Number of nodes that must be updated before a write is considered successful
– ANY to ALL
– AtANY, a hinted handoff is all that is needed to return.
• QUORUM
– Commonly used middle-ground consistency level
– Defined as (replication_factor / 2) + 1
108. Hinted Handoff Write
• Write intended for a node
that is offline
• An online node, processing
the request, makes a note
to carry out the write once
the node comes back online
109. Write Properties
• No locks in the critical path
• Sequential disk access
• Behaves like a write back Cache
• Append support without read ahead
• Atomicity guarantee for a key
• AlwaysWritable
– accept writes during failure scenarios
110. Write Operations
• Stages
– Logging data in the commit log
– Writing data to the memtable
– Flushing data from the memtable
– Storing data on disk in SSTables
• Commit Log
– First place a write is recorded
– Crash recovery mechanism
– Write not successful until recorded in commit log
– Once recorded in commit log, data is written to Memtable
111. Write Operations
• Memtable
– Data structure in memory
– Once memtable size reaches a threshold, it is flushed (appended) to SSTable
– Several may exist at once (1 current, any others waiting to be flushed)
– First place read operations look for data
• SSTable
– Kept on disk
– Immutable once written
– Periodically compacted for performance
113. Read Repair
• On read, nodes are queried until the number of nodes which respond with
the most recent value meet a specified consistency level from ONE to
ALL
• If the consistency level is not met, nodes are updated with the most
recent value which is then returned
• If the consistency level is met, the value is returned and any nodes that
reported old values are then updated
115. Delete Operations
• Tombstones
– On delete request, records are marked for deletion
– Similar to Recycle Bin
– Data is actually deleted on major compaction or configurable timer
116. Gossip Protocols
• Used to discover location and state information about the
other nodes participating in a Cassandra cluster
• Network Communication protocols inspired for real life
rumor spreading
• Periodic, Pairwise, inter-node communication
• Low frequency communication ensures low cost
117. Gossip Protocols
• Random selection of peers
• Example – Node A wish to search for pattern in data
– Round 1 – Node A searches locally and then gossips with node B
– Round 2 – Node A,B gossips with C and D
– Round 3 – Nodes A,B,C and D gossips with 4 other nodes ……
• Round by round doubling makes protocol very robust
118. Failure Detection
• Gossip process tracks heartbeats from other nodes both directly and indirectly
• Node Fail state is given by variable Φ
– tells how likely a node might fail (suspicion level) instead of simple binary value (up/down).
• This type of system is known as Accrual Failure Detector
• Takes into account network conditions, workload, or other conditions that might
affect perceived heartbeat rate
• A threshold for Φ tells is used to decide if a node is dead
– If node is correct, phi will be constant set by application.
– Generally Φ(t) = 0
119. Failure Detection
• Uses Scuttleback (a Gossip protocol) to manage nodes
• Uses gossip for node membership and to transmit system control state
• Lightweight with mathematically provable properties
• State disseminated in O(logN) rounds where N is the number of nodes in
the cluster.
• EveryT seconds each member increments its heartbeat counter and
selects one other member to send its list to.
• A member merges the list with its own list .
120. Accrual Failure Detector
• Valuable for system management, replication, load balancing etc
• Node Fail state is given by variable ‘phi’ which tells how likely a
node might fail (suspicion level) instead of simple binary value
(up/down)
• Defined as a failure detector that outputs a value, PHI, associated
with each process.
• Also known as Adaptive Failure detectors - designed to adapt to
changing network conditions
121. Accrual Failure Detector
• The value output, PHI, represents a suspicion level
• Applications set an appropriate threshold, trigger suspicions
and perform appropriate actions
• In Cassandra the average time taken to detect a failure is 10-
15 seconds with the PHI threshold set at 5
122. Performance Benchmark
• Loading of data - limited by network bandwidth
• Read performance for Inbox Search in production
Search Interactions Term Search
Min 7.69 ms 7.78 ms
Median 15.69 ms 18.27 ms
Average 26.13 ms 44.41 ms
124. Data Model
• Column: smallest data element, a tuple with a name and a value :Rockets, '1'
might return:
{
'name' => ‘Rocket-Powered Roller Skates',
‘toon' => ‘Ready Set Zoom',
‘inventoryQty' => ‘5‘,
‘productUrl’ => ‘rockets1.gif’
}
125. Data Model
• ColumnFamily -There’s a single structure used to group both the
Columns and SuperColumns. Called a ColumnFamily (think table), it has
two types, Standard & Super.
– Column families must be defined at startup
• Key - the permanent name of the record
• Keyspace - the outer-most level of organization.This is usually the name
of the application. For example, ‘Acme' (think database name)
126. Data Model
• Optional super column: a named list.A super column contains standard columns,
stored in recent order
• SupposeOtherProducts has inventory in categories
• Querying (:OtherProducts, '174927') might return
– {‘OtherProducts' => {'name' => ‘Acme Instant Girl', ..}, ‘foods': {...}, ‘martian': {...},
‘animals': {...}}
• In the example, foods, martian, and animals are all super column names
• They are defined on the fly, and there can be any number of them per row.
:OtherProducts would be the name of the super column family
127. Data Model
• Columns and SuperColumns are both tuples with a name & value.The key difference is that a standard Column’s
value is a “string” and in a SuperColumn the value is a Map of Columns
• Columns are always sorted by their name. Sorting supports:
– BytesType
– UTF8Type
– LexicalUUIDType
– TimeUUIDType
– AsciiType
– LongType
• Each of these options treats the Columns' name as a different data type
128. Tunable Consistency
• Cassandra has programmable read/writable consistency
• Any - Ensure that the write is written to at least 1 node
• One - Ensure that the write is written to at least 1 node’s commit log and memory
table before receipt to client
• Quorom - Ensure that the write goes to node/2 + 1
• All - Ensure that writes go to all nodes. An unresponsive node would fail the write
129. Consistent Hashing
A
H
D
B
M
V
S
R
C
• Partition using consistent hashing
– Keys hash to a point on a fixed circular space
– Ring is partitioned into a set of ordered slots
and servers and keys hashed over these slots
• Nodes take positions on the circle.
• A, B, and D exists.
– B responsible for AB range.
– D responsible for BD range.
– A responsible for DA range.
• C joins.
– B, D split ranges.
– C gets BC from D.
130. Key-Value Model
• Cassandra is a column oriented
NoSQL system
• Column families: sets of key-value
pairs
– column family as a table and key-
value pairs as a row (using relational
database analogy)
• A row is a collection of columns
labeled with a name
131. Cassandra Row
• Value of row is itself a sequence
of key-value pairs
• such nested key-value pairs are
columns
• key = column name
• A row must contain at least 1
column
133. Column Names StoringValues
• key: User ID
• column names store tweet ID
values
• values of all column names are
set to “-” (empty byte array) as
they are not used
134. Key Space
• A Key Space is a group of column
families together. It is only a logical
grouping of column families and
provides an isolated scope for
names
135. Comparison with RDBMS
• With RDBMS, a normalized data model is created without
considering the exact queries
– SQL can return almost anything though Joins
• With C*, the data model is designed for specific queries
– schema is adjusted as new queries introduced
• C*: NO joins, relationships, or foreign keys
– a separate table is leveraged per query
– data required by multiple tables is denormalized across those tables
136. Compaction
• Compaction runs periodically to merge multiple SSTables
– Reclaims space
– Creates new index
– Merges keys
– Combines columns
– Discards tombstones
– Improves performance by minimizing disk seeks
• Types
– Major
– Read-only
137. Anti-Entropy
• Replica synchronization mechanism
• Ensures synchronization of data across nodes
• Compares data checksums against neighboring nodes
• Uses Merkle trees (Hash trees)
– Snapshot of data sent to neighboring nodes
– Created and broadcasted on every major compaction
– If two nodes take snapshots withinTREE_STORE_TIMEOUT of each other,
snapshots are compared and data is synced.
139. Cassandra Query Language - CQL
• Creating a keyspace - namespace of tables
CREATE KEYSPACE demo
WITH replication = {‘class’: ’SimpleStrategy’, replication_factor’: 3};
• To use namespace:
USE demo;
141. CQL
• Insert
– INSERT INTO users (email, bio, birthday, active)VALUES
(‘Tom.Stok@btx.com’, ‘StarTeammate’, 516513612220, true);
– Timestamp fields are specified in milliseconds since epoch
• Query tables
– SELECT expression reads one or more records from Cassandra column family
and returns a result-set of rows
– SELECT * FROM users;
– SELECT email FROM usersWHERE active = true;
142. Cassandra Advantages
• Perfect for time-series data
• High performance
• Decentralization
• Nearly linear scalability
• Replication support
• No single points of failure
• MapReduce support
143. CassandraWeaknesses
• No referential integrity
– no concept of JOIN
• Querying options for retrieving data are limited
• Sorting data is a design decision
– no GROUP BY
• No support for atomic operations
– if operation fails, changes can still occur
• First think about queries, then about data model
144. Key Points
• Cassandra is designed as a distributed database management system
– use it when you have a lot of data spread across multiple servers
• Cassandra write performance is always excellent, but read performance
depends on write patterns
– it is important to spend enough time to design proper schema around the
query pattern
• having a high-level understanding of some internals is a plus
– ensures a design of a strong application built atop Cassandra
145. Hector – Java API for Cassandra
• Sits on top ofThrift
• Load balancing
• JMX monitoring
• Connection-pooling
• Failover
• JNDI integration with application servers
• Additional methods on top of the standard get, update, delete methods.
• Under discussion
– hooks into Spring declarative transactions
146. Memcached Database
• Key-Value Store
• Very easy to setup and use
• Consistent hashing
• Scales very well
• In memory caching, no persistence
• LRU eviction policy
• O(1) to set/get/delete
• Atomic operations set/get/delete
• No iterators or very difficult
148. MongoDB
• Publicly released in 2009
• Allows data to persist in a nested state
• Query that nested data in an ad hoc fashion
• Enforces no schema
• Documents can optionally contain fields or types that no other document
in the collection contains
• NoSQL
150. MongoDB
• Document-oriented database
• Uses BSON format – Binary JSON
• An instance may have zero or more databases
• A database may have zero or more collections
• A collection may have zero or more documents
• A document may have one or more fields
• Indexes function like RDBMS counterparts
151. MongoDB
• Data types: bool, int, double, string, object(bson), oid, array, null, date
• Database and collections created automatically
• Language Drivers
• Capped collections are fixed size collections, buffers, very fast, FIFO,
good for logs. No indexes
• Object id are generated by client, 12 bytes packed data - 4 byte time, 3
byte machine, 2 byte pid, 3 byte counter
152. MongoDB
• Possible to refer other documents in different collections but more
efficient to embed documents
• Replication easy to setup. Read from slaves
• Supports aggregation
– Map Reduce with JavaScript
• Indexes, B-Trees. Ids are always indexed
153. MongoDB
• Updates are atomic. Low contention locks
• Querying mongo done with a document
– Lazy, returns a cursor
– Reducable to SQL, select, insert, update limit, sort - upsert (either inserts of
updates)
– Operators - $ne, $and, $or, $lt, $gt, $incr, $decr
• Repository Pattern for easy development
154. MongoDB
• Full Index Support
• Replication & High Availability
• Auto-Sharding
• Querying
• Fast In-Place Updates
• Map/Reduce
158. Commands
# create a doc and save into a collection
p = {firstname:"Dave", lastname:"Ho“}
db.person.save(p)
db.person.insert({firstname:"Ricky", lastname:"Ho"})
# Show all docs within a collection
db.person.find()
# Iterate result using cursor
var c = db.person.find()
p1 = c.next()
p2 = c.next()
159. Commands
#Query
p3 = db.person.findone({lastname:"Ho"}
# Return a subset of fields (ie: projection)
db.person.find({lastname:"Ho"}, {firstname:true})
# Delete some records
db.person.remove({firstname:"Ricky"})
#To build an index for a collection
db.person.ensureIndex({firstname:1})
160. Commands
#To show all existing indexes
db.person.getIndexes()
#To remove an index
db.person.dropIndex({firstname:1})
# Index can be build on a path of the doc
db.person.ensureIndex({"address.city":1})
# A composite key can be used to build index
db.person.ensureIndex({lastname:1, firstname:1})
161. Commands
#Data update andTransaction: To update an existing doc, we can do the following
var p1 = db.person.findone({lastname:"Ho"})
p1["address"] = "San Jose" db.person.save(p1)
# Do the same in one command
db.person.update({lastname:"Ho"}, {$set:{address:"San Jose"}}, false, true)