Apache cassandra an introduction

Apache
Cassandra:
A Brief History:
Dive into the Dynamo whitepaper

About me
@Shehaaz
I love hacking on Wearable/iOT devices.

Topics Today
● History and Dynamo
● Time series data modeling
● Example App

History
● Peer-to-Peer (All nodes are EQUAL)
○ Centralized peer-to-peer networks
■ Node connects to “Directory” server.
● e.g: Napster
○ Unstructured networks
■ Nodes randomly connect to each other
● e.g: Kazaa, Gossip
○ Structured networks
■ Nodes organized into a specific topology (consistent Hashing)
● e.g: Cassandra Ring

Road to Cassandra
● 1999: Napster and other “questionable” P2P services
● 2006: Google Big Table
○ C* has similar data storage.
● 2007: Amazon Dynamo (Avinash Lakshman)
○ C* has similar architecture
● 2008: Facebook Open Sourced C* (Avinash Lakshman)

CAP Theorem
● Consistency
○ All nodes see the same data at the same time
● Availability
○ A guarantee that every request receives a response about whether it succeeded or failed
● Partition Tolerance
○ The system continues to operate despite arbitrary message loss or failure of part of the system
e.g: Increasing Availability (increase Rep.Factor) Reduce Consistency. You can only have two out of the three!

Dynamo
The motivation:
● You must ALWAYS be able to add to your
shopping cart! (High Availability)
● Conflict resolution is done at the application:
○ merge conflicting shopping carts.
● Primary Key access to data store (RDB limitations)
○ e.g: best seller list, customer preferences, etc

Dynamo Architecture
Key principles:
1. Incremental scalability
○ Add nodes w/o disrupting system
2. Symmetry
○ Every node has same responsibility
3. Decentralization
○ peer-to-peer over centralized control
4. Heterogeneity
○ The work distribution must be
proportional to the capabilities of the
individual servers.

Distributed Hash Table
Data Organization
Distributed Hash
Table (DHT) using
Consistent Hashing:
The keys are mapped
to form a ring. The
output range of the
hash function is
treated as a fixed
circular “ring”. (i.e:
The largest Hash
Value wraps around
to the smallest hash
value)

Inserting data: High Level
Hash(RowKey) = 4500
circle clockwise and insert in Node 5

Row Level Hashing?
1 T:22:00:02, HR:71 T:22:00:01, HR:72
2 T:22:00:05, HR:90 T:22:00:02, HR:95
Patient ID (Partition Key)
Event Time (Clustering Column)

Dynamo Architecture
Consistent Hashing
● Advantage:
○ Departure or Arrival of a node only affects immediate
neighbors. Every node is in charge of the previous node clockwise.
○ Only K/N nodes need to be remapped when a node
drops. K= #keys N= #Nodes
● Disadvantage:
○ ?

Dynamo Architecture
Consistent Hashing
● Disadvantage?

Dynamo Architecture
Consistent Hashing
● Disadvantage
○ Random Node position assignment leads to non-
uniform data and load distribution
○ Some nodes could simply suck

Virtual nodes to rescue!
● Instead of mapping a node to a single
point in the ring, each node gets assigned
to multiple locations in the ring….(what
does that mean?)
Virtual Nodes!

Virtual Nodes
Three node
cluster with zero
V-nodes
p = Position

Virtual Nodes
● V-Nodes look like nodes in the system
● Regular node can be responsible for more
than one V-Node

Virtual Nodes: Add Node
Adding a new Node:
● This will evenly balance the data in the
cluster. Server #4 will get data from all the
servers.
○ How?
■ Server 4 is next to 1,2 and 3

V-Nodes: Remove Node
When a node goes down the data is evenly
distributed.
When #1 went down, #2 and #3 took over the data.
If we didn’t have virtual nodes #2 would have been
overloaded.

Replication
Why?
To achieve high availability
e.g:
Replication Factor: 3
Hash(KEY1) = 500
Node #1 is the coordinator node
for values 0 to 999
Its job is to replicate it to TWO
other nodes.
In modern C* it is the job of the Node that received the
write.

Replication
Server 1 copies the data to TWO other nodes
clockwise to satisfy Replication Factor: 3
If 1 goes down 2 will make sure to keep R.F=3

Example Application
● Patient in critical care. Needs a vital sign
dashboard
● Arduino based Heart Rate and spO2
measuring device.
● Pretty graph and gain insight from the data

What’s Wrong?
1. We will eventually run out of columns. Cassandra
allows 2 billions columns per row
63.3 years

What’s Wrong?
2. RowKey Hashing will create a hotspot in the cluster. (Remember Row Level Hashing?)

Data modeling in C*
Time Series data modeling.

Create Tables
A.K.A: Compound Row Key

Table
1,2015-02-17 T:22:00:01, HR:71 T:22:00:00, HR:72
2,2015-02-17 T:22:00:05, HR:90 T:22:00:02, HR:95
Patient ID (Partition Key) Event Time (Clustering Column)
Data is SORTED and stored Sequentially on Disk

Resources
Amazon Dynamo paper:
http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
Cassandra High Availability by Robbie Strickland
http://www.amazon.com/gp/product/1783989122/ref=cm_cr_ryp_prd_ttl_sol_0

Apache cassandra an introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Apache cassandra an introduction

Similar to Apache cassandra an introduction (20)

Recently uploaded

Recently uploaded (20)

Apache cassandra an introduction