Apache Cassandra 
Fundamentals 
or: 
How I stopped worrying and learned to love the CAP theorem 
Russell Spitzer 
@RussSpitzer 
Software Engineer in Test at DataStax
Who am I? 
• Former Bioinformatics Student 
at UCSF 
• Work on the integration of 
Cassandra (C*) with Hadoop, 
Solr, and Redacted! 
• I Spend a lot of time spinning up 
clusters on EC2, GCE, Azure, … 
http://www.datastax.com/dev/ 
blog/testing-cassandra-1000- 
nodes-at-a-time 
• Developing new ways to make 
sure that C* Scales
Apache Cassandra is a Linearly Scaling 
and Fault Tolerant noSQL Database 
Linearly Scaling: 
The power of the database 
increases linearly with the 
number of machines 
2x machines = 2x throughput 
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html 
Fault Tolerant: 
Nodes down != Database Down 
Datacenter down != Database Down
CAP Theorem Limits What 
Distributed Systems can do 
Consistency 
When I ask the same question to any part of the system I should get the same answer 
How many planes do we have?
CAP Theorem Limits What 
Distributed Systems can do 
Consistency 
When I ask the same question to any part of the system I should get the same answer 
How many planes do we have? 
Consistent 
1 1 1 1 1 1 1
CAP Theorem Limits What 
Distributed Systems can do 
Consistency 
When I ask the same question to any part of the system I should get the same answer 
How many planes do we have? 
Not Consistent 
1 4 1 2 1 8 1
CAP Theorem Limits What 
Distributed Systems can do 
When I ask a question I will get an answer 
Availability 
How many planes do we have? 
Available 
1 zzzzz *snort* zzz
CAP Theorem Limits What 
Distributed Systems can do 
Availability 
When I ask a question I will get an answer 
How many planes do we have? 
I have to wait for major snooze to wake up 
zzzzz *snort* zzz 
Not Available
CAP Theorem Limits What 
Distributed Systems can do 
Partition Tolerance 
I can ask questions even when the system is having intra-system communication 
problems 
How many planes do we have? 
Team Edward Team Jacob 
1 
Tolerant
CAP Theorem Limits What 
Distributed Systems can do 
Partition Tolerance 
I can ask questions even when the system is having intra-system communication 
problems 
How many planes do we have? 
Not Tolerant 
Team Edward Team Jacob 
I’m not sure without asking those 
vampire lovers and we aren’t speaking
Cassandra is an AP System 
which is Eventually Consistent 
Eventually consistent: 
New information will make it to everyone eventually 
How many planes do we have? How many planes do we have? 
I don’t know without asking those 
vampire lovers and we aren’t speaking 
1 1 1 1 1 1 
I just heard ! 
we actually ! 
have 2 
2 2 2 2 2 2 2
Two knobs control fault tolerance in 
C*: Replication and Consistency Level 
Server Side - Replication: 
How many copies of a data should exist in the cluster? 
Coordinator 
for this operation 
ABD ABC 
ACD 
BCD 
RF=3 
Client 
SimpleStrategy: Replicas 
NetworkTopologyStrategy: Replicas per Datacenter
Two knobs control fault tolerance in 
C*: Replication and Consistency Level 
Client Side - Consistency Level: 
How many replicas should we check before 
acknowledgment? 
ABD ABC 
ACD 
BCD 
Client 
Coordinator 
for this operation 
CL = One
Two knobs control fault tolerance in 
C*: Replication and Consistency Level 
Client Side - Consistency Level: 
How many replicas should we check before 
acknowledgment? 
ABD ABC 
ACD 
BCD 
CL = Quorum 
Client 
Coordinator 
for this operation
Nodes own data whose primary key 
hashes to their their token ranges 
ABD ABC 
ACD 
BCD 
Every piece of data belongs on 
the node who owns the 
Murmur3(2.0) Hash of its 
partition key + (RF-1) other 
nodes 
Partition Key Clustering Key 
Rest of Data 
ID: ICBM_432 Time: 30 
Loc: SF , Status: Idle 
ID: ICBM_432 
Murmur3Hash 
Murmur3: A
Cassandra writes are FAST 
due to log-append storage 
Par Clu Re Memory 
Memtable 
Memtable Memtable 
Commit Log 
Par Clu Re 
Par Clu Re 
Par Clu Re 
Disk Flushed 
SSTable SSTable
Deletes in a distributed 
System are Challenging 
We need to keep records of 
deletions in case of network 
partitions 
Node1 
Node2 Power Outage 
Time 
Tombstone Tombstone 
Tombstone
Compactions merge and 
unify data in our stables 
SSTable 
1 
+ SSTable 
SSTable 
2 3 
Since SSTables are immutable 
this is our chance to 
consolidate rows and remove 
tombstones (After GC Grace)
Layout of Data Allows for Rapid 
Queries Along Clustering Columns 
ID: ICBM_432 
ID: ICBM_900 
ID: ICBM_9210 
Time: 30 
Loc: 
SF 
Status: 
Idle 
Time: 45 
Loc: 
SF 
Status: 
Idle 
Time: 60 
Loc: 
SF 
Status: 
Idle 
Time: 30 
Loc: 
Boston 
Status: 
Idle 
Time: 45 
Loc: 
Boston 
Status: 
Idle 
Time: 60 
Loc: 
Boston 
Status: 
Idle 
Time: 30 
Loc: 
Tulsa 
Status: 
Idle 
Time: 45 
Loc: 
Tulsa 
Status: 
Idle 
Time: 60 
Loc: 
Tulsa 
Status: 
Idle 
Disclaimer: Not exactly like this (Use sstable2json to see real layout)
CQL allows easy definition 
of Table Structures 
ID: ICBM_432 
Time: 30 
Loc: 
SF 
Status: 
Idle 
Time: 45 
Loc: 
SF 
Status: 
Idle 
Time: 60 
Loc: 
SF 
Status: 
Idle 
CREATE TABLE icbmlog ( 
name text, 
time timestamp, 
location text, 
status text, 
PRIMARY KEY (name,time) 
);
Reading data is FAST but 
limited by disk IO 
Memory 
Memtable 
Memtable Memtable 
Commit Log 
Par Clu Re 
Par Clu Re 
Par Clu Re 
Disk 
SSTable SSTable 
Client 
Par Clu Re 
LWW 
Replica 
Par Clu Re
Reading data is FAST but 
limited by disk IO 
Memory 
Memtable 
Memtable Memtable 
Commit Log 
Par Clu Re 
Par Clu Re 
Par Clu Re 
Disk 
SSTable SSTable 
Client 
Par Clu Re 
LWW 
Replica 
Par Clu Re 
Read 
Repair
New Clients provide a 
holistic view of the C* cluster 
Client 
ABD ABC 
ACD 
BCD 
Initial Contact 
Cluster.builder().addContactPoint("127.0.0.1").build()
Session Objects Are used 
for Executing Requests 
session = cluster.connect() 
session.execute("DROP KEYSPACE IF EXISTS icbmkey") 
session.execute("CREATE KEYSPACE icbmkey with 
replication = 
{'class':'SimpleStrategy','replication_factor':'1'}") 
For highest throughput use asynchronous methods 
ResultSetFuture executeAsync(Query query) 
Then add a callback or Queue the ResultSetFutures 
ResultSetFuture 
ResultSetFuture 
ResultSetFuture
Token Aware Policies allow the reduction 
in the number of intra-network requests 
made 
Client 
ABD ABC 
ACD 
BCD 
A
Prepared statements allow for 
sending less data over the wire 
Query is prepared on all nodes by driver 
Prepared batch statements 
can further improve throughput 
PreparedStatement ps = session.prepare("INSERT INTO messages (user_id, msg_id, title, body) VALUES (?, ?, ?, ?)"); 
BatchStatement batch = new BatchStatement(); 
batch.add(ps.bind(uid, mid1, title1, body1)); 
batch.add(ps.bind(uid, mid2, title2, body2)); 
batch.add(ps.bind(uid, mid3, title3, body3)); 
session.execute(batch);
Avoid 
• Preparing statements more than once 
• Creating batches which are too large 
• Running statements in serial 
• Using consistency-levels above your need 
• Secondary Indexes in your main queries 
• or really at all unless you are doing analytics
Have fun with C* 
Questions?

Cassandra Fundamentals - C* 2.0

  • 1.
    Apache Cassandra Fundamentals or: How I stopped worrying and learned to love the CAP theorem Russell Spitzer @RussSpitzer Software Engineer in Test at DataStax
  • 2.
    Who am I? • Former Bioinformatics Student at UCSF • Work on the integration of Cassandra (C*) with Hadoop, Solr, and Redacted! • I Spend a lot of time spinning up clusters on EC2, GCE, Azure, … http://www.datastax.com/dev/ blog/testing-cassandra-1000- nodes-at-a-time • Developing new ways to make sure that C* Scales
  • 3.
    Apache Cassandra isa Linearly Scaling and Fault Tolerant noSQL Database Linearly Scaling: The power of the database increases linearly with the number of machines 2x machines = 2x throughput http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html Fault Tolerant: Nodes down != Database Down Datacenter down != Database Down
  • 4.
    CAP Theorem LimitsWhat Distributed Systems can do Consistency When I ask the same question to any part of the system I should get the same answer How many planes do we have?
  • 5.
    CAP Theorem LimitsWhat Distributed Systems can do Consistency When I ask the same question to any part of the system I should get the same answer How many planes do we have? Consistent 1 1 1 1 1 1 1
  • 6.
    CAP Theorem LimitsWhat Distributed Systems can do Consistency When I ask the same question to any part of the system I should get the same answer How many planes do we have? Not Consistent 1 4 1 2 1 8 1
  • 7.
    CAP Theorem LimitsWhat Distributed Systems can do When I ask a question I will get an answer Availability How many planes do we have? Available 1 zzzzz *snort* zzz
  • 8.
    CAP Theorem LimitsWhat Distributed Systems can do Availability When I ask a question I will get an answer How many planes do we have? I have to wait for major snooze to wake up zzzzz *snort* zzz Not Available
  • 9.
    CAP Theorem LimitsWhat Distributed Systems can do Partition Tolerance I can ask questions even when the system is having intra-system communication problems How many planes do we have? Team Edward Team Jacob 1 Tolerant
  • 10.
    CAP Theorem LimitsWhat Distributed Systems can do Partition Tolerance I can ask questions even when the system is having intra-system communication problems How many planes do we have? Not Tolerant Team Edward Team Jacob I’m not sure without asking those vampire lovers and we aren’t speaking
  • 11.
    Cassandra is anAP System which is Eventually Consistent Eventually consistent: New information will make it to everyone eventually How many planes do we have? How many planes do we have? I don’t know without asking those vampire lovers and we aren’t speaking 1 1 1 1 1 1 I just heard ! we actually ! have 2 2 2 2 2 2 2 2
  • 12.
    Two knobs controlfault tolerance in C*: Replication and Consistency Level Server Side - Replication: How many copies of a data should exist in the cluster? Coordinator for this operation ABD ABC ACD BCD RF=3 Client SimpleStrategy: Replicas NetworkTopologyStrategy: Replicas per Datacenter
  • 13.
    Two knobs controlfault tolerance in C*: Replication and Consistency Level Client Side - Consistency Level: How many replicas should we check before acknowledgment? ABD ABC ACD BCD Client Coordinator for this operation CL = One
  • 14.
    Two knobs controlfault tolerance in C*: Replication and Consistency Level Client Side - Consistency Level: How many replicas should we check before acknowledgment? ABD ABC ACD BCD CL = Quorum Client Coordinator for this operation
  • 15.
    Nodes own datawhose primary key hashes to their their token ranges ABD ABC ACD BCD Every piece of data belongs on the node who owns the Murmur3(2.0) Hash of its partition key + (RF-1) other nodes Partition Key Clustering Key Rest of Data ID: ICBM_432 Time: 30 Loc: SF , Status: Idle ID: ICBM_432 Murmur3Hash Murmur3: A
  • 16.
    Cassandra writes areFAST due to log-append storage Par Clu Re Memory Memtable Memtable Memtable Commit Log Par Clu Re Par Clu Re Par Clu Re Disk Flushed SSTable SSTable
  • 17.
    Deletes in adistributed System are Challenging We need to keep records of deletions in case of network partitions Node1 Node2 Power Outage Time Tombstone Tombstone Tombstone
  • 18.
    Compactions merge and unify data in our stables SSTable 1 + SSTable SSTable 2 3 Since SSTables are immutable this is our chance to consolidate rows and remove tombstones (After GC Grace)
  • 19.
    Layout of DataAllows for Rapid Queries Along Clustering Columns ID: ICBM_432 ID: ICBM_900 ID: ICBM_9210 Time: 30 Loc: SF Status: Idle Time: 45 Loc: SF Status: Idle Time: 60 Loc: SF Status: Idle Time: 30 Loc: Boston Status: Idle Time: 45 Loc: Boston Status: Idle Time: 60 Loc: Boston Status: Idle Time: 30 Loc: Tulsa Status: Idle Time: 45 Loc: Tulsa Status: Idle Time: 60 Loc: Tulsa Status: Idle Disclaimer: Not exactly like this (Use sstable2json to see real layout)
  • 20.
    CQL allows easydefinition of Table Structures ID: ICBM_432 Time: 30 Loc: SF Status: Idle Time: 45 Loc: SF Status: Idle Time: 60 Loc: SF Status: Idle CREATE TABLE icbmlog ( name text, time timestamp, location text, status text, PRIMARY KEY (name,time) );
  • 21.
    Reading data isFAST but limited by disk IO Memory Memtable Memtable Memtable Commit Log Par Clu Re Par Clu Re Par Clu Re Disk SSTable SSTable Client Par Clu Re LWW Replica Par Clu Re
  • 22.
    Reading data isFAST but limited by disk IO Memory Memtable Memtable Memtable Commit Log Par Clu Re Par Clu Re Par Clu Re Disk SSTable SSTable Client Par Clu Re LWW Replica Par Clu Re Read Repair
  • 23.
    New Clients providea holistic view of the C* cluster Client ABD ABC ACD BCD Initial Contact Cluster.builder().addContactPoint("127.0.0.1").build()
  • 24.
    Session Objects Areused for Executing Requests session = cluster.connect() session.execute("DROP KEYSPACE IF EXISTS icbmkey") session.execute("CREATE KEYSPACE icbmkey with replication = {'class':'SimpleStrategy','replication_factor':'1'}") For highest throughput use asynchronous methods ResultSetFuture executeAsync(Query query) Then add a callback or Queue the ResultSetFutures ResultSetFuture ResultSetFuture ResultSetFuture
  • 25.
    Token Aware Policiesallow the reduction in the number of intra-network requests made Client ABD ABC ACD BCD A
  • 26.
    Prepared statements allowfor sending less data over the wire Query is prepared on all nodes by driver Prepared batch statements can further improve throughput PreparedStatement ps = session.prepare("INSERT INTO messages (user_id, msg_id, title, body) VALUES (?, ?, ?, ?)"); BatchStatement batch = new BatchStatement(); batch.add(ps.bind(uid, mid1, title1, body1)); batch.add(ps.bind(uid, mid2, title2, body2)); batch.add(ps.bind(uid, mid3, title3, body3)); session.execute(batch);
  • 27.
    Avoid • Preparingstatements more than once • Creating batches which are too large • Running statements in serial • Using consistency-levels above your need • Secondary Indexes in your main queries • or really at all unless you are doing analytics
  • 28.
    Have fun withC* Questions?