Cassandra - An Introduction

Cassandra – An Introduction

Mikio L. Braun
Leo Jugel

TU Berlin, twimpact

LinuxTag Berlin
13. Mai 2011

LinuxTag Berlin, 13. 5. 2011 (c) 2011 by Mikio L. Braun @mikiobraun, blog.mikiobraun.de

What is NoSQL
● For many web applications, “classical data
bases” are not the right choice:
● Database is just used for storing objects.
● Consistency not essential.
● A lot of concurrent access.


NoSQL in comparison
Classical Databases NoSQL
Powerful query language very simple query language
Scales by using larger servers skales through clustering
(“scaling up”) (“scaling out”)
Changes of database schema very costly No fixed database schema
ACID: Atomicity, Consistency, Isolation, Typically only “eventually consistent”
Duratbility
Transactions, locking, etc. Typically no support for transactions etc.


Brewer's CAP Theorem
● CAP: Consistency, Availability, Partition
Tolerance
● Consistency: You never get old data.
● Availability: read/write operations always possible.
● Partition Tolerance: other guarantees hold even if
network of servers break.
● You can only have two of these!

Gilbert, Lynch, Brewer's conjecture and the feasibility of consistent, available, partition-
tolerant web services, ACM SIGACT News, Volume 33, Issue 2, June 2002

Homepage http://cassandra.apache.org
Language Java
History ● Developed at Facebook for inbox search,
released as Open Source in July 2008
● Apache Incubator since March 2009

● Apache Top-Level since February 2010

Main Properties ● structured key value store

● “eventually consistent”

● fully equivalent nodes

● cluster can be modified without restarting

Support DataStax (http://datastax.com)
Licence Apache 2.0


Version 0.6.x and 0.7.x
● Most important changes in 0.7.x
● config file format changed from XML to YAML
● schema modification (ColumnFamilies) without
restart
● Beginning support for secondary indices
● However, also problems with stability initially.


Inspirations for Cassandra
● Amazon Dynamo
● Clustering without dedicated master node
● Peer-to-peer discovery of nodes, HintedHintoff, etc.
● Google BigTable
● data model
● requires central master node
● Provides much more fine grained control:
– which data should be stored together
– on-the-fly compression, etc.


Installation
● Download tar.gz from
http://cassandra.apache.org/download/
● Unpack
● ./conf contains config files
● ./bin/cassandra -f to start Cassandra, Ctrl-C to
stop


Configuration
● Database
● Version 0.6.x: conf/storage-conf.xml
● Version 0.7.x: conf/cassandra.yaml
● JVM Parameters
● Version 0.6.x: bin/cassandra.in.sh
● Version 0.7.x: conf/cassandra-env.sh


Cassandra's Data Model
Keyspace (= database) byte arrays
Column Family (= table) Row
key {name1: value1, name2: value2, name3: value3, ...}

column
strings
sorted by name!
sorted according to partitioner

Super Column Family
key
key {name1: value1, ...}


Example: Simple Object Store
class Person {
long id;
String name;
String affiliation;
}

Convert fields to byte arrays

Keyspace “MyDatabase”:
ColumnFamily “Person”:
“1”: {“id”: “1”, “name”: “Mikio Braun, “affiliation”: “TU Berlin”}


Example: Index
class Page {
long id;
… Object data fields
List<Links> links;
} Keyspace “MyDatabase”
ColumnFamily “Pages”
class Link { “3”: {“id”: 3, …}
long id; “4”: {“id”: 4, …}
... Used for both, linking
int numberOfHits; ColumnFamily “Links” and indexing!
} “1”: {“id”: 1, “url”: …}
“17”. {“id”: 17, “url”: …}

ColumnFamily “LinksPerPageByNumberOfHits”
“3”: { “00000132:00000001”: “t”, “000025: 00000017”: …
“4”: { “00000044:00000024”: “t”, … }

Here we exploit that
columns are sorted
by their names. Of course, everything encoded in byte arrays,
not ASCII


Are SuperColumnFamilies
necessary?

● Usually, you can replace a SuperColumnFamily
by several CollumnFamilies.
● Since SuperColumnFamilies make the
implementation and the protocol more compelx,
there are also people advocating the remove
SuperCFs... .


Cassandra's Architecture

MemTable Read Operation

Flush
Memory

Disk

Write Operation Commit Log SSTable SSTable SSTable

Compaction!

Cassandras API
● THRIFT-based API
Read operations Write operations
get single column insert single column
get_slice range of columns batch_mutate several columns in
multiget_slice range of columns in several rows
several rows remove single column
get_count column count truncate while ColumnFamily
get_range_slice several columns from
range of rows
get_indexed_slices range of columns from
index

Sonstige
login, describe_*, add/drop column family/keyspace since 0.7.x


Cassandra Clustering
● Fully equivalent nodes, no master node.
● Bootstrapping requires seed node.
“Storage Proxy”

Node Node Node

Reads/writes according to consistency level

Query


Consistency Level and
Replication Factor
●Replication factor: On how many nodes is a
piece of data stored?

● Consistency level:
Consistency Level
ANY A node has received the operation, even a
HintedHandoff node.
ONE One node has completed the request.
QUORUM Operation has completed on majority of nodes / newest
result is returned.
LOCAL_QUORUM QUORUM in local data center
GLOBAL_QUORUM QUORUM in global data center
ALL Wait till all nodes have completed the request


How to deal with failure
● As long as requirements of the consistency level can be
met, everything is fine.
● Hinted Handoff:
● A write operation for a faulty node is stored on another node and
pushed to the other node once it is available again.
● Data won't be readable after write!
● Read Repair:
● After read operation has completed, data will be compared and
updated on all nodes in the background.


Libraries
Python Pycassa: http://github.com/pycassa/pycass
Telephus: http://github.com/driftx/Telephus
Java Datanucleus JDO:http://github.com/tnine/Datanucleus-Cassandra-Plugin
Hector: http://github.com/rantav/hector
Kundera http://code.google.com/p/kundera/
Pelops: http://github.com/s7/scale7-pelops
Grails grails-cassandra: https://github.com/wolpert/grails-cassandra
.NET Aquiles: http://aquiles.codeplex.com/
FluentCassandra: http://github.com/managedfusion/fluentcassandra
Ruby Cassandra: http://github.com/fauna/cassandra
PHP phpcassa: http://github.com/thobbs/phpcassa
SimpleCassie: http://code.google.com/p/simpletools-php/wiki/SimpleCassie

Or roll your own based on THRIFT http://thrift.apache.org/ :)


TWIMPACT: An Application
● Real-time analysis of Twitter
● Trend analysis based on retweets
● Very high data rate (several million tweets per
day, about 50 per second)


TWIMPACT: twimpact.jp


TWIMPACT: twimpact.com


Application Profile
● Information about tweets, users, and retweets
● Text matching for non-API-retweets
● Retweet frequency and user impact
● Operation profile:
get_slice get get_slice batch_mutate insert batch_mutate remove
(all) (range) (one row)
Fraction 50.1% 6.0% 0.1% 14.9% 21.5% 6.8% 0.8%
Duration 1.1ms 1.7ms 0.8ms 0.9ms 1.1ms 0.8ms 1.2ms


Practical Experiences with
Cassandra
● Very stable
● Read operations relatively expensive
● Multithreading leads to a huge performance
increase
● Requires quite extensive tuning
● Clustering doesn't automatically lead to better
performance
● Compaction leads to performance decrease of
up to 50%


Performance through Multithreading
● Multithreading leads to much higher throughput
● How to achieve multithreading without locking
support?
64
32
16
8
4
2

1
Core i7,
4 cores
(2 + 2 HT)

Performance through Multithreading
● Multithreading leads to much higher throughput
● How to achieve multithreading without locking
support?


Cassandra Tuning
● Tuning opportunities:
● Size of memtables, thresholds for flushes
● Size of JVM Heap
● Frequency and depth of compaction
● Where?
● MemTableThresholds etc. in conf/cassandra.yaml
● JVM Parameters in conf/cassandra-env.sh


Overview of JVM GC
Old Generation
Young Generation
CMSInitiatingOccupancyFraction

“Eden” “Survivors”
Additional memory
usage while GC
up to a few hundred MB dozens of GBs is running


Cassandra's Memory Usage

Flush
Memtables,
indexes, etc.

Size of Memtable: 128M, JVM Heap: 3G, #CF: 12 Compaction

Cassandra's Memory Usage
● Memtables may survive for a very long time (up
to several hours)
● are placed in old generation
● GC has to process several dozen GBs
● heap to small, GC triggered too late
 “GC storm”
● Trade-off:
● I/O load vs. memory usage
● Do not neglect compaction!


The Effects of GC and Compactions

Große
GC
Compaction


Cluster vs Single Node
● Our set-up:
● 1 Cluster with six-core CPU and RAID 5 with 6 hard disks
● 4 Cluster with six-core CPU and RAID 0 with 2 hard disks
● Single node consistently performs 1,5-3 times better.
● Possible causes:
● Overhead through network communication/consistency levels, etc.
● Hard disk performance significant
● Cluster still too small
● Effectively available disk space:
● 1 Cluster: 6 * 500 GB = 3TB with RAID 5 = 2.5 TB (83%)
● 4 Cluster: 4 * 1TB = 4TB with replication factor 2 = 2TB (50%)


Alternatives
● MongoDB, CouchDB, redis, even
memcached... .
● Persistency: Disk or RAM?
● Replication: Master/Slave or Peer-to-Peer?
● Sharding?
● Upcoming trend towards more complex query
languages (Javascript), map-reduce operations,
etc.


Summary: Cassandra
● Platform which scales well
● Active user and developer community
● Read operations quite expensive
● For optimal performance, extensive tuning
necessary
● Depending on your application, eventually
consistent and lack of transactions/locking might
be problematic.


Links
● Apache Cassandra http://cassandra.apache.org
● Apache Cassandra Wiki
http://wiki.apache.org/cassandra/FrontPage
● DataStax Dokumentation für Cassandra
http://www.datastax.com/docs/0.7/index
● My Blog: http://blog.mikiobraun.de
● Twimpact: http://beta.twimpact.com


Cassandra - An Introduction

More Related Content

Viewers also liked

Similar to Cassandra - An Introduction

More from Mikio L. Braun

Recently uploaded

Cassandra - An Introduction