Outside The Box With Apache Cassnadra

Outside The Box With Apache Cassandra

Eric Evans
eevans@rackspace.com
@jericevans

Palemetto Open Source Software Conference
April 16, 2010

Cassandra is...

A massively scalable, decentralized, structured data store (aka
database).

Outline

1 Background

2 Project History

3 Description

4 Case Studies

5 Roadmap

CAP Theorem (aka Brewer’s Theorem)

Distributed systems cannot provide all three of:
• Consistency
• Availability
• Partition Tolerance

Inﬂuential Papers

Dynamo: Amazon’s Highly Available Key-value Store 1

• Voldemort
• Riak
Bigtable: A Distributed Storage System for Structured Data 2

• Hypertable
• HBase

1
http:
//www.allthingsdistributed.com/2007/10/amazons_dynamo.html
2
http://labs.google.com/papers/bigtable-osdi06.pdf

• 7 new committers added
• Dozens of contributors
• 200+ (!) people on IRC
• Hundreds of closed issues (bugs, features, etc)
• 4 major releases; a number of stable point releases
• Graduation to TLP

Cassandra is...

• O(1) DHT
• Eventual consistency
• Tunable trade-oﬀs, consistency vs. availability

But...

• Values are structured, indexed
• Columns / column families
• Slicing w/ predicates (queries)

Client API

• Thrift (12 diﬀerent languages!)3
• High-level client libraries
• Ruby
• Perl
• Python (Twisted too)
• Scala
• Java
• PHP
• Grails
• C++

3
http://incubator.apache.org/thrift

Querying

• get(): retrieve by column name
• multiget(): by column name for a set of keys
• get slice(): by column name, or a range of names
• returning columns
• returning super columns
• multiget slice(): a subset of columns for a set of keys
• get count: number of columns or sub-columns
• get range slice(): subset of columns for a range of keys

Updating

• insert(): add/update column (by key)
• batch insert(): add/update multiple columns (by key)
• remove(): remove a column
• batch mutate(): like batch insert() but can also delete
(new for 0.6, deprecates batch insert())

Column comparators

• TimeUUID
• LexicalUUID
• UTF8
• Long
• Bytes
• ...

Consistency

CAP Theorem: choose any two of Consistency, Availability, or
Partition tolerance.
• Zero
• One
• Quorum ((N / 2) + 1)
• All

About writes...

• Atomic within a column family
• Any node
• Always writeable (hinted hand-oﬀ)
• Fast

About reads...

• Any node
• Read repair
• Key cache
• Record cache

Case 1: Digg

Digg is a social news site that allows people to discover and share
content from anywhere on the Internet by submitting stories and
links, and voting and commenting on submitted stories and links.

Ranked 98th by Alexa.com.

Problem

• Terabytes of data; high transaction rate (reads dominated)
• Multiple clusters; heavily sharded
• Management nightmare (high eﬀort, error prone)
• Unsatisﬁed availability requirements (geographic isolation)

Solution

• Currently production on ”Green Badges”
• Cassandra as primary data store RSN
• Datacenter and rack-aware replication

Case 2: Twitter

Twitter is a social networking and microblogging service that
enables its users to send and read tweets, text-based posts of up to
140 characters.

Ranked 12th by Alexa.com.

MySQL

• Terabytes of data, ˜1,000,000 ops/s
• Calls for heavy sharding, light replication
• Schema changes are very diﬃcult, (if possible at all)
• Manual sharding is very high eﬀort
• Automated sharding and replication is Hard

Case 3: Facebook

Facebook is a social networking site where users can create a
proﬁle, add friends, and send them messages. Users can also join
groups organized by location or other points of common interest.

Ranked #2 by Alexa.com.

Inbox Search

• 100 TB
• 160 nodes
• 1/2 billion writes per day (2yr old number?)

0.6

• batch mutate command
• authentication (basic)
• new consistency level, ANY
• fat client
• mmapped i/o reads (default on 64bit jvm)
• improved write concurrency (HH)
• networking optimizations
• row caching
• improved management tools
• per-keyspace replication factor

0.7

• more efficient compactions (row sizes bigger than memory)
• easier (dynamic?) column family changes
• SSTable versioning
• SSTable compression
• support for column family truncation
• improved configuration handling
• remove key range command
• even more improved management tools
• vector clocks w/ server-side conflict resolution

Outside The Box With Apache Cassnadra

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Outside The Box With Apache Cassnadra

Similar to Outside The Box With Apache Cassnadra (20)

More from Eric Evans

More from Eric Evans (17)

Recently uploaded

Recently uploaded (20)

Outside The Box With Apache Cassnadra