Cassandra




Jonathan Ellis
Motivation
●   Scaling reads to a relational database is
    hard
●   Scaling writes to a relational database is
    virtu...
The new face of data
●   Scale out, not up
●   Online load balancing, cluster growth
●   Flexible schema
●   Key-oriented ...
CAP theorem
●   Pick two of Consistency, Availability,
    Partition tolerance
Two famous papers
●   Bigtable: A distributed storage system for
    structured data, 2006
●   Dynamo: amazon's highly ava...
Two approaches
●   Bigtable: “How can we build a distributed
    db on top of GFS?”
●   Dynamo: “How can we build a distri...
10,000 ft summary
●   Dynamo partitioning and replication
●   Log-structured ColumnFamily data model
    similar to Bigtab...
Cassandra highlights
●   High availability
●   Incremental scalability
●   Eventually consistent
●   Tunable tradeoffs bet...
Dynamo architecture & Lookup
Architecture details
●   O(1) node lookup
●   Explicit replication
●   Eventually consistent
Architecture layers
Messaging service   Commit log   Tombstones
Gossip              Memtable     Hinted handoff
Failure de...
Writes
●   Any node
●   Partitioner
●   Commitlog, memtable
●   SSTable
●   Compaction
●   Wait for W responses
Memtable / SSTable




Disk

  Commit log
SSTable format
●   Key / data
SSTable Indexes
●   Bloom filter
●   Key
●   Column


(Similar to Hadoop MapFile / Tfile)
Compaction
●   Merge keys
●   Combine columns
●   Discard tombstones
Remove
●   Deletion marker (tombstone) necessary
    to suppress data in older SSTables, until
    compaction
●   Read rep...
Cassandra write properties
●   No reads
●   No seeks
●   Fast
●   Atomic within ColumnFamily
●   Always writable
Read path
●   Any node
●   Partitioner
●   Wait for R responses
●   Wait for N – R responses in the
    background and per...
Cassandra read properties
●   Read multiple SSTables
●   Slower than writes (but still fast)
●   Seeks can be mitigated wi...
Consistency in a BASE world
●   If W + R > N, you will have consistency
●   W=1, R=N
●   W=N, R=1
●   W=Q, R=Q where Q = N...
vs MySQL with 50GB of data
●   MySQL
    ●   ~300ms write
    ●   ~350ms read
●   Cassandra
    ●   ~0.12ms write
    ●   ...
Data model
●   Rows, ColumnFamilies, Columns
ColumnFamilies

keyA           column1   column2   column3
keyC           column1   column7   column11


Column
Byte[] Nam...
Super ColumnFamilies


keyF   Super1                      Super2

       column    column   column     column    column   ...
Types of queries
●   Single column
●   Slice
    ●   Set of names / range of names
    ●   Simple slice -> columns
    ●  ...
Range queries
●   Add “master” server
●   Implement on top of K/V
●   Order-preserving partitioning
Modification
●   Insert / update
●   Remove
●   Single column or batch
●   Specify W, number of nodes to wait for
Thrift
struct Column {
   1: binary                        name,
   2: binary                        value,
   3: i64     ...
Honestly, Thrift kinda sucks
Example: a multiuser blog
Two queries
 - the most recent posts belonging to a
given blog, in reverse chronological order
 ...
First try

JBE     Cassandra is teh awesome              BASE FTW
blog
         post            comment   comment    post ...
Second try
   JBE blog    Cassandra    BASE FTW     Cassandr   comment   comment
               is teh                    ...
Roadmap
Cassandra 0.3
●   Remove support
●   OPP / Range queries
●   Test suite
●   Workarounds for JDK bugs
●   Rudimentary multi...
Cassandra 0.4
●   Branched May 18
●   Data file format change to support billions
    of rows per node instead of millions...
Cassandra 0.5
●   Bootstrap
●   Load balancing
    ●   Closely related to “bootstrap done right”
●   Merkle tree repair
● ...
Users
Production: facebook, RocketFuel
Production RSN: Digg, Rackspace
No date yet: IBM Research, Twitter
Evaluating: 50+ ...
More
●   Eventual consistency:
    http://www.allthingsdistributed.com/2008/12/
●   Introduction to distributed databases ...
Cassandra
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamo
Upcoming SlideShare
Loading in...5
×

Cassandra: Open Source Bigtable + Dynamo

24,937

Published on

Cassandra is a highly scalable, eventually consistent, distributed, structured columnfamily store with no single points of failure, initially open-sourced by Facebook and now part of the Apache Incubator. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7975

Published in: Technology, News & Politics
1 Comment
50 Likes
Statistics
Notes
No Downloads
Views
Total Views
24,937
On Slideshare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
984
Comments
1
Likes
50
Embeds 0
No embeds

No notes for slide

Transcript of "Cassandra: Open Source Bigtable + Dynamo"

  1. 1. Cassandra Jonathan Ellis
  2. 2. Motivation ● Scaling reads to a relational database is hard ● Scaling writes to a relational database is virtually impossible ● … and when you do, it usually isn't relational anymore
  3. 3. The new face of data ● Scale out, not up ● Online load balancing, cluster growth ● Flexible schema ● Key-oriented queries ● CAP-aware
  4. 4. CAP theorem ● Pick two of Consistency, Availability, Partition tolerance
  5. 5. Two famous papers ● Bigtable: A distributed storage system for structured data, 2006 ● Dynamo: amazon's highly available key- value store, 2007
  6. 6. Two approaches ● Bigtable: “How can we build a distributed db on top of GFS?” ● Dynamo: “How can we build a distributed hash table appropriate for the data center?”
  7. 7. 10,000 ft summary ● Dynamo partitioning and replication ● Log-structured ColumnFamily data model similar to Bigtable's
  8. 8. Cassandra highlights ● High availability ● Incremental scalability ● Eventually consistent ● Tunable tradeoffs between consistency and latency ● Minimal administration ● No SPF
  9. 9. Dynamo architecture & Lookup
  10. 10. Architecture details ● O(1) node lookup ● Explicit replication ● Eventually consistent
  11. 11. Architecture layers Messaging service Commit log Tombstones Gossip Memtable Hinted handoff Failure detection SSTable Read repair Cluster state Indexes Bootstrap Partitioner Compaction Monitoring Replication Admin tools
  12. 12. Writes ● Any node ● Partitioner ● Commitlog, memtable ● SSTable ● Compaction ● Wait for W responses
  13. 13. Memtable / SSTable Disk Commit log
  14. 14. SSTable format ● Key / data
  15. 15. SSTable Indexes ● Bloom filter ● Key ● Column (Similar to Hadoop MapFile / Tfile)
  16. 16. Compaction ● Merge keys ● Combine columns ● Discard tombstones
  17. 17. Remove ● Deletion marker (tombstone) necessary to suppress data in older SSTables, until compaction ● Read repair complicates things a little ● Eventually consistent complicates things more ● Solution: configurable delay before tombstone GC, after which tombstones are not repaired
  18. 18. Cassandra write properties ● No reads ● No seeks ● Fast ● Atomic within ColumnFamily ● Always writable
  19. 19. Read path ● Any node ● Partitioner ● Wait for R responses ● Wait for N – R responses in the background and perform read repair
  20. 20. Cassandra read properties ● Read multiple SSTables ● Slower than writes (but still fast) ● Seeks can be mitigated with more RAM ● Scales to billions of rows
  21. 21. Consistency in a BASE world ● If W + R > N, you will have consistency ● W=1, R=N ● W=N, R=1 ● W=Q, R=Q where Q = N / 2 + 1
  22. 22. vs MySQL with 50GB of data ● MySQL ● ~300ms write ● ~350ms read ● Cassandra ● ~0.12ms write ● ~15ms read ● Achtung!
  23. 23. Data model ● Rows, ColumnFamilies, Columns
  24. 24. ColumnFamilies keyA column1 column2 column3 keyC column1 column7 column11 Column Byte[] Name Byte[] Value I64 timestamp
  25. 25. Super ColumnFamilies keyF Super1 Super2 column column column column column column keyJ Super1 Super5 column column column column column column
  26. 26. Types of queries ● Single column ● Slice ● Set of names / range of names ● Simple slice -> columns ● Super slice -> supercolumns ● Key range
  27. 27. Range queries ● Add “master” server ● Implement on top of K/V ● Order-preserving partitioning
  28. 28. Modification ● Insert / update ● Remove ● Single column or batch ● Specify W, number of nodes to wait for
  29. 29. Thrift struct Column {    1: binary                        name,    2: binary                        value,    3: i64                           timestamp, } struct SuperColumn {    1: binary                        name,    2: list<Column>                  columns, } Column get_column(table, key, column_path, block_for=1) list<string> get_key_range(table, column_family, start_with="",  stop_at="", max_results=100) void insert(table, key, column_path, value, timestamp,  block_for=0) void remove(tablename, key, column_path_or_parent, timestamp)
  30. 30. Honestly, Thrift kinda sucks
  31. 31. Example: a multiuser blog Two queries - the most recent posts belonging to a given blog, in reverse chronological order - a single post and its comments, in chronological order
  32. 32. First try JBE Cassandra is teh awesome BASE FTW blog post comment comment post comment comment Evan I like kittens And Ruby blog post comment comment post comment comment <ColumnFamily Type="Super" CompareWith="TimeString" CompareSubcolumnsWith="UUID" Name="Blog"/>
  33. 33. Second try JBE blog Cassandra BASE FTW Cassandr comment comment is teh a is teh awesome awesome Evan blog I like kittens And Ruby Base FTW comment comment I like comment comment kittens And Ruby comment comment <ColumnFamily <ColumnFamily CompareWith="UUIDType" CompareWith="UUIDType" Name="Blog"/> Name="Comment"/>
  34. 34. Roadmap
  35. 35. Cassandra 0.3 ● Remove support ● OPP / Range queries ● Test suite ● Workarounds for JDK bugs ● Rudimentary multi-datacenter support
  36. 36. Cassandra 0.4 ● Branched May 18 ● Data file format change to support billions of rows per node instead of millions ● API changes (no more colon delimiters) ● Multi-table (keyspace) support ● LRU key cache ● fsync support ● Bootstrap ● Web interface
  37. 37. Cassandra 0.5 ● Bootstrap ● Load balancing ● Closely related to “bootstrap done right” ● Merkle tree repair ● Millions of columns per row ● This will require another data format change ● Multiget ● Callout support
  38. 38. Users Production: facebook, RocketFuel Production RSN: Digg, Rackspace No date yet: IBM Research, Twitter Evaluating: 50+ in #cassandra on freenode
  39. 39. More ● Eventual consistency: http://www.allthingsdistributed.com/2008/12/ ● Introduction to distributed databases by Todd Lipcon at NoSQL 09: http://www.vimeo.com/5145059 ● Other articles/videos about Cassandra: http://wiki.apache.org/cassandra/ArticlesAndP ● #cassandra on irc.freenode.net
  40. 40. Cassandra
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×