Cassandra: Open Source Bigtable + Dynamo
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Cassandra: Open Source Bigtable + Dynamo

  • 31,256 views
Uploaded on

Cassandra is a highly scalable, eventually consistent, distributed, structured columnfamily store with no single points of failure, initially open-sourced by Facebook and now part of the Apache......

Cassandra is a highly scalable, eventually consistent, distributed, structured columnfamily store with no single points of failure, initially open-sourced by Facebook and now part of the Apache Incubator. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7975

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
31,256
On Slideshare
29,275
From Embeds
1,981
Number of Embeds
31

Actions

Shares
Downloads
977
Comments
1
Likes
49

Embeds 1,981

http://spyced.blogspot.com 1,200
http://abrdev.com 495
http://www.slideshare.net 161
http://spyced.blogspot.in 24
http://planetcassandra.org 24
http://spyced.blogspot.co.uk 12
http://spyced.blogspot.com.br 7
http://theoldreader.com 7
http://spyced.blogspot.jp 6
http://spyced.blogspot.ca 5
http://spyced.blogspot.fr 5
url_unknown 5
http://spyced.blogspot.hk 3
http://spyced.blogspot.de 3
http://spyced.blogspot.com.ar 3
http://spyced.blogspot.se 3
http://spyced.blogspot.com.au 2
http://spyced.blogspot.sg 2
http://translate.googleusercontent.com 2
http://www.newsblur.com 1
http://www.blogger.com 1
http://spyced.blogspot.kr 1
http://www.4624.info 1
http://74.125.155.132 1
http://209.85.135.132 1
http://spyced.blogspot.com.es 1
https://jujo00obo2o234ungd3t8qjfcjrs3o6k-a-sites-opensocial.googleusercontent.com 1
http://infosiftr.com 1
http://spyced.blogspot.ro 1
http://spyced.blogspot.it 1
http://spyced.blogspot.co.il 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Cassandra Jonathan Ellis
  • 2. Motivation ● Scaling reads to a relational database is hard ● Scaling writes to a relational database is virtually impossible ● … and when you do, it usually isn't relational anymore
  • 3. The new face of data ● Scale out, not up ● Online load balancing, cluster growth ● Flexible schema ● Key-oriented queries ● CAP-aware
  • 4. CAP theorem ● Pick two of Consistency, Availability, Partition tolerance
  • 5. Two famous papers ● Bigtable: A distributed storage system for structured data, 2006 ● Dynamo: amazon's highly available key- value store, 2007
  • 6. Two approaches ● Bigtable: “How can we build a distributed db on top of GFS?” ● Dynamo: “How can we build a distributed hash table appropriate for the data center?”
  • 7. 10,000 ft summary ● Dynamo partitioning and replication ● Log-structured ColumnFamily data model similar to Bigtable's
  • 8. Cassandra highlights ● High availability ● Incremental scalability ● Eventually consistent ● Tunable tradeoffs between consistency and latency ● Minimal administration ● No SPF
  • 9. Dynamo architecture & Lookup
  • 10. Architecture details ● O(1) node lookup ● Explicit replication ● Eventually consistent
  • 11. Architecture layers Messaging service Commit log Tombstones Gossip Memtable Hinted handoff Failure detection SSTable Read repair Cluster state Indexes Bootstrap Partitioner Compaction Monitoring Replication Admin tools
  • 12. Writes ● Any node ● Partitioner ● Commitlog, memtable ● SSTable ● Compaction ● Wait for W responses
  • 13. Memtable / SSTable Disk Commit log
  • 14. SSTable format ● Key / data
  • 15. SSTable Indexes ● Bloom filter ● Key ● Column (Similar to Hadoop MapFile / Tfile)
  • 16. Compaction ● Merge keys ● Combine columns ● Discard tombstones
  • 17. Remove ● Deletion marker (tombstone) necessary to suppress data in older SSTables, until compaction ● Read repair complicates things a little ● Eventually consistent complicates things more ● Solution: configurable delay before tombstone GC, after which tombstones are not repaired
  • 18. Cassandra write properties ● No reads ● No seeks ● Fast ● Atomic within ColumnFamily ● Always writable
  • 19. Read path ● Any node ● Partitioner ● Wait for R responses ● Wait for N – R responses in the background and perform read repair
  • 20. Cassandra read properties ● Read multiple SSTables ● Slower than writes (but still fast) ● Seeks can be mitigated with more RAM ● Scales to billions of rows
  • 21. Consistency in a BASE world ● If W + R > N, you will have consistency ● W=1, R=N ● W=N, R=1 ● W=Q, R=Q where Q = N / 2 + 1
  • 22. vs MySQL with 50GB of data ● MySQL ● ~300ms write ● ~350ms read ● Cassandra ● ~0.12ms write ● ~15ms read ● Achtung!
  • 23. Data model ● Rows, ColumnFamilies, Columns
  • 24. ColumnFamilies keyA column1 column2 column3 keyC column1 column7 column11 Column Byte[] Name Byte[] Value I64 timestamp
  • 25. Super ColumnFamilies keyF Super1 Super2 column column column column column column keyJ Super1 Super5 column column column column column column
  • 26. Types of queries ● Single column ● Slice ● Set of names / range of names ● Simple slice -> columns ● Super slice -> supercolumns ● Key range
  • 27. Range queries ● Add “master” server ● Implement on top of K/V ● Order-preserving partitioning
  • 28. Modification ● Insert / update ● Remove ● Single column or batch ● Specify W, number of nodes to wait for
  • 29. Thrift struct Column {    1: binary                        name,    2: binary                        value,    3: i64                           timestamp, } struct SuperColumn {    1: binary                        name,    2: list<Column>                  columns, } Column get_column(table, key, column_path, block_for=1) list<string> get_key_range(table, column_family, start_with="",  stop_at="", max_results=100) void insert(table, key, column_path, value, timestamp,  block_for=0) void remove(tablename, key, column_path_or_parent, timestamp)
  • 30. Honestly, Thrift kinda sucks
  • 31. Example: a multiuser blog Two queries - the most recent posts belonging to a given blog, in reverse chronological order - a single post and its comments, in chronological order
  • 32. First try JBE Cassandra is teh awesome BASE FTW blog post comment comment post comment comment Evan I like kittens And Ruby blog post comment comment post comment comment <ColumnFamily Type="Super" CompareWith="TimeString" CompareSubcolumnsWith="UUID" Name="Blog"/>
  • 33. Second try JBE blog Cassandra BASE FTW Cassandr comment comment is teh a is teh awesome awesome Evan blog I like kittens And Ruby Base FTW comment comment I like comment comment kittens And Ruby comment comment <ColumnFamily <ColumnFamily CompareWith="UUIDType" CompareWith="UUIDType" Name="Blog"/> Name="Comment"/>
  • 34. Roadmap
  • 35. Cassandra 0.3 ● Remove support ● OPP / Range queries ● Test suite ● Workarounds for JDK bugs ● Rudimentary multi-datacenter support
  • 36. Cassandra 0.4 ● Branched May 18 ● Data file format change to support billions of rows per node instead of millions ● API changes (no more colon delimiters) ● Multi-table (keyspace) support ● LRU key cache ● fsync support ● Bootstrap ● Web interface
  • 37. Cassandra 0.5 ● Bootstrap ● Load balancing ● Closely related to “bootstrap done right” ● Merkle tree repair ● Millions of columns per row ● This will require another data format change ● Multiget ● Callout support
  • 38. Users Production: facebook, RocketFuel Production RSN: Digg, Rackspace No date yet: IBM Research, Twitter Evaluating: 50+ in #cassandra on freenode
  • 39. More ● Eventual consistency: http://www.allthingsdistributed.com/2008/12/ ● Introduction to distributed databases by Todd Lipcon at NoSQL 09: http://www.vimeo.com/5145059 ● Other articles/videos about Cassandra: http://wiki.apache.org/cassandra/ArticlesAndP ● #cassandra on irc.freenode.net
  • 40. Cassandra