Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

NOSQL and Cassandra

6,853 views

Published on

Published in: Technology, Education
  • Be the first to comment

NOSQL and Cassandra

  1. 1. Introduction to NOSQL And Cassandra @rantav  @outbrain
  2. 2. SQL is good <ul><ul><li>Rich language </li></ul></ul><ul><ul><li>Easy to use and integrate </li></ul></ul><ul><ul><li>Rich toolset  </li></ul></ul><ul><ul><li>Many vendors </li></ul></ul><ul><ul><li>The promise: ACID </li></ul></ul><ul><ul><ul><li>Atomicity </li></ul></ul></ul><ul><ul><ul><li>Consistency </li></ul></ul></ul><ul><ul><ul><li>Isolation </li></ul></ul></ul><ul><ul><ul><li>Durability </li></ul></ul></ul>
  3. 3. SQL Rules
  4. 4. BUT
  5. 5. The Challenge: Modern web apps <ul><ul><li>Internet-scale data size </li></ul></ul><ul><ul><li>High read-write rates </li></ul></ul><ul><ul><li>Frequent schema changes </li></ul></ul><ul><ul><li>&quot;social&quot; apps - not banks </li></ul></ul><ul><ul><ul><li>They don't need the same  level of ACID  </li></ul></ul></ul>SCALING
  6. 6. Scaling Solutions - Replication Scales Reads
  7. 7. Scaling Solutions - Sharding Scales also Writes
  8. 8. Brewer's CAP Theorem:  You can only choose two
  9. 9. CAP
  10. 10. Availability + Partition Tolerance (no Consistency)
  11. 11. Existing NOSQL Solutions
  12. 12. Taxonomy of NOSQL data stores <ul><ul><li>Document Oriented </li></ul></ul><ul><ul><ul><li>CouchDB, MongoDB, Lotus Notes, SimpleDB, Orient </li></ul></ul></ul><ul><ul><li>Key-Value </li></ul></ul><ul><ul><ul><li>Voldemort, Dynamo, Riak (sort of), Redis, Tokyo  </li></ul></ul></ul><ul><ul><li>Column </li></ul></ul><ul><ul><ul><li>Cassandra, HBase, BigTable </li></ul></ul></ul><ul><ul><li>Graph Databases </li></ul></ul><ul><ul><ul><li>  Neo4J, FlockDB, DEX, AlegroGraph </li></ul></ul></ul><ul><li>http://en.wikipedia.org/wiki/NoSQL </li></ul>
  13. 13.   <ul><ul><li>Developed at facebook </li></ul></ul><ul><ul><li>Follows the BigTable Data Model - column oriented </li></ul></ul><ul><ul><li>Follows the Dynamo Eventual Consistency model </li></ul></ul><ul><ul><li>Opensourced at Apache </li></ul></ul><ul><ul><li>Implemented in Java </li></ul></ul>
  14. 14. N/R/W <ul><ul><li>N - Number of replicas (nodes) for any data item </li></ul></ul><ul><ul><li>W - Number or nodes a write operation blocks on </li></ul></ul><ul><ul><li>R - Number of nodes a read operation blocks on </li></ul></ul>CONSISTENCY DOWN TO EARTH
  15. 15. N/R/W - Typical Values <ul><ul><li>W=1 => Block until first node written successfully </li></ul></ul><ul><ul><li>W=N => Block until all nodes written successfully </li></ul></ul><ul><ul><li>W=0 => Async writes </li></ul></ul><ul><ul><li>R=1 => Block until the first node returns an answer </li></ul></ul><ul><ul><li>R=N => Block until all nodes return an answer </li></ul></ul><ul><ul><li>R=0 => Doesn't make sense </li></ul></ul><ul><ul><li>QUORUM: </li></ul></ul><ul><ul><ul><li>R = N/2+1 </li></ul></ul></ul><ul><ul><ul><li>W = N/2+1 </li></ul></ul></ul><ul><ul><ul><li>=> Fully consistent </li></ul></ul></ul>
  16. 16. Data Model - Forget SQL <ul><li>Do you know SQL? </li></ul>
  17. 17. Data Model - Vocabulary <ul><ul><li>Keyspace – like namespace for unique keys. </li></ul></ul><ul><ul><li>Column Family – very much like a table… but not quite. </li></ul></ul><ul><ul><li>Key – a key that represent row (of columns) </li></ul></ul><ul><ul><li>Column – representation of value with: </li></ul></ul><ul><ul><ul><li>Column name </li></ul></ul></ul><ul><ul><ul><li>Value </li></ul></ul></ul><ul><ul><ul><li>Timestamp </li></ul></ul></ul><ul><ul><li>Super Column – Column that holds list of columns inside </li></ul></ul>
  18. 18. Data Model - Columns <ul><li>struct Column { </li></ul><ul><li>    1: required binary name , </li></ul><ul><li>    2: optional binary value , </li></ul><ul><li>    3: optional i64 timestamp , </li></ul><ul><li>    4: optional i32 ttl , </li></ul><ul><li>} </li></ul><ul><li>JSON-ish notation: </li></ul><ul><li>{  </li></ul><ul><li>  &quot;name&quot;:      &quot;emailAddress&quot;,  </li></ul><ul><li>  &quot;value&quot;:     &quot;foo@bar.com&quot;,  </li></ul><ul><li>  &quot;timestamp&quot;: 123456789 } </li></ul>
  19. 19. Data Model - Column Family <ul><ul><li>Similar to SQL tables </li></ul></ul><ul><ul><li>Has many columns </li></ul></ul><ul><ul><li>Has many rows </li></ul></ul>
  20. 20. Data Model - Rows <ul><ul><li>Primary key for objects </li></ul></ul><ul><ul><li>All keys are arbitrary length binaries </li></ul></ul><ul><li>Users:                                 CF </li></ul><ul><li>     ran:                               ROW </li></ul><ul><li>         emailAddress: foo@bar.com,      COLUMN </li></ul><ul><li>         webSite: http://bar.com         COLUMN </li></ul><ul><li>     f.rat:                              ROW </li></ul><ul><li>         emailAddress: f.rat@mouse.com   COLUMN </li></ul><ul><li>Stats:                                  CF </li></ul><ul><li>     ran:                               ROW </li></ul><ul><li>         visits: 243                     COLUMN </li></ul>
  21. 21. Data Model - Songs example <ul><li>Songs:  </li></ul><ul><li>     Meir Ariel:  </li></ul><ul><li>         Shir Keev: 6:13,  </li></ul><ul><li>         Tikva: 4:11, </li></ul><ul><li>         Erol: 6:17 </li></ul><ul><li>         Suetz: 5:30 </li></ul><ul><li>         Dr Hitchakmut: 3:30 </li></ul><ul><li>     Mashina: </li></ul><ul><li>         Rakevet Layla: 3:02 </li></ul><ul><li>         Optikai: 5:40 </li></ul>
  22. 22. Data Model - Super Columns <ul><ul><li>Columns whose values are lists of columns </li></ul></ul>
  23. 23. Data Model - Super Columns <ul><li>Songs:  </li></ul><ul><li>     Meir Ariel: </li></ul><ul><li>         Shirey Hag : </li></ul><ul><li>             Shir Keev: 6:13,  </li></ul><ul><li>             Tikva: 4:11, </li></ul><ul><li>             Erol: 6:17 </li></ul><ul><li>         Vegluy Eynaim :  </li></ul><ul><li>             Suetz: 5:30 </li></ul><ul><li>             Dr Hitchakmut: 3:30 </li></ul><ul><li>     Mashina: </li></ul><ul><li>         ... </li></ul>
  24. 24. The API - Read <ul><li>get </li></ul><ul><li>get_slice </li></ul><ul><li>get_count </li></ul><ul><li>multiget </li></ul><ul><li>multiget_slice </li></ul><ul><li>get_ranage_slices </li></ul><ul><li>get_indexed_slices </li></ul>
  25. 25. The True API <ul><li>get(keyspace, key, column_path, consistency ) </li></ul><ul><li>get_slice( ks, key, column_parent, predicate,  consistency ) </li></ul><ul><li>multiget(ks, keys, column_path,  consistency ) </li></ul><ul><li>multiget_slice( ks, keys, column_parent, predicate,  consistency ) </li></ul><ul><li>... </li></ul>
  26. 26. The API - Write <ul><li>insert </li></ul><ul><li>add </li></ul><ul><li>remove </li></ul><ul><li>remove_counter </li></ul><ul><li>batch_mutate </li></ul>
  27. 27. The API - Meta <ul><li>describe_schema_versions </li></ul><ul><li>describe_keyspaces </li></ul><ul><li>describe_cluster_name </li></ul><ul><li>describe_version </li></ul><ul><li>describe_ring </li></ul><ul><li>describe_partitioner </li></ul><ul><li>describe_snitch </li></ul>
  28. 28. The API - DDL <ul><li>system_add_column_family </li></ul><ul><li>system_drop_column_family </li></ul><ul><li>system_add_keyspace </li></ul><ul><li>system_drop_keyspace </li></ul><ul><li>system_update_keyspace </li></ul><ul><li>system_update_column_family </li></ul>
  29. 29. The API - CQL <ul><li>execute_cql_query </li></ul><ul><li>cqlsh> SELECT key, state FROM users; </li></ul><ul><li>cqlsh> INSERT INTO users (key, full_name, birth_date, state) VALUES ('bsanderson', 'Brandon Sanderson', 1975, 'UT'); </li></ul>
  30. 30. Consistency Model <ul><ul><li>N - per keyspace </li></ul></ul><ul><ul><li>R - per each read requests </li></ul></ul><ul><ul><li>W - per each write request </li></ul></ul>
  31. 31. Consistency Model <ul><li>Cassandra defines: </li></ul><ul><li>enum ConsistencyLevel { </li></ul><ul><li>    ONE, </li></ul><ul><li>    QUORUM, </li></ul><ul><li>    LOCAL_QUORUM, </li></ul><ul><li>    EACH_QUORUM, </li></ul><ul><li>    ALL, </li></ul><ul><li>    ANY, </li></ul><ul><li>    TWO, </li></ul><ul><li>    THREE, </li></ul><ul><li>} </li></ul>
  32. 32. Java Code <ul><li>TTransport tr = new TSocket(&quot;localhost&quot;, 9160);  </li></ul><ul><li>TProtocol proto = new TBinaryProtocol(tr);  </li></ul><ul><li>Cassandra.Client client = new Cassandra.Client(proto);  </li></ul><ul><li>tr.open();  </li></ul><ul><li>String key_user_id = &quot;1&quot;;  </li></ul><ul><li>long timestamp = System.currentTimeMillis();  </li></ul><ul><li>client. insert (&quot;Keyspace1&quot;,  </li></ul><ul><li>               key_user_id,  </li></ul><ul><li>               new ColumnPath(&quot;Standard1&quot;,  </li></ul><ul><li>                              null, </li></ul><ul><li>                              &quot;name&quot;.getBytes(&quot;UTF-8&quot;)),  </li></ul><ul><li>               &quot;Chris Goffinet&quot;.getBytes(&quot;UTF-8&quot;), </li></ul><ul><li>               timestamp,  </li></ul><ul><li>               ConsistencyLevel.ONE);  </li></ul>
  33. 33. Java Client - Hector http://github.com/rantav/hector <ul><ul><li>The de-facto java client for cassandra </li></ul></ul><ul><ul><li>Encapsulates thrift </li></ul></ul><ul><ul><li>Adds JMX (Monitoring) </li></ul></ul><ul><ul><li>Connection pooling </li></ul></ul><ul><ul><li>Failover </li></ul></ul><ul><ul><li>Open-sourced at github and has a growing community of developers and users. </li></ul></ul>
  34. 34. Java Client - Hector - cont <ul><li>  /** </li></ul><ul><li>   * Insert a new value keyed by key </li></ul><ul><li>   * </li></ul><ul><li>   * @param key   Key for the value </li></ul><ul><li>   * @param value the String value to insert </li></ul><ul><li>   */ </li></ul><ul><li>   public void insert (final String key, final String value) { </li></ul><ul><li>     Mutator m = createMutator(keyspaceOperator); </li></ul><ul><li>     m.insert(key,  </li></ul><ul><li>             CF_NAME,  </li></ul><ul><li>             createColumn(COLUMN_NAME, value)); </li></ul><ul><li>   } </li></ul>
  35. 35. Java Client - Hector - cont <ul><li>   /** </li></ul><ul><li>   * Get a string value. </li></ul><ul><li>   * </li></ul><ul><li>   * @return The string value; null if no value exists for the given key. </li></ul><ul><li>   */ </li></ul><ul><li>   public String get (final String key) throws HectorException { </li></ul><ul><li>     ColumnQuery<String, String> q = createColumnQuery(keyspaceOperator, serializer, serializer); </li></ul><ul><li>     Result<HColumn<String, String>> r = q.setKey(key). </li></ul><ul><li>         setName(COLUMN_NAME). </li></ul><ul><li>         setColumnFamily(CF_NAME). </li></ul><ul><li>         execute(); </li></ul><ul><li>     HColumn<String, String> c = r.get(); </li></ul><ul><li>     return c == null ? null : c.getValue(); </li></ul><ul><li>   } </li></ul>
  36. 36. Extra If you're not snoring yet...
  37. 37. Sorting <ul><li>Columns are sorted by their type  </li></ul><ul><ul><li>BytesType  </li></ul></ul><ul><ul><li>UTF8Type </li></ul></ul><ul><ul><li>AsciiType </li></ul></ul><ul><ul><li>LongType </li></ul></ul><ul><ul><li>LexicalUUIDType </li></ul></ul><ul><ul><li>TimeUUIDType </li></ul></ul><ul><li>Rows are sorted by their Partitioner </li></ul><ul><ul><li>RandomPartitioner </li></ul></ul><ul><ul><li>OrderPreservingPartitioner </li></ul></ul><ul><ul><li>CollatingOrderPreservingPartitioner </li></ul></ul>
  38. 38. Thrift <ul><li>Cross-language protocol </li></ul><ul><li>Compiles to: C++, Java, PHP, Ruby, Erlang, Perl, ... </li></ul><ul><li>struct UserProfile {  </li></ul><ul><li>     1: i32     uid ,  </li></ul><ul><li>     2: string name ,  </li></ul><ul><li>     3: string blurb   </li></ul><ul><li>}  </li></ul><ul><li>service UserStorage {  </li></ul><ul><li>     void          store (1: UserProfile user), </li></ul><ul><li>     UserProfile   retrieve (1: i32 uid)  </li></ul><ul><li>} </li></ul>
  39. 39. Thrift <ul><li>Generating sources: </li></ul><ul><li>thrift --gen java cassandra.thrift </li></ul><ul><li>thrift -- gen py cassandra.thrift </li></ul>
  40. 40. Internals
  41. 41. Agenda <ul><ul><li>Background and history </li></ul></ul><ul><ul><li>Architectural Layers </li></ul></ul><ul><ul><li>Transport: Thrift </li></ul></ul><ul><ul><li>Write Path (and sstables, memtables) </li></ul></ul><ul><ul><li>Read Path </li></ul></ul><ul><ul><li>Compactions </li></ul></ul><ul><ul><li>Bloom Filters </li></ul></ul><ul><ul><li>Gossip </li></ul></ul><ul><ul><li>Deletions </li></ul></ul><ul><ul><li>More... </li></ul></ul>
  42. 42. Required Reading ;-) <ul><li>BigTable  http://labs.google.com/papers/bigtable.html </li></ul><ul><li>Dynamo  http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html </li></ul>
  43. 43. From Dynamo: <ul><ul><li>Symmetric p2p architecture </li></ul></ul><ul><ul><li>Gossip based discovery and error detection </li></ul></ul><ul><ul><li>Distributed key-value store </li></ul></ul><ul><ul><ul><li>Pluggable partitioning  </li></ul></ul></ul><ul><ul><ul><li>Pluggable topology discovery </li></ul></ul></ul><ul><ul><li>Eventual consistent and Tunable per operation  </li></ul></ul>
  44. 44. From BigTable <ul><ul><li>Sparse Column oriented sparse array </li></ul></ul><ul><ul><li>SSTable disk storage </li></ul></ul><ul><ul><ul><li>Append-only commit log </li></ul></ul></ul><ul><ul><ul><li>Memtable (buffering and sorting) </li></ul></ul></ul><ul><ul><ul><li>Immutable sstable files </li></ul></ul></ul><ul><ul><ul><li>Compactions </li></ul></ul></ul><ul><ul><ul><li>High write performance  </li></ul></ul></ul>
  45. 45. Architecture Layers Cluster Management Messaging service  Gossip  Failure detection  Cluster state  Partitioner  Replication  Single Host Commit log  Memtable  SSTable  Indexes  Compaction 
  46. 46. Write Path
  47. 47. Writing
  48. 48. Writing
  49. 49. Writing
  50. 50. Writing
  51. 51. Memtables <ul><ul><li>In-memory representation of recently written data </li></ul></ul><ul><ul><li>When the table is full, it's sorted and then flushed to disk -> sstable </li></ul></ul>
  52. 52. SSTables <ul><li>Sorted Strings Tables </li></ul><ul><ul><li>Immutable </li></ul></ul><ul><ul><li>On-disk </li></ul></ul><ul><ul><li>Sorted by a string key </li></ul></ul><ul><ul><li>In-memory index of elements </li></ul></ul><ul><ul><li>Binary search (in memory) to find element location </li></ul></ul><ul><ul><li>Bloom filter to reduce number of unneeded binary searches. </li></ul></ul>
  53. 53. Write Properties <ul><ul><li>No Locks in the critical path </li></ul></ul><ul><ul><li>Always available to writes, even if there are failures. </li></ul></ul><ul><ul><li>No reads </li></ul></ul><ul><ul><li>No seeks  </li></ul></ul><ul><ul><li>Fast  </li></ul></ul><ul><ul><li>Atomic within a Row </li></ul></ul>
  54. 54. Read Path
  55. 55. Reads
  56. 56. Reading
  57. 57. Reading
  58. 58. Reading
  59. 59. Reading
  60. 60. Bloom Filters <ul><ul><li>Space efficient probabilistic data structure </li></ul></ul><ul><ul><li>Test whether an element is a member of a set </li></ul></ul><ul><ul><li>Allow false positive, but not false negative  </li></ul></ul><ul><ul><li>k hash functions </li></ul></ul><ul><ul><li>Union and intersection are implemented as bitwise OR, AND </li></ul></ul>
  61. 61. Read Properteis <ul><ul><li>Read multiple SSTables  </li></ul></ul><ul><ul><li>Slower than writes (but still fast)  </li></ul></ul><ul><ul><li>Seeks can be mitigated with more RAM </li></ul></ul><ul><ul><li>Uses probabilistic bloom filters to reduce lookups. </li></ul></ul><ul><ul><li>Extensive optional caching </li></ul></ul><ul><ul><ul><li>Key Cache </li></ul></ul></ul><ul><ul><ul><li>Row Cache </li></ul></ul></ul><ul><ul><ul><li>Excellent monitoring </li></ul></ul></ul>
  62. 62. Compactions <ul><li>  </li></ul>
  63. 63. Compactions <ul><ul><li>Merge keys  </li></ul></ul><ul><ul><li>Combine columns  </li></ul></ul><ul><ul><li>Discard tombstones </li></ul></ul><ul><ul><li>Use bloom filters bitwise OR operation </li></ul></ul>
  64. 64. Gossip <ul><ul><li>p2p </li></ul></ul><ul><ul><li>Enables seamless nodes addition. </li></ul></ul><ul><ul><li>Rebalancing of keys </li></ul></ul><ul><ul><li>Fast detection of nodes that goes down. </li></ul></ul><ul><ul><li>Every node knows about all others - no master. </li></ul></ul>
  65. 65. Deletions <ul><ul><li>Deletion marker (tombstone) necessary to suppress data in older SSTables, until compaction  </li></ul></ul><ul><ul><li>Read repair complicates things a little  </li></ul></ul><ul><ul><li>Eventually consistent complicates things more  </li></ul></ul><ul><ul><li>Solution: configurable delay before tombstone GC, after which tombstones are not repaired </li></ul></ul>
  66. 66. Extra Long list of subjects <ul><ul><li>SEDA (Staged Events Driven Architecture) </li></ul></ul><ul><ul><li>Anti Entropy and Merkle Trees </li></ul></ul><ul><ul><li>Hinted Handoff </li></ul></ul><ul><ul><li>repair on read </li></ul></ul>
  67. 67. SEDA <ul><ul><li>Mutate </li></ul></ul><ul><ul><li>Stream </li></ul></ul><ul><ul><li>Gossip </li></ul></ul><ul><ul><li>Response </li></ul></ul><ul><ul><li>Anti Entropy </li></ul></ul><ul><ul><li>Load Balance </li></ul></ul><ul><ul><li>Migration  </li></ul></ul>
  68. 68. Anti Entropy and Merkle Trees
  69. 69. Hinted Handoff
  70. 70. References <ul><ul><li>http://horicky.blogspot.com/2009/11/nosql-patterns.html </li></ul></ul><ul><ul><li>http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf </li></ul></ul><ul><ul><li>http://labs.google.com/papers/bigtable.html </li></ul></ul><ul><ul><li>http://bret.appspot.com/entry/how-friendfeed-uses-mysql </li></ul></ul><ul><ul><li>http://www.julianbrowne.com/article/viewer/brewers-cap-theorem </li></ul></ul><ul><ul><li>http://www.allthingsdistributed.com/2008/12/eventually_consistent.html </li></ul></ul><ul><ul><li>http://wiki.apache.org/cassandra/DataModel </li></ul></ul><ul><ul><li>http://incubator.apache.org/thrift/ </li></ul></ul><ul><ul><li>http://www.eecs.harvard.edu/~mdw/papers/quals-seda.pdf </li></ul></ul>

×