Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to apache_cassandra_for_develope


Published on

A presentation for Data Day Austin on January 29th, 2011

Introduces how to effectively use Apache Cassandra for Java developers using the Hector Java client:

Introduction to apache_cassandra_for_develope

  1. 1. Introduction to  Apache Cassandra (for Java developers!) Nate McCall [email_address] @zznate
  2. 2. Brief Intro  <ul><li>NOT a &quot;key/value store&quot; </li></ul><ul><li>Columns are dynamic inside a column family </li></ul><ul><li>SSTables are immutable  </li></ul><ul><li>SSTables merged on reads </li></ul><ul><li>All nodes share the same role (i.e. no single point of failure) </li></ul><ul><li>Trading ACID compliance for scalability is a fundamental design decision </li></ul>
  3. 3. How does this impact development? <ul><li>Substantially.  </li></ul><ul><li>For operations affecting the same data, that data will become consistent eventually as determined by the timestamps.  </li></ul><ul><li>But you can trade availability for consistency. (More on this later) </li></ul><ul><li>You can store whatever you want. It's all just bytes. </li></ul><ul><li>You need to think about how you will query the data before you write it. </li></ul>
  4. 4. Neat. So Now What? <ul><li>Like any database, you need a client! </li></ul><ul><ul><li>Python: </li></ul></ul><ul><ul><ul><li>Telephus:  (Twisted) </li></ul></ul></ul><ul><ul><ul><li>Pycassa: </li></ul></ul></ul><ul><ul><li>Java: </li></ul></ul><ul><ul><ul><li>Hector:  (Examples  ) </li></ul></ul></ul><ul><ul><ul><li>Pelops: </li></ul></ul></ul><ul><ul><ul><li>Kundera </li></ul></ul></ul><ul><ul><ul><li>Datanucleus JDO: </li></ul></ul></ul><ul><ul><li>Grails: </li></ul></ul><ul><ul><ul><li>grails-cassandra: </li></ul></ul></ul><ul><ul><li>.NET: </li></ul></ul><ul><ul><ul><li>FluentCassandra : </li></ul></ul></ul><ul><ul><ul><li>Aquiles: </li></ul></ul></ul><ul><ul><li>Ruby: </li></ul></ul><ul><ul><ul><li>Cassandra: </li></ul></ul></ul><ul><ul><li>PHP: </li></ul></ul><ul><ul><ul><li>phpcassa: </li></ul></ul></ul><ul><ul><ul><li>SimpleCassie : </li></ul></ul></ul>
  5. 5. ... but do not roll your own
  6. 6. Thrift <ul><ul><li>Fast, efficient serialization and network IO.  </li></ul></ul><ul><ul><li>Lots of clients available (you can probably use it in other places as well) </li></ul></ul><ul><li>Why you don't want to work with the Thrift API directly: </li></ul><ul><ul><li>SuperColumn </li></ul></ul><ul><ul><li>ColumnOrSuperColumn </li></ul></ul><ul><ul><li>ColumnParent.super_column </li></ul></ul><ul><ul><li>ColumnPath.super_column </li></ul></ul><ul><ul><li>Map<ByteBuffer,Map<String,List<Mutation>>> mutationMap  </li></ul></ul>
  7. 7. Higher Level Client <ul><li>Hector </li></ul><ul><ul><li>JMX Counters </li></ul></ul><ul><ul><li>Add/remove hosts: </li></ul></ul><ul><ul><ul><li>automatically  </li></ul></ul></ul><ul><ul><ul><li>programatically </li></ul></ul></ul><ul><ul><ul><li>via JMX </li></ul></ul></ul><ul><ul><li>Plugable load balancing </li></ul></ul><ul><ul><li>Complete encapsulation of Thrift API </li></ul></ul><ul><ul><li>Type-safe approach to dealing with Apache Cassandra </li></ul></ul><ul><ul><li>Lightweight ORM (supports JPA 1.0 annotations) </li></ul></ul><ul><ul><li>Mavenized! </li></ul></ul>
  8. 8. &quot;CQL&quot; <ul><ul><li>Currently in Apache Cassandra trunk  </li></ul></ul><ul><ul><li>Experimental </li></ul></ul><ul><ul><li>Lots of possibilities </li></ul></ul><ul><li>from test/system/ </li></ul><ul><li>UPDATE StandardLong1 SET 1L=&quot;1&quot;, 2L=&quot;2&quot;, 3L=&quot;3&quot;, 4L=&quot;4&quot; WHERE KEY=&quot;aa&quot; </li></ul><ul><li>SELECT &quot;cd1&quot;, &quot;col&quot; FROM Standard1 WHERE KEY = &quot;kd&quot; </li></ul><ul><li>DELETE &quot;cd1&quot;, &quot;col&quot; FROM Standard1 WHERE KEY = &quot;kd&quot; </li></ul>
  9. 9. Avro?? <ul><li>Gone. Added too much complexity after Thrift caught up.   </li></ul><ul><li>&quot;None of the libraries distinguished themselves as being a particularly crappy choice for serialization.&quot;  </li></ul><ul><li>(See  CASSANDRA-1765 ) </li></ul>
  10. 10. Thrift API Methods <ul><li>Retrieving </li></ul><ul><li>Writing/Removing </li></ul><ul><li>Meta Information </li></ul><ul><li>Schema Manipulation </li></ul>
  11. 11. Thrift API Methods - Retrieving <ul><li>get: retrieve a single column for a key </li></ul><ul><li>get_slice: retrieve a &quot;slice&quot; of columns for a key </li></ul><ul><li>multiget_slice: retrieve a &quot;slice&quot; of columns for a list of keys </li></ul><ul><li>get_count: counts the columns of key (you have to deserialize the row to do it) </li></ul><ul><li>get_range_slices: retrieve a slice for a range of keys </li></ul><ul><li>get_indexed_slices (FTW!) </li></ul>
  12. 12. Thrift API Methods - Writing/Removing <ul><li>insert </li></ul><ul><li>batch_mutate (batch insertion AND deletion) </li></ul><ul><li>remove </li></ul><ul><li>truncate** </li></ul>
  13. 13. Thrift API Methods - Meta Information <ul><li>describe_cluster_name </li></ul><ul><li>describe_version </li></ul><ul><li>describe_keyspace </li></ul><ul><li>describe_keyspaces </li></ul>
  14. 14. Thrift API Methods - Schema <ul><li>system_add_keyspace </li></ul><ul><li>system_update_keyspace </li></ul><ul><li>system_drop_keyspace </li></ul><ul><li>system_add_column_family </li></ul><ul><li>system_update_column_family </li></ul><ul><li>system_drop_column_family </li></ul>
  15. 15. vs. RDBMS - Consistency Level <ul><li>Consistency is tunable per request! </li></ul><ul><li>Cassandra provides consistency when R + W > N (read replica count + write replica count > replication factor). </li></ul><ul><li>*** CONSITENCY LEVEL FAILURE IS NOT A ROLLBACK*** </li></ul><ul><li>Idempotent: an operation can be applied multiple times without changing the result </li></ul>
  16. 16. vs. RDBMS - Append Only <ul><li>Proper data modelling will minimizes seeks  </li></ul><ul><li>(Go to Tyler's presentation for more!) </li></ul>
  17. 17. On to the Code... <ul><li> </li></ul><ul><li>Uses Maven.  </li></ul><ul><li>Really basic.  </li></ul><ul><li>Modify/abuse/alter as needed.  </li></ul><ul><li>Descriptions of what is going on and how to run each example are in the Javadoc comments.  </li></ul><ul><li>Sample data is based on North American Numbering Plan </li></ul><ul><li> </li></ul>
  18. 18. Data Shape <ul><li>512 202 30.27 097.74 W TX Austin </li></ul><ul><li>512 203 30.27 097.74 L TX Austin </li></ul><ul><li>512 204 30.32 097.73 W TX Austin </li></ul><ul><li>512 205 30.32 097.73 W TX Austin </li></ul><ul><li>512 206 30.32 097.73 L TX Austin </li></ul>
  19. 19. Get a Single Column for a Key <ul><li> </li></ul><ul><li>Retrieve a single column with: </li></ul><ul><li>Name </li></ul><ul><li>Value </li></ul><ul><li>Timestamp </li></ul><ul><li>TTL </li></ul>
  20. 20. Get the Contents of a Row <ul><li> </li></ul><ul><li>Retrieves a list of columns (Hector wraps these in a ColumnSlice) </li></ul><ul><li>&quot;SlicePredicate&quot; can either be explicit set of columns OR a range (more on ranges soon) </li></ul><ul><li>Another messy either/or choice encapsulated by Hector </li></ul>
  21. 21. Get the (sorted!) Columns of a Row  <ul><li> </li></ul><ul><li>Shows why the choice of comparator is important (this is the order in which the columns hit the disk - take advantage of it) </li></ul><ul><li>Can be easily modified to return results in reverse order (but this is slightly slower) </li></ul>
  22. 22. Get the Same Slice from Several Rows <ul><li> </li></ul><ul><li>Very similar to get_slice examples, except we provide a list of keys </li></ul>
  23. 23. Get Slices From a Range of Rows <ul><li> </li></ul><ul><li>Like multiget_slice, except we can specify a KeyRange </li></ul><ul><li>(encapsulated by RangeSlicesQuery#setKeys(start, end) </li></ul><ul><li>The results of this query will be significantly more meaningful with OrderPreservingPartitioner (try this at home!) </li></ul>
  24. 24. Get Slices From a Range of Rows - 2 <ul><li> </li></ul><ul><li>Compound column name for controlling ranges </li></ul><ul><li>Comparator at work on text field </li></ul>
  25. 25. Get Slices from Indexed Columns <ul><li> </li></ul><ul><li>You only need to index a single column to apply clauses on other columns </li></ul><ul><li>(BUT- the indexed column must be present with an EQUALS clause!) </li></ul><ul><li>(It's just another ColumnFamily maintained automatically) </li></ul>
  26. 26. Insert, Update and Delete <ul><li>... are effectively the same operation.  </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li>Run each in succession (in whichever combination you like) and verify your results on the CLI </li></ul><ul><li>Hint: watch the timestamps </li></ul><ul><li>bin/cassandra-cli --host localhost </li></ul><ul><li>use Tutorial; </li></ul><ul><li>list AreaCode; </li></ul><ul><li>list Npanxx; </li></ul><ul><li>list StateCity; </li></ul>
  27. 27. Stuff I Punted on for the Sake of Brevity <ul><li>meta_* methods </li></ul><ul><li> L43-81 @hector </li></ul><ul><li>system_* methods </li></ul><ul><li> @ hector-examples </li></ul><ul><li> L84-157 @hector </li></ul><ul><li>ORM (it works and is in production) </li></ul><ul><li>ORM Documentation </li></ul><ul><li>multiple nodes </li></ul><ul><li>failure scenarios </li></ul><ul><li>Data modelling (go see Tyler's presentation) </li></ul>
  28. 28. Things to Remember <ul><ul><li>deletes and timestamp granularity </li></ul></ul><ul><ul><li>&quot;range ghosts&quot; </li></ul></ul><ul><ul><li>using the wrong column comparator and InvalidRequestException </li></ul></ul><ul><ul><li>deletions actually write data </li></ul></ul><ul><ul><li>use column-level TTL to automate deletion </li></ul></ul><ul><ul><li>&quot;how do I iterate over all the rows in a column family&quot;? </li></ul></ul><ul><ul><ul><li>get_range_slices, but don't do that </li></ul></ul></ul><ul><ul><ul><li>a good sign your data model is wrong </li></ul></ul></ul>
  29. 29. Dealing with *Lots* of Data (Briefly) <ul><li>Two biggest headaches have been addressed: </li></ul><ul><ul><li>Compaction pollutes os page cache ( CASSANDRA-1470 ) </li></ul></ul><ul><ul><li>Greater than 143mil keys on a single SSTable means more BF false positives ( CASSANDRA-1555 ) </li></ul></ul><ul><li>Hadoop integration: Yes. (Go see Jeremy's presentation) </li></ul><ul><li>Bulk loading: Yes.  CASSANDRA-1278 </li></ul><ul><li>For more information: </li></ul>