Your SlideShare is downloading. ×
0
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Introduction to apache_cassandra_for_develope

10,602

Published on

A presentation for Data Day Austin on January 29th, 2011 …

A presentation for Data Day Austin on January 29th, 2011

Introduces how to effectively use Apache Cassandra for Java developers using the Hector Java client: http://github.com/rantav/hector

0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
10,602
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
307
Comments
0
Likes
8
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Introduction to  Apache Cassandra (for Java developers!) Nate McCall [email_address] @zznate
  • 2. Brief Intro  <ul><li>NOT a &quot;key/value store&quot; </li></ul><ul><li>Columns are dynamic inside a column family </li></ul><ul><li>SSTables are immutable  </li></ul><ul><li>SSTables merged on reads </li></ul><ul><li>All nodes share the same role (i.e. no single point of failure) </li></ul><ul><li>Trading ACID compliance for scalability is a fundamental design decision </li></ul>
  • 3. How does this impact development? <ul><li>Substantially.  </li></ul><ul><li>For operations affecting the same data, that data will become consistent eventually as determined by the timestamps.  </li></ul><ul><li>But you can trade availability for consistency. (More on this later) </li></ul><ul><li>You can store whatever you want. It's all just bytes. </li></ul><ul><li>You need to think about how you will query the data before you write it. </li></ul>
  • 4. Neat. So Now What? <ul><li>Like any database, you need a client! </li></ul><ul><ul><li>Python: </li></ul></ul><ul><ul><ul><li>Telephus:  http://github.com/driftx/Telephus  (Twisted) </li></ul></ul></ul><ul><ul><ul><li>Pycassa:  http://github.com/pycassa/pycassa </li></ul></ul></ul><ul><ul><li>Java: </li></ul></ul><ul><ul><ul><li>Hector:  http://github.com/rantav/hector  (Examples  https://github.com/zznate/hector-examples  ) </li></ul></ul></ul><ul><ul><ul><li>Pelops:  http://github.com/s7/scale7-pelops </li></ul></ul></ul><ul><ul><ul><li>Kundera  http://code.google.com/p/kundera/ </li></ul></ul></ul><ul><ul><ul><li>Datanucleus JDO:  http://github.com/tnine/Datanucleus-Cassandra-Plugin </li></ul></ul></ul><ul><ul><li>Grails: </li></ul></ul><ul><ul><ul><li>grails-cassandra:  https://github.com/wolpert/grails-cassandra </li></ul></ul></ul><ul><ul><li>.NET: </li></ul></ul><ul><ul><ul><li>FluentCassandra :  http://github.com/managedfusion/fluentcassandra </li></ul></ul></ul><ul><ul><ul><li>Aquiles:  http://aquiles.codeplex.com/ </li></ul></ul></ul><ul><ul><li>Ruby: </li></ul></ul><ul><ul><ul><li>Cassandra:  http://github.com/fauna/cassandra </li></ul></ul></ul><ul><ul><li>PHP: </li></ul></ul><ul><ul><ul><li>phpcassa:  http://github.com/thobbs/phpcassa </li></ul></ul></ul><ul><ul><ul><li>SimpleCassie :  http://code.google.com/p/simpletools-php/wiki/SimpleCassie </li></ul></ul></ul>
  • 5. ... but do not roll your own
  • 6. Thrift <ul><ul><li>Fast, efficient serialization and network IO.  </li></ul></ul><ul><ul><li>Lots of clients available (you can probably use it in other places as well) </li></ul></ul><ul><li>Why you don't want to work with the Thrift API directly: </li></ul><ul><ul><li>SuperColumn </li></ul></ul><ul><ul><li>ColumnOrSuperColumn </li></ul></ul><ul><ul><li>ColumnParent.super_column </li></ul></ul><ul><ul><li>ColumnPath.super_column </li></ul></ul><ul><ul><li>Map<ByteBuffer,Map<String,List<Mutation>>> mutationMap  </li></ul></ul>
  • 7. Higher Level Client <ul><li>Hector </li></ul><ul><ul><li>JMX Counters </li></ul></ul><ul><ul><li>Add/remove hosts: </li></ul></ul><ul><ul><ul><li>automatically  </li></ul></ul></ul><ul><ul><ul><li>programatically </li></ul></ul></ul><ul><ul><ul><li>via JMX </li></ul></ul></ul><ul><ul><li>Plugable load balancing </li></ul></ul><ul><ul><li>Complete encapsulation of Thrift API </li></ul></ul><ul><ul><li>Type-safe approach to dealing with Apache Cassandra </li></ul></ul><ul><ul><li>Lightweight ORM (supports JPA 1.0 annotations) </li></ul></ul><ul><ul><li>Mavenized!  http://repo2.maven.org/maven2/me/prettyprint/ </li></ul></ul>
  • 8. &quot;CQL&quot; <ul><ul><li>Currently in Apache Cassandra trunk  </li></ul></ul><ul><ul><li>Experimental </li></ul></ul><ul><ul><li>Lots of possibilities </li></ul></ul><ul><li>from test/system/test_cql.py: </li></ul><ul><li>UPDATE StandardLong1 SET 1L=&quot;1&quot;, 2L=&quot;2&quot;, 3L=&quot;3&quot;, 4L=&quot;4&quot; WHERE KEY=&quot;aa&quot; </li></ul><ul><li>SELECT &quot;cd1&quot;, &quot;col&quot; FROM Standard1 WHERE KEY = &quot;kd&quot; </li></ul><ul><li>DELETE &quot;cd1&quot;, &quot;col&quot; FROM Standard1 WHERE KEY = &quot;kd&quot; </li></ul>
  • 9. Avro?? <ul><li>Gone. Added too much complexity after Thrift caught up.   </li></ul><ul><li>&quot;None of the libraries distinguished themselves as being a particularly crappy choice for serialization.&quot;  </li></ul><ul><li>(See  CASSANDRA-1765 ) </li></ul>
  • 10. Thrift API Methods <ul><li>Retrieving </li></ul><ul><li>Writing/Removing </li></ul><ul><li>Meta Information </li></ul><ul><li>Schema Manipulation </li></ul>
  • 11. Thrift API Methods - Retrieving <ul><li>get: retrieve a single column for a key </li></ul><ul><li>get_slice: retrieve a &quot;slice&quot; of columns for a key </li></ul><ul><li>multiget_slice: retrieve a &quot;slice&quot; of columns for a list of keys </li></ul><ul><li>get_count: counts the columns of key (you have to deserialize the row to do it) </li></ul><ul><li>get_range_slices: retrieve a slice for a range of keys </li></ul><ul><li>get_indexed_slices (FTW!) </li></ul>
  • 12. Thrift API Methods - Writing/Removing <ul><li>insert </li></ul><ul><li>batch_mutate (batch insertion AND deletion) </li></ul><ul><li>remove </li></ul><ul><li>truncate** </li></ul>
  • 13. Thrift API Methods - Meta Information <ul><li>describe_cluster_name </li></ul><ul><li>describe_version </li></ul><ul><li>describe_keyspace </li></ul><ul><li>describe_keyspaces </li></ul>
  • 14. Thrift API Methods - Schema <ul><li>system_add_keyspace </li></ul><ul><li>system_update_keyspace </li></ul><ul><li>system_drop_keyspace </li></ul><ul><li>system_add_column_family </li></ul><ul><li>system_update_column_family </li></ul><ul><li>system_drop_column_family </li></ul>
  • 15. vs. RDBMS - Consistency Level <ul><li>Consistency is tunable per request! </li></ul><ul><li>Cassandra provides consistency when R + W > N (read replica count + write replica count > replication factor). </li></ul><ul><li>*** CONSITENCY LEVEL FAILURE IS NOT A ROLLBACK*** </li></ul><ul><li>Idempotent: an operation can be applied multiple times without changing the result </li></ul>
  • 16. vs. RDBMS - Append Only <ul><li>Proper data modelling will minimizes seeks  </li></ul><ul><li>(Go to Tyler's presentation for more!) </li></ul>
  • 17. On to the Code... <ul><li>https://github.com/zznate/cassandra-tutorial </li></ul><ul><li>Uses Maven.  </li></ul><ul><li>Really basic.  </li></ul><ul><li>Modify/abuse/alter as needed.  </li></ul><ul><li>Descriptions of what is going on and how to run each example are in the Javadoc comments.  </li></ul><ul><li>Sample data is based on North American Numbering Plan </li></ul><ul><li>http://en.wikipedia.org/wiki/North_American_Numbering_Plan </li></ul>
  • 18. Data Shape <ul><li>512 202 30.27 097.74 W TX Austin </li></ul><ul><li>512 203 30.27 097.74 L TX Austin </li></ul><ul><li>512 204 30.32 097.73 W TX Austin </li></ul><ul><li>512 205 30.32 097.73 W TX Austin </li></ul><ul><li>512 206 30.32 097.73 L TX Austin </li></ul>
  • 19. Get a Single Column for a Key <ul><li>GetCityForNpanxx.java </li></ul><ul><li>Retrieve a single column with: </li></ul><ul><li>Name </li></ul><ul><li>Value </li></ul><ul><li>Timestamp </li></ul><ul><li>TTL </li></ul>
  • 20. Get the Contents of a Row <ul><li>GetSliceForNpanxx.java </li></ul><ul><li>Retrieves a list of columns (Hector wraps these in a ColumnSlice) </li></ul><ul><li>&quot;SlicePredicate&quot; can either be explicit set of columns OR a range (more on ranges soon) </li></ul><ul><li>Another messy either/or choice encapsulated by Hector </li></ul>
  • 21. Get the (sorted!) Columns of a Row  <ul><li>GetSliceForStateCity.java </li></ul><ul><li>Shows why the choice of comparator is important (this is the order in which the columns hit the disk - take advantage of it) </li></ul><ul><li>Can be easily modified to return results in reverse order (but this is slightly slower) </li></ul>
  • 22. Get the Same Slice from Several Rows <ul><li>MultigetSliceForNpanxx.java </li></ul><ul><li>Very similar to get_slice examples, except we provide a list of keys </li></ul>
  • 23. Get Slices From a Range of Rows <ul><li>GetRangeSlicesForStateCity.java </li></ul><ul><li>Like multiget_slice, except we can specify a KeyRange </li></ul><ul><li>(encapsulated by RangeSlicesQuery#setKeys(start, end) </li></ul><ul><li>The results of this query will be significantly more meaningful with OrderPreservingPartitioner (try this at home!) </li></ul>
  • 24. Get Slices From a Range of Rows - 2 <ul><li>GetSliceForAreaCodeCity.java </li></ul><ul><li>Compound column name for controlling ranges </li></ul><ul><li>Comparator at work on text field </li></ul>
  • 25. Get Slices from Indexed Columns <ul><li>GetIndexedSlicesForCityState.java </li></ul><ul><li>You only need to index a single column to apply clauses on other columns </li></ul><ul><li>(BUT- the indexed column must be present with an EQUALS clause!) </li></ul><ul><li>(It's just another ColumnFamily maintained automatically) </li></ul>
  • 26. Insert, Update and Delete <ul><li>... are effectively the same operation.  </li></ul><ul><li>InsertRowsForColumnFamilies.java </li></ul><ul><li>DeleteRowsForColumnFamily.java </li></ul><ul><li>Run each in succession (in whichever combination you like) and verify your results on the CLI </li></ul><ul><li>Hint: watch the timestamps </li></ul><ul><li>bin/cassandra-cli --host localhost </li></ul><ul><li>use Tutorial; </li></ul><ul><li>list AreaCode; </li></ul><ul><li>list Npanxx; </li></ul><ul><li>list StateCity; </li></ul>
  • 27. Stuff I Punted on for the Sake of Brevity <ul><li>meta_* methods </li></ul><ul><li>CassandraClusterTest.java: L43-81 @hector </li></ul><ul><li>system_* methods </li></ul><ul><li>SchemaManipulation.java @ hector-examples </li></ul><ul><li>CassandraClusterTest.java: L84-157 @hector </li></ul><ul><li>ORM (it works and is in production) </li></ul><ul><li>ORM Documentation </li></ul><ul><li>multiple nodes </li></ul><ul><li>failure scenarios </li></ul><ul><li>Data modelling (go see Tyler's presentation) </li></ul>
  • 28. Things to Remember <ul><ul><li>deletes and timestamp granularity </li></ul></ul><ul><ul><li>&quot;range ghosts&quot; </li></ul></ul><ul><ul><li>using the wrong column comparator and InvalidRequestException </li></ul></ul><ul><ul><li>deletions actually write data </li></ul></ul><ul><ul><li>use column-level TTL to automate deletion </li></ul></ul><ul><ul><li>&quot;how do I iterate over all the rows in a column family&quot;? </li></ul></ul><ul><ul><ul><li>get_range_slices, but don't do that </li></ul></ul></ul><ul><ul><ul><li>a good sign your data model is wrong </li></ul></ul></ul>
  • 29. Dealing with *Lots* of Data (Briefly) <ul><li>Two biggest headaches have been addressed: </li></ul><ul><ul><li>Compaction pollutes os page cache ( CASSANDRA-1470 ) </li></ul></ul><ul><ul><li>Greater than 143mil keys on a single SSTable means more BF false positives ( CASSANDRA-1555 ) </li></ul></ul><ul><li>Hadoop integration: Yes. (Go see Jeremy's presentation) </li></ul><ul><li>Bulk loading: Yes.  CASSANDRA-1278 </li></ul><ul><li>For more information:  http://wiki.apache.org/cassandra/LargeDataSetConsiderations </li></ul>

×