Introduction to apache_cassandra_for_develope
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Introduction to apache_cassandra_for_develope

on

  • 10,994 views

A presentation for Data Day Austin on January 29th, 2011...

A presentation for Data Day Austin on January 29th, 2011

Introduces how to effectively use Apache Cassandra for Java developers using the Hector Java client: http://github.com/rantav/hector

Statistics

Views

Total Views
10,994
Views on SlideShare
10,933
Embed Views
61

Actions

Likes
8
Downloads
300
Comments
0

2 Embeds 61

http://rtiweb.net 43
http://revolucaojava.rtiweb.net 18

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Introduction to apache_cassandra_for_develope Presentation Transcript

  • 1. Introduction to  Apache Cassandra (for Java developers!) Nate McCall [email_address] @zznate
  • 2. Brief Intro 
    • NOT a "key/value store"
    • Columns are dynamic inside a column family
    • SSTables are immutable 
    • SSTables merged on reads
    • All nodes share the same role (i.e. no single point of failure)
    • Trading ACID compliance for scalability is a fundamental design decision
  • 3. How does this impact development?
    • Substantially. 
    • For operations affecting the same data, that data will become consistent eventually as determined by the timestamps. 
    • But you can trade availability for consistency. (More on this later)
    • You can store whatever you want. It's all just bytes.
    • You need to think about how you will query the data before you write it.
  • 4. Neat. So Now What?
    • Like any database, you need a client!
      • Python:
        • Telephus:  http://github.com/driftx/Telephus  (Twisted)
        • Pycassa:  http://github.com/pycassa/pycassa
      • Java:
        • Hector:  http://github.com/rantav/hector  (Examples  https://github.com/zznate/hector-examples  )
        • Pelops:  http://github.com/s7/scale7-pelops
        • Kundera  http://code.google.com/p/kundera/
        • Datanucleus JDO:  http://github.com/tnine/Datanucleus-Cassandra-Plugin
      • Grails:
        • grails-cassandra:  https://github.com/wolpert/grails-cassandra
      • .NET:
        • FluentCassandra :  http://github.com/managedfusion/fluentcassandra
        • Aquiles:  http://aquiles.codeplex.com/
      • Ruby:
        • Cassandra:  http://github.com/fauna/cassandra
      • PHP:
        • phpcassa:  http://github.com/thobbs/phpcassa
        • SimpleCassie :  http://code.google.com/p/simpletools-php/wiki/SimpleCassie
  • 5. ... but do not roll your own
  • 6. Thrift
      • Fast, efficient serialization and network IO. 
      • Lots of clients available (you can probably use it in other places as well)
    • Why you don't want to work with the Thrift API directly:
      • SuperColumn
      • ColumnOrSuperColumn
      • ColumnParent.super_column
      • ColumnPath.super_column
      • Map<ByteBuffer,Map<String,List<Mutation>>> mutationMap 
  • 7. Higher Level Client
    • Hector
      • JMX Counters
      • Add/remove hosts:
        • automatically 
        • programatically
        • via JMX
      • Plugable load balancing
      • Complete encapsulation of Thrift API
      • Type-safe approach to dealing with Apache Cassandra
      • Lightweight ORM (supports JPA 1.0 annotations)
      • Mavenized!  http://repo2.maven.org/maven2/me/prettyprint/
  • 8. &quot;CQL&quot;
      • Currently in Apache Cassandra trunk 
      • Experimental
      • Lots of possibilities
    • from test/system/test_cql.py:
    • UPDATE StandardLong1 SET 1L=&quot;1&quot;, 2L=&quot;2&quot;, 3L=&quot;3&quot;, 4L=&quot;4&quot; WHERE KEY=&quot;aa&quot;
    • SELECT &quot;cd1&quot;, &quot;col&quot; FROM Standard1 WHERE KEY = &quot;kd&quot;
    • DELETE &quot;cd1&quot;, &quot;col&quot; FROM Standard1 WHERE KEY = &quot;kd&quot;
  • 9. Avro??
    • Gone. Added too much complexity after Thrift caught up.  
    • &quot;None of the libraries distinguished themselves as being a particularly crappy choice for serialization.&quot; 
    • (See  CASSANDRA-1765 )
  • 10. Thrift API Methods
    • Retrieving
    • Writing/Removing
    • Meta Information
    • Schema Manipulation
  • 11. Thrift API Methods - Retrieving
    • get: retrieve a single column for a key
    • get_slice: retrieve a &quot;slice&quot; of columns for a key
    • multiget_slice: retrieve a &quot;slice&quot; of columns for a list of keys
    • get_count: counts the columns of key (you have to deserialize the row to do it)
    • get_range_slices: retrieve a slice for a range of keys
    • get_indexed_slices (FTW!)
  • 12. Thrift API Methods - Writing/Removing
    • insert
    • batch_mutate (batch insertion AND deletion)
    • remove
    • truncate**
  • 13. Thrift API Methods - Meta Information
    • describe_cluster_name
    • describe_version
    • describe_keyspace
    • describe_keyspaces
  • 14. Thrift API Methods - Schema
    • system_add_keyspace
    • system_update_keyspace
    • system_drop_keyspace
    • system_add_column_family
    • system_update_column_family
    • system_drop_column_family
  • 15. vs. RDBMS - Consistency Level
    • Consistency is tunable per request!
    • Cassandra provides consistency when R + W > N (read replica count + write replica count > replication factor).
    • *** CONSITENCY LEVEL FAILURE IS NOT A ROLLBACK***
    • Idempotent: an operation can be applied multiple times without changing the result
  • 16. vs. RDBMS - Append Only
    • Proper data modelling will minimizes seeks 
    • (Go to Tyler's presentation for more!)
  • 17. On to the Code...
    • https://github.com/zznate/cassandra-tutorial
    • Uses Maven. 
    • Really basic. 
    • Modify/abuse/alter as needed. 
    • Descriptions of what is going on and how to run each example are in the Javadoc comments. 
    • Sample data is based on North American Numbering Plan
    • http://en.wikipedia.org/wiki/North_American_Numbering_Plan
  • 18. Data Shape
    • 512 202 30.27 097.74 W TX Austin
    • 512 203 30.27 097.74 L TX Austin
    • 512 204 30.32 097.73 W TX Austin
    • 512 205 30.32 097.73 W TX Austin
    • 512 206 30.32 097.73 L TX Austin
  • 19. Get a Single Column for a Key
    • GetCityForNpanxx.java
    • Retrieve a single column with:
    • Name
    • Value
    • Timestamp
    • TTL
  • 20. Get the Contents of a Row
    • GetSliceForNpanxx.java
    • Retrieves a list of columns (Hector wraps these in a ColumnSlice)
    • &quot;SlicePredicate&quot; can either be explicit set of columns OR a range (more on ranges soon)
    • Another messy either/or choice encapsulated by Hector
  • 21. Get the (sorted!) Columns of a Row 
    • GetSliceForStateCity.java
    • Shows why the choice of comparator is important (this is the order in which the columns hit the disk - take advantage of it)
    • Can be easily modified to return results in reverse order (but this is slightly slower)
  • 22. Get the Same Slice from Several Rows
    • MultigetSliceForNpanxx.java
    • Very similar to get_slice examples, except we provide a list of keys
  • 23. Get Slices From a Range of Rows
    • GetRangeSlicesForStateCity.java
    • Like multiget_slice, except we can specify a KeyRange
    • (encapsulated by RangeSlicesQuery#setKeys(start, end)
    • The results of this query will be significantly more meaningful with OrderPreservingPartitioner (try this at home!)
  • 24. Get Slices From a Range of Rows - 2
    • GetSliceForAreaCodeCity.java
    • Compound column name for controlling ranges
    • Comparator at work on text field
  • 25. Get Slices from Indexed Columns
    • GetIndexedSlicesForCityState.java
    • You only need to index a single column to apply clauses on other columns
    • (BUT- the indexed column must be present with an EQUALS clause!)
    • (It's just another ColumnFamily maintained automatically)
  • 26. Insert, Update and Delete
    • ... are effectively the same operation. 
    • InsertRowsForColumnFamilies.java
    • DeleteRowsForColumnFamily.java
    • Run each in succession (in whichever combination you like) and verify your results on the CLI
    • Hint: watch the timestamps
    • bin/cassandra-cli --host localhost
    • use Tutorial;
    • list AreaCode;
    • list Npanxx;
    • list StateCity;
  • 27. Stuff I Punted on for the Sake of Brevity
    • meta_* methods
    • CassandraClusterTest.java: L43-81 @hector
    • system_* methods
    • SchemaManipulation.java @ hector-examples
    • CassandraClusterTest.java: L84-157 @hector
    • ORM (it works and is in production)
    • ORM Documentation
    • multiple nodes
    • failure scenarios
    • Data modelling (go see Tyler's presentation)
  • 28. Things to Remember
      • deletes and timestamp granularity
      • &quot;range ghosts&quot;
      • using the wrong column comparator and InvalidRequestException
      • deletions actually write data
      • use column-level TTL to automate deletion
      • &quot;how do I iterate over all the rows in a column family&quot;?
        • get_range_slices, but don't do that
        • a good sign your data model is wrong
  • 29. Dealing with *Lots* of Data (Briefly)
    • Two biggest headaches have been addressed:
      • Compaction pollutes os page cache ( CASSANDRA-1470 )
      • Greater than 143mil keys on a single SSTable means more BF false positives ( CASSANDRA-1555 )
    • Hadoop integration: Yes. (Go see Jeremy's presentation)
    • Bulk loading: Yes.  CASSANDRA-1278
    • For more information:  http://wiki.apache.org/cassandra/LargeDataSetConsiderations