<ul>Apache Cassandra </ul><ul><li>An Introduction for Java Developers </li></ul><ul>Nate McCall [email_address] @zznate </ul>
<ul>What is Apache Cassandra? </ul>
CAP Theorem  C onsistency A vailability  P artition Tolerance “ Though shalt have but 2”  - Conjecture made by Eric Brewer...
<ul>Apache Cassandra Concepts </ul>- Explicit choice of partition tolerance and availability. Consistency is tunable. - No...
Generally complements another system(s)  (Not intended to be one-size-fits-all) *** You should always use the right tool f...
How does this differ from an RDBMS?
How does this differ from an RDBMS? Substantially.
vs. RDBMS - No Joins  Unless:  - you do them on the client  - you do them via Map/Reduce
vs. RDBMS - Schema Optional  (Though you can add meta information for validation and type checking)  *** Supports secondar...
vs. RDBMS - Prematerialized and Transaction-less - No ACID transactions  - Limited support for ad-hoc queries
vs. RDBMS - Prematerialized and Transaction-less - No ACID transactions  - Limited support for ad-hoc queries *** You are ...
<ul>vs. RDBMS - Facilitates Consolidation </ul>It can be your caching layer * Off-heap cache (provided you install JNA) It...
vs. RDBMS - Shared-Nothing Architecture Every node plays the same role: no masters, no slaves, no special nodes *** No sin...
<ul>vs. RDBMS - Real Linear Scalability </ul>Want 2x performance? Add 2x nodes. *** 'No downtime' included!
<ul>vs. RDBMS - Performance </ul>Reads on par with writes
<ul>Clustering </ul>
<ul>Clustering </ul>Single node cluster (easy development setup) - one node owns the whole hash range
<ul>Clustering </ul>Two node cluster - Key range divided between nodes
<ul>Clustering </ul>Consistent Hashing: md5(“zznate”) = “C”
<ul>Clustering </ul>Consistent Hashing FTW: - Ring ownership continuously “gossiped” between nodes - Any node can act as a...
<ul>Clustering </ul>Client Read:  get(“zznate”) md5 = “C”
<ul>Clustering – Scale Out </ul>
<ul>Clustering – Scale Out </ul>
<ul>Clustering – Scale Out </ul>
<ul>Clustering - Multi-DC </ul>
<ul>Clustering - Reliability </ul>
<ul>Clustering - Reliability </ul>
<ul>Clustering - Reliability </ul>
<ul>Clustering - Reliability </ul>
<ul>Clustering - Multi-Datacenter </ul>
<ul>Clustering – Multi-DC Reliability </ul>
<ul>Storage (Briefly)  </ul>
<ul>Storage (Briefly)  </ul><ul>Understanding the on-disk format is extremely helpful in designing your data model correct...
<ul>Storage - SSTable </ul><ul>- SSTables are immutable (“Merge on read”) <li>- Newest timestamp wins </li></ul>
<ul>Storage – Compaction </ul><ul>Merge SSTables – keeping count down making Merge on Read more efficient <li>Discards Tom...
<ul>Data Model </ul>
<ul>Data Model </ul><ul>&quot;...sparse, persistent, distributed, multi-dimensional sorted map.&quot; <li>(The “Bigtable” ...
<ul>Data Model </ul><ul>Keyspace <li>- Collection of Column Families
- Controls replication
Column Family
- Similar to a table
- Columns ordered by name </li></ul>
<ul>Data Model – Column Family </ul><ul>Static Column Family <li>- Model my object data
Dynamic Column Family
- Pre-calculated query results
Nothing stopping you from mixing them! </li></ul>
<ul>Data Model – Static CF </ul><ul>zznate </ul><ul>driftx </ul><ul>thobbs </ul><ul>jbellis </ul><ul><li>password : * </li...
<ul>Data Model – Prematerialized Query </ul><ul>Following </ul><ul>zznate </ul><ul>driftx </ul><ul>thobbs </ul><ul>jbellis...
Data Model – Prematerialized Query Additional examples: Timeline of tweets by a user Timeline of tweets by all of the peop...
<ul>API Operations  </ul>
Five general categories <ul>Retrieving Writing/Updating/Removing (all the same op!) <ul>Increment counters </ul>Meta Infor...
Using a Client Hector Client: http://hector-client.org - Most popular Java client  - In use at very large installations - ...
<ul>Sample Project for Experimenting </ul>https://github.com/zznate/cassandra-tutorial https://github.com/zznate/hector-ex...
<ul>ColumnFamilyTemplate </ul>Familiar, type-safe approach - based on template-method design pattern - generic: ColumnFami...
<ul>ColumnFamilyTemplate </ul>new ThriftColumnFamilyTemplate(keyspaceName,  columnFamilyName,  StringSerializer.get(),  St...
<ul>ColumnFamilyTemplate </ul>ColumnFamilyResult<String, String> res = cft.queryColumns(&quot;zznate&quot;); String value ...
<ul>ColumnFamilyTemplate </ul>ColumnFamilyResult wrapper =  template.queryColumns(&quot;zznate&quot;, &quot;patricioe&quot...
<ul>ColumnFamilyTemplate </ul>ColumnFamilyUpdater updater = template.createUpdater(&quot;zznate&quot;);  updater.setString...
<ul>ColumnFamilyTemplate </ul>template.deleteColumn(&quot;zznate&quot;, &quot;notNeededStuff&quot;); template.deleteColumn...
<ul>Deletion </ul>
<ul>Deletion </ul><ul><li>Again: Every mutation is an insert!
- Merge on read
- Sstables are immutable
- Highest timestamp wins </li></ul>
<ul>Deletion – As Seen by CLI </ul><ul>[default@Tutorial] list StateCity; <li>Using default limit of 100
-------------------
RowKey: CA Burlingame
=> (column=650, value=33372e3537783132322e3334, timestamp=1310340410528000)
-------------------
RowKey: TX Austin
Upcoming SlideShare
Loading in...5
×

Introduciton to Apache Cassandra for Java Developers (JavaOne)

12,098
-1

Published on

The database industry has been abuzz over the past year about NoSQL databases. Apache Cassandra, which has quickly emerged as a best-of-breed solution in this space, is used at many companies to achieve unprecedented scale while maintaining streamlined operations.

This presentation goes beyond the hype, buzzwords, and rehashed slides and actually presents the attendees with a hands-on, step-by-step tutorial on how to write a Java application on top of Apache Cassandra. It focuses on concepts such as idempotence, tunable consistency, and shared-nothing clusters to help attendees get started with Apache Cassandra quickly while avoiding common pitfalls.

Published in: Technology, Education
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
12,098
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
390
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide

Introduciton to Apache Cassandra for Java Developers (JavaOne)

  1. 1. <ul>Apache Cassandra </ul><ul><li>An Introduction for Java Developers </li></ul><ul>Nate McCall [email_address] @zznate </ul>
  2. 2. <ul>What is Apache Cassandra? </ul>
  3. 3. CAP Theorem C onsistency A vailability P artition Tolerance “ Though shalt have but 2” - Conjecture made by Eric Brewer in 2000 - Published as formal proof in 2002 - See: http://en.wikipedia.org/wiki/CAP_theorem for more
  4. 4. <ul>Apache Cassandra Concepts </ul>- Explicit choice of partition tolerance and availability. Consistency is tunable. - No read before write - Merge on read - Idempotent - Schema Optional - All nodes share the same roll - Still performs well with larger-than-memory data sets
  5. 5. Generally complements another system(s) (Not intended to be one-size-fits-all) *** You should always use the right tool for the right job anyway
  6. 6. How does this differ from an RDBMS?
  7. 7. How does this differ from an RDBMS? Substantially.
  8. 8. vs. RDBMS - No Joins Unless: - you do them on the client - you do them via Map/Reduce
  9. 9. vs. RDBMS - Schema Optional (Though you can add meta information for validation and type checking) *** Supports secondary indexes too: “ … WHERE state = 'TX' ”
  10. 10. vs. RDBMS - Prematerialized and Transaction-less - No ACID transactions - Limited support for ad-hoc queries
  11. 11. vs. RDBMS - Prematerialized and Transaction-less - No ACID transactions - Limited support for ad-hoc queries *** You are going to give up both of these anyway when you shard an RDBMS ***
  12. 12. <ul>vs. RDBMS - Facilitates Consolidation </ul>It can be your caching layer * Off-heap cache (provided you install JNA) It can be your analytics infrastructure * true map/reduce * pig driver * hive driver coming soon
  13. 13. vs. RDBMS - Shared-Nothing Architecture Every node plays the same role: no masters, no slaves, no special nodes *** No single point of failure
  14. 14. <ul>vs. RDBMS - Real Linear Scalability </ul>Want 2x performance? Add 2x nodes. *** 'No downtime' included!
  15. 15. <ul>vs. RDBMS - Performance </ul>Reads on par with writes
  16. 16. <ul>Clustering </ul>
  17. 17. <ul>Clustering </ul>Single node cluster (easy development setup) - one node owns the whole hash range
  18. 18. <ul>Clustering </ul>Two node cluster - Key range divided between nodes
  19. 19. <ul>Clustering </ul>Consistent Hashing: md5(“zznate”) = “C”
  20. 20. <ul>Clustering </ul>Consistent Hashing FTW: - Ring ownership continuously “gossiped” between nodes - Any node can act as a “coordinator” to service client requests for any key * requests forwarded to the appropriate nodes by coordinator transparently to the client
  21. 21. <ul>Clustering </ul>Client Read: get(“zznate”) md5 = “C”
  22. 22. <ul>Clustering – Scale Out </ul>
  23. 23. <ul>Clustering – Scale Out </ul>
  24. 24. <ul>Clustering – Scale Out </ul>
  25. 25. <ul>Clustering - Multi-DC </ul>
  26. 26. <ul>Clustering - Reliability </ul>
  27. 27. <ul>Clustering - Reliability </ul>
  28. 28. <ul>Clustering - Reliability </ul>
  29. 29. <ul>Clustering - Reliability </ul>
  30. 30. <ul>Clustering - Multi-Datacenter </ul>
  31. 31. <ul>Clustering – Multi-DC Reliability </ul>
  32. 32. <ul>Storage (Briefly) </ul>
  33. 33. <ul>Storage (Briefly) </ul><ul>Understanding the on-disk format is extremely helpful in designing your data model correctly </ul>
  34. 34. <ul>Storage - SSTable </ul><ul>- SSTables are immutable (“Merge on read”) <li>- Newest timestamp wins </li></ul>
  35. 35. <ul>Storage – Compaction </ul><ul>Merge SSTables – keeping count down making Merge on Read more efficient <li>Discards Tombstones (more on this later!) </li></ul>
  36. 36. <ul>Data Model </ul>
  37. 37. <ul>Data Model </ul><ul>&quot;...sparse, persistent, distributed, multi-dimensional sorted map.&quot; <li>(The “Bigtable” paper) </li></ul>
  38. 38. <ul>Data Model </ul><ul>Keyspace <li>- Collection of Column Families
  39. 39. - Controls replication
  40. 40. Column Family
  41. 41. - Similar to a table
  42. 42. - Columns ordered by name </li></ul>
  43. 43. <ul>Data Model – Column Family </ul><ul>Static Column Family <li>- Model my object data
  44. 44. Dynamic Column Family
  45. 45. - Pre-calculated query results
  46. 46. Nothing stopping you from mixing them! </li></ul>
  47. 47. <ul>Data Model – Static CF </ul><ul>zznate </ul><ul>driftx </ul><ul>thobbs </ul><ul>jbellis </ul><ul><li>password : * </li></ul><ul>password : * </ul><ul>password : * </ul><ul>name : Nate </ul><ul>name : Brandon </ul><ul>name : Tyler </ul><ul>password : * </ul><ul>name : Jonathan </ul><ul>site : datastax.com </ul><ul>Users </ul>
  48. 48. <ul>Data Model – Prematerialized Query </ul><ul>Following </ul><ul>zznate </ul><ul>driftx </ul><ul>thobbs </ul><ul>jbellis </ul><ul>driftx: </ul><ul>thobbs: </ul><ul>driftx: </ul><ul>thobbs: </ul><ul>mdennis: </ul><ul>zznate </ul><ul>zznate: </ul><ul>pcmanus </ul><ul>xedin: </ul>
  49. 49. Data Model – Prematerialized Query Additional examples: Timeline of tweets by a user Timeline of tweets by all of the people a user is following List of comments sorted by score List of friends grouped by state
  50. 50. <ul>API Operations </ul>
  51. 51. Five general categories <ul>Retrieving Writing/Updating/Removing (all the same op!) <ul>Increment counters </ul>Meta Information Schema Manipulation CQL Execution </ul>
  52. 52. Using a Client Hector Client: http://hector-client.org - Most popular Java client - In use at very large installations - A number of tools and utilities built on top - Very active community - MIT Licensed *** like any open source project fully dependent on another open source project it has it's worts
  53. 53. <ul>Sample Project for Experimenting </ul>https://github.com/zznate/cassandra-tutorial https://github.com/zznate/hector-examples Built using Hector Really basic – designed to be beginner level w/ very few moving parts Modify/abuse/alter as needed *** Descriptions of what is going on and how to run each example are in the Javadoc comments. 
  54. 54. <ul>ColumnFamilyTemplate </ul>Familiar, type-safe approach - based on template-method design pattern - generic: ColumnFamilyTemplate<K,N> (K is the key type, N the column name type) ColumnFamilyTemplate template = new ThriftColumnFamilyTemplate(keyspaceName, columnFamilyName, StringSerializer.get(), StringSerializer.get()); *** (no generics for clarity)
  55. 55. <ul>ColumnFamilyTemplate </ul>new ThriftColumnFamilyTemplate(keyspaceName, columnFamilyName, StringSerializer.get(), StringSerializer.get()); Key Format Column Name Format - Cassandra calls this a “comparator” - Remember: defines column order in on-disk format
  56. 56. <ul>ColumnFamilyTemplate </ul>ColumnFamilyResult<String, String> res = cft.queryColumns(&quot;zznate&quot;); String value = res.getString(&quot;email&quot;); Date startDate = res.getDate(“startDate”); Key Format Column Name Format
  57. 57. <ul>ColumnFamilyTemplate </ul>ColumnFamilyResult wrapper = template.queryColumns(&quot;zznate&quot;, &quot;patricioe&quot;, &quot;thobbs&quot;); String nateEmail = wrapper.getString(&quot;email&quot;); wrapper.next(); String patoEmail = wrapper.getString(&quot;email&quot;); wrapper.next(); String tylerEmail = wrapper.getString(&quot;email&quot;); Querying multiple rows and iterating over results
  58. 58. <ul>ColumnFamilyTemplate </ul>ColumnFamilyUpdater updater = template.createUpdater(&quot;zznate&quot;); updater.setString(&quot;companyName&quot;,&quot;DataStax&quot;); updater.addKey(&quot;sergek&quot;); updater.setString(&quot;companyName&quot;,&quot;PrestoSports&quot;); template.update(updater); Inserting data with ColumnFamilyUpdater
  59. 59. <ul>ColumnFamilyTemplate </ul>template.deleteColumn(&quot;zznate&quot;, &quot;notNeededStuff&quot;); template.deleteColumn(&quot;zznate&quot;, &quot;somethingElse&quot;); template.deleteColumn(&quot;patricioe&quot;, &quot;aDifferentColumnName&quot;); ... template.deleteRow(“someuser”); template.executeBatch(); Deleting Data with ColumnFamilyTemplate
  60. 60. <ul>Deletion </ul>
  61. 61. <ul>Deletion </ul><ul><li>Again: Every mutation is an insert!
  62. 62. - Merge on read
  63. 63. - Sstables are immutable
  64. 64. - Highest timestamp wins </li></ul>
  65. 65. <ul>Deletion – As Seen by CLI </ul><ul>[default@Tutorial] list StateCity; <li>Using default limit of 100
  66. 66. -------------------
  67. 67. RowKey: CA Burlingame
  68. 68. => (column=650, value=33372e3537783132322e3334, timestamp=1310340410528000)
  69. 69. -------------------
  70. 70. RowKey: TX Austin
  71. 71. => (column=202, value=33302e3237783039372e3734, timestamp=1310143852392000)
  72. 72. => (column=203, value=33302e3237783039372e3734, timestamp=1310143852444000)
  73. 73. => (column=204, value=33302e3332783039372e3733, timestamp=1310143852448000)
  74. 74. => (column=205, value=33302e3332783039372e3733, timestamp=1310143852453000)
  75. 75. => (column=206, value=33302e3332783039372e3733, timestamp=1310143852457000) </li></ul>
  76. 76. <ul>Deletion – As Seen by CLI </ul><ul>[default@Tutorial] list StateCity; <li>Using default limit of 100
  77. 77. -------------------
  78. 78. RowKey: CA Burlingame
  79. 79. -------------------
  80. 80. RowKey: TX Austin
  81. 81. => (column=202, value=33302e3237783039372e3734, timestamp=1310143852392000)
  82. 82. => (column=203, value=33302e3237783039372e3734, timestamp=1310143852444000)
  83. 83. => (column=204, value=33302e3332783039372e3733, timestamp=1310143852448000)
  84. 84. => (column=205, value=33302e3332783039372e3733, timestamp=1310143852453000)
  85. 85. => (column=206, value=33302e3332783039372e3733, timestamp=1310143852457000) </li></ul>
  86. 86. <ul>Deletion – FYI </ul><ul>mutator.addDeletion(&quot;202230&quot;, &quot;Npanxx&quot;, “city”, stringSerializer); </ul><ul>Does not exist? You just inserted a tombstone! </ul><ul>Sending a deletion for a non-existing row: </ul><ul>[default@Tutorial] list Npanxx; <li>Using default limit of 100
  87. 87. . . .
  88. 88. -------------------
  89. 89. RowKey: 202230
  90. 90. -------------------
  91. 91. . . . </li></ul>
  92. 92. <ul>Integrating with existing patterns </ul>
  93. 93. <ul>Integrating with existing patterns </ul><ul><li>“ Yes.” </li></ul>
  94. 94. <ul>Integrating with existing patterns </ul><ul><li><bean id=&quot;cassandraHostConfigurator&quot;
  95. 95. class=&quot;me.prettyprint.cassandra.service.CassandraHostConfigurator&quot;>
  96. 96. <constructor-arg value=&quot;localhost:9170&quot;/>
  97. 97. </bean>
  98. 98. <bean id=&quot;cluster&quot; class=&quot;me.prettyprint.cassandra.service.ThriftCluster&quot;>
  99. 99. <constructor-arg value=&quot;TestCluster&quot;/>
  100. 100. <constructor-arg ref=&quot;cassandraHostConfigurator&quot;/>
  101. 101. </bean>
  102. 102. <bean id=&quot;consistencyLevelPolicy&quot; class=&quot;me.prettyprint.cassandra.model.ConfigurableConsistencyLevel&quot;>
  103. 103. <property name=&quot;defaultReadConsistencyLevel&quot; value=&quot;ONE&quot;/>
  104. 104. </bean>
  105. 105. <bean id=&quot;keyspaceOperator&quot; class=&quot;me.prettyprint.hector.api.factory.HFactory&quot;
  106. 106. factory-method=&quot;createKeyspace&quot;>
  107. 107. <constructor-arg value=&quot;Keyspace1&quot;/>
  108. 108. <constructor-arg ref=&quot;cluster&quot;/>
  109. 109. <constructor-arg ref=&quot;consistencyLevelPolicy&quot;/>
  110. 110. </bean>
  111. 111. <bean id=&quot;simpleCassandraDao&quot; class=&quot;me.prettyprint.cassandra.dao.SimpleCassandraDao&quot;>
  112. 112. <property name=&quot;keyspace&quot; ref=&quot;keyspaceOperator&quot;/>
  113. 113. <property name=&quot;columnFamilyName&quot; value=&quot;Standard1&quot;/>
  114. 114. </bean> </li></ul>
  115. 115. <ul>Integrating with existing patterns </ul><ul><li>Hector Object Mapper:
  116. 116. https://github.com/rantav/hector/wiki/Hector-Object-Mapper-%28HOM%29
  117. 117. Hector JPA:
  118. 118. https://github.com/riptano/hector-jpa </li></ul>
  119. 119. <ul>Integrating with existing patterns </ul><ul><li>CQL: JDBC Driver and Pool in 1.0!
  120. 120. JdbcTemplate FTW! </li></ul>
  121. 121. <ul>Development Resources </ul>Hector Documentation http://hector-client.org <ul><li>Cassandra Maven Plugin http://mojo.codehaus.org/cassandra-maven-plugin/
  122. 122. CCM localhost cassandra cluster https://github.com/pcmanus/ccm
  123. 123. OpsCenter http://www.datastax.com/products/opscenter </li></ul><ul>Cassandra AMIs https://github.com/riptano/CassandraClusterAMI </ul>
  124. 124. <ul>Putting it Together </ul>
  125. 125. <ul>Take control of consistency </ul><ul><li>If you do need a high degree of consistency, use thresholds to trigger different behavior
  126. 126. - Bank account:
  127. 127. “ on values over $10,000, wait to here from all replicas”
  128. 128. - Distributed Shopping Cart:
  129. 129. Show a confirmation page to verify order resolution
  130. 130. *** What is your appetite for risk? </li></ul>
  131. 131. Uniquely identify operations in the application <ul><li>Facilitates idempotent behavior and out-of-order execution </li></ul>
  132. 132. <ul>Denormalization </ul><ul><li>The point of normalization is to avoid update anomalies
  133. 133. ***But In an append-only system, we don't do updates </li></ul>
  134. 134. <ul>Summary </ul><ul><li>- Take advantage of strengths
  135. 135. - Look for idempotence and asynchronicity in your business processes
  136. 136. - If it's not in the API, you are probably doing it wrong
  137. 137. - Seek death is still possible if you model incorrectly </li></ul>
  138. 138. <ul>Questions </ul><ul>Nate McCall [email_address] @zznate </ul>
  139. 139. <ul>Additional Resources </ul><ul><li>DataStax Documentation: http://www.datastax.com/docs/0.8/index
  140. 140. Apache Cassandra project wiki: http://wiki.apache.org/cassandra/
  141. 141. “ The Dynamo Paper”
  142. 142. http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
  143. 143. P. Helland. Building on Quicksand
  144. 144. http://arxiv.org/pdf/0909.1788
  145. 145. P. Helland. Life Beyond Distributed Transactions
  146. 146. http://www.ics.uci.edu/~cs223/papers/cidr07p15.pdf
  147. 147. S. Anand. “Netflix's Transition to High-Availability Storage Systems”
  148. 148. http://media.amazonwebservices.com/Netflix_Transition_to_a_Key_v3.pdf
  149. 149. “ The Megastore Paper”
  150. 150. http://research.google.com/pubs/archive/36971.pdf </li></ul>
  1. ¿Le ha llamado la atención una diapositiva en particular?

    Recortar diapositivas es una manera útil de recopilar información importante para consultarla más tarde.

×