Your SlideShare is downloading. ×
Nyc summit intro_to_cassandra
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Nyc summit intro_to_cassandra

1,856

Published on

Introduction to Apache Cassandra from Cassandra NYC summit

Introduction to Apache Cassandra from Cassandra NYC summit

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,856
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
96
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • TODO: need fb logo
  • TODO: need fb logo
  • Transcript

    • 1.
        Java, Big Data and Apache Cassandra
        Nate McCall [email_address] @zznate
    • 2.
        Apache Cassandra: Origins in big data
    • 3.
        Apache Cassandra: Origins in big data
    • 4. But first... the CAP Theorem C onsistency A vailability P artition Tolerance “ Thou shalt have but 2” - Conjecture made by Eric Brewer in 2000 - Published as formal proof in 2002 - See: http://en.wikipedia.org/wiki/CAP_theorem for more
    • 5. CAP Theorem: Cassandra Style - Explicit choice of partition tolerance and availability. - Opt for more consistency at the cost of availability Consistency is tunable (per operation)
    • 6.
        Apache Cassandra Concepts
      - No read before write - Merge on read - Idempotent - Schema Optional - All nodes share the same roll - Still performs well with larger-than-memory data sets
    • 7. Generally complements another system(s) (Not intended to be one-size-fits-all) *** You should always use the right tool for the right job anyway
    • 8. How does this differ from an RDBMS?
    • 9. How does this differ from an RDBMS? Substantially.
    • 10. vs. RDBMS - No Joins Unless: - you do them on the client - you do them via Map/Reduce
    • 11. vs. RDBMS - Schema Optional (Though you can add meta information for validation and type checking) *** Supports secondary indexes too: “ … WHERE state = 'TX' ”
    • 12. vs. RDBMS - Prematerialized and Transaction-less - No ACID transactions - Limited support for ad-hoc queries
    • 13. vs. RDBMS - Prematerialized and Transaction-less - No ACID transactions - Limited support for ad-hoc queries *** You are going to give up both of these anyway when you shard an RDBMS ***
    • 14.
        vs. RDBMS - Facilitates Consolidation
      It can be your caching layer * Off-heap cache (provided you install JNA) It can be your analytics infrastructure * true map/reduce * pig driver * hive driver coming soon
    • 15. vs. RDBMS - Shared-Nothing Architecture Every node plays the same role: no masters, no slaves, no special nodes *** No single point of failure
    • 16.
        vs. RDBMS - Real Linear Scalability
      Want 2x performance? Add 2x nodes (with no downtime!)
    • 17.
        vs. RDBMS - Performance
      Reads on par with writes
    • 18.
        Clustering
    • 19.
        Clustering
      Consistent Hashing FTW: - No fancy shard logic or tedious management of such required - Ring ownership continuously “gossiped” between nodes - Any node can act as a “coordinator” to service client requests for any key * requests forwarded to the appropriate nodes by coordinator transparently to the client
    • 20.
        Clustering
      Single node cluster (easy development setup) - one node owns the whole hash range
    • 21.
        Clustering
      Two node cluster - Key range divided between nodes
    • 22.
        Clustering
      Consistent Hashing: md5(“zznate”) = “C”
    • 23.
        Clustering: The Client's Perspective
      Client Read: get(“zznate”) md5 = “C”
    • 24.
        Clustering – Scale Out
    • 25.
        Clustering – Scale Out
    • 26.
        Clustering – Scale Out
    • 27.
        Clustering - Multi-DC
    • 28.
        Clustering - Reliability
    • 29.
        Clustering - Reliability
    • 30.
        Clustering - Reliability
    • 31.
        Clustering - Reliability
    • 32.
        Clustering - Multi-Datacenter
    • 33.
        Clustering – Multi-DC Reliability
    • 34.
        Storage (Briefly)
    • 35.
        Storage (Briefly)
        Understanding the on-disk format is extremely helpful in designing your data model correctly
    • 36.
        Storage - SSTable
        - SSTables are immutable (“Merge on read”)
      • - Newest timestamp wins
    • 37.
        Storage – Compaction
        Merge SSTables – keeping count down making Merge on Read more efficient
      • Discards Tombstones (more on this later!)
    • 38.
        Data Model
    • 39.
        Data Model
        "...sparse, persistent, distributed, multi-dimensional sorted map."
      • (The “Bigtable” paper)
    • 40.
        Data Model
        Keyspace
      • - Collection of Column Families
      • 41. - Controls replication
      • 42. Column Family
      • 43. - Similar to a table
      • 44. - Columns ordered by name
    • 45.
        Data Model – Column Family
        Static Column Family
      • - Model my object data
      • 46. Dynamic Column Family
      • 47. - Pre-calculated query results
      • 48. Nothing stopping you from mixing them!
    • 49.
        Data Model – Static CF
        GOOG
        AAPL
        NFLX
        NOK
      • price: 589.55
      • price: 401.76
        price: 78.73
        name : Google
        name : Apple
        name : Netflix
        price: 6.90
        name : Nokia
        exchange : NYSE
        Stocks
    • 50.
        Data Model – Prematerialized Query
        StockHist
        10/25/2011: 6.71
        GOOG
        AAPL
        NFLX
        NOK
        10/24/2011: 6.76
        10/21/2011: 6.61
        10/25/2011: 77.37
        10/24/2011: 118.84
        10/21/2011: 117.04
        10/25/2011: 397.77
        10/24/2011: 405.77
        10/21/2011: 392.87
        10/25/2011: 583.16
        10/24/2011: 596.42
        10/21/2011: 590.49
    • 51. Data Model – Prematerialized Query Additional examples: Timeline of tweets by a user Timeline of tweets by all of the people a user is following List of comments sorted by score List of friends grouped by state
    • 52.
        API Operations
    • 53. Five general categories
        Retrieving Writing/Updating/Removing (all the same op!)
          Increment counters
        Meta Information Schema Manipulation CQL Execution
    • 54. Big Data Fun and Hijinks
        - Hadoop integration - Pig Integration - Hive Integration * open source version coming soon * available in DataStax Enterprise
    • 55. Big Data: Map/Reduce Integration Cassandra Implementations of: - InputFormat and OutputFormat - RecordReader and RecordWriter - InputSplit for Column Families *** See org.apache.cassandra.hadoop package and examples for more
    • 56. Big Data: Pig Integration grunt> name_group = GROUP score_data BY name PARALLEL 3; grunt> name_total = FOREACH name_group GENERATE group, COUNT(score_data.name), LongSum(score_data.score) AS total_score; grunt> ordered_scores = ORDER name_total BY total_score DESC PARALLEL 3; grunt> DUMP ordered_scores;
    • 57. Using a Client Hector Client: http://hector-client.org - Most popular Java client - In use at very large installations - A number of tools and utilities built on top - Very active community - MIT Licensed *** like any open source project fully dependent on another open source project it has it's worts
    • 58.
        Sample Project for Experimenting
      https://github.com/zznate/cassandra-tutorial https://github.com/zznate/hector-examples Built using Hector Really basic – designed to be beginner level w/ very few moving parts Modify/abuse/alter as needed *** Descriptions of what is going on and how to run each example are in the Javadoc comments. 
    • 59.
        Hector: ColumnFamilyTemplate
      Familiar, type-safe approach - based on template-method design pattern - generic: ColumnFamilyTemplate<K,N> (K is the key type, N the column name type) ColumnFamilyTemplate template = new ThriftColumnFamilyTemplate(keyspaceName, columnFamilyName, StringSerializer.get(), StringSerializer.get()); *** (no generics for clarity)
    • 60.
        Hector: ColumnFamilyTemplate
      new ThriftColumnFamilyTemplate(keyspaceName, columnFamilyName, StringSerializer.get(), StringSerializer.get()); Key Format Column Name Format - Cassandra calls this a “comparator” - Remember: defines column order in on-disk format
    • 61. Hector: ColumnFamilyTemplate ColumnFamilyResult<String, String> res = cft.queryColumns(&quot;zznate&quot;); String value = res.getString(&quot;email&quot;); Date startDate = res.getDate(“startDate”); Key Format Column Name Format
    • 62. Hector: ColumnFamilyTemplate ColumnFamilyResult wrapper = template.queryColumns(&quot;zznate&quot;, &quot;patricioe&quot;, &quot;thobbs&quot;) ; while (wrapper.hasNext() ) { emails.put(wrapper.getKey(), wrapper.getString(&quot;email&quot;)); ... Querying multiple rows
    • 63. Hector: ColumnFamilyTemplate ColumnFamilyResult wrapper = template.queryColumns(&quot;zznate&quot;, &quot;patricioe&quot;, &quot;thobbs&quot;); while ( wrapper.hasNext() ) { emails.put(wrapper.getKey(), wrapper.getString(&quot;email&quot;)); ... Iterating over results
    • 64. Hector: ColumnFamilyTemplate ColumnFamilyUpdater updater = template.createUpdater(&quot;zznate&quot;); updater.setString(&quot;companyName&quot;,&quot;DataStax&quot;); updater.addKey(&quot;sergek&quot;); updater.setString(&quot;companyName&quot;,&quot;PrestoSports&quot;); template.update(updater); Insert: Creating an updater for a key
    • 65. Hector: ColumnFamilyTemplate ColumnFamilyUpdater updater = template.createUpdater(&quot;zznate&quot;); updater.setString(&quot;companyName&quot;,&quot;DataStax&quot;); updater.addKey(&quot;sergek&quot;); updater.setString(&quot;companyName&quot;,&quot;PrestoSports&quot;); template.update(updater); Insert: Adding Multiple Rows
    • 66. Hector: ColumnFamilyTemplate ColumnFamilyUpdater updater = template.createUpdater(&quot;zznate&quot;); updater.setString(&quot;companyName&quot;,&quot;DataStax&quot;); updater.addKey(&quot;sergek&quot;); updater.setString(&quot;companyName&quot;,&quot;PrestoSports&quot;); template.update(updater); Insert: Invoking Batch Execution
    • 67. Hector: ColumnFamilyTemplate template.deleteColumn(&quot;zznate&quot;, &quot;notNeededStuff&quot;); template.deleteColumn(&quot;zznate&quot;, &quot;somethingElse&quot;); template.deleteColumn(&quot;patricioe&quot;, &quot;aDifferentColumnName&quot;); ... template.deleteRow(“someuser”); template.executeBatch(); Deleting Data: Single Column
    • 68. Hector: ColumnFamilyTemplate template.deleteColumn(&quot;zznate&quot;, &quot;notNeededStuff&quot;); template.deleteColumn(&quot;zznate&quot;, &quot;somethingElse&quot;); template.deleteColumn(&quot;patricioe&quot;, &quot;aDifferentColumnName&quot;); ... template.deleteRow(“someuser”); template.executeBatch(); Deleting Data: Whole Row
    • 69.
        Deletion
    • 70.
        Deletion
      • Again: Every mutation is an insert!
      • 71. - Merge on read
      • 72. - Sstables are immutable
      • 73. - Highest timestamp wins
    • 74.
        Deletion – As Seen by CLI
        [default@Tutorial] list Portfolio;
      • Using default limit of 100
      • 75. -------------------
      • 76. RowKey: 12783
      • 77. => (column=GOOG, value=30, timestamp=1310340410528000)
      • 78. -------------------
      • 79. RowKey: 15736
      • 80. => (column=AAPL, value=20, timestamp=1310143852392000)
      • 81. => (column=NOK, value=90, timestamp=1310143852444000)
      • 82. => (column=IBM, value=50, timestamp=1310143852448000)
      • 83. => (column=GOOG, value=5, timestamp=1310143852453000)
      • 84. => (column=INTC, value=200, timestamp=1310143852457000)
    • 85.
        Deletion – As Seen by CLI
        [default@Tutorial] list Portfolio;
      • Using default limit of 100
      • 86. -------------------
      • 87. RowKey: 12783
      • 88. -------------------
      • 89. RowKey: 15736
      • 90. => (column=AAPL, value=20, timestamp=1310143852392000)
      • 91. => (column=NOK, value=90, timestamp=1310143852444000)
      • 92. => (column=IBM, value=50, timestamp=1310143852448000)
      • 93. => (column=GOOG, value=5, timestamp=1310143852453000)
      • 94. => (column=INTC, value=200, timestamp=1310143852457000)
    • 95.
        Deletion – FYI
        mutator.addDeletion(&quot;14100&quot;, &quot;INTC&quot;, 75, stringSerializer);
        Does not exist? You just inserted a tombstone!
        Sending a deletion for a non-existing row:
        [default@Tutorial] list Portfolio;
      • Using default limit of 100
      • 96. . . .
      • 97. -------------------
      • 98. RowKey: 14100
      • 99. -------------------
      • 100. . . .
    • 101.
        Integrating with existing patterns
    • 102.
        Integrating with existing patterns
      • “ Yes.”
    • 103.
        Integrating with existing patterns
      • <bean id=&quot;cassandraHostConfigurator&quot;
      • 104. class=&quot;me.prettyprint.cassandra.service.CassandraHostConfigurator&quot;>
      • 105. <constructor-arg value=&quot;localhost:9170&quot;/>
      • 106. </bean>
      • 107. <bean id=&quot;cluster&quot; class=&quot;me.prettyprint.cassandra.service.ThriftCluster&quot;>
      • 108. <constructor-arg value=&quot;TestCluster&quot;/>
      • 109. <constructor-arg ref=&quot;cassandraHostConfigurator&quot;/>
      • 110. </bean>
      • 111. <bean id=&quot;consistencyLevelPolicy&quot; class=&quot;me.prettyprint.cassandra.model.ConfigurableConsistencyLevel&quot;>
      • 112. <property name=&quot;defaultReadConsistencyLevel&quot; value=&quot;ONE&quot;/>
      • 113. </bean>
      • 114. <bean id=&quot;keyspaceOperator&quot; class=&quot;me.prettyprint.hector.api.factory.HFactory&quot;
      • 115. factory-method=&quot;createKeyspace&quot;>
      • 116. <constructor-arg value=&quot;Keyspace1&quot;/>
      • 117. <constructor-arg ref=&quot;cluster&quot;/>
      • 118. <constructor-arg ref=&quot;consistencyLevelPolicy&quot;/>
      • 119. </bean>
      • 120. <bean id=&quot;simpleCassandraDao&quot; class=&quot;me.prettyprint.cassandra.dao.SimpleCassandraDao&quot;>
      • 121. <property name=&quot;keyspace&quot; ref=&quot;keyspaceOperator&quot;/>
      • 122. <property name=&quot;columnFamilyName&quot; value=&quot;Standard1&quot;/>
      • 123. </bean>
    • 124.
        Integrating with existing patterns
      • Hector Object Mapper (simple, JPA 1.0-style annotations):
      • 125. https://github.com/rantav/hector/wiki/Hector-Object-Mapper-%28HOM%29
      • 126. Hector JPA (experimental open-jpa implementation):
      • 127. https://github.com/riptano/hector-jpa
    • 128.
        Integrating with existing patterns
      • private static final String STOCK_CQL =
      • 129. “ select price FROM Stocks WHERE KEY = ?&quot;;
      • 130. jdbcTemplate.query(STOCK_CQL, stockTicker,
      • 131. new RowMapper<Stock>() {
      • 132. public Stock mapRow(ResultSet rs, int row) throws SQLException {
      • 133. CassandraResultSet crs = (CassandraResultSet)rs;
      • 134. Stock stock = new Stock();
      • 135. stock.setTicker(new String(crs.getKey()));
      • 136. stock.setPrice(crs.getDouble(&quot;price&quot;));
      • 137. return stock;
      • 138. }
      • 139. });
    • 140.
        Integrating with existing patterns
      • private static String UPDATE_PORTOFOLIO_CQL =
      • 141. &quot; update Portfolios set ? = ? where KEY = ? &quot;;
      • 142. jdbcTemplate.update(UPDATE_PORTFOLIO_CQL,
      • 143. new Object[] {position.getTicker(),
      • 144. position.getCount(),
      • 145. portfolio.getName()});
    • 146.
        Integrating with existing patterns
      • private static final String UPDATE_PORT_CQL =
      • 147. &quot;update Portfolios set ? = ? where KEY = ?&quot;;
      • 148. jdbcTemplate.batchUpdate(UPDATE_PORT_CQL,
      • 149. new BatchPreparedStatementSetter() {
      • 150. public void setValues(PreparedStatement ps, int index) throws SQLException {
      • 151. Position pos = portfolio.getConstituents().get(index);
      • 152. ps.setString(1, pos.getTicker());
      • 153. ps.setLong(2, pos.getShares());
      • 154. ps.setString(3,portfolio.getName());
      • 155. }
      • 156. public int getBatchSize() {
      • 157. return portfolio.getConstituents().size();
      • 158. }
      • 159. });
    • 160.
        Putting it Together
    • 161.
        Take control of consistency
      • If you do need a high degree of consistency, use thresholds to trigger different behavior
      • 162. - Bank account:
      • 163. “ on values over $10,000, wait to here from all replicas”
      • 164. - Distributed Shopping Cart:
      • 165. Show a confirmation page to verify order resolution
      • 166. *** What is your appetite for risk?
    • 167. Uniquely identify operations in the application
      • Facilitates idempotent behavior and out-of-order execution
    • 168.
        Denormalization
      • The point of normalization is to avoid update anomalies
      • 169. ***But In an append-only system, we don't do updates
    • 170.
        Summary
      • - Take advantage of strengths
      • 171. - Look for idempotence and asynchronicity in your business processes
      • 172. - If it's not in the API, you are probably doing it wrong
      • 173. - Seek death is still possible if you model incorrectly
    • 174.
        Questions
        Nate McCall [email_address] @zznate
    • 175.
        Development Resources
      Hector Documentation http://hector-client.org
      • Cassandra Maven Plugin http://mojo.codehaus.org/cassandra-maven-plugin/
      • 176. CCM localhost cassandra cluster https://github.com/pcmanus/ccm
      • 177. OpsCenter http://www.datastax.com/products/opscenter
        Cassandra AMIs https://github.com/riptano/CassandraClusterAMI
    • 178.
        Additional Resources
      • DataStax Documentation: http://www.datastax.com/docs/0.8/index
      • 179. Apache Cassandra project wiki: http://wiki.apache.org/cassandra/
      • 180. “ The Dynamo Paper”
      • 181. http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
      • 182. P. Helland. Building on Quicksand
      • 183. http://arxiv.org/pdf/0909.1788
      • 184. P. Helland. Life Beyond Distributed Transactions
      • 185. http://www.ics.uci.edu/~cs223/papers/cidr07p15.pdf
      • 186. S. Anand. “Netflix's Transition to High-Availability Storage Systems”
      • 187. http://media.amazonwebservices.com/Netflix_Transition_to_a_Key_v3.pdf
      • 188. “ The Megastore Paper”
      • 189. http://research.google.com/pubs/archive/36971.pdf

    ×