Meetup cassandra for_java_cql
Upcoming SlideShare
Loading in...5
×
 

Meetup cassandra for_java_cql

on

  • 2,495 views

Slides from 10/26/2011 Cassandra Austin Meetup group

Slides from 10/26/2011 Cassandra Austin Meetup group

Statistics

Views

Total Views
2,495
Views on SlideShare
2,495
Embed Views
0

Actions

Likes
1
Downloads
91
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Meetup cassandra for_java_cql Meetup cassandra for_java_cql Presentation Transcript

      • Building Java Applications with Apache Cassandra
        Nate McCall [email_address] @zznate
      • What is Apache Cassandra?
    • CAP Theorem C onsistency A vailability P artition Tolerance “ Thou shalt have but 2” - Conjecture made by Eric Brewer in 2000 - Published as formal proof in 2002 - See: http://en.wikipedia.org/wiki/CAP_theorem for more
      • Apache Cassandra Concepts
      - Explicit choice of partition tolerance and availability. Consistency is tunable. - No read before write - Merge on read - Idempotent - Schema Optional - All nodes share the same role - Still performs well with larger-than-memory data sets
    • Generally complements another system(s) (Not intended to be one-size-fits-all) *** You should always use the right tool for the right job anyway
    • How does this differ from an RDBMS?
    • How does this differ from an RDBMS? Substantially.
    • vs. RDBMS - No Joins Unless: - you do them on the client - you do them via Map/Reduce
    • vs. RDBMS - Schema Optional (Though you can add meta information for validation and type checking) *** Supports secondary indexes too: “ … WHERE state = 'TX' ”
    • vs. RDBMS - Prematerialized and Transaction-less - No ACID transactions - Limited support for ad-hoc queries
    • vs. RDBMS - Prematerialized and Transaction-less - No ACID transactions - Limited support for ad-hoc queries *** You are going to give up both of these anyway when you shard an RDBMS ***
      • vs. RDBMS - Facilitates Consolidation
      It can be your caching layer * Off-heap cache (provided you install JNA) It can be your analytics infrastructure * true map/reduce * pig driver * hive driver coming soon
    • vs. RDBMS - Shared-Nothing Architecture Every node plays the same role: no masters, no slaves, no special nodes *** No single point of failure
      • vs. RDBMS - Real Linear Scalability
      Want 2x performance? Add 2x nodes. *** 'No downtime' included!
      • vs. RDBMS - Performance
      Reads on par with writes
      • Storage (Briefly)
      • Storage (Briefly)
        Understanding the on-disk format is extremely helpful in designing your data model correctly
      • Storage - SSTable
        - SSTables are immutable (“Merge on read”)
      • - Newest timestamp wins
      • Storage – Compaction
        Merge SSTables – keeping count down making Merge on Read more efficient
      • Discards Tombstones (more on this later!)
      • Data Model
      • Data Model
        "...sparse, persistent, distributed, multi-dimensional sorted map."
      • (The “Bigtable” paper)
      • Data Model
        Keyspace
      • - Collection of Column Families
      • - Controls replication
      • Column Family
      • - Similar to a table
      • - Columns ordered by name
      • Data Model – Column Family
        Static Column Family
      • - Model my object data
      • Dynamic Column Family
      • - Pre-calculated query results
      • Nothing stopping you from mixing them!
      • Data Model – Static CF
        GOOG
        AAPL
        NFLX
        NOK
      • price: 589.55
      • price: 401.76
        price: 78.73
        name : Google
        name : Apple
        name : Netflix
        price: 6.90
        name : Nokia
        exchange : NYSE
        Stocks
      • Data Model – Prematerialized Query
        StockHist
        10/25/2011: 6.71
        GOOG
        AAPL
        NFLX
        NOK
        10/24/2011: 6.76
        10/21/2011: 6.61
        10/25/2011: 77.37
        10/24/2011: 118.84
        10/21/2011: 117.04
        10/25/2011: 397.77
        10/24/2011: 405.77
        10/21/2011: 392.87
        10/25/2011: 583.16
        10/24/2011: 596.42
        10/21/2011: 590.49
      • API Operations
    • Five general categories
        Retrieving Writing/Updating/Removing (all the same op!)
          Increment counters
        Meta Information Schema Manipulation CQL Execution
    • Using a Client Hector Client: http://hector-client.org - Most popular Java client - In use at very large installations - A number of tools and utilities built on top - Very active community - MIT Licensed *** like any open source project fully dependent on another open source project it has its worts
      • Sample Project for Experimenting
      https://github.com/zznate/cassandra-tutorial https://github.com/zznate/hector-examples Built using Hector Really basic – designed to be beginner level w/ very few moving parts Modify/abuse/alter as needed *** Descriptions of what is going on and how to run each example are in the Javadoc comments. 
      • ColumnFamilyTemplate
      Familiar, type-safe approach - based on template-method design pattern - generic: ColumnFamilyTemplate<K,N> (K is the key type, N the column name type) ColumnFamilyTemplate template = new ThriftColumnFamilyTemplate(keyspaceName, columnFamilyName, StringSerializer.get(), StringSerializer.get()); *** (no generics for clarity)
      • ColumnFamilyTemplate
      new ThriftColumnFamilyTemplate(keyspaceName, columnFamilyName, StringSerializer.get(), StringSerializer.get()); Key Format Column Name Format - Cassandra calls this a “comparator” - Remember: defines column order in on-disk format
      • ColumnFamilyTemplate
      ColumnFamilyResult<String, String> res = cft.queryColumns(&quot;zznate&quot;); String value = res.getString(&quot;email&quot;); Date startDate = res.getDate(“startDate”); Key Format Column Name Format
      • ColumnFamilyTemplate
      ColumnFamilyResult wrapper = template.queryColumns(&quot;GOOG&quot;, &quot;AAPL&quot;, &quot;NFLX&quot;); String googName = wrapper.getString(&quot;name&quot;); wrapper.next(); String aaplName = wrapper.getString(&quot;name&quot;); wrapper.next(); String nflxName = wrapper.getString(&quot;name&quot;); Querying multiple rows and iterating over results
      • ColumnFamilyTemplate
      ColumnFamilyUpdater updater = template.createUpdater(&quot;AAPL&quot;); updater.setString(&quot;exchange&quot;,&quot;NASDAQ&quot;); updater.addKey(&quot;GOOG&quot;); updater.setString(&quot;exchange&quot;,&quot;NASDAQ&quot;); template.update(updater); Inserting data with ColumnFamilyUpdater
      • ColumnFamilyTemplate
      template.deleteColumn(&quot;AAPL&quot;, &quot;notNeededStuff&quot;); template.deleteColumn(&quot;GOOG&quot;, &quot;somethingElse&quot;); template.deleteColumn(&quot;GOOG&quot;, &quot;aDifferentColumnName&quot;); ... template.deleteRow(“NOK”); template.executeBatch(); Deleting Data with ColumnFamilyTemplate
      • Deletion
      • Deletion
      • Again: Every mutation is an insert!
      • - Merge on read
      • - Sstables are immutable
      • - Highest timestamp wins
      • Deletion – As Seen by CLI
        [default@Tutorial] list StateCity;
      • Using default limit of 100
      • -------------------
      • RowKey: CA Burlingame
      • => (column=650, value=33372e3537783132322e3334, timestamp=1310340410528000)
      • -------------------
      • RowKey: TX Austin
      • => (column=202, value=33302e3237783039372e3734, timestamp=1310143852392000)
      • => (column=203, value=33302e3237783039372e3734, timestamp=1310143852444000)
      • => (column=204, value=33302e3332783039372e3733, timestamp=1310143852448000)
      • => (column=205, value=33302e3332783039372e3733, timestamp=1310143852453000)
      • => (column=206, value=33302e3332783039372e3733, timestamp=1310143852457000)
      • Deletion – As Seen by CLI
        [default@Tutorial] list StateCity;
      • Using default limit of 100
      • -------------------
      • RowKey: CA Burlingame
      • -------------------
      • RowKey: TX Austin
      • => (column=202, value=33302e3237783039372e3734, timestamp=1310143852392000)
      • => (column=203, value=33302e3237783039372e3734, timestamp=1310143852444000)
      • => (column=204, value=33302e3332783039372e3733, timestamp=1310143852448000)
      • => (column=205, value=33302e3332783039372e3733, timestamp=1310143852453000)
      • => (column=206, value=33302e3332783039372e3733, timestamp=1310143852457000)
      • Deletion – FYI
        mutator.addDeletion(&quot;202230&quot;, &quot;Npanxx&quot;, “city”, stringSerializer);
        Does not exist? You just inserted a tombstone!
        Sending a deletion for a non-existing row:
        [default@Tutorial] list Npanxx;
      • Using default limit of 100
      • . . .
      • -------------------
      • RowKey: 202230
      • -------------------
      • . . .
      • Integrating with existing patterns
      • <bean id=&quot;cassandraHostConfigurator&quot;
      • class=&quot;me.prettyprint.cassandra.service.CassandraHostConfigurator&quot;>
      • <constructor-arg value=&quot;localhost:9170&quot;/>
      • </bean>
      • <bean id=&quot;cluster&quot; class=&quot;me.prettyprint.cassandra.service.ThriftCluster&quot;>
      • <constructor-arg value=&quot;TestCluster&quot;/>
      • <constructor-arg ref=&quot;cassandraHostConfigurator&quot;/>
      • </bean>
      • <bean id=&quot;consistencyLevelPolicy&quot; class=&quot;me.prettyprint.cassandra.model.ConfigurableConsistencyLevel&quot;>
      • <property name=&quot;defaultReadConsistencyLevel&quot; value=&quot;ONE&quot;/>
      • </bean>
      • <bean id=&quot;keyspaceOperator&quot; class=&quot;me.prettyprint.hector.api.factory.HFactory&quot;
      • factory-method=&quot;createKeyspace&quot;>
      • <constructor-arg value=&quot;Keyspace1&quot;/>
      • <constructor-arg ref=&quot;cluster&quot;/>
      • <constructor-arg ref=&quot;consistencyLevelPolicy&quot;/>
      • </bean>
      • <bean id=&quot;simpleCassandraDao&quot; class=&quot;me.prettyprint.cassandra.dao.SimpleCassandraDao&quot;>
      • <property name=&quot;keyspace&quot; ref=&quot;keyspaceOperator&quot;/>
      • <property name=&quot;columnFamilyName&quot; value=&quot;Standard1&quot;/>
      • </bean>
      • Integrating with existing patterns
      • Hector Object Mapper:
      • https://github.com/rantav/hector/wiki/Hector-Object-Mapper-%28HOM%29
      • Hector JPA:
      • https://github.com/riptano/hector-jpa
      • CQL via JDBC
      • CQL via JDBC
      • - Integrate with existing tools (Spring Framework's JdbcTemplate in this case)
      • *** Still some caveats and missing features
      • CQL via JDBC
      • https://github.com/riptano/jdbc-conn-pool
      • - see portfolio_example sub project
      • CQL via JDBC: Components
      • - HCQLDataSource (from jdbc-pool)
      • - Spring Framework's JdbcTemplate
      • - DAO class with associated domain objects
      • - Junit
      • - Spring Framework's SpringJUnit4ClassRunner (context setup and injection)
      • - EmbededServerHelper from hector-test (manage Cassandra lifecycle, directories and configuration)
      • CQL via JDBC: Configuration
      • Pool Configuration
      • - Cluster name, keyspace and at least 1 host required
      • - Additional settings for:
      • * fail over semantics
      • * automatic host discovery
      • * timeout counters and thresholds
      • CQL via JDBC: Configuration (JNDI)
      • <Resource name= &quot;cassandra/CassandraClientFactory&quot;
      • auth= &quot;Container&quot;
      • type= &quot;me.prettyprint.cassandra.api.Keyspace&quot;
      • factory= &quot;me.prettyprint.cassandra.jndi.CassandraClientJndiResourceFactory&quot;
      • hosts= &quot;cass1:9160,cass2:9160,cass3:9160&quot;
      • user= &quot;user&quot;
      • password= &quot;passwd&quot;
      • keyspace=&quot; Keyspace1&quot;
      • clusterName= &quot;Test Cluster&quot;
      • maxActive= &quot;20&quot;
      • maxWaitTimeWhenExhausted= &quot;10&quot;
      • failoverPolicy= &quot;ON_FAIL_TRY_ALL_AVAILABLE&quot;
      • autoDiscoverHosts= &quot;true&quot;
      • runAutoDiscoveryAtStartup= &quot;true&quot; />
      • CQL via JDBC: Configuration (Spring)
      • <bean class= &quot;com.datastax.drivers.jdbc.pool.cassandra.jdbc.HCQLDataSource&quot;
      • id= &quot;ds&quot; >
      • <property name= &quot;clusterName&quot; value= &quot;TestCluster&quot; />
      • <property name= &quot;keyspaceName&quot; value= &quot;PortfolioDemo&quot; />
      • <property name= &quot;hosts&quot; value= &quot;127.0.0.1:9170&quot; />
      • </bean>
      • <bean class= &quot;org.springframework.jdbc.core.JdbcTemplate&quot;
      • id=&quot; jdbcTemplate&quot; >
      • <constructor-arg ref= &quot;ds&quot; />
      • </bean>
      • CQL via JDBC: Components
      • private static final String PORTFOLIOS_INSERT =
      • &quot;BEGIN BATCH &quot;
      • + &quot;INSERT INTO Portfolios (KEY, BLU, CJS, DAL) VALUES (168,'19', '7', '38') &quot;
      • + &quot;INSERT INTO Portfolios (KEY, BSX, CHK, DNB, MCI, SR) VALUES (236,'32', '27', '7','8','3') &quot;
      • + &quot;APPLY BATCH&quot; ;
      • ...
      • jdbcTemplate.execute(PORTFOLIOS_INSERT);
      Inserting Test Data
      • CQL via JDBC: Components
      • public Stock mapRow(ResultSet rs, int row) throws SQLException {
      • CassandraResultSet crs = (CassandraResultSet)rs;
      • Stock stock = new Stock();
      • stock.setTicker(new String(crs.getKey()));
      • stock.setPrice(crs.getDouble(&quot;price&quot;));
      • return stock;
      • }
      • See PortfolioDao#loadStocks
      Reading Data via RowMapper
      • Development Resources
      CQL Documentation (and CQL Shell) http://www.datastax.com/docs/1.0/dml/using_cql Hector Documentation http://hector-client.org
      • Cassandra Maven Plugin (exec-cql goal) http://mojo.codehaus.org/cassandra-maven-plugin/
      • CCM localhost cassandra cluster https://github.com/pcmanus/ccm
      • OpsCenter http://www.datastax.com/products/opscenter
        Cassandra AMIs https://github.com/riptano/CassandraClusterAMI
      • Putting it Together
      • Take control of consistency
      • If you do need a high degree of consistency, use thresholds to trigger different behavior
      • - Bank account:
      • “ on values over $10,000, wait to here from all replicas”
      • - Distributed Shopping Cart:
      • Show a confirmation page to verify order resolution
      • *** What is your appetite for risk?
    • Uniquely identify operations in the application
      • Facilitates idempotent behavior and out-of-order execution
      • Denormalization
      • The point of normalization is to avoid update anomalies
      • ***But In an append-only system, we don't do updates
      • Summary
      • - Take advantage of strengths
      • - Look for idempotence and asynchronicity in your business processes
      • - If it's not in the API, you are probably doing it wrong
      • - Seek death is still possible if you model incorrectly
      • Questions
        Nate McCall [email_address] @zznate
      • Additional Resources
      • DataStax Documentation: http://www.datastax.com/docs/0.8/index
      • Apache Cassandra project wiki: http://wiki.apache.org/cassandra/
      • “ The Dynamo Paper”
      • http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
      • P. Helland. Building on Quicksand
      • http://arxiv.org/pdf/0909.1788
      • P. Helland. Life Beyond Distributed Transactions
      • http://www.ics.uci.edu/~cs223/papers/cidr07p15.pdf
      • S. Anand. “Netflix's Transition to High-Availability Storage Systems”
      • http://media.amazonwebservices.com/Netflix_Transition_to_a_Key_v3.pdf
      • “ The Megastore Paper”
      • http://research.google.com/pubs/archive/36971.pdf