Cassandra hands on

479 views
380 views

Published on

A presentation delivered to the Dublin Cassandra User Group on the 29 May 2014. It covers use cases written by Patrick Callaghan of DataStax interpreted by Niall Milton of DigBigData.

Published in: Internet, Technology
1 Comment
1 Like
Statistics
Notes
  • Slide 42 should read X Hierarchies * Y variables
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
479
On SlideShare
0
From Embeds
0
Number of Embeds
29
Actions
Shares
0
Downloads
14
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide

Cassandra hands on

  1. 1. Cassandra Hands On Niall Milton, CTO, DigBigData Examples courtesy of Patrick Callaghan, DataStax Sponsored By
  2. 2. Introduction —  We will be walking through Cassandra use cases from Patrick Callaghan on github. —  https://github.com/PatrickCallaghan/ —  Patrick sends his apologies but due to Aer Lingus air strike on Friday he couldn’t get a flight back to UK —  This presentation will cover the important points from each sample application
  3. 3. Agenda —  Transactions Example —  Paging Example —  Analytics Example —  Risk Sensitivity Example
  4. 4. Transactions Example
  5. 5. Scenario —  We want to add products, each with a quantity to an order —  Orders come in concurrently from random buyers —  Products that have sold out will return “OUT OF STOCK” —  We want to use lightweight transactions to guarantee that we do not allow orders to complete when no stock is available
  6. 6. Lightweight Transactions —  Guarantee a serial isolation level, ACID —  Uses PAXOS consensus algorithm to achieve this in a distributed system. See: —  http://research.microsoft.com/en-us/um/people/lamport/ pubs/paxos-simple.pdf —  Every node is still equal, no master or locks —  Allows for conditional inserts & updates —  The cost of linearizable consistency is higher latency, not suitable for high volume writes where low latency is required
  7. 7. Retrieve & Run the Code 1.  git clone https://github.com/PatrickCallaghan/datastax- transaction-demo.git 2.  mvn clean compile exec:java - Dexec.mainClass="com.datastax.demo.SchemaSetup” 3.  mvn clean compile exec:java - Dexec.mainClass="com.datastax.transactions.Main" - Dload=true -DcontactPoints=127.0.0.1 - DnoOfThreads=10
  8. 8. Schema 1.  create keyspace if not exists datastax_transactions_demo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1' }; 2.  create table if not exists products(productId text, capacityleft int, orderIds set<text>, PRIMARY KEY (productId)); 3.  create table if not exists buyers_orders(buyerId text, orderId text, productId text, PRIMARY KEY(buyerId, orderId));
  9. 9. Model public class Order { private String orderId; private String productId; private String buyerId; … }
  10. 10. Method —  Find current product quantity at CL.SERIAL —  This allows us to execute a PAXOS query without proposing an update, i.e. read the current value SELECT capacityLeft from products WHERE productId = ‘1234’ e.g. capacityLeft = 5
  11. 11. Method Contd. —  Do a conditional update using IF operator to make sure product quantity has not changed since last quantity check —  Note the use of the set collection type here. —  This statement will only succeed if the IF condition is met UPDATE products SET orderIds=orderIds + {'3'}, capacityleft = 4 WHERE productId = ’1234' IF capacityleft = 5;
  12. 12. Method Contd. —  If last query succeeds, simply insert the order. INSERT into orders (buyerId, orderId, productId) values (1,3,’1234’); —  This guarantees that no order will be placed where there is insufficient quantity to fulfill it.
  13. 13. Comments —  Using LWT incurs a cost of higher latency because all replicas must be consulted before a value is committed / returned. —  CL.SERIAL does not propose a new value but is used to read the possibly uncommitted PAXOS state —  The IF operator can also be used as IF NOT EXISTS which is useful for user creation for example
  14. 14. Paging Example
  15. 15. Scenario —  We have 1000s of products in our product catalogue —  We want to browse these using a simple select —  We don’t want to retrieve all at once!
  16. 16. Cursors —  We are often dealing with wide rows in Cassandra —  Reading entire rows or multiple rows at once could lead to OOM errors —  Traditionally this meant using range queries to retrieve content —  Cassandra 2.0 (and Java driver) introduces cursors —  Makes row based queries more efficient (no need to use the token() function) —  This will simplify client code
  17. 17. Retrieve & Run the Code 1.  git clone https://github.com/PatrickCallaghan/datastax- paging-demo.git 2.  mvn clean compile exec:java - Dexec.mainClass="com.datastax.demo.SchemaSetup" 3.  mvn clean compile exec:java - Dexec.mainClass="com.datastax.paging.Main"
  18. 18. Schema create table if not exists products(productId text, capacityleft int, orderIds set<text>, PRIMARY KEY (productId)); —  N.B With the default partitioner, products will be ordered based on Murmer3 hash value. Old way we would need to use the token() function to retrieve them in order
  19. 19. Model public class Product { private String productId; private int capacityLeft; private Set<String> orderIds; … }
  20. 20. Method 1.  Create a simple select query for the products table. 2.  Set the fetch size parameter 3.  Execute the statement Statement stmt = new SimpleStatement("Select * from products”); stmt.setFetchSize(100); ResultSet resultSet = this.session.execute(stmt);
  21. 21. Method Contd. 1.  Get an iterator for the result set 2.  Use a while loop to iterate over the result set Iterator<Row> iterator = resultSet.iterator(); while (iterator.hasNext()){ Row row = iterator.next(); // do stuff with the row }
  22. 22. Comments —  Very easy to transparently iterate in a memory efficient way over a large result set —  Cursor state is maintained by driver. —  Allows for failover between different page responses, i.e. the state is not lost if a page fails to load from a node in the replica set, the page will be requested from another node —  See: http://www.datastax.com/dev/blog/client- side-improvements-in-cassandra-2-0
  23. 23. Analytics Example
  24. 24. Scenario —  Don’t have Hadoop but want to run some HIVE type analytics on our large dataset —  Example: Get the Top10 financial transactions ordered by monetary value for each user —  May want to add more complex filtering later (where value > 1000) or even do mathematical groupings, percentiles, means, min, max
  25. 25. Cassandra for Analytics —  Useful for many scenarios when no other analytics solution is available —  Using cursors, queries are bounded & memory efficient depending on the operation —  Can be applied anywhere we can do iterative or recursive processing, SUM, AVG, MIN, MAX etc. —  NB: The example code also includes an CQLSSTableWriter which is fast & convenient if we want to manually create SSTables of large datasets rather than send millions of insert queries to Cassandra
  26. 26. Retrieve & Run the Code 1.  git clone https://github.com/PatrickCallaghan/datastax- analytics-example.git 2.  export MAVEN_OPTS=-Xmx512M (up the memory) 3.  mvn clean compile exec:java - Dexec.mainClass="com.datastax.bulkloader.Main" 4.  mvn clean compile exec:java - Dexec.mainClass="com.datastax.analytics.TopTrans actionsByAmountForUserRunner"
  27. 27. Schema create table IF NOT EXISTS transactions ( accid text, txtnid uuid, txtntime timestamp, amount double, type text, reason text, PRIMARY KEY(accid, txtntime) );
  28. 28. Model public class Transaction { pivate String txtnId; private String acountId; private double amount; private Date txtnDate; private String reason; private String type; … }
  29. 29. Method —  Pass a blocking queue into the DAO method which cursors the data, allows us to pop items off as they are added —  NB: Could also use a callback here to update the queue public void getAllProducts(BlockingQueue<Transaction> processorQueue) Statement stmt = new SimpleStatement(“SELECT * FROM transactions”); stmt.setFetchSize(2500); ResultSet resultSet = this.session.execute(stmt);
  30. 30. Method Contd. 1.  Get an iterator for the result set 2.  Use a while loop to iterate over the result set, add each row into the queue while (iterator.hasNext()) { Row row = iterator.next(); Transaction transaction = createTransactionFromRow(row); //convenience queue.offer(transaction); }
  31. 31. Method Contd. 1.  Use Java Collections & Transaction comparator to track Top results private Set<Transaction> orderedSet = new BoundedTreeSet<Transaction>(10, new TransactionAmountComparator());
  32. 32. Comments —  Entirely possible, but probably not to be thought of as a complete replacement for dedicated analytics solutions —  Issues are token distribution across replicas and mixed write and read patterns —  Running analytics or MR operations can be a read heavy operation (as well as memory and i/o intensive) —  Transaction logging tends to be write heavy —  Cassandra can handle it, but in practice it is better to split workloads except for smaller cases, where latency doesn’t matter or where the cluster is not generally under significant load —  Consider DSE Hadoop, Spark, Storm as alternatives
  33. 33. Risk Sensitivity Example
  34. 34. Scenario —  In financial risk systems, positions have sensitivity to certain variable —  Positions are hierarchical and is associated with a trader at a desk which is part of an asset type in a certain location. —  E.g. Frankfurt/FX/desk10/trader7/position23 —  Sensitivity values are inserted for each position. We need to aggregate them for each level in the hierarchy —  The Sum of all sensitivities over time is the new sensitivity as they are represented by deltas.
  35. 35. Scenario —  E.g. Aggregations for: —  Frankfurt/FX/desk10/trader7 —  Frankfurt/FX/desk10 —  Frankfurt/FX —  As new positions are entered the risk sensitivities will change and will need to be aggregated for each level for the new value to be available
  36. 36. Queries select * from risk_sensitivities_hierarchy where hier_path = 'Paris/FX'; ! select * from risk_sensitivities_hierarchy where hier_path = 'Paris/FX/desk4' and sub_hier_path='trader3'; ! select * from risk_sensitivities_hierarchy where hier_path = 'Paris/FX/desk4' and sub_hier_path='trader3' and risk_sens_name='irDelta';!
  37. 37. Retrieve & Run the Code 1.  git clone https://github.com/PatrickCallaghan/datastax- analytics-example.git 2.  export MAVEN_OPTS=-Xmx512M (up the memory) 3.  mvn clean compile exec:java - Dexec.mainClass="com.datastax.bulkloader.Main" 4.  mvn clean compile exec:java - Dexec.mainClass="com.heb.finance.analytics.Main" -DstopSize=1000000
  38. 38. Schema create table if not exists risk_sensitivities_hierarchy ( hier_path text, sub_hier_path text, risk_sens_name text, value double, PRIMARY KEY (hier_path, sub_hier_path, risk_sens_name) ) WITH compaction={'class': 'LeveledCompactionStrategy'}; NB: Notice the use of LCS as we want the table to be efficient for reads also
  39. 39. Model public class RiskSensitivity public final String name; public final String path; public final String position; public final BigDecimal value; … }
  40. 40. Method —  Write a service to write new sensitivities to Cassandra Periodically. insert into risk_sensitivities_hierarchy (hier_path, sub_hier_path, risk_sens_name, value) VALUES (?, ?, ?, ?)
  41. 41. Method Contd. —  In our aggregator do the following periodically —  Select data for hierarchies we wish to aggregate select * from risk_sensitivities_hierarchy where hier_path = ‘Frankfurt/FX/desk10/trader4’ —  Will get all positions related to this hierarchy —  Add the values (represented as deltas) to each other to get the new sensitivity —  E.g. S1 = -3, S2 = 2, S3= -1 —  Write it back for ‘Frankfurt/FX/desk10/trader4’
  42. 42. Comments —  Simple way to maintain up to date risk sensitivity on an on going basis based on previous data —  Will mean (N Hierarchies) * (N variables) queries are executed periodically (keep an eye on this) —  Cursors, blocking queue and bounded collections help us achieve the same result without reading entire rows —  Has other applications such as roll ups for stream data provided you have a reasonably low cardinality in terms of number of (time resolution) * variables.
  43. 43. —  Thanks Patrick Callaghan for the hard work coding the examples! — Questions?

×