Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Mixing Batch and Real-time: Cassandra with Shark

Richard Low | @richardalow

#CASSANDRAEU

CASSANDRASUMMITEU
About me

* Analytics tech lead at SwiftKey
* Cassandra freelancer
* Previous: lead Cassandra and Analytics dev at
Acunu

...
Outline

* Batch analytics on real-time databases
* Current solutions
* Spark and Shark
* My solution
* Performance result...
Batch analytics on real-time databases

#CASSANDRAEU

@richardalow
Batch and real-time analytics

* Wherever there is data there are unforeseeable

queries
* Real-time databases are optimiz...
Example

* User accounts database
* Read-heavy
* Must be low latency
* Other tables on same database
* Some are write heav...
Example data model
CREATE TABLE user_accounts (
userid uuid PRIMARY KEY,
username text,
email text,
password text,
last_vi...
Example data model
SELECT * FROM user_accounts LIMIT 2;
userid
| country | email
| last_visited
| password | username
----...
Marketing walks in

#CASSANDRAEU

@richardalow
Ad-hoc query

“Please can you find all users from Brazil who haven’t
logged in since July and have an email @yahoo.com.
I n...
Ad-hoc query observations

* We have 500k users from Brazil
* 60MB of raw data
* No way to extract by country from data mo...
Why?

* Underrepresented use case in plethora of tools
* Seen days of dev time wasted
* Want to see what can be done

#CAS...
Current solutions

#CASSANDRAEU

@richardalow
Options

* Run Hive query on top of Cassandra

#CASSANDRAEU

@richardalow
Options

* Run Hive query on top of Cassandra
* Will compete with Cassandra for
* I/O
* Memory
* CPU
* Network
* Will caus...
Options

* Write ETL script and load into another DB

#CASSANDRAEU

@richardalow
Options

* Write ETL script and load into another DB
* All custom code
* Single threaded
* Unreliable
* Will still flush c...
Options

* Clone the cluster

#CASSANDRAEU

@richardalow
Options

* Clone the cluster
* Worst possible network load
* Manual import each time
* No incremental update
* Need duplic...
Options

* Add ‘batch analytics’ DC and run Hive there

#CASSANDRAEU

@richardalow
Options

* Add ‘batch analytics’ DC and run Hive there
* Initial copy slow and affects real-time
performance
* Need duplic...
Spark and Shark

#CASSANDRAEU

@richardalow
Spark

* Developed by Amplab
* Distributed computation, like Hadoop
* Designed for iterative algorithms
* Much faster for ...
Spark is used by

Source: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
#CASSANDRAEU

@richardalow
Shark

* Hive on Spark
* Completely compatible with Hive
* Same QL, UDFs and storage handlers
* Can cache tables

#CASSAND...
Shark

* Hive on Spark
* Completely compatible with Hive
* Same QL, UDFs and storage handlers
* Can cache tables
CREATE TA...
Shark on Cassandra

#CASSANDRAEU

@richardalow
Shark on Cassandra

* CqlStorageHandler
* Can use existing hive-cassandra storage handler
* Can work well - see Evan Chan’...
Shark on Cassandra direct

* SSTableStorageHandler
* Run spark workers on the Cassandra nodes
* Read directly from SSTable...
Cassandra on Spark: through CQL interface
Spark worker JVM

FS Cache

Cassandra JVM
Deserialize
Merge
Serialize

SSTables
...
Cassandra on Spark: SSTables direct
Spark worker JVM
Deserialize
Process

SSTables

#CASSANDRAEU

Remote client

Deseriali...
Disadvantages

* Equivalent to CL.ONE
* Always runs task local with the data
* Doesn’t read data in memtables

#CASSANDRAE...
Performance results

#CASSANDRAEU

@richardalow
Testing

* 4 node Cassandra cluster on m1.large
* 2 cores, 7.5 GB RAM, 2 ephemeral disks
* 1 spark master
* Spark running ...
Setup

* Cassandra 1.2.10
* 3 GB heap
* 256 tokens per node
* RF 3
* Preloaded 100M randomly generated records
* Each node...
Tools

* codahale Metrics
* Ganglia
* Load generator using DataStax Java driver
* Google spreadsheet

#CASSANDRAEU

@richa...
Result 1

* No Cassandra load
* Run caching query:
CREATE TABLE user_accounts_cached as
SELECT * FROM user_accounts WHERE
...
Using cached results

* Now have results cached, can run super fast
queries
* No I/O or extra memory
* Bounded number of c...
Result 2

* Add read load
* Read-modify-write of accounts info
* 200 ops/s
* Measure latency
* Slow down SSTable loader to...
95%ile base

mean base

#CASSANDRAEU

@richardalow
Analysis

* Average latency 17% lower
* Probably due to less CPU used by query
* Max 95th %ile latency 33% lower and much ...
Result 3

* Keep read workload
* Measure same latency
* Add insert workload
* Insert into separate table
* 2500 ops/s

#CA...
CQL loader

#CASSANDRAEU

SSTable loader

@richardalow
Analysis

* Lots of latency, but there is anyway

#CASSANDRAEU

@richardalow
Performance wrap up

* 2.5x faster with less CPU

=> uses less resources to do the same thing
* Lower, more predictable la...
Summary

#CASSANDRAEU

@richardalow
Summary

* Discussed analytics use case not well served by
current tools
* Spark, Shark
* SSTableStorageHandler
* Performa...
Future

* Needs a name
* Github
* Speak to me if you want to use it
* Speak to me if you want to contribute

#CASSANDRAEU
...
Thank you!
Richard Low | @richardalow

#CASSANDRAEU

@richardalow
Upcoming SlideShare
Loading in …5
×

C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark

1,660 views

Published on

Speaker: Richard Low, Analytics Tech Lead at SwiftKey
Video: http://www.youtube.com/watch?v=QTb4HTwVMq0&list=PLqcm6qE9lgKLoYaakl3YwIWP4hmGsHm5e&index=2
Everything Cassandra does is designed for a real-time workload of high volume inserts and frequent small queries. Cassandra has Hadoop and Hive integration, but performing long running ad-hoc queries with these tools is difficult without impacting real-time performance or requires duplicate clusters. This talk will explain how I'm integrating Cassandra with Shark, a drop-in Hive replacement developed by Berkeley's AmpLab. It's designed to give fine grained control over all resource usage so you can safely run arbitrary ad-hoc queries on your existing cluster with controlled and predictable impact.

Published in: Technology, Business
  • Be the first to comment

C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark

  1. 1. Mixing Batch and Real-time: Cassandra with Shark Richard Low | @richardalow #CASSANDRAEU CASSANDRASUMMITEU
  2. 2. About me * Analytics tech lead at SwiftKey * Cassandra freelancer * Previous: lead Cassandra and Analytics dev at Acunu #CASSANDRAEU @richardalow
  3. 3. Outline * Batch analytics on real-time databases * Current solutions * Spark and Shark * My solution * Performance results * Summary & future work #CASSANDRAEU @richardalow
  4. 4. Batch analytics on real-time databases #CASSANDRAEU @richardalow
  5. 5. Batch and real-time analytics * Wherever there is data there are unforeseeable queries * Real-time databases are optimized for real-time queries * Large queries may not be possible * Or will impact your real-time SLA #CASSANDRAEU @richardalow
  6. 6. Example * User accounts database * Read-heavy * Must be low latency * Other tables on same database * Some are write heavy * A good fit for Cassandra! #CASSANDRAEU @richardalow
  7. 7. Example data model CREATE TABLE user_accounts ( userid uuid PRIMARY KEY, username text, email text, password text, last_visited timestamp, country text ); #CASSANDRAEU @richardalow
  8. 8. Example data model SELECT * FROM user_accounts LIMIT 2; userid | country | email | last_visited | password | username ---------+---------+---------------------+---------------------+----------+--------a03dcf03 | UK | richard@wentnet.com | 2013-10-07 09:07:36 | td7rjxwp | rlow b3f1871e | FR | jean@yahoo.com | 2013-08-17 13:07:36 | moh7eksn | jean88 #CASSANDRAEU @richardalow
  9. 9. Marketing walks in #CASSANDRAEU @richardalow
  10. 10. Ad-hoc query “Please can you find all users from Brazil who haven’t logged in since July and have an email @yahoo.com. I need the answer by Monday.” #CASSANDRAEU @richardalow
  11. 11. Ad-hoc query observations * We have 500k users from Brazil * 60MB of raw data * No way to extract by country from data model * It’s on unchanging data* * Can take hours, not days * No expectation this query will need rerunning * Mostly, some of the people who haven’t visited for a while may suddenly come back #CASSANDRAEU @richardalow
  12. 12. Why? * Underrepresented use case in plethora of tools * Seen days of dev time wasted * Want to see what can be done #CASSANDRAEU @richardalow
  13. 13. Current solutions #CASSANDRAEU @richardalow
  14. 14. Options * Run Hive query on top of Cassandra #CASSANDRAEU @richardalow
  15. 15. Options * Run Hive query on top of Cassandra * Will compete with Cassandra for * I/O * Memory * CPU * Network * Will cause extra GC pressure on Cassandra * Could flush filesystem cache #CASSANDRAEU @richardalow
  16. 16. Options * Write ETL script and load into another DB #CASSANDRAEU @richardalow
  17. 17. Options * Write ETL script and load into another DB * All custom code * Single threaded * Unreliable * Will still flush cache on Cassandra nodes #CASSANDRAEU @richardalow
  18. 18. Options * Clone the cluster #CASSANDRAEU @richardalow
  19. 19. Options * Clone the cluster * Worst possible network load * Manual import each time * No incremental update * Need duplicate hardware #CASSANDRAEU @richardalow
  20. 20. Options * Add ‘batch analytics’ DC and run Hive there #CASSANDRAEU @richardalow
  21. 21. Options * Add ‘batch analytics’ DC and run Hive there * Initial copy slow and affects real-time performance * Need duplicate hardware * Will drop writes when really busy #CASSANDRAEU @richardalow
  22. 22. Spark and Shark #CASSANDRAEU @richardalow
  23. 23. Spark * Developed by Amplab * Distributed computation, like Hadoop * Designed for iterative algorithms * Much faster for queries with working sets that fit in RAM * Reliability from storing lineage rather than intermediate results * Runs on Mesos or YARN #CASSANDRAEU @richardalow
  24. 24. Spark is used by Source: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark #CASSANDRAEU @richardalow
  25. 25. Shark * Hive on Spark * Completely compatible with Hive * Same QL, UDFs and storage handlers * Can cache tables #CASSANDRAEU @richardalow
  26. 26. Shark * Hive on Spark * Completely compatible with Hive * Same QL, UDFs and storage handlers * Can cache tables CREATE TABLE user_accounts_cached as SELECT * FROM user_accounts WHERE country = ‘BR’; #CASSANDRAEU @richardalow
  27. 27. Shark on Cassandra #CASSANDRAEU @richardalow
  28. 28. Shark on Cassandra * CqlStorageHandler * Can use existing hive-cassandra storage handler * Can work well - see Evan Chan’s talk (Ooyala) from #cassandra13 * But suffers from same problems as Hive+Hadoop on Cassandra #CASSANDRAEU @richardalow
  29. 29. Shark on Cassandra direct * SSTableStorageHandler * Run spark workers on the Cassandra nodes * Read directly from SSTables in separate JVM * Limit CPU and memory through Spark/Mesos/ YARN * Limit I/O by rate limiting raw disk access * Skip filesystem cache #CASSANDRAEU @richardalow
  30. 30. Cassandra on Spark: through CQL interface Spark worker JVM FS Cache Cassandra JVM Deserialize Merge Serialize SSTables Deserialize Process Remote client Latency spikes! #CASSANDRAEU @richardalow
  31. 31. Cassandra on Spark: SSTables direct Spark worker JVM Deserialize Process SSTables #CASSANDRAEU Remote client Deserialize Merge Serialize FS Cache Cassandra JVM Constant latency @richardalow
  32. 32. Disadvantages * Equivalent to CL.ONE * Always runs task local with the data * Doesn’t read data in memtables #CASSANDRAEU @richardalow
  33. 33. Performance results #CASSANDRAEU @richardalow
  34. 34. Testing * 4 node Cassandra cluster on m1.large * 2 cores, 7.5 GB RAM, 2 ephemeral disks * 1 spark master * Spark running on Cassandra nodes * Limited to 1 core, 1 GB RAM * Compare CQLStorageHandler with SSTableStorageHandler #CASSANDRAEU @richardalow
  35. 35. Setup * Cassandra 1.2.10 * 3 GB heap * 256 tokens per node * RF 3 * Preloaded 100M randomly generated records * Each node started with 9GB of data * No optimization or tuning #CASSANDRAEU @richardalow
  36. 36. Tools * codahale Metrics * Ganglia * Load generator using DataStax Java driver * Google spreadsheet #CASSANDRAEU @richardalow
  37. 37. Result 1 * No Cassandra load * Run caching query: CREATE TABLE user_accounts_cached as SELECT * FROM user_accounts WHERE country = ‘BR’; * Takes 33 mins through CQL * Takes 13 mins through SSTables * 130k records/s * => SSTables is 2.5x faster * Even better since CQL has access to both cores #CASSANDRAEU @richardalow
  38. 38. Using cached results * Now have results cached, can run super fast queries * No I/O or extra memory * Bounded number of cores SELECT count(*) FROM user_accounts_cached WHERE unix_timestamp(last_visited)< unix_timestamp('2013-08-01 00:00:00') AND email LIKE '%@c9%'; * Took 18 seconds #CASSANDRAEU @richardalow
  39. 39. Result 2 * Add read load * Read-modify-write of accounts info * 200 ops/s * Measure latency * Slow down SSTable loader to same rate as CQL #CASSANDRAEU @richardalow
  40. 40. 95%ile base mean base #CASSANDRAEU @richardalow
  41. 41. Analysis * Average latency 17% lower * Probably due to less CPU used by query * Max 95th %ile latency 33% lower and much more predictable * Possibly due to less GC pressure * Still have a latency increase over base * Probably due to I/O use #CASSANDRAEU @richardalow
  42. 42. Result 3 * Keep read workload * Measure same latency * Add insert workload * Insert into separate table * 2500 ops/s #CASSANDRAEU @richardalow
  43. 43. CQL loader #CASSANDRAEU SSTable loader @richardalow
  44. 44. Analysis * Lots of latency, but there is anyway #CASSANDRAEU @richardalow
  45. 45. Performance wrap up * 2.5x faster with less CPU => uses less resources to do the same thing * Lower, more predictable latencies when at same speed => controlled resource usage lowers latency impact * Could limit further to make impact unnoticeable #CASSANDRAEU @richardalow
  46. 46. Summary #CASSANDRAEU @richardalow
  47. 47. Summary * Discussed analytics use case not well served by current tools * Spark, Shark * SSTableStorageHandler * Performance results #CASSANDRAEU @richardalow
  48. 48. Future * Needs a name * Github * Speak to me if you want to use it * Speak to me if you want to contribute #CASSANDRAEU @richardalow
  49. 49. Thank you! Richard Low | @richardalow #CASSANDRAEU @richardalow

×