Mixing Batch and Real-time: Cassandra with Shark

Richard Low | @richardalow

#CASSANDRAEU

CASSANDRASUMMITEU
About me

* Analytics tech lead at SwiftKey
* Cassandra freelancer
* Previous: lead Cassandra and Analytics dev at
Acunu

...
Outline

* Batch analytics on real-time databases
* Current solutions
* Spark and Shark
* My solution
* Performance result...
Batch analytics on real-time databases

#CASSANDRAEU

@richardalow
Batch and real-time analytics

* Wherever there is data there are unforeseeable

queries
* Real-time databases are optimiz...
Example

* User accounts database
* Read-heavy
* Must be low latency
* Other tables on same database
* Some are write heav...
Example data model
CREATE TABLE user_accounts (
userid uuid PRIMARY KEY,
username text,
email text,
password text,
last_vi...
Example data model
SELECT * FROM user_accounts LIMIT 2;
userid
| country | email
| last_visited
| password | username
----...
Marketing walks in

#CASSANDRAEU

@richardalow
Ad-hoc query

“Please can you find all users from Brazil who haven’t
logged in since July and have an email @yahoo.com.
I n...
Ad-hoc query observations

* We have 500k users from Brazil
* 60MB of raw data
* No way to extract by country from data mo...
Why?

* Underrepresented use case in plethora of tools
* Seen days of dev time wasted
* Want to see what can be done

#CAS...
Current solutions

#CASSANDRAEU

@richardalow
Options

* Run Hive query on top of Cassandra

#CASSANDRAEU

@richardalow
Options

* Run Hive query on top of Cassandra
* Will compete with Cassandra for
* I/O
* Memory
* CPU
* Network
* Will caus...
Options

* Write ETL script and load into another DB

#CASSANDRAEU

@richardalow
Options

* Write ETL script and load into another DB
* All custom code
* Single threaded
* Unreliable
* Will still flush c...
Options

* Clone the cluster

#CASSANDRAEU

@richardalow
Options

* Clone the cluster
* Worst possible network load
* Manual import each time
* No incremental update
* Need duplic...
Options

* Add ‘batch analytics’ DC and run Hive there

#CASSANDRAEU

@richardalow
Options

* Add ‘batch analytics’ DC and run Hive there
* Initial copy slow and affects real-time
performance
* Need duplic...
Spark and Shark

#CASSANDRAEU

@richardalow
Spark

* Developed by Amplab
* Distributed computation, like Hadoop
* Designed for iterative algorithms
* Much faster for ...
Spark is used by

Source: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
#CASSANDRAEU

@richardalow
Shark

* Hive on Spark
* Completely compatible with Hive
* Same QL, UDFs and storage handlers
* Can cache tables

#CASSAND...
Shark

* Hive on Spark
* Completely compatible with Hive
* Same QL, UDFs and storage handlers
* Can cache tables
CREATE TA...
Shark on Cassandra

#CASSANDRAEU

@richardalow
Shark on Cassandra

* CqlStorageHandler
* Can use existing hive-cassandra storage handler
* Can work well - see Evan Chan’...
Shark on Cassandra direct

* SSTableStorageHandler
* Run spark workers on the Cassandra nodes
* Read directly from SSTable...
Cassandra on Spark: through CQL interface
Spark worker JVM

FS Cache

Cassandra JVM
Deserialize
Merge
Serialize

SSTables
...
Cassandra on Spark: SSTables direct
Spark worker JVM
Deserialize
Process

SSTables

#CASSANDRAEU

Remote client

Deseriali...
Disadvantages

* Equivalent to CL.ONE
* Always runs task local with the data
* Doesn’t read data in memtables

#CASSANDRAE...
Performance results

#CASSANDRAEU

@richardalow
Testing

* 4 node Cassandra cluster on m1.large
* 2 cores, 7.5 GB RAM, 2 ephemeral disks
* 1 spark master
* Spark running ...
Setup

* Cassandra 1.2.10
* 3 GB heap
* 256 tokens per node
* RF 3
* Preloaded 100M randomly generated records
* Each node...
Tools

* codahale Metrics
* Ganglia
* Load generator using DataStax Java driver
* Google spreadsheet

#CASSANDRAEU

@richa...
Result 1

* No Cassandra load
* Run caching query:
CREATE TABLE user_accounts_cached as
SELECT * FROM user_accounts WHERE
...
Using cached results

* Now have results cached, can run super fast
queries
* No I/O or extra memory
* Bounded number of c...
Result 2

* Add read load
* Read-modify-write of accounts info
* 200 ops/s
* Measure latency
* Slow down SSTable loader to...
95%ile base

mean base

#CASSANDRAEU

@richardalow
Analysis

* Average latency 17% lower
* Probably due to less CPU used by query
* Max 95th %ile latency 33% lower and much ...
Result 3

* Keep read workload
* Measure same latency
* Add insert workload
* Insert into separate table
* 2500 ops/s

#CA...
CQL loader

#CASSANDRAEU

SSTable loader

@richardalow
Analysis

* Lots of latency, but there is anyway

#CASSANDRAEU

@richardalow
Performance wrap up

* 2.5x faster with less CPU

=> uses less resources to do the same thing
* Lower, more predictable la...
Summary

#CASSANDRAEU

@richardalow
Summary

* Discussed analytics use case not well served by
current tools
* Spark, Shark
* SSTableStorageHandler
* Performa...
Future

* Needs a name
* Github
* Speak to me if you want to use it
* Speak to me if you want to contribute

#CASSANDRAEU
...
Thank you!
Richard Low | @richardalow

#CASSANDRAEU

@richardalow
Upcoming SlideShare
Loading in...5
×

Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

2,238

Published on

Everything Cassandra does is designed for a real-time workload of high volume inserts and frequent small queries. Cassandra has Hadoop and Hive integration, but performing long running ad-hoc queries with these tools is difficult without impacting real-time performance or requires duplicate clusters. This talk will explain how I'm integrating Cassandra with Shark, a drop-in Hive replacement developed by Berkeley's AmpLab. It's designed to give fine grained control over all resource usage so you can safely run arbitrary ad-hoc queries on your existing cluster with controlled and predictable impact.

Published in: Technology

Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

  1. 1. Mixing Batch and Real-time: Cassandra with Shark Richard Low | @richardalow #CASSANDRAEU CASSANDRASUMMITEU
  2. 2. About me * Analytics tech lead at SwiftKey * Cassandra freelancer * Previous: lead Cassandra and Analytics dev at Acunu #CASSANDRAEU @richardalow
  3. 3. Outline * Batch analytics on real-time databases * Current solutions * Spark and Shark * My solution * Performance results * Summary & future work #CASSANDRAEU @richardalow
  4. 4. Batch analytics on real-time databases #CASSANDRAEU @richardalow
  5. 5. Batch and real-time analytics * Wherever there is data there are unforeseeable queries * Real-time databases are optimized for real-time queries * Large queries may not be possible * Or will impact your real-time SLA #CASSANDRAEU @richardalow
  6. 6. Example * User accounts database * Read-heavy * Must be low latency * Other tables on same database * Some are write heavy * A good fit for Cassandra! #CASSANDRAEU @richardalow
  7. 7. Example data model CREATE TABLE user_accounts ( userid uuid PRIMARY KEY, username text, email text, password text, last_visited timestamp, country text ); #CASSANDRAEU @richardalow
  8. 8. Example data model SELECT * FROM user_accounts LIMIT 2; userid | country | email | last_visited | password | username ---------+---------+---------------------+---------------------+----------+--------a03dcf03 | UK | richard@wentnet.com | 2013-10-07 09:07:36 | td7rjxwp | rlow b3f1871e | FR | jean@yahoo.com | 2013-08-17 13:07:36 | moh7eksn | jean88 #CASSANDRAEU @richardalow
  9. 9. Marketing walks in #CASSANDRAEU @richardalow
  10. 10. Ad-hoc query “Please can you find all users from Brazil who haven’t logged in since July and have an email @yahoo.com. I need the answer by Monday.” #CASSANDRAEU @richardalow
  11. 11. Ad-hoc query observations * We have 500k users from Brazil * 60MB of raw data * No way to extract by country from data model * It’s on unchanging data* * Can take hours, not days * No expectation this query will need rerunning * Mostly, some of the people who haven’t visited for a while may suddenly come back #CASSANDRAEU @richardalow
  12. 12. Why? * Underrepresented use case in plethora of tools * Seen days of dev time wasted * Want to see what can be done #CASSANDRAEU @richardalow
  13. 13. Current solutions #CASSANDRAEU @richardalow
  14. 14. Options * Run Hive query on top of Cassandra #CASSANDRAEU @richardalow
  15. 15. Options * Run Hive query on top of Cassandra * Will compete with Cassandra for * I/O * Memory * CPU * Network * Will cause extra GC pressure on Cassandra * Could flush filesystem cache #CASSANDRAEU @richardalow
  16. 16. Options * Write ETL script and load into another DB #CASSANDRAEU @richardalow
  17. 17. Options * Write ETL script and load into another DB * All custom code * Single threaded * Unreliable * Will still flush cache on Cassandra nodes #CASSANDRAEU @richardalow
  18. 18. Options * Clone the cluster #CASSANDRAEU @richardalow
  19. 19. Options * Clone the cluster * Worst possible network load * Manual import each time * No incremental update * Need duplicate hardware #CASSANDRAEU @richardalow
  20. 20. Options * Add ‘batch analytics’ DC and run Hive there #CASSANDRAEU @richardalow
  21. 21. Options * Add ‘batch analytics’ DC and run Hive there * Initial copy slow and affects real-time performance * Need duplicate hardware * Will drop writes when really busy #CASSANDRAEU @richardalow
  22. 22. Spark and Shark #CASSANDRAEU @richardalow
  23. 23. Spark * Developed by Amplab * Distributed computation, like Hadoop * Designed for iterative algorithms * Much faster for queries with working sets that fit in RAM * Reliability from storing lineage rather than intermediate results * Runs on Mesos or YARN #CASSANDRAEU @richardalow
  24. 24. Spark is used by Source: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark #CASSANDRAEU @richardalow
  25. 25. Shark * Hive on Spark * Completely compatible with Hive * Same QL, UDFs and storage handlers * Can cache tables #CASSANDRAEU @richardalow
  26. 26. Shark * Hive on Spark * Completely compatible with Hive * Same QL, UDFs and storage handlers * Can cache tables CREATE TABLE user_accounts_cached as SELECT * FROM user_accounts WHERE country = ‘BR’; #CASSANDRAEU @richardalow
  27. 27. Shark on Cassandra #CASSANDRAEU @richardalow
  28. 28. Shark on Cassandra * CqlStorageHandler * Can use existing hive-cassandra storage handler * Can work well - see Evan Chan’s talk (Ooyala) from #cassandra13 * But suffers from same problems as Hive+Hadoop on Cassandra #CASSANDRAEU @richardalow
  29. 29. Shark on Cassandra direct * SSTableStorageHandler * Run spark workers on the Cassandra nodes * Read directly from SSTables in separate JVM * Limit CPU and memory through Spark/Mesos/ YARN * Limit I/O by rate limiting raw disk access * Skip filesystem cache #CASSANDRAEU @richardalow
  30. 30. Cassandra on Spark: through CQL interface Spark worker JVM FS Cache Cassandra JVM Deserialize Merge Serialize SSTables Deserialize Process Remote client Latency spikes! #CASSANDRAEU @richardalow
  31. 31. Cassandra on Spark: SSTables direct Spark worker JVM Deserialize Process SSTables #CASSANDRAEU Remote client Deserialize Merge Serialize FS Cache Cassandra JVM Constant latency @richardalow
  32. 32. Disadvantages * Equivalent to CL.ONE * Always runs task local with the data * Doesn’t read data in memtables #CASSANDRAEU @richardalow
  33. 33. Performance results #CASSANDRAEU @richardalow
  34. 34. Testing * 4 node Cassandra cluster on m1.large * 2 cores, 7.5 GB RAM, 2 ephemeral disks * 1 spark master * Spark running on Cassandra nodes * Limited to 1 core, 1 GB RAM * Compare CQLStorageHandler with SSTableStorageHandler #CASSANDRAEU @richardalow
  35. 35. Setup * Cassandra 1.2.10 * 3 GB heap * 256 tokens per node * RF 3 * Preloaded 100M randomly generated records * Each node started with 9GB of data * No optimization or tuning #CASSANDRAEU @richardalow
  36. 36. Tools * codahale Metrics * Ganglia * Load generator using DataStax Java driver * Google spreadsheet #CASSANDRAEU @richardalow
  37. 37. Result 1 * No Cassandra load * Run caching query: CREATE TABLE user_accounts_cached as SELECT * FROM user_accounts WHERE country = ‘BR’; * Takes 33 mins through CQL * Takes 13 mins through SSTables * 130k records/s * => SSTables is 2.5x faster * Even better since CQL has access to both cores #CASSANDRAEU @richardalow
  38. 38. Using cached results * Now have results cached, can run super fast queries * No I/O or extra memory * Bounded number of cores SELECT count(*) FROM user_accounts_cached WHERE unix_timestamp(last_visited)< unix_timestamp('2013-08-01 00:00:00') AND email LIKE '%@c9%'; * Took 18 seconds #CASSANDRAEU @richardalow
  39. 39. Result 2 * Add read load * Read-modify-write of accounts info * 200 ops/s * Measure latency * Slow down SSTable loader to same rate as CQL #CASSANDRAEU @richardalow
  40. 40. 95%ile base mean base #CASSANDRAEU @richardalow
  41. 41. Analysis * Average latency 17% lower * Probably due to less CPU used by query * Max 95th %ile latency 33% lower and much more predictable * Possibly due to less GC pressure * Still have a latency increase over base * Probably due to I/O use #CASSANDRAEU @richardalow
  42. 42. Result 3 * Keep read workload * Measure same latency * Add insert workload * Insert into separate table * 2500 ops/s #CASSANDRAEU @richardalow
  43. 43. CQL loader #CASSANDRAEU SSTable loader @richardalow
  44. 44. Analysis * Lots of latency, but there is anyway #CASSANDRAEU @richardalow
  45. 45. Performance wrap up * 2.5x faster with less CPU => uses less resources to do the same thing * Lower, more predictable latencies when at same speed => controlled resource usage lowers latency impact * Could limit further to make impact unnoticeable #CASSANDRAEU @richardalow
  46. 46. Summary #CASSANDRAEU @richardalow
  47. 47. Summary * Discussed analytics use case not well served by current tools * Spark, Shark * SSTableStorageHandler * Performance results #CASSANDRAEU @richardalow
  48. 48. Future * Needs a name * Github * Speak to me if you want to use it * Speak to me if you want to contribute #CASSANDRAEU @richardalow
  49. 49. Thank you! Richard Low | @richardalow #CASSANDRAEU @richardalow
  1. ¿Le ha llamado la atención una diapositiva en particular?

    Recortar diapositivas es una manera útil de recopilar información importante para consultarla más tarde.

×