Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)

  • 1,883 views
Uploaded on

Everything Cassandra does is designed for a real-time workload of high volume inserts and frequent small queries. Cassandra has Hadoop and Hive integration, but performing long running ad-hoc queries …

Everything Cassandra does is designed for a real-time workload of high volume inserts and frequent small queries. Cassandra has Hadoop and Hive integration, but performing long running ad-hoc queries with these tools is difficult without impacting real-time performance or requires duplicate clusters. This talk will explain how I'm integrating Cassandra with Shark, a drop-in Hive replacement developed by Berkeley's AmpLab. It's designed to give fine grained control over all resource usage so you can safely run arbitrary ad-hoc queries on your existing cluster with controlled and predictable impact.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,883
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
39
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Mixing Batch and Real-time: Cassandra with Shark Richard Low | @richardalow #CASSANDRAEU CASSANDRASUMMITEU
  • 2. About me * Analytics tech lead at SwiftKey * Cassandra freelancer * Previous: lead Cassandra and Analytics dev at Acunu #CASSANDRAEU @richardalow
  • 3. Outline * Batch analytics on real-time databases * Current solutions * Spark and Shark * My solution * Performance results * Summary & future work #CASSANDRAEU @richardalow
  • 4. Batch analytics on real-time databases #CASSANDRAEU @richardalow
  • 5. Batch and real-time analytics * Wherever there is data there are unforeseeable queries * Real-time databases are optimized for real-time queries * Large queries may not be possible * Or will impact your real-time SLA #CASSANDRAEU @richardalow
  • 6. Example * User accounts database * Read-heavy * Must be low latency * Other tables on same database * Some are write heavy * A good fit for Cassandra! #CASSANDRAEU @richardalow
  • 7. Example data model CREATE TABLE user_accounts ( userid uuid PRIMARY KEY, username text, email text, password text, last_visited timestamp, country text ); #CASSANDRAEU @richardalow
  • 8. Example data model SELECT * FROM user_accounts LIMIT 2; userid | country | email | last_visited | password | username ---------+---------+---------------------+---------------------+----------+--------a03dcf03 | UK | richard@wentnet.com | 2013-10-07 09:07:36 | td7rjxwp | rlow b3f1871e | FR | jean@yahoo.com | 2013-08-17 13:07:36 | moh7eksn | jean88 #CASSANDRAEU @richardalow
  • 9. Marketing walks in #CASSANDRAEU @richardalow
  • 10. Ad-hoc query “Please can you find all users from Brazil who haven’t logged in since July and have an email @yahoo.com. I need the answer by Monday.” #CASSANDRAEU @richardalow
  • 11. Ad-hoc query observations * We have 500k users from Brazil * 60MB of raw data * No way to extract by country from data model * It’s on unchanging data* * Can take hours, not days * No expectation this query will need rerunning * Mostly, some of the people who haven’t visited for a while may suddenly come back #CASSANDRAEU @richardalow
  • 12. Why? * Underrepresented use case in plethora of tools * Seen days of dev time wasted * Want to see what can be done #CASSANDRAEU @richardalow
  • 13. Current solutions #CASSANDRAEU @richardalow
  • 14. Options * Run Hive query on top of Cassandra #CASSANDRAEU @richardalow
  • 15. Options * Run Hive query on top of Cassandra * Will compete with Cassandra for * I/O * Memory * CPU * Network * Will cause extra GC pressure on Cassandra * Could flush filesystem cache #CASSANDRAEU @richardalow
  • 16. Options * Write ETL script and load into another DB #CASSANDRAEU @richardalow
  • 17. Options * Write ETL script and load into another DB * All custom code * Single threaded * Unreliable * Will still flush cache on Cassandra nodes #CASSANDRAEU @richardalow
  • 18. Options * Clone the cluster #CASSANDRAEU @richardalow
  • 19. Options * Clone the cluster * Worst possible network load * Manual import each time * No incremental update * Need duplicate hardware #CASSANDRAEU @richardalow
  • 20. Options * Add ‘batch analytics’ DC and run Hive there #CASSANDRAEU @richardalow
  • 21. Options * Add ‘batch analytics’ DC and run Hive there * Initial copy slow and affects real-time performance * Need duplicate hardware * Will drop writes when really busy #CASSANDRAEU @richardalow
  • 22. Spark and Shark #CASSANDRAEU @richardalow
  • 23. Spark * Developed by Amplab * Distributed computation, like Hadoop * Designed for iterative algorithms * Much faster for queries with working sets that fit in RAM * Reliability from storing lineage rather than intermediate results * Runs on Mesos or YARN #CASSANDRAEU @richardalow
  • 24. Spark is used by Source: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark #CASSANDRAEU @richardalow
  • 25. Shark * Hive on Spark * Completely compatible with Hive * Same QL, UDFs and storage handlers * Can cache tables #CASSANDRAEU @richardalow
  • 26. Shark * Hive on Spark * Completely compatible with Hive * Same QL, UDFs and storage handlers * Can cache tables CREATE TABLE user_accounts_cached as SELECT * FROM user_accounts WHERE country = ‘BR’; #CASSANDRAEU @richardalow
  • 27. Shark on Cassandra #CASSANDRAEU @richardalow
  • 28. Shark on Cassandra * CqlStorageHandler * Can use existing hive-cassandra storage handler * Can work well - see Evan Chan’s talk (Ooyala) from #cassandra13 * But suffers from same problems as Hive+Hadoop on Cassandra #CASSANDRAEU @richardalow
  • 29. Shark on Cassandra direct * SSTableStorageHandler * Run spark workers on the Cassandra nodes * Read directly from SSTables in separate JVM * Limit CPU and memory through Spark/Mesos/ YARN * Limit I/O by rate limiting raw disk access * Skip filesystem cache #CASSANDRAEU @richardalow
  • 30. Cassandra on Spark: through CQL interface Spark worker JVM FS Cache Cassandra JVM Deserialize Merge Serialize SSTables Deserialize Process Remote client Latency spikes! #CASSANDRAEU @richardalow
  • 31. Cassandra on Spark: SSTables direct Spark worker JVM Deserialize Process SSTables #CASSANDRAEU Remote client Deserialize Merge Serialize FS Cache Cassandra JVM Constant latency @richardalow
  • 32. Disadvantages * Equivalent to CL.ONE * Always runs task local with the data * Doesn’t read data in memtables #CASSANDRAEU @richardalow
  • 33. Performance results #CASSANDRAEU @richardalow
  • 34. Testing * 4 node Cassandra cluster on m1.large * 2 cores, 7.5 GB RAM, 2 ephemeral disks * 1 spark master * Spark running on Cassandra nodes * Limited to 1 core, 1 GB RAM * Compare CQLStorageHandler with SSTableStorageHandler #CASSANDRAEU @richardalow
  • 35. Setup * Cassandra 1.2.10 * 3 GB heap * 256 tokens per node * RF 3 * Preloaded 100M randomly generated records * Each node started with 9GB of data * No optimization or tuning #CASSANDRAEU @richardalow
  • 36. Tools * codahale Metrics * Ganglia * Load generator using DataStax Java driver * Google spreadsheet #CASSANDRAEU @richardalow
  • 37. Result 1 * No Cassandra load * Run caching query: CREATE TABLE user_accounts_cached as SELECT * FROM user_accounts WHERE country = ‘BR’; * Takes 33 mins through CQL * Takes 13 mins through SSTables * 130k records/s * => SSTables is 2.5x faster * Even better since CQL has access to both cores #CASSANDRAEU @richardalow
  • 38. Using cached results * Now have results cached, can run super fast queries * No I/O or extra memory * Bounded number of cores SELECT count(*) FROM user_accounts_cached WHERE unix_timestamp(last_visited)< unix_timestamp('2013-08-01 00:00:00') AND email LIKE '%@c9%'; * Took 18 seconds #CASSANDRAEU @richardalow
  • 39. Result 2 * Add read load * Read-modify-write of accounts info * 200 ops/s * Measure latency * Slow down SSTable loader to same rate as CQL #CASSANDRAEU @richardalow
  • 40. 95%ile base mean base #CASSANDRAEU @richardalow
  • 41. Analysis * Average latency 17% lower * Probably due to less CPU used by query * Max 95th %ile latency 33% lower and much more predictable * Possibly due to less GC pressure * Still have a latency increase over base * Probably due to I/O use #CASSANDRAEU @richardalow
  • 42. Result 3 * Keep read workload * Measure same latency * Add insert workload * Insert into separate table * 2500 ops/s #CASSANDRAEU @richardalow
  • 43. CQL loader #CASSANDRAEU SSTable loader @richardalow
  • 44. Analysis * Lots of latency, but there is anyway #CASSANDRAEU @richardalow
  • 45. Performance wrap up * 2.5x faster with less CPU => uses less resources to do the same thing * Lower, more predictable latencies when at same speed => controlled resource usage lowers latency impact * Could limit further to make impact unnoticeable #CASSANDRAEU @richardalow
  • 46. Summary #CASSANDRAEU @richardalow
  • 47. Summary * Discussed analytics use case not well served by current tools * Spark, Shark * SSTableStorageHandler * Performance results #CASSANDRAEU @richardalow
  • 48. Future * Needs a name * Github * Speak to me if you want to use it * Speak to me if you want to contribute #CASSANDRAEU @richardalow
  • 49. Thank you! Richard Low | @richardalow #CASSANDRAEU @richardalow