Mixing Batch and Real-time: Cassandra with Shark

Mixing Batch and Real-time: Cassandra with Shark

Richard Low | @richardalow

#CASSANDRAEU

CASSANDRASUMMITEU

About me

* Analytics tech lead at SwiftKey
* Cassandra freelancer
* Previous: lead Cassandra and Analytics dev at
Acunu

#CASSANDRAEU

@richardalow

Outline

* Batch analytics on real-time databases
* Current solutions
* Spark and Shark
* My solution
* Performance results
* Summary & future work

#CASSANDRAEU

@richardalow

Batch analytics on real-time databases

#CASSANDRAEU

@richardalow

Batch and real-time analytics

* Wherever there is data there are unforeseeable

queries
* Real-time databases are optimized for real-time
queries
* Large queries may not be possible
* Or will impact your real-time SLA

#CASSANDRAEU

@richardalow

Example

* User accounts database
* Read-heavy
* Must be low latency
* Other tables on same database
* Some are write heavy
* A good ﬁt for Cassandra!

#CASSANDRAEU

@richardalow

Example data model
CREATE TABLE user_accounts (
userid uuid PRIMARY KEY,
username text,
email text,
password text,
last_visited timestamp,
country text
);

#CASSANDRAEU

@richardalow

Marketing walks in

#CASSANDRAEU

@richardalow

Ad-hoc query

“Please can you ﬁnd all users from Brazil who haven’t
logged in since July and have an email @yahoo.com.
I need the answer by Monday.”

#CASSANDRAEU

@richardalow

Ad-hoc query observations

* We have 500k users from Brazil
* 60MB of raw data
* No way to extract by country from data model
* It’s on unchanging data*
* Can take hours, not days
* No expectation this query will need rerunning
* Mostly, some of the people who haven’t visited for a while may suddenly come back

#CASSANDRAEU

@richardalow

Why?

* Underrepresented use case in plethora of tools
* Seen days of dev time wasted
* Want to see what can be done

#CASSANDRAEU

@richardalow

Current solutions

#CASSANDRAEU

@richardalow

Options

* Run Hive query on top of Cassandra

#CASSANDRAEU

@richardalow

Options

* Run Hive query on top of Cassandra
* Will compete with Cassandra for
* I/O
* Memory
* CPU
* Network
* Will cause extra GC pressure on Cassandra
* Could flush ﬁlesystem cache

#CASSANDRAEU

@richardalow

Options

* Write ETL script and load into another DB

#CASSANDRAEU

@richardalow

Options

* Write ETL script and load into another DB
* All custom code
* Single threaded
* Unreliable
* Will still flush cache on Cassandra nodes

#CASSANDRAEU

@richardalow

Options

* Clone the cluster

#CASSANDRAEU

@richardalow

Options

* Clone the cluster
* Worst possible network load
* Manual import each time
* No incremental update
* Need duplicate hardware

#CASSANDRAEU

@richardalow

Options

* Add ‘batch analytics’ DC and run Hive there

#CASSANDRAEU

@richardalow

Options

* Add ‘batch analytics’ DC and run Hive there
* Initial copy slow and affects real-time
performance
* Need duplicate hardware
* Will drop writes when really busy

#CASSANDRAEU

@richardalow

Spark and Shark

#CASSANDRAEU

@richardalow

Spark

* Developed by Amplab
* Distributed computation, like Hadoop
* Designed for iterative algorithms
* Much faster for queries with working sets that ﬁt
in RAM
* Reliability from storing lineage rather than
intermediate results
* Runs on Mesos or YARN

#CASSANDRAEU

@richardalow

Spark is used by

Source: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
#CASSANDRAEU

@richardalow

Shark

* Hive on Spark
* Completely compatible with Hive
* Same QL, UDFs and storage handlers
* Can cache tables

#CASSANDRAEU

@richardalow

Shark

* Hive on Spark
* Completely compatible with Hive
* Same QL, UDFs and storage handlers
* Can cache tables
CREATE TABLE user_accounts_cached as
SELECT * FROM user_accounts WHERE
country = ‘BR’;

#CASSANDRAEU

@richardalow

Shark on Cassandra

#CASSANDRAEU

@richardalow

Shark on Cassandra

* CqlStorageHandler
* Can use existing hive-cassandra storage handler
* Can work well - see Evan Chan’s talk (Ooyala) from
#cassandra13
* But suffers from same problems as Hive+Hadoop
on Cassandra

#CASSANDRAEU

@richardalow

Shark on Cassandra direct

* SSTableStorageHandler
* Run spark workers on the Cassandra nodes
* Read directly from SSTables in separate JVM
* Limit CPU and memory through Spark/Mesos/
YARN
* Limit I/O by rate limiting raw disk access
* Skip ﬁlesystem cache

#CASSANDRAEU

@richardalow

Cassandra on Spark: through CQL interface
Spark worker JVM

FS Cache

Cassandra JVM
Deserialize
Merge
Serialize

SSTables

Deserialize
Process

Remote client
Latency
spikes!

#CASSANDRAEU

@richardalow

Cassandra on Spark: SSTables direct
Spark worker JVM
Deserialize
Process

SSTables

#CASSANDRAEU

Remote client

Deserialize
Merge
Serialize

FS Cache

Cassandra JVM

Constant
latency

@richardalow

Disadvantages

* Equivalent to CL.ONE
* Always runs task local with the data
* Doesn’t read data in memtables

#CASSANDRAEU

@richardalow

Performance results

#CASSANDRAEU

@richardalow

Testing

* 4 node Cassandra cluster on m1.large
* 2 cores, 7.5 GB RAM, 2 ephemeral disks
* 1 spark master
* Spark running on Cassandra nodes
* Limited to 1 core, 1 GB RAM
* Compare CQLStorageHandler with
SSTableStorageHandler

#CASSANDRAEU

@richardalow

Setup

* Cassandra 1.2.10
* 3 GB heap
* 256 tokens per node
* RF 3
* Preloaded 100M randomly generated records
* Each node started with 9GB of data
* No optimization or tuning

#CASSANDRAEU

@richardalow

Tools

* codahale Metrics
* Ganglia
* Load generator using DataStax Java driver
* Google spreadsheet

#CASSANDRAEU

@richardalow

Result 1

* No Cassandra load
* Run caching query:
CREATE TABLE user_accounts_cached as
SELECT * FROM user_accounts WHERE
country = ‘BR’;

* Takes 33 mins through CQL
* Takes 13 mins through SSTables
* 130k records/s
* => SSTables is 2.5x faster
* Even better since CQL has access to both cores
#CASSANDRAEU

@richardalow

Using cached results

* Now have results cached, can run super fast
queries
* No I/O or extra memory
* Bounded number of cores

SELECT count(*) FROM user_accounts_cached
WHERE unix_timestamp(last_visited)<
unix_timestamp('2013-08-01 00:00:00') AND
email LIKE '%@c9%';

* Took 18 seconds
#CASSANDRAEU

@richardalow

Result 2

* Add read load
* Read-modify-write of accounts info
* 200 ops/s
* Measure latency
* Slow down SSTable loader to same rate as CQL

#CASSANDRAEU

@richardalow

95%ile base

mean base

#CASSANDRAEU

@richardalow

Analysis

* Average latency 17% lower
* Probably due to less CPU used by query
* Max 95th %ile latency 33% lower and much more
predictable
* Possibly due to less GC pressure
* Still have a latency increase over base
* Probably due to I/O use

#CASSANDRAEU

@richardalow

Result 3

* Keep read workload
* Measure same latency
* Add insert workload
* Insert into separate table
* 2500 ops/s

#CASSANDRAEU

@richardalow

CQL loader

#CASSANDRAEU

SSTable loader

@richardalow

Analysis

* Lots of latency, but there is anyway

#CASSANDRAEU

@richardalow

Performance wrap up

* 2.5x faster with less CPU

=> uses less resources to do the same thing
* Lower, more predictable latencies when at same
speed
=> controlled resource usage lowers latency
impact
* Could limit further to make impact unnoticeable

#CASSANDRAEU

@richardalow

Summary

#CASSANDRAEU

@richardalow

Summary

* Discussed analytics use case not well served by
current tools
* Spark, Shark
* SSTableStorageHandler
* Performance results

#CASSANDRAEU

@richardalow

Future

* Needs a name
* Github
* Speak to me if you want to use it
* Speak to me if you want to contribute

#CASSANDRAEU

@richardalow

Thank you!
Richard Low | @richardalow

#CASSANDRAEU

@richardalow

Mixing Batch and Real-time: Cassandra with Shark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Mixing Batch and Real-time: Cassandra with Shark

Similar to Mixing Batch and Real-time: Cassandra with Shark (20)

Recently uploaded

Recently uploaded (20)

Mixing Batch and Real-time: Cassandra with Shark