SlideShare a Scribd company logo
1 of 49
Download to read offline
Mixing Batch and Real-time: Cassandra with Shark

Richard Low | @richardalow

#CASSANDRAEU

CASSANDRASUMMITEU
About me

* Analytics tech lead at SwiftKey
* Cassandra freelancer
* Previous: lead Cassandra and Analytics dev at
Acunu

#CASSANDRAEU

@richardalow
Outline

* Batch analytics on real-time databases
* Current solutions
* Spark and Shark
* My solution
* Performance results
* Summary & future work

#CASSANDRAEU

@richardalow
Batch analytics on real-time databases

#CASSANDRAEU

@richardalow
Batch and real-time analytics

* Wherever there is data there are unforeseeable

queries
* Real-time databases are optimized for real-time
queries
* Large queries may not be possible
* Or will impact your real-time SLA

#CASSANDRAEU

@richardalow
Example

* User accounts database
* Read-heavy
* Must be low latency
* Other tables on same database
* Some are write heavy
* A good fit for Cassandra!

#CASSANDRAEU

@richardalow
Example data model
CREATE TABLE user_accounts (
userid uuid PRIMARY KEY,
username text,
email text,
password text,
last_visited timestamp,
country text
);

#CASSANDRAEU

@richardalow
Example data model
SELECT * FROM user_accounts LIMIT 2;
userid
| country | email
| last_visited
| password | username
---------+---------+---------------------+---------------------+----------+--------a03dcf03 |
UK | richard@wentnet.com | 2013-10-07 09:07:36 | td7rjxwp | rlow
b3f1871e |
FR | jean@yahoo.com
| 2013-08-17 13:07:36 | moh7eksn | jean88

#CASSANDRAEU

@richardalow
Marketing walks in

#CASSANDRAEU

@richardalow
Ad-hoc query

“Please can you find all users from Brazil who haven’t
logged in since July and have an email @yahoo.com.
I need the answer by Monday.”

#CASSANDRAEU

@richardalow
Ad-hoc query observations

* We have 500k users from Brazil
* 60MB of raw data
* No way to extract by country from data model
* It’s on unchanging data*
* Can take hours, not days
* No expectation this query will need rerunning
* Mostly, some of the people who haven’t visited for a while may suddenly come back

#CASSANDRAEU

@richardalow
Why?

* Underrepresented use case in plethora of tools
* Seen days of dev time wasted
* Want to see what can be done

#CASSANDRAEU

@richardalow
Current solutions

#CASSANDRAEU

@richardalow
Options

* Run Hive query on top of Cassandra

#CASSANDRAEU

@richardalow
Options

* Run Hive query on top of Cassandra
* Will compete with Cassandra for
* I/O
* Memory
* CPU
* Network
* Will cause extra GC pressure on Cassandra
* Could flush filesystem cache

#CASSANDRAEU

@richardalow
Options

* Write ETL script and load into another DB

#CASSANDRAEU

@richardalow
Options

* Write ETL script and load into another DB
* All custom code
* Single threaded
* Unreliable
* Will still flush cache on Cassandra nodes

#CASSANDRAEU

@richardalow
Options

* Clone the cluster

#CASSANDRAEU

@richardalow
Options

* Clone the cluster
* Worst possible network load
* Manual import each time
* No incremental update
* Need duplicate hardware

#CASSANDRAEU

@richardalow
Options

* Add ‘batch analytics’ DC and run Hive there

#CASSANDRAEU

@richardalow
Options

* Add ‘batch analytics’ DC and run Hive there
* Initial copy slow and affects real-time
performance
* Need duplicate hardware
* Will drop writes when really busy

#CASSANDRAEU

@richardalow
Spark and Shark

#CASSANDRAEU

@richardalow
Spark

* Developed by Amplab
* Distributed computation, like Hadoop
* Designed for iterative algorithms
* Much faster for queries with working sets that fit
in RAM
* Reliability from storing lineage rather than
intermediate results
* Runs on Mesos or YARN

#CASSANDRAEU

@richardalow
Spark is used by

Source: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
#CASSANDRAEU

@richardalow
Shark

* Hive on Spark
* Completely compatible with Hive
* Same QL, UDFs and storage handlers
* Can cache tables

#CASSANDRAEU

@richardalow
Shark

* Hive on Spark
* Completely compatible with Hive
* Same QL, UDFs and storage handlers
* Can cache tables
CREATE TABLE user_accounts_cached as
SELECT * FROM user_accounts WHERE
country = ‘BR’;

#CASSANDRAEU

@richardalow
Shark on Cassandra

#CASSANDRAEU

@richardalow
Shark on Cassandra

* CqlStorageHandler
* Can use existing hive-cassandra storage handler
* Can work well - see Evan Chan’s talk (Ooyala) from
#cassandra13
* But suffers from same problems as Hive+Hadoop
on Cassandra

#CASSANDRAEU

@richardalow
Shark on Cassandra direct

* SSTableStorageHandler
* Run spark workers on the Cassandra nodes
* Read directly from SSTables in separate JVM
* Limit CPU and memory through Spark/Mesos/
YARN
* Limit I/O by rate limiting raw disk access
* Skip filesystem cache

#CASSANDRAEU

@richardalow
Cassandra on Spark: through CQL interface
Spark worker JVM

FS Cache

Cassandra JVM
Deserialize
Merge
Serialize

SSTables

Deserialize
Process

Remote client
Latency
spikes!

#CASSANDRAEU

@richardalow
Cassandra on Spark: SSTables direct
Spark worker JVM
Deserialize
Process

SSTables

#CASSANDRAEU

Remote client

Deserialize
Merge
Serialize

FS Cache

Cassandra JVM

Constant
latency

@richardalow
Disadvantages

* Equivalent to CL.ONE
* Always runs task local with the data
* Doesn’t read data in memtables

#CASSANDRAEU

@richardalow
Performance results

#CASSANDRAEU

@richardalow
Testing

* 4 node Cassandra cluster on m1.large
* 2 cores, 7.5 GB RAM, 2 ephemeral disks
* 1 spark master
* Spark running on Cassandra nodes
* Limited to 1 core, 1 GB RAM
* Compare CQLStorageHandler with
SSTableStorageHandler

#CASSANDRAEU

@richardalow
Setup

* Cassandra 1.2.10
* 3 GB heap
* 256 tokens per node
* RF 3
* Preloaded 100M randomly generated records
* Each node started with 9GB of data
* No optimization or tuning

#CASSANDRAEU

@richardalow
Tools

* codahale Metrics
* Ganglia
* Load generator using DataStax Java driver
* Google spreadsheet

#CASSANDRAEU

@richardalow
Result 1

* No Cassandra load
* Run caching query:
CREATE TABLE user_accounts_cached as
SELECT * FROM user_accounts WHERE
country = ‘BR’;

* Takes 33 mins through CQL
* Takes 13 mins through SSTables
* 130k records/s
* => SSTables is 2.5x faster
* Even better since CQL has access to both cores
#CASSANDRAEU

@richardalow
Using cached results

* Now have results cached, can run super fast
queries
* No I/O or extra memory
* Bounded number of cores

SELECT count(*) FROM user_accounts_cached
WHERE unix_timestamp(last_visited)<
unix_timestamp('2013-08-01 00:00:00') AND
email LIKE '%@c9%';

* Took 18 seconds
#CASSANDRAEU

@richardalow
Result 2

* Add read load
* Read-modify-write of accounts info
* 200 ops/s
* Measure latency
* Slow down SSTable loader to same rate as CQL

#CASSANDRAEU

@richardalow
95%ile base

mean base

#CASSANDRAEU

@richardalow
Analysis

* Average latency 17% lower
* Probably due to less CPU used by query
* Max 95th %ile latency 33% lower and much more
predictable
* Possibly due to less GC pressure
* Still have a latency increase over base
* Probably due to I/O use

#CASSANDRAEU

@richardalow
Result 3

* Keep read workload
* Measure same latency
* Add insert workload
* Insert into separate table
* 2500 ops/s

#CASSANDRAEU

@richardalow
CQL loader

#CASSANDRAEU

SSTable loader

@richardalow
Analysis

* Lots of latency, but there is anyway

#CASSANDRAEU

@richardalow
Performance wrap up

* 2.5x faster with less CPU

=> uses less resources to do the same thing
* Lower, more predictable latencies when at same
speed
=> controlled resource usage lowers latency
impact
* Could limit further to make impact unnoticeable

#CASSANDRAEU

@richardalow
Summary

#CASSANDRAEU

@richardalow
Summary

* Discussed analytics use case not well served by
current tools
* Spark, Shark
* SSTableStorageHandler
* Performance results

#CASSANDRAEU

@richardalow
Future

* Needs a name
* Github
* Speak to me if you want to use it
* Speak to me if you want to contribute

#CASSANDRAEU

@richardalow
Thank you!
Richard Low | @richardalow

#CASSANDRAEU

@richardalow

More Related Content

What's hot

Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXzznate
 
Cassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE SearchCassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE SearchCaleb Rackliffe
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into CassandraBrian Hess
 
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...DataStax
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityRussell Spitzer
 
Python and cassandra
Python and cassandraPython and cassandra
Python and cassandraJon Haddad
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin
 
Cassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one dayCassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one dayCarlos Alonso Pérez
 
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_developeIntroduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_developezznate
 
Introduction to cassandra 2014
Introduction to cassandra 2014Introduction to cassandra 2014
Introduction to cassandra 2014Patrick McFadin
 
Apache cassandra en production - devoxx 2017
Apache cassandra en production  - devoxx 2017Apache cassandra en production  - devoxx 2017
Apache cassandra en production - devoxx 2017Alexander DEJANOVSKI
 
Monitoring Cassandra with Riemann
Monitoring Cassandra with RiemannMonitoring Cassandra with Riemann
Monitoring Cassandra with RiemannPatricia Gorla
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + ElkVasil Remeniuk
 
Manchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internalsManchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internalsChristopher Batey
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformDataStax Academy
 
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...DataStax Academy
 
Helsinki Cassandra Meetup #2: From Postgres to Cassandra
Helsinki Cassandra Meetup #2: From Postgres to CassandraHelsinki Cassandra Meetup #2: From Postgres to Cassandra
Helsinki Cassandra Meetup #2: From Postgres to CassandraBruno Amaro Almeida
 
Cassandra and Spark
Cassandra and Spark Cassandra and Spark
Cassandra and Spark datastaxjp
 

What's hot (20)

Sparkstreaming
SparkstreamingSparkstreaming
Sparkstreaming
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMX
 
Cassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE SearchCassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE Search
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra
 
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
 
Python and cassandra
Python and cassandraPython and cassandra
Python and cassandra
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
 
Cassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one dayCassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one day
 
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_developeIntroduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
 
Introduction to cassandra 2014
Introduction to cassandra 2014Introduction to cassandra 2014
Introduction to cassandra 2014
 
Apache cassandra en production - devoxx 2017
Apache cassandra en production  - devoxx 2017Apache cassandra en production  - devoxx 2017
Apache cassandra en production - devoxx 2017
 
Monitoring Cassandra with Riemann
Monitoring Cassandra with RiemannMonitoring Cassandra with Riemann
Monitoring Cassandra with Riemann
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + Elk
 
Manchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internalsManchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internals
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
 
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
 
Elassandra
ElassandraElassandra
Elassandra
 
Helsinki Cassandra Meetup #2: From Postgres to Cassandra
Helsinki Cassandra Meetup #2: From Postgres to CassandraHelsinki Cassandra Meetup #2: From Postgres to Cassandra
Helsinki Cassandra Meetup #2: From Postgres to Cassandra
 
Cassandra and Spark
Cassandra and Spark Cassandra and Spark
Cassandra and Spark
 

Similar to Mixing Batch and Real-time: Cassandra with Shark

Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...DataStax
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandrazznate
 
C* Summit EU 2013: Keynote by Jonathan Ellis — Cassandra 2.0 & 2.1
C* Summit EU 2013: Keynote by Jonathan Ellis — Cassandra 2.0 & 2.1C* Summit EU 2013: Keynote by Jonathan Ellis — Cassandra 2.0 & 2.1
C* Summit EU 2013: Keynote by Jonathan Ellis — Cassandra 2.0 & 2.1DataStax Academy
 
Cassandra Summit EU 2013
Cassandra Summit EU 2013Cassandra Summit EU 2013
Cassandra Summit EU 2013jbellis
 
Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)zznate
 
Maximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorMaximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorRussell Spitzer
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyDataStax Academy
 
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013odnoklassniki.ru
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinChristian Johannsen
 
Cassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction GuideCassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction GuideMohammed Fazuluddin
 
Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Dave Gardner
 
C* Summit EU 2013: Hardware Agnostic: Cassandra on Raspberry Pi
C* Summit EU 2013: Hardware Agnostic: Cassandra on Raspberry Pi C* Summit EU 2013: Hardware Agnostic: Cassandra on Raspberry Pi
C* Summit EU 2013: Hardware Agnostic: Cassandra on Raspberry Pi DataStax Academy
 
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationIndexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationCesare Cugnasco
 
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...DataStax Academy
 
Hindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to CassandraHindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to CassandraMichael Kjellman
 
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael KjellmanC* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael KjellmanDataStax Academy
 
How Cloudflare analyzes -1m dns queries per second @ Percona E17
How Cloudflare analyzes -1m dns queries per second @ Percona E17How Cloudflare analyzes -1m dns queries per second @ Percona E17
How Cloudflare analyzes -1m dns queries per second @ Percona E17Tom Arnfeld
 
Cassandra To Infinity And Beyond
Cassandra To Infinity And BeyondCassandra To Infinity And Beyond
Cassandra To Infinity And BeyondRomain Hardouin
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introductionfardinjamshidi
 

Similar to Mixing Batch and Real-time: Cassandra with Shark (20)

Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
 
C* Summit EU 2013: Keynote by Jonathan Ellis — Cassandra 2.0 & 2.1
C* Summit EU 2013: Keynote by Jonathan Ellis — Cassandra 2.0 & 2.1C* Summit EU 2013: Keynote by Jonathan Ellis — Cassandra 2.0 & 2.1
C* Summit EU 2013: Keynote by Jonathan Ellis — Cassandra 2.0 & 2.1
 
Cassandra Summit EU 2013
Cassandra Summit EU 2013Cassandra Summit EU 2013
Cassandra Summit EU 2013
 
Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)
 
Maximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorMaximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra Connector
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al Tobey
 
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
 
Cassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction GuideCassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction Guide
 
Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)
 
C* Summit EU 2013: Hardware Agnostic: Cassandra on Raspberry Pi
C* Summit EU 2013: Hardware Agnostic: Cassandra on Raspberry Pi C* Summit EU 2013: Hardware Agnostic: Cassandra on Raspberry Pi
C* Summit EU 2013: Hardware Agnostic: Cassandra on Raspberry Pi
 
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationIndexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
 
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
 
Hindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to CassandraHindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to Cassandra
 
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael KjellmanC* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
 
How Cloudflare analyzes -1m dns queries per second @ Percona E17
How Cloudflare analyzes -1m dns queries per second @ Percona E17How Cloudflare analyzes -1m dns queries per second @ Percona E17
How Cloudflare analyzes -1m dns queries per second @ Percona E17
 
Cassandra To Infinity And Beyond
Cassandra To Infinity And BeyondCassandra To Infinity And Beyond
Cassandra To Infinity And Beyond
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introduction
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
 

Recently uploaded

Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 

Mixing Batch and Real-time: Cassandra with Shark

  • 1. Mixing Batch and Real-time: Cassandra with Shark Richard Low | @richardalow #CASSANDRAEU CASSANDRASUMMITEU
  • 2. About me * Analytics tech lead at SwiftKey * Cassandra freelancer * Previous: lead Cassandra and Analytics dev at Acunu #CASSANDRAEU @richardalow
  • 3. Outline * Batch analytics on real-time databases * Current solutions * Spark and Shark * My solution * Performance results * Summary & future work #CASSANDRAEU @richardalow
  • 4. Batch analytics on real-time databases #CASSANDRAEU @richardalow
  • 5. Batch and real-time analytics * Wherever there is data there are unforeseeable queries * Real-time databases are optimized for real-time queries * Large queries may not be possible * Or will impact your real-time SLA #CASSANDRAEU @richardalow
  • 6. Example * User accounts database * Read-heavy * Must be low latency * Other tables on same database * Some are write heavy * A good fit for Cassandra! #CASSANDRAEU @richardalow
  • 7. Example data model CREATE TABLE user_accounts ( userid uuid PRIMARY KEY, username text, email text, password text, last_visited timestamp, country text ); #CASSANDRAEU @richardalow
  • 8. Example data model SELECT * FROM user_accounts LIMIT 2; userid | country | email | last_visited | password | username ---------+---------+---------------------+---------------------+----------+--------a03dcf03 | UK | richard@wentnet.com | 2013-10-07 09:07:36 | td7rjxwp | rlow b3f1871e | FR | jean@yahoo.com | 2013-08-17 13:07:36 | moh7eksn | jean88 #CASSANDRAEU @richardalow
  • 10. Ad-hoc query “Please can you find all users from Brazil who haven’t logged in since July and have an email @yahoo.com. I need the answer by Monday.” #CASSANDRAEU @richardalow
  • 11. Ad-hoc query observations * We have 500k users from Brazil * 60MB of raw data * No way to extract by country from data model * It’s on unchanging data* * Can take hours, not days * No expectation this query will need rerunning * Mostly, some of the people who haven’t visited for a while may suddenly come back #CASSANDRAEU @richardalow
  • 12. Why? * Underrepresented use case in plethora of tools * Seen days of dev time wasted * Want to see what can be done #CASSANDRAEU @richardalow
  • 14. Options * Run Hive query on top of Cassandra #CASSANDRAEU @richardalow
  • 15. Options * Run Hive query on top of Cassandra * Will compete with Cassandra for * I/O * Memory * CPU * Network * Will cause extra GC pressure on Cassandra * Could flush filesystem cache #CASSANDRAEU @richardalow
  • 16. Options * Write ETL script and load into another DB #CASSANDRAEU @richardalow
  • 17. Options * Write ETL script and load into another DB * All custom code * Single threaded * Unreliable * Will still flush cache on Cassandra nodes #CASSANDRAEU @richardalow
  • 18. Options * Clone the cluster #CASSANDRAEU @richardalow
  • 19. Options * Clone the cluster * Worst possible network load * Manual import each time * No incremental update * Need duplicate hardware #CASSANDRAEU @richardalow
  • 20. Options * Add ‘batch analytics’ DC and run Hive there #CASSANDRAEU @richardalow
  • 21. Options * Add ‘batch analytics’ DC and run Hive there * Initial copy slow and affects real-time performance * Need duplicate hardware * Will drop writes when really busy #CASSANDRAEU @richardalow
  • 23. Spark * Developed by Amplab * Distributed computation, like Hadoop * Designed for iterative algorithms * Much faster for queries with working sets that fit in RAM * Reliability from storing lineage rather than intermediate results * Runs on Mesos or YARN #CASSANDRAEU @richardalow
  • 24. Spark is used by Source: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark #CASSANDRAEU @richardalow
  • 25. Shark * Hive on Spark * Completely compatible with Hive * Same QL, UDFs and storage handlers * Can cache tables #CASSANDRAEU @richardalow
  • 26. Shark * Hive on Spark * Completely compatible with Hive * Same QL, UDFs and storage handlers * Can cache tables CREATE TABLE user_accounts_cached as SELECT * FROM user_accounts WHERE country = ‘BR’; #CASSANDRAEU @richardalow
  • 28. Shark on Cassandra * CqlStorageHandler * Can use existing hive-cassandra storage handler * Can work well - see Evan Chan’s talk (Ooyala) from #cassandra13 * But suffers from same problems as Hive+Hadoop on Cassandra #CASSANDRAEU @richardalow
  • 29. Shark on Cassandra direct * SSTableStorageHandler * Run spark workers on the Cassandra nodes * Read directly from SSTables in separate JVM * Limit CPU and memory through Spark/Mesos/ YARN * Limit I/O by rate limiting raw disk access * Skip filesystem cache #CASSANDRAEU @richardalow
  • 30. Cassandra on Spark: through CQL interface Spark worker JVM FS Cache Cassandra JVM Deserialize Merge Serialize SSTables Deserialize Process Remote client Latency spikes! #CASSANDRAEU @richardalow
  • 31. Cassandra on Spark: SSTables direct Spark worker JVM Deserialize Process SSTables #CASSANDRAEU Remote client Deserialize Merge Serialize FS Cache Cassandra JVM Constant latency @richardalow
  • 32. Disadvantages * Equivalent to CL.ONE * Always runs task local with the data * Doesn’t read data in memtables #CASSANDRAEU @richardalow
  • 34. Testing * 4 node Cassandra cluster on m1.large * 2 cores, 7.5 GB RAM, 2 ephemeral disks * 1 spark master * Spark running on Cassandra nodes * Limited to 1 core, 1 GB RAM * Compare CQLStorageHandler with SSTableStorageHandler #CASSANDRAEU @richardalow
  • 35. Setup * Cassandra 1.2.10 * 3 GB heap * 256 tokens per node * RF 3 * Preloaded 100M randomly generated records * Each node started with 9GB of data * No optimization or tuning #CASSANDRAEU @richardalow
  • 36. Tools * codahale Metrics * Ganglia * Load generator using DataStax Java driver * Google spreadsheet #CASSANDRAEU @richardalow
  • 37. Result 1 * No Cassandra load * Run caching query: CREATE TABLE user_accounts_cached as SELECT * FROM user_accounts WHERE country = ‘BR’; * Takes 33 mins through CQL * Takes 13 mins through SSTables * 130k records/s * => SSTables is 2.5x faster * Even better since CQL has access to both cores #CASSANDRAEU @richardalow
  • 38. Using cached results * Now have results cached, can run super fast queries * No I/O or extra memory * Bounded number of cores SELECT count(*) FROM user_accounts_cached WHERE unix_timestamp(last_visited)< unix_timestamp('2013-08-01 00:00:00') AND email LIKE '%@c9%'; * Took 18 seconds #CASSANDRAEU @richardalow
  • 39. Result 2 * Add read load * Read-modify-write of accounts info * 200 ops/s * Measure latency * Slow down SSTable loader to same rate as CQL #CASSANDRAEU @richardalow
  • 41. Analysis * Average latency 17% lower * Probably due to less CPU used by query * Max 95th %ile latency 33% lower and much more predictable * Possibly due to less GC pressure * Still have a latency increase over base * Probably due to I/O use #CASSANDRAEU @richardalow
  • 42. Result 3 * Keep read workload * Measure same latency * Add insert workload * Insert into separate table * 2500 ops/s #CASSANDRAEU @richardalow
  • 44. Analysis * Lots of latency, but there is anyway #CASSANDRAEU @richardalow
  • 45. Performance wrap up * 2.5x faster with less CPU => uses less resources to do the same thing * Lower, more predictable latencies when at same speed => controlled resource usage lowers latency impact * Could limit further to make impact unnoticeable #CASSANDRAEU @richardalow
  • 47. Summary * Discussed analytics use case not well served by current tools * Spark, Shark * SSTableStorageHandler * Performance results #CASSANDRAEU @richardalow
  • 48. Future * Needs a name * Github * Speak to me if you want to use it * Speak to me if you want to contribute #CASSANDRAEU @richardalow
  • 49. Thank you! Richard Low | @richardalow #CASSANDRAEU @richardalow