Sorry for the Delay
• There were some technical difficulties, so we are giving folks a
few more minutes to join
• Again – sorry for the dely 
© 2014 DataStax, All Rights Reserved. Company Confidential 1
All attendees
placed on mute
Input questions at any time
using the online interface
Webinar Housekeeping
Big Data Analytics with
Cassandra and Spark
Brian Hess
Sr. Product Manager for Analytics
DataStax
© 2014 DataStax, All Rights Reserved. Company Confidential 5
© 2014 DataStax, All Rights Reserved. Company Confidential 6
Willie Sutton
Bank Robber in the 1930s-1950s
FBI Most Wanted List 1950
Captured in 1952
© 2014 DataStax, All Rights Reserved. Company Confidential 7
Willie Sutton
When asked
“Why do you rob banks?”
“Because that’s where the
money is.”
Motivating Use Case
Internet of Things
© 2014 DataStax, All Rights Reserved. Company Confidential 8
Your
System
Motivating Use Case
Internet of Things
© 2014 DataStax, All Rights Reserved. Company Confidential 9
Your
System
Motivating Use Case
Internet of Things
© 2014 DataStax, All Rights Reserved. Company Confidential 10
Your
SystemFAULT
© 2014 DataStax, All Rights Reserved. Company Confidential
Cassandra
Spark
Spark + Cassandra
11
Apache Cassandra
• Distributed NoSQL database
– BigTable meets Dynamo
• All nodes are equal
– Always on
– Linear scale out - a lot
• More data
• More transactions
• Multi-Datacenter
– Geographic or Workload
• Cassandra Query Language
– SQL-like
© 2014 DataStax, All Rights Reserved. Company Confidential 12
200,000
txns/sec
100,000
txns/sec
400,000
txns/sec
How Cassandra Works – Writes
© 2014 DataStax, All Rights Reserved. Company Confidential 13
It’s 72°
How Cassandra Works – Writes
© 2014 DataStax, All Rights Reserved. Company Confidential 14
It’s 72°
How Cassandra Works – Writes
© 2014 DataStax, All Rights Reserved. Company Confidential 15
Done
How Cassandra Works – Writes
© 2014 DataStax, All Rights Reserved. Company Confidential 16
Done
Tunable Consistency
• Relax the Consistency in ACID
– Isn’t always needed – and isn’t guaranteed anyway (in distributed DBs)
– Reads my not get the most up-to-date data – but almost always will
• All data is replicated
– Set in the schema
– Distributed to nodes by Token Range
• Options:
– QUORUM, ONE, ALL
• Can ensure reads get most up-to-date value
– E.g. – read/write at QUORUM
© 2014 DataStax, All Rights Reserved. Company Confidential 17
How Cassandra Works – Tunable Consistency
© 2014 DataStax, All Rights Reserved. Company Confidential 18
You got it.
I’ll make sure
everyone gets it.
You got it.
A majority got it.
The rest will.
You got it.
One guy got it.
The rest will.
You got it.
Everyone has it.
How Cassandra Works – Query
© 2014 DataStax, All Rights Reserved. Company Confidential 19
SELECT user_id
FROM users
WHERE name =
‘PBCupFan’;
How Cassandra Works – Query
© 2014 DataStax, All Rights Reserved. Company Confidential 20
Sure Thing, Let me
get that for you.
SELECT user_id
FROM users
WHERE name =
‘PBCupFan’;
How Cassandra Works – Query
© 2014 DataStax, All Rights Reserved. Company Confidential 21
What do you guys
have for PBCup?
SELECT user_id
FROM users
WHERE name =
‘PBCupFan’;
How Cassandra Works – Query
© 2014 DataStax, All Rights Reserved. Company Confidential 22
Here’s what I have:
Here’s what I have:
SELECT user_id
FROM users
WHERE name =
‘PBCupFan’;
How Cassandra Works – Query
© 2014 DataStax, All Rights Reserved. Company Confidential 23
Let me resolve
any conflicts
SELECT user_id
FROM users
WHERE name =
‘PBCupFan’;
How Cassandra Works – Query
© 2014 DataStax, All Rights Reserved. Company Confidential 24
Here ya go!
user_id
---------
1234
(1 rows)
Cassandra for Internet of Things
It’s all about scaling
© 2014 DataStax, All Rights Reserved. Company Confidential 25
Cassandra for Internet of Things
It’s all about scaling
© 2014 DataStax, All Rights Reserved. Company Confidential 26
Cassandra for Internet of Things
It’s all about scaling
© 2014 DataStax, All Rights Reserved. Company Confidential 27
Cassandra
• Always On
– No down time
• Linear Scalability
– For writes or reads
– For data size
© 2014 DataStax, All Rights Reserved. Company Confidential 28
• Terrific choice for Internet of Things, Web, Mobile, etc.
– British Gas, Nike, etc – Thermostats, Manufacturing, Oil/Gas, etc
It’s where the data is!
Cassandra Limitations
• No aggregations
– Optimized for lookups & writes
– No GROUP BYs
– No Windowed Aggregates
• No Joins
– Data model to avoid
• Must select by partition key
– There are secondary indexes
• But they are an antipattern
• Not optimized for full-table
scans
© 2014 DataStax, All Rights Reserved. Company Confidential 29
It actually can’t do everything 
Apache Spark
• Distributed computing framework
• Generalized DAG execution
• Easy Abstraction for Datasets
• Integrated SQL Queries
• Streaming
• Machine Learning Library
© 2014 DataStax, All Rights Reserved. Company Confidential 30
Spark Components
© 2014 DataStax, All Rights Reserved. Company Confidential 31
Spark Core Engine
Spark SQL Spark
Streaming
MLlib GraphX Spark R
Spark Components
© 2014 DataStax, All Rights Reserved. Company Confidential 32
Spark Provides a Simple and Efficient
framework for Distributed Computations
Node Roles 2
In Memory Caching Yes!
Generic DAG Execution Yes!
Great Abstraction For Datasets?
Dataframe!
(previously Resilient Distributed Dataset (RDD))
Spark
Master
Spark
Worker
Spark
Worker
Spark
WorkerSpark Executor
Spark Partition
Dataframe
(or RDD)
Spark Provides a Simple and Efficient
framework for Distributed Computations
Spark Master: Assigns cluster resources to applications
Spark Worker: Manages executors running on a machine
Spark Executor: Started by Worker - Workhorse of the spark application
Spark
Master
Spark
Worker
Spark
Worker
Spark
WorkerSpark Executor
Spark Partition
Dataframe
(or RDD)
Spark Provides a Simple and Efficient
framework for Distributed Computations
Spark Master: Assigns cluster resources to applications
Spark Worker: Manages executors running on a machine
Spark Executor: Started by Worker - Workhorse of the spark application
Spark
Master
Spark
Worker
Spark
Worker
Spark
WorkerSpark Executor
Spark Partition
Dataframe
(or RDD)
Spark Provides a Simple and Efficient
framework for Distributed Computations
Spark Master: Assigns cluster resources to applications
Spark Worker: Manages executors running on a machine
Spark Executor: Started by Worker - Workhorse of the spark application
Spark
Master
Spark
Worker
Spark
Worker
Spark
WorkerSpark Executor
Spark Partition
Dataframe
(or RDD)
RDDs Can be Generated from a
Variety of Sources
Textfiles
Parallelized Collections
RDDs Can be Generated from a
Variety of Sources
Textfiles
Parallelized Collections
Spark on Cassandra
© 2014 DataStax, All Rights Reserved. Company Confidential 40
Spark Core Engine
Spark SQL Spark
Streaming
MLlib GraphX Spark R
Cassandra
DataStax Spark-Cassandra Connector
Spark Cassandra Connector uses the DataStax
Java Driver to Read from and Write to Cassandra
Each Executor Maintains a
connection to the C* Cluster
Spark Executor
DataStax
Java Driver
Tokens 1-1000
Tokens 1001 -2000
Tokens …
RDD’s read into
different splits based
on sets of tokens
C*
Full Token
Range
© 2014 DataStax, All Rights Reserved. Company Confidential 42
Co-locate Spark and C* for Best Performance
• Run Cassandra and
Spark on same nodes
• Local reads/writes
• Increased performance
© 2014 DataStax, All Rights Reserved. Company Confidential 43
Things you can’t do in Cassandra
– Using SparkSQL
• JOINs
sc.sql("SELECT t.sensor_id, t.temp, m.location
FROM ks.temperatures t JOIN ks.metadata m
ON t.sensor_id = m.sensor_id
WHERE t.sensor_id = 12345");
• Aggregates
sc.sql("SELECT sensor_id, year, month, MAX(temp) mtemp
FROM ks.temperatures
GROUP BY sensor_id, year, month");
© 2014 DataStax, All Rights Reserved. Company Confidential 44
Things you can’t do in Cassandra
– External Data
• JOIN with HDFS data
val temp2014 = sc.textFile("webhdfs://myhadoop/data/temp2014.csv").
map(x=>x.split(",")).
map(x=>((x(0).toInt, x(1).toInt, x(2).toInt),
x(3).toDouble))
val temp2015 = sc.cassandraTable("ks", "temperatures").
map(x=>((x.getInt("sensor_id"), x.getInt("year"), x.getInt("month")),
x.getDouble("avgTemp")))
val hotter = temp2015.join(temp2014).filter(x => x._2._1._1 > x._2._2._1)
• Non-Partition Key Predicates
csc.sql("SELECT * FROM ks.temperatures WHERE temp > 100")
© 2014 DataStax, All Rights Reserved. Company Confidential 45
Tools
• ODBC and JDBC tools via SparkSQL
– Tableau, Pentaho, R, etc
• Apache Zeppelin (incubating)
A web-based notebook
that enables interactive data
analytics.
© 2014 DataStax, All Rights Reserved. Company Confidential 46
Quick word on Spark Streaming and Cassandra
• Very good combination
– Simple, powerful, useful, scalable, etc, etc, etc.
© 2014 DataStax, All Rights Reserved. Company Confidential 47
Receiver
Quick word on Spark Streaming and Cassandra
© 2014 DataStax, All Rights Reserved. Company Confidential 48
import com.datastax.spark.connector.streaming._
// Spark connection options
val conf = new SparkConf(true)...
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
// stream input
val lines = ssc.socketTextStream(serverIP, serverPort)
// count words
val wordCounts = lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
// stream output
wordCounts.saveToCassandra("test", "words")
// start processing
ssc.start()
ssc.awaitTermination()
DataStax Enterprise
© 2014 DataStax, All Rights Reserved. Company Confidential 49
Combines Cassandra,
Spark, and Solr (and more!)
- Fault Tolerance
- Management
- Visual Monitoring
- Security
- ETC!
Motivating Use Case
Internet of Things
© 2014 DataStax, All Rights Reserved. Company Confidential 50
Cassandra + Spark
• Unleash the power of analytics
• On your operational data
– IoT, Web, Mobile, etc
© 2014 DataStax, All Rights Reserved. Company Confidential 51
“Because that’s where
the Data is.”
Contacts and Links
• Links
– Cassandra Summit: http://cassandrasummit-datastax.com/
– DataStax Academy: https://academy.datastax.com/
• Contacts
– Kevin Pardue, Regional Channel Manager: kevin.pardue@datastax.com
– Brian Hess, Sr Product Manager for Analytics: brian.hess@datastax.com
– Devin Saxon, Marketing Specialist: dsaxon@datastax.com
© 2014 DataStax, All Rights Reserved. Company Confidential 52
© 2014 DataStax, All Rights Reserved. Company Confidential 53

Big Data Analytics with Spark

  • 1.
    Sorry for theDelay • There were some technical difficulties, so we are giving folks a few more minutes to join • Again – sorry for the dely  © 2014 DataStax, All Rights Reserved. Company Confidential 1
  • 3.
    All attendees placed onmute Input questions at any time using the online interface Webinar Housekeeping
  • 4.
    Big Data Analyticswith Cassandra and Spark Brian Hess Sr. Product Manager for Analytics DataStax
  • 5.
    © 2014 DataStax,All Rights Reserved. Company Confidential 5
  • 6.
    © 2014 DataStax,All Rights Reserved. Company Confidential 6 Willie Sutton Bank Robber in the 1930s-1950s FBI Most Wanted List 1950 Captured in 1952
  • 7.
    © 2014 DataStax,All Rights Reserved. Company Confidential 7 Willie Sutton When asked “Why do you rob banks?” “Because that’s where the money is.”
  • 8.
    Motivating Use Case Internetof Things © 2014 DataStax, All Rights Reserved. Company Confidential 8 Your System
  • 9.
    Motivating Use Case Internetof Things © 2014 DataStax, All Rights Reserved. Company Confidential 9 Your System
  • 10.
    Motivating Use Case Internetof Things © 2014 DataStax, All Rights Reserved. Company Confidential 10 Your SystemFAULT
  • 11.
    © 2014 DataStax,All Rights Reserved. Company Confidential Cassandra Spark Spark + Cassandra 11
  • 12.
    Apache Cassandra • DistributedNoSQL database – BigTable meets Dynamo • All nodes are equal – Always on – Linear scale out - a lot • More data • More transactions • Multi-Datacenter – Geographic or Workload • Cassandra Query Language – SQL-like © 2014 DataStax, All Rights Reserved. Company Confidential 12 200,000 txns/sec 100,000 txns/sec 400,000 txns/sec
  • 13.
    How Cassandra Works– Writes © 2014 DataStax, All Rights Reserved. Company Confidential 13 It’s 72°
  • 14.
    How Cassandra Works– Writes © 2014 DataStax, All Rights Reserved. Company Confidential 14 It’s 72°
  • 15.
    How Cassandra Works– Writes © 2014 DataStax, All Rights Reserved. Company Confidential 15 Done
  • 16.
    How Cassandra Works– Writes © 2014 DataStax, All Rights Reserved. Company Confidential 16 Done
  • 17.
    Tunable Consistency • Relaxthe Consistency in ACID – Isn’t always needed – and isn’t guaranteed anyway (in distributed DBs) – Reads my not get the most up-to-date data – but almost always will • All data is replicated – Set in the schema – Distributed to nodes by Token Range • Options: – QUORUM, ONE, ALL • Can ensure reads get most up-to-date value – E.g. – read/write at QUORUM © 2014 DataStax, All Rights Reserved. Company Confidential 17
  • 18.
    How Cassandra Works– Tunable Consistency © 2014 DataStax, All Rights Reserved. Company Confidential 18 You got it. I’ll make sure everyone gets it. You got it. A majority got it. The rest will. You got it. One guy got it. The rest will. You got it. Everyone has it.
  • 19.
    How Cassandra Works– Query © 2014 DataStax, All Rights Reserved. Company Confidential 19 SELECT user_id FROM users WHERE name = ‘PBCupFan’;
  • 20.
    How Cassandra Works– Query © 2014 DataStax, All Rights Reserved. Company Confidential 20 Sure Thing, Let me get that for you. SELECT user_id FROM users WHERE name = ‘PBCupFan’;
  • 21.
    How Cassandra Works– Query © 2014 DataStax, All Rights Reserved. Company Confidential 21 What do you guys have for PBCup? SELECT user_id FROM users WHERE name = ‘PBCupFan’;
  • 22.
    How Cassandra Works– Query © 2014 DataStax, All Rights Reserved. Company Confidential 22 Here’s what I have: Here’s what I have: SELECT user_id FROM users WHERE name = ‘PBCupFan’;
  • 23.
    How Cassandra Works– Query © 2014 DataStax, All Rights Reserved. Company Confidential 23 Let me resolve any conflicts SELECT user_id FROM users WHERE name = ‘PBCupFan’;
  • 24.
    How Cassandra Works– Query © 2014 DataStax, All Rights Reserved. Company Confidential 24 Here ya go! user_id --------- 1234 (1 rows)
  • 25.
    Cassandra for Internetof Things It’s all about scaling © 2014 DataStax, All Rights Reserved. Company Confidential 25
  • 26.
    Cassandra for Internetof Things It’s all about scaling © 2014 DataStax, All Rights Reserved. Company Confidential 26
  • 27.
    Cassandra for Internetof Things It’s all about scaling © 2014 DataStax, All Rights Reserved. Company Confidential 27
  • 28.
    Cassandra • Always On –No down time • Linear Scalability – For writes or reads – For data size © 2014 DataStax, All Rights Reserved. Company Confidential 28 • Terrific choice for Internet of Things, Web, Mobile, etc. – British Gas, Nike, etc – Thermostats, Manufacturing, Oil/Gas, etc It’s where the data is!
  • 29.
    Cassandra Limitations • Noaggregations – Optimized for lookups & writes – No GROUP BYs – No Windowed Aggregates • No Joins – Data model to avoid • Must select by partition key – There are secondary indexes • But they are an antipattern • Not optimized for full-table scans © 2014 DataStax, All Rights Reserved. Company Confidential 29 It actually can’t do everything 
  • 30.
    Apache Spark • Distributedcomputing framework • Generalized DAG execution • Easy Abstraction for Datasets • Integrated SQL Queries • Streaming • Machine Learning Library © 2014 DataStax, All Rights Reserved. Company Confidential 30
  • 31.
    Spark Components © 2014DataStax, All Rights Reserved. Company Confidential 31 Spark Core Engine Spark SQL Spark Streaming MLlib GraphX Spark R
  • 32.
    Spark Components © 2014DataStax, All Rights Reserved. Company Confidential 32
  • 33.
    Spark Provides aSimple and Efficient framework for Distributed Computations Node Roles 2 In Memory Caching Yes! Generic DAG Execution Yes! Great Abstraction For Datasets? Dataframe! (previously Resilient Distributed Dataset (RDD)) Spark Master Spark Worker Spark Worker Spark WorkerSpark Executor Spark Partition Dataframe (or RDD)
  • 34.
    Spark Provides aSimple and Efficient framework for Distributed Computations Spark Master: Assigns cluster resources to applications Spark Worker: Manages executors running on a machine Spark Executor: Started by Worker - Workhorse of the spark application Spark Master Spark Worker Spark Worker Spark WorkerSpark Executor Spark Partition Dataframe (or RDD)
  • 35.
    Spark Provides aSimple and Efficient framework for Distributed Computations Spark Master: Assigns cluster resources to applications Spark Worker: Manages executors running on a machine Spark Executor: Started by Worker - Workhorse of the spark application Spark Master Spark Worker Spark Worker Spark WorkerSpark Executor Spark Partition Dataframe (or RDD)
  • 36.
    Spark Provides aSimple and Efficient framework for Distributed Computations Spark Master: Assigns cluster resources to applications Spark Worker: Manages executors running on a machine Spark Executor: Started by Worker - Workhorse of the spark application Spark Master Spark Worker Spark Worker Spark WorkerSpark Executor Spark Partition Dataframe (or RDD)
  • 37.
    RDDs Can beGenerated from a Variety of Sources Textfiles Parallelized Collections
  • 38.
    RDDs Can beGenerated from a Variety of Sources Textfiles Parallelized Collections
  • 40.
    Spark on Cassandra ©2014 DataStax, All Rights Reserved. Company Confidential 40 Spark Core Engine Spark SQL Spark Streaming MLlib GraphX Spark R Cassandra DataStax Spark-Cassandra Connector
  • 41.
    Spark Cassandra Connectoruses the DataStax Java Driver to Read from and Write to Cassandra Each Executor Maintains a connection to the C* Cluster Spark Executor DataStax Java Driver Tokens 1-1000 Tokens 1001 -2000 Tokens … RDD’s read into different splits based on sets of tokens C* Full Token Range
  • 42.
    © 2014 DataStax,All Rights Reserved. Company Confidential 42
  • 43.
    Co-locate Spark andC* for Best Performance • Run Cassandra and Spark on same nodes • Local reads/writes • Increased performance © 2014 DataStax, All Rights Reserved. Company Confidential 43
  • 44.
    Things you can’tdo in Cassandra – Using SparkSQL • JOINs sc.sql("SELECT t.sensor_id, t.temp, m.location FROM ks.temperatures t JOIN ks.metadata m ON t.sensor_id = m.sensor_id WHERE t.sensor_id = 12345"); • Aggregates sc.sql("SELECT sensor_id, year, month, MAX(temp) mtemp FROM ks.temperatures GROUP BY sensor_id, year, month"); © 2014 DataStax, All Rights Reserved. Company Confidential 44
  • 45.
    Things you can’tdo in Cassandra – External Data • JOIN with HDFS data val temp2014 = sc.textFile("webhdfs://myhadoop/data/temp2014.csv"). map(x=>x.split(",")). map(x=>((x(0).toInt, x(1).toInt, x(2).toInt), x(3).toDouble)) val temp2015 = sc.cassandraTable("ks", "temperatures"). map(x=>((x.getInt("sensor_id"), x.getInt("year"), x.getInt("month")), x.getDouble("avgTemp"))) val hotter = temp2015.join(temp2014).filter(x => x._2._1._1 > x._2._2._1) • Non-Partition Key Predicates csc.sql("SELECT * FROM ks.temperatures WHERE temp > 100") © 2014 DataStax, All Rights Reserved. Company Confidential 45
  • 46.
    Tools • ODBC andJDBC tools via SparkSQL – Tableau, Pentaho, R, etc • Apache Zeppelin (incubating) A web-based notebook that enables interactive data analytics. © 2014 DataStax, All Rights Reserved. Company Confidential 46
  • 47.
    Quick word onSpark Streaming and Cassandra • Very good combination – Simple, powerful, useful, scalable, etc, etc, etc. © 2014 DataStax, All Rights Reserved. Company Confidential 47 Receiver
  • 48.
    Quick word onSpark Streaming and Cassandra © 2014 DataStax, All Rights Reserved. Company Confidential 48 import com.datastax.spark.connector.streaming._ // Spark connection options val conf = new SparkConf(true)... // streaming with 1 second batch window val ssc = new StreamingContext(conf, Seconds(1)) // stream input val lines = ssc.socketTextStream(serverIP, serverPort) // count words val wordCounts = lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) // stream output wordCounts.saveToCassandra("test", "words") // start processing ssc.start() ssc.awaitTermination()
  • 49.
    DataStax Enterprise © 2014DataStax, All Rights Reserved. Company Confidential 49 Combines Cassandra, Spark, and Solr (and more!) - Fault Tolerance - Management - Visual Monitoring - Security - ETC!
  • 50.
    Motivating Use Case Internetof Things © 2014 DataStax, All Rights Reserved. Company Confidential 50
  • 51.
    Cassandra + Spark •Unleash the power of analytics • On your operational data – IoT, Web, Mobile, etc © 2014 DataStax, All Rights Reserved. Company Confidential 51 “Because that’s where the Data is.”
  • 52.
    Contacts and Links •Links – Cassandra Summit: http://cassandrasummit-datastax.com/ – DataStax Academy: https://academy.datastax.com/ • Contacts – Kevin Pardue, Regional Channel Manager: kevin.pardue@datastax.com – Brian Hess, Sr Product Manager for Analytics: brian.hess@datastax.com – Devin Saxon, Marketing Specialist: dsaxon@datastax.com © 2014 DataStax, All Rights Reserved. Company Confidential 52
  • 53.
    © 2014 DataStax,All Rights Reserved. Company Confidential 53

Editor's Notes

  • #34 Spark has a very simple Architecture (see chart) Basic model for RDD is really nice, Easy to grok RDD, many sources you can get this from Lots of fun languages supported
  • #35 Spark Master : Analgous to Job Tracker Initial contact point for applications Keeps track of state of system
  • #36 Spark Worker: Task Tracker … Manages starting "executors" on machines Reports and setups env for executors
  • #37 Spark Executor: Actually does the work Process started by worker, Communicates directly with driving application & master 1 Spark Partition per executor … KEY Spark Partition != Cassandra Partition
  • #38 RDD’s where do they come from All sorts of great places
  • #39 So how do we act with these RDD’s
  • #42 Basics on how OSSConnector works How RDD is Split up