Big Data Analytics with Spark

Sorry for the Delay
• There were some technical difficulties, so we are giving folks a
few more minutes to join
• Again – sorry for the dely 
© 2014 DataStax, All Rights Reserved. Company Confidential 1

All attendees
placed on mute
Input questions at any time
using the online interface
Webinar Housekeeping

Big Data Analytics with
Cassandra and Spark
Brian Hess
Sr. Product Manager for Analytics
DataStax

Willie Sutton
Bank Robber in the 1930s-1950s
FBI Most Wanted List 1950
Captured in 1952

Willie Sutton
When asked
“Why do you rob banks?”
“Because that’s where the
money is.”

Motivating Use Case
Internet of Things
Your
System

Motivating Use Case
Internet of Things
Your
SystemFAULT

© 2014 DataStax, All Rights Reserved. Company Confidential
Cassandra
Spark
Spark + Cassandra
11

Apache Cassandra
• Distributed NoSQL database
– BigTable meets Dynamo
• All nodes are equal
– Always on
– Linear scale out - a lot
• More data
• More transactions
• Multi-Datacenter
– Geographic or Workload
• Cassandra Query Language
– SQL-like
200,000
txns/sec
100,000
txns/sec
400,000
txns/sec

How Cassandra Works – Writes
It’s 72°

It’s 72°

Done

Tunable Consistency
• Relax the Consistency in ACID
– Isn’t always needed – and isn’t guaranteed anyway (in distributed DBs)
– Reads my not get the most up-to-date data – but almost always will
• All data is replicated
– Set in the schema
– Distributed to nodes by Token Range
• Options:
– QUORUM, ONE, ALL
• Can ensure reads get most up-to-date value
– E.g. – read/write at QUORUM

How Cassandra Works – Tunable Consistency
You got it.
I’ll make sure
everyone gets it.
You got it.
A majority got it.
The rest will.
You got it.
One guy got it.
The rest will.
You got it.
Everyone has it.

How Cassandra Works – Query
SELECT user_id
FROM users
WHERE name =
‘PBCupFan’;

Sure Thing, Let me
get that for you.
SELECT user_id
FROM users
WHERE name =
‘PBCupFan’;

What do you guys
have for PBCup?
SELECT user_id
FROM users
WHERE name =
‘PBCupFan’;

Here’s what I have:
Here’s what I have:
SELECT user_id
FROM users
WHERE name =
‘PBCupFan’;

Let me resolve
any conflicts
SELECT user_id
FROM users
WHERE name =
‘PBCupFan’;

Here ya go!
user_id
---------
1234
(1 rows)

Cassandra for Internet of Things
It’s all about scaling

Cassandra
• Always On
– No down time
• Linear Scalability
– For writes or reads
– For data size
• Terrific choice for Internet of Things, Web, Mobile, etc.
– British Gas, Nike, etc – Thermostats, Manufacturing, Oil/Gas, etc
It’s where the data is!

Cassandra Limitations
• No aggregations
– Optimized for lookups & writes
– No GROUP BYs
– No Windowed Aggregates
• No Joins
– Data model to avoid
• Must select by partition key
– There are secondary indexes
• But they are an antipattern
• Not optimized for full-table
scans
It actually can’t do everything 

Apache Spark
• Distributed computing framework
• Generalized DAG execution
• Easy Abstraction for Datasets
• Integrated SQL Queries
• Streaming
• Machine Learning Library

Spark Components
Spark Core Engine
Spark SQL Spark
Streaming
MLlib GraphX Spark R

Spark Components

Spark Provides a Simple and Efficient
framework for Distributed Computations
Node Roles 2
In Memory Caching Yes!
Generic DAG Execution Yes!
Great Abstraction For Datasets?
Dataframe!
(previously Resilient Distributed Dataset (RDD))
Spark
Master
Spark
Worker
Spark
Worker
Spark
WorkerSpark Executor
Spark Partition
Dataframe
(or RDD)

Spark Provides a Simple and Efficient
framework for Distributed Computations
Spark Master: Assigns cluster resources to applications
Spark Worker: Manages executors running on a machine
Spark Executor: Started by Worker - Workhorse of the spark application
Spark
Master
Spark
Worker
Spark
Worker
Spark
WorkerSpark Executor
Spark Partition
Dataframe
(or RDD)

RDDs Can be Generated from a
Variety of Sources
Textfiles
Parallelized Collections

Spark on Cassandra
Spark Core Engine
Spark SQL Spark
Streaming
MLlib GraphX Spark R
Cassandra
DataStax Spark-Cassandra Connector

Spark Cassandra Connector uses the DataStax
Java Driver to Read from and Write to Cassandra
Each Executor Maintains a
connection to the C* Cluster
Spark Executor
DataStax
Java Driver
Tokens 1-1000
Tokens 1001 -2000
Tokens …
RDD’s read into
different splits based
on sets of tokens
C*
Full Token
Range

Co-locate Spark and C* for Best Performance
• Run Cassandra and
Spark on same nodes
• Local reads/writes
• Increased performance

Things you can’t do in Cassandra
– Using SparkSQL
• JOINs
sc.sql("SELECT t.sensor_id, t.temp, m.location
FROM ks.temperatures t JOIN ks.metadata m
ON t.sensor_id = m.sensor_id
WHERE t.sensor_id = 12345");
• Aggregates
sc.sql("SELECT sensor_id, year, month, MAX(temp) mtemp
FROM ks.temperatures
GROUP BY sensor_id, year, month");

Things you can’t do in Cassandra
– External Data
• JOIN with HDFS data
val temp2014 = sc.textFile("webhdfs://myhadoop/data/temp2014.csv").
map(x=>x.split(",")).
map(x=>((x(0).toInt, x(1).toInt, x(2).toInt),
x(3).toDouble))
val temp2015 = sc.cassandraTable("ks", "temperatures").
map(x=>((x.getInt("sensor_id"), x.getInt("year"), x.getInt("month")),
x.getDouble("avgTemp")))
val hotter = temp2015.join(temp2014).filter(x => x._2._1._1 > x._2._2._1)
• Non-Partition Key Predicates
csc.sql("SELECT * FROM ks.temperatures WHERE temp > 100")

Tools
• ODBC and JDBC tools via SparkSQL
– Tableau, Pentaho, R, etc
• Apache Zeppelin (incubating)
A web-based notebook
that enables interactive data
analytics.

Quick word on Spark Streaming and Cassandra
• Very good combination
– Simple, powerful, useful, scalable, etc, etc, etc.
Receiver

Quick word on Spark Streaming and Cassandra
import com.datastax.spark.connector.streaming._
// Spark connection options
val conf = new SparkConf(true)...
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
// stream input
val lines = ssc.socketTextStream(serverIP, serverPort)
// count words
val wordCounts = lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
// stream output
wordCounts.saveToCassandra("test", "words")
// start processing
ssc.start()
ssc.awaitTermination()

DataStax Enterprise
Combines Cassandra,
Spark, and Solr (and more!)
- Fault Tolerance
- Management
- Visual Monitoring
- Security
- ETC!

Motivating Use Case
Internet of Things

Cassandra + Spark
• Unleash the power of analytics
• On your operational data
– IoT, Web, Mobile, etc
“Because that’s where
the Data is.”

Contacts and Links
• Links
– Cassandra Summit: http://cassandrasummit-datastax.com/
– DataStax Academy: https://academy.datastax.com/
• Contacts
– Kevin Pardue, Regional Channel Manager: kevin.pardue@datastax.com
– Brian Hess, Sr Product Manager for Analytics: brian.hess@datastax.com
– Devin Saxon, Marketing Specialist: dsaxon@datastax.com

Big Data Analytics with Spark

More Related Content

What's hot

Viewers also liked

Similar to Big Data Analytics with Spark

More from DataStax Academy

Recently uploaded

Big Data Analytics with Spark

Editor's Notes