Cassandra and Spark

Apache Spark and Cassandra
1
Patrick McFadin 
Chief Evangelist for Apache Cassandra, DataStax
@PatrickMcFadin

About me
• Chief Evangelist for Apache Cassandra
• Senior Solution Architect at DataStax
• Chief Architect, Hobsons
• Web applications and performance since 1996

Cassandra is…
• Shared nothing
• Masterless peer-to-peer
• Great scaling story
• Resilient to failure

Cassandra for Applications
APACHE
CASSANDRA

Example: Weather Station
• Weather station collects data
• Cassandra stores in sequence
• Application reads in sequence

Queries supported
CREATE TABLE raw_weather_data ( 
wsid text, 
year int, 
month int, 
day int, 
hour int, 
temperature double, 
dewpoint double, 
pressure double, 
wind_direction int, 
wind_speed double, 
sky_condition int, 
sky_condition_text text, 
one_hour_precip double, 
six_hour_precip double, 
PRIMARY KEY ((wsid), year, month, day, hour) 
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
Get weather data given
•Weather Station ID
•Weather Station ID and Time
•Weather Station ID and Range of Time

Aggregation Queries
CREATE TABLE daily_aggregate_temperature ( 
wsid text, 
year int, 
month int, 
day int, 
high double, 
low double, 
mean double, 
variance double, 
stdev double, 
PRIMARY KEY ((wsid), year, month, day) 
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);
Get temperature stats given
•Weather Station ID
•Weather Station ID and Time
•Weather Station ID and Range of Time
Windsor California
July 1, 2014
High: 73.4
Low : 51.4

Map Reduce
Input Data
Map
Reduce
Intermediate Data
Output Data
Disk

In memory
Input Data
Map
Reduce
Intermediate Data
Output Data
Disk

In memory
Input Data
Spark Intermediate Data
Output Data
Disk Memory

RDD
Tranformations
•Produces new RDD
•Calls: filter, flatmap, map,
distinct, groupBy, union, zip,
reduceByKey, subtract
Are
•Immutable
•Partitioned
•Reusable
Actions
•Start cluster computing operations
•Calls: collect: Array[T], count,
fold, reduce..
and Have

API
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save  
...
reducemap

Spark Streaming
Near Real-time
SparkSQL
Structured Data
MLLib
Machine Learning
GraphX
Graph Analysis

Great combo
Store a ton of data Analyze a ton of data

Great combo
Spark Streaming
Near Real-time
SparkSQL
Structured Data
MLLib
Machine Learning
GraphX
Graph Analysis

Great combo
Spark Streaming
Near Real-time
SparkSQL
Structured Data
MLLib
Machine Learning
GraphX
Graph Analysis
CREATE TABLE raw_weather_data (
wsid text,
year int,
month int,
day int,
hour int,
temperature double,
dewpoint double,
pressure double,
wind_direction int,
wind_speed double,
sky_condition int,
sky_condition_text text,
one_hour_precip double,
six_hour_precip double,
PRIMARY KEY ((wsid), year, month, day, hour)
Spark Connector

Executer
Master
Worker
Executer
Executer
Server

Master
Worker
Worker
Worker Worker
0-24Token Ranges
0-100
25-49
50-74
75-99
I will only
analyze 25% of
the data.

Master
Worker
Worker
Worker Worker
0-24
25-49
50-74
75-9975-99
0-24
25-49
50-74
AnalyticsTransactional

Executer
Master
Worker
Executer
Executer
75-99
SELECT *
FROM keyspace.table
WHERE token(pk) > 75
AND token(pk) <= 99
Spark RDD
Spark Partition
Spark Partition
Spark Partition
Spark Connector

Executer
Master
Worker
Executer
Executer
75-99
Spark RDD
Spark Partition
Spark Partition
Spark Partition

Spark Connector
Cassandra
Cassandra +
Spark
Joins and Unions No Yes
Transformations Limited Yes
Outside Data
Integration
No Yes
Aggregations Limited Yes

Spark Reads on Cassandra
Awesome animation by DataStax’s own Russell Spitzer

Spark RDDs
Represent a Large
Amount of Data
Partitioned into Chunks
RDD
1 2 3
4 5 6
7 8 9Node 2
Node 1 Node 3
Node 4

Node 2
Node 1
Spark RDDs
Represent a Large
Amount of Data
RDD
2
346
7 8 9
Node 3
Node 4
1 5

Node 2
Node 1
RDD
2
346
7 8 9
Node 3
Node 4
1 5
Spark RDDs
Represent a Large
Amount of Data

Cassandra Data is Distributed By Token Range

0
500

0
500
999

0
500
Node 1
Node 2
Node 3
Node 4

0
500
Node 1
Node 2
Node 3
Node 4
Without vnodes

0
500
Node 1
Node 2
Node 3
Node 4
With vnodes

Node 1
120-220
300-500
780-830
0-50
spark.cassandra.input.split.size 50
Reported density is 0.5
The Connector Uses Information on the Node to Make  
Spark Partitions

Node 1
120-220
300-500
0-50
Spark Partitions
1
780-830

1
Node 1
120-220
300-500
0-50
Spark Partitions
780-830

2
1
Node 1 300-500
0-50
Spark Partitions
780-830

2
1
Node 1
300-400
0-50
Spark Partitions
780-830
400-500

21
Node 1
0-50
Spark Partitions
780-830
400-500

21
Node 1
0-50
Spark Partitions
780-830
400-500
3

21
Node 1
0-50
Spark Partitions
780-830
3
400-500

21
Node 1
0-50
Spark Partitions
780-830
3

4
21
Node 1
0-50
Spark Partitions
780-830
3

421
Node 1
Spark Partitions
3

4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50780-830
Node 1

4
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830

4
0-50
780-830
Node 1
50 CQL Rows

4
0-50
780-830
Node 1
50 CQL Rows
50 CQL Rows

4
0-50
780-830
Node 1
50 CQL Rows50 CQL Rows

4
0-50
780-830
Node 1
50 CQL Rows

4
0-50
780-830
Node 1
50 CQL Rows
50 CQL Rows

4
0-50
780-830
Node 1
50 CQL Rows
50 CQL Rows 50 CQL Rows

4
0-50
780-830
Node 1
50 CQL Rows
50 CQL Rows
50 CQL Rows

4
0-50
780-830
Node 1
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows

4
0-50
780-830
Node 1
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows

Attaching to Spark and Cassandra
// Import Cassandra-specific functions on SparkContext and RDD objects
import org.apache.spark.{SparkContext, SparkConf} 
import com.datastax.spark.connector._
/** The setMaster("local") lets us run & test the job right in our IDE */ 
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
.setMaster(“local[*]")
.setAppName(getClass.getName)
// Optionally 
.set("cassandra.username", "cassandra") 
.set("cassandra.password", “cassandra")
 
val sc = new SparkContext(conf)

Weather station example
CREATE TABLE raw_weather_data ( 
wsid text,  
year int,  
month int,  
day int,  
hour int,  
temperature double,  
dewpoint double,  
pressure double,  
wind_direction int,  
wind_speed double,  
sky_condition int,  
sky_condition_text text,  
one_hour_precip double,  
six_hour_precip double,  
PRIMARY KEY ((wsid), year, month, day, hour) 

Simple example
/** keyspace & table */ 
val tableRDD = sc.cassandraTable("isd_weather_data", "raw_weather_data") 
 
 
/** get a simple count of all the rows in the raw_weather_data table */ 
val rowCount = tableRDD.count() 
 
 
println(s"Total Rows in Raw Weather Table: $rowCount") 
sc.stop()

Simple example
/** keyspace & table */ 
val tableRDD = sc.cassandraTable("isd_weather_data", "raw_weather_data") 
 
 
/** get a simple count of all the rows in the raw_weather_data table */ 
val rowCount = tableRDD.count() 
 
 
println(s"Total Rows in Raw Weather Table: $rowCount") 
sc.stop()
Executer
SELECT *
FROM isd_weather_data.raw_weather_data
Spark RDD
Spark Partition
Spark Connector

Using CQL
SELECT temperature 
FROM raw_weather_data 
WHERE wsid = '724940:23234' 
AND year = 2008 
AND month = 12 
AND day = 1;
val cqlRRD = sc.cassandraTable("isd_weather_data", "raw_weather_data") 
.select("temperature") 
.where("wsid = ? AND year = ? AND month = ? AND DAY = ?", 
"724940:23234", "2008", "12", “1")

Using SQL!
spark-sql> SELECT wsid, year, month, day, max(temperature) high, min(temperature) low 
WHERE month = 6 
AND temperature !=0.0 
GROUP BY wsid, year, month, day;
724940:23234 2008 6 1 15.6 10.0
724940:23234 2008 6 2 15.6 10.0
724940:23234 2008 6 3 17.2 11.7
724940:23234 2008 6 4 17.2 10.0
724940:23234 2008 6 5 17.8 10.0
724940:23234 2008 6 6 17.2 10.0
724940:23234 2008 6 7 20.6 8.9

SQL with a Join
spark-sql> SELECT ws.name, raw.hour, raw.temperature 
FROM raw_weather_data raw 
JOIN weather_station ws 
ON raw.wsid = ws.id 
WHERE raw.wsid = '724940:23234' 
AND raw.year = 2008 AND raw.month = 6 AND raw.day = 1;
SAN FRANCISCO INTL AP 23 15.0

Python
from pyspark_cassandra import CassandraSparkContext 
 
from pyspark.sql import SQLContext 
from pyspark import SparkContext, SparkConf 
 
conf = SparkConf()  
.setAppName("PySpark Cassandra Test")  
.setMaster("spark://127.0.0.1:7077")  
.set("spark.cassandra.connection.host", "127.0.0.1") 
 
sc = CassandraSparkContext(conf=conf) 
sql = SQLContext(sc) 
 
rows = sql.sql('''SELECT max(temperature) AS high, wsid, year, month, day 
WHERE wsid = '724940:23234' 
AND month = 6 
AND day =1 
GROUP BY wsid, year, month, day;''').collect() 
 
for row in rows: 
print row.wsid, row.day, row.high

Analyzing large data sets
val spanRDD = sc.cassandraTable[Double]("isd_weather_data", "raw_weather_data") 
.select("temperature") 
.where("wsid = ? AND year = ? AND month = ? AND DAY = ?", 
"724940:23234", "2008", "12", "1").spanBy(row => (row.getString("wsid")))
•Specify partition grouping
•Use with large partitions
•Perfect for time series

Weather Station Analysis
• Weather station collects data
• Cassandra stores in sequence
• Spark rolls up data into new
tables
Windsor California
July 1, 2014
High: 73.4
Low : 51.4

Saving back the weather data
val cc = new CassandraSQLContext(sc) 
cc.setKeyspace("isd_weather_data") 
cc.sql(""" 
SELECT wsid, year, month, day, max(temperature) high, min(temperature) low 
WHERE month = 6 
GROUP BY wsid, year, month, day; 
""") 
.map{row => (row.getString(0), row.getInt(1), row.getInt(2), row.getInt(3), row.getDouble(4), row.getDouble(5))} 
.saveToCassandra("isd_weather_data", "daily_aggregate_temperature")

What just happened
• Data is read from temperature table
• Transformed
• Inserted into the daily_high_low table
Table:
temperature
Table:
daily_high_low
Read data
from table
Transform
Insert data
into table

The problem domain
Petabytes of
data
Gigabytes Per Second

Analytic
Analytic
Search
Spark Streaming
Kinesis,'S3'

DStream - Micro Batches
μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD)
Processing of DStream = Processing of μBatches, RDDs
DStream
• Continuous sequence of micro batches
• More complex processing models are possible with less effort
• Streaming computations as a series of deterministic batch
computations on small time intervals

Now what?
Cassandra
Only DC
Cassandra
+ Spark DC
Spark Jobs
Spark Streaming

You can do this at home!
https://github.com/killrweather/killrweather

PatrickM50- 50% off Priority Pass
PatrickMCert- 25% Certiﬁcation

Thank you!
Bring the questions
Follow me on twitter
@PatrickMcFadin

Cassandra and Spark

More Related Content

What's hot

Viewers also liked

Similar to Cassandra and Spark

More from datastaxjp

Recently uploaded

Cassandra and Spark