Apache cassandra and spark. you got the the lighter, let's start the fire

©2013 DataStax Conﬁdential. Do not distribute without consent.
@PatrickMcFadin
Patrick McFadin 
Chief Evangelist for Apache Cassandra
Apache Cassandra and Spark
You got the lighter, let’s spark the fire
1

Cassandra 3.0 & 3.1
Spring and Fall

Cassandra is…
• Shared nothing
• Masterless peer-to-peer
• Great scaling story
• Resilient to failure

Cassandra for Applications
APACHE
CASSANDRA

A Data Ocean or Pond., Lake
An In-Memory Database
A Key-Value Store
A magical database unicorn that farts rainbows

Apache Spark
• 10x faster on disk,100x faster in memory than Hadoop MR
• Works out of the box on EMR
• Fault Tolerant Distributed Datasets
• Batch, iterative and streaming analysis
• In Memory Storage and Disk
• Integrates with Most File and Storage Options
Up to 100× faster
(2-10× on disk)
2-5× less code

Spark Components
Spark Core
Spark SQL
structured
Spark
Streaming
real-time
MLlib
machine learning
GraphX
graph

org.apache.spark.rdd.RDD
Resilient Distributed Dataset (RDD)
•Created through transformations on data (map,filter..) or other RDDs
•Immutable
•Partitioned
•Reusable

RDD Operations
•Transformations - Similar to scala collections API
•Produce new RDDs
•filter, flatmap, map, distinct, groupBy, union, zip,
reduceByKey, subtract
•Actions
•Require materialization of the records to generate a value
•collect: Array[T], count, fold, reduce..

Analytic
Analytic
Search
Transformation
Action
RDD Operations

Cassandra & Spark: A Great Combo
Datastax: spark-cassandra-connector: 
https://github.com/datastax/spark-cassandra-connector
•Both are Easy to Use
•Spark Can Help You Bridge Your Hadoop
and Cassandra Systems
•Use Spark Libraries, Caching on-top of
Cassandra-stored Data
•Combine Spark Streaming with Cassandra
Storage

Spark On Cassandra
•Server-Side filters (where clauses)
•Cross-table operations (JOIN, UNION, etc.)
•Data locality-aware (speed)
•Data transformation, aggregation, etc.
•Natural Time Series Integration

Apache Spark and Cassandra Open Source Stack
Cassandra

Spark Cassandra Connector
*Cassandra tables exposed as Spark RDDs
*Read from and write to Cassandra
*Mapping of C* tables and rows to Scala objects
*All Cassandra types supported and converted to Scala types
*Server side data selection
*Virtual Nodes support
*Use with Scala or Java
*Compatible with, Spark 1.1.0, Cassandra 2.1 & 2.0

Type Mapping
CQL Type Scala Type
ascii String
bigint Long
boolean Boolean
counter Long
decimal BigDecimal, java.math.BigDecimal
double Double
float Float
inet java.net.InetAddress
int Int
list Vector, List, Iterable, Seq, IndexedSeq, java.util.List
map Map, TreeMap, java.util.HashMap
set Set, TreeSet, java.util.HashSet
text, varchar String
timestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTime
timeuuid java.util.UUID
uuid java.util.UUID
varint BigInt, java.math.BigInteger
*nullable values Option

Connecting to Cassandra
// Import Cassandra-specific functions on SparkContext and RDD objects
import com.datastax.driver.spark._
// Spark connection options
val conf = new SparkConf(true)
.setMaster("spark://192.168.123.10:7077")
.setAppName("cassandra-demo")
.set("cassandra.connection.host", "192.168.123.10") // initial
contact
.set("cassandra.username", "cassandra")
.set("cassandra.password", "cassandra")
val sc = new SparkContext(conf)

Accessing Data
CREATE TABLE test.words (word text PRIMARY KEY, count int);
INSERT INTO test.words (word, count) VALUES ('bar', 30);
INSERT INTO test.words (word, count) VALUES ('foo', 20);
// Use table as RDD
val rdd = sc.cassandraTable("test", "words")
// rdd: CassandraRDD[CassandraRow] = CassandraRDD[0]
rdd.toArray.foreach(println)
// CassandraRow[word: bar, count: 30]
// CassandraRow[word: foo, count: 20]
rdd.columnNames // Stream(word, count)
rdd.size // 2
val firstRow = rdd.first // firstRow: CassandraRow = CassandraRow[word: bar,
count: 30]
firstRow.getInt("count") // Int = 30
*Accessing table above as RDD:

Saving Data
val newRdd = sc.parallelize(Seq(("cat", 40), ("fox", 50)))
// newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2]
newRdd.saveToCassandra("test", "words", Seq("word", "count"))
SELECT * FROM test.words;
word | count
------+-------
bar | 30
foo | 20
cat | 40
fox | 50
(4 rows)
*RDD above saved to Cassandra:

https://github.com/datastax/spark-‐cassandra-‐connector
Keyspace Table
Cassandra Spark
RDD[CassandraRow]
RDD[Tuples]
Bundled
and
Supported
with
DSE
4.5!

Spark Cassandra Connector uses the DataStax Java Driver to
Read from and Write to C*
Spark C*
Full Token
Range
Each Executor Maintains
a connection to the C*
Cluster
Spark
Executor
DataStax
Java Driver
Tokens 1-1000
Tokens 1001 -2000
Tokens …
RDD’s read into different
splits based on sets of tokens

Co-locate Spark and C* for
Best Performance
C*
C*C*
C*
Spark 
Worker
Spark 
Worker
Spark
Master
Spark
Worker
Running Spark Workers
on
the same nodes as your
C* Cluster will save
network hops when
reading and writing

Analytics Workload Isolation
Cassandra
+ Spark DC
Cassandra
Only DC
Online
App
Analytical
App
Mixed Load Cassandra Cluster

Example 1: Weather Station
• Weather station collects data
• Cassandra stores in sequence
• Application reads in sequence

Data Model
• Weather Station Id and Time
are unique
• Store as many as needed
CREATE TABLE temperature (
weather_station text,
year int,
month int,
day int,
hour int,
temperature double,
PRIMARY KEY (weather_station,year,month,day,hour)
);
INSERT INTO temperature(weather_station,year,month,day,hour,temperature)
VALUES (‘10010:99999’,2005,12,1,7,-5.6);
VALUES (‘10010:99999’,2005,12,1,8,-5.1);
VALUES (‘10010:99999’,2005,12,1,9,-4.9);
VALUES (‘10010:99999’,2005,12,1,10,-5.3);

Primary key relationship

Partition Key

Partition Key Clustering Columns

10010:99999

2005:12:1:7
-5.6
10010:99999
-5.3-4.9-5.1
2005:12:1:8 2005:12:1:9 2005:12:1:10

Partition keys
10010:99999 Murmur3 Hash Token = 7224631062609997448
722266:13850 Murmur3 Hash Token = -6804302034103043898
VALUES (‘10010:99999’,2005,12,1,7,-5.6);
VALUES (‘722266:13850’,2005,12,1,7,-5.6);
Consistent hash. 128 bit number
between 2-63 and 264

Partition keys
VALUES (‘10010:99999’,2005,12,1,7,-5.6);
VALUES (‘722266:13850’,2005,12,1,7,-5.6);
For this example, let’s make it a
reasonable number

Writes & WAN replication
10.0.0.1
00-25
10.0.0.4
76-100
10.0.0.2
26-50
10.0.0.3
51-75
DC1
DC1: RF=3
Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.10.0.1
00-25
10.10.0.4
76-100
10.10.0.2
26-50
10.10.0.3
51-75
DC2
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
DC2: RF=3
Client
Insert Data
Partition Key = 15
Asynchronous Local Replication
Asynchronous WAN Replication

Locality
10.0.0.1
00-25
10.0.0.4
76-100
10.0.0.2
26-50
10.0.0.3
51-75
DC1
DC1: RF=3
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.10.0.1
00-25
10.10.0.4
76-100
10.10.0.2
26-50
10.10.0.3
51-75
DC2
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
DC2: RF=3
Client
Get Data
Partition Key = 15
Client
Get Data
Partition Key = 15

Data Locality
weatherstation_id=‘10010:99999’ ?
1000 Node Cluster
You are here!

Spark Reads on Cassandra
Awesome animation by DataStax’s own Russel Spitzer

Spark RDDs
Represent a Large
Amount of Data
Partitioned into Chunks
RDD
1 2 3
4 5 6
7 8 9Node 2
Node 1 Node 3
Node 4

Node 2
Node 1
Spark RDDs
Represent a Large
Amount of Data
RDD
2
346
7 8 9
Node 3
Node 4
1 5

Node 2
Node 1
RDD
2
346
7 8 9
Node 3
Node 4
1 5
Spark RDDs
Represent a Large
Amount of Data

Cassandra Data is Distributed By Token Range

0
500

0
500
999

0
500
Node 1
Node 2
Node 3
Node 4

0
500
Node 1
Node 2
Node 3
Node 4
Without vnodes

0
500
Node 1
Node 2
Node 3
Node 4
With vnodes

Node 1
120-220
300-500
780-830
0-50
spark.cassandra.input.split.size 50
Reported density is 0.5
The Connector Uses Information on the Node to Make  
Spark Partitions

Node 1
120-220
300-500
0-50
Spark Partitions
1
780-830

1
Node 1
120-220
300-500
0-50
Spark Partitions
780-830

2
1
Node 1 300-500
0-50
Spark Partitions
780-830

2
1
Node 1
300-400
0-50
Spark Partitions
780-830
400-500

21
Node 1
0-50
Spark Partitions
780-830
400-500

21
Node 1
0-50
Spark Partitions
780-830
400-500
3

21
Node 1
0-50
Spark Partitions
780-830
3
400-500

21
Node 1
0-50
Spark Partitions
780-830
3

4
21
Node 1
0-50
Spark Partitions
780-830
3

421
Node 1
Spark Partitions
3

4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50780-830
Node 1

4
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830

4
0-50
780-830
Node 1
50 CQL Rows

4
0-50
780-830
Node 1
50 CQL Rows
50 CQL Rows

4
0-50
780-830
Node 1
50 CQL Rows50 CQL Rows

4
0-50
780-830
Node 1
50 CQL Rows

4
0-50
780-830
Node 1
50 CQL Rows
50 CQL Rows

4
0-50
780-830
Node 1
50 CQL Rows
50 CQL Rows 50 CQL Rows

4
0-50
780-830
Node 1
50 CQL Rows
50 CQL Rows
50 CQL Rows

4
0-50
780-830
Node 1
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows

4
0-50
780-830
Node 1
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows

Weather Station Analysis
• Weather station collects data
• Cassandra stores in sequence
• Spark rolls up data into new
tables
Windsor California
July 1, 2014
High: 73.4
Low : 51.4

Roll-up table(SparkSQL example)
CREATE TABLE daily_high_low (
weatherstation text,
date text,
high_temp double,
low_temp double,
PRIMARY KEY ((weatherstation,date))
);
• Weather Station Id and Date are
unique
• High and low temp for each day
SparkSQL> INSERT INTO TABLE
> daily_high_low
> SELECT
> weatherstation, to_date(year, day, hour) date, max(temperature) high_temp, min(temperature) low_temp
> FROM temperature
> GROUP BY weatherstation_id, year, month, day;
OK
Time taken: 2.345 seconds
functions aggregations

What just happened
• Data is read from temperature table
• Transformed
• Inserted into the daily_high_low table
Table:
temperature
Table:
daily_high_low
Read data
from table
Transform
Insert data
into table

zillions of bytes gigabytes per second
Spark Versus Spark Streaming

Analytic
Analytic
Search
Spark Streaming
Kinesis,'S3'

DStream - Micro Batches
μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD)
Processing of DStream = Processing of μBatches, RDDs
DStream
• Continuous sequence of micro batches
• More complex processing models are possible with less effort
• Streaming computations as a series of deterministic batch
computations on small time intervals

Spark Streaming Example
val conf = new SparkConf(loadDefaults = true)
.set("spark.cassandra.connection.host", "127.0.0.1")
.setMaster("spark://127.0.0.1:7077")
val sc = new SparkContext(conf)
val table: CassandraRDD[CassandraRow] = sc.cassandraTable("keyspace", "tweets")
 
val ssc = new StreamingContext(sc, Seconds(30)) 
 
val stream = KafkaUtils.createStream[String, String, StringDecoder,
StringDecoder]( 
ssc, kafka.kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY) 
 
stream.map(_._2).countByValue().saveToCassandra("demo", "wordcount") 
 
ssc.start() 
ssc.awaitTermination()
Initialization
Transformations
and Action
CassandraRDD
Stream Initialization

Now what?
Cassandra
Only DC
Cassandra
+ Spark DC
Spark Jobs
Spark Streaming

You can do this at home!
https://github.com/killrweather/killrweather
On your USB!

Thank you!
Bring the questions
Follow me on twitter
@PatrickMcFadin

Apache cassandra and spark. you got the the lighter, let's start the fire

More Related Content

What's hot

Viewers also liked

Similar to Apache cassandra and spark. you got the the lighter, let's start the fire

More from Patrick McFadin

Apache cassandra and spark. you got the the lighter, let's start the fire