Apache cassandra & apache spark for time series data

Apache Cassandra & Apache Spark
for Time Series Data
Patrick McFadin
Chief Evangelist for Apache Cassandra, DataStax
@PatrickMcFadin
©2013 DataStax Confidential. Do not distribute without consent.
1

Cassandra for Applications
APACHE
CASSANDRA

Cassandra is…
• Shared nothing
• Masterless peer-to-peer
• Based on Dynamo

Scaling
• Add nodes to scale
• Millions Ops/s
Cassandra HBase Redis MySQL
THROUGHPUT OPS/SEC)

Uptime
• Built to replicate
• Resilient to failure
• Always on
NONE

Replication
DC1
10.0.0.1
00-25
10.0.0.4
76-100
10.0.0.2
26-50
10.0.0.3
51-75
DC1: RF=3
DC2
10.10.0.1
00-25
Asynchronous WAN Replication
10.10.0.4
76-100
10.10.0.2
26-50
10.10.0.3
51-75
DC2: RF=3
Client Insert Data
Asynchronous Local Replication

Data Model
• Familiar syntax
• Collections
• PRIMARY KEY for uniqueness
CREATE TABLE videos (
videoid uuid,
userid uuid,
name varchar,
description varchar,
location text,
location_type int,
preview_thumbnails map<text,text>,
tags set<varchar>,
added_date timestamp,
PRIMARY KEY (videoid)
);

Data Model - User Defined Types
• Complex data in one place
• No multi-gets (multi-partitions)
• Nesting! CREATE TYPE address (
street text,
city text,
zip_code int,
country text,
cross_streets set<text>
);

Data Model - Updated
• Now video_metadata is
embedded in videos
CREATE TYPE video_metadata (
height int,
width int,
video_bit_rate set<text>,
encoding text
);
CREATE TABLE videos (
videoid uuid,
userid uuid,
name varchar,
description varchar,
location text,
location_type int,
preview_thumbnails map<text,text>,
tags set<varchar>,
metadata set <frozen<video_metadata>>,
added_date timestamp,
PRIMARY KEY (videoid)
);

Data Model - Storing JSON
{
"productId": 2,
"name": "Kitchen Table",
"price": 249.99,
"description" : "Rectangular table with oak finish",
"dimensions": {
"units": "inches",
"length": 50.0,
"width": 66.0,
"height": 32
},
"categories": {
{
"category" : "Home Furnishings" {
"catalogPage": 45,
"url": "/home/furnishings"
},
{
"category" : "Kitchen Furnishings" {
"catalogPage": 108,
"url": "/kitchen/furnishings"
}
}
}
CREATE TYPE dimensions (
units text,
length float,
width float,
height float
);
CREATE TYPE category (
catalogPage int,
url text
);
CREATE TABLE product (
productId int,
name text,
price float,
description text,
dimensions frozen <dimensions>,
categories map <text, frozen <category>>,
PRIMARY KEY (productId)
);

Why…
Cassandra for Time Series?
Spark as a great addition to Cassandra?

Example 1: Weather Station
• Weather station collects data
• Cassandra stores in sequence
• Application reads in sequence

Use case
• Get all data for one weather station
• Get data for a single date and time
• Get data for a range of dates and times
• Store data per weather station
• Store time series in order: first to last
Needed Queries
Data Model to support queries

Data Model
• Weather Station Id and Time
are unique
• Store as many as needed
CREATE TABLE temperature (
weather_station text,
year int,
month int,
day int,
hour int,
temperature double,
PRIMARY KEY ((weather_station),year,month,day,hour)
);
INSERT INTO temperature(weather_station,year,month,day,hour,temperature)
VALUES (‘10010:99999’,2005,12,1,7,-5.6);
VALUES (‘10010:99999’,2005,12,1,8,-5.1);
VALUES (‘10010:99999’,2005,12,1,9,-4.9);
VALUES (‘10010:99999’,2005,12,1,10,-5.3);

Storage Model - Logical View
weather_station hour temperature
2005:12:1:7
-5.6
2005:12:1:8
-5.1
2005:12:1:9
-4.9
SELECT weather_station,hour,temperature
FROM temperature
WHERE weatherstation_id=‘10010:99999’
AND year = 2005 AND month = 12 AND day = 1;
10010:99999
10010:99999
10010:99999
2005:12:1:10
-5.3
10010:99999

2005:12:1:12
-5.4
2005:12:1:11
Storage Model - Disk Layout
SELECT weather_station,hour,temperature
FROM temperature
WHERE weatherstation_id=‘10010:99999’
AND year = 2005 AND month = 12 AND day = 1;
-5.1 -4.9 -5.3 -4.9
2005:12:1:7
-5.6
2005:12:1:8 2005:12:1:9
10010:99999
2005:12:1:10
Merged, Sorted and Stored Sequentially

Primary key relationship
PRIMARY KEY (weatherstation_id,year,month,day,hour)

Partition Key

Partition Key Clustering Columns

10010:99999

2005:12:1:7
-5.6
10010:99999
2005:12:1:8 2005:12:1:9 2005:12:1:10
-5.1 -4.9 -5.3

Data Locality
weatherstation_id=‘10010:99999’ ?
1000 Node Cluster
You are here!

Query patterns
• Range queries
• “Slice” operation on disk
SELECT weatherstation,hour,temperature
FROM temperature
WHERE weatherstation_id=‘10010:99999'
AND year = 2005 AND month = 12 AND day = 1
AND hour >= 7 AND hour <= 10;
Single seek on disk
2005:12:1:12
-5.4
2005:12:1:11
-5.1 -4.9 -5.3 -4.9
2005:12:1:7
-5.6
2005:12:1:8 2005:12:1:9
10010:99999
2005:12:1:10
Partition key for locality

Query patterns
• Range queries
• “Slice” operation on disk
Sorted by event_time
Programmers like this
SELECT weatherstation,hour,temperature
FROM temperature
WHERE weatherstation_id=‘10010:99999'
AND year = 2005 AND month = 12 AND day = 1
AND hour >= 7 AND hour <= 10;
weather_station hour temperature
2005:12:1:7
-5.6
2005:12:1:8
-5.1
2005:12:1:9
-4.9
10010:99999
10010:99999
10010:99999
2005:12:1:10
-5.3
10010:99999

Apache Spark
• 10x faster on disk,100x faster in memory than Hadoop MR
• Works out of the box on EMR
• Fault Tolerant Distributed Datasets
• Batch, iterative and streaming analysis
• In Memory Storage and Disk
• Integrates with Most File and Storage Options
Up to 100× faster
(2-10× on disk)
2-5× less code

Spark Components
Spark SQL
structured
Spark Core
Spark
Streaming
real-time
MLlib
machine learning
GraphX
graph

org.apache.spark.rdd.RDD
Resilient Distributed Dataset (RDD)
•Created through transformations on data (map,filter..) or other RDDs
•Immutable
•Partitioned
•Reusable

RDD Operations
•Transformations - Similar to scala collections API
•Produce new RDDs
•filter, flatmap, map, distinct, groupBy, union, zip,
reduceByKey, subtract
•Actions
•Require materialization of the records to generate a value
•collect: Array[T], count, fold, reduce..

Analytic
Analytic
Search
RDD Operations
Transformation
Action

Collections and Files To RDD
scala> val distData = sc.parallelize(Seq(1,2,3,4,5)
distData: spark.RDD[Int] = spark.ParallelCollection@10d13e3e
val distFile: RDD[String] = sc.textFile(“directory/*.txt”)
val distFile = sc.textFile(“hdfs://namenode:9000/path/file”)
val distFile = sc.sequenceFile(“hdfs://namenode:9000/path/file”)

Spark on Cassandra
• Server-Side filters (where clauses)
• Cross-table operations (JOIN, UNION, etc.)
• Data locality-aware (speed)
• Data transformation, aggregation, etc.
• Natural Time Series Integration

Spark Cassandra Connector
• Loads data from Cassandra to Spark
• Writes data from Spark to Cassandra
• Implicit Type Conversions and Object Mapping
• Implemented in Scala (offers a Java API)
• Open Source
• Exposes Cassandra Tables as Spark RDDs + Spark DStreams

Spark Cassandra Connector
C*
User Application
Spark-Cassandra Connector
Cassandra C* C*
C*
Spark Executor
C* Java (Soon Scala) Driver
https://github.com/datastax/spark-cassandra-connector

Spark Cassandra Example
val conf = new SparkConf(loadDefaults = true)
.set("spark.cassandra.connection.host", "127.0.0.1")
.setMaster("spark://127.0.0.1:7077")
val sc = new SparkContext(conf)
val table: CassandraRDD[CassandraRow] = sc.cassandraTable("keyspace", "tweets")
val ssc = new StreamingContext(sc, Seconds(30))
val stream = KafkaUtils.createStream[String, String, StringDecoder,
StringDecoder](
ssc, kafka.kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY)
stream.map(_._2).countByValue().saveToCassandra("demo", "wordcount")
ssc.start()
ssc.awaitTermination()
Initialization
CassandraRDD
Stream Initialization
Transformations
and Action

Weather Station Analysis
• Cassandra stores in sequence
• Spark rolls up data into new
tables
Windsor California
July 1, 2014
High: 73.4F
Low : 51.4F

Roll-up table
CREATE TABLE daily_aggregate_temperature (
wsid text,
year int,
month int,
day int,
high double,
low double,
PRIMARY KEY ((wsid), year, month, day)
);
• Weather Station Id(wsid) is unique
• High and low temp for each day

Setup connection
def main(args: Array[String]): Unit = {
// the setMaster("local") lets us run & test the job right in our IDE
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1").setMaster("local")
// "local" here is the master, meaning we don't explicitly have a spark master set up
val sc = new SparkContext("local", "weather", conf)
val connector = CassandraConnector(conf)
val cc = new CassandraSQLContext(sc)
cc.setKeyspace("isd_weather_data")

Get data and aggregate
// Case class to store row data
case class daily_aggregate_temperature (wsid: String, year: Int, month: Int, day: Int, high:Double, low:Double)
// Create SparkSQL statement
val aggregationSql = "SELECT wsid, year, month, day, max(temperature) high, min(temperature) low " +
"FROM raw_weather_data " +
"WHERE month = 6 " +
"GROUP BY wsid, year, month, day;"
val srdd: SchemaRDD = cc.sql(aggregationSql);
val resultSet = srdd.map(row => (
new daily_aggregate_temperature(
row.getString(0), row.getInt(1), row.getInt(2), row.getInt(3), row.getDouble(4), row.getDouble(5))))
.collect()

Store back into Cassandra
connector.withSessionDo(session => {
// Create a single prepared statement
val prepared = session.prepare(insertStatement)
val bound = prepared.bind
// Iterate over result set and bind variables
for (row <- resultSet) {
bound.setString("wsid", row.wsid)
bound.setInt("year", row.year)
bound.setInt("month", row.month)
bound.setInt("day", row.day)
bound.setDouble("high", row.high)
bound.setDouble("low", row.low)
// Insert new row in database
session.execute(bound)
}
})

Result
wsid | year | month | day | high | low
--------------+------+-------+-----+------+------
725300:94846 | 2012 | 9 | 30 | 18.9 | 10.6
725300:94846 | 2012 | 9 | 29 | 25.6 | 9.4
725300:94846 | 2012 | 9 | 28 | 19.4 | 11.7
725300:94846 | 2012 | 9 | 27 | 17.8 | 7.8
725300:94846 | 2012 | 9 | 26 | 22.2 | 13.3
725300:94846 | 2012 | 9 | 25 | 25 | 11.1
725300:94846 | 2012 | 9 | 24 | 21.1 | 4.4
725300:94846 | 2012 | 9 | 23 | 15.6 | 5
725300:94846 | 2012 | 9 | 22 | 15 | 7.2
725300:94846 | 2012 | 9 | 21 | 18.3 | 9.4
725300:94846 | 2012 | 9 | 20 | 21.7 | 11.7
725300:94846 | 2012 | 9 | 19 | 22.8 | 5.6
725300:94846 | 2012 | 9 | 18 | 17.2 | 9.4
725300:94846 | 2012 | 9 | 17 | 25 | 12.8
725300:94846 | 2012 | 9 | 16 | 25 | 10.6
725300:94846 | 2012 | 9 | 15 | 26.1 | 11.1
725300:94846 | 2012 | 9 | 14 | 23.9 | 11.1
725300:94846 | 2012 | 9 | 13 | 26.7 | 13.3
725300:94846 | 2012 | 9 | 12 | 29.4 | 17.2
725300:94846 | 2012 | 9 | 11 | 28.3 | 11.7
725300:94846 | 2012 | 9 | 10 | 23.9 | 12.2
725300:94846 | 2012 | 9 | 9 | 21.7 | 12.8
725300:94846 | 2012 | 9 | 8 | 22.2 | 12.8
725300:94846 | 2012 | 9 | 7 | 25.6 | 18.9
725300:94846 | 2012 | 9 | 6 | 30 | 20.6
725300:94846 | 2012 | 9 | 5 | 30 | 17.8
725300:94846 | 2012 | 9 | 4 | 32.2 | 21.7
725300:94846 | 2012 | 9 | 3 | 30.6 | 21.7
725300:94846 | 2012 | 9 | 2 | 27.2 | 21.7
725300:94846 | 2012 | 9 | 1 | 27.2 | 21.7
SELECT wsid, year, month, day, high, low
FROM daily_aggregate_temperature
WHERE wsid = '725300:94846'
AND year=2012 AND month=9 ;

What just happened?
• Data is read from raw_weather_data table
• Transformed
• Inserted into the daily_aggregate_temperature table
Table:
raw_weather_data
Table:
daily_aggregate_tem
perature
Read data
from table Transform Insert data
into table

Weather Station Stream Analysis
• Data processed in stream
• Data stored in Cassandra
Windsor California
Today
Rainfall total: 1.2cm
High: 73.4F
Low : 51.4F

Spark Versus Spark Streaming
zillions of bytes gigabytes per second

Analytic
Analytic
Search
Spark Streaming
Kinesis,'S3'

DStream - Micro Batches
• Continuous sequence of micro batches
• More complex processing models are possible with less effort
• Streaming computations as a series of deterministic batch
computations on small time intervals
DStream
μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD)
Processing of DStream = Processing of μBatches, RDDs

Spark Streaming Reduce Example
val sc = new SparkContext(..)
val ssc = new StreamingContext(sc, Seconds(5))
val stream = TwitterUtils.createStream(ssc, auth, filters, StorageLevel.MEMORY_ONLY_SER_2)
val transform = (cruft: String) =>
Pattern.findAllIn(cruft).flatMap(_.stripPrefix("#"))
/** Note that Cassandra is doing the sorting for you here. */
stream.flatMap(_.getText.toLowerCase.split("""s+"""))
.map(transform)
.countByValueAndWindow(Seconds(5), Seconds(5))
.transform((rdd, time) => rdd.map { case (term, count) => (term, count, now(time))})
.saveToCassandra(keyspace, suspicious, SomeColumns(“suspicious", "count", “timestamp"))
Even Machine Learning!

Temperature High/Low Stream
Weather
Stations
Receive API
Apache Kafka
Producer
TemperatureActor
TemperatureActor
TemperatureActor
Consumer

You can do this at home!
https://github.com/killrweather/killrweather

Databricks & Datastax
Apache Spark is packaged as part of Datastax
Enterprise Analytics 4.5
Databricks & Datastax Have Partnered for
Apache Spark Engineering and Support
http://www.datastax.com/

Resources
•Spark Cassandra Connector
https://github.com/datastax/spark-cassandra-connector
•Apache Cassandra http://cassandra.apache.org
•Apache Spark http://spark.apache.org
•Apache Kafka http://kafka.apache.org
•Akka http://akka.io
Analytic
Analytic

FREE tickets to our Annual Cassandra Summit Europe taking place in London in early December (3rd
and 4th). The 4th is a full conference day with free admission to all attendees and will feature
presentations by companies like ING, Credit Suisse, Target, UBS, The Noble Group as well as other top
Cassandra experts in the world.
There will be content for those entirely new to Cassandra all the way to the most seasoned Cassandra
veteran, spanning development, architecture, and operations as well as how to integrate Cassandra with
analytics and search technologies like Apache Spark and Apache Solr.
December 3rd is a paid training day. If you are interested in getting a discount on paid training, please
speak with Diego - dferreira@datastax.com

Thank you
Follow me on twitter for more updates
@PatrickMcFadin

Apache cassandra & apache spark for time series data

More Related Content

What's hot

Similar to Apache cassandra & apache spark for time series data

More from Patrick McFadin

Recently uploaded

Apache cassandra & apache spark for time series data