Have you ever wanted to analyze sensor data that arrives every second from across the world? Or maybe your want to analyze intra-day trading prices of millions of financial instruments? Or take all the page views from Wikipedia and compare the hourly statistics? To do this or any other similar analysis, you will need to analyze large sequences of measurements over time. And what better way to do this then with Apache Spark? In this session we will dig into how to consume data, and analyze it with Spark, and then store the results in Apache Cassandra.
Analyzing Time-Series Data with Apache Spark and Cassandra
What is Time-Series Data? Time-series data consists of sequences of measurements, each occurring at a point in time.
A variety of terms are used to describe time-series data, and many of them apply to conflicting or overlapping concepts. In the interest of clarity, in spark-ts , we stick to a particular vocabulary:
A time series is a sequence of floating-point values, each linked to a timestamp.
Consistent hash between 2-63 and 264 •Each node owns a range of those values •The token is the beginning of that range to the next node’s token value •Virtual Nodes break these down further
Each partition is a 128 bit value
Since we are really just creating discrete RDD’s we will see how we have the opportunity to combine streaming with the rest of the stack
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with
Apache Spark and Cassandra
HDF / IoT Product Solution Architect
If you every wanted to….
Build models over measurements coming in
every second from sensors across the
Dig into intra-day trading prices of millions
of financial instruments?
Compare hourly page view statistics across
every page on Wikipedia?
You need to do it over a large
sequence of measurements
Cassandra architecture is
Based on Amazon Dynamo and Google BigTable
 1986 paper “The Case for Shared Nothing” -http://db.cs.berkeley.edu/papers/hpts85-
What is Spark used for?
Fast and general purpose engine for large scale
Provides a framework that supports In-Memory
Designed for iterative computations and interactive
Resilient Distributed Dataset (RDD)
• Created through transformations
on data (map,filter..) or other
Number of RDD partitions will
control how many parallel tasks
can be run against the data stored
in the RDD
Hint: in general make it at least as large as the # of cpu cores in your cluster
Transformations - Similar to scala collections API
• Produce new RDDs
• filter, flatmap, map, distinct, groupBy, union, zip,
• Require materialization of the records to generate
• collect: Array[T], count, fold, reduce..
Spark asks an RDD for a list of its partitions (splits)
Each split consists of one or more token-ranges
For every partition:
•Spark gets a list of preferred nodes to process on
•Spark creates a task and sends it to one of the
nodes for execution
What is Spark Streaming?
•Provides efficient, fault-tolerant stateful stream
•Provides a simple API for implementing complex
•Integrates with Spark’s batch and interactive
•Integrates with other Spark extensions
Spark on Cassandra
• Server-Side filters (where clauses)
• Cross-table operations (JOIN, UNION, etc.)
• Data locality-aware (speed)
• Data transformation, aggregation, etc.
• Natural Time Series Integration
Spark Cassandra Connector
• Loads data from Cassandra to Spark
• Writes data from Spark to Cassandra
• Implicit Type Conversions and Object Mapping
• Implemented in Scala (offers a Java API)
• Open Source
• Exposes Cassandra Tables as Spark RDDs +