Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Analyzing Time-Series Data with
Apache Spark and Cassandra
Andrew Psaltis
HDF / IoT Product Solution Architect
@itmdata
St...
If you every wanted to….
Build models over measurements coming in
every second from sensors across the
world?
Dig into int...
You need to do it over a large
sequence of measurements
over time.
A problem perfect for
Cassandra and Spark
Time-series data
consists of sequences
of measurements,
each occurring at a
point in time.
Example: Weather Station
Weather station collects data
Cassandra stores in sequence
Application reads in sequence
Query use cases
Weather Station ID
Get weather data given:
Weather Station ID and Time
Weather Station ID and
Range of Time
Aggregation use cases
Weather Station ID
Get temperature stats given:
Weather Station ID and Time
Weather Station ID and
R...
Cassandra Overview
Cassandra architecture is
Shared nothing[1]
Materless peer-to-peer
Shard Free
Based on Amazon Dynamo and Google BigTable
[...
Row
Partition
Table
Keyspace
Table 1 Table 2
Partition Key Clustering Columns
Order Override
Partition Key
Clustering Columns
10010:99999
2016:07:28:1
2
-5.6
2016:07:28:1
1
-5.1
2016:07:28:1
0
-4.9
2016:07:28:0
9
-5...
Tokens
Consistent hash between 2-63 and 264
Each node owns a range of those values
The token is the beginning of that rang...
Replication
Node Primary
10.0.0.1 00-25
10.0.0.2 26-50
10.0.0.3 51-75
10.0.0.4 76-100
10.0.0.1
00-25
10.0.0.2
26-50
10.0.0...
Replication
Node Primary Replica
10.0.0.1 00-25 76-100
10.0.0.2 26-50 00-25
10.0.0.3 51-75 26-50
10.0.0.4 76-100 51-75
10....
Replication
Node Primary Replica Replica
10.0.0.
1
00-25 76-100 51-75
10.0.0.
2
26-50 00-25 76-100
10.0.0.
3
51-75 26-50 0...
Replication
10.0.0.1
00-25
76-100
51-75
10.0.0.2
26-50
00-25
76-100
10.0.0.4
76-100
51-75
26-50
10.0.0.3
51-75
26-50
00-25...
Multi-Datacenter
10.0.0.1
00-25
76-100
51-75
10.0.0.2
26-50
00-25
76-100
10.0.0.4
76-100
51-75
26-50
10.0.0.3
51-75
26-50
...
Query use cases
Weather Station ID
Get weather data given:
Weather Station ID and Time
Weather Station ID and
Range of Time
Spark Overview
What is Spark used for?
Fast and general purpose engine for large scale
data processing
Provides a framework that supports...
Resilient Distributed Dataset (RDD)
• Created through transformations
on data (map,filter..) or other
RDDs
• Immutable
• P...
RDD Partitioning
Number of RDD partitions will
control how many parallel tasks
can be run against the data stored
in the R...
 Transformations - Similar to scala collections API
• Produce new RDDs
• filter, flatmap, map, distinct, groupBy, union, ...
Data Locality
Spark asks an RDD for a list of its partitions (splits)
Each split consists of one or more token-ranges
For ...
What is Spark Streaming?
•Provides efficient, fault-tolerant stateful stream
processing
•Provides a simple API for impleme...
Spark Streaming Overview
Discretized Streams (DStreams)
•The basic abstraction provided by Spark Streaming
•Continuous series of RDDs
Spark on Cassandra
Spark on Cassandra
• Server-Side filters (where clauses)
• Cross-table operations (JOIN, UNION, etc.)
• Data locality-awar...
Spark Cassandra Connector
• Loads data from Cassandra to Spark
• Writes data from Spark to Cassandra
• Implicit Type Conve...
Spark Cassandra Connector
Spark Cassandra Example
Locating a Row
Cassandra RDD Use the Token Range to Create Node
Local Spark Partitions
The Spark Executor uses the Java Driver to
Pull Rows from the Local Cassandra Instance
Transactional
10.0.0.1
00-25
10.0.0.2
26-5010.0.0.4
76-100
10.0.0.3
51-75
10.0.0.1
00-25
10.0.0.2
26-50
10.0.0.4
76-100
10...
Batch Weather Station Analysis
Weather Station Analysis
Weather station collects data
Cassandra stores in sequence
Spark rolls up data into new tables
Setup Connection
Get data and aggregate
Store back into Cassandra
Aggregation use cases
Weather Station ID
Get temperature stats given:
Weather Station ID and Time
Weather Station ID and
R...
Weather Station Stream Analysis
Weather station collects data
Data processed in stream
Cassandra stores in sequence
Weather Station Stream Analysis
Counter
https://github.com/killrweather/killrweather
To explore at home….
Thank You
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Upcoming SlideShare
Loading in …5
×

Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016

2,378 views

Published on

Have you ever wanted to analyze sensor data that arrives every second from across the world? Or maybe your want to analyze intra-day trading prices of millions of financial instruments? Or take all the page views from Wikipedia and compare the hourly statistics? To do this or any other similar analysis, you will need to analyze large sequences of measurements over time. And what better way to do this then with Apache Spark? In this session we will dig into how to consume data, and analyze it with Spark, and then store the results in Apache Cassandra.

Published in: Technology
  • Be the first to comment

Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016

  1. 1. Analyzing Time-Series Data with Apache Spark and Cassandra Andrew Psaltis HDF / IoT Product Solution Architect @itmdata StampedeCon 2016
  2. 2. If you every wanted to…. Build models over measurements coming in every second from sensors across the world? Dig into intra-day trading prices of millions of financial instruments? Compare hourly page view statistics across every page on Wikipedia?
  3. 3. You need to do it over a large sequence of measurements over time.
  4. 4. A problem perfect for Cassandra and Spark
  5. 5. Time-series data consists of sequences of measurements, each occurring at a point in time.
  6. 6. Example: Weather Station Weather station collects data Cassandra stores in sequence Application reads in sequence
  7. 7. Query use cases Weather Station ID Get weather data given: Weather Station ID and Time Weather Station ID and Range of Time
  8. 8. Aggregation use cases Weather Station ID Get temperature stats given: Weather Station ID and Time Weather Station ID and Range of Time
  9. 9. Cassandra Overview
  10. 10. Cassandra architecture is Shared nothing[1] Materless peer-to-peer Shard Free Based on Amazon Dynamo and Google BigTable [1] 1986 paper “The Case for Shared Nothing” -http://db.cs.berkeley.edu/papers/hpts85- nothing.pdf.
  11. 11. Row
  12. 12. Partition
  13. 13. Table
  14. 14. Keyspace Table 1 Table 2
  15. 15. Partition Key Clustering Columns Order Override
  16. 16. Partition Key Clustering Columns 10010:99999 2016:07:28:1 2 -5.6 2016:07:28:1 1 -5.1 2016:07:28:1 0 -4.9 2016:07:28:0 9 -5.3 Primary key relationship
  17. 17. Tokens Consistent hash between 2-63 and 264 Each node owns a range of those values The token is the beginning of that range to the next node’s token value Virtual Nodes break these down further
  18. 18. Replication Node Primary 10.0.0.1 00-25 10.0.0.2 26-50 10.0.0.3 51-75 10.0.0.4 76-100 10.0.0.1 00-25 10.0.0.2 26-50 10.0.0.4 76-100 10.0.0.3 51-75 DC 1 DC 1 RF: 1
  19. 19. Replication Node Primary Replica 10.0.0.1 00-25 76-100 10.0.0.2 26-50 00-25 10.0.0.3 51-75 26-50 10.0.0.4 76-100 51-75 10.0.0.1 00-25 76-100 10.0.0.2 26-50 00-25 10.0.0.4 76-100 51-75 10.0.0.3 51-75 26-50 DC 1 DC 1 RF: 2
  20. 20. Replication Node Primary Replica Replica 10.0.0. 1 00-25 76-100 51-75 10.0.0. 2 26-50 00-25 76-100 10.0.0. 3 51-75 26-50 00-25 10.0.0.1 00-25 76-100 51-75 10.0.0.2 26-50 00-25 76-100 10.0.0.4 76-100 51-75 26-50 10.0.0.3 51-75 26-50 00-25 DC 1 DC 1 RF: 3
  21. 21. Replication 10.0.0.1 00-25 76-100 51-75 10.0.0.2 26-50 00-25 76-100 10.0.0.4 76-100 51-75 26-50 10.0.0.3 51-75 26-50 00-25 DC 1 RF: 3 Client Write to partition 15
  22. 22. Multi-Datacenter 10.0.0.1 00-25 76-100 51-75 10.0.0.2 26-50 00-25 76-100 10.0.0.4 76-100 51-75 26-50 10.0.0.3 51-75 26-50 00-25 DC 1 RF: 3 Client Write to partition 15 10.0.0.1 00-25 76-100 51-75 10.0.0.2 26-50 00-25 76-100 10.0.0.4 76-100 51-75 26-50 10.0.0.3 51-75 26-50 00-25 DC 2 RF: 3
  23. 23. Query use cases Weather Station ID Get weather data given: Weather Station ID and Time Weather Station ID and Range of Time
  24. 24. Spark Overview
  25. 25. What is Spark used for? Fast and general purpose engine for large scale data processing Provides a framework that supports In-Memory Cluster Computing Designed for iterative computations and interactive data mining
  26. 26. Resilient Distributed Dataset (RDD) • Created through transformations on data (map,filter..) or other RDDs • Immutable • Partitioned • Reusable
  27. 27. RDD Partitioning Number of RDD partitions will control how many parallel tasks can be run against the data stored in the RDD Hint: in general make it at least as large as the # of cpu cores in your cluster
  28. 28.  Transformations - Similar to scala collections API • Produce new RDDs • filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract  Actions • Require materialization of the records to generate a value • collect: Array[T], count, fold, reduce.. RDD Operations
  29. 29. Data Locality Spark asks an RDD for a list of its partitions (splits) Each split consists of one or more token-ranges For every partition: •Spark gets a list of preferred nodes to process on from RDD •Spark creates a task and sends it to one of the nodes for execution
  30. 30. What is Spark Streaming? •Provides efficient, fault-tolerant stateful stream processing •Provides a simple API for implementing complex algorithms •Integrates with Spark’s batch and interactive processing •Integrates with other Spark extensions
  31. 31. Spark Streaming Overview
  32. 32. Discretized Streams (DStreams) •The basic abstraction provided by Spark Streaming •Continuous series of RDDs
  33. 33. Spark on Cassandra
  34. 34. Spark on Cassandra • Server-Side filters (where clauses) • Cross-table operations (JOIN, UNION, etc.) • Data locality-aware (speed) • Data transformation, aggregation, etc. • Natural Time Series Integration
  35. 35. Spark Cassandra Connector • Loads data from Cassandra to Spark • Writes data from Spark to Cassandra • Implicit Type Conversions and Object Mapping • Implemented in Scala (offers a Java API) • Open Source • Exposes Cassandra Tables as Spark RDDs + Spark DStreams
  36. 36. Spark Cassandra Connector
  37. 37. Spark Cassandra Example
  38. 38. Locating a Row
  39. 39. Cassandra RDD Use the Token Range to Create Node Local Spark Partitions
  40. 40. The Spark Executor uses the Java Driver to Pull Rows from the Local Cassandra Instance
  41. 41. Transactional 10.0.0.1 00-25 10.0.0.2 26-5010.0.0.4 76-100 10.0.0.3 51-75 10.0.0.1 00-25 10.0.0.2 26-50 10.0.0.4 76-100 10.0.0.3 51-75 Analytics
  42. 42. Batch Weather Station Analysis
  43. 43. Weather Station Analysis Weather station collects data Cassandra stores in sequence Spark rolls up data into new tables
  44. 44. Setup Connection
  45. 45. Get data and aggregate
  46. 46. Store back into Cassandra
  47. 47. Aggregation use cases Weather Station ID Get temperature stats given: Weather Station ID and Time Weather Station ID and Range of Time
  48. 48. Weather Station Stream Analysis Weather station collects data Data processed in stream Cassandra stores in sequence
  49. 49. Weather Station Stream Analysis Counter
  50. 50. https://github.com/killrweather/killrweather To explore at home….
  51. 51. Thank You

×