The New
Analytics Toolbox
Going beyond Hadoop
whoami
Robbie Strickland
Software Dev Manager
linkedin.com/in/robbiestrickland
rostrickland@gmail.com
Hadoop works fine, right?
Hadoop works fine, right?
● It’s 11 years old (!!)
Hadoop works fine, right?
● It’s 11 years old (!!)
● It’s relatively slow and inefficient
Hadoop works fine, right?
● It’s 11 years old (!!)
● It’s relatively slow and inefficient
● … because everything gets writ...
Hadoop works fine, right?
● It’s 11 years old (!!)
● It’s relatively slow and inefficient
● … because everything gets writ...
Hadoop works fine, right?
● It’s 11 years old (!!)
● It’s relatively slow and inefficient
● … because everything gets writ...
Hadoop works fine, right?
● It’s 11 years old (!!)
● It’s relatively slow and inefficient
● … because everything gets writ...
Hadoop works fine, right?
● It’s 11 years old (!!)
● It’s relatively slow and inefficient
● … because everything gets writ...
Landscape has changed ...
Lots of new tools available:
● Cloudera Impala
● Apache Drill
● Proprietary (Splunk, keen.io, et...
Landscape has changed ...
Lots of new tools available:
● Cloudera Impala
● Apache Drill
● Proprietary (Splunk, keen.io, et...
Landscape has changed ...
Lots of new tools available:
● Cloudera Impala
● Apache Drill
● Proprietary (Splunk, keen.io, et...
Landscape has changed ...
Lots of new tools available:
● Cloudera Impala
● Apache Drill
● Proprietary (Splunk, keen.io, et...
Landscape has changed ...
Lots of new tools available:
● Cloudera Impala
● Apache Drill
● Proprietary (Splunk, keen.io, et...
Spark wins?
Spark wins?
● Can be a Hadoop replacement
Spark wins?
● Can be a Hadoop replacement
● Works with any existing InputFormat
Spark wins?
● Can be a Hadoop replacement
● Works with any existing InputFormat
● Doesn’t require Hadoop
Spark wins?
● Can be a Hadoop replacement
● Works with any existing InputFormat
● Doesn’t require Hadoop
● Supports batch ...
Spark wins?
● Can be a Hadoop replacement
● Works with any existing InputFormat
● Doesn’t require Hadoop
● Supports batch ...
Spark wins?
● Can be a Hadoop replacement
● Works with any existing InputFormat
● Doesn’t require Hadoop
● Supports batch ...
Spark vs. Hadoop
MapReduce:
MapReduce:
MapReduce:
MapReduce:
MapReduce:
MapReduce:
Spark (angels sing!):
What is Spark?
What is Spark?
● In-memory cluster computing
What is Spark?
● In-memory cluster computing
● 10-100x faster than MapReduce
What is Spark?
● In-memory cluster computing
● 10-100x faster than MapReduce
● Collection API over large datasets
What is Spark?
● In-memory cluster computing
● 10-100x faster than MapReduce
● Collection API over large datasets
● Scala,...
What is Spark?
● In-memory cluster computing
● 10-100x faster than MapReduce
● Collection API over large datasets
● Scala,...
What is Spark?
● In-memory cluster computing
● 10-100x faster than MapReduce
● Collection API over large datasets
● Scala,...
What is Spark?
● Native graph processing via GraphX
What is Spark?
● Native graph processing via GraphX
● Native machine learning via MLlib
What is Spark?
● Native graph processing via GraphX
● Native machine learning via MLlib
● SQL queries via SparkSQL
What is Spark?
● Native graph processing via GraphX
● Native machine learning via MLlib
● SQL queries via SparkSQL
● Works...
What is Spark?
● Native graph processing via GraphX
● Native machine learning via MLlib
● SQL queries via SparkSQL
● Works...
Spark on Cassandra
Spark on Cassandra
● Direct integration via DataStax driver -
cassandra-driver-spark on github
Spark on Cassandra
● Direct integration via DataStax driver -
cassandra-driver-spark on github
● No job config cruft
Spark on Cassandra
● Direct integration via DataStax driver -
cassandra-driver-spark on github
● No job config cruft
● Sup...
Spark on Cassandra
● Direct integration via DataStax driver -
cassandra-driver-spark on github
● No job config cruft
● Sup...
Spark on Cassandra
● Direct integration via DataStax driver -
cassandra-driver-spark on github
● No job config cruft
● Sup...
Cassandra with Hadoop
Cassandra
DataNode
TaskTracker
Cassandra
DataNode
TaskTracker
Cassandra
DataNode
TaskTracker
NameNod...
Cassandra with Spark (using HDFS)
Cassandra
Spark Worker
DataNode
Cassandra
Spark Worker
DataNode
Cassandra
Spark Worker
D...
A Typical Spark Application
A Typical Spark Application
● SparkContext + SparkConf
A Typical Spark Application
● SparkContext + SparkConf
● Data Source to RDD[T]
A Typical Spark Application
● SparkContext + SparkConf
● Data Source to RDD[T]
● Transformations/Actions
A Typical Spark Application
● SparkContext + SparkConf
● Data Source to RDD[T]
● Transformations/Actions
● Saving/Displayi...
Resilient Distributed Dataset (RDD)
Resilient Distributed Dataset (RDD)
● A distributed collection of items
Resilient Distributed Dataset (RDD)
● A distributed collection of items
● Transformations
○ Similar to those found in scal...
Resilient Distributed Dataset (RDD)
● A distributed collection of items
● Transformations
○ Similar to those found in scal...
RDD Transformations vs Actions
Transformations:
Produce new RDDs
Actions:
Require the materialization of the records to pr...
RDD Transformations/Actions
Transformations:
filter, map, flatMap, collect(λ):RDD[T], distinct, groupBy,
subtract, union, ...
Resilient Distributed Dataset (RDD)
val numOfAdults = persons.filter(_.age > 17).count()
Transformation Action
Example
Demo
Spark SQL Example
Spark Streaming
Spark Streaming
● Creates RDDs from stream source on a
defined interval
Spark Streaming
● Creates RDDs from stream source on a
defined interval
● Same ops as “normal” RDDs
Spark Streaming
● Creates RDDs from stream source on a
defined interval
● Same ops as “normal” RDDs
● Supports a variety o...
Spark Streaming
The Output
Real World Task Distribution
Real World Task Distribution
Transformations and Actions:
Similar to Scala but …
Your choice of transformations need be ma...
Real World Task Distribution
How do we do that?
● Partition the data appropriately for number of cores (at
least 2 x numbe...
Real World Task Distribution
How do we do that?
● Partial Aggregation
● Create an algorithm to be as simple/efficient as i...
Real World Task Distribution
Real World Task Distribution
Some Common Costly Transformations:
● sorting
● groupByKey
● reduceByKey
● ...
Thank you!
Robbie Strickland
rostrickland@gmail.com
https://github.com/rstrickland/sparkdemo
Upcoming SlideShare
Loading in …5
×

Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

3,833
-1

Published on

The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. But there are serious advantages to many of the new tools, and this presentation will give an analysis of the current state–including pros and cons as well as what’s needed to bootstrap and operate the various options.

About Robbie Strickland, Software Development Manager at The Weather Channel

Robbie works for The Weather Channel’s digital division as part of the team that builds backend services for weather.com and the TWC mobile apps. He has been involved in the Cassandra project since 2010 and has contributed in a variety of ways over the years; this includes work on drivers for Scala and C#, the Hadoop integration, heading up the Atlanta Cassandra Users Group, and answering lots of Stack Overflow questions.

Published in: Technology

Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

  1. 1. The New Analytics Toolbox Going beyond Hadoop
  2. 2. whoami Robbie Strickland Software Dev Manager linkedin.com/in/robbiestrickland rostrickland@gmail.com
  3. 3. Hadoop works fine, right?
  4. 4. Hadoop works fine, right? ● It’s 11 years old (!!)
  5. 5. Hadoop works fine, right? ● It’s 11 years old (!!) ● It’s relatively slow and inefficient
  6. 6. Hadoop works fine, right? ● It’s 11 years old (!!) ● It’s relatively slow and inefficient ● … because everything gets written to disk
  7. 7. Hadoop works fine, right? ● It’s 11 years old (!!) ● It’s relatively slow and inefficient ● … because everything gets written to disk ● Writing mapreduce code sucks
  8. 8. Hadoop works fine, right? ● It’s 11 years old (!!) ● It’s relatively slow and inefficient ● … because everything gets written to disk ● Writing mapreduce code sucks ● Lots of code for even simple tasks
  9. 9. Hadoop works fine, right? ● It’s 11 years old (!!) ● It’s relatively slow and inefficient ● … because everything gets written to disk ● Writing mapreduce code sucks ● Lots of code for even simple tasks ● Boilerplate
  10. 10. Hadoop works fine, right? ● It’s 11 years old (!!) ● It’s relatively slow and inefficient ● … because everything gets written to disk ● Writing mapreduce code sucks ● Lots of code for even simple tasks ● Boilerplate ● Configuration isn’t for the faint of heart
  11. 11. Landscape has changed ... Lots of new tools available: ● Cloudera Impala ● Apache Drill ● Proprietary (Splunk, keen.io, etc.) ● Spark / Spark SQL ● Shark
  12. 12. Landscape has changed ... Lots of new tools available: ● Cloudera Impala ● Apache Drill ● Proprietary (Splunk, keen.io, etc.) ● Spark / Spark SQL ● Shark MPP queries for Hadoop data
  13. 13. Landscape has changed ... Lots of new tools available: ● Cloudera Impala ● Apache Drill ● Proprietary (Splunk, keen.io, etc.) ● Spark / Spark SQL ● Shark Varies by vendor
  14. 14. Landscape has changed ... Lots of new tools available: ● Cloudera Impala ● Apache Drill ● Proprietary (Splunk, keen.io, etc.) ● Spark / Spark SQL ● Shark Generic in-memory analysis
  15. 15. Landscape has changed ... Lots of new tools available: ● Cloudera Impala ● Apache Drill ● Proprietary (Splunk, keen.io, etc.) ● Spark / Spark SQL ● Shark Hive queries on Spark, replaced by Spark SQL
  16. 16. Spark wins?
  17. 17. Spark wins? ● Can be a Hadoop replacement
  18. 18. Spark wins? ● Can be a Hadoop replacement ● Works with any existing InputFormat
  19. 19. Spark wins? ● Can be a Hadoop replacement ● Works with any existing InputFormat ● Doesn’t require Hadoop
  20. 20. Spark wins? ● Can be a Hadoop replacement ● Works with any existing InputFormat ● Doesn’t require Hadoop ● Supports batch & streaming analysis
  21. 21. Spark wins? ● Can be a Hadoop replacement ● Works with any existing InputFormat ● Doesn’t require Hadoop ● Supports batch & streaming analysis ● Functional programming model
  22. 22. Spark wins? ● Can be a Hadoop replacement ● Works with any existing InputFormat ● Doesn’t require Hadoop ● Supports batch & streaming analysis ● Functional programming model ● Direct Cassandra integration
  23. 23. Spark vs. Hadoop
  24. 24. MapReduce:
  25. 25. MapReduce:
  26. 26. MapReduce:
  27. 27. MapReduce:
  28. 28. MapReduce:
  29. 29. MapReduce:
  30. 30. Spark (angels sing!):
  31. 31. What is Spark?
  32. 32. What is Spark? ● In-memory cluster computing
  33. 33. What is Spark? ● In-memory cluster computing ● 10-100x faster than MapReduce
  34. 34. What is Spark? ● In-memory cluster computing ● 10-100x faster than MapReduce ● Collection API over large datasets
  35. 35. What is Spark? ● In-memory cluster computing ● 10-100x faster than MapReduce ● Collection API over large datasets ● Scala, Python, Java
  36. 36. What is Spark? ● In-memory cluster computing ● 10-100x faster than MapReduce ● Collection API over large datasets ● Scala, Python, Java ● Stream processing
  37. 37. What is Spark? ● In-memory cluster computing ● 10-100x faster than MapReduce ● Collection API over large datasets ● Scala, Python, Java ● Stream processing ● Supports any existing Hadoop input / output format
  38. 38. What is Spark? ● Native graph processing via GraphX
  39. 39. What is Spark? ● Native graph processing via GraphX ● Native machine learning via MLlib
  40. 40. What is Spark? ● Native graph processing via GraphX ● Native machine learning via MLlib ● SQL queries via SparkSQL
  41. 41. What is Spark? ● Native graph processing via GraphX ● Native machine learning via MLlib ● SQL queries via SparkSQL ● Works out of the box on EMR
  42. 42. What is Spark? ● Native graph processing via GraphX ● Native machine learning via MLlib ● SQL queries via SparkSQL ● Works out of the box on EMR ● Easily join datasets from disparate sources
  43. 43. Spark on Cassandra
  44. 44. Spark on Cassandra ● Direct integration via DataStax driver - cassandra-driver-spark on github
  45. 45. Spark on Cassandra ● Direct integration via DataStax driver - cassandra-driver-spark on github ● No job config cruft
  46. 46. Spark on Cassandra ● Direct integration via DataStax driver - cassandra-driver-spark on github ● No job config cruft ● Supports server-side filters (where clauses)
  47. 47. Spark on Cassandra ● Direct integration via DataStax driver - cassandra-driver-spark on github ● No job config cruft ● Supports server-side filters (where clauses) ● Data locality aware
  48. 48. Spark on Cassandra ● Direct integration via DataStax driver - cassandra-driver-spark on github ● No job config cruft ● Supports server-side filters (where clauses) ● Data locality aware ● Uses HDFS, CassandraFS, or other distributed FS for checkpointing
  49. 49. Cassandra with Hadoop Cassandra DataNode TaskTracker Cassandra DataNode TaskTracker Cassandra DataNode TaskTracker NameNode 2ndaryNN JobTracker
  50. 50. Cassandra with Spark (using HDFS) Cassandra Spark Worker DataNode Cassandra Spark Worker DataNode Cassandra Spark Worker DataNode Spark Master NameNode 2ndaryNN
  51. 51. A Typical Spark Application
  52. 52. A Typical Spark Application ● SparkContext + SparkConf
  53. 53. A Typical Spark Application ● SparkContext + SparkConf ● Data Source to RDD[T]
  54. 54. A Typical Spark Application ● SparkContext + SparkConf ● Data Source to RDD[T] ● Transformations/Actions
  55. 55. A Typical Spark Application ● SparkContext + SparkConf ● Data Source to RDD[T] ● Transformations/Actions ● Saving/Displaying
  56. 56. Resilient Distributed Dataset (RDD)
  57. 57. Resilient Distributed Dataset (RDD) ● A distributed collection of items
  58. 58. Resilient Distributed Dataset (RDD) ● A distributed collection of items ● Transformations ○ Similar to those found in scala collections ○ Lazily processed
  59. 59. Resilient Distributed Dataset (RDD) ● A distributed collection of items ● Transformations ○ Similar to those found in scala collections ○ Lazily processed ● Can recalculate from any point of failure
  60. 60. RDD Transformations vs Actions Transformations: Produce new RDDs Actions: Require the materialization of the records to produce a value
  61. 61. RDD Transformations/Actions Transformations: filter, map, flatMap, collect(λ):RDD[T], distinct, groupBy, subtract, union, zip, reduceByKey ... Actions: collect:Array[T], count, fold, reduce ...
  62. 62. Resilient Distributed Dataset (RDD) val numOfAdults = persons.filter(_.age > 17).count() Transformation Action
  63. 63. Example
  64. 64. Demo
  65. 65. Spark SQL Example
  66. 66. Spark Streaming
  67. 67. Spark Streaming ● Creates RDDs from stream source on a defined interval
  68. 68. Spark Streaming ● Creates RDDs from stream source on a defined interval ● Same ops as “normal” RDDs
  69. 69. Spark Streaming ● Creates RDDs from stream source on a defined interval ● Same ops as “normal” RDDs ● Supports a variety of sources ○ Queues ○ Twitter ○ Flume ○ etc.
  70. 70. Spark Streaming
  71. 71. The Output
  72. 72. Real World Task Distribution
  73. 73. Real World Task Distribution Transformations and Actions: Similar to Scala but … Your choice of transformations need be made with task distribution and memory in mind
  74. 74. Real World Task Distribution How do we do that? ● Partition the data appropriately for number of cores (at least 2 x number of cores) ● Filter early and often (Queries, Filters, Distinct...) ● Use pipelines
  75. 75. Real World Task Distribution How do we do that? ● Partial Aggregation ● Create an algorithm to be as simple/efficient as it can be to appropriately answer the question
  76. 76. Real World Task Distribution
  77. 77. Real World Task Distribution Some Common Costly Transformations: ● sorting ● groupByKey ● reduceByKey ● ...
  78. 78. Thank you! Robbie Strickland rostrickland@gmail.com https://github.com/rstrickland/sparkdemo

×