Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

on

  • 2,061 views

The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live ...

The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. But there are serious advantages to many of the new tools, and this presentation will give an analysis of the current state–including pros and cons as well as what’s needed to bootstrap and operate the various options.

About Robbie Strickland, Software Development Manager at The Weather Channel

Robbie works for The Weather Channel’s digital division as part of the team that builds backend services for weather.com and the TWC mobile apps. He has been involved in the Cassandra project since 2010 and has contributed in a variety of ways over the years; this includes work on drivers for Scala and C#, the Hadoop integration, heading up the Atlanta Cassandra Users Group, and answering lots of Stack Overflow questions.

Statistics

Views

Total Views
2,061
Views on SlideShare
2,036
Embed Views
25

Actions

Likes
5
Downloads
22
Comments
0

2 Embeds 25

https://twitter.com 20
http://www.slideee.com 5

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop Presentation Transcript

  • 1. The New Analytics Toolbox Going beyond Hadoop
  • 2. whoami Robbie Strickland Software Dev Manager linkedin.com/in/robbiestrickland rostrickland@gmail.com
  • 3. Hadoop works fine, right?
  • 4. Hadoop works fine, right? ● It’s 11 years old (!!)
  • 5. Hadoop works fine, right? ● It’s 11 years old (!!) ● It’s relatively slow and inefficient
  • 6. Hadoop works fine, right? ● It’s 11 years old (!!) ● It’s relatively slow and inefficient ● … because everything gets written to disk
  • 7. Hadoop works fine, right? ● It’s 11 years old (!!) ● It’s relatively slow and inefficient ● … because everything gets written to disk ● Writing mapreduce code sucks
  • 8. Hadoop works fine, right? ● It’s 11 years old (!!) ● It’s relatively slow and inefficient ● … because everything gets written to disk ● Writing mapreduce code sucks ● Lots of code for even simple tasks
  • 9. Hadoop works fine, right? ● It’s 11 years old (!!) ● It’s relatively slow and inefficient ● … because everything gets written to disk ● Writing mapreduce code sucks ● Lots of code for even simple tasks ● Boilerplate
  • 10. Hadoop works fine, right? ● It’s 11 years old (!!) ● It’s relatively slow and inefficient ● … because everything gets written to disk ● Writing mapreduce code sucks ● Lots of code for even simple tasks ● Boilerplate ● Configuration isn’t for the faint of heart
  • 11. Landscape has changed ... Lots of new tools available: ● Cloudera Impala ● Apache Drill ● Proprietary (Splunk, keen.io, etc.) ● Spark / Spark SQL ● Shark
  • 12. Landscape has changed ... Lots of new tools available: ● Cloudera Impala ● Apache Drill ● Proprietary (Splunk, keen.io, etc.) ● Spark / Spark SQL ● Shark MPP queries for Hadoop data
  • 13. Landscape has changed ... Lots of new tools available: ● Cloudera Impala ● Apache Drill ● Proprietary (Splunk, keen.io, etc.) ● Spark / Spark SQL ● Shark Varies by vendor
  • 14. Landscape has changed ... Lots of new tools available: ● Cloudera Impala ● Apache Drill ● Proprietary (Splunk, keen.io, etc.) ● Spark / Spark SQL ● Shark Generic in-memory analysis
  • 15. Landscape has changed ... Lots of new tools available: ● Cloudera Impala ● Apache Drill ● Proprietary (Splunk, keen.io, etc.) ● Spark / Spark SQL ● Shark Hive queries on Spark, replaced by Spark SQL
  • 16. Spark wins?
  • 17. Spark wins? ● Can be a Hadoop replacement
  • 18. Spark wins? ● Can be a Hadoop replacement ● Works with any existing InputFormat
  • 19. Spark wins? ● Can be a Hadoop replacement ● Works with any existing InputFormat ● Doesn’t require Hadoop
  • 20. Spark wins? ● Can be a Hadoop replacement ● Works with any existing InputFormat ● Doesn’t require Hadoop ● Supports batch & streaming analysis
  • 21. Spark wins? ● Can be a Hadoop replacement ● Works with any existing InputFormat ● Doesn’t require Hadoop ● Supports batch & streaming analysis ● Functional programming model
  • 22. Spark wins? ● Can be a Hadoop replacement ● Works with any existing InputFormat ● Doesn’t require Hadoop ● Supports batch & streaming analysis ● Functional programming model ● Direct Cassandra integration
  • 23. Spark vs. Hadoop
  • 24. MapReduce:
  • 25. MapReduce:
  • 26. MapReduce:
  • 27. MapReduce:
  • 28. MapReduce:
  • 29. MapReduce:
  • 30. Spark (angels sing!):
  • 31. What is Spark?
  • 32. What is Spark? ● In-memory cluster computing
  • 33. What is Spark? ● In-memory cluster computing ● 10-100x faster than MapReduce
  • 34. What is Spark? ● In-memory cluster computing ● 10-100x faster than MapReduce ● Collection API over large datasets
  • 35. What is Spark? ● In-memory cluster computing ● 10-100x faster than MapReduce ● Collection API over large datasets ● Scala, Python, Java
  • 36. What is Spark? ● In-memory cluster computing ● 10-100x faster than MapReduce ● Collection API over large datasets ● Scala, Python, Java ● Stream processing
  • 37. What is Spark? ● In-memory cluster computing ● 10-100x faster than MapReduce ● Collection API over large datasets ● Scala, Python, Java ● Stream processing ● Supports any existing Hadoop input / output format
  • 38. What is Spark? ● Native graph processing via GraphX
  • 39. What is Spark? ● Native graph processing via GraphX ● Native machine learning via MLlib
  • 40. What is Spark? ● Native graph processing via GraphX ● Native machine learning via MLlib ● SQL queries via SparkSQL
  • 41. What is Spark? ● Native graph processing via GraphX ● Native machine learning via MLlib ● SQL queries via SparkSQL ● Works out of the box on EMR
  • 42. What is Spark? ● Native graph processing via GraphX ● Native machine learning via MLlib ● SQL queries via SparkSQL ● Works out of the box on EMR ● Easily join datasets from disparate sources
  • 43. Spark on Cassandra
  • 44. Spark on Cassandra ● Direct integration via DataStax driver - cassandra-driver-spark on github
  • 45. Spark on Cassandra ● Direct integration via DataStax driver - cassandra-driver-spark on github ● No job config cruft
  • 46. Spark on Cassandra ● Direct integration via DataStax driver - cassandra-driver-spark on github ● No job config cruft ● Supports server-side filters (where clauses)
  • 47. Spark on Cassandra ● Direct integration via DataStax driver - cassandra-driver-spark on github ● No job config cruft ● Supports server-side filters (where clauses) ● Data locality aware
  • 48. Spark on Cassandra ● Direct integration via DataStax driver - cassandra-driver-spark on github ● No job config cruft ● Supports server-side filters (where clauses) ● Data locality aware ● Uses HDFS, CassandraFS, or other distributed FS for checkpointing
  • 49. Cassandra with Hadoop Cassandra DataNode TaskTracker Cassandra DataNode TaskTracker Cassandra DataNode TaskTracker NameNode 2ndaryNN JobTracker
  • 50. Cassandra with Spark (using HDFS) Cassandra Spark Worker DataNode Cassandra Spark Worker DataNode Cassandra Spark Worker DataNode Spark Master NameNode 2ndaryNN
  • 51. A Typical Spark Application
  • 52. A Typical Spark Application ● SparkContext + SparkConf
  • 53. A Typical Spark Application ● SparkContext + SparkConf ● Data Source to RDD[T]
  • 54. A Typical Spark Application ● SparkContext + SparkConf ● Data Source to RDD[T] ● Transformations/Actions
  • 55. A Typical Spark Application ● SparkContext + SparkConf ● Data Source to RDD[T] ● Transformations/Actions ● Saving/Displaying
  • 56. Resilient Distributed Dataset (RDD)
  • 57. Resilient Distributed Dataset (RDD) ● A distributed collection of items
  • 58. Resilient Distributed Dataset (RDD) ● A distributed collection of items ● Transformations ○ Similar to those found in scala collections ○ Lazily processed
  • 59. Resilient Distributed Dataset (RDD) ● A distributed collection of items ● Transformations ○ Similar to those found in scala collections ○ Lazily processed ● Can recalculate from any point of failure
  • 60. RDD Transformations vs Actions Transformations: Produce new RDDs Actions: Require the materialization of the records to produce a value
  • 61. RDD Transformations/Actions Transformations: filter, map, flatMap, collect(λ):RDD[T], distinct, groupBy, subtract, union, zip, reduceByKey ... Actions: collect:Array[T], count, fold, reduce ...
  • 62. Resilient Distributed Dataset (RDD) val numOfAdults = persons.filter(_.age > 17).count() Transformation Action
  • 63. Example
  • 64. Demo
  • 65. Spark SQL Example
  • 66. Spark Streaming
  • 67. Spark Streaming ● Creates RDDs from stream source on a defined interval
  • 68. Spark Streaming ● Creates RDDs from stream source on a defined interval ● Same ops as “normal” RDDs
  • 69. Spark Streaming ● Creates RDDs from stream source on a defined interval ● Same ops as “normal” RDDs ● Supports a variety of sources ○ Queues ○ Twitter ○ Flume ○ etc.
  • 70. Spark Streaming
  • 71. The Output
  • 72. Real World Task Distribution
  • 73. Real World Task Distribution Transformations and Actions: Similar to Scala but … Your choice of transformations need be made with task distribution and memory in mind
  • 74. Real World Task Distribution How do we do that? ● Partition the data appropriately for number of cores (at least 2 x number of cores) ● Filter early and often (Queries, Filters, Distinct...) ● Use pipelines
  • 75. Real World Task Distribution How do we do that? ● Partial Aggregation ● Create an algorithm to be as simple/efficient as it can be to appropriately answer the question
  • 76. Real World Task Distribution
  • 77. Real World Task Distribution Some Common Costly Transformations: ● sorting ● groupByKey ● reduceByKey ● ...
  • 78. Thank you! Robbie Strickland rostrickland@gmail.com https://github.com/rstrickland/sparkdemo