Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Spark Streaming 
and Friends 
Chris Fregly 
Global Big Data Conference 
Sept 2014 
Kinesis 
Streaming
Who am I? 
Former Netflix’er: 
netflix.github.io 
Spark Contributor: 
github.com/apache/spark 
Founder: 
fluxcapacitor.com...
Spark Streaming-Kinesis Jira
Quick Poll 
• Hadoop, Hive, Pig? 
• Spark, Spark streaming? 
• EMR, Redshift? 
• Flume, Kafka, Kinesis, Storm? 
• Lambda A...
“Streaming” 
Kinesis 
Streaming 
Video 
Streaming 
Piping 
Big Data 
Streaming
Agenda 
• Spark, Spark Streaming Overview 
• Use Cases 
• API and Libraries 
• Execution Model 
• Fault Tolerance 
• Clust...
Spark Overview (1/2) 
• Berkeley AMPLab ~2009 
• Part of Berkeley Data Analytics Stack (BDAS, 
aka “badass”)
Spark Overview (2/2) 
• Based on Microsoft Dryad paper ~2007 
• Written in Scala 
• Supports Java, Python, SQL, and R 
• I...
Spark Use Cases 
• Ad hoc, exploratory, interactive analytics 
• Real-time + Batch Analytics 
– Lambda Architecture 
• Rea...
Explosion of Specialized Systems
Unified Spark Libraries 
• Spark SQL (Data Processing) 
• Spark Streaming (Streaming) 
• MLlib (Machine Learning) 
• Graph...
Unified Benefits 
• Advancements in higher-level libraries 
pushed down into core and vice-versa 
• Examples 
– Spark Stre...
Spark API
Resilient Distributed Dataset (RDD) 
• Core Spark abstraction 
• Represents partitions 
across the cluster nodes 
• Enable...
RDD Lineage
Spark API Overview 
• Richer, more expressive than MapReduce 
• Native support for Java, Scala, Python, 
SQL, and R (mostl...
Transformations
Actions
Spark Execution Model
Spark Execution Model Overview 
• Parallel, distributed 
• DAG-based 
• Lazy evaluation 
• Allows optimizations 
– Reduce ...
Execution Optimizations
Spark Cluster Deployment
Spark Cluster Deployment
Master High Availability 
• Multiple Master Nodes 
• ZooKeeper maintains current Master 
• Existing applications and worke...
Spark Streaming
Spark Streaming Overview 
• Low latency, high throughput, fault-tolerance 
(mostly) 
• Long-running Spark application 
• S...
Spark Streaming Use Cases 
• ETL on streaming data during ingestion 
• Anomaly, malware, and fraud detection 
• Operationa...
Discretized Stream (DStream) 
• Core Spark Streaming abstraction 
• Micro-batches of RDDs 
• Operations similar to RDD 
• ...
Spark Streaming API
Spark Streaming API Overview 
• Rich, expressive API similar to core 
• Operations 
– Transformations 
– Actions 
• Window...
DStream Transformations
DStream Actions
Window and State DStream Operations
DStream Example
Spark Streaming Cluster 
Deployment
Spark Streaming Cluster Deployment
Scaling Receivers
Scaling Processors
Spark Streaming 
+ 
Kinesis
Spark Streaming + Kinesis Architecture 
Kinesis 
Producer 
Kinesis 
Producer 
Spark St reaming Kinesis Archit ec t ure 
Ki...
Throughput and Pricing 
Spark 
Kinesis 
Producer 
Spark St reaming Kinesis Throughput and Pr ic ing 
< 10 second delay 
Ki...
Demo! 
Kinesis 
Streaming 
https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/… 
Scala: …/scala/org/a...
Spark Streaming 
Fault Tolerance
Fault Tolerance 
• Points of Failure 
– Receiver 
– Driver 
– Worker/Processor 
• Solutions 
– Data Replication 
– Seconda...
Streaming Receiver Failure 
• Use a backup receiver 
• Use multiple receivers pulling from multiple 
shards 
– Use checkpo...
Streaming Driver Failure 
• Use a backup Driver 
– Use DStream metadata checkpoint info to 
recover 
• Single point of fai...
Stream Worker/Processor Failure 
• No problem! 
• DStream RDD partitions will be 
recalculated from lineage
Types of Checkpoints 
Spark 
1. Spark checkpointing of StreamingContext 
DStreams and metadata 
2. Lineage of state and wi...
Spark Streaming 
Monitoring and Tuning
Monitoring 
• Monitor driver, receiver, worker nodes, and 
streams 
• Alert upon failure or unusually high latency 
• Spar...
Spark Web UI
Tuning 
• Batch interval 
– High: reduce overhead of submitting new tasks for each batch 
– Low: keeps latencies low 
– Sw...
Lambda Architecture
Lambda Architecture Overview 
• Batch Layer 
– Immutable, 
Batch read, 
Append-only write 
– Source of truth 
– ie. HDFS 
...
Spark + AWS + Lambda
Spark + AWS + Lambda + ML
Approximations
Approximation Overview 
• Required for scaling 
• Speed up analysis of large datasets 
• Reduce size of working dataset 
•...
Some Approximation Methods 
• Approximate time-bound queries 
– BlinkDB 
• Bernouilli and Poisson Sampling 
– RDD: sample(...
Approximations In Action 
Figure: Memory Savings with Approximation Techniques 
(http://highlyscalable.wordpress.com/2012/...
Spark Statistics Library 
• Correlations 
– Dependence between 2 random variables 
– Pearson, Spearman 
• Hypothesis Testi...
Summary 
• Spark, Spark Streaming Overview 
• Use Cases 
• API and Libraries 
• Execution Model 
• Fault Tolerance 
• Clus...
Upcoming SlideShare
Loading in …5
×

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

3,176 views

Published on

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming and Approximations Lambda Architecture

Published in: Software

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

  1. 1. Spark Streaming and Friends Chris Fregly Global Big Data Conference Sept 2014 Kinesis Streaming
  2. 2. Who am I? Former Netflix’er: netflix.github.io Spark Contributor: github.com/apache/spark Founder: fluxcapacitor.com Author: effectivespark.com sparkinaction.com
  3. 3. Spark Streaming-Kinesis Jira
  4. 4. Quick Poll • Hadoop, Hive, Pig? • Spark, Spark streaming? • EMR, Redshift? • Flume, Kafka, Kinesis, Storm? • Lambda Architecture? • Bloom Filters, HyperLogLog?
  5. 5. “Streaming” Kinesis Streaming Video Streaming Piping Big Data Streaming
  6. 6. Agenda • Spark, Spark Streaming Overview • Use Cases • API and Libraries • Execution Model • Fault Tolerance • Cluster Deployment • Monitoring • Scaling and Tuning • Lambda Architecture • Approximations
  7. 7. Spark Overview (1/2) • Berkeley AMPLab ~2009 • Part of Berkeley Data Analytics Stack (BDAS, aka “badass”)
  8. 8. Spark Overview (2/2) • Based on Microsoft Dryad paper ~2007 • Written in Scala • Supports Java, Python, SQL, and R • In-memory when possible, not required • Improved efficiency over MapReduce – 100x in-memory, 2-10x on-disk • Compatible with Hadoop – File formats, SerDes, and UDFs
  9. 9. Spark Use Cases • Ad hoc, exploratory, interactive analytics • Real-time + Batch Analytics – Lambda Architecture • Real-time Machine Learning • Real-time Graph Processing • Approximate, Time-bound Queries
  10. 10. Explosion of Specialized Systems
  11. 11. Unified Spark Libraries • Spark SQL (Data Processing) • Spark Streaming (Streaming) • MLlib (Machine Learning) • GraphX (Graph Processing) • BlinkDB (Approximate Queries) • Statistics (Correlations, Sampling, etc) • Others – Shark (Hive on Spark) – Spork (Pig on Spark)
  12. 12. Unified Benefits • Advancements in higher-level libraries pushed down into core and vice-versa • Examples – Spark Streaming: GC and memory management improvements – Spark GraphX: IndexedRDD for random, hashed access within a partition versus scanning entire partition
  13. 13. Spark API
  14. 14. Resilient Distributed Dataset (RDD) • Core Spark abstraction • Represents partitions across the cluster nodes • Enables parallel processing on data sets • Partitions can be in-memory or on-disk • Immutable, recomputable, fault tolerant • Contains transformation lineage on data set
  15. 15. RDD Lineage
  16. 16. Spark API Overview • Richer, more expressive than MapReduce • Native support for Java, Scala, Python, SQL, and R (mostly) • Unified API across all libraries • Operations = Transformations + Actions
  17. 17. Transformations
  18. 18. Actions
  19. 19. Spark Execution Model
  20. 20. Spark Execution Model Overview • Parallel, distributed • DAG-based • Lazy evaluation • Allows optimizations – Reduce disk I/O – Reduce shuffle I/O – Parallel execution – Task pipelining • Data locality and rack awareness • Worker node fault tolerance using RDD lineage graphs per partition
  21. 21. Execution Optimizations
  22. 22. Spark Cluster Deployment
  23. 23. Spark Cluster Deployment
  24. 24. Master High Availability • Multiple Master Nodes • ZooKeeper maintains current Master • Existing applications and workers will be notified of new Master election • New applications and workers need to explicitly specify current Master • Alternatives (Not recommended) – Local filesystem – NFS Mount
  25. 25. Spark Streaming
  26. 26. Spark Streaming Overview • Low latency, high throughput, fault-tolerance (mostly) • Long-running Spark application • Supports Flume, Kafka, Twitter, Kinesis, Socket, File, etc. • Graceful shutdown, in-flight message draining • Uses Spark Core, DAG Execution Model, and Fault Tolerance
  27. 27. Spark Streaming Use Cases • ETL on streaming data during ingestion • Anomaly, malware, and fraud detection • Operational dashboards • Lambda architecture – Unified batch and streaming – ie. Different machine learning models for different time frames • Predictive maintenance – Sensors • NLP analysis – Twitter firehose
  28. 28. Discretized Stream (DStream) • Core Spark Streaming abstraction • Micro-batches of RDDs • Operations similar to RDD • Fault tolerance using DStream/RDD lineage
  29. 29. Spark Streaming API
  30. 30. Spark Streaming API Overview • Rich, expressive API similar to core • Operations – Transformations – Actions • Window and State Operations • Requires checkpointing to snip long-running DStream lineage • Register DStream as a Spark SQL table for querying!
  31. 31. DStream Transformations
  32. 32. DStream Actions
  33. 33. Window and State DStream Operations
  34. 34. DStream Example
  35. 35. Spark Streaming Cluster Deployment
  36. 36. Spark Streaming Cluster Deployment
  37. 37. Scaling Receivers
  38. 38. Scaling Processors
  39. 39. Spark Streaming + Kinesis
  40. 40. Spark Streaming + Kinesis Architecture Kinesis Producer Kinesis Producer Spark St reaming Kinesis Archit ec t ure Kinesis Spark St reaming, Kinesis Cl ient Library Appl icat ion Kinesis St ream Shard 1 Shard 2 Shard 3 Kinesis Producer Kinesis Receiver DSt ream 1 Kinesis Cl ient Library Kinesis Record Processor Thread 1 Kinesis Record Processor Thread 2 Kinesis Receiver DSt ream 2 Kinesis Cl ient Library Kinesis Record Processor Thread 1
  41. 41. Throughput and Pricing Spark Kinesis Producer Spark St reaming Kinesis Throughput and Pr ic ing < 10 second delay Kinesis Spark St reaming Appl icat ion Kinesis St ream Shard 1 Spark Kinesis Receiver Shard 1 Shard 1 1 MB/ sec per shard 1000 PUTs/ sec 50K/ PUT 2 MB/ sec per shard Shard Cost : $ 0.36 per day per shard PUT Cost : $ 2.50 per day per shard Net work Transf er Cost : Free wit hin Region! !
  42. 42. Demo! Kinesis Streaming https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/… Scala: …/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala Java: …/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
  43. 43. Spark Streaming Fault Tolerance
  44. 44. Fault Tolerance • Points of Failure – Receiver – Driver – Worker/Processor • Solutions – Data Replication – Secondary/Backup Nodes – Checkpoints
  45. 45. Streaming Receiver Failure • Use a backup receiver • Use multiple receivers pulling from multiple shards – Use checkpoint-enabled, sharded streaming source (ie. Kafka and Kinesis) • Data is replicated to 2 nodes immediately upon ingestion • Possible data loss • Possible at-least once • Use buffered sources (ie. Kafka and Kinesis)
  46. 46. Streaming Driver Failure • Use a backup Driver – Use DStream metadata checkpoint info to recover • Single point of failure – interrupts stream processing • Streaming Driver is a long-running Spark application – Schedules long-running stream receivers • State and Window RDD checkpoints help avoid data loss (mostly)
  47. 47. Stream Worker/Processor Failure • No problem! • DStream RDD partitions will be recalculated from lineage
  48. 48. Types of Checkpoints Spark 1. Spark checkpointing of StreamingContext DStreams and metadata 2. Lineage of state and window DStream operations Kinesis 3. Kinesis Client Library (KCL) checkpoints current position within shard – Checkpoint info is stored in DynamoDB per Kinesis application keyed by shard
  49. 49. Spark Streaming Monitoring and Tuning
  50. 50. Monitoring • Monitor driver, receiver, worker nodes, and streams • Alert upon failure or unusually high latency • Spark Web UI – Streaming tab • Ganglia, CloudWatch • StreamingListener callback
  51. 51. Spark Web UI
  52. 52. Tuning • Batch interval – High: reduce overhead of submitting new tasks for each batch – Low: keeps latencies low – Sweet spot: DStream job time (scheduling + processing) is steady and less than batch interval • Checkpoint interval – High: reduce load on checkpoint overhead – Low: reduce amount of data loss on failure – Recommendation: 5-10x sliding window interval • Use DStream.repartition() to increase parallelism of processing DStream jobs across cluster • Use spark.streaming.unpersist=true to let the Streaming Framework figure out when to unpersist • Use CMS GC for consistent processing times
  53. 53. Lambda Architecture
  54. 54. Lambda Architecture Overview • Batch Layer – Immutable, Batch read, Append-only write – Source of truth – ie. HDFS • Speed Layer – Mutable, Random read/write – Most complex – Recent data only – ie. Cassandra • Serving Layer – Immutable, Random read, Batch write – ie. ElephantDB
  55. 55. Spark + AWS + Lambda
  56. 56. Spark + AWS + Lambda + ML
  57. 57. Approximations
  58. 58. Approximation Overview • Required for scaling • Speed up analysis of large datasets • Reduce size of working dataset • Data is messy • Collection of data is messy • Exact isn’t always necessary • “Approximate is the new Exact”
  59. 59. Some Approximation Methods • Approximate time-bound queries – BlinkDB • Bernouilli and Poisson Sampling – RDD: sample(), RDD.takeSample() • HyperLogLog PairRDD: countApproxDistinctByKey() • Count-min Sketch • Spark Streaming and Twitter Algebird • Bloom Filters – Everywhere!
  60. 60. Approximations In Action Figure: Memory Savings with Approximation Techniques (http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)
  61. 61. Spark Statistics Library • Correlations – Dependence between 2 random variables – Pearson, Spearman • Hypothesis Testing – Measure of statistical significance – Chi-squared test • Stratified Sampling – Sample separately from different sub-populations – Bernoulli and Poisson sampling – With and without replacement • Random data generator – Uniform, standard normal, and Poisson distribution
  62. 62. Summary • Spark, Spark Streaming Overview • Use Cases • API and Libraries • Execution Model • Fault Tolerance • Cluster Deployment • Monitoring • Scaling and Tuning • Lambda Architecture • Approximations Oct 2014 MEAP Early Access http://sparkinaction.com

×