Deep dive into Spark StreamingTathagata Das (TD)Matei Zaharia, Haoyuan Li, Timothy Hunter,Patrick Wendell and many othersU...
What is Spark Streaming? Extends Spark for doing large scale stream processing Scales to 100s of nodes and achieves seco...
Discretized Stream ProcessingRun a streaming computation as a series of very small,deterministic batch jobs3SparkSparkStre...
Discretized Stream ProcessingRun a streaming computation as a series of very small,deterministic batch jobs4 Batch sizes ...
Example 1 – Get hashtags from Twitterval tweets = ssc.twitterStream(<Twitter username>, <Twitterpassword>)DStream: a seque...
Example 1 – Get hashtags from Twitterval tweets = ssc.twitterStream(<Twitter username>, <Twitterpassword>)val hashTags = t...
Example 1 – Get hashtags from Twitterval tweets = ssc.twitterStream(<Twitter username>, <Twitterpassword>)val hashTags = t...
DStream of dataExample 2 – Count the hashtags over last 1 minval tweets = ssc.twitterStream(<Twitter username>, <Twitter p...
tagCountsExample 2 – Count the hashtags over last 1 minval tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValu...
Key concepts DStream – sequence of RDDs representing a stream of data- Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, T...
Arbitrary Stateful Computations Maintain arbitrary state, track sessions- Maintain per-user mood as state, and update it ...
Combine Batch and Stream Processing Do arbitrary Spark RDD computation within DStream- Join incoming tweets with a spam f...
Fault-tolerance RDDs remember the operationsthat created them Batches of input data arereplicated in memory for fault-to...
Agenda Overview DStream Abstraction System Model Persistence / Caching RDD Checkpointing Performance Tuning
Discretized Stream (DStream)A sequence of RDDs representinga stream of dataWhat does it take to define a DStream?
DStream InterfaceThe DStream interface primarily defines how to generate anRDD in each batch interval List of dependent (...
Example: Mapped DStream Dependencies: Single parent DStream Slide Interval: Same as the parent DStream Compute function...
Window operation gather together data over a sliding window Dependencies: Single parent DStream Slide Interval: Window s...
Example: Network Input DStreamBase class of all input DStreams that receive data from the network Dependencies: None Sli...
Network ReceiverResponsible for receiving data and pushing it into Spark’s datamanagement layer (Block Manager)Base class ...
Example: Socket Receiver On start:Connect to remote TCP serverWhile socket is connected,Receiving bytes and deserializeDe...
DStream Grapht = ssc.twitterStream(“…”).map(…)t.foreach(…)t1 = ssc.twitterStream(“…”)t2 = ssc.twitterStream(“…”)t = t1.uni...
DStream Graph  RDD Graphs  Spark jobs Every interval, RDD graph is computed from DStream graph For each output operati...
Agenda Overview DStream Abstraction System Model Persistence / Caching RDD Checkpointing Performance Tuning
Components Network Input Tracker – Keeps track of the data received by each networkreceiver and maps them to the correspo...
Execution Model – Receiving DataSpark Streaming + Spark Driver Spark WorkersStreamingContext.start()NetworkInputTrackerRec...
Spark WorkersExecution Model – Job SchedulingNetworkInputTrackerJobScheduler Spark’sSchedulersReceiverBlockManagerBlockMan...
Job Scheduling Each output operation used generates a job- More jobs  more time taken to process batches  higher batchd...
Agenda Overview DStream Abstraction System Model Persistence / Caching RDD Checkpointing Performance Tuning
DStream Persistence If a DStream is set to persist at a storage level, then all RDDsgenerated by it set to the same stora...
DStream Persistence Default storage level of DStreams is StorageLevel.MEMORY_ONLY_SER(i.e. in memory as serialized bytes)...
Agenda Overview DStream Abstraction System Model Persistence / Caching RDD Checkpointing Performance Tuning
What is RDD checkpointing?Saving RDD to HDFS to prevent RDD graph from growing too large Done internally in Spark transpa...
Why is RDD checkpointing necessary?Stateful DStream operators can have infinite lineagesLarge lineages lead to … Large cl...
Why is RDD checkpointing necessary?Stateful DStream operators can have infinite lineagesPeriodic RDD checkpointing solves ...
RDD Checkpointing Periodicity of checkpoint determines a tradeoff- Checkpoint too frequent: HDFS writing will slow things...
Agenda Overview DStream Abstraction System Model Persistence / Caching RDD Checkpointing Performance Tuning
Performance TuningStep 1Achieve a stable configuration that can sustain thestreaming workloadStep 2Optimize for lower late...
Step 1: Achieving Stable ConfigurationHow to identify whether a configuration is stable? Look for the following messages ...
Step 1: Achieving Stable ConfigurationHow to figure out a good stable configuration? Start with a low data rate, small nu...
Step 1: Achieving Stable ConfigurationHow to figure out a good stable configuration? If the first map stage on raw data i...
Step 2: Optimize for Lower Latency Reduce batch size and find a stable configuration again- Increase levels of parallelis...
Step 2: Optimize for Lower Latency Using concurrent mark sweep GC -XX:+UseConcMarkSweepGC isrecommended- Reduces throughp...
Small code base 5000 LOC for Scala API (+ 1500 LC for Java API)- Most DStream code mirrors the RDD code Easy to take a l...
Future (Possible) Directions Better master fault-tolerance Better performance for complex queries- Better performance fo...
CONTRIBUTIONS ARE WELCOME
Upcoming SlideShare
Loading in...5
×

Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17

21,105

Published on

Slides from Tathagata Das's talk at the Spark Meetup entitled "Deep Dive with Spark Streaming" on June 17, 2013 in Sunnyvale California at Plug and Play. Tathagata Das is the lead developer on Spark Streaming and a PhD student in computer science in the UC Berkeley AMPLab.

Published in: Technology
0 Comments
88 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
21,105
On Slideshare
0
From Embeds
0
Number of Embeds
21
Actions
Shares
0
Downloads
973
Comments
0
Likes
88
Embeds 0
No embeds

No notes for slide

Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17

  1. 1. Deep dive into Spark StreamingTathagata Das (TD)Matei Zaharia, Haoyuan Li, Timothy Hunter,Patrick Wendell and many othersUC BERKELEY
  2. 2. What is Spark Streaming? Extends Spark for doing large scale stream processing Scales to 100s of nodes and achieves second scale latencies Efficient and fault-tolerant stateful stream processing Integrates with Spark’s batch and interactive processing Provides a simple batch-like API for implementing complexalgorithms
  3. 3. Discretized Stream ProcessingRun a streaming computation as a series of very small,deterministic batch jobs3SparkSparkStreamingbatches of Xsecondslive data streamprocessedresults Chop up the live stream into batches of Xseconds Spark treats each batch of data as RDDsand processes them using RDD operations Finally, the processed results of the RDDoperations are returned in batches
  4. 4. Discretized Stream ProcessingRun a streaming computation as a series of very small,deterministic batch jobs4 Batch sizes as low as ½ second, latency ~ 1second Potential for combining batch processingand streaming processing in the samesystemSparkSparkStreamingbatches of Xsecondslive data streamprocessedresults
  5. 5. Example 1 – Get hashtags from Twitterval tweets = ssc.twitterStream(<Twitter username>, <Twitterpassword>)DStream: a sequence of distributed datasets (RDDs)representing a distributed stream of databatch @ t+1batch @ t batch @ t+2tweets DStreamstored in memory as an RDD(immutable, distributed dataset)Twitter Streaming API
  6. 6. Example 1 – Get hashtags from Twitterval tweets = ssc.twitterStream(<Twitter username>, <Twitterpassword>)val hashTags = tweets.flatMap (status => getTags(status))flatMap flatMap flatMap…transformation: modify data in one DStream to createanother DStreamnew DStreamnew RDDs createdfor every batchbatch @ t+1batch @ t batch @ t+2tweets DStreamhashTags Dstream*#cat, #dog, … +
  7. 7. Example 1 – Get hashtags from Twitterval tweets = ssc.twitterStream(<Twitter username>, <Twitterpassword>)val hashTags = tweets.flatMap (status => getTags(status))hashTags.saveAsHadoopFiles("hdfs://...")output operation: to push data to external storageflatMap flatMap flatMapsave save savebatch @ t+1batch @ t batch @ t+2tweets DStreamhashTags DStreamevery batchsaved to HDFS
  8. 8. DStream of dataExample 2 – Count the hashtags over last 1 minval tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)val hashTags = tweets.flatMap (status => getTags(status))val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()sliding windowoperationwindow length sliding intervalwindow lengthsliding interval
  9. 9. tagCountsExample 2 – Count the hashtags over last 1 minval tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()hashTagst-1 t t+1 t+2 t+3sliding windowcountByValuecount over allthe data in thewindow
  10. 10. Key concepts DStream – sequence of RDDs representing a stream of data- Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets Transformations – modify data from one DStream to another- Standard RDD operations – map, countByValue, reduceByKey, join, …- Stateful operations – window, countByValueAndWindow, … Output Operations – send data to external entity- saveAsHadoopFiles – saves to HDFS- foreach – do anything with each batch of results
  11. 11. Arbitrary Stateful Computations Maintain arbitrary state, track sessions- Maintain per-user mood as state, and update it with his/her tweetsmoods = tweets.updateStateByKey(tweet => updateMood(tweet))updateMood(newTweets, lastMood) => newMoodtweetst-1 t t+1 t+2 t+3moods
  12. 12. Combine Batch and Stream Processing Do arbitrary Spark RDD computation within DStream- Join incoming tweets with a spam file to filter out bad tweetstweets.transform(tweetsRDD => {tweetsRDD.join(spamHDFSFile).filter(...)})
  13. 13. Fault-tolerance RDDs remember the operationsthat created them Batches of input data arereplicated in memory for fault-tolerance Data lost due to worker failure,can be recomputed fromreplicated input data Therefore, all transformed data isfault-tolerantinput datareplicatedin memoryflatMaplost partitionsrecomputed onother workerstweetsRDDhashTagsRDD
  14. 14. Agenda Overview DStream Abstraction System Model Persistence / Caching RDD Checkpointing Performance Tuning
  15. 15. Discretized Stream (DStream)A sequence of RDDs representinga stream of dataWhat does it take to define a DStream?
  16. 16. DStream InterfaceThe DStream interface primarily defines how to generate anRDD in each batch interval List of dependent (parent) DStreams Slide Interval, the interval at which it will compute RDDs Function to compute RDD at a time t
  17. 17. Example: Mapped DStream Dependencies: Single parent DStream Slide Interval: Same as the parent DStream Compute function for time t: Create new RDD by applying mapfunction on parent DStream’s RDD of time t
  18. 18. Window operation gather together data over a sliding window Dependencies: Single parent DStream Slide Interval: Window sliding interval Compute function for time t: Apply union over all the RDDs ofparent DStream between times t and (t – window length)Example: Windowed DStreamParent DStreamwindow lengthsliding interval
  19. 19. Example: Network Input DStreamBase class of all input DStreams that receive data from the network Dependencies: None Slide Interval: Batch duration in streaming context Compute function for time t: Create a BlockRDD with all theblocks of data received in the last batch interval Associated with a Network Receiver object
  20. 20. Network ReceiverResponsible for receiving data and pushing it into Spark’s datamanagement layer (Block Manager)Base class for all receivers - Kafka, Flume, etc.Simple Interface: What to do on starting the receiver- Helper object blockGenerator to push data into Spark What to do on stopping the receiver
  21. 21. Example: Socket Receiver On start:Connect to remote TCP serverWhile socket is connected,Receiving bytes and deserializeDeserialize them into Java objectsAdd the objects to blockGenerator On stop:Disconnect socket
  22. 22. DStream Grapht = ssc.twitterStream(“…”).map(…)t.foreach(…)t1 = ssc.twitterStream(“…”)t2 = ssc.twitterStream(“…”)t = t1.union(t2).map(…)t.saveAsHadoopFiles(…)t.map(…).foreach(…)t.filter(…).foreach(…)TMFETwitter Input DStreamMapped DStreamForeach DStreamTUMTM FFEFEFEDStream GraphSpark Streaming programDummy DStream signifyingan output operation
  23. 23. DStream Graph  RDD Graphs  Spark jobs Every interval, RDD graph is computed from DStream graph For each output operation, a Spark action is created For each action, a Spark job is created to compute itTUMTM FFEFEFEDStream GraphBUMBM FAA ARDD GraphSpark actionsBlock RDDs withdata received inlast batch interval3 Spark jobs
  24. 24. Agenda Overview DStream Abstraction System Model Persistence / Caching RDD Checkpointing Performance Tuning
  25. 25. Components Network Input Tracker – Keeps track of the data received by each networkreceiver and maps them to the corresponding input DStreams Job Scheduler – Periodically queries the DStream graph to generate Sparkjobs from received data, and hands them to Job Manager for execution Job Manager – Maintains a job queue and executes the jobs in Sparkssc = new StreamingContextt =ssc.twitterStream(“…”)t.filter(…).foreach(…)Your programDStream graphSpark ContextNetwork Input TrackerJob ManagerJob SchedulerSpark ClientRDD graph SchedulerBlock managerShuffletrackerSpark WorkerBlockmanagerTaskthreadsClusterManager
  26. 26. Execution Model – Receiving DataSpark Streaming + Spark Driver Spark WorkersStreamingContext.start()NetworkInputTrackerReceiverData recvdBlockManagerBlocks replicatedBlockManagerMasterBlockManagerBlocks pushed
  27. 27. Spark WorkersExecution Model – Job SchedulingNetworkInputTrackerJobScheduler Spark’sSchedulersReceiverBlockManagerBlockManagerJobs executed onworker nodesDStreamGraphJobManagerJobQueueSpark Streaming + Spark DriverJobsBlock IDsRDDs
  28. 28. Job Scheduling Each output operation used generates a job- More jobs  more time taken to process batches  higher batchduration Job Manager decides how many concurrent Spark jobs to run- Default is 1, can be set using Java propertyspark.streaming.concurrentJobs- If you have multiple output operations, you can try increasing thisproperty to reduce batch processing times and so reduce batch duration
  29. 29. Agenda Overview DStream Abstraction System Model Persistence / Caching RDD Checkpointing Performance Tuning
  30. 30. DStream Persistence If a DStream is set to persist at a storage level, then all RDDsgenerated by it set to the same storage level When to persist?- If there are multiple transformations / actions on a DStream- If RDDs in a DStream is going to be used multiple times Window-based DStreams are automatically persisted in memory
  31. 31. DStream Persistence Default storage level of DStreams is StorageLevel.MEMORY_ONLY_SER(i.e. in memory as serialized bytes)- Except for input DStreams which have StorageLevel.MEMORY_AND_DISK_SER_2- Note the difference from RDD’s default level (no serialization)- Serialization reduces random pauses due to GC providing more consistentjob processing times
  32. 32. Agenda Overview DStream Abstraction System Model Persistence / Caching RDD Checkpointing Performance Tuning
  33. 33. What is RDD checkpointing?Saving RDD to HDFS to prevent RDD graph from growing too large Done internally in Spark transparent to the user program Done lazily, saved to HDFS the first time it is computedred_rdd.checkpoint()HDFS fileContents of red_rdd savedto a HDFS file transparent toall child RDDs
  34. 34. Why is RDD checkpointing necessary?Stateful DStream operators can have infinite lineagesLarge lineages lead to … Large closure of the RDD object  large task sizes  high task launch times High recovery times under failuredatat-1 t t+1 t+2 t+3states
  35. 35. Why is RDD checkpointing necessary?Stateful DStream operators can have infinite lineagesPeriodic RDD checkpointing solves thisUseful for iterative Spark programs as welldatat-1 t t+1 t+2 t+3statesHDFSHDFS
  36. 36. RDD Checkpointing Periodicity of checkpoint determines a tradeoff- Checkpoint too frequent: HDFS writing will slow things down- Checkpoint too infrequent: Task launch times may increase- Default setting checkpoints at most once in 10 seconds- Try to checkpoint once in about 10 batches
  37. 37. Agenda Overview DStream Abstraction System Model Persistence / Caching RDD Checkpointing Performance Tuning
  38. 38. Performance TuningStep 1Achieve a stable configuration that can sustain thestreaming workloadStep 2Optimize for lower latency
  39. 39. Step 1: Achieving Stable ConfigurationHow to identify whether a configuration is stable? Look for the following messages in the logTotal delay: 0.01500 s for job 12 of time 1371512674000 … If the total delay is continuously increasing, then unstable as thesystem is unable to process data as fast as its receiving! If the total delay stays roughly constant and around 2x theconfigured batch duration, then stable
  40. 40. Step 1: Achieving Stable ConfigurationHow to figure out a good stable configuration? Start with a low data rate, small number of nodes, reasonablylarge batch duration (5 – 10 seconds) Increase the data rate, number of nodes, etc. Find the bottleneck in the job processing- Jobs are divided into stages- Find which stage is taking the most amount of time
  41. 41. Step 1: Achieving Stable ConfigurationHow to figure out a good stable configuration? If the first map stage on raw data is taking most time, then try …- Enabling delayed scheduling by setting property spark.locality.wait- Splitting your data source into multiple sub streams- Repartitioning the raw data into many partitions as first step If any of the subsequent stages are taking a lot of time, try…- Try increasing the level of parallelism (i.e., increase number of reducers)- Add more processors to the system
  42. 42. Step 2: Optimize for Lower Latency Reduce batch size and find a stable configuration again- Increase levels of parallelism, etc. Optimize serialization overheads- Consider using Kryo serialization instead of the default Java serialization forboth data and tasks- For data, set property spark.serializer=spark.KryoSerializer- For tasks, set spark.closure.serializer=spark.KryoSerializer Use Spark stand-alone mode rather than Mesos
  43. 43. Step 2: Optimize for Lower Latency Using concurrent mark sweep GC -XX:+UseConcMarkSweepGC isrecommended- Reduces throughput a little, but also reduces large GC pauses and mayallow lower batch sizes by making processing time more consistent Try disabling serialization in DStream/RDD persistence levels- Increases memory consumption and randomness of GC related pauses, butmay reduce latency by further reducing serialization overheads For a full list of guidelines for performance tuning- Spark Tuning Guide- Spark Streaming Tuning Guide
  44. 44. Small code base 5000 LOC for Scala API (+ 1500 LC for Java API)- Most DStream code mirrors the RDD code Easy to take a look and contribute
  45. 45. Future (Possible) Directions Better master fault-tolerance Better performance for complex queries- Better performance for stateful processing is a low hanging fruit Dashboard for Spark Streaming- Continuous graphs of processing times, end-to-end latencies- Drill down for analyzing processing times of stages for finding bottlenecks Python API for Spark Streaming
  46. 46. CONTRIBUTIONS ARE WELCOME
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×