Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17

Deep dive into Spark Streaming
Tathagata Das (TD)
Matei Zaharia, Haoyuan Li, Timothy Hunter,
Patrick Wendell and many others
UC BERKELEY
What is Spark Streaming?
 Extends Spark for doing large scale stream processing
 Scales to 100s of nodes and achieves second scale latencies
 Efficient and fault-tolerant stateful stream processing
 Integrates with Spark’s batch and interactive processing
 Provides a simple batch-like API for implementing complex
algorithms
Discretized Stream Processing
Run a streaming computation as a series of very small,
deterministic batch jobs
3
Spark
Spark
Streaming
batches of X
seconds
live data stream
processed
results
 Chop up the live stream into batches of X
seconds
 Spark treats each batch of data as RDDs
and processes them using RDD operations
 Finally, the processed results of the RDD
operations are returned in batches
Discretized Stream Processing
Run a streaming computation as a series of very small,
deterministic batch jobs
4
 Batch sizes as low as ½ second, latency ~ 1
second
 Potential for combining batch processing
and streaming processing in the same
system
Spark
Spark
Streaming
batches of X
seconds
live data stream
processed
results
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter
password>)
DStream: a sequence of distributed datasets (RDDs)
representing a distributed stream of data
batch @ t+1batch @ t batch @ t+2
tweets DStream
stored in memory as an RDD
(immutable, distributed dataset)
Twitter Streaming API
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter
password>)
val hashTags = tweets.flatMap (status => getTags(status))
flatMap flatMap flatMap
…
transformation: modify data in one DStream to create
another DStream
new DStream
new RDDs created
for every batch
batch @ t+1batch @ t batch @ t+2
tweets DStream
hashTags Dstream
*#cat, #dog, … +
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter
password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external storage
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t batch @ t+2
tweets DStream
hashTags DStream
every batch
saved to HDFS
DStream of data
Example 2 – Count the hashtags over last 1 min
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()
sliding window
operation
window length sliding interval
window length
sliding interval
tagCounts
Example 2 – Count the hashtags over last 1 min
val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()
hashTags
t-1 t t+1 t+2 t+3
sliding window
countByValue
count over all
the data in the
window
Key concepts
 DStream – sequence of RDDs representing a stream of data
- Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets
 Transformations – modify data from one DStream to another
- Standard RDD operations – map, countByValue, reduceByKey, join, …
- Stateful operations – window, countByValueAndWindow, …
 Output Operations – send data to external entity
- saveAsHadoopFiles – saves to HDFS
- foreach – do anything with each batch of results
Arbitrary Stateful Computations
 Maintain arbitrary state, track sessions
- Maintain per-user mood as state, and update it with his/her tweets
moods = tweets.updateStateByKey(tweet => updateMood(tweet))
updateMood(newTweets, lastMood) => newMood
tweets
t-1 t t+1 t+2 t+3
moods
Combine Batch and Stream Processing
 Do arbitrary Spark RDD computation within DStream
- Join incoming tweets with a spam file to filter out bad tweets
tweets.transform(tweetsRDD => {
tweetsRDD.join(spamHDFSFile).filter(...)
})
Fault-tolerance
 RDDs remember the operations
that created them
 Batches of input data are
replicated in memory for fault-
tolerance
 Data lost due to worker failure,
can be recomputed from
replicated input data
 Therefore, all transformed data is
fault-tolerant
input data
replicated
in memory
flatMap
lost partitions
recomputed on
other workers
tweets
RDD
hashTags
RDD
Agenda
 Overview
 DStream Abstraction
 System Model
 Persistence / Caching
 RDD Checkpointing
 Performance Tuning
Discretized Stream (DStream)
A sequence of RDDs representing
a stream of data
What does it take to define a DStream?
DStream Interface
The DStream interface primarily defines how to generate an
RDD in each batch interval
 List of dependent (parent) DStreams
 Slide Interval, the interval at which it will compute RDDs
 Function to compute RDD at a time t
Example: Mapped DStream
 Dependencies: Single parent DStream
 Slide Interval: Same as the parent DStream
 Compute function for time t: Create new RDD by applying map
function on parent DStream’s RDD of time t
Window operation gather together data over a sliding window
 Dependencies: Single parent DStream
 Slide Interval: Window sliding interval
 Compute function for time t: Apply union over all the RDDs of
parent DStream between times t and (t – window length)
Example: Windowed DStream
Parent DStream
window length
sliding interval
Example: Network Input DStream
Base class of all input DStreams that receive data from the network
 Dependencies: None
 Slide Interval: Batch duration in streaming context
 Compute function for time t: Create a BlockRDD with all the
blocks of data received in the last batch interval
 Associated with a Network Receiver object
Network Receiver
Responsible for receiving data and pushing it into Spark’s data
management layer (Block Manager)
Base class for all receivers - Kafka, Flume, etc.
Simple Interface:
 What to do on starting the receiver
- Helper object blockGenerator to push data into Spark
 What to do on stopping the receiver
Example: Socket Receiver
 On start:
Connect to remote TCP server
While socket is connected,
Receiving bytes and deserialize
Deserialize them into Java objects
Add the objects to blockGenerator
 On stop:
Disconnect socket
DStream Graph
t = ssc.twitterStream(“…”)
.map(…)
t.foreach(…)
t1 = ssc.twitterStream(“…”)
t2 = ssc.twitterStream(“…”)
t = t1.union(t2).map(…)
t.saveAsHadoopFiles(…)
t.map(…).foreach(…)
t.filter(…).foreach(…)
T
M
F
E
Twitter Input DStream
Mapped DStream
Foreach DStream
T
U
M
T
M FF
E
F
E
F
E
DStream GraphSpark Streaming program
Dummy DStream signifying
an output operation
DStream Graph  RDD Graphs  Spark jobs
 Every interval, RDD graph is computed from DStream graph
 For each output operation, a Spark action is created
 For each action, a Spark job is created to compute it
T
U
M
T
M FF
E
F
E
F
E
DStream Graph
B
U
M
B
M FA
A A
RDD Graph
Spark actions
Block RDDs with
data received in
last batch interval
3 Spark jobs
Agenda
 Overview
 DStream Abstraction
 System Model
 Persistence / Caching
 RDD Checkpointing
 Performance Tuning
Components
 Network Input Tracker – Keeps track of the data received by each network
receiver and maps them to the corresponding input DStreams
 Job Scheduler – Periodically queries the DStream graph to generate Spark
jobs from received data, and hands them to Job Manager for execution
 Job Manager – Maintains a job queue and executes the jobs in Spark
ssc = new StreamingContext
t =
ssc.twitterStream(“…”)
t.filter(…).foreach(…)
Your program
DStream graph
Spark Context
Network Input Tracker
Job Manager
Job Scheduler
Spark Client
RDD graph Scheduler
Block manager
Shuffle
tracker
Spark Worker
Block
manager
Task
threads
Cluster
Manager
Execution Model – Receiving Data
Spark Streaming + Spark Driver Spark Workers
StreamingContext.start()
Network
Input
Tracker
Receiver
Data recvd
Block
Manager
Blocks replicated
Block
Manager
Master
Block
Manager
Blocks pushed
Spark Workers
Execution Model – Job Scheduling
Network
Input
Tracker
Job
Scheduler Spark’s
Schedulers
Receiver
Block
Manager
Block
Manager
Jobs executed on
worker nodes
DStream
Graph
Job
Manager
JobQueue
Spark Streaming + Spark Driver
Jobs
Block IDsRDDs
Job Scheduling
 Each output operation used generates a job
- More jobs  more time taken to process batches  higher batch
duration
 Job Manager decides how many concurrent Spark jobs to run
- Default is 1, can be set using Java property
spark.streaming.concurrentJobs
- If you have multiple output operations, you can try increasing this
property to reduce batch processing times and so reduce batch duration
Agenda
 Overview
 DStream Abstraction
 System Model
 Persistence / Caching
 RDD Checkpointing
 Performance Tuning
DStream Persistence
 If a DStream is set to persist at a storage level, then all RDDs
generated by it set to the same storage level
 When to persist?
- If there are multiple transformations / actions on a DStream
- If RDDs in a DStream is going to be used multiple times
 Window-based DStreams are automatically persisted in memory
DStream Persistence
 Default storage level of DStreams is StorageLevel.MEMORY_ONLY_SER
(i.e. in memory as serialized bytes)
- Except for input DStreams which have StorageLevel.MEMORY_AND_DISK_SER_2
- Note the difference from RDD’s default level (no serialization)
- Serialization reduces random pauses due to GC providing more consistent
job processing times
Agenda
 Overview
 DStream Abstraction
 System Model
 Persistence / Caching
 RDD Checkpointing
 Performance Tuning
What is RDD checkpointing?
Saving RDD to HDFS to prevent RDD graph from growing too large
 Done internally in Spark transparent to the user program
 Done lazily, saved to HDFS the first time it is computed
red_rdd.checkpoint()
HDFS file
Contents of red_rdd saved
to a HDFS file transparent to
all child RDDs
Why is RDD checkpointing necessary?
Stateful DStream operators can have infinite lineages
Large lineages lead to …
 Large closure of the RDD object  large task sizes  high task launch times
 High recovery times under failure
data
t-1 t t+1 t+2 t+3
states
Why is RDD checkpointing necessary?
Stateful DStream operators can have infinite lineages
Periodic RDD checkpointing solves this
Useful for iterative Spark programs as well
data
t-1 t t+1 t+2 t+3
states
HDF
S
HDF
S
RDD Checkpointing
 Periodicity of checkpoint determines a tradeoff
- Checkpoint too frequent: HDFS writing will slow things down
- Checkpoint too infrequent: Task launch times may increase
- Default setting checkpoints at most once in 10 seconds
- Try to checkpoint once in about 10 batches
Agenda
 Overview
 DStream Abstraction
 System Model
 Persistence / Caching
 RDD Checkpointing
 Performance Tuning
Performance Tuning
Step 1
Achieve a stable configuration that can sustain the
streaming workload
Step 2
Optimize for lower latency
Step 1: Achieving Stable Configuration
How to identify whether a configuration is stable?
 Look for the following messages in the log
Total delay: 0.01500 s for job 12 of time 1371512674000 …
 If the total delay is continuously increasing, then unstable as the
system is unable to process data as fast as its receiving!
 If the total delay stays roughly constant and around 2x the
configured batch duration, then stable
Step 1: Achieving Stable Configuration
How to figure out a good stable configuration?
 Start with a low data rate, small number of nodes, reasonably
large batch duration (5 – 10 seconds)
 Increase the data rate, number of nodes, etc.
 Find the bottleneck in the job processing
- Jobs are divided into stages
- Find which stage is taking the most amount of time
Step 1: Achieving Stable Configuration
How to figure out a good stable configuration?
 If the first map stage on raw data is taking most time, then try …
- Enabling delayed scheduling by setting property spark.locality.wait
- Splitting your data source into multiple sub streams
- Repartitioning the raw data into many partitions as first step
 If any of the subsequent stages are taking a lot of time, try…
- Try increasing the level of parallelism (i.e., increase number of reducers)
- Add more processors to the system
Step 2: Optimize for Lower Latency
 Reduce batch size and find a stable configuration again
- Increase levels of parallelism, etc.
 Optimize serialization overheads
- Consider using Kryo serialization instead of the default Java serialization for
both data and tasks
- For data, set property spark.serializer=spark.KryoSerializer
- For tasks, set spark.closure.serializer=spark.KryoSerializer
 Use Spark stand-alone mode rather than Mesos
Step 2: Optimize for Lower Latency
 Using concurrent mark sweep GC -XX:+UseConcMarkSweepGC is
recommended
- Reduces throughput a little, but also reduces large GC pauses and may
allow lower batch sizes by making processing time more consistent
 Try disabling serialization in DStream/RDD persistence levels
- Increases memory consumption and randomness of GC related pauses, but
may reduce latency by further reducing serialization overheads
 For a full list of guidelines for performance tuning
- Spark Tuning Guide
- Spark Streaming Tuning Guide
Small code base
 5000 LOC for Scala API (+ 1500 LC for Java API)
- Most DStream code mirrors the RDD code
 Easy to take a look and contribute
Future (Possible) Directions
 Better master fault-tolerance
 Better performance for complex queries
- Better performance for stateful processing is a low hanging fruit
 Dashboard for Spark Streaming
- Continuous graphs of processing times, end-to-end latencies
- Drill down for analyzing processing times of stages for finding bottlenecks
 Python API for Spark Streaming
CONTRIBUTIONS ARE WELCOME
1 of 46

Recommended

Introduction to Spark Streaming by
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
6.4K views31 slides
Understanding Query Plans and Spark UIs by
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
4.7K views50 slides
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc... by
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
10.8K views45 slides
Spark streaming , Spark SQL by
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
1.4K views25 slides
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ... by
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
5K views50 slides
Parquet performance tuning: the missing guide by
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
40.5K views44 slides

More Related Content

What's hot

SparkSQL: A Compiler from Queries to RDDs by
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
6.1K views44 slides
Optimizing Apache Spark SQL Joins by
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
44.9K views24 slides
Introduction to Redis by
Introduction to RedisIntroduction to Redis
Introduction to RedisArnab Mitra
11K views31 slides
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive by
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
3.3K views63 slides
Apache Hudi: The Path Forward by
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
490 views32 slides
Apache Spark Architecture by
Apache Spark ArchitectureApache Spark Architecture
Apache Spark ArchitectureAlexey Grishchenko
75.9K views114 slides

What's hot(20)

SparkSQL: A Compiler from Queries to RDDs by Databricks
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks6.1K views
Optimizing Apache Spark SQL Joins by Databricks
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks44.9K views
Introduction to Redis by Arnab Mitra
Introduction to RedisIntroduction to Redis
Introduction to Redis
Arnab Mitra11K views
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive by Sachin Aggarwal
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal3.3K views
Apache Hudi: The Path Forward by Alluxio, Inc.
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
Alluxio, Inc.490 views
Apache Spark Core—Deep Dive—Proper Optimization by Databricks
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks6.1K views
Unified Big Data Processing with Apache Spark (QCON 2014) by Databricks
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks3.7K views
Apache Spark in Depth: Core Concepts, Architecture & Internals by Anton Kirillov
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov9.6K views
PySpark dataframe by Jaemun Jung
PySpark dataframePySpark dataframe
PySpark dataframe
Jaemun Jung285 views
Fine Tuning and Enhancing Performance of Apache Spark Jobs by Databricks
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks2.5K views
The Apache Spark File Format Ecosystem by Databricks
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
Databricks2.1K views
Introducing DataFrames in Spark for Large Scale Data Science by Databricks
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Databricks41K views
The Parquet Format and Performance Optimization Opportunities by Databricks
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks8.1K views
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac... by HostedbyConfluent
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent4.3K views
Simplifying Big Data Analytics with Apache Spark by Databricks
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks12K views
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab by CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab1.9K views
Databricks Delta Lake and Its Benefits by Databricks
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks5.1K views
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar... by Simplilearn
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn1.5K views

Similar to Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17

Deep dive into spark streaming by
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streamingTao Li
1.3K views32 slides
strata_spark_streaming.ppt by
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.pptrveiga100
9 views40 slides
strata_spark_streaming.ppt by
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.pptAbhijitManna19
7 views40 slides
strata_spark_streaming.ppt by
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.pptsnowflakebatch
3 views40 slides
Spark streaming by
Spark streamingSpark streaming
Spark streamingVenkateswaran Kandasamy
156 views40 slides
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ... by
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Tathagata Das
1.3K views41 slides

Similar to Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17(20)

Deep dive into spark streaming by Tao Li
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streaming
Tao Li1.3K views
strata_spark_streaming.ppt by rveiga100
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
rveiga1009 views
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ... by Tathagata Das
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Tathagata Das1.3K views
Spark streaming by Noam Shaish
Spark streamingSpark streaming
Spark streaming
Noam Shaish1.3K views
Meet Up - Spark Stream Processing + Kafka by Knoldus Inc.
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
Knoldus Inc.7.6K views
Spark 计算模型 by wang xing
Spark 计算模型Spark 计算模型
Spark 计算模型
wang xing162 views
Unified Big Data Processing with Apache Spark by C4Media
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media2K views
No more struggles with Apache Spark workloads in production by Chetan Khatri
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Chetan Khatri168 views
Productionizing your Streaming Jobs by Databricks
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
Databricks1.6K views
Introduction to Spark Streaming by Knoldus Inc.
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
Knoldus Inc.3.7K views
Spark & Spark Streaming Internals - Nov 15 (1) by Akhil Das
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das846 views
Tuning and Debugging in Apache Spark by Databricks
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks12.3K views
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo... by Guido Schmutz
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz753 views
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ... by Flink Forward
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward573 views
Flink 0.10 @ Bay Area Meetup (October 2015) by Stephan Ewen
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen1.7K views

Recently uploaded

Tunable Laser (1).pptx by
Tunable Laser (1).pptxTunable Laser (1).pptx
Tunable Laser (1).pptxHajira Mahmood
23 views37 slides
From chaos to control: Managing migrations and Microsoft 365 with ShareGate! by
From chaos to control: Managing migrations and Microsoft 365 with ShareGate!From chaos to control: Managing migrations and Microsoft 365 with ShareGate!
From chaos to control: Managing migrations and Microsoft 365 with ShareGate!sammart93
9 views39 slides
Voice Logger - Telephony Integration Solution at Aegis by
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at AegisNirmal Sharma
17 views1 slide
Special_edition_innovator_2023.pdf by
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdfWillDavies22
16 views6 slides
Roadmap to Become Experts.pptx by
Roadmap to Become Experts.pptxRoadmap to Become Experts.pptx
Roadmap to Become Experts.pptxdscwidyatamanew
11 views45 slides
DALI Basics Course 2023 by
DALI Basics Course  2023DALI Basics Course  2023
DALI Basics Course 2023Ivory Egg
14 views12 slides

Recently uploaded(20)

From chaos to control: Managing migrations and Microsoft 365 with ShareGate! by sammart93
From chaos to control: Managing migrations and Microsoft 365 with ShareGate!From chaos to control: Managing migrations and Microsoft 365 with ShareGate!
From chaos to control: Managing migrations and Microsoft 365 with ShareGate!
sammart939 views
Voice Logger - Telephony Integration Solution at Aegis by Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma17 views
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2216 views
DALI Basics Course 2023 by Ivory Egg
DALI Basics Course  2023DALI Basics Course  2023
DALI Basics Course 2023
Ivory Egg14 views
AMAZON PRODUCT RESEARCH.pdf by JerikkLaureta
AMAZON PRODUCT RESEARCH.pdfAMAZON PRODUCT RESEARCH.pdf
AMAZON PRODUCT RESEARCH.pdf
JerikkLaureta15 views
Transcript: The Details of Description Techniques tips and tangents on altern... by BookNet Canada
Transcript: The Details of Description Techniques tips and tangents on altern...Transcript: The Details of Description Techniques tips and tangents on altern...
Transcript: The Details of Description Techniques tips and tangents on altern...
BookNet Canada130 views
Web Dev - 1 PPT.pdf by gdsczhcet
Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet55 views
STPI OctaNE CoE Brochure.pdf by madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb12 views
Lilypad @ Labweek, Istanbul, 2023.pdf by Ally339821
Lilypad @ Labweek, Istanbul, 2023.pdfLilypad @ Labweek, Istanbul, 2023.pdf
Lilypad @ Labweek, Istanbul, 2023.pdf
Ally3398219 views
Data-centric AI and the convergence of data and model engineering: opportunit... by Paolo Missier
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier34 views
Empathic Computing: Delivering the Potential of the Metaverse by Mark Billinghurst
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the Metaverse
Mark Billinghurst470 views
1st parposal presentation.pptx by i238212
1st parposal presentation.pptx1st parposal presentation.pptx
1st parposal presentation.pptx
i2382129 views

Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17

  • 1. Deep dive into Spark Streaming Tathagata Das (TD) Matei Zaharia, Haoyuan Li, Timothy Hunter, Patrick Wendell and many others UC BERKELEY
  • 2. What is Spark Streaming?  Extends Spark for doing large scale stream processing  Scales to 100s of nodes and achieves second scale latencies  Efficient and fault-tolerant stateful stream processing  Integrates with Spark’s batch and interactive processing  Provides a simple batch-like API for implementing complex algorithms
  • 3. Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs 3 Spark Spark Streaming batches of X seconds live data stream processed results  Chop up the live stream into batches of X seconds  Spark treats each batch of data as RDDs and processes them using RDD operations  Finally, the processed results of the RDD operations are returned in batches
  • 4. Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs 4  Batch sizes as low as ½ second, latency ~ 1 second  Potential for combining batch processing and streaming processing in the same system Spark Spark Streaming batches of X seconds live data stream processed results
  • 5. Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) DStream: a sequence of distributed datasets (RDDs) representing a distributed stream of data batch @ t+1batch @ t batch @ t+2 tweets DStream stored in memory as an RDD (immutable, distributed dataset) Twitter Streaming API
  • 6. Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) flatMap flatMap flatMap … transformation: modify data in one DStream to create another DStream new DStream new RDDs created for every batch batch @ t+1batch @ t batch @ t+2 tweets DStream hashTags Dstream *#cat, #dog, … +
  • 7. Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage flatMap flatMap flatMap save save save batch @ t+1batch @ t batch @ t+2 tweets DStream hashTags DStream every batch saved to HDFS
  • 8. DStream of data Example 2 – Count the hashtags over last 1 min val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue() sliding window operation window length sliding interval window length sliding interval
  • 9. tagCounts Example 2 – Count the hashtags over last 1 min val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue() hashTags t-1 t t+1 t+2 t+3 sliding window countByValue count over all the data in the window
  • 10. Key concepts  DStream – sequence of RDDs representing a stream of data - Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets  Transformations – modify data from one DStream to another - Standard RDD operations – map, countByValue, reduceByKey, join, … - Stateful operations – window, countByValueAndWindow, …  Output Operations – send data to external entity - saveAsHadoopFiles – saves to HDFS - foreach – do anything with each batch of results
  • 11. Arbitrary Stateful Computations  Maintain arbitrary state, track sessions - Maintain per-user mood as state, and update it with his/her tweets moods = tweets.updateStateByKey(tweet => updateMood(tweet)) updateMood(newTweets, lastMood) => newMood tweets t-1 t t+1 t+2 t+3 moods
  • 12. Combine Batch and Stream Processing  Do arbitrary Spark RDD computation within DStream - Join incoming tweets with a spam file to filter out bad tweets tweets.transform(tweetsRDD => { tweetsRDD.join(spamHDFSFile).filter(...) })
  • 13. Fault-tolerance  RDDs remember the operations that created them  Batches of input data are replicated in memory for fault- tolerance  Data lost due to worker failure, can be recomputed from replicated input data  Therefore, all transformed data is fault-tolerant input data replicated in memory flatMap lost partitions recomputed on other workers tweets RDD hashTags RDD
  • 14. Agenda  Overview  DStream Abstraction  System Model  Persistence / Caching  RDD Checkpointing  Performance Tuning
  • 15. Discretized Stream (DStream) A sequence of RDDs representing a stream of data What does it take to define a DStream?
  • 16. DStream Interface The DStream interface primarily defines how to generate an RDD in each batch interval  List of dependent (parent) DStreams  Slide Interval, the interval at which it will compute RDDs  Function to compute RDD at a time t
  • 17. Example: Mapped DStream  Dependencies: Single parent DStream  Slide Interval: Same as the parent DStream  Compute function for time t: Create new RDD by applying map function on parent DStream’s RDD of time t
  • 18. Window operation gather together data over a sliding window  Dependencies: Single parent DStream  Slide Interval: Window sliding interval  Compute function for time t: Apply union over all the RDDs of parent DStream between times t and (t – window length) Example: Windowed DStream Parent DStream window length sliding interval
  • 19. Example: Network Input DStream Base class of all input DStreams that receive data from the network  Dependencies: None  Slide Interval: Batch duration in streaming context  Compute function for time t: Create a BlockRDD with all the blocks of data received in the last batch interval  Associated with a Network Receiver object
  • 20. Network Receiver Responsible for receiving data and pushing it into Spark’s data management layer (Block Manager) Base class for all receivers - Kafka, Flume, etc. Simple Interface:  What to do on starting the receiver - Helper object blockGenerator to push data into Spark  What to do on stopping the receiver
  • 21. Example: Socket Receiver  On start: Connect to remote TCP server While socket is connected, Receiving bytes and deserialize Deserialize them into Java objects Add the objects to blockGenerator  On stop: Disconnect socket
  • 22. DStream Graph t = ssc.twitterStream(“…”) .map(…) t.foreach(…) t1 = ssc.twitterStream(“…”) t2 = ssc.twitterStream(“…”) t = t1.union(t2).map(…) t.saveAsHadoopFiles(…) t.map(…).foreach(…) t.filter(…).foreach(…) T M F E Twitter Input DStream Mapped DStream Foreach DStream T U M T M FF E F E F E DStream GraphSpark Streaming program Dummy DStream signifying an output operation
  • 23. DStream Graph  RDD Graphs  Spark jobs  Every interval, RDD graph is computed from DStream graph  For each output operation, a Spark action is created  For each action, a Spark job is created to compute it T U M T M FF E F E F E DStream Graph B U M B M FA A A RDD Graph Spark actions Block RDDs with data received in last batch interval 3 Spark jobs
  • 24. Agenda  Overview  DStream Abstraction  System Model  Persistence / Caching  RDD Checkpointing  Performance Tuning
  • 25. Components  Network Input Tracker – Keeps track of the data received by each network receiver and maps them to the corresponding input DStreams  Job Scheduler – Periodically queries the DStream graph to generate Spark jobs from received data, and hands them to Job Manager for execution  Job Manager – Maintains a job queue and executes the jobs in Spark ssc = new StreamingContext t = ssc.twitterStream(“…”) t.filter(…).foreach(…) Your program DStream graph Spark Context Network Input Tracker Job Manager Job Scheduler Spark Client RDD graph Scheduler Block manager Shuffle tracker Spark Worker Block manager Task threads Cluster Manager
  • 26. Execution Model – Receiving Data Spark Streaming + Spark Driver Spark Workers StreamingContext.start() Network Input Tracker Receiver Data recvd Block Manager Blocks replicated Block Manager Master Block Manager Blocks pushed
  • 27. Spark Workers Execution Model – Job Scheduling Network Input Tracker Job Scheduler Spark’s Schedulers Receiver Block Manager Block Manager Jobs executed on worker nodes DStream Graph Job Manager JobQueue Spark Streaming + Spark Driver Jobs Block IDsRDDs
  • 28. Job Scheduling  Each output operation used generates a job - More jobs  more time taken to process batches  higher batch duration  Job Manager decides how many concurrent Spark jobs to run - Default is 1, can be set using Java property spark.streaming.concurrentJobs - If you have multiple output operations, you can try increasing this property to reduce batch processing times and so reduce batch duration
  • 29. Agenda  Overview  DStream Abstraction  System Model  Persistence / Caching  RDD Checkpointing  Performance Tuning
  • 30. DStream Persistence  If a DStream is set to persist at a storage level, then all RDDs generated by it set to the same storage level  When to persist? - If there are multiple transformations / actions on a DStream - If RDDs in a DStream is going to be used multiple times  Window-based DStreams are automatically persisted in memory
  • 31. DStream Persistence  Default storage level of DStreams is StorageLevel.MEMORY_ONLY_SER (i.e. in memory as serialized bytes) - Except for input DStreams which have StorageLevel.MEMORY_AND_DISK_SER_2 - Note the difference from RDD’s default level (no serialization) - Serialization reduces random pauses due to GC providing more consistent job processing times
  • 32. Agenda  Overview  DStream Abstraction  System Model  Persistence / Caching  RDD Checkpointing  Performance Tuning
  • 33. What is RDD checkpointing? Saving RDD to HDFS to prevent RDD graph from growing too large  Done internally in Spark transparent to the user program  Done lazily, saved to HDFS the first time it is computed red_rdd.checkpoint() HDFS file Contents of red_rdd saved to a HDFS file transparent to all child RDDs
  • 34. Why is RDD checkpointing necessary? Stateful DStream operators can have infinite lineages Large lineages lead to …  Large closure of the RDD object  large task sizes  high task launch times  High recovery times under failure data t-1 t t+1 t+2 t+3 states
  • 35. Why is RDD checkpointing necessary? Stateful DStream operators can have infinite lineages Periodic RDD checkpointing solves this Useful for iterative Spark programs as well data t-1 t t+1 t+2 t+3 states HDF S HDF S
  • 36. RDD Checkpointing  Periodicity of checkpoint determines a tradeoff - Checkpoint too frequent: HDFS writing will slow things down - Checkpoint too infrequent: Task launch times may increase - Default setting checkpoints at most once in 10 seconds - Try to checkpoint once in about 10 batches
  • 37. Agenda  Overview  DStream Abstraction  System Model  Persistence / Caching  RDD Checkpointing  Performance Tuning
  • 38. Performance Tuning Step 1 Achieve a stable configuration that can sustain the streaming workload Step 2 Optimize for lower latency
  • 39. Step 1: Achieving Stable Configuration How to identify whether a configuration is stable?  Look for the following messages in the log Total delay: 0.01500 s for job 12 of time 1371512674000 …  If the total delay is continuously increasing, then unstable as the system is unable to process data as fast as its receiving!  If the total delay stays roughly constant and around 2x the configured batch duration, then stable
  • 40. Step 1: Achieving Stable Configuration How to figure out a good stable configuration?  Start with a low data rate, small number of nodes, reasonably large batch duration (5 – 10 seconds)  Increase the data rate, number of nodes, etc.  Find the bottleneck in the job processing - Jobs are divided into stages - Find which stage is taking the most amount of time
  • 41. Step 1: Achieving Stable Configuration How to figure out a good stable configuration?  If the first map stage on raw data is taking most time, then try … - Enabling delayed scheduling by setting property spark.locality.wait - Splitting your data source into multiple sub streams - Repartitioning the raw data into many partitions as first step  If any of the subsequent stages are taking a lot of time, try… - Try increasing the level of parallelism (i.e., increase number of reducers) - Add more processors to the system
  • 42. Step 2: Optimize for Lower Latency  Reduce batch size and find a stable configuration again - Increase levels of parallelism, etc.  Optimize serialization overheads - Consider using Kryo serialization instead of the default Java serialization for both data and tasks - For data, set property spark.serializer=spark.KryoSerializer - For tasks, set spark.closure.serializer=spark.KryoSerializer  Use Spark stand-alone mode rather than Mesos
  • 43. Step 2: Optimize for Lower Latency  Using concurrent mark sweep GC -XX:+UseConcMarkSweepGC is recommended - Reduces throughput a little, but also reduces large GC pauses and may allow lower batch sizes by making processing time more consistent  Try disabling serialization in DStream/RDD persistence levels - Increases memory consumption and randomness of GC related pauses, but may reduce latency by further reducing serialization overheads  For a full list of guidelines for performance tuning - Spark Tuning Guide - Spark Streaming Tuning Guide
  • 44. Small code base  5000 LOC for Scala API (+ 1500 LC for Java API) - Most DStream code mirrors the RDD code  Easy to take a look and contribute
  • 45. Future (Possible) Directions  Better master fault-tolerance  Better performance for complex queries - Better performance for stateful processing is a low hanging fruit  Dashboard for Spark Streaming - Continuous graphs of processing times, end-to-end latencies - Drill down for analyzing processing times of stages for finding bottlenecks  Python API for Spark Streaming