Recipes for running Spark
Streaming in Production
Tathagata “TD” Das
Hadoop Summit 2015
@tathadas
Who am I?
Project Management Committee (PMC) member of
Spark
Lead developer of Spark Streaming
Formerly in AMPLab, UC Berkeley
Software developer at Databricks
Founded by creators of Spark and remains largest
contributor
Offers a hosted service
- Spark on EC2
- Notebooks
- Plot visualizations
- Cluster management
- Scheduled jobs
What is Databricks?
What is Spark
Streaming?
Spark Streaming
Scalable, fault-tolerant stream processing
system
File
systems
Databases
Dashboar
ds
Flume
Kinesis
HDFS/S3
Kafka
Twitter
High-level
API
joins, windows, …
often 5x less code
Fault-
tolerant
Exactly-once
semantics, even for
stateful ops
Integration
Integrate with MLlib,
SQL, DataFrames,
GraphX
What can you use it for?
6
Real-time fraud detection in transactions
React to anomalies in sensors in real-time
Cat videos in tweets as soon as they go viral
How does it work?
Receivers receive data streams and chop them up into
batches
Spark processes the batches and pushes out the results
7
data streams
receivers
batches results
Word Count with Kafka
val context = new StreamingContext(conf, Seconds(1))
val lines = KafkaUtils.createStream(context, ...)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1))
.reduceByKey(_ + _)
wordCounts.print()
context.start()
8
entry point of
streaming
functionality
create DStream
from Kafka data
DStream: represents a data
stream
Word Count with Kafka
val context = new StreamingContext(conf, Seconds(1))
val lines = KafkaUtils.createStream(context, ...)
val words = lines.flatMap(_.split(" "))
9
split lines into words
Transformation: transform data to create new
DStreams
Word Count with Kafka
val context = new StreamingContext(conf, Seconds(1))
val lines = KafkaUtils.createStream(context, ...)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1))
.reduceByKey(_ + _)
wordCounts.print()
context.start()
10
print some counts on
screen
count the words
start receiving and
transforming the data
Languages
Can natively use
Can use any other language by using pipe()
11
12
Integrates with Spark Ecosystem
13
Spark Core
Spark
Streaming
Spark
SQL
MLlib GraphX
Combine batch and streaming processing
Join data streams with static data
sets
// Create data set from Hadoop file
val dataset = sparkContext.hadoopFile(“file”)
// Join each batch in stream with the dataset
kafkaStream.transform { batchRDD =>
batchRDD.join(dataset)
.filter( ... )
}
14
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Combine machine learning with streaming
Learn models offline, apply them
online
// Learn model offline
val model = KMeans.train(dataset, ...)
// Apply model online on stream
kafkaStream.map { event =>
model.predict(event.feature)
}
15
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Combine SQL with streaming
Interactively query streaming data with
SQL
// Register each batch in stream as table
kafkaStream.foreachRDD { batchRDD =>
batchRDD.toDF.registerTempTable("events")
}
// Interactively query table
sqlContext.sql("select * from events")
16
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
17
How do I write production quality
Spark Streaming applications?
WordCount with Kafka
object WordCount {
def main(args: Array[String]) {
val context = new StreamingContext(new SparkConf(), Seconds(1))
val lines = KafkaUtils.createStream(context, ...)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x,1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
18
Got it working on a small
Spark cluster on little data
What’s next??
What’s needed to get it production-ready?
Fault-tolerance and semantics
Performance and stability
Monitoring and upgrading
19
20
Deeper View of Spark Streaming
Any Spark Application
21
Driver
User code runs
in the driver
process
YARN / Mesos /
Spark Standalone
cluster
Tasks sent to
executors for
processing data
Executor
Executor
Executor
Driver launches
executors in
cluster
Spark Streaming Application: Receive data
22
Executo
r
Executo
r
Driver runs
receivers as
long running
tasks
Receiver Data stream
Driver
object WordCount {
def main(args: Array[String]) {
val context = new StreamingContext(…)
val lines = KafkaUtils.createStream(...)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x,1))
.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
Receiver divides
stream into blocks
and keeps in
memory
Data
Blocks
Blocks also
replicated to
another
executor
Data
Blocks
Spark Streaming Application: Process data
23
Executo
r
Executo
r
Receiver
Data
Blocks
Data
Blocks
Data
stor
e
Every batch
interval, driver
launches tasks to
process the
blocks
Driver
object WordCount {
def main(args: Array[String]) {
val context = new StreamingContext(…)
val lines = KafkaUtils.createStream(...)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x,1))
.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
24
Fault-tolerance and semantics
Failures? Why care?
Many streaming applications need zero data loss
guarantees despite any kind of failures in the system
At least once guarantee – every record processed at least once
Exactly once guarantee – every record processed exactly once
Different kinds of failures – executor and driver failures
Some failures and guarantee requirements need
additional configurations and setups
25
Executo
r
Receiver
Data
Blocks
What if an executor fails?
Task and receiver restarted by
Spark automatically, no config
needed
26
Executo
r
Failed
Ex.
Receiver
Blocks
Blocks
Driver
If executor fails,
receiver is lost
and all blocks are
lost
Receiver
Receiver
restarted
Tasks restarted
on block
replicas
What if the driver fails?
27
Executo
r
BlocksHow do we
recover?
When the
driver fails, all
the executors
fail
All
computation, all
received blocks
are lost
Executo
r
Receiver
Blocks
Failed
Ex.
Receiver
Blocks
Failed
Executo
r
Blocks
Driver
Failed
Driver
Recovering Driver with Checkpointing
Checkpointing: Periodically save the
state of computation to fault-tolerant
storage
28
Executo
r
Blocks
Executo
r
Receiver
Blocks
Active
Driver
Checkpoint
information
Recovering Driver with Checkpointing
Checkpointing: Periodically save the
state of computation to fault-tolerant
storage
29
Failed driver can be
restarted from checkpoint
information
Failed
Driver
Restarte
d Driver
New
Executo
r
New
Executo
r
Receiver
New
executors
launched
and
receivers
restarted
Recovering Driver with Checkpointing
1. Configure automatic driver restart
All cluster managers support this
2. Set a checkpoint directory in a HDFS-compatible file
system
streamingContext.checkpoint(hdfsDirectory)
3. Slightly restructure of the code to use checkpoints for
recovery
30
Configurating Automatic Driver Restart
Spark Standalone – Submit job in “cluster” mode with “--supervise”
See http://spark.apache.org/docs/latest/spark-standalone.html
YARN – Submit job in “cluster” mode
See YARN config “yarn.resourcemanager.am.max-attempts”
Mesos – Marathon can restart Mesos applications
31
Restructuring code for Checkpointing
32
val context = new StreamingContext(...)
val lines = KafkaUtils.createStream(...)
val words = lines.flatMap(...)
...
context.start()
Create
+
Setup
Start
def creatingFunc(): StreamingContext = {
val context = new StreamingContext(...)
val lines = KafkaUtils.createStream(...)
val words = lines.flatMap(...)
...
context.checkpoint(hdfsDir)
}
Put all setup code into a function that returns a new StreamingContext
Get context setup from HDFS dir OR create a new one with the function
val context = StreamingContext.getOrCreate(
hdfsDir, creatingFunc)
context.start()
Restructuring code for Checkpointing
StreamingContext.getOrCreate():
If HDFS directory has checkpoint info
recover context from info
else
call creatingFunc() to create
and setup a new context
Restarted process can figure out
whether to recover using checkpoint
info or not
33
def creatingFunc(): StreamingContext = {
val context = new StreamingContext(...)
val lines = KafkaUtils.createStream(...)
val words = lines.flatMap(...)
...
context.checkpoint(hdfsDir)
}
val context = StreamingContext.getOrCreate(
hdfsDir, creatingFunc)
context.start()
Received blocks lost on Restart!
34
Failed
Driver
Restarte
d Driver
New
Executo
r
New Ex.
Receiver
No Blocks
In-memory blocks
of buffered data
are lost on driver
restart
Recovering data with Write Ahead Logs
Write Ahead Log: Synchronously
save received data to fault-tolerant
storage
35
Executo
r
Blocks
saved to
HDFS
Executo
r
Receiver
Blocks
Active
Driver
Data stream
Received blocks lost on Restart!
Write Ahead Log: Synchronously
save received data to fault-tolerant
storage
36
Failed
Driver
Restarte
d Driver
New
Executo
r
New Ex.
Receiver
Blocks
Blocks recovered
from Write Ahead
Log
Recovering data with Write Ahead Logs
1. Enable checkpointing, logs written in checkpoint
directory
2. Enabled WAL in SparkConf configuration
sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
3. Receiver should also be reliable
Acknowledge source only after data saved to WAL
Unacked data will be replayed from source by restarted receiver
37
Fault-tolerance Semantics
38
Zero data loss = every stage
processes each event at least once
despite any failure
Sources
Transformin
g
Sinks
Outputting
Receiving
Fault-tolerance Semantics
39
Sources
Transformin
g
Sinks
Outputting
Receiving
Exactly once, as long as received data is not lost
At least once, with Write Ahead Log and Reliable
receivers
Receiving
Outputting
Exactly once, if outputs are idempotent or
transactional
End-to-end semantics:
At-least once
Fault-tolerance Semantics
40
Exactly once receiving with new Kafka Direct
approach
Treats Kafka like a replicated log, reads it like a file
Does not use receivers
No need to create multiple DStreams and union them
No need to enable Write Ahead Logs
val directKafkaStream = KafkaUtils.createDirectStream(...)
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.htm
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
Sources
Transformin
g
Sinks
Outputting
Receiving
Fault-tolerance Semantics
41
Exactly once receiving with new Kafka Direct
approach
Sources
Transformin
g
Sinks
Outputting
Receiving
Exactly once, as long as received data is not lost
Exactly once, if outputs are idempotent or
transactional
End-to-end semantics:
Exactly once!
42
http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-
monkey.html
43
Performance and stability
Achieving High Throughput
44
High throughput achieved by sufficient
parallelism at all stages of the pipeline
Sources
Transformin
g
Sinks
Outputting
Receiving
Scaling the Receivers
45
Sources
Transformin
g
Sinks
Outputting
Receiving
Sources must be configured with parallel data
streams
#partitions in Kafka topics, #shards in Kinesis streams,
…
Streaming app should have multiple receivers that
receive the data streams in parallel
Multiple input DStreams, each running a receiver
Can be unioned together to create one DStream
val kafkaStream1 = KafkaUtils.createStream(...)
val kafkaStream2 = KafkaUtils.createStream(...)
Scaling the Receivers
46
Sources
Transformin
g
Sinks
Outputting
Receiving
Sufficient number of executors to run all the
receivers
#executors > #receivers , so that no more than 1
receiver per executor, and network is not shared
between receivers
Kafka Direct approach does not use receivers
Automatically parallelizes data reading across executors
Parallelism = # Kafka partitions
Stability in Processing
47
Sources
Transformin
g
Sinks
Outputting
Receiving For stability, must process data as fast as it is
received
Must ensure avg batch processing times < batch
interval
Previous batch is done by the time next batch is received
Otherwise, new batches keeps queueing up waiting
for previous batches to finish, scheduling delay goes
up
Reducing Batch Processing Times
48
Sources
Transformin
g
Sinks
Outputting
Receiving
More receivers!
Executor running receivers do lot of the processing
Repartition the received data to explicitly distribute
load
unionedStream.repartition(40)
Increase #partitions in shuffles
transformedStream.reduceByKey(reduceFunc, 40)
Get more executors and cores!
Reducing Batch Processing Times
49
Sources
Transformin
g
Sinks
Outputting
Receiving
Use Kryo serialization to serialization costs
Register classes for best performance
See configurations spark.kryo.*
http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization
Larger batch durations improve stability
More data aggregated together, amortized cost of shuffle
Limit ingestion rate to handle data surges
See configurations spark.streaming.*maxRate*
http://spark.apache.org/docs/latest/configuration.html#spark-streaming
Speeding up Output Operations
50
Sources
Transformin
g
Sinks
Outputting
Receiving
Write to data stores efficiently
dataRDD.foreach { event =>
// open connection
// insert single event
// close connection
}
foreach: inefficient
dataRDD.foreachPartition { partition =>
// open connection
// insert all events in partition
// close connection
}
foreachPartition: efficient
dataRDD.foreachPartition { partition =>
// get open connection from pool
// insert all events in partition
// return connection to pool
}
foreachPartition + connection pool: more efficient
51
Monitoring and Upgrading
Streaming in Spark Web UI
Stats over last 1000
batches
Timelines
Histograms
52
Scheduling delay should be approx 0
Processing Time approx < batch
interval
Streaming in Spark Web UI
Details of individual batches
53
Details of Spark jobs run in a
batch
[New in Spark
1.4]
Operational Monitoring
Streaming statistics published through Codahale metrics
Ganglia sink, Graphite sink, custom Codahale metrics sinks
Can see long term trends, across hours and days
Configure the metrics using
$SPARK_HOME/conf/metrics.properties
Also need to compile Spark with Ganglia LGPL profile
(see http://spark.apache.org/docs/latest/monitoring.html#metrics)
54
Upgrading Apps
1. Shutdown your current streaming app gracefully
Will process all data before shutting down cleanly
streamingContext.stop(stopGracefully = true)
2. Update app code and start it again
56
Things not covered
Memory tuning
GC tuning
Using SQLContext
Using Broadcasts
Using Accumulators
…
Refer to online guide
http://spark.apache.org/docs/latest/streaming-programming-
guide.html
57
58
More Spark Streaming Talks

Modus operandi of Spark Streaming - Recipes for Running your Streaming Application in Production

  • 1.
    Recipes for runningSpark Streaming in Production Tathagata “TD” Das Hadoop Summit 2015 @tathadas
  • 2.
    Who am I? ProjectManagement Committee (PMC) member of Spark Lead developer of Spark Streaming Formerly in AMPLab, UC Berkeley Software developer at Databricks
  • 3.
    Founded by creatorsof Spark and remains largest contributor Offers a hosted service - Spark on EC2 - Notebooks - Plot visualizations - Cluster management - Scheduled jobs What is Databricks?
  • 4.
  • 5.
    Spark Streaming Scalable, fault-tolerantstream processing system File systems Databases Dashboar ds Flume Kinesis HDFS/S3 Kafka Twitter High-level API joins, windows, … often 5x less code Fault- tolerant Exactly-once semantics, even for stateful ops Integration Integrate with MLlib, SQL, DataFrames, GraphX
  • 6.
    What can youuse it for? 6 Real-time fraud detection in transactions React to anomalies in sensors in real-time Cat videos in tweets as soon as they go viral
  • 7.
    How does itwork? Receivers receive data streams and chop them up into batches Spark processes the batches and pushes out the results 7 data streams receivers batches results
  • 8.
    Word Count withKafka val context = new StreamingContext(conf, Seconds(1)) val lines = KafkaUtils.createStream(context, ...) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)) .reduceByKey(_ + _) wordCounts.print() context.start() 8 entry point of streaming functionality create DStream from Kafka data DStream: represents a data stream
  • 9.
    Word Count withKafka val context = new StreamingContext(conf, Seconds(1)) val lines = KafkaUtils.createStream(context, ...) val words = lines.flatMap(_.split(" ")) 9 split lines into words Transformation: transform data to create new DStreams
  • 10.
    Word Count withKafka val context = new StreamingContext(conf, Seconds(1)) val lines = KafkaUtils.createStream(context, ...) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)) .reduceByKey(_ + _) wordCounts.print() context.start() 10 print some counts on screen count the words start receiving and transforming the data
  • 11.
    Languages Can natively use Canuse any other language by using pipe() 11
  • 12.
  • 13.
    Integrates with SparkEcosystem 13 Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 14.
    Combine batch andstreaming processing Join data streams with static data sets // Create data set from Hadoop file val dataset = sparkContext.hadoopFile(“file”) // Join each batch in stream with the dataset kafkaStream.transform { batchRDD => batchRDD.join(dataset) .filter( ... ) } 14 Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 15.
    Combine machine learningwith streaming Learn models offline, apply them online // Learn model offline val model = KMeans.train(dataset, ...) // Apply model online on stream kafkaStream.map { event => model.predict(event.feature) } 15 Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 16.
    Combine SQL withstreaming Interactively query streaming data with SQL // Register each batch in stream as table kafkaStream.foreachRDD { batchRDD => batchRDD.toDF.registerTempTable("events") } // Interactively query table sqlContext.sql("select * from events") 16 Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 17.
    17 How do Iwrite production quality Spark Streaming applications?
  • 18.
    WordCount with Kafka objectWordCount { def main(args: Array[String]) { val context = new StreamingContext(new SparkConf(), Seconds(1)) val lines = KafkaUtils.createStream(context, ...) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x,1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } } 18 Got it working on a small Spark cluster on little data What’s next??
  • 19.
    What’s needed toget it production-ready? Fault-tolerance and semantics Performance and stability Monitoring and upgrading 19
  • 20.
    20 Deeper View ofSpark Streaming
  • 21.
    Any Spark Application 21 Driver Usercode runs in the driver process YARN / Mesos / Spark Standalone cluster Tasks sent to executors for processing data Executor Executor Executor Driver launches executors in cluster
  • 22.
    Spark Streaming Application:Receive data 22 Executo r Executo r Driver runs receivers as long running tasks Receiver Data stream Driver object WordCount { def main(args: Array[String]) { val context = new StreamingContext(…) val lines = KafkaUtils.createStream(...) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x,1)) .reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } } Receiver divides stream into blocks and keeps in memory Data Blocks Blocks also replicated to another executor Data Blocks
  • 23.
    Spark Streaming Application:Process data 23 Executo r Executo r Receiver Data Blocks Data Blocks Data stor e Every batch interval, driver launches tasks to process the blocks Driver object WordCount { def main(args: Array[String]) { val context = new StreamingContext(…) val lines = KafkaUtils.createStream(...) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x,1)) .reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } }
  • 24.
  • 25.
    Failures? Why care? Manystreaming applications need zero data loss guarantees despite any kind of failures in the system At least once guarantee – every record processed at least once Exactly once guarantee – every record processed exactly once Different kinds of failures – executor and driver failures Some failures and guarantee requirements need additional configurations and setups 25
  • 26.
    Executo r Receiver Data Blocks What if anexecutor fails? Task and receiver restarted by Spark automatically, no config needed 26 Executo r Failed Ex. Receiver Blocks Blocks Driver If executor fails, receiver is lost and all blocks are lost Receiver Receiver restarted Tasks restarted on block replicas
  • 27.
    What if thedriver fails? 27 Executo r BlocksHow do we recover? When the driver fails, all the executors fail All computation, all received blocks are lost Executo r Receiver Blocks Failed Ex. Receiver Blocks Failed Executo r Blocks Driver Failed Driver
  • 28.
    Recovering Driver withCheckpointing Checkpointing: Periodically save the state of computation to fault-tolerant storage 28 Executo r Blocks Executo r Receiver Blocks Active Driver Checkpoint information
  • 29.
    Recovering Driver withCheckpointing Checkpointing: Periodically save the state of computation to fault-tolerant storage 29 Failed driver can be restarted from checkpoint information Failed Driver Restarte d Driver New Executo r New Executo r Receiver New executors launched and receivers restarted
  • 30.
    Recovering Driver withCheckpointing 1. Configure automatic driver restart All cluster managers support this 2. Set a checkpoint directory in a HDFS-compatible file system streamingContext.checkpoint(hdfsDirectory) 3. Slightly restructure of the code to use checkpoints for recovery 30
  • 31.
    Configurating Automatic DriverRestart Spark Standalone – Submit job in “cluster” mode with “--supervise” See http://spark.apache.org/docs/latest/spark-standalone.html YARN – Submit job in “cluster” mode See YARN config “yarn.resourcemanager.am.max-attempts” Mesos – Marathon can restart Mesos applications 31
  • 32.
    Restructuring code forCheckpointing 32 val context = new StreamingContext(...) val lines = KafkaUtils.createStream(...) val words = lines.flatMap(...) ... context.start() Create + Setup Start def creatingFunc(): StreamingContext = { val context = new StreamingContext(...) val lines = KafkaUtils.createStream(...) val words = lines.flatMap(...) ... context.checkpoint(hdfsDir) } Put all setup code into a function that returns a new StreamingContext Get context setup from HDFS dir OR create a new one with the function val context = StreamingContext.getOrCreate( hdfsDir, creatingFunc) context.start()
  • 33.
    Restructuring code forCheckpointing StreamingContext.getOrCreate(): If HDFS directory has checkpoint info recover context from info else call creatingFunc() to create and setup a new context Restarted process can figure out whether to recover using checkpoint info or not 33 def creatingFunc(): StreamingContext = { val context = new StreamingContext(...) val lines = KafkaUtils.createStream(...) val words = lines.flatMap(...) ... context.checkpoint(hdfsDir) } val context = StreamingContext.getOrCreate( hdfsDir, creatingFunc) context.start()
  • 34.
    Received blocks loston Restart! 34 Failed Driver Restarte d Driver New Executo r New Ex. Receiver No Blocks In-memory blocks of buffered data are lost on driver restart
  • 35.
    Recovering data withWrite Ahead Logs Write Ahead Log: Synchronously save received data to fault-tolerant storage 35 Executo r Blocks saved to HDFS Executo r Receiver Blocks Active Driver Data stream
  • 36.
    Received blocks loston Restart! Write Ahead Log: Synchronously save received data to fault-tolerant storage 36 Failed Driver Restarte d Driver New Executo r New Ex. Receiver Blocks Blocks recovered from Write Ahead Log
  • 37.
    Recovering data withWrite Ahead Logs 1. Enable checkpointing, logs written in checkpoint directory 2. Enabled WAL in SparkConf configuration sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true") 3. Receiver should also be reliable Acknowledge source only after data saved to WAL Unacked data will be replayed from source by restarted receiver 37
  • 38.
    Fault-tolerance Semantics 38 Zero dataloss = every stage processes each event at least once despite any failure Sources Transformin g Sinks Outputting Receiving
  • 39.
    Fault-tolerance Semantics 39 Sources Transformin g Sinks Outputting Receiving Exactly once,as long as received data is not lost At least once, with Write Ahead Log and Reliable receivers Receiving Outputting Exactly once, if outputs are idempotent or transactional End-to-end semantics: At-least once
  • 40.
    Fault-tolerance Semantics 40 Exactly oncereceiving with new Kafka Direct approach Treats Kafka like a replicated log, reads it like a file Does not use receivers No need to create multiple DStreams and union them No need to enable Write Ahead Logs val directKafkaStream = KafkaUtils.createDirectStream(...) https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.htm http://spark.apache.org/docs/latest/streaming-kafka-integration.html Sources Transformin g Sinks Outputting Receiving
  • 41.
    Fault-tolerance Semantics 41 Exactly oncereceiving with new Kafka Direct approach Sources Transformin g Sinks Outputting Receiving Exactly once, as long as received data is not lost Exactly once, if outputs are idempotent or transactional End-to-end semantics: Exactly once!
  • 42.
  • 43.
  • 44.
    Achieving High Throughput 44 Highthroughput achieved by sufficient parallelism at all stages of the pipeline Sources Transformin g Sinks Outputting Receiving
  • 45.
    Scaling the Receivers 45 Sources Transformin g Sinks Outputting Receiving Sourcesmust be configured with parallel data streams #partitions in Kafka topics, #shards in Kinesis streams, … Streaming app should have multiple receivers that receive the data streams in parallel Multiple input DStreams, each running a receiver Can be unioned together to create one DStream val kafkaStream1 = KafkaUtils.createStream(...) val kafkaStream2 = KafkaUtils.createStream(...)
  • 46.
    Scaling the Receivers 46 Sources Transformin g Sinks Outputting Receiving Sufficientnumber of executors to run all the receivers #executors > #receivers , so that no more than 1 receiver per executor, and network is not shared between receivers Kafka Direct approach does not use receivers Automatically parallelizes data reading across executors Parallelism = # Kafka partitions
  • 47.
    Stability in Processing 47 Sources Transformin g Sinks Outputting ReceivingFor stability, must process data as fast as it is received Must ensure avg batch processing times < batch interval Previous batch is done by the time next batch is received Otherwise, new batches keeps queueing up waiting for previous batches to finish, scheduling delay goes up
  • 48.
    Reducing Batch ProcessingTimes 48 Sources Transformin g Sinks Outputting Receiving More receivers! Executor running receivers do lot of the processing Repartition the received data to explicitly distribute load unionedStream.repartition(40) Increase #partitions in shuffles transformedStream.reduceByKey(reduceFunc, 40) Get more executors and cores!
  • 49.
    Reducing Batch ProcessingTimes 49 Sources Transformin g Sinks Outputting Receiving Use Kryo serialization to serialization costs Register classes for best performance See configurations spark.kryo.* http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization Larger batch durations improve stability More data aggregated together, amortized cost of shuffle Limit ingestion rate to handle data surges See configurations spark.streaming.*maxRate* http://spark.apache.org/docs/latest/configuration.html#spark-streaming
  • 50.
    Speeding up OutputOperations 50 Sources Transformin g Sinks Outputting Receiving Write to data stores efficiently dataRDD.foreach { event => // open connection // insert single event // close connection } foreach: inefficient dataRDD.foreachPartition { partition => // open connection // insert all events in partition // close connection } foreachPartition: efficient dataRDD.foreachPartition { partition => // get open connection from pool // insert all events in partition // return connection to pool } foreachPartition + connection pool: more efficient
  • 51.
  • 52.
    Streaming in SparkWeb UI Stats over last 1000 batches Timelines Histograms 52 Scheduling delay should be approx 0 Processing Time approx < batch interval
  • 53.
    Streaming in SparkWeb UI Details of individual batches 53 Details of Spark jobs run in a batch [New in Spark 1.4]
  • 54.
    Operational Monitoring Streaming statisticspublished through Codahale metrics Ganglia sink, Graphite sink, custom Codahale metrics sinks Can see long term trends, across hours and days Configure the metrics using $SPARK_HOME/conf/metrics.properties Also need to compile Spark with Ganglia LGPL profile (see http://spark.apache.org/docs/latest/monitoring.html#metrics) 54
  • 55.
    Upgrading Apps 1. Shutdownyour current streaming app gracefully Will process all data before shutting down cleanly streamingContext.stop(stopGracefully = true) 2. Update app code and start it again 56
  • 56.
    Things not covered Memorytuning GC tuning Using SQLContext Using Broadcasts Using Accumulators … Refer to online guide http://spark.apache.org/docs/latest/streaming-programming- guide.html 57
  • 57.