Dive into Spark Streaming

@maasg www. .com
by
Gerard Maas
Data Processing Team Lead
gerard.maas@virdata.com @maasg
Dive into

@maasg www. .com
Virdata: A ‘born on the cloud’ IoT Managed Services & Analytics Platform

@maasg www. .com
A ‘born on the cloud’ IoT Managed Services & Analytics Platform
2012 2013 2014
affiliate certified

@maasg www. .com
Kafka PUB / SUB
MQTT / WebSockets
RAW
Storage
>>>
Storage` ` `
>>>
<<<
>>>
Query
>>>
>>>
Virdata’s Stack

@maasg www. .com
PUB / SUB
MQTT / WebSockets
RAW
Storage
Storage
` ` `
Query
Notebook Server
How Spark is Driving a New Loosely-coupled Stand-alone Service
Virdata’s Full StackVirdata’s Spark as a Service

@maasg www. .com
100TB
5MB/second

@maasg www. .com
Agenda
What is Spark Streaming?
Programming Model
Demo 1
Execution Model
Demo 2
Resources
Q/A

@maasg www. .com
Spark Streaming
Scalable, fault-tolerant stream processing system
Kafka
Flume
Kinesis
Twitter
Sockets
HDFS/S3
Databases
HDFS
Server
APPLICATIONS
Custom
Streams

@maasg www. .com
Spark: RDD Operations
transformatio
n
INPUT
DATA
HDFS
TEXT/
Sequence
File
RDD
SparkContext
RDD
OUTPUT
Data
HDFS
TEXT/
Sequence
File
Cassandra
Lazy evaluation
action

@maasg www. .com
Spark Streaming: DStreams
transformatio
n
Receiver RDD RDD
Lazy evaluation
action
Stream

@maasg www. .com
Spark Streaming

@maasg www. .com
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
Spark Streaming

@maasg www. .com
DStream[T]
t0 t1 t2 t3 ti ti+1
RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]
Transformations
Spark Streaming

@maasg www. .com
DStream[T]
t0 t1 t2 t3 ti ti+1
RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]
Actions
Transformations
Spark Streaming

@maasg www. .com
Transformations
map,
flatmap,
filter
count,
reduce,
countByValue,
reduceByKey
n
union,
join
cogroup

@maasg www. .com
Transformations
transform
val iotDstream = MQTTUtils.createStream(...)
val devicePriority = sparkContext.cassandraTable(...)
val prioritizedDStream = iotDstream.transform{rdd =>
rdd.map(d => (d.id, d)).join(devicePriority)
}

@maasg www. .com
Transformations
updateStateByKey
… …

@maasg www. .com
Transformations
trackStateByKey
… …
SPARK 1.6

@maasg www. .com
Actions
print
-------------------------------------------
Time: 1459875469000 ms
-------------------------------------------
data1
data2
saveAsTextFiles,
saveAsObjectFiles,
saveAsHadoopFiles
xxx
yyy
zzz
foreachRDD
*

@maasg www. .com
Actions - foreachRDD
dstream.foreachRDD{rdd =>
}
Dataframes
Spark SQL
MLLib
GraphX
Databases
...

@maasg www. .com
Actions - foreachRDD Usage
rdd.cache()
val alternatives = restServer.get(“/v1/alternatives”).toSet
alternatives.foreach{alternative =>
val byAlternative = rdd.filter(element => element.kind == alternative)
val asRecords = byAlternative.map(element => asRecord(element))
val conn = DB.connect(server)
asRecords.foreachPartition{partition =>
partition.foreach(element => conn.insert(element)
}
}
rdd.unpersist(true)
}

@maasg www. .com
rdd.cache()
}
}
rdd.unpersist(true)
}
Executes on the Driver
Executes on the Workers

@maasg www. .com
Windows - Sliding
1 t
dstream.window(windowLength = 6, slideInterval = 3)
t
2 3 4 5 6 7 8 9
1,2,3,4,5,6
10 11 12 13 14

@maasg www. .com
Windows - Sliding
1 t
t
2 3 4 5 6 7 8 9
1,2,3,4,5,6
10 11 12 13 14
4,5,6,7,8,9

@maasg www. .com
Windows - Sliding
1 t
t
2 3 4 5 6 7 8 9
1,2,3,4,5,6
10 11 12 13 14
4,5,6,7,8,9 7,8,9,10,11,12

@maasg www. .com
Windows - Non-Overlapping
1 t
t
2 3 4 5 6 7 8 9
1,2,3,4,5,6
10 11 12 13 14
7,8,9,10,11,12

@maasg www. .com
Windows - Operations
window,
countByWindow,
reduceByWindow,
reduceByKeyAndWindow,
countByValueAndWindow

@maasg www. .com
Windows - Inverse Function Optimization
reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval,[numTasks])
1 2 3 4 5 6 7 8 9
1,2,3,4,5,6
+
+ -

@maasg www. .com
Windows- Inverse Function Optimization
reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval,[numTasks])
1 2 3 4 5 6 7 8 9
1,2,3,4,5,6
4,5,6,7,8,9
-
+ -
+ 7 8 9
1 2 3

@maasg www. .com
Top 3 Starter Issues
Serialization
Where is my closure executing
Do I have enough cores?

@maasg www. .com
Demo 1
Anatomy of an
Spark Streaming
Application

@maasg www. .com
Tweet few keywords about your interests and
experience.
Use hashtag “#sparkbe”
Scala Spark Streaming
DistributedSystems
#sparkbe

@maasg www. .com
Deployment Options
M
Local
Standalone
Cluster
WW W
Using a Cluster
Manager
W
spark.master=local[*] spark.master=spark://host:port spark.master=mesos://host:port
M
M
D
W
D
W
W

@maasg www. .com
Deployment Options
M
Local
Standalone
Cluster
Using a Cluster
Manager
spark.master=local[*] spark.master=mesos://host:port
M
M
DD
Rec
Exec
Rec
Exec
Exec
Exec
Rec
Exec
Rec
Exec
ExecExec
spark.master=spark://host:port

@maasg www. .com
Streaming
Spark
t0 t1 t2
#0
Consumer
Consumer
Consumer
Scheduling

@maasg www. .com
Streaming
Spark
t0 t1 t2
#1
Consumer
Consumer
Consumer
#0
Process Time < Batch Interval
Scheduling

@maasg www. .com
Streaming
Spark
t0 t1 t2
#2
Consumer
Consumer
Consumer
#0 #1
#3
Scheduling Delay
Scheduling

@maasg www. .com
From Streams to μbatches
Consumer
#0 #1
batchInterval
blockInterval
Spark Streaming
Spark #partitions = receivers x batchInterval /
blockInterval

@maasg www. .com
#0
RDD
Partitions
Spark
Spark
Executors
Spark Streaming

@maasg www. .com
#0
RDD
Spark
Spark
Executors
Spark Streaming

@maasg www. .com

@maasg www. .com
Consumer
#0 #1
batchInterval
blockInterval
Spark Streaming
Spark #partitions = receivers x batchInterval /
blockInterval

@maasg www. .com
Consumer
#0 #1
batchInterval
blockInterval
Spark Streaming
Spark spark.streaming.blockInterval = batchInterval x
receivers / (partitionFactor x sparkCores)

@maasg www. .com
The Importance of Caching
dstream.foreachRDD { rdd =>
rdd.cache() // cache the RDD before iterating!
keys.foreach{ key =>
rdd.filter(elem=> key(elem) == key).saveAsFooBar(...)
}
rdd.unpersist()
}

@maasg www. .com
The Receiver model
spark.streaming.receiver.maxRate
Fault tolerance ? WAL

@maasg www. .com
The Receiver Model
src: https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

@maasg www. .com
Direct Kafka Stream
compute(offsets)
Driver

@maasg www. .com
Kafka:The Receiver-less model
Simplified Parallelism
Efficiency
Exactly-once semantics
Less degrees of freedom
val directKafkaStream =
KafkaUtils.createDirectStream[
[key class],
[value class],
[key decoder class],
[value decoder class] ](
streamingContext, [map of Kafka parameters], [set
of topics to consume]
)
spark.streaming.kafka.maxRatePerPartition

@maasg www. .com
Kafka:The Receiver-less model
src: https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

@maasg www. .com
Delivery Semantics
• Spark Streaming Receiver-based (<v1.2 ) Roughly at least once
• Spark Streaming Recv w/ WAL At least once + Zero Data Loss
• Spark Streaming Direct At least once + Zero Data Loss
• Spark Streaming Direct
+ Offset management Exactly Once
+ Idempotent Writes | Transactions

@maasg www. .com
Spark Streaming (v1.5) made Reactive
proportional-integral-derivative controller (PID controller)
Backpressure support

@maasg www. .com
Demo 2
Spark Streaming Performance

@maasg www. .com
Resources
Spark Streaming Official Programming Guide:
http://spark.apache.org/docs/latest/streaming-programming-guide.html
Backpressure in Spark Streaming:
http://blog.garillot.net/post/121183250481/a-quick-update-on-spark-streaming-work-since-i
The Virdata’s Spark Streaming tuning guide:
http://www.virdata.com/tuning-spark/
Spark Summit Presentations:
https://spark-summit.org/
Diving into Spark Streaming Execution Model:
https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html
Kafka direct approach:
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md

@maasg www. .com
Thanks!
Gerard Maas
@maasg
www. .com
- we’re hiring -

Dive into Spark Streaming

More Related Content

What's hot

Viewers also liked

Similar to Dive into Spark Streaming

Recently uploaded

Dive into Spark Streaming