@maasg www. .com
by
Gerard Maas
Data Processing Team Lead
gerard.maas@virdata.com @maasg
Dive into
@maasg www. .com
Virdata: A ‘born on the cloud’ IoT Managed Services & Analytics Platform
@maasg www. .com
A ‘born on the cloud’ IoT Managed Services & Analytics Platform
2012 2013 2014
affiliate certified
@maasg www. .com
Kafka PUB / SUB
MQTT / WebSockets
RAW
Storage
>>>
Storage` ` `
>>>
<<<
>>>
Query
>>>
>>>
Virdata’s Stack
@maasg www. .com
PUB / SUB
MQTT / WebSockets
RAW
Storage
Storage
` ` `
Query
Notebook Server
How Spark is Driving a New Loosely-coupled Stand-alone Service
Virdata’s Full StackVirdata’s Spark as a Service
@maasg www. .com
100TB
5MB
@maasg www. .com
100TB
5MB/second
@maasg www. .com
Agenda
What is Spark Streaming?
Programming Model
Demo 1
Execution Model
Demo 2
Resources
Q/A
@maasg www. .com
Apache Spark
@maasg www. .com
Spark Streaming
Scalable, fault-tolerant stream processing system
Kafka
Flume
Kinesis
Twitter
Sockets
HDFS/S3
Databases
HDFS
Server
APPLICATIONS
Custom
Streams
@maasg www. .com
Spark: RDD Operations
transformatio
n
INPUT
DATA
HDFS
TEXT/
Sequence
File
RDD
SparkContext
RDD
OUTPUT
Data
HDFS
TEXT/
Sequence
File
Cassandra
Lazy evaluation
action
@maasg www. .com
Spark Streaming: DStreams
transformatio
n
Receiver RDD RDD
Lazy evaluation
action
Stream
@maasg www. .com
Spark Streaming
@maasg www. .com
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
Spark Streaming
@maasg www. .com
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]
Transformations
Spark Streaming
@maasg www. .com
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]
Actions
Transformations
Spark Streaming
@maasg www. .com
Transformations
map,
flatmap,
filter
count,
reduce,
countByValue,
reduceByKey
n
union,
join
cogroup
@maasg www. .com
Transformations
transform
val iotDstream = MQTTUtils.createStream(...)
val devicePriority = sparkContext.cassandraTable(...)
val prioritizedDStream = iotDstream.transform{rdd =>
rdd.map(d => (d.id, d)).join(devicePriority)
}
@maasg www. .com
Transformations
updateStateByKey
… …
@maasg www. .com
Transformations
trackStateByKey
… …
SPARK 1.6
@maasg www. .com
Actions
print
-------------------------------------------
Time: 1459875469000 ms
-------------------------------------------
data1
data2
saveAsTextFiles,
saveAsObjectFiles,
saveAsHadoopFiles
xxx
yyy
zzz
foreachRDD
*
@maasg www. .com
Actions - foreachRDD
dstream.foreachRDD{rdd =>
}
Dataframes
Spark SQL
MLLib
GraphX
Databases
...
@maasg www. .com
Actions - foreachRDD Usage
dstream.foreachRDD{rdd =>
rdd.cache()
val alternatives = restServer.get(“/v1/alternatives”).toSet
alternatives.foreach{alternative =>
val byAlternative = rdd.filter(element => element.kind == alternative)
val asRecords = byAlternative.map(element => asRecord(element))
val conn = DB.connect(server)
asRecords.foreachPartition{partition =>
partition.foreach(element => conn.insert(element)
}
}
rdd.unpersist(true)
}
@maasg www. .com
Actions - foreachRDD Usage
dstream.foreachRDD{rdd =>
rdd.cache()
val alternatives = restServer.get(“/v1/alternatives”).toSet
alternatives.foreach{alternative =>
val byAlternative = rdd.filter(element => element.kind == alternative)
val asRecords = byAlternative.map(element => asRecord(element))
val conn = DB.connect(server)
asRecords.foreachPartition{partition =>
partition.foreach(element => conn.insert(element)
}
}
rdd.unpersist(true)
}
Executes on the Driver
Executes on the Workers
@maasg www. .com
Actions - foreachRDD Usage
dstream.foreachRDD{rdd =>
rdd.cache()
val alternatives = restServer.get(“/v1/alternatives”).toSet
alternatives.foreach{alternative =>
val byAlternative = rdd.filter(element => element.kind == alternative)
val asRecords = byAlternative.map(element => asRecord(element))
asRecords.foreachPartition{partition =>
val conn = DB.connect(server)
partition.foreach(element => conn.insert(element)
}
}
rdd.unpersist(true)
}
Executes on the Driver
Executes on the Workers
@maasg www. .com
Windows - Sliding
1 t
dstream.window(windowLength = 6, slideInterval = 3)
t
2 3 4 5 6 7 8 9
1,2,3,4,5,6
10 11 12 13 14
@maasg www. .com
Windows - Sliding
1 t
dstream.window(windowLength = 6, slideInterval = 3)
t
2 3 4 5 6 7 8 9
1,2,3,4,5,6
10 11 12 13 14
4,5,6,7,8,9
@maasg www. .com
Windows - Sliding
1 t
dstream.window(windowLength = 6, slideInterval = 3)
t
2 3 4 5 6 7 8 9
1,2,3,4,5,6
10 11 12 13 14
4,5,6,7,8,9 7,8,9,10,11,12
@maasg www. .com
Windows - Non-Overlapping
1 t
dstream.window(windowLength = 6, slideInterval = 6)
t
2 3 4 5 6 7 8 9
1,2,3,4,5,6
10 11 12 13 14
7,8,9,10,11,12
@maasg www. .com
Windows - Operations
window,
countByWindow,
reduceByWindow,
reduceByKeyAndWindow,
countByValueAndWindow
@maasg www. .com
Windows - Inverse Function Optimization
reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval,[numTasks])
1 2 3 4 5 6 7 8 9
1,2,3,4,5,6
+
+ -
@maasg www. .com
Windows- Inverse Function Optimization
reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval,[numTasks])
1 2 3 4 5 6 7 8 9
1,2,3,4,5,6
4,5,6,7,8,9
-
+ -
+ 7 8 9
1 2 3
@maasg www. .com
Top 3 Starter Issues
Serialization
Where is my closure executing
Do I have enough cores?
@maasg www. .com
Demo 1
Anatomy of an
Spark Streaming
Application
@maasg www. .com
Tweet few keywords about your interests and
experience.
Use hashtag “#sparkbe”
Scala Spark Streaming
DistributedSystems
#sparkbe
Ready to dive in?
@maasg www. .com
Deployment Options
M
Local
Standalone
Cluster
WW W
Using a Cluster
Manager
W
spark.master=local[*] spark.master=spark://host:port spark.master=mesos://host:port
M
M
D
W
D
W
W
@maasg www. .com
Deployment Options
M
Local
Standalone
Cluster
Using a Cluster
Manager
spark.master=local[*] spark.master=mesos://host:port
M
M
DD
Rec
Exec
Rec
Exec
Exec
Exec
Rec
Exec
Rec
Exec
ExecExec
spark.master=spark://host:port
@maasg www. .com
Streaming
Spark
t0 t1 t2
#0
Consumer
Consumer
Consumer
Scheduling
@maasg www. .com
Streaming
Spark
t0 t1 t2
#1
Consumer
Consumer
Consumer
#0
Process Time < Batch Interval
Scheduling
@maasg www. .com
Streaming
Spark
t0 t1 t2
#2
Consumer
Consumer
Consumer
#0 #1
#3
Scheduling Delay
Scheduling
@maasg www. .com
From Streams to μbatches
Consumer
#0 #1
batchInterval
blockInterval
Spark Streaming
Spark #partitions = receivers x batchInterval /
blockInterval
@maasg www. .com
#0
RDD
Partitions
Spark
Spark
Executors
Spark Streaming
From Streams to μbatches
@maasg www. .com
#0
RDD
Spark
Spark
Executors
Spark Streaming
From Streams to μbatches
@maasg www. .com
#0
RDD
Spark
Spark
Executors
Spark Streaming
From Streams to μbatches
@maasg www. .com
#0
RDD
Spark
Spark
Executors
Spark Streaming
From Streams to μbatches
@maasg www. .com
From Streams to μbatches
@maasg www. .com
From Streams to μbatches
@maasg www. .com
Consumer
#0 #1
batchInterval
blockInterval
Spark Streaming
Spark #partitions = receivers x batchInterval /
blockInterval
From Streams to μbatches
@maasg www. .com
Consumer
#0 #1
batchInterval
blockInterval
Spark Streaming
Spark spark.streaming.blockInterval = batchInterval x
receivers / (partitionFactor x sparkCores)
From Streams to μbatches
@maasg www. .com
The Importance of Caching
dstream.foreachRDD { rdd =>
rdd.cache() // cache the RDD before iterating!
keys.foreach{ key =>
rdd.filter(elem=> key(elem) == key).saveAsFooBar(...)
}
rdd.unpersist()
}
@maasg www. .com
The Receiver model
spark.streaming.receiver.maxRate
Fault tolerance ? WAL
@maasg www. .com
The Receiver Model
src: https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
@maasg www. .com
Direct Kafka Stream
compute(offsets)
Driver
@maasg www. .com
Kafka:The Receiver-less model
Simplified Parallelism
Efficiency
Exactly-once semantics
Less degrees of freedom
val directKafkaStream =
KafkaUtils.createDirectStream[
[key class],
[value class],
[key decoder class],
[value decoder class] ](
streamingContext, [map of Kafka parameters], [set
of topics to consume]
)
spark.streaming.kafka.maxRatePerPartition
@maasg www. .com
Kafka:The Receiver-less model
src: https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
@maasg www. .com
Delivery Semantics
• Spark Streaming Receiver-based (<v1.2 ) Roughly at least once
• Spark Streaming Recv w/ WAL At least once + Zero Data Loss
• Spark Streaming Direct At least once + Zero Data Loss
• Spark Streaming Direct
+ Offset management Exactly Once
+ Idempotent Writes | Transactions
@maasg www. .com
Spark Streaming (v1.5) made Reactive
proportional-integral-derivative controller (PID controller)
Backpressure support
@maasg www. .com
Demo 2
Spark Streaming Performance
@maasg www. .com
Resources
Spark Streaming Official Programming Guide:
http://spark.apache.org/docs/latest/streaming-programming-guide.html
Backpressure in Spark Streaming:
http://blog.garillot.net/post/121183250481/a-quick-update-on-spark-streaming-work-since-i
The Virdata’s Spark Streaming tuning guide:
http://www.virdata.com/tuning-spark/
Spark Summit Presentations:
https://spark-summit.org/
Diving into Spark Streaming Execution Model:
https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html
Kafka direct approach:
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
@maasg www. .com
Questions?
@maasg www. .com
Thanks!
Gerard Maas
@maasg
www. .com
- we’re hiring -

Dive into Spark Streaming

  • 1.
    @maasg www. .com by GerardMaas Data Processing Team Lead gerard.maas@virdata.com @maasg Dive into
  • 2.
    @maasg www. .com Virdata:A ‘born on the cloud’ IoT Managed Services & Analytics Platform
  • 3.
    @maasg www. .com A‘born on the cloud’ IoT Managed Services & Analytics Platform 2012 2013 2014 affiliate certified
  • 4.
    @maasg www. .com KafkaPUB / SUB MQTT / WebSockets RAW Storage >>> Storage` ` ` >>> <<< >>> Query >>> >>> Virdata’s Stack
  • 5.
    @maasg www. .com PUB/ SUB MQTT / WebSockets RAW Storage Storage ` ` ` Query Notebook Server How Spark is Driving a New Loosely-coupled Stand-alone Service Virdata’s Full StackVirdata’s Spark as a Service
  • 7.
  • 8.
  • 9.
    @maasg www. .com Agenda Whatis Spark Streaming? Programming Model Demo 1 Execution Model Demo 2 Resources Q/A
  • 10.
  • 11.
    @maasg www. .com SparkStreaming Scalable, fault-tolerant stream processing system Kafka Flume Kinesis Twitter Sockets HDFS/S3 Databases HDFS Server APPLICATIONS Custom Streams
  • 12.
    @maasg www. .com Spark:RDD Operations transformatio n INPUT DATA HDFS TEXT/ Sequence File RDD SparkContext RDD OUTPUT Data HDFS TEXT/ Sequence File Cassandra Lazy evaluation action
  • 13.
    @maasg www. .com SparkStreaming: DStreams transformatio n Receiver RDD RDD Lazy evaluation action Stream
  • 14.
  • 15.
    @maasg www. .com DStream[T] RDD[T]RDD[T] RDD[T] RDD[T] RDD[T] t0 t1 t2 t3 ti ti+1 Spark Streaming
  • 16.
    @maasg www. .com DStream[T] RDD[T]RDD[T] RDD[T] RDD[T] RDD[T] t0 t1 t2 t3 ti ti+1 RDD[U] RDD[U] RDD[U] RDD[U] RDD[U] Transformations Spark Streaming
  • 17.
    @maasg www. .com DStream[T] RDD[T]RDD[T] RDD[T] RDD[T] RDD[T] t0 t1 t2 t3 ti ti+1 RDD[U] RDD[U] RDD[U] RDD[U] RDD[U] Actions Transformations Spark Streaming
  • 18.
  • 19.
    @maasg www. .com Transformations transform valiotDstream = MQTTUtils.createStream(...) val devicePriority = sparkContext.cassandraTable(...) val prioritizedDStream = iotDstream.transform{rdd => rdd.map(d => (d.id, d)).join(devicePriority) }
  • 20.
  • 21.
  • 22.
    @maasg www. .com Actions print ------------------------------------------- Time:1459875469000 ms ------------------------------------------- data1 data2 saveAsTextFiles, saveAsObjectFiles, saveAsHadoopFiles xxx yyy zzz foreachRDD *
  • 23.
    @maasg www. .com Actions- foreachRDD dstream.foreachRDD{rdd => } Dataframes Spark SQL MLLib GraphX Databases ...
  • 24.
    @maasg www. .com Actions- foreachRDD Usage dstream.foreachRDD{rdd => rdd.cache() val alternatives = restServer.get(“/v1/alternatives”).toSet alternatives.foreach{alternative => val byAlternative = rdd.filter(element => element.kind == alternative) val asRecords = byAlternative.map(element => asRecord(element)) val conn = DB.connect(server) asRecords.foreachPartition{partition => partition.foreach(element => conn.insert(element) } } rdd.unpersist(true) }
  • 25.
    @maasg www. .com Actions- foreachRDD Usage dstream.foreachRDD{rdd => rdd.cache() val alternatives = restServer.get(“/v1/alternatives”).toSet alternatives.foreach{alternative => val byAlternative = rdd.filter(element => element.kind == alternative) val asRecords = byAlternative.map(element => asRecord(element)) val conn = DB.connect(server) asRecords.foreachPartition{partition => partition.foreach(element => conn.insert(element) } } rdd.unpersist(true) } Executes on the Driver Executes on the Workers
  • 26.
    @maasg www. .com Actions- foreachRDD Usage dstream.foreachRDD{rdd => rdd.cache() val alternatives = restServer.get(“/v1/alternatives”).toSet alternatives.foreach{alternative => val byAlternative = rdd.filter(element => element.kind == alternative) val asRecords = byAlternative.map(element => asRecord(element)) asRecords.foreachPartition{partition => val conn = DB.connect(server) partition.foreach(element => conn.insert(element) } } rdd.unpersist(true) } Executes on the Driver Executes on the Workers
  • 27.
    @maasg www. .com Windows- Sliding 1 t dstream.window(windowLength = 6, slideInterval = 3) t 2 3 4 5 6 7 8 9 1,2,3,4,5,6 10 11 12 13 14
  • 28.
    @maasg www. .com Windows- Sliding 1 t dstream.window(windowLength = 6, slideInterval = 3) t 2 3 4 5 6 7 8 9 1,2,3,4,5,6 10 11 12 13 14 4,5,6,7,8,9
  • 29.
    @maasg www. .com Windows- Sliding 1 t dstream.window(windowLength = 6, slideInterval = 3) t 2 3 4 5 6 7 8 9 1,2,3,4,5,6 10 11 12 13 14 4,5,6,7,8,9 7,8,9,10,11,12
  • 30.
    @maasg www. .com Windows- Non-Overlapping 1 t dstream.window(windowLength = 6, slideInterval = 6) t 2 3 4 5 6 7 8 9 1,2,3,4,5,6 10 11 12 13 14 7,8,9,10,11,12
  • 31.
    @maasg www. .com Windows- Operations window, countByWindow, reduceByWindow, reduceByKeyAndWindow, countByValueAndWindow
  • 32.
    @maasg www. .com Windows- Inverse Function Optimization reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval,[numTasks]) 1 2 3 4 5 6 7 8 9 1,2,3,4,5,6 + + -
  • 33.
    @maasg www. .com Windows-Inverse Function Optimization reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval,[numTasks]) 1 2 3 4 5 6 7 8 9 1,2,3,4,5,6 4,5,6,7,8,9 - + - + 7 8 9 1 2 3
  • 34.
    @maasg www. .com Top3 Starter Issues Serialization Where is my closure executing Do I have enough cores?
  • 35.
    @maasg www. .com Demo1 Anatomy of an Spark Streaming Application
  • 36.
    @maasg www. .com Tweetfew keywords about your interests and experience. Use hashtag “#sparkbe” Scala Spark Streaming DistributedSystems #sparkbe
  • 37.
  • 38.
    @maasg www. .com DeploymentOptions M Local Standalone Cluster WW W Using a Cluster Manager W spark.master=local[*] spark.master=spark://host:port spark.master=mesos://host:port M M D W D W W
  • 39.
    @maasg www. .com DeploymentOptions M Local Standalone Cluster Using a Cluster Manager spark.master=local[*] spark.master=mesos://host:port M M DD Rec Exec Rec Exec Exec Exec Rec Exec Rec Exec ExecExec spark.master=spark://host:port
  • 40.
    @maasg www. .com Streaming Spark t0t1 t2 #0 Consumer Consumer Consumer Scheduling
  • 41.
    @maasg www. .com Streaming Spark t0t1 t2 #1 Consumer Consumer Consumer #0 Process Time < Batch Interval Scheduling
  • 42.
    @maasg www. .com Streaming Spark t0t1 t2 #2 Consumer Consumer Consumer #0 #1 #3 Scheduling Delay Scheduling
  • 43.
    @maasg www. .com FromStreams to μbatches Consumer #0 #1 batchInterval blockInterval Spark Streaming Spark #partitions = receivers x batchInterval / blockInterval
  • 44.
  • 45.
    @maasg www. .com #0 RDD Spark Spark Executors SparkStreaming From Streams to μbatches
  • 46.
    @maasg www. .com #0 RDD Spark Spark Executors SparkStreaming From Streams to μbatches
  • 47.
    @maasg www. .com #0 RDD Spark Spark Executors SparkStreaming From Streams to μbatches
  • 48.
    @maasg www. .com FromStreams to μbatches
  • 49.
    @maasg www. .com FromStreams to μbatches
  • 50.
    @maasg www. .com Consumer #0#1 batchInterval blockInterval Spark Streaming Spark #partitions = receivers x batchInterval / blockInterval From Streams to μbatches
  • 51.
    @maasg www. .com Consumer #0#1 batchInterval blockInterval Spark Streaming Spark spark.streaming.blockInterval = batchInterval x receivers / (partitionFactor x sparkCores) From Streams to μbatches
  • 52.
    @maasg www. .com TheImportance of Caching dstream.foreachRDD { rdd => rdd.cache() // cache the RDD before iterating! keys.foreach{ key => rdd.filter(elem=> key(elem) == key).saveAsFooBar(...) } rdd.unpersist() }
  • 53.
    @maasg www. .com TheReceiver model spark.streaming.receiver.maxRate Fault tolerance ? WAL
  • 54.
    @maasg www. .com TheReceiver Model src: https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
  • 55.
    @maasg www. .com DirectKafka Stream compute(offsets) Driver
  • 56.
    @maasg www. .com Kafka:TheReceiver-less model Simplified Parallelism Efficiency Exactly-once semantics Less degrees of freedom val directKafkaStream = KafkaUtils.createDirectStream[ [key class], [value class], [key decoder class], [value decoder class] ]( streamingContext, [map of Kafka parameters], [set of topics to consume] ) spark.streaming.kafka.maxRatePerPartition
  • 57.
    @maasg www. .com Kafka:TheReceiver-less model src: https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
  • 58.
    @maasg www. .com DeliverySemantics • Spark Streaming Receiver-based (<v1.2 ) Roughly at least once • Spark Streaming Recv w/ WAL At least once + Zero Data Loss • Spark Streaming Direct At least once + Zero Data Loss • Spark Streaming Direct + Offset management Exactly Once + Idempotent Writes | Transactions
  • 59.
    @maasg www. .com SparkStreaming (v1.5) made Reactive proportional-integral-derivative controller (PID controller) Backpressure support
  • 60.
    @maasg www. .com Demo2 Spark Streaming Performance
  • 62.
    @maasg www. .com Resources SparkStreaming Official Programming Guide: http://spark.apache.org/docs/latest/streaming-programming-guide.html Backpressure in Spark Streaming: http://blog.garillot.net/post/121183250481/a-quick-update-on-spark-streaming-work-since-i The Virdata’s Spark Streaming tuning guide: http://www.virdata.com/tuning-spark/ Spark Summit Presentations: https://spark-summit.org/ Diving into Spark Streaming Execution Model: https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html Kafka direct approach: https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
  • 63.
  • 64.
    @maasg www. .com Thanks! GerardMaas @maasg www. .com - we’re hiring -