SPARK STREAMING
PROGRAMMING
TECHNIQUES YOU
SHOULD KNOW
Gerard Maas
#EUstr2
Gerard Maas
Sr SW Engineer
Computer Engineer
Scala Programmer
Early Spark Adopter (v0.9)
Spark Notebook Contributor
Cassandra MVP (2015, 2016)
Stack Overflow Top Contributor
(Spark, Spark Streaming, Scala)
Wannabe {
IoT Maker
Drone crasher/tinkerer
}
@maasg
https://github.com/maasg
https://www.linkedin.com/
in/gerardmaas/
https://stackoverflow.com
/users/764040/maasg
3lightbend.com/fast-data-platform
@maasg #EUstr2
Agenda
4
- Spark Streaming Refresher
- Model
- Operations
- Techniques
- Self-contained stream generation
- Refreshing external data
- Structured Streaming compatibility
- Keeping arbitrary state
- Probabilistic accumulators
Spark Streaming Refresher
5
@maasg #EUstr2
6
Kafka
Flume
Kinesis
Twitter
Sockets
HDFS/S3
Custom Apache Spark
SparkSQL
SparkML
...
Databases
HDFS
API Server
Streams
@maasg #EUstr2
API
7
Input Process Output
DStream Transformations Output Operations
@maasg #EUstr2
Transformations
map,
flatmap,
filter
count,
reduce,
countByValue,
reduceByKey
n
union,
join
cogroup
8
@maasg #EUstr2
Transformations
mapWithState
…
…
9
@maasg #EUstr2
Transformations
transform
val iotDstream = MQTTUtils.createStream(...)
val devicePriority = sparkContext.cassandraTable(...)
val prioritizedDStream = iotDstream.transform{rdd =>
rdd.map(d => (d.id, d)).join(devicePriority)
}
10
@maasg #EUstr2
Actions
print
-------------------------------------------
Time: 1459875469000 ms
-------------------------------------------
data1
data2
saveAsTextFiles,
saveAsObjectFiles,
saveAsHadoopFiles
xxx
yyy
zzz
foreachRDD
*
11
@maasg #EUstr2
12
def print(num: Int): Unit = ssc.withScope {
def foreachFunc: (RDD[T], Time) => Unit = {
(rdd: RDD[T], time: Time) => {
val firstNum = rdd.take(num + 1)
// scalastyle:off println
println("-------------------------------------------")
println(s"Time: $time")
println("-------------------------------------------")
firstNum.take(num).foreach(println)
if (firstNum.length > num) println("...")
println()
// scalastyle:on println
}
}
foreachRDD(context.sparkContext.clean(foreachFunc), displayInnerRDDOps = false)
}
Actions
@maasg #EUstr2
13
Actions - foreachRDD
dstream.foreachRDD{rdd =>
rdd.cache()
val alternatives = restServer.get(“/v1/alternatives”).toSet
alternatives.foreach{alternative =>
val byAlternative = rdd.filter(element => element.kind == alternative)
val asRecords = byAlternative.map(element => asRecord(element))
asRecords.foreachPartition{partition =>
val conn = DB.connect(server)
partition.foreach(element => conn.insert(element)
}
}
rdd.unpersist(true)
}
Executes local on the Driver
Executes distributed on the Workers
@maasg #EUstr2
14
Actions - foreachRDD
Spark Cluster
W
M
D
W
W
W
dstream.foreachRDD{rdd =>
rdd.cache()
val alternatives = restServer.get(“/v1/alternatives”).toSet
alternatives.foreach{alternative =>
val byAlternative = rdd.filter(element => element.kind == alternative)
val asRecords = byAlternative.map(element => asRecord(element))
asRecords.foreachPartition{partition =>
val conn = DB.connect(server)
partition.foreach(element => conn.insert(element)
}
}
rdd.unpersist(true)
}
@maasg #EUstr2
15
Ready to Dive in?
@maasg #EUstr2
16
Apache Spark
SparkSQL
SparkML
...
Self Contained Stream Generation
@maasg #EUstr2
ConstantInputDStream
17
/**
* An input stream that always returns the same RDD on each time step. Useful for testing.
*/
class ConstantInputDStream[T: ClassTag](_ssc: StreamingContext, rdd: RDD[T])
// Usage
val constantDStream = new ConstantInputDStream(streamingContext, rdd)
@maasg #EUstr2
ConstantInputDStream: Generate Data
18
import scala.util.Random
val sensorId: () => Int = () => Random.nextInt(sensorCount)
val data: () => Double = () => Random.nextDouble
val timestamp: () => Long = () => System.currentTimeMillis
// Generates records with Random data
val recordFunction: () => String = { () =>
if (Random.nextDouble < 0.9) {
Seq(sensorId().toString, timestamp(), data()).mkString(",")
} else {
// simulate 10% crap data as well… real world streams are seldom clean
"!!~corrupt~^&##$"
}
}
val sensorDataGenerator = sparkContext.parallelize(1 to n).map(_ => recordFunction)
val sensorData = sensorDataGenerator.map(recordFun => recordFun())
val rawDStream = new ConstantInputDStream(streamingContext, sensorData)
RDD[() => Record]
@maasg #EUstr2
19
Apache Spark
SparkSQL
SparkML
...
Stream Enrichment with External Data
@maasg #EUstr2
ConstantInputDStream + foreachRDD=
Reload External Data Periodically
20
var sensorReference = sparkSession.read.parquet(s"$referenceFile")
sensorRef.cache()
val refreshDStream = new ConstantInputDStream(streamingContext, sparkContext.emptyRDD[Int])
// Refresh data every 5 minutes
val refreshIntervalDStream = refreshDStream.window(Seconds(300), Seconds(300))
refreshIntervalDStream.foreachRDD{ _ =>
sensorRef.unpersist(false)
sensorRef = sparkSession.read.parquet(s"$referenceFile")
sensorRef.cache()
}
@maasg #EUstr2
DStream + foreachRDD=
Reload External Data with a Trigger
21
var sensorReference = sparkSession.read.parquet(s"$referenceFile")
sensorRef.cache()
val triggerRefreshDStream: DStream = // create a DStream from a source. e.g. Kafka
val referenceStream = triggerRefreshDStream.transform { rdd =>
if (rdd.take(1) == “refreshNow”) {
sensorRef.unpersist(false)
sensorRef = sparkSession.read.parquet(s"$referenceFile")
sensorRef.cache()
}
sensorRef.rdd
}
incomingStream.join(referenceStream) ...
@maasg #EUstr2
22
Structured Streaming
@maasg #EUstr2
ForeachRDD + Datasets + Functional =
Structured Streaming Portability
23
val parse: Dataset[String] => Dataset[Record] = ???
val process: Dataset[Record] => Dataset[Result] = ???
val serialize: Dataset[Result] => Dataset[String] = ???
val kafkaStream = spark.readStream…
val f = parse andThen process andThen serialize
val result = f(kafkaStream)
result.writeStream
.format("kafka")
.option("kafka.bootstrap.servers",bootstrapServers)
.option("topic", writeTopic)
.option("checkpointLocation", checkpointLocation)
.start()
val dstream = KafkaUtils.createDirectStream(...)
dstream.map{rdd =>
val ds = sparkSession.createDataset(rdd)
val f = parse andThen process andThen serialize
val result = f(ds)
result.write.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("topic", writeTopic)
.option("checkpointLocation", checkpointLocation)
.save()
}
Structured StreamingSpark Streaming
@maasg #EUstr2
24
Keep Arbitrary State
Apache Spark
SparkSQL
SparkML
...
@maasg #EUstr2
Keeping Arbitrary State
25
var baseline: Dataset[Features] = sparkSession.read.parquet(targetFile).as[Features]
…
stream.foreachRDD{ rdd =>
val incomingData = sparkSession.createDataset(rdd)
val incomingFeatures = rawToFeatures(incomingData)
val analyzed = compare(incomingFeatures, baseline)
// store analyzed data
baseline = (baseline union incomingFeatures).filter(isExpired)
}
https://gist.github.com/maasg/9d51a2a42fc831e385cf744b84e80479
@maasg #EUstr2
Keeping Arbitrary State
26
@maasg #EUstr2
Keeping Arbitrary State
27
@maasg #EUstr2
Keeping Arbitrary State
28
@maasg #EUstr2
Keeping Arbitrary State
29
@maasg #EUstr2
Keeping Arbitrary State
30
@maasg #EUstr2
Keeping Arbitrary State: Roll your own checkpoints !
31
var baseline: Dataset[Features] = sparkSession.read.parquet(targetFile).as[Features]
var cycle = 1
var checkpointFile = 0
stream.foreachRDD{ rdd =>
val incomingData = sparkSession.createDataset(rdd)
val incomingFeatures = rawToFeatures(incomingData)
val analyzed = compare(incomingFeatures, baseline)
// store analyzed data
baseline = (baseline union incomingFeatures).filter(isOldFeature)
cycle = (cycle + 1) % checkpointInterval
if (cycle == 0) {
checkpointFile = (checkpointFile + 1) % 2
baseline.write.mode(“overwrite”).parquet(s”$targetFile_$checkpointFile“)
baseline = baseline.read(s”$targetFile_$checkpointFile“)
}
}
@maasg #EUstr2
32
Apache Spark
SparkSQL
SparkML
...
Probabilistic Accumulators
Σ
@maasg #EUstr2
33
Exactness
Big Data Real-time
@maasg #EUstr2
HyperLogLog: Cardinality Estimation
34
accuracy = 1.054 / sqrt(2^p)
Gb
Mb
Few Kb
@maasg #EUstr2
HLL Accumulator
35
https://github.com/LearningSparkStreaming/HLLAccumulator
class HLLAccumulator[T](precisionValue: Int = 12) extends AccumulatorV2[T, Long] {
private def instance(): HyperLogLogPlus = new HyperLogLogPlus(precisionValue, 0)
override def add(v: T): Unit = hll.offer(v)
override def merge(other: AccumulatorV2[T, Long]): Unit = other match {
case otherHllAcc: HLLAccumulator[T] => hll.addAll(otherHllAcc.hll)
case _ => throw new UnsupportedOperationException(
s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
}
}
@maasg #EUstr2
Using Probabilistic Accumulators
36
import learning.spark.streaming.HLLAccumulator
val uniqueVisitorsAccumulator= new HLLAccumulator[String](precisionValue = 12)
sc.register(uniqueVisitorsAccumulator, "unique-visitors")
…
clickStream.foreachRDD{rdd =>
rdd.foreach{
case BlogHit(ts, user, url) => uniqueVisitorsAccumulator.add(user)
}
...
val currentUniqueVisitors = uniqueVisitorsAccumulator.value
...
}
@maasg #EUstr2
37
Apache Spark
SparkSQL
SparkML
...
Σ
Putting it all Together
@maasg #EUstr2
38
Questions?
@maasg #EUstr2
39
Thank You
@maasg

Spark Streaming Programming Techniques You Should Know with Gerard Maas

  • 1.
  • 2.
    Gerard Maas Sr SWEngineer Computer Engineer Scala Programmer Early Spark Adopter (v0.9) Spark Notebook Contributor Cassandra MVP (2015, 2016) Stack Overflow Top Contributor (Spark, Spark Streaming, Scala) Wannabe { IoT Maker Drone crasher/tinkerer } @maasg https://github.com/maasg https://www.linkedin.com/ in/gerardmaas/ https://stackoverflow.com /users/764040/maasg
  • 3.
  • 4.
    @maasg #EUstr2 Agenda 4 - SparkStreaming Refresher - Model - Operations - Techniques - Self-contained stream generation - Refreshing external data - Structured Streaming compatibility - Keeping arbitrary state - Probabilistic accumulators
  • 5.
  • 6.
    @maasg #EUstr2 6 Kafka Flume Kinesis Twitter Sockets HDFS/S3 Custom ApacheSpark SparkSQL SparkML ... Databases HDFS API Server Streams
  • 7.
    @maasg #EUstr2 API 7 Input ProcessOutput DStream Transformations Output Operations
  • 8.
  • 9.
  • 10.
    @maasg #EUstr2 Transformations transform val iotDstream= MQTTUtils.createStream(...) val devicePriority = sparkContext.cassandraTable(...) val prioritizedDStream = iotDstream.transform{rdd => rdd.map(d => (d.id, d)).join(devicePriority) } 10
  • 11.
    @maasg #EUstr2 Actions print ------------------------------------------- Time: 1459875469000ms ------------------------------------------- data1 data2 saveAsTextFiles, saveAsObjectFiles, saveAsHadoopFiles xxx yyy zzz foreachRDD * 11
  • 12.
    @maasg #EUstr2 12 def print(num:Int): Unit = ssc.withScope { def foreachFunc: (RDD[T], Time) => Unit = { (rdd: RDD[T], time: Time) => { val firstNum = rdd.take(num + 1) // scalastyle:off println println("-------------------------------------------") println(s"Time: $time") println("-------------------------------------------") firstNum.take(num).foreach(println) if (firstNum.length > num) println("...") println() // scalastyle:on println } } foreachRDD(context.sparkContext.clean(foreachFunc), displayInnerRDDOps = false) } Actions
  • 13.
    @maasg #EUstr2 13 Actions -foreachRDD dstream.foreachRDD{rdd => rdd.cache() val alternatives = restServer.get(“/v1/alternatives”).toSet alternatives.foreach{alternative => val byAlternative = rdd.filter(element => element.kind == alternative) val asRecords = byAlternative.map(element => asRecord(element)) asRecords.foreachPartition{partition => val conn = DB.connect(server) partition.foreach(element => conn.insert(element) } } rdd.unpersist(true) } Executes local on the Driver Executes distributed on the Workers
  • 14.
    @maasg #EUstr2 14 Actions -foreachRDD Spark Cluster W M D W W W dstream.foreachRDD{rdd => rdd.cache() val alternatives = restServer.get(“/v1/alternatives”).toSet alternatives.foreach{alternative => val byAlternative = rdd.filter(element => element.kind == alternative) val asRecords = byAlternative.map(element => asRecord(element)) asRecords.foreachPartition{partition => val conn = DB.connect(server) partition.foreach(element => conn.insert(element) } } rdd.unpersist(true) }
  • 15.
  • 16.
  • 17.
    @maasg #EUstr2 ConstantInputDStream 17 /** * Aninput stream that always returns the same RDD on each time step. Useful for testing. */ class ConstantInputDStream[T: ClassTag](_ssc: StreamingContext, rdd: RDD[T]) // Usage val constantDStream = new ConstantInputDStream(streamingContext, rdd)
  • 18.
    @maasg #EUstr2 ConstantInputDStream: GenerateData 18 import scala.util.Random val sensorId: () => Int = () => Random.nextInt(sensorCount) val data: () => Double = () => Random.nextDouble val timestamp: () => Long = () => System.currentTimeMillis // Generates records with Random data val recordFunction: () => String = { () => if (Random.nextDouble < 0.9) { Seq(sensorId().toString, timestamp(), data()).mkString(",") } else { // simulate 10% crap data as well… real world streams are seldom clean "!!~corrupt~^&##$" } } val sensorDataGenerator = sparkContext.parallelize(1 to n).map(_ => recordFunction) val sensorData = sensorDataGenerator.map(recordFun => recordFun()) val rawDStream = new ConstantInputDStream(streamingContext, sensorData) RDD[() => Record]
  • 19.
  • 20.
    @maasg #EUstr2 ConstantInputDStream +foreachRDD= Reload External Data Periodically 20 var sensorReference = sparkSession.read.parquet(s"$referenceFile") sensorRef.cache() val refreshDStream = new ConstantInputDStream(streamingContext, sparkContext.emptyRDD[Int]) // Refresh data every 5 minutes val refreshIntervalDStream = refreshDStream.window(Seconds(300), Seconds(300)) refreshIntervalDStream.foreachRDD{ _ => sensorRef.unpersist(false) sensorRef = sparkSession.read.parquet(s"$referenceFile") sensorRef.cache() }
  • 21.
    @maasg #EUstr2 DStream +foreachRDD= Reload External Data with a Trigger 21 var sensorReference = sparkSession.read.parquet(s"$referenceFile") sensorRef.cache() val triggerRefreshDStream: DStream = // create a DStream from a source. e.g. Kafka val referenceStream = triggerRefreshDStream.transform { rdd => if (rdd.take(1) == “refreshNow”) { sensorRef.unpersist(false) sensorRef = sparkSession.read.parquet(s"$referenceFile") sensorRef.cache() } sensorRef.rdd } incomingStream.join(referenceStream) ...
  • 22.
  • 23.
    @maasg #EUstr2 ForeachRDD +Datasets + Functional = Structured Streaming Portability 23 val parse: Dataset[String] => Dataset[Record] = ??? val process: Dataset[Record] => Dataset[Result] = ??? val serialize: Dataset[Result] => Dataset[String] = ??? val kafkaStream = spark.readStream… val f = parse andThen process andThen serialize val result = f(kafkaStream) result.writeStream .format("kafka") .option("kafka.bootstrap.servers",bootstrapServers) .option("topic", writeTopic) .option("checkpointLocation", checkpointLocation) .start() val dstream = KafkaUtils.createDirectStream(...) dstream.map{rdd => val ds = sparkSession.createDataset(rdd) val f = parse andThen process andThen serialize val result = f(ds) result.write.format("kafka") .option("kafka.bootstrap.servers", bootstrapServers) .option("topic", writeTopic) .option("checkpointLocation", checkpointLocation) .save() } Structured StreamingSpark Streaming
  • 24.
    @maasg #EUstr2 24 Keep ArbitraryState Apache Spark SparkSQL SparkML ...
  • 25.
    @maasg #EUstr2 Keeping ArbitraryState 25 var baseline: Dataset[Features] = sparkSession.read.parquet(targetFile).as[Features] … stream.foreachRDD{ rdd => val incomingData = sparkSession.createDataset(rdd) val incomingFeatures = rawToFeatures(incomingData) val analyzed = compare(incomingFeatures, baseline) // store analyzed data baseline = (baseline union incomingFeatures).filter(isExpired) } https://gist.github.com/maasg/9d51a2a42fc831e385cf744b84e80479
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
    @maasg #EUstr2 Keeping ArbitraryState: Roll your own checkpoints ! 31 var baseline: Dataset[Features] = sparkSession.read.parquet(targetFile).as[Features] var cycle = 1 var checkpointFile = 0 stream.foreachRDD{ rdd => val incomingData = sparkSession.createDataset(rdd) val incomingFeatures = rawToFeatures(incomingData) val analyzed = compare(incomingFeatures, baseline) // store analyzed data baseline = (baseline union incomingFeatures).filter(isOldFeature) cycle = (cycle + 1) % checkpointInterval if (cycle == 0) { checkpointFile = (checkpointFile + 1) % 2 baseline.write.mode(“overwrite”).parquet(s”$targetFile_$checkpointFile“) baseline = baseline.read(s”$targetFile_$checkpointFile“) } }
  • 32.
  • 33.
  • 34.
    @maasg #EUstr2 HyperLogLog: CardinalityEstimation 34 accuracy = 1.054 / sqrt(2^p) Gb Mb Few Kb
  • 35.
    @maasg #EUstr2 HLL Accumulator 35 https://github.com/LearningSparkStreaming/HLLAccumulator classHLLAccumulator[T](precisionValue: Int = 12) extends AccumulatorV2[T, Long] { private def instance(): HyperLogLogPlus = new HyperLogLogPlus(precisionValue, 0) override def add(v: T): Unit = hll.offer(v) override def merge(other: AccumulatorV2[T, Long]): Unit = other match { case otherHllAcc: HLLAccumulator[T] => hll.addAll(otherHllAcc.hll) case _ => throw new UnsupportedOperationException( s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}") } }
  • 36.
    @maasg #EUstr2 Using ProbabilisticAccumulators 36 import learning.spark.streaming.HLLAccumulator val uniqueVisitorsAccumulator= new HLLAccumulator[String](precisionValue = 12) sc.register(uniqueVisitorsAccumulator, "unique-visitors") … clickStream.foreachRDD{rdd => rdd.foreach{ case BlogHit(ts, user, url) => uniqueVisitorsAccumulator.add(user) } ... val currentUniqueVisitors = uniqueVisitorsAccumulator.value ... }
  • 37.
  • 38.
  • 39.