Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
SPARK STREAMING
PROGRAMMING
TECHNIQUES YOU
SHOULD KNOW
Gerard Maas
#EUstr2
Gerard Maas
Sr SW Engineer
Computer Engineer
Scala Programmer
Early Spark Adopter (v0.9)
Spark Notebook Contributor
Cassan...
3lightbend.com/fast-data-platform
@maasg #EUstr2
Agenda
4
- Spark Streaming Refresher
- Model
- Operations
- Techniques
- Self-contained stream generation
-...
Spark Streaming Refresher
5
@maasg #EUstr2
6
Kafka
Flume
Kinesis
Twitter
Sockets
HDFS/S3
Custom Apache Spark
SparkSQL
SparkML
...
Databases
HDFS
API S...
@maasg #EUstr2
API
7
Input Process Output
DStream Transformations Output Operations
@maasg #EUstr2
Transformations
map,
flatmap,
filter
count,
reduce,
countByValue,
reduceByKey
n
union,
join
cogroup
8
@maasg #EUstr2
Transformations
mapWithState
…
…
9
@maasg #EUstr2
Transformations
transform
val iotDstream = MQTTUtils.createStream(...)
val devicePriority = sparkContext.ca...
@maasg #EUstr2
Actions
print
-------------------------------------------
Time: 1459875469000 ms
--------------------------...
@maasg #EUstr2
12
def print(num: Int): Unit = ssc.withScope {
def foreachFunc: (RDD[T], Time) => Unit = {
(rdd: RDD[T], ti...
@maasg #EUstr2
13
Actions - foreachRDD
dstream.foreachRDD{rdd =>
rdd.cache()
val alternatives = restServer.get(“/v1/altern...
@maasg #EUstr2
14
Actions - foreachRDD
Spark Cluster
W
M
D
W
W
W
dstream.foreachRDD{rdd =>
rdd.cache()
val alternatives = ...
@maasg #EUstr2
15
Ready to Dive in?
@maasg #EUstr2
16
Apache Spark
SparkSQL
SparkML
...
Self Contained Stream Generation
@maasg #EUstr2
ConstantInputDStream
17
/**
* An input stream that always returns the same RDD on each time step. Useful fo...
@maasg #EUstr2
ConstantInputDStream: Generate Data
18
import scala.util.Random
val sensorId: () => Int = () => Random.next...
@maasg #EUstr2
19
Apache Spark
SparkSQL
SparkML
...
Stream Enrichment with External Data
@maasg #EUstr2
ConstantInputDStream + foreachRDD=
Reload External Data Periodically
20
var sensorReference = sparkSession....
@maasg #EUstr2
DStream + foreachRDD=
Reload External Data with a Trigger
21
var sensorReference = sparkSession.read.parque...
@maasg #EUstr2
22
Structured Streaming
@maasg #EUstr2
ForeachRDD + Datasets + Functional =
Structured Streaming Portability
23
val parse: Dataset[String] => Data...
@maasg #EUstr2
24
Keep Arbitrary State
Apache Spark
SparkSQL
SparkML
...
@maasg #EUstr2
Keeping Arbitrary State
25
var baseline: Dataset[Features] = sparkSession.read.parquet(targetFile).as[Featu...
@maasg #EUstr2
Keeping Arbitrary State
26
@maasg #EUstr2
Keeping Arbitrary State
27
@maasg #EUstr2
Keeping Arbitrary State
28
@maasg #EUstr2
Keeping Arbitrary State
29
@maasg #EUstr2
Keeping Arbitrary State
30
@maasg #EUstr2
Keeping Arbitrary State: Roll your own checkpoints !
31
var baseline: Dataset[Features] = sparkSession.read...
@maasg #EUstr2
32
Apache Spark
SparkSQL
SparkML
...
Probabilistic Accumulators
Σ
@maasg #EUstr2
33
Exactness
Big Data Real-time
@maasg #EUstr2
HyperLogLog: Cardinality Estimation
34
accuracy = 1.054 / sqrt(2^p)
Gb
Mb
Few Kb
@maasg #EUstr2
HLL Accumulator
35
https://github.com/LearningSparkStreaming/HLLAccumulator
class HLLAccumulator[T](precisi...
@maasg #EUstr2
Using Probabilistic Accumulators
36
import learning.spark.streaming.HLLAccumulator
val uniqueVisitorsAccumu...
@maasg #EUstr2
37
Apache Spark
SparkSQL
SparkML
...
Σ
Putting it all Together
@maasg #EUstr2
38
Questions?
@maasg #EUstr2
39
Thank You
@maasg
Upcoming SlideShare
Loading in …5
×

Spark Streaming Programming Techniques You Should Know with Gerard Maas

At its heart, Spark Streaming is a scheduling framework, able to efficiently collect and deliver data to Spark for further processing. While the DStream abstraction provides high-level functions to process streams, several operations also grant us access to deeper levels of the API, where we can directly operate on RDDs, transform them to Datasets to make use of that abstraction or store the data for later processing. Between these API layers lie many hooks that we can manipulate to enrich our Spark Streaming jobs. In this presentation we will demonstrate how to tap into the Spark Streaming scheduler to run arbitrary data workloads, we will show practical uses of the forgotten ‘ConstantInputDStream’ and will explain how to combine Spark Streaming with probabilistic data structures to optimize the use of memory in order to improve the resource usage of long-running streaming jobs. Attendees of this session will come out with a richer toolbox of techniques to widen the use of Spark Streaming and improve the robustness of new or existing jobs.

  • Be the first to comment

Spark Streaming Programming Techniques You Should Know with Gerard Maas

  1. 1. SPARK STREAMING PROGRAMMING TECHNIQUES YOU SHOULD KNOW Gerard Maas #EUstr2
  2. 2. Gerard Maas Sr SW Engineer Computer Engineer Scala Programmer Early Spark Adopter (v0.9) Spark Notebook Contributor Cassandra MVP (2015, 2016) Stack Overflow Top Contributor (Spark, Spark Streaming, Scala) Wannabe { IoT Maker Drone crasher/tinkerer } @maasg https://github.com/maasg https://www.linkedin.com/ in/gerardmaas/ https://stackoverflow.com /users/764040/maasg
  3. 3. 3lightbend.com/fast-data-platform
  4. 4. @maasg #EUstr2 Agenda 4 - Spark Streaming Refresher - Model - Operations - Techniques - Self-contained stream generation - Refreshing external data - Structured Streaming compatibility - Keeping arbitrary state - Probabilistic accumulators
  5. 5. Spark Streaming Refresher 5
  6. 6. @maasg #EUstr2 6 Kafka Flume Kinesis Twitter Sockets HDFS/S3 Custom Apache Spark SparkSQL SparkML ... Databases HDFS API Server Streams
  7. 7. @maasg #EUstr2 API 7 Input Process Output DStream Transformations Output Operations
  8. 8. @maasg #EUstr2 Transformations map, flatmap, filter count, reduce, countByValue, reduceByKey n union, join cogroup 8
  9. 9. @maasg #EUstr2 Transformations mapWithState … … 9
  10. 10. @maasg #EUstr2 Transformations transform val iotDstream = MQTTUtils.createStream(...) val devicePriority = sparkContext.cassandraTable(...) val prioritizedDStream = iotDstream.transform{rdd => rdd.map(d => (d.id, d)).join(devicePriority) } 10
  11. 11. @maasg #EUstr2 Actions print ------------------------------------------- Time: 1459875469000 ms ------------------------------------------- data1 data2 saveAsTextFiles, saveAsObjectFiles, saveAsHadoopFiles xxx yyy zzz foreachRDD * 11
  12. 12. @maasg #EUstr2 12 def print(num: Int): Unit = ssc.withScope { def foreachFunc: (RDD[T], Time) => Unit = { (rdd: RDD[T], time: Time) => { val firstNum = rdd.take(num + 1) // scalastyle:off println println("-------------------------------------------") println(s"Time: $time") println("-------------------------------------------") firstNum.take(num).foreach(println) if (firstNum.length > num) println("...") println() // scalastyle:on println } } foreachRDD(context.sparkContext.clean(foreachFunc), displayInnerRDDOps = false) } Actions
  13. 13. @maasg #EUstr2 13 Actions - foreachRDD dstream.foreachRDD{rdd => rdd.cache() val alternatives = restServer.get(“/v1/alternatives”).toSet alternatives.foreach{alternative => val byAlternative = rdd.filter(element => element.kind == alternative) val asRecords = byAlternative.map(element => asRecord(element)) asRecords.foreachPartition{partition => val conn = DB.connect(server) partition.foreach(element => conn.insert(element) } } rdd.unpersist(true) } Executes local on the Driver Executes distributed on the Workers
  14. 14. @maasg #EUstr2 14 Actions - foreachRDD Spark Cluster W M D W W W dstream.foreachRDD{rdd => rdd.cache() val alternatives = restServer.get(“/v1/alternatives”).toSet alternatives.foreach{alternative => val byAlternative = rdd.filter(element => element.kind == alternative) val asRecords = byAlternative.map(element => asRecord(element)) asRecords.foreachPartition{partition => val conn = DB.connect(server) partition.foreach(element => conn.insert(element) } } rdd.unpersist(true) }
  15. 15. @maasg #EUstr2 15 Ready to Dive in?
  16. 16. @maasg #EUstr2 16 Apache Spark SparkSQL SparkML ... Self Contained Stream Generation
  17. 17. @maasg #EUstr2 ConstantInputDStream 17 /** * An input stream that always returns the same RDD on each time step. Useful for testing. */ class ConstantInputDStream[T: ClassTag](_ssc: StreamingContext, rdd: RDD[T]) // Usage val constantDStream = new ConstantInputDStream(streamingContext, rdd)
  18. 18. @maasg #EUstr2 ConstantInputDStream: Generate Data 18 import scala.util.Random val sensorId: () => Int = () => Random.nextInt(sensorCount) val data: () => Double = () => Random.nextDouble val timestamp: () => Long = () => System.currentTimeMillis // Generates records with Random data val recordFunction: () => String = { () => if (Random.nextDouble < 0.9) { Seq(sensorId().toString, timestamp(), data()).mkString(",") } else { // simulate 10% crap data as well… real world streams are seldom clean "!!~corrupt~^&##$" } } val sensorDataGenerator = sparkContext.parallelize(1 to n).map(_ => recordFunction) val sensorData = sensorDataGenerator.map(recordFun => recordFun()) val rawDStream = new ConstantInputDStream(streamingContext, sensorData) RDD[() => Record]
  19. 19. @maasg #EUstr2 19 Apache Spark SparkSQL SparkML ... Stream Enrichment with External Data
  20. 20. @maasg #EUstr2 ConstantInputDStream + foreachRDD= Reload External Data Periodically 20 var sensorReference = sparkSession.read.parquet(s"$referenceFile") sensorRef.cache() val refreshDStream = new ConstantInputDStream(streamingContext, sparkContext.emptyRDD[Int]) // Refresh data every 5 minutes val refreshIntervalDStream = refreshDStream.window(Seconds(300), Seconds(300)) refreshIntervalDStream.foreachRDD{ _ => sensorRef.unpersist(false) sensorRef = sparkSession.read.parquet(s"$referenceFile") sensorRef.cache() }
  21. 21. @maasg #EUstr2 DStream + foreachRDD= Reload External Data with a Trigger 21 var sensorReference = sparkSession.read.parquet(s"$referenceFile") sensorRef.cache() val triggerRefreshDStream: DStream = // create a DStream from a source. e.g. Kafka val referenceStream = triggerRefreshDStream.transform { rdd => if (rdd.take(1) == “refreshNow”) { sensorRef.unpersist(false) sensorRef = sparkSession.read.parquet(s"$referenceFile") sensorRef.cache() } sensorRef.rdd } incomingStream.join(referenceStream) ...
  22. 22. @maasg #EUstr2 22 Structured Streaming
  23. 23. @maasg #EUstr2 ForeachRDD + Datasets + Functional = Structured Streaming Portability 23 val parse: Dataset[String] => Dataset[Record] = ??? val process: Dataset[Record] => Dataset[Result] = ??? val serialize: Dataset[Result] => Dataset[String] = ??? val kafkaStream = spark.readStream… val f = parse andThen process andThen serialize val result = f(kafkaStream) result.writeStream .format("kafka") .option("kafka.bootstrap.servers",bootstrapServers) .option("topic", writeTopic) .option("checkpointLocation", checkpointLocation) .start() val dstream = KafkaUtils.createDirectStream(...) dstream.map{rdd => val ds = sparkSession.createDataset(rdd) val f = parse andThen process andThen serialize val result = f(ds) result.write.format("kafka") .option("kafka.bootstrap.servers", bootstrapServers) .option("topic", writeTopic) .option("checkpointLocation", checkpointLocation) .save() } Structured StreamingSpark Streaming
  24. 24. @maasg #EUstr2 24 Keep Arbitrary State Apache Spark SparkSQL SparkML ...
  25. 25. @maasg #EUstr2 Keeping Arbitrary State 25 var baseline: Dataset[Features] = sparkSession.read.parquet(targetFile).as[Features] … stream.foreachRDD{ rdd => val incomingData = sparkSession.createDataset(rdd) val incomingFeatures = rawToFeatures(incomingData) val analyzed = compare(incomingFeatures, baseline) // store analyzed data baseline = (baseline union incomingFeatures).filter(isExpired) } https://gist.github.com/maasg/9d51a2a42fc831e385cf744b84e80479
  26. 26. @maasg #EUstr2 Keeping Arbitrary State 26
  27. 27. @maasg #EUstr2 Keeping Arbitrary State 27
  28. 28. @maasg #EUstr2 Keeping Arbitrary State 28
  29. 29. @maasg #EUstr2 Keeping Arbitrary State 29
  30. 30. @maasg #EUstr2 Keeping Arbitrary State 30
  31. 31. @maasg #EUstr2 Keeping Arbitrary State: Roll your own checkpoints ! 31 var baseline: Dataset[Features] = sparkSession.read.parquet(targetFile).as[Features] var cycle = 1 var checkpointFile = 0 stream.foreachRDD{ rdd => val incomingData = sparkSession.createDataset(rdd) val incomingFeatures = rawToFeatures(incomingData) val analyzed = compare(incomingFeatures, baseline) // store analyzed data baseline = (baseline union incomingFeatures).filter(isOldFeature) cycle = (cycle + 1) % checkpointInterval if (cycle == 0) { checkpointFile = (checkpointFile + 1) % 2 baseline.write.mode(“overwrite”).parquet(s”$targetFile_$checkpointFile“) baseline = baseline.read(s”$targetFile_$checkpointFile“) } }
  32. 32. @maasg #EUstr2 32 Apache Spark SparkSQL SparkML ... Probabilistic Accumulators Σ
  33. 33. @maasg #EUstr2 33 Exactness Big Data Real-time
  34. 34. @maasg #EUstr2 HyperLogLog: Cardinality Estimation 34 accuracy = 1.054 / sqrt(2^p) Gb Mb Few Kb
  35. 35. @maasg #EUstr2 HLL Accumulator 35 https://github.com/LearningSparkStreaming/HLLAccumulator class HLLAccumulator[T](precisionValue: Int = 12) extends AccumulatorV2[T, Long] { private def instance(): HyperLogLogPlus = new HyperLogLogPlus(precisionValue, 0) override def add(v: T): Unit = hll.offer(v) override def merge(other: AccumulatorV2[T, Long]): Unit = other match { case otherHllAcc: HLLAccumulator[T] => hll.addAll(otherHllAcc.hll) case _ => throw new UnsupportedOperationException( s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}") } }
  36. 36. @maasg #EUstr2 Using Probabilistic Accumulators 36 import learning.spark.streaming.HLLAccumulator val uniqueVisitorsAccumulator= new HLLAccumulator[String](precisionValue = 12) sc.register(uniqueVisitorsAccumulator, "unique-visitors") … clickStream.foreachRDD{rdd => rdd.foreach{ case BlogHit(ts, user, url) => uniqueVisitorsAccumulator.add(user) } ... val currentUniqueVisitors = uniqueVisitorsAccumulator.value ... }
  37. 37. @maasg #EUstr2 37 Apache Spark SparkSQL SparkML ... Σ Putting it all Together
  38. 38. @maasg #EUstr2 38 Questions?
  39. 39. @maasg #EUstr2 39 Thank You @maasg

×