Gerard Maas
Señor SW Engineer
Computer Engineer
Scala Developer
Early Spark Adopter (v0.9)
Cassandra MVP (2015, 2016)
Stack Overflow Top
Contributor
(Spark, Spark Streaming, Scala)
Wannabe {
IoT Maker
Drone crasher/tinkerer
}
@maasg
https://github.com/maasg
https://www.linkedin.com/
in/gerardmaas/
https://stackoverflow.com
/users/764040/maasg
Streaming | Big Data
100Tb
5Mb
100Tb
5Mb/s
∑ Stream = Dataset
Dataset = Stream
What is Spark and Why we Should Care
Streaming APIs in Spark
- Structured Streaming Overview
- Interactive Session 1
- Spark Streaming Overview
- Interactive Session 2
Spark Streaming [AND|OR|XOR] Structured Streaming
Once upon a
time...
Apache Spark Core
Spark SQL
SparkMLLib
SparkStreaming
Structured
Streaming
DataFrames
DataSets
GraphFrames
Data Sources
Apache Spark Core
Spark SQL
SparkMLLib
SparkStreaming
Structured
Streaming
DataFrames
DataSets
GraphFrames
Data Sources
1Structured
Streaming
Structured Streaming
Kafka
Sockets
HDFS/S3
Custom
Streaming
DataFrame
Query
Kafka
Files
foreachSink
console
memory
Output
Mode
Sensor
Data
Producer
Structured
Streaming
Fast Data Platform
Spark Notebook
Local Process
Demo Scenario
1 Structured Streaming
HIGHLIGHTS
val rawData = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServer)
.option("subscribe", sourceTopic)
.option("startingOffsets", "latest")
.load()
Sources
Operations
...
val rawValues = rawData.selectExpr("CAST(value AS STRING)").as[String]
val jsonValues = rawValues.select(from_json($"value", schema) as "record")
val sensorData = jsonValues.select("record.*").as[SensorData]
…
Event Time
...
val movingAverage = sensorData.withColumn("timestamp", toSeconds($"ts").cast(TimestampType))
.withWatermark("timestamp", "30 seconds")
.groupBy($"id", window($"timestamp", "30 seconds", "10 seconds"))
.agg(avg($"temp"))
...
Sinks...
val visualizationQuery = sensorData.writeStream
.queryName("visualization") // this query name will be the SQL table name
.outputMode("update")
.format("memory")
.start()
...
val kafkaWriterQuery = kafkaFormat.writeStream
.queryName("kafkaWriter") // this query name will be the table name
.outputMode("append")
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServer)
.option("topic", targetTopic)
.option("checkpointLocation", "/tmp/spark/checkpoint")
.start()
Use Cases
● Streaming ETL
● Stream aggregations, windows
● Event-time oriented analytics
● Join Streams with Fixed Datasets
● Apply Machine Learning Models
2Spark
Streaming
Spark Streaming
Kafka
Flume
Kinesis
Twitter
Sockets
HDFS/S3
Custom Apache Spark
SparkSQL
SparkML
...
Databases
HDFS
API Server
Streams
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]
Transformation
T -> U
Actions
API: Transformations
map,
flatmap,
filter
count,
reduce,
countByValue,
reduceByKey
n
union,
join
cogroup
API: Transformations
mapWithState
…
…
API: Transformations
transform
val iotDstream = MQTTUtils.createStream(...)
val devicePriority = sparkContext.cassandraTable(...)
val prioritizedDStream = iotDstream.transform{rdd =>
rdd.map(d => (d.id, d)).join(devicePriority)
}
Actions
print
-------------------------------------------
Time: 1459875469000 ms
-------------------------------------------
data1
data2
saveAsTextFiles,
saveAsObjectFiles,
saveAsHadoopFiles
xxx
yyy
zzz
foreachRDD
*
Actions
print
-------------------------------------------
Time: 1459875469000 ms
-------------------------------------------
data1
data2
saveAsTextFiles,
saveAsObjectFiles,
saveAsHadoopFiles
xxx
yyy
zzz
foreachRDD
*
Spark SQL
Dataframes
GraphFrames
Any API
Sensor
Data
Producer
Structured
Streaming
Fast Data Platform
Spark Notebook
Local Process
Demo Scenario
2 Spark Streaming
HIGHLIGHTS
import org.apache.spark.streaming.StreamingContext
val streamingContext = new StreamingContext(sparkContext, interval)
Streaming Context
val kafkaParams = Map[String, String](
"metadata.broker.list" -> kafkaBootstrapServer,
"group.id" -> "sensor-tracker-group",
"auto.offset.reset" -> "largest",
"enable.auto.commit" -> (false: java.lang.Boolean).toString
)
val topics = Set(topic)
@transient val stream = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](
streamingContext, kafkaParams, topics)
Source
import spark.implicits._
val sensorDataStream = stream.transform{rdd =>
val jsonData = rdd.map{case (k,v) => v}
val ds = sparkSession.createDataset(jsonData)
val jsonDF = spark.read.json(ds)
val sensorDataDS = jsonDF.as[SensorData]
sensorDataDS.rdd
}
Transformations
val model = new M2Model()
…
model.trainOn(inputData)
…
val scoredDStream = model.predictOnValues(inputData)
Model
suspects.foreachRDD{rdd =>
val sample = rdd.take(20).map(_.toString)
val total = s"total found: ${rdd.count}"
outputBox(total +: sample)
}
Output
Usecases
● Stream-stream joins
● Complex state management (local + cluster state)
● Streaming Machine Learning
○ Learn
○ Score
● Join Streams with Updatable Datasets
● [-] Event-time oriented analytics
● [-] Continuous processing
Structured Streaming
+
Spark Streaming + Structured Streaming
38
val parse: Dataset[String] => Dataset[Record] = ???
val process: Dataset[Record] => Dataset[Result] = ???
val serialize: Dataset[Result] => Dataset[String] = ???
val kafkaStream = spark.readStream…
val f = parse andThen process andThen serialize
val result = f(kafkaStream)
result.writeStream
.format("kafka")
.option("kafka.bootstrap.servers",bootstrapServers)
.option("topic", writeTopic)
.option("checkpointLocation", checkpointLocation)
.start()
val dstream = KafkaUtils.createDirectStream(...)
dstream.map{rdd =>
val ds = sparkSession.createDataset(rdd)
val f = parse andThen process andThen serialize
val result = f(ds)
result.write.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("topic", writeTopic)
.option("checkpointLocation", checkpointLocation)
.save()
}
Structured StreamingSpark Streaming
Streaming Pipelines (example)
Structured Streaming
Keyword
Extraction
Keyword
Relevance
Similarity
DB Storage
Structured Streaming
New Project?
80%
20%
lightbend.com/fast-data-platform
Features
1. One-click component installations
2. Automatic dependency checks
3. One-click access to install logs
4. Real-time cluster visualization
5. Access to consolidated production logs
Benefits:
1. Easy to get started
2. Ready access to all components
3. Increased developer velocity
Fast Data Platform Manager, for Managing Running Clusters
lightbend.com/learn
If you’re serious about having end-to-end monitoring for
your Fast Data and streaming applications, 

let’s chat!
SET UP A 20-MIN DEMO

A Tale of Two APIs: Using Spark Streaming In Production