Structured streaming for machine learning

Structured Streaming
For machine Learning! :)
kroszk@
Built with
experimental
APIs :)

Holden:
● Prefered pronouns are she/her
● I’m a Principal Software Engineer at IBM’s Spark Technology Center
● previously Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & Fast Data processing with Spark
○ co-author of a new book focused on Spark performance coming out this year*
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos

Seth:
● Machine learning engineer at IBM’s Spark Technology Center
○ Working on high-performance, distributed machine learning for Spark MLlib
○ Also, structured streaming!
● Previously electrical engineering
● @shendrickson16
● Linkedin https://www.linkedin.com/in/sethah
● Github https://github.com/sethah
● SlideShare http://www.slideshare.net/SethHendrickson

Spark Technology
Center
5
IBM
Spark
Technology
Center
Founded in 2015.
Location:
Physical: 505 Howard St., San Francisco CA
Web: http://spark.tc Twitter: @apachespark_tc
Mission:
Contribute intellectual and technical capital to the Apache Spark
community.
Make the core technology enterprise- and cloud-ready.
Build data science skills to drive intelligence into business
applications — http://bigdatauniversity.com
Key statistics:
About 50 developers, co-located with 25 IBM designers.
Major contributions to Apache Spark http://jiras.spark.tc
Apache SystemML is now an Apache Incubator project.
Founding member of UC Berkeley AMPLab and RISE Lab
Member of R Consortium and Scala Center
Spark Technology
Center

What is going to be covered:
● Who we think y’all are
● Abridged Introduction to Datasets
● Abridged Introduction to Structured Streaming
● What Structured Streaming is and is not
● How to write simple structured streaming queries
● The exciting part: Building machine learning on top of structured streaming
● Possible future changes to make structured streaming & ML work together
nicely
Torsten Reuschling

Who we think you wonderful humans are?
● Nice* people
● Don’t mind pictures of cats
● Know some Apache Spark
● May or may not know the Dataset API
● Want to take advantage of Spark’s Structured Streaming
● May care about machine learning
● Possible distracted with the new Zelda game (although if you're still playing
Pokemon Go we can be friends)

ALPHA =~ Please don’t use this in production (yet)
Image by Mr Thinktank

What are Datasets?
● New in Spark 1.6
● Provide templated compile time strongly typed version of DataFrames
● Make it easier to intermix functional & relational code
○ Do you hate writing UDFS? So do I!
● Still an experimental component (API will change in future versions)
● The basis of the Structured Streaming
Houser Wolf

Using Datasets to mix functional & relational:
val ds: Dataset[RawPanda] = ...
val happiness = ds.filter($"happy" === true).
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
Sephiroty Magno Fiesta

So what was that?
ds.filter($"happy" === true).
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
A typed query (specifies the
return type). Without the as[]
will return a DataFrame
(Dataset[Row])
Traditional functional
reduction:
arbitrary scala code :)
Robert Couse-Baker

And functional style maps:
/**
* Functional map + Dataset, sums the positive attributes for the
pandas
*/
def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {
ds.map{rp => rp.attributes.filter(_ > 0).sum}
}

And now we can use it for streaming too!
● StructuredStreaming - new to Spark 2.0
○ Emphasis on new - be cautious when using
● Extends the Dataset & DataFrame APIs to represent continuous tables
● Still very early stages - but lots of really cool optimizations possible now
● We can build a machine learning pipeline with it together :)
○ Well we have to use some hacks - but ssssssh don’t tell TD
https://github.com/holdenk/spark-structured-streaming-ml

Architecture
Planner
(microbatch thread)
Streaming
Source
Streaming
Sink
Incremental
Execution
Incremental
Execution
Rich Bowen
...
time

Aggregates
abstract class UserDefinedAggregateFunction {
def initialize(buffer: MutableAggregationBuffer): Unit
def update(buffer: MutableAggregationBuffer, input: Row): Unit
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit
def evaluate(buffer: Row): Any
}

Get a streaming dataset
// Read a streaming dataframe
val schema = new StructType()
.add("happiness", "double")
.add("coffees", "integer")
val streamingDS = spark
.readStream
.schema(schema)
.format(“parquet”)
.load(path)
Dataset
isStreaming = true
streaming
source

Build the recipe for each query
val happinessByCoffee = streamingDS
.groupBy($"coffees")
.agg(avg($"happiness"))
Dataset
isStreaming = true
streaming
source
Aggregate
groupBy = “coffees”
expr = avg(“happiness”)

Streaming Aggregation
data
partial_avg
avg
shuffle
Inject stateful
operators
Batch physical plan Streaming microbatch
physical plan
data
partial_avg
restore_state
shuffle
avg
avg
save_state
sink sink sink
t0
t1
t2
kitty.green66

Streaming Aggregation - Partial Average
(0, -150)
(2, 30)
(8, 200)
(8, 150)
(0, -100)
(2, 30)
(2, 10)
(2, 50)
0 -> (1, -150)
2 -> (1, 30)
8 -> (2, 350)
0 -> (1, -100)
2 -> (3, 90)
(0, -150)
(0, 0)
(0, -200)
(8, 120)
(8, 100)
(2, 20)
(2, 20)
(2, 60)
0 -> (3, -350)
8 -> (1, 120)
2 -> (3, 100)
8 -> (1, 100)
data
partial_avg
restore_state
shuffle
avg
avg
save_state
0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450)
0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450)
Executor 1 Executor 2
Data source
State store
Beatrice Murch

Streaming Aggregation - Shuffle
(0, -150)
(2, 30)
(8, 200)
(8, 150)
(0, -100)
(2, 30)
(2, 10)
(2, 50)
(0, -150)
(0, 0)
(0, -200)
(8, 120)
(8, 100)
(2, 20)
(2, 20)
(2, 60)
data
partial_avg
restore_state
shuffle
avg
avg
save_state
0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450)
0 -> (1, -150)
0 -> (3, -350)
0 -> (1, -100)
2 -> (1, 30)
2 -> (3, 100)
2 -> (3, 90)
8 -> (1, 120)
8 -> (1, 100)
8 -> (2, 350)
Data source
0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450)
State store
UnknownNet Photography

Streaming Aggregation - Merge Average
(0, -150)
(2, 30)
(8, 200)
(8, 150)
(0, -100)
(2, 30)
(2, 10)
(2, 50)
(0, -150)
(0, 0)
(0, -200)
(8, 120)
(8, 100)
(2, 20)
(2, 20)
(2, 60)
data
partial_avg
restore_state
shuffle
avg
avg
save_state
0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450)
0 -> (5, -600) 2 -> (7, 220) 8 -> (4, 570)
Data source
0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450)
State store
elminium

Streaming Aggregation - Restore State
(0, -150)
(2, 30)
(8, 200)
(8, 150)
(0, -100)
(2, 30)
(2, 10)
(2, 50)
(0, -150)
(0, 0)
(0, -200)
(8, 120)
(8, 100)
(2, 20)
(2, 20)
(2, 60)
data
partial_avg
restore_state
shuffle
avg
avg
save_state
0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450)
0 -> (5, -600)
0 -> (5, -800)
2 -> (7, 220)
2 -> (3, 150)
8 -> (4, 570)
8 -> (4, 450)
Data source
0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450)
State store
elminium

Streaming Aggregation - Merge Average
(0, -150)
(2, 30)
(8, 200)
(8, 150)
(0, -100)
(2, 30)
(2, 10)
(2, 50)
(0, -150)
(0, 0)
(0, -200)
(8, 120)
(8, 100)
(2, 20)
(2, 20)
(2, 60)
data
partial_avg
restore_state
shuffle
avg
avg
save_state
0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450)
0 -> (10, -1400) 2 -> (10, 370) 8 -> (8, 1020)
0 -> (5, -800) 2 -> (3, 150) 8 -> (4, 450)
State store
Data source
elminium

Streaming Aggregation - Save State
(0, -150)
(2, 30)
(8, 200)
(8, 150)
(0, -100)
(2, 30)
(2, 10)
(2, 50)
(0, -150)
(0, 0)
(0, -200)
(8, 120)
(8, 100)
(2, 20)
(2, 20)
(2, 60)
data
partial_avg
restore_state
shuffle
avg
avg
save_state
0 -> (10, -1400) 2 -> (10, 370) 8 -> (8, 1020)
0 -> (10, -1400) 2 -> (10, 370) 8 -> (8, 1020)
0 -> (10, -1400) 2 -> (10, 370) 8 -> (8, 1020)
State store
Data source
elminium

How to train a streaming ML model
1. Future: directly use structured streaming to create model streams via stateful
aggregators
○ https://spark-summit.org/eu-2016/events/online-learning-with-structured-streaming/
2. Today: use the sink to collect model updates and store them on the driver

Stateful Aggregator
data
partial_agg
final_agg
save_state
restore_initial
merge_agg
data1
dataP
...
zero
saved
state
data1
dataP
...
saved state
new aggregate
new model
Stateful Aggregator
Streaming Aggregator Stateful Aggregator
Physical Plan
Pedro Ribeiro Simões

Stateful Aggregator - Restore initial state
[w0
0
, …, wd
0
] [w0
0
, …, wd
0
] [w0
0
, …, wd
0
] [w0
0
, …, wd
0
]
(y0
, x0
)
...
(yk-1
, xk-1
)
(yk
, xk
)
...
(y2k-1
, x2k-1
)
(y2k
, x2k
)
...
(y3k-1
, x3k-1
)
(y3k
, x3k
)
...
(y4k-1
, x4k-1
)
Data source
data
partial_agg
save_state
restore_initial
merge_agg
State store
[w0
0
, …, wd
0
]

Stateful Aggregator - Partial Aggregation
(y0
, x0
)
...
(yk-1
, xk-1
)
(yk
, xk
)
...
(y2k-1
, x2k-1
)
(y2k
, x2k
)
...
(y3k-1
, x3k-1
)
(y3k
, x3k
)
...
(y4k-1
, x4k-1
)
Data source
data
partial_agg
save_state
restore_initial
merge_agg
State store
[w0
0
, …, wd
0
]
[w0
0
, …, wd
0
] [w0
0
, …, wd
0
] [w0
0
, …, wd
0
] [w0
0
, …, wd
0
]

Stateful Aggregator - Merge Aggregators
data
partial_agg
save_state
restore_initial
merge_agg
[w0
0
, …, wd
0
]
(y0
, x0
)
...
(yk-1
, xk-1
)
(yk
, xk
)
...
(y2k-1
, x2k-1
)
(y2k
, x2k
)
...
(y3k-1
, x3k-1
)
(y3k
, x3k
)
...
(y4k-1
, x4k-1
)
Model combining scheme,
e.g. weighted average
Data source
State store
[w0
0,1
, …, wd
0,1
]
...
[w0
0,P
, …, wd
0,P
]

Stateful Aggregator - Save State
(y0
, x0
)
...
(yk-1
, xk-1
)
(yk
, xk
)
...
(y2k-1
, x2k-1
)
(y2k
, x2k
)
...
(y3k-1
, x3k-1
)
(y3k
, x3k
)
...
(y4k-1
, x4k-1
)
Data source
State storedata
partial_agg
save_state
restore_initial
merge_agg
[w0
1
, …, wd
1
]
[w0
1
, …, wd
1
]

Batch ML pipelines
Tokenizer HashingTF String Indexer Naive Bayes
Tokenizer HashingTF
Streaming
String Indexer
Streaming
Naive Bayes
fit(df)
Estimator
Transformer
● In the batch setting, an estimator is trained on a dataset, and
produces a static, immutable transformer.
● There is no communication between the two.

Streaming ML pipelines
Tokenizer HashingTF String Indexer
Naive BayesTokenizer HashingTF
Streaming
String Indexer
Streaming
Naive Bayes
Model
Sink
Tokenizer HashingTF
Streaming
String Indexer
Model
Sink
Data
sink

Streaming ML Pipelines (Proof of Concept)
Tokenizer HashingTF
Streaming
String Indexer
Streaming
Naive Bayes
Tokenizer HashingTF
Streaming
String Indexer
Streaming
Naive Bayes
fit(df)
Estimator
Transformer
(mutable)
● In this implementation, the estimator produces an initial transformer, and
communicates updates to a specialized StreamingTransformer.
● Streaming transformers must provide a means of incorporating model
updates into predictions.
state state
Lauren Coolman

Streaming Estimator/Transformer (POC)
trait StreamingModel[S] extends Transformer {
def update(updates: S): Unit
}
trait StreamingEstimator[S] extends Estimator {
def model: StreamingModel[S]
def update(batch: Dataset[_]): Unit
}
Sufficient statistics for
model updates
Blinke
nArea

Getting a micro-batch view with distributed
collection*
case class ForeachDatasetSink(func: DataFrame => Unit) extends Sink
{
override def addBatch(batchId: Long, data: DataFrame): Unit = {
func(data)
}
}
https://github.com/holdenk/spark-structured-streaming-ml

And doing some ML with it:
def evilTrain(df: DataFrame): StreamingQuery = {
val sink = new ForeachDatasetSink({df: DataFrame => update(df)})
val sparkSession = df.sparkSession
val evilStreamingQueryManager =
EvilStreamingQueryManager(sparkSession.streams)
evilStreamingQueryManager.startQuery(
Some("snb-train"),
None,
df,
sink,
OutputMode.Append())
}

And doing some ML with it:
def update(batch: Dataset[_]): Unit = {
val newCountsByClass = add(batch)
model.update(newCountsByClass)
} Aggregate new batch
Merge with previous aggregates

And doing some ML with it*
(Algorithm specific)
def update(updates: Array[(Double, (Long, DenseVector))]): Unit = {
updates.foreach { case (label, (numDocs, termCounts)) =>
countsByClass.get(label) match {
case Some((n, c)) =>
axpy(1.0, termCounts, c)
countsByClass(label) = (n + numDocs, c)
case None =>
// new label encountered
countsByClass += (label -> (numDocs, termCounts))
}
}
}

Non-Evil alternatives to our Evil:
● ForeachWriter exists
● Since everything runs on the executors it's difficult to update the model
● You could:
○ Use accumulators
○ Write the updates to Kafka
○ Send the updates to a param server of some type with RPC
○ Or do the evil things we did instead :)
● Wait for the “future?”: https://github.com/apache/spark/pull/15178
_torne

Working with the results - foreach (1 of 2)
val foreachWriter: ForeachWriter[T] =
new ForeachWriter[T] {
def open(partitionId: Long, version: Long): Boolean = {
True // always open
}
def close(errorOrNull: Throwable): Unit = {
// No close logic - if we wanted to copy updates per-batch
}
def process(record: T): Unit = {
db.update(record)
}
}

Working with the results - foreach (2 of 2)
// Apply foreach
happinessByCoffee.writeStream.outputMode(OutputMode.Complete())
foreach(foreachWriter).start()

Structured Streaming in Review:
● Structured Streaming still uses Spark’s Microbatch approach
● JIRA discussion indicates an interest in swapping out the execution engine
(but no public design document has emerged yet)
● One of the areas that Matei is researching
○ Researching ==~ future , research !~ today
Windell Oskay

Ok but where can we not use it?
● A lot of random methods on DataFrames & Datasets won’t work
● They will fail at runtime rather than compile time - so have tests!
● Anything which roundtrips through an rdd() is going to be pretty sad (aka fail)
○ Lot’s of internals randomly do (like toJson) for historical reasons
● Need to run a query inside of a sink? That is not going to work
● Need a complex receiver type? Many receivers are not ported yet
● Also you will need distinct query names - even if you stop the previous query.
● Aggregations and Append output mode (and the only file sink requires
Append)
● DataFrame/Dataset transformations inside of a sink

Open questions for ML pipelines
● How to train and predict simultaneously, on the same data?
○ Transform thread should be executed first
○ Do we actually need to support this or is this just a common demo?
● How to ensure robustness to failures?
○ Treat the output of training as a stream of models, with the same robustness guarantees of
any structured streaming query
○ Work based on this approach has already been prototyped
● Model training must be idempotent - should not train on the same data twice
○ Leverage batch ID, similar to `FileStreamSink`
● How to extend MLWritable for streaming
○ Spark’s format isn’t really all that useful - maybe PMML or PFA
Photo by
bullet101

Structured Streaming ML vs DStreams ML
What could be different for ML on structured streaming vs ML on DStreams?
● Structured streaming is built on the Spark SQL engine
○ Catalyst optimizer
○ Project tungsten
● Pipeline integration
○ ML pipelines have been improved and iterated across 5 releases, we can leverage their
mature design for streaming pipelines
○ This will make adding and working with new algorithms much easier than in the past
● Event time handling
○ Streaming ML algorithms typically use a decay factor
○ Structured streaming provides native support for event time, which is more appropriate for
decay
Krzysztof Belczyński

Batch vs Streaming Pipelines (Draft POC API)
val df = spark
.read
.schema(schema)
.parquet(path)
val tokenizer = new RegexTokenizer()
val htf = new HashingTF()
val nb = new NaiveBayes()
val pipeline = new Pipeline()
.setStages(Array(tokenizer, htf, nb))
val pipelineModel = pipeline.fit(df)
val df = spark
.readStream
.schema(schema)
.parquet(path)
val tokenizer = new RegexTokenizer()
val htf = new HashingTF()
val snb = new StreamingNaiveBayes()
val pipeline = new StreamingPipeline()
.setStages(Array(tokenizer, htf, snb))
.setCheckpointLocation(path)
val query = pipeline.fitStreaming(df)
query.awaitTermination()
https://github.com/sethah/spark/tree/structured-streaming-fun

Additional Spark Resources
● Programming guide (along with JavaDoc, PyDoc,
ScalaDoc, etc.)
○ http://spark.apache.org/docs/latest/
● Books
● Videos
● Spark Office Hours
○ Normally in the bay area - will do Google Hangouts ones soon
○ follow me on twitter for future ones - https://twitter.com/holdenkarau

Structured Streaming Resources
● Programming guide (along with JavaDoc, PyDoc,
ScalaDoc, etc.)
○ http://spark.apache.org/docs/latest/structured-streaming-programming
-guide.html
● https://github.com/holdenk/spark-structured-streaming-ml
● TD
https://spark-summit.org/2016/events/a-deep-dive-into-structured-st
reaming/

Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
Coming soon:
High Performance Spark
Learning PySpark

The next book…..
Available in “Early Release”*:
● Buy from O’Reilly - http://bit.ly/highPerfSpark
● Extending ML is covered in Chapter 9
Get notified when updated & finished:
● http://www.highperformancespark.com
● https://twitter.com/highperfspark
● Should be finished between May 22nd ~ June 18nth :D
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.

Surveys!!!!!!!! :D
● Interested in Structured Streaming?
○ http://bit.ly/structuredStreamingML - Let us know your thoughts
● Pssst: Care about Python DataFrame UDF
Performance?
○ http://bit.ly/pySparkUDF
● Care about Spark Testing?
○ http://bit.ly/holdenTestingSpark
Michael
Himbeault

And some upcoming talks:
● May
○ Tomorrow - Strata London Talk 2
○ 3rd Data Science Summit Europe in Israel
● June
○ Scala Days CPH
○ Spark Summit West (SF)
○ Berlin Buzzwords
○ Scala Swarm (Porto, Portugal)

k thnx bye!
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Any PySpark Users: Have some
simple UDFs you wish ran faster
you are willing to share?:
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)

Start a continuous query
val query = happinessByCoffee
.writeStream
.outputMode(“complete”)
.trigger(ProcessingTime(5.seconds))
.start()
StreamingQuery
source
relation
groupBy
avglogicalPlan =

Launch a new thread to listen for new data
Source
Available Offsets:
Sink
Committed Offsets:
MicroBatch Thread
StreamingQuery
source
relation
groupBy
avglogicalPlan =
Listening
Neil Falzon

Write new offsets to WAL
Source
Available Offsets:
0, 1, 2
Sink
Committed Offsets:
MicroBatch Thread
StreamingQuery
source
relation
groupBy
avglogicalPlan =
Commit to WAL
April Weeks

Check the source for new offsets
Source
Available Offsets:
0, 1, 2
Sink
Committed Offsets
MicroBatch Thread
StreamingQuery
source
relation
groupBy
avglogicalPlan =
batchId=
42
getBatch()
cat-observer

Get the “recipe” for this micro batch
Source
Available Offsets:
0, 1, 2
Sink
Committed Offsets
MicroBatch Thread
StreamingQuery
source
relation
groupBy
avglogicalPlan =
source
scan
groupBy
avg
batchId=
42
transform
Jackie

Send the micro batch Dataset to the sink
Source
Available Offsets:
0, 1, 2
Sink
Committed Offsets
MicroBatch Thread
StreamingQuery
source
scan
groupBy
avglogicalPlan =
groupBy
avg
MicroBatch Dataset
isStreaming = false
addBatch()
batchId=
42
Backed by an incremental
execution plan
Jason Rojas

Commit and listen again
Source
Available Offsets:
0, 1, 2
Sink
Committed Offsets:
0, 1, 2
MicroBatch Thread
StreamingQuery
source
scan
groupBy
avglogicalPlan =
Listening
S Orchard

Execution Summary
● Each query has its own thread - asynchronous
● Sources must be replayable
● Use write-ahead-logs for durability
● Sinks must be idempotent
● Each batch is executed with an incremental execution
plan
● Sinks get a micro batch view of the data
snaxor

Cool - lets build some ML with it!
Lauren Coolman

Get a dataframe
val schema = new StructType()
.add("happiness", "double")
.add("coffees", "integer")
val batchDS = spark
.read
.schema(schema)
.load(path)
Dataset
data
source
isStreaming = false

Build the recipe for each query
val happinessByCoffee = batchDS
.groupBy($"coffees")
.agg(avg($"happiness"))
Dataset
isStreaming = false
data
source
Aggregate
groupBy = “coffees”
expr = avg(“happiness”)

Batch Aggregation - Partial Average
(0, -150)
(2, 30)
(8, 200)
(8, 150)
(0, -100)
(2, 30)
(2, 10)
(2, 50)
0 -> (1, -150)
2 -> (1, 30)
8 -> (2, 350)
0 -> (1, -150)
2 -> (3, 90)
(0, -150)
(0, 0)
(0, -200)
(8, 120)
(8, 100)
(2, 20)
(2, 20)
(2, 60)
0 -> (3, -350)
8 -> (1, 120)
2 -> (3, 100)
8 -> (1, 100)
data
partial_avg
avg
shuffle
Data source

Batch Aggregation - Shuffle
data
partial_avg
avg
(0, -150)
(2, 30)
(8, 200)
(8, 150)
(0, -100)
(2, 30)
(2, 10)
(2, 50)
0 -> (1, -150)
0 -> (3, -350)
0 -> (1, -150)
2 -> (1, 30)
2 -> (3, 100)
2 -> (3, 90)
(0, -150)
(0, 0)
(0, -200)
(8, 120)
(8, 100)
(2, 20)
(2, 20)
(2, 60)
8 -> (1, 120)
8 -> (1, 100)
8 -> (2, 350)shuffle
Data source

Batch Aggregation - Final Average
data
partial_avg
avg
(0, -150)
(2, 30)
(8, 200)
(8, 150)
(0, -100)
(2, 30)
(2, 10)
(2, 50)
0 -> (5, -650) 2 -> (7, 220)
Executor 1
(0, -150)
(0, 0)
(0, -200)
(8, 120)
(8, 100)
(2, 20)
(2, 20)
(2, 60)
8 -> (4, 570)
shuffle
Executor 2
Data source

Structured streaming for machine learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Structured streaming for machine learning

Similar to Structured streaming for machine learning (20)

Recently uploaded

Recently uploaded (20)

Structured streaming for machine learning