Scalable predictive pipelines with Spark and Scala

Scalable Predictive Pipelines
with Spark and Scala
Dimitris Papadopoulos

10
Data Science Tasks
Data
Model
Results
Preprocessing

1. Using Spark ML Pipelines
2. Scalable Pipelines
11
Outline

16
Pipeline Stage
● One or more inputs ● Strictly one output

17
Pipeline Stage
● Closed under concatenation

18
Pipeline Stage
● Closed under concatenation
● Standalone and runnable
● Spark™ ML inside

20
Spark ML Pipelines
Using a Pipeline to train a model

21
Spark ML Pipelines
Using a PipelineModel to get predictions

22
Peek inside a Spark pipeline
It’s a Pipeline

23
It’s a Pipeline
plain Spark API

24
It’s a Pipeline
plain Spark API
From DataFrame to a Model

25
Instantiating a Pipeline
Running it!

26
Example Pipeline
EventCoalescer:
collects raw pulse events
(Json) into substantially
fewer files (Parquet)
UserAccountPreprocessor:
harmonises schemas of user
accounts across sites (if
necessary). Provides ground
truth data for training.
EventPreprocessor:
aggregates events per
user
GenderPredictor:
creates labels and features,
trains classifier & computes
predictions

27
Example Pipeline
EventCoalescer:
EventPreprocessor:
user
GenderPredictor:
predictions
GenderPerformanceEvaluator:
computes performance metrics, e.g.
accuracy and area under ROC

28
Scalable Pipelines: pain points
EventCoalescer:
EventPreprocessor:
user
GenderPredictor:
predictions

29
EventCoalescer:
EventPreprocessor:
user
GenderPredictor:
predictions
Input: 1 day’s / 7 days’ worth of events data.
Larger lookbacks needed for better accuracy.

30
More data for better performance
Performance of three different pipelines,
vs lookback length (1, 7, 30, 45)

31
EventCoalescer:
EventPreprocessor:
user
GenderPredictor:
predictions
What will happen if we try to process
30 days worth of data (e.g. 3B events) ???

32
Memory and processing heavy:
● In one use-case, for 7 days lookback (~7 x 100M events) we used to need 20 Spark
executors with 22G of memory each.
Not easily scalable
● As the lookback increases
● As more and more sites are incorporated into our pipelines
Redundant processing
● For K days of lookback, we are repeating processing of K - 2 days worth of data, when we
run the pipeline every day, in a rolling window fashion.
“What will happen if we try to process
30 days worth of data (e.g. 3B events) ???”

33
Saved by Algebra
● The operations (op) along with the corresponding data structures (S) that
we are interested in are monoids.
○ Associative:
■ for all A,B,C in S, (A op B) op C = A op (B op C)
○ Identity element:
■ there exists E in S such that for each A in S, E op A = A op E = A
● Examples:
○ Summation: 1 + 2 + 3 + 4 = (1 + 2) + (3 + 4)
○ String array concatenation: [“foo”] + [“bar”] + [“baz”] = [“foo”, “bar”] + [“baz”]

34
Scalable Pipelines: in monoids fashion
● Split the aggregations in smaller chunks
○ i.e. pre-process events per user and single day (not over the entire lookback)

35
● Make one (or multiple) day aggregates and combine
○ i.e. aggregate over the pre-preprocessed events per user and day

36
● Make one (or multiple) day aggregates and combine
○ i.e. aggregate over the pre-preprocessed events per user and day
● It’s like trying to ...eat an elephant: one piece at a time!

37
Scalable Pipelines: building blocks
● Imagine we had a
MapAggregator, for
aggregating maps of [String-
>Double].

38
MapAggregator, for
>Double].
● The spec for such an
aggregator implemented in
Scala on Spark could look
like this. :-)

39
MapAggregator, for
>Double].
● The spec for such an
aggregator implemented in
Scala on Spark could look
like this. :-)

40
● In Spark we can define our own functions, also known as User Defined
Functions (UDF)
● A UDF takes as arguments one or more columns, and returns some
output.
● It gets executed for each row of the DataFrame.
● It can also be parameterized.
● e.g. val myUDF = udf((myArg: myType) => ...)

● Since Spark 1.5, we can also define our own User Defined Aggregate
Functions (UDAF).
● UDAFs can be used to compute custom calculations over groups of input
data (in contrast, UDFs compute a value looking at a single input row)
● Examples: calculating geometric mean or calculating the product of values for
every group.
● A UDAF maintains an aggregation buffer to store intermediate results for
every group of input data.
● It updates this buffer for every input row.
● Once it has processed all input rows, it generates a result value based on
values of the aggregation buffer.
41
Scalable Pipelines: UDAF

42
Scalable Pipelines: UDAF
A User Defined
Aggregate Function
Implementation of abstract
methods

43
Scalable Pipelines: adding a new stage
EventCoalescer:
EventPreprocessor:
user
GenderPredictor:
predictions
What will happen if we try to process
30 days worth of data (e.g. 3B events) ???

44
Scalable Pipelines: adding a new stage
EventCoalescer:
EventPreprocessor:
user and day
GenderPredictor:
predictions
GenderPerformanceEval
uator:computes performance
metrics, e.g. accuracy and
area under ROC
EventAggregator:
aggregates pre-processed
events per user over
multiple days (lookback)

45
Scalable Pipelines: Aggregating Events

46
It’s a Transformer

47
DataFrame in , DataFrame out

48
DataFrame in , DataFrame out
Aggregating maps
of feature frequency
counts

49
Scalable Pipelines: closing remarks
● With User Defined Aggregate Functions, we have reduced the workload of
our pipelines by a factor of 20!

50
● With User Defined Aggregate Functions, we have reduced the workload of
our pipelines by a factor of 20!
● Obvious gains: freeing up resources that can be used for running even
more pipelines, faster, over even more input data

51
● Needles to say, more factors contribute towards a scalable pipeline:
○ Performance tuning of the Spark cluster
○ Use of a workflow manager (e.g. Luigi) for pipeline orchestration

52
● Needles to say, more factors contribute towards a scalable pipeline:
○ Performance tuning of the Spark cluster
○ Use of a workflow manager (e.g. Luigi) for pipeline orchestration
● But each one of these is a topic for a separate talk (Carlos? Hint, hint!) :-)

54
Shameless plug
We are hiring!
Across all our hubs
in London, Oslo, Stockholm, Barcelona
for Data Science, Engineering, UX and Product roles
https://jobs.lever.co/schibsted
spt-recruiters@schibsted.com

Scalable predictive pipelines with Spark and Scala

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Scalable predictive pipelines with Spark and Scala