SlideShare a Scribd company logo
THE FUTURE IS NOW
Scalable Predictive Pipelines
with Spark and Scala
Dimitris Papadopoulos
3
About Schibsted
4
About Schibsted
5
About Schibsted
6
Event Tracking Data
7
Event Tracking Data
8
Event Tracking Data
9
Event Tracking Data
10
Data Science Tasks
Data
Model
Results
Preprocessing
1. Using Spark ML Pipelines
2. Scalable Pipelines
11
Outline
12
Pipeline
13
Pipeline
14
Pipeline
15
Not a pipe
16
Pipeline Stage
● One or more inputs ● Strictly one output
17
Pipeline Stage
● One or more inputs ● Strictly one output
● Closed under concatenation
18
Pipeline Stage
● One or more inputs ● Strictly one output
● Closed under concatenation
● Standalone and runnable
● Spark™ ML inside
19
Spark ML Pipelines
20
Spark ML Pipelines
Using a Pipeline to train a model
21
Spark ML Pipelines
Using a PipelineModel to get predictions
22
Peek inside a Spark pipeline
It’s a Pipeline
23
Peek inside a Spark pipeline
It’s a Pipeline
plain Spark API
24
Peek inside a Spark pipeline
It’s a Pipeline
plain Spark API
From DataFrame to a Model
25
Peek inside a Spark pipeline
Instantiating a Pipeline
Running it!
26
Example Pipeline
EventCoalescer:
collects raw pulse events
(Json) into substantially
fewer files (Parquet)
UserAccountPreprocessor:
harmonises schemas of user
accounts across sites (if
necessary). Provides ground
truth data for training.
EventPreprocessor:
aggregates events per
user
GenderPredictor:
creates labels and features,
trains classifier & computes
predictions
27
Example Pipeline
EventCoalescer:
collects raw pulse events
(Json) into substantially
fewer files (Parquet)
UserAccountPreprocessor:
harmonises schemas of user
accounts across sites (if
necessary). Provides ground
truth data for training.
EventPreprocessor:
aggregates events per
user
GenderPredictor:
creates labels and features,
trains classifier & computes
predictions
GenderPerformanceEvaluator:
computes performance metrics, e.g.
accuracy and area under ROC
28
Scalable Pipelines: pain points
EventCoalescer:
collects raw pulse events
(Json) into substantially
fewer files (Parquet)
UserAccountPreprocessor:
harmonises schemas of user
accounts across sites (if
necessary). Provides ground
truth data for training.
EventPreprocessor:
aggregates events per
user
GenderPredictor:
creates labels and features,
trains classifier & computes
predictions
GenderPerformanceEvaluator:
computes performance metrics, e.g.
accuracy and area under ROC
29
Scalable Pipelines: pain points
EventCoalescer:
collects raw pulse events
(Json) into substantially
fewer files (Parquet)
UserAccountPreprocessor:
harmonises schemas of user
accounts across sites (if
necessary). Provides ground
truth data for training.
EventPreprocessor:
aggregates events per
user
GenderPredictor:
creates labels and features,
trains classifier & computes
predictions
GenderPerformanceEvaluator:
computes performance metrics, e.g.
accuracy and area under ROC
Input: 1 day’s / 7 days’ worth of events data.
Larger lookbacks needed for better accuracy.
30
More data for better performance
Performance of three different pipelines,
vs lookback length (1, 7, 30, 45)
31
Scalable Pipelines: pain points
EventCoalescer:
collects raw pulse events
(Json) into substantially
fewer files (Parquet)
UserAccountPreprocessor:
harmonises schemas of user
accounts across sites (if
necessary). Provides ground
truth data for training.
EventPreprocessor:
aggregates events per
user
GenderPredictor:
creates labels and features,
trains classifier & computes
predictions
GenderPerformanceEvaluator:
computes performance metrics, e.g.
accuracy and area under ROC
What will happen if we try to process
30 days worth of data (e.g. 3B events) ???
32
Scalable Pipelines: pain points
Memory and processing heavy:
● In one use-case, for 7 days lookback (~7 x 100M events) we used to need 20 Spark
executors with 22G of memory each.
Not easily scalable
● As the lookback increases
● As more and more sites are incorporated into our pipelines
Redundant processing
● For K days of lookback, we are repeating processing of K - 2 days worth of data, when we
run the pipeline every day, in a rolling window fashion.
“What will happen if we try to process
30 days worth of data (e.g. 3B events) ???”
33
Saved by Algebra
● The operations (op) along with the corresponding data structures (S) that
we are interested in are monoids.
○ Associative:
■ for all A,B,C in S, (A op B) op C = A op (B op C)
○ Identity element:
■ there exists E in S such that for each A in S, E op A = A op E = A
● Examples:
○ Summation: 1 + 2 + 3 + 4 = (1 + 2) + (3 + 4)
○ String array concatenation: [“foo”] + [“bar”] + [“baz”] = [“foo”, “bar”] + [“baz”]
34
Scalable Pipelines: in monoids fashion
● Split the aggregations in smaller chunks
○ i.e. pre-process events per user and single day (not over the entire lookback)
35
Scalable Pipelines: in monoids fashion
● Split the aggregations in smaller chunks
○ i.e. pre-process events per user and single day (not over the entire lookback)
● Make one (or multiple) day aggregates and combine
○ i.e. aggregate over the pre-preprocessed events per user and day
36
Scalable Pipelines: in monoids fashion
● Split the aggregations in smaller chunks
○ i.e. pre-process events per user and single day (not over the entire lookback)
● Make one (or multiple) day aggregates and combine
○ i.e. aggregate over the pre-preprocessed events per user and day
● It’s like trying to ...eat an elephant: one piece at a time!
37
Scalable Pipelines: building blocks
● Imagine we had a
MapAggregator, for
aggregating maps of [String-
>Double].
38
Scalable Pipelines: building blocks
● Imagine we had a
MapAggregator, for
aggregating maps of [String-
>Double].
● The spec for such an
aggregator implemented in
Scala on Spark could look
like this. :-)
39
Scalable Pipelines: building blocks
● Imagine we had a
MapAggregator, for
aggregating maps of [String-
>Double].
● The spec for such an
aggregator implemented in
Scala on Spark could look
like this. :-)
40
Scalable Pipelines: building blocks
● In Spark we can define our own functions, also known as User Defined
Functions (UDF)
● A UDF takes as arguments one or more columns, and returns some
output.
● It gets executed for each row of the DataFrame.
● It can also be parameterized.
● e.g. val myUDF = udf((myArg: myType) => ...)
● Since Spark 1.5, we can also define our own User Defined Aggregate
Functions (UDAF).
● UDAFs can be used to compute custom calculations over groups of input
data (in contrast, UDFs compute a value looking at a single input row)
● Examples: calculating geometric mean or calculating the product of values for
every group.
● A UDAF maintains an aggregation buffer to store intermediate results for
every group of input data.
● It updates this buffer for every input row.
● Once it has processed all input rows, it generates a result value based on
values of the aggregation buffer.
41
Scalable Pipelines: UDAF
42
Scalable Pipelines: UDAF
A User Defined
Aggregate Function
Implementation of abstract
methods
43
Scalable Pipelines: adding a new stage
EventCoalescer:
collects raw pulse events
(Json) into substantially
fewer files (Parquet)
UserAccountPreprocessor:
harmonises schemas of user
accounts across sites (if
necessary). Provides ground
truth data for training.
EventPreprocessor:
aggregates events per
user
GenderPredictor:
creates labels and features,
trains classifier & computes
predictions
GenderPerformanceEvaluator:
computes performance metrics, e.g.
accuracy and area under ROC
What will happen if we try to process
30 days worth of data (e.g. 3B events) ???
44
Scalable Pipelines: adding a new stage
EventCoalescer:
collects raw pulse events
(Json) into substantially
fewer files (Parquet)
UserAccountPreprocessor:
harmonises schemas of user
accounts across sites (if
necessary). Provides ground
truth data for training.
EventPreprocessor:
aggregates events per
user and day
GenderPredictor:
creates labels and features,
trains classifier & computes
predictions
GenderPerformanceEval
uator:computes performance
metrics, e.g. accuracy and
area under ROC
EventAggregator:
aggregates pre-processed
events per user over
multiple days (lookback)
45
Scalable Pipelines: Aggregating Events
46
Scalable Pipelines: Aggregating Events
It’s a Transformer
47
Scalable Pipelines: Aggregating Events
It’s a Transformer
DataFrame in , DataFrame out
48
Scalable Pipelines: Aggregating Events
It’s a Transformer
DataFrame in , DataFrame out
Aggregating maps
of feature frequency
counts
49
Scalable Pipelines: closing remarks
● With User Defined Aggregate Functions, we have reduced the workload of
our pipelines by a factor of 20!
50
Scalable Pipelines: closing remarks
● With User Defined Aggregate Functions, we have reduced the workload of
our pipelines by a factor of 20!
● Obvious gains: freeing up resources that can be used for running even
more pipelines, faster, over even more input data
51
Scalable Pipelines: closing remarks
● Needles to say, more factors contribute towards a scalable pipeline:
○ Performance tuning of the Spark cluster
○ Use of a workflow manager (e.g. Luigi) for pipeline orchestration
52
Scalable Pipelines: closing remarks
● Needles to say, more factors contribute towards a scalable pipeline:
○ Performance tuning of the Spark cluster
○ Use of a workflow manager (e.g. Luigi) for pipeline orchestration
● But each one of these is a topic for a separate talk (Carlos? Hint, hint!) :-)
53
Q/A
Thank you!
54
Shameless plug
We are hiring!
Across all our hubs
in London, Oslo, Stockholm, Barcelona
for Data Science, Engineering, UX and Product roles
https://jobs.lever.co/schibsted
spt-recruiters@schibsted.com

More Related Content

Recently uploaded

STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 

Recently uploaded (20)

STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 

Featured

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
Alireza Esmikhani
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Project for Public Spaces & National Center for Biking and Walking
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago
 

Featured (20)

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 

Scalable predictive pipelines with Spark and Scala

  • 2. Scalable Predictive Pipelines with Spark and Scala Dimitris Papadopoulos
  • 11. 1. Using Spark ML Pipelines 2. Scalable Pipelines 11 Outline
  • 16. 16 Pipeline Stage ● One or more inputs ● Strictly one output
  • 17. 17 Pipeline Stage ● One or more inputs ● Strictly one output ● Closed under concatenation
  • 18. 18 Pipeline Stage ● One or more inputs ● Strictly one output ● Closed under concatenation ● Standalone and runnable ● Spark™ ML inside
  • 20. 20 Spark ML Pipelines Using a Pipeline to train a model
  • 21. 21 Spark ML Pipelines Using a PipelineModel to get predictions
  • 22. 22 Peek inside a Spark pipeline It’s a Pipeline
  • 23. 23 Peek inside a Spark pipeline It’s a Pipeline plain Spark API
  • 24. 24 Peek inside a Spark pipeline It’s a Pipeline plain Spark API From DataFrame to a Model
  • 25. 25 Peek inside a Spark pipeline Instantiating a Pipeline Running it!
  • 26. 26 Example Pipeline EventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet) UserAccountPreprocessor: harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training. EventPreprocessor: aggregates events per user GenderPredictor: creates labels and features, trains classifier & computes predictions
  • 27. 27 Example Pipeline EventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet) UserAccountPreprocessor: harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training. EventPreprocessor: aggregates events per user GenderPredictor: creates labels and features, trains classifier & computes predictions GenderPerformanceEvaluator: computes performance metrics, e.g. accuracy and area under ROC
  • 28. 28 Scalable Pipelines: pain points EventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet) UserAccountPreprocessor: harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training. EventPreprocessor: aggregates events per user GenderPredictor: creates labels and features, trains classifier & computes predictions GenderPerformanceEvaluator: computes performance metrics, e.g. accuracy and area under ROC
  • 29. 29 Scalable Pipelines: pain points EventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet) UserAccountPreprocessor: harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training. EventPreprocessor: aggregates events per user GenderPredictor: creates labels and features, trains classifier & computes predictions GenderPerformanceEvaluator: computes performance metrics, e.g. accuracy and area under ROC Input: 1 day’s / 7 days’ worth of events data. Larger lookbacks needed for better accuracy.
  • 30. 30 More data for better performance Performance of three different pipelines, vs lookback length (1, 7, 30, 45)
  • 31. 31 Scalable Pipelines: pain points EventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet) UserAccountPreprocessor: harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training. EventPreprocessor: aggregates events per user GenderPredictor: creates labels and features, trains classifier & computes predictions GenderPerformanceEvaluator: computes performance metrics, e.g. accuracy and area under ROC What will happen if we try to process 30 days worth of data (e.g. 3B events) ???
  • 32. 32 Scalable Pipelines: pain points Memory and processing heavy: ● In one use-case, for 7 days lookback (~7 x 100M events) we used to need 20 Spark executors with 22G of memory each. Not easily scalable ● As the lookback increases ● As more and more sites are incorporated into our pipelines Redundant processing ● For K days of lookback, we are repeating processing of K - 2 days worth of data, when we run the pipeline every day, in a rolling window fashion. “What will happen if we try to process 30 days worth of data (e.g. 3B events) ???”
  • 33. 33 Saved by Algebra ● The operations (op) along with the corresponding data structures (S) that we are interested in are monoids. ○ Associative: ■ for all A,B,C in S, (A op B) op C = A op (B op C) ○ Identity element: ■ there exists E in S such that for each A in S, E op A = A op E = A ● Examples: ○ Summation: 1 + 2 + 3 + 4 = (1 + 2) + (3 + 4) ○ String array concatenation: [“foo”] + [“bar”] + [“baz”] = [“foo”, “bar”] + [“baz”]
  • 34. 34 Scalable Pipelines: in monoids fashion ● Split the aggregations in smaller chunks ○ i.e. pre-process events per user and single day (not over the entire lookback)
  • 35. 35 Scalable Pipelines: in monoids fashion ● Split the aggregations in smaller chunks ○ i.e. pre-process events per user and single day (not over the entire lookback) ● Make one (or multiple) day aggregates and combine ○ i.e. aggregate over the pre-preprocessed events per user and day
  • 36. 36 Scalable Pipelines: in monoids fashion ● Split the aggregations in smaller chunks ○ i.e. pre-process events per user and single day (not over the entire lookback) ● Make one (or multiple) day aggregates and combine ○ i.e. aggregate over the pre-preprocessed events per user and day ● It’s like trying to ...eat an elephant: one piece at a time!
  • 37. 37 Scalable Pipelines: building blocks ● Imagine we had a MapAggregator, for aggregating maps of [String- >Double].
  • 38. 38 Scalable Pipelines: building blocks ● Imagine we had a MapAggregator, for aggregating maps of [String- >Double]. ● The spec for such an aggregator implemented in Scala on Spark could look like this. :-)
  • 39. 39 Scalable Pipelines: building blocks ● Imagine we had a MapAggregator, for aggregating maps of [String- >Double]. ● The spec for such an aggregator implemented in Scala on Spark could look like this. :-)
  • 40. 40 Scalable Pipelines: building blocks ● In Spark we can define our own functions, also known as User Defined Functions (UDF) ● A UDF takes as arguments one or more columns, and returns some output. ● It gets executed for each row of the DataFrame. ● It can also be parameterized. ● e.g. val myUDF = udf((myArg: myType) => ...)
  • 41. ● Since Spark 1.5, we can also define our own User Defined Aggregate Functions (UDAF). ● UDAFs can be used to compute custom calculations over groups of input data (in contrast, UDFs compute a value looking at a single input row) ● Examples: calculating geometric mean or calculating the product of values for every group. ● A UDAF maintains an aggregation buffer to store intermediate results for every group of input data. ● It updates this buffer for every input row. ● Once it has processed all input rows, it generates a result value based on values of the aggregation buffer. 41 Scalable Pipelines: UDAF
  • 42. 42 Scalable Pipelines: UDAF A User Defined Aggregate Function Implementation of abstract methods
  • 43. 43 Scalable Pipelines: adding a new stage EventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet) UserAccountPreprocessor: harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training. EventPreprocessor: aggregates events per user GenderPredictor: creates labels and features, trains classifier & computes predictions GenderPerformanceEvaluator: computes performance metrics, e.g. accuracy and area under ROC What will happen if we try to process 30 days worth of data (e.g. 3B events) ???
  • 44. 44 Scalable Pipelines: adding a new stage EventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet) UserAccountPreprocessor: harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training. EventPreprocessor: aggregates events per user and day GenderPredictor: creates labels and features, trains classifier & computes predictions GenderPerformanceEval uator:computes performance metrics, e.g. accuracy and area under ROC EventAggregator: aggregates pre-processed events per user over multiple days (lookback)
  • 46. 46 Scalable Pipelines: Aggregating Events It’s a Transformer
  • 47. 47 Scalable Pipelines: Aggregating Events It’s a Transformer DataFrame in , DataFrame out
  • 48. 48 Scalable Pipelines: Aggregating Events It’s a Transformer DataFrame in , DataFrame out Aggregating maps of feature frequency counts
  • 49. 49 Scalable Pipelines: closing remarks ● With User Defined Aggregate Functions, we have reduced the workload of our pipelines by a factor of 20!
  • 50. 50 Scalable Pipelines: closing remarks ● With User Defined Aggregate Functions, we have reduced the workload of our pipelines by a factor of 20! ● Obvious gains: freeing up resources that can be used for running even more pipelines, faster, over even more input data
  • 51. 51 Scalable Pipelines: closing remarks ● Needles to say, more factors contribute towards a scalable pipeline: ○ Performance tuning of the Spark cluster ○ Use of a workflow manager (e.g. Luigi) for pipeline orchestration
  • 52. 52 Scalable Pipelines: closing remarks ● Needles to say, more factors contribute towards a scalable pipeline: ○ Performance tuning of the Spark cluster ○ Use of a workflow manager (e.g. Luigi) for pipeline orchestration ● But each one of these is a topic for a separate talk (Carlos? Hint, hint!) :-)
  • 54. 54 Shameless plug We are hiring! Across all our hubs in London, Oslo, Stockholm, Barcelona for Data Science, Engineering, UX and Product roles https://jobs.lever.co/schibsted spt-recruiters@schibsted.com