Scaling machine learning workflows with Apache Beam

Scaling machine learning
workflows with Apache Beam
Tatiana Al-Chueyr
Senior Data Engineer
Online, 24 October 2020 @tati_alchueyr

Добрый день!
@tati_alchueyr
Multi
NIX Conf
доброго дня!

@tati_alchueyrMulti
NIX Conf
tati.__doc__
● Brazilian living in London since 2014
● Senior Data Engineer at the BBC Datalab team
● Graduated in Computer Engineering at Unicamp, Brazil
● Passionate software developer for 16 years
● Experience in the private and public sectors
● Developed software for Medicine, Media and Education
3

@tati_alchueyrMulti
NIX Conf
I ❤ Ukraine
In 2019, Amanda and I went to Kharkiv for 3 days, when:
● We were Keynote Speakers at OctopusCon
● We lectured at the Kharkiv National University of Radio Electronics
● I was really impressed with the Ukranian Tech Community
● We had a Dolphin therapy session at the Nemo Dolphinarium
Credit: @obestwalter
Credit: OctopusCon
4

@tati_alchueyrMulti
NIX Conf
BBC: British Broadcasting Corporation
● Founded in 1922
● In the UK…
○ The BBC has no advertisements
○ If a resident wants to watch the BBC, they pay a TV
License
● Values
○ Independent, impartial and honest
○ Audiences are at the heart of everything we do
● Purpose
Inform Educate Entertain+ +
5

@tati_alchueyrMulti
NIX Conf
bbc.stats()
● BBC TV reaches 91% UK adult population
● BBC News reaches 426 million global audience weekly
Reference 1: BBC
Reference 2: BBC
Image Credit: BBC6

@tati_alchueyrMulti
NIX Conf
bbc.stats()
~2,000 pieces of BBC content are produced every day….
and a limited number of available slots to occupy!
7

@tati_alchueyrMulti
NIX Conf
BBC.
Vision
For the BBC to be a leader in Machine Learning that
delights audiences and prioritises the needs of
individuals and society over corporations and states.
Mission
To develop and deploy Machine Learning at BBC scale
so that teams can tailor services to individuals whilst
upholding our editorial values.
8

@tati_alchueyrMulti
NIX Conf
Pre-lockdown (subset of) Datalab team members (15 August 2019)
BBC. .
9

@tati_alchueyrMulti
NIX Conf
Locked-down (subset of) Datalab team members (19 March 2020)
BBC. .
COVID-19
pandemic
10

recommendation engine
the challenge

@tati_alchueyrMulti
NIX Conf
The BBC outsourced a recommendation engine
12

@tati_alchueyrMulti
NIX Conf
The audience liked personalised recommendations
13

@tati_alchueyrMulti
NIX Conf
Could we replace it with our own recommender?
14

@tati_alchueyrMulti
NIX Conf
Could we replace it with our own recommender?
15

principles

@tati_alchueyrMulti
NIX Conf
Content-based approach
17

@tati_alchueyrMulti
NIX Conf
Collaborative-filtering approach
18

@tati_alchueyrMulti
NIX Conf
Hybrid approach e.g. Factorisation Machine Algorithm
19

@tati_alchueyrMulti
NIX Conf
Machine learning workflow
20

the prototype

@tati_alchueyrMulti
NIX Conf
1-2 months of work:
● Collected data (quick-and-dirty™ scripts)
● Compared existing Python Factorisation Machines libraries (winner: LightFM)
● Trained and predicted recommendations (quick-and-dirty™ scripts)
● Implemented a qualitative experiment tool
● Recruited volunteers to join the qualitative experiment
● Ran qualitative experiment, comparing:
○ External provider recommendations
○ Our own Factorization Machines-powered recommendations
The prototype
22

@tati_alchueyrMulti
NIX Conf
Qualitative experiment: how
Who
● ~30 test users recruited
○ Internal BBC employees
○ Under 35
How
● Two sets with 9 recommendations each:
○ External provider
○ Internal factorisation machines
● Users, without knowing the origin of the recs, had to:
○ choose “the best”, “both”, or “neither”
○ explain why
23

@tati_alchueyrMulti
NIX Conf
Qualitative experiments
neither external
provider
factorisation
machines
both
24

productionising

@tati_alchueyrMulti
NIX Conf
Productionising machine learning
Configuration
Data Collection
and
Transformation
Feature Extraction
Data
Verification
Machine
Resource
Management
Serving
Infrastructure
Monitoring
Process Management
Tools
Analysis ToolsML Code
Image copied from presentation by Googler @mpyeager
26

@tati_alchueyrMulti
NIX Conf
Input
Processing
Output
User activity data Content metadata
Recommendations
Machine Learning model
training
Predict recommendations
27

@tati_alchueyrMulti
NIX Conf
Input
Processing
Output
Business Rules, part I - Non-personalised
- Recency
- Availability
- Excluded Masterbrands
- Excluded genres
Business Rules, part II - Personalised
- Already seen items
- Local radio (if not consumed previously)
- Specific language (if not consumed previously)
- Episode picking from a series
- Diversification (1 episode per brand/series)
Recommendations
training
28

@tati_alchueyrMulti
NIX Conf
Steps to be done in the workflows, before the API
Input
Processing
Output
- Recency
- Availability
- Excluded genres
Recommendations
training
29

@tati_alchueyrMulti
NIX Conf
model
Recommendation API strategies
API
API
user
activity
content
metadata
cached
recs
A. On the fly
B. Precompute
predicts & applies rules
retrieves pre-computed recommendations
30

@tati_alchueyrMulti
NIX Conf
model
Recommendation API strategies
API
API
user
activity
content
metadata
cached
recs
A. On the fly
B. Precompute
Goal:
1500 requests/s
with P95 responses
< 60 ms
31

@tati_alchueyrMulti
NIX Conf
Recommendation API: load performance
On the fly Precomputed Precomputed
Concurrent load tests
requests/s
50 50 1500
Success percentage 63.88% 100% 100%
Latency of p50 (success) 323.78 ms 1.68 ms 4.75 ms
Maximum successful
requests per second
23 50 1500
Goal:
1500 requests/s
with P95 responses
< 60 ms
Machine type: c2-standard-8, Python 3.7, Sanic workers: 7, Prediction threads: 1, vCPU cores: 7, Memory: 15 Gi, Deployment Replicas: 1
32

@tati_alchueyrMulti
NIX Conf
model
Strategies to serve recommendations
API
API
user
activity
content
metadata
cached
recs
A. On the fly
B. Precompute
33

@tati_alchueyrMulti
NIX Conf
Steps to be done in the workflows, before the API
Input
Processing
Output
- Recency
- Availability
- Excluded genres
Precomputed
recommendations
training
34

workflows orchestration

@tati_alchueyrMulti
NIX Conf
Workflows orchestration: requirements
● Scheduling recurrent jobs
● Retry executing a task if it fails
● Task dependency management
● Monitoring and logs
● Capability of programmatically defining workflows (direct acyclic graphs)
● Built-in support for writing automated tests
36

@tati_alchueyrMulti
NIX Conf
Workflows orchestration: Apache Airflow
37

@tati_alchueyrMulti
NIX Conf
Google Managed Apache Airflow: Cloud Composer
38

@tati_alchueyrMulti
NIX Conf
Cloud Composer: monitoring
39

@tati_alchueyrMulti
NIX Conf
Limitation of Apache Airflow
● Good for orchestrating tasks
● Not good for processing a data-intensive task within an Airflow worker
40

@tati_alchueyrMulti
NIX Conf
41

@tati_alchueyrMulti
NIX Conf
42

@tati_alchueyrMulti
NIX Conf
Issue:
Depending on the
volumes of data, a single
PythonOperator task
which usually takes
10 min could take almost
3h!
Consequences:
Overall delay
Blocked worker
43

@tati_alchueyrMulti
NIX Conf
Time estimations (in seconds) to predict recommendations using a c2-standard-30 instance (30 vCPU and 120 GB RAM)
44

@tati_alchueyrMulti
NIX Conf
Time estimations (in seconds) to predict recommendations using a c2-standard-30 instance (30 vCPU and 120 GB RAM)
2h to predict
recommendations for
10k users
What about 5 million
users - or more?
45

@tati_alchueyrMulti
NIX Conf
Limitation of Apache Airflow: solutions
Delegating processing to other services
● Tasks which scale vertically (better hardware)
○ Airflow Compute Engine (Virtual Machine) Operator (GceInstanceStartOperator)
○ Airflow Kubernetes Pod Operator (GKEPodOperator)
● Tasks which scale horizontally (can be split and distributed in multiple nodes)
○ Airflow Dataflow Operator (Google Dataflow, Apache Beam )
○ Airflow Dataproc Operator (Google Dataproc, Apache Spark & Hadoop)
46

efficient data processing

@tati_alchueyrMulti
NIX Conf
Apache Beam
“Apache Beam is a unified
programming model designed
to provide efficient and
portable data processing
pipelines”
48

@tati_alchueyrMulti
NIX Conf
Apache Beam
https://towardsdatascience.com/running-an-apache-beam-data-pipeline-on-azure-databricks-c09e521d8fc3
49

@tati_alchueyrMulti
NIX Conf
Apache Beam: overview of Dataflow job
Image from the book “Google Cloud Platform In Action” by JJ Geewax, Chapter 20
50

@tati_alchueyrMulti
NIX Conf
Parallel processing “effortlessly”
Image from the book “Google Cloud Platform In Action” by JJ Geewax, Chapter 20
51

@tati_alchueyrMulti
NIX Conf
Simple Beam example
https://beam.apache.org/documentation/transforms/python/aggregation/cogroupbykey/
52

@tati_alchueyrMulti
NIX Conf
53

@tati_alchueyrMulti
NIX Conf
Adoption of Apache Beam & Dataflow
“Serverless” parallel processing of 41,258,135 items (27.32 GB) with
Python in 1min 24s using 10 default workers
54

@tati_alchueyrMulti
NIX Conf
Pure Airflow
PythonOperator in
Cloud Composer
DataflowOperator
running a Beam
pipeline within
Dataflow
episode
availability episode
s/PythonOperator/DataflowOperator
Computation time reduced almost by one
order of magnitude
Document
type
PythonOperator DataflowOperator Performance
gain
episode 60 min 6 min 90%
availability
episode
12 min 5 min 58%
55

@tati_alchueyrMulti
NIX Conf
Precomputing recs for millions of users
56

beam/dataflow gotchas

@tati_alchueyrMulti
NIX Conf
Quizz time
https://forms.gle/CxhnDU4wd55hmgQX7
58

@tati_alchueyrMulti
NIX Conf
To Beam or not to Beam?
● 8.4 GiB distributed in 130 parquet files
● Task: read only one of the columns and export that in new files
● Three implementations:
○ Single-threaded PyArrow in my computer (Quad-Core 16 GB RAM)
○ Dataflow autoscaling, up to 10 default workers
○ Dataflow fixed amount of 10 workers
● What is the most efficient vCPU, memory and time-wise?
59

@tati_alchueyrMulti
NIX Conf
To Beam or not to Beam?
PyArrow Dataflow
(autoscaling)
Dataflow
(fixed workers)
Time 3m56.355s 12m27.314s 7m44.518s
Total vCPU 0.05 vCPU hr 0.997 vCPU hr 0.979 vCPU hr
Total memory 0.016 GB hr 3.739 GB hr 3.673 GB hr
60

@tati_alchueyrMulti
NIX Conf
Does a better machine means faster?
n1-standard-1:
● 1 vCPU
● 3.75 GB RAM
n1-standard-4
● 4 vCPU
● 15 GB RAM
61

@tati_alchueyrMulti
NIX Conf
n1-standard-1
● 1 vCPU
● 3.75 GB RAM
n1-standard-4
● 4 vCPU
● 15 GB RAM
62

@tati_alchueyrMulti
NIX Conf
n1-standard-1
● 1 vCPU
● 3.75 GB RAM
n1-standard-4
● 4 vCPU
● 15 GB RAM
63

@tati_alchueyrMulti
NIX Conf
Error message from worker: ConnectionReset
64

@tati_alchueyrMulti
NIX Conf
A single “executor”
within each worker
(VM) needed 10
GB...
65

@tati_alchueyrMulti
NIX Conf
https://stackoverflow.com/questions/63705660/optimising-gcp-costs-for-a-memory-intensive-dataflow-pipeline
66

@tati_alchueyrMulti
NIX Conf
Solutions for memory-intensive beam transformations
● Use custom machine type with extended memory
● Use shared memory feature from Beam 2.24
67

@tati_alchueyrMulti
NIX Conf
Cost analysis
https://cloud.google.com/dataflow/pricing
Resources metrics per job
68

@tati_alchueyrMulti
NIX Conf
Cost reduction
300$ per run
69

@tati_alchueyrMulti
NIX Conf
Cost reduction
memory intensive
transformation
Solutions
● Use shared memory
● Split pipelines so only the memory
intensive transformation uses expensive
machine types.
70

@tati_alchueyr
Multi
NIX Conf
дуже тобі дякую!
Большое спасибо!
Thank you very much!

Scaling machine learning workflows with Apache Beam

Recommended

Recommended

More Related Content

Similar to Scaling machine learning workflows with Apache Beam

Similar to Scaling machine learning workflows with Apache Beam (20)

More from Tatiana Al-Chueyr

More from Tatiana Al-Chueyr (19)

Recently uploaded

Recently uploaded (20)

Scaling machine learning workflows with Apache Beam

Editor's Notes