SlideShare a Scribd company logo
Precomputing recommendations
with Apache Beam
Tatiana Al-Chueyr
Principal Data Engineer @ BBC Datalab
London, 21 September 2021
@tati_alchueyr
● Brazilian living in London UK since 2014
● Principal Data Engineer at the BBC (Datalab team)
● Graduated in Computer Engineering at Unicamp
● Software developer for 18 years
● Passionate about open-source
Apache Beam user since early 2019
BBC.datalab.hummingbirds
The knowledge in this presentation is the result of lots of teamwork
within one squad of a larger team and even broader organisation
current squad team members
previous squad team members
Darren
Mundy
David
Hollands
Richard
Bownes
Marc
Oppenheimer
Bettina
Hermant
Tatiana
Al-Chueyr
Jana
Eggink
some
business context
business context goal
to personalise the experience of millions of users of BBC Sounds
to build a replacement for an external third-party recommendation engine
business context numbers
BBC Sounds has approximately
● 200,000 podcast and music episodes
● 6.5 millions of users
The personalised rails (eg. Recommended for You) display:
● 9 episodes (smartphones) or
● 12 episodes (web)
business context problem visualisation
it is similar to finding the best match among 20,000 items per user x 65 million times
business context product rules
The recommendations must also comply to the BBC
product and editorial rules, such as:
● Diversification: no more than one item per brand
● Recency: no news episodes older than 24 hours
● Narrative arc: next drama series episode
● Language: Gaelic items to Gaelic listeners
● Availability: only available content
● Exclusion: shipping forecast and soap-opera
technology & architecture
overview
technology overview
● Python
● Google Cloud Platform
● Apache Airflow
● Apache Beam (Dataflow Runner)
● LightFM (Factorisation Machine model)
architecture overview
User
activity
Content
metadata
Train Model
Artefacts
Predict
Extract &
Transform
Extract &
Transform
User
activity
features
Content
metadata
features
Filtered
Predictions
Apply
rules
Predictions
historical data future
architecture overview
User
activity
Content
metadata
Train Model
Artefacts
Predict
Extract &
Transform
Extract &
Transform
User
activity
features
Content
metadata
features
Filtered
Predictions
Apply
rules
Predictions
where we used Apache Beam
historical data future
data
schema & sample
business context input data schema
Episode
● ID
● Masterbrand
● Brand
● Genres
● Formats
● Series
● Schedule start
● Schedule end
● Release date
User
● ID
User Interactions
● User ID
● Consumed Episodes
business context input data instance
Episode
● p09f7b5r
● bbc_radio_scotland
● p07jt0jn
● ["100001", "200003"]
● [“PT017”]
● p05l30yb
● 2017-11-03T10:07:00Z
● 2021-06-05T21:00:00Z
● 2017-11-03T00:00:00Z
User
● AdL9IFh-xQU=
User Interactions
● AdL9IFh-xQU=
● ["m000tw74",
"p095546q",
"p0958bg1",
"p0940vx8"]
business context output data schema & sample
User Predictions Schema
● User ID
● Recommendations
○ Episode ID
○ Score
User Predictions Sample
● AdL9IFh-xQU=
● [ (“p09f7b5r”, 1.0),
(“b13jlkf01”, 0.5),
(“m124uk”, 0.1)]
first steps
feasibility
risk analysis precompute recommendations
cost estimate: precompute daily for a month with pure Python in a VM: ~ US$ 250
Estimate of time (seconds) to precompute recommendations
analysis using c2-standard-30 (30 vCPU and 120 RAM) and LightFM
risk analysis sorting recommendations
sort 100k predictions per user with pure Python did not seem efficient
risk analysis sorting recommendations
sort 100k predictions per user with pure Python did not seem efficient
concern on the feasibility of applying complex rules
April 2020
scope
simplification
simplification initial numbers
Initial numbers for precomputing recommendations:
● Total users: 6.5 mi
● Total content: 200k items
● User activity: 1 year
● Precompute frequency: 3 times a day
● Apply rules after precomputing
April 2020
simplification initial x agreed numbers
Initial numbers for precomputing recommendations:
● Total users: 6.5 mi -> 3.5 mi
● Total content: 200k items -> 100k items
● User activity: 1 year -> 4 months
● Precompute frequency: 3 times a day -> 2 times a week
● Apply rules after precomputing -> before & after precomputing
April 2020
simplification non-personalised x personalised rules
User activity data Content metadata
Business Rules, part I - Non-personalised
- Recency
- Availability
- Excluded Masterbrands
- Excluded genres
Business Rules, part II - Personalised
- Already seen items
- Local radio (if not consumed previously)
- Specific language (if not consumed previously)
- Episode picking from a series
- Diversification (1 episode per brand/series)
Precomputed
recommendations
Machine Learning model
training
Predict recommendations
once
Reduce content
from
200k to 100k
items
3.5 million times
chose the 50
more relevant &
editorial
compatible items
out of 100k items
ranked by user
May 2020
simplification sorting
● Initial problem
○ for each user
■ sort 100k recommendations
● Simplified problem
○ for each user
■ calculate recommendations scores’ 90 percentile
■ discard items that have a score lower than P90
■ sort ~ 10k items per user
June 2020
precompute recommendations
pipeline evolution
pipeline 1.0 goal
● A single Beam Pipeline to
○ precompute recommendations
○ apply rules
June 2020
pipeline 1.0 design & arguments
August 2020
apache-beam[gcp]==2.15.0
--runner=DataflowRunner
--machine-type = n1-standard-1 (1 vCPU & 3.75 GB RAM)
--num_workers=10
--autoscaling_algorithm=NONE
pipeline 1.0 design
August 2020
pipeline 1.0 design
August 2020
pipeline 1.0 error when running in dev & prod
August 2020
Workflow failed. Causes: S05:Read non-cold start
users/Read+Retrieve user ids+Predict+Keep best scores+Sort
scores+Process predictions+Group activity history and
recommendations/pair_with_recommendations+Group activity
history and recommendations/GroupByKey/Reify+Group activity
history and recommendations/GroupByKey/Write failed., The job
failed because a work item has failed 4 times. Look in previous log
entries for the cause of each one of the 4 failures. For more
information, see
https://cloud.google.com/dataflow/docs/guides/common-errors.
The work item was attempted on these workers:
beamapp-al-cht01-08141052-08140353-1tqj-harness-0k4v
Root cause: The worker lost contact with the service.,
beamapp-al-cht01-08141052-08140353-1tqj-harness-0k4v
Root cause: The worker lost contact with the service.,
beamapp-al-cht01-08141052-08140353-1tqj-harness-ffqv
Root cause: The worker lost contact with the service.,
beamapp-al-cht01-08141052-08140353-1tqj-harness-cjht
Root cause: The worker lost contact with the service.
pipeline 1.0 data analysis
August 2020
1. Change machine type to a larger one
○ --machine_type=custom-1-6656 (1 vCPU, 6.5 GB RM) - 6.5GB RAM /core
○ --machine_type=m1-ultramem-40 (40 vCPU, 961 GB RAM) - 24GB RAM/core (overkill)
2. Refactor the pipeline
3. Reshuffle => the cost to apply it did not justify the performance gain in the steps after
○ Shuffle service
○ Reshuffle function
4. Increase the amount of workers
○ --num_workers=40
pipeline 1.0 attempts to fix (i)
September 2020
5. Control the parallelism in Dataflow so the VM wouldn’t starve out of memory
pipeline 1.0 attempts to fix (ii)
Worker node (VM)
SDK Worker
Harness Threads
SDK Worker
Harness Threads
Worker node (VM)
SDK Worker
Harness Threads
Worker node (VM)
SDK Worker
Harness Threads
Harness Threads
--number_of_worker_harness_threads=1
--experiments=use_runner_v2
(or)
--sdk_worker_parallelism
--experiments=no_use_multiple_sdk_containers
--experiments=beam_fn_api
September 2020
pipeline 1.0 attempts to fix (iii)
https://stackoverflow.com/questions/63705660/optimising-gcp-costs-for-a-memory-intensive-dataflow-pipeline
pipeline 1.0 attempts to fix (iii)
https://twitter.com/tati_alchueyr/status/1301152715498758146
https://cloud.google.com/blog/products/data-analytics/ml-inference-in-dataflow-pipelines
pipeline 1.0 attempts to fix (iii)
https://stackoverflow.com/questions/63705660/optimising-gcp-costs-for-a-memory-intensive-dataflow-pipeline
pipeline 1.0 attempts to fix (iii)
https://stackoverflow.com/questions/63705660/optimising-gcp-costs-for-a-memory-intensive-dataflow-pipeline
pipeline 2.0 goal
● A single Beam Pipeline to
○ precompute recommendations
○ apply rules
● Which could process our production data
June 2020
pipeline 2.0 design & arguments
apache-beam== 2.24
--runner=DataflowRunner
--machine-type = custom-30-460800-ext
--num_workers= 40
--autoscaling_algorithm=NONE
September 2020
pipeline 2.0 business outcomes
● +59% increase in interactions in Recommended for You rail
● +103% increase in interactions for under 35s
internal external
September 2020
pipeline 2.0 issues
● but costs were high...
£ 279.31 per run
September 2020
pipeline 2.0 issues
OSError: [Errno 28] No space left on device During handling
March 2021
pipeline 2.0 issues
If a batch job uses Dataflow Shuffle, then the default is 25 GB;
otherwise, the default is 250 GB. March 2021
pipeline 2.0 issues
apache-beam== 2.24
--runner=DataflowRunner
--machine-type = custom-30-460800-ext
--num_workers= 40
--autoscaling_algorithm=NONE
--experiments=shuffle_mode=appliance
March 2021
cost savings plan
1. Administer pain relief 2. Hook up to bypass 3. Heart surgery
➔ Attempt shared
memory
➔ Attempt FlexRS
➔ Mid week delta (only
compute mid week for
users with activity
since Sunday’s run)
➔ Split pipeline
➔ Major refactor
➔ SCANN vs
LightFM.score()
➔ etc.
Timebox: 1 week Timebox: 2 weeks Timebox: 1 month
April 2021
pipeline 3.0 design
apache-beam== 2.24
--runner=DataflowRunner
--machine-type = custom-30-460800-ext
--num_workers= 40
--autoscaling_algorithm=NONE
--experiments=shuffle_mode=appliance
April 2021
pipeline 3.0 shared memory & FlexRS strategy
● Used production-representative data (model, auxiliary data structures)
● Ran the pipeline for 0.5% users, so the iterations would be cheap
○ 100% users: £ 266.74
○ 0.5% users: £ 80.54
● Attempts
○ Shared model using custom-30-460800-ext (15 GB/vCPU)
○ Shared model using custom-30-299520-ext (9.75 GB/vCPU)
○ Shared model using custom-6-50688-ext (8.25 GB/vCPU)
■ 0.5% users: £ 18.46 => -77.5% cost reduction!
May 2021
pipeline 3.0 shared memory & FlexRS results
● However, when we tried to run the same pipeline for 100%, it would take
hours and not complete.
● It was very inefficient and costed more than the initial implementation.
May 2021
pipeline 3.0 shared memory & FlexRS issue hypothesis
Between steps, we were
transferring lots of verbose data
which could be shared across
users. The version of
Dataflow/Beam we were using
didn’t seem to handle this
gracefully
May 2021
pipeline 4.0 heart surgery
● Split compute predictions from applying rules
● Keep the interfaces to a minimal
○ between these two pipelines
○ between steps within the same pipeline
June 2021
pipeline 4.0 heart surgery
June 2021
Top spender tasks
Recommended for you 1.1
(Dataflow estimated task duration
in days)
Music Mixes 1.1
(Dataflow estimated task duration
in days)
Featured Comedy 1.1
(Dataflow estimated task duration
in days)
Enrich user activity 20 27 28
Predict 47 17 17
Apply rules 47 0.3 0.1
Predict 41% 38% 38%
Rules-related costs 59% 62% 62%
The rules could be applied in a much cheaper VM, allowing 60% of the processing to be reduced to approx. 1/10
pipeline 4.1 precompute recommendations
apache-beam== 2.29
--runner=DataflowRunner
--machine-type = n1-highmem-16
--flexrs-goal = COST_OPTIMIZED
--max-num-workers= 64
--number-of-worker-harness-threads=7
--experiments=use_runner_v2
+ Batching
+ Shared memory
+ Flex RS
https://cloud.google.com/blog/products/data-analytics/ml-inference-in-dataflow-pipelines
July 2021
pipeline 4.1 precompute recommendations
Cost to run for 3.5 million users:
● 100k episodes: ~ £ 100.00 => £ 48.92 / run
● 300 episodes: ~ £ 60.00 => £ 3.40
● 18 episodes: ~ £ 50.00 => £0.74
July 2021
pipeline 4.2 apply business rules
apache-beam== 2.29
--runner=DataflowRunner
--machine-type = n1-standard-1
--experiments=use_runner_v2
+ Implemented rules natively
+ Created minimal interfaces and
views of the data
July 2021
pipeline 4.2 apply business rules
Cost to run for 3.5 million users:
● £ 0.15 - £ 2.00 per run, savings:
○ £ 150.00 => £ 2.00
○ £ 100.00 => £ 0.15 July 2021
pipeline 4.0 heart surgery
● We were able to reduce the cost of the most expensive run of the pipeline
from £ 279.31 per run to less than £ 50
● Reduced the costs to -82%
July 2021
takeaways
1. plan based on your data
2. an expensive machine learning pipeline is better than none
3. reducing the scope is a good starting point to saving money
○ Apply non-personalised rules before iterating per user
○ Sort top 1k recommendations by user opposed to 100k
4. using custom machine types might limit other cost savings
○ Such as FlexRS (schedulable preemptible instances in Dataflow only work)
5. to use shared memory may not lead to cost savings
6. minimal interfaces lead to more predictable behaviours in Dataflow
7. splitting the pipeline can be a solution to costs
takeaways
Thank you!
@tati_alchueyr

More Related Content

What's hot

Deploying ML models to production (frequently and safely) - PYCON 2018
Deploying ML models to production (frequently and safely) - PYCON 2018Deploying ML models to production (frequently and safely) - PYCON 2018
Deploying ML models to production (frequently and safely) - PYCON 2018
David Tan
 
Yannis Zarkadas. Enterprise data science workflows on kubeflow
Yannis Zarkadas. Enterprise data science workflows on kubeflowYannis Zarkadas. Enterprise data science workflows on kubeflow
Yannis Zarkadas. Enterprise data science workflows on kubeflow
MarynaHoldaieva
 
AI Pipeline Optimization using Kubeflow
AI Pipeline Optimization using KubeflowAI Pipeline Optimization using Kubeflow
AI Pipeline Optimization using Kubeflow
Steve Guhr
 
Accelerated Training of Transformer Models
Accelerated Training of Transformer ModelsAccelerated Training of Transformer Models
Accelerated Training of Transformer Models
Databricks
 
9 steps to awesome with kubernetes
9 steps to awesome with kubernetes9 steps to awesome with kubernetes
9 steps to awesome with kubernetes
BaraniBuuny
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Databricks
 
SpringOne Platform recap 정윤진
SpringOne Platform recap 정윤진SpringOne Platform recap 정윤진
SpringOne Platform recap 정윤진
VMware Tanzu Korea
 
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
Databricks
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
Jen Aman
 
H2O 3 REST API Overview
H2O 3 REST API OverviewH2O 3 REST API Overview
H2O 3 REST API Overview
Raymond Peck
 
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Portable batch and streaming pipelines with Apache Beam (Big Data Application...Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Malo Denielou
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Data Con LA
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesSimplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Databricks
 
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016
Chris Fregly
 
Capacity Planning Infrastructure for Web Applications (Drupal)
Capacity Planning Infrastructure for Web Applications (Drupal)Capacity Planning Infrastructure for Web Applications (Drupal)
Capacity Planning Infrastructure for Web Applications (Drupal)
Ricardo Amaro
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Dan Halperin
 
OpenRheinRuhr 2018 - Ops hates containers! Why?
OpenRheinRuhr 2018 - Ops hates containers! Why?OpenRheinRuhr 2018 - Ops hates containers! Why?
OpenRheinRuhr 2018 - Ops hates containers! Why?
Martin Alfke
 
H2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangH2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy Wang
Sri Ambati
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDK
Shu-Jeng Hsieh
 

What's hot (20)

Deploying ML models to production (frequently and safely) - PYCON 2018
Deploying ML models to production (frequently and safely) - PYCON 2018Deploying ML models to production (frequently and safely) - PYCON 2018
Deploying ML models to production (frequently and safely) - PYCON 2018
 
Yannis Zarkadas. Enterprise data science workflows on kubeflow
Yannis Zarkadas. Enterprise data science workflows on kubeflowYannis Zarkadas. Enterprise data science workflows on kubeflow
Yannis Zarkadas. Enterprise data science workflows on kubeflow
 
AI Pipeline Optimization using Kubeflow
AI Pipeline Optimization using KubeflowAI Pipeline Optimization using Kubeflow
AI Pipeline Optimization using Kubeflow
 
Accelerated Training of Transformer Models
Accelerated Training of Transformer ModelsAccelerated Training of Transformer Models
Accelerated Training of Transformer Models
 
9 steps to awesome with kubernetes
9 steps to awesome with kubernetes9 steps to awesome with kubernetes
9 steps to awesome with kubernetes
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
SpringOne Platform recap 정윤진
SpringOne Platform recap 정윤진SpringOne Platform recap 정윤진
SpringOne Platform recap 정윤진
 
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
H2O 3 REST API Overview
H2O 3 REST API OverviewH2O 3 REST API Overview
H2O 3 REST API Overview
 
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Portable batch and streaming pipelines with Apache Beam (Big Data Application...Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesSimplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
 
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016
 
Capacity Planning Infrastructure for Web Applications (Drupal)
Capacity Planning Infrastructure for Web Applications (Drupal)Capacity Planning Infrastructure for Web Applications (Drupal)
Capacity Planning Infrastructure for Web Applications (Drupal)
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
 
OpenRheinRuhr 2018 - Ops hates containers! Why?
OpenRheinRuhr 2018 - Ops hates containers! Why?OpenRheinRuhr 2018 - Ops hates containers! Why?
OpenRheinRuhr 2018 - Ops hates containers! Why?
 
H2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangH2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy Wang
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDK
 

Similar to Precomputing recommendations with Apache Beam

Democratization of NOSQL Document-Database over Relational Database Comparati...
Democratization of NOSQL Document-Database over Relational Database Comparati...Democratization of NOSQL Document-Database over Relational Database Comparati...
Democratization of NOSQL Document-Database over Relational Database Comparati...
IRJET Journal
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL Anywhere
SAP Technology
 
Supercharge your data analytics with BigQuery
Supercharge your data analytics with BigQuerySupercharge your data analytics with BigQuery
Supercharge your data analytics with BigQuery
Márton Kodok
 
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
DataBench
 
AI-accelerated CFD (Computational Fluid Dynamics)
AI-accelerated CFD (Computational Fluid Dynamics)AI-accelerated CFD (Computational Fluid Dynamics)
AI-accelerated CFD (Computational Fluid Dynamics)
byteLAKE
 
Scaling machine learning workflows with Apache Beam
Scaling machine learning workflows with Apache BeamScaling machine learning workflows with Apache Beam
Scaling machine learning workflows with Apache Beam
Tatiana Al-Chueyr
 
Applying linear regression and predictive analytics
Applying linear regression and predictive analyticsApplying linear regression and predictive analytics
Applying linear regression and predictive analytics
MariaDB plc
 
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowHow to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
Lucas Arruda
 
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
tdc-globalcode
 
Sprint 44 review
Sprint 44 reviewSprint 44 review
Sprint 44 review
ManageIQ
 
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
Databricks
 
Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...
Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...
Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...
Nesma
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
Vijayananda Mohire
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
Vijayananda Mohire
 
Qtp interview questions_1
Qtp interview questions_1Qtp interview questions_1
Qtp interview questions_1
Ramu Palanki
 
DataLive conference in Geneva 2018 - Bringing AI to the Data
DataLive conference in Geneva 2018 - Bringing AI to the DataDataLive conference in Geneva 2018 - Bringing AI to the Data
DataLive conference in Geneva 2018 - Bringing AI to the Data
Sasha Lazarevic
 
Bitmovin LIVE Tech Talks: Analytics for Workflow Automation (ft. Touchstream ...
Bitmovin LIVE Tech Talks: Analytics for Workflow Automation (ft. Touchstream ...Bitmovin LIVE Tech Talks: Analytics for Workflow Automation (ft. Touchstream ...
Bitmovin LIVE Tech Talks: Analytics for Workflow Automation (ft. Touchstream ...
Bitmovin Inc
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
 
Telecom product cost models development approach
Telecom product cost models development approachTelecom product cost models development approach
Telecom product cost models development approach
Parcus Group
 
BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
BDW16 London - William Vambenepe, Google - 3rd Generation Data PlatformBDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
Big Data Week
 

Similar to Precomputing recommendations with Apache Beam (20)

Democratization of NOSQL Document-Database over Relational Database Comparati...
Democratization of NOSQL Document-Database over Relational Database Comparati...Democratization of NOSQL Document-Database over Relational Database Comparati...
Democratization of NOSQL Document-Database over Relational Database Comparati...
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL Anywhere
 
Supercharge your data analytics with BigQuery
Supercharge your data analytics with BigQuerySupercharge your data analytics with BigQuery
Supercharge your data analytics with BigQuery
 
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
 
AI-accelerated CFD (Computational Fluid Dynamics)
AI-accelerated CFD (Computational Fluid Dynamics)AI-accelerated CFD (Computational Fluid Dynamics)
AI-accelerated CFD (Computational Fluid Dynamics)
 
Scaling machine learning workflows with Apache Beam
Scaling machine learning workflows with Apache BeamScaling machine learning workflows with Apache Beam
Scaling machine learning workflows with Apache Beam
 
Applying linear regression and predictive analytics
Applying linear regression and predictive analyticsApplying linear regression and predictive analytics
Applying linear regression and predictive analytics
 
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowHow to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
 
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
 
Sprint 44 review
Sprint 44 reviewSprint 44 review
Sprint 44 review
 
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
 
Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...
Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...
Nesma autumn conference 2015 - Is FPA a valuable addition to predictable agil...
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Qtp interview questions_1
Qtp interview questions_1Qtp interview questions_1
Qtp interview questions_1
 
DataLive conference in Geneva 2018 - Bringing AI to the Data
DataLive conference in Geneva 2018 - Bringing AI to the DataDataLive conference in Geneva 2018 - Bringing AI to the Data
DataLive conference in Geneva 2018 - Bringing AI to the Data
 
Bitmovin LIVE Tech Talks: Analytics for Workflow Automation (ft. Touchstream ...
Bitmovin LIVE Tech Talks: Analytics for Workflow Automation (ft. Touchstream ...Bitmovin LIVE Tech Talks: Analytics for Workflow Automation (ft. Touchstream ...
Bitmovin LIVE Tech Talks: Analytics for Workflow Automation (ft. Touchstream ...
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Telecom product cost models development approach
Telecom product cost models development approachTelecom product cost models development approach
Telecom product cost models development approach
 
BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
BDW16 London - William Vambenepe, Google - 3rd Generation Data PlatformBDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
 

More from Tatiana Al-Chueyr

Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache Airflow
Tatiana Al-Chueyr
 
Contributing to Apache Airflow
Contributing to Apache AirflowContributing to Apache Airflow
Contributing to Apache Airflow
Tatiana Al-Chueyr
 
Responsible machine learning at the BBC
Responsible machine learning at the BBCResponsible machine learning at the BBC
Responsible machine learning at the BBC
Tatiana Al-Chueyr
 
Powering machine learning workflows with Apache Airflow and Python
Powering machine learning workflows with Apache Airflow and PythonPowering machine learning workflows with Apache Airflow and Python
Powering machine learning workflows with Apache Airflow and Python
Tatiana Al-Chueyr
 
Responsible Machine Learning at the BBC
Responsible Machine Learning at the BBCResponsible Machine Learning at the BBC
Responsible Machine Learning at the BBC
Tatiana Al-Chueyr
 
Sprint cPython at Globo.com
Sprint cPython at Globo.comSprint cPython at Globo.com
Sprint cPython at Globo.com
Tatiana Al-Chueyr
 
QCon SP - recommended for you
QCon SP - recommended for youQCon SP - recommended for you
QCon SP - recommended for you
Tatiana Al-Chueyr
 
PyConUK 2016 - Writing English Right
PyConUK 2016  - Writing English RightPyConUK 2016  - Writing English Right
PyConUK 2016 - Writing English Right
Tatiana Al-Chueyr
 
InVesalius: 3D medical imaging software
InVesalius: 3D medical imaging softwareInVesalius: 3D medical imaging software
InVesalius: 3D medical imaging software
Tatiana Al-Chueyr
 
Automatic English text correction
Automatic English text correctionAutomatic English text correction
Automatic English text correction
Tatiana Al-Chueyr
 
Python packaging and dependency resolution
Python packaging and dependency resolutionPython packaging and dependency resolution
Python packaging and dependency resolution
Tatiana Al-Chueyr
 
Rio info 2013 - Linked Data at Globo.com
Rio info 2013 - Linked Data at Globo.comRio info 2013 - Linked Data at Globo.com
Rio info 2013 - Linked Data at Globo.com
Tatiana Al-Chueyr
 
PythonBrasil[8] closing
PythonBrasil[8] closingPythonBrasil[8] closing
PythonBrasil[8] closing
Tatiana Al-Chueyr
 
Linking the world with Python and Semantics
Linking the world with Python and SemanticsLinking the world with Python and Semantics
Linking the world with Python and Semantics
Tatiana Al-Chueyr
 
Desarollando aplicaciones web en python con pruebas
Desarollando aplicaciones web en python con pruebasDesarollando aplicaciones web en python con pruebas
Desarollando aplicaciones web en python con pruebas
Tatiana Al-Chueyr
 
Desarollando aplicaciones móviles con Python y Android
Desarollando aplicaciones móviles con Python y AndroidDesarollando aplicaciones móviles con Python y Android
Desarollando aplicaciones móviles con Python y Android
Tatiana Al-Chueyr
 
Transifex: Ensinando o seu Software Público a falar novos idiomas
Transifex: Ensinando o seu Software Público a falar novos idiomasTransifex: Ensinando o seu Software Público a falar novos idiomas
Transifex: Ensinando o seu Software Público a falar novos idiomas
Tatiana Al-Chueyr
 

More from Tatiana Al-Chueyr (17)

Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache Airflow
 
Contributing to Apache Airflow
Contributing to Apache AirflowContributing to Apache Airflow
Contributing to Apache Airflow
 
Responsible machine learning at the BBC
Responsible machine learning at the BBCResponsible machine learning at the BBC
Responsible machine learning at the BBC
 
Powering machine learning workflows with Apache Airflow and Python
Powering machine learning workflows with Apache Airflow and PythonPowering machine learning workflows with Apache Airflow and Python
Powering machine learning workflows with Apache Airflow and Python
 
Responsible Machine Learning at the BBC
Responsible Machine Learning at the BBCResponsible Machine Learning at the BBC
Responsible Machine Learning at the BBC
 
Sprint cPython at Globo.com
Sprint cPython at Globo.comSprint cPython at Globo.com
Sprint cPython at Globo.com
 
QCon SP - recommended for you
QCon SP - recommended for youQCon SP - recommended for you
QCon SP - recommended for you
 
PyConUK 2016 - Writing English Right
PyConUK 2016  - Writing English RightPyConUK 2016  - Writing English Right
PyConUK 2016 - Writing English Right
 
InVesalius: 3D medical imaging software
InVesalius: 3D medical imaging softwareInVesalius: 3D medical imaging software
InVesalius: 3D medical imaging software
 
Automatic English text correction
Automatic English text correctionAutomatic English text correction
Automatic English text correction
 
Python packaging and dependency resolution
Python packaging and dependency resolutionPython packaging and dependency resolution
Python packaging and dependency resolution
 
Rio info 2013 - Linked Data at Globo.com
Rio info 2013 - Linked Data at Globo.comRio info 2013 - Linked Data at Globo.com
Rio info 2013 - Linked Data at Globo.com
 
PythonBrasil[8] closing
PythonBrasil[8] closingPythonBrasil[8] closing
PythonBrasil[8] closing
 
Linking the world with Python and Semantics
Linking the world with Python and SemanticsLinking the world with Python and Semantics
Linking the world with Python and Semantics
 
Desarollando aplicaciones web en python con pruebas
Desarollando aplicaciones web en python con pruebasDesarollando aplicaciones web en python con pruebas
Desarollando aplicaciones web en python con pruebas
 
Desarollando aplicaciones móviles con Python y Android
Desarollando aplicaciones móviles con Python y AndroidDesarollando aplicaciones móviles con Python y Android
Desarollando aplicaciones móviles con Python y Android
 
Transifex: Ensinando o seu Software Público a falar novos idiomas
Transifex: Ensinando o seu Software Público a falar novos idiomasTransifex: Ensinando o seu Software Público a falar novos idiomas
Transifex: Ensinando o seu Software Público a falar novos idiomas
 

Recently uploaded

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 

Recently uploaded (20)

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 

Precomputing recommendations with Apache Beam

  • 1. Precomputing recommendations with Apache Beam Tatiana Al-Chueyr Principal Data Engineer @ BBC Datalab London, 21 September 2021
  • 2. @tati_alchueyr ● Brazilian living in London UK since 2014 ● Principal Data Engineer at the BBC (Datalab team) ● Graduated in Computer Engineering at Unicamp ● Software developer for 18 years ● Passionate about open-source Apache Beam user since early 2019
  • 3. BBC.datalab.hummingbirds The knowledge in this presentation is the result of lots of teamwork within one squad of a larger team and even broader organisation current squad team members previous squad team members Darren Mundy David Hollands Richard Bownes Marc Oppenheimer Bettina Hermant Tatiana Al-Chueyr Jana Eggink
  • 5. business context goal to personalise the experience of millions of users of BBC Sounds to build a replacement for an external third-party recommendation engine
  • 6. business context numbers BBC Sounds has approximately ● 200,000 podcast and music episodes ● 6.5 millions of users The personalised rails (eg. Recommended for You) display: ● 9 episodes (smartphones) or ● 12 episodes (web)
  • 7. business context problem visualisation it is similar to finding the best match among 20,000 items per user x 65 million times
  • 8. business context product rules The recommendations must also comply to the BBC product and editorial rules, such as: ● Diversification: no more than one item per brand ● Recency: no news episodes older than 24 hours ● Narrative arc: next drama series episode ● Language: Gaelic items to Gaelic listeners ● Availability: only available content ● Exclusion: shipping forecast and soap-opera
  • 10. technology overview ● Python ● Google Cloud Platform ● Apache Airflow ● Apache Beam (Dataflow Runner) ● LightFM (Factorisation Machine model)
  • 11. architecture overview User activity Content metadata Train Model Artefacts Predict Extract & Transform Extract & Transform User activity features Content metadata features Filtered Predictions Apply rules Predictions historical data future
  • 12. architecture overview User activity Content metadata Train Model Artefacts Predict Extract & Transform Extract & Transform User activity features Content metadata features Filtered Predictions Apply rules Predictions where we used Apache Beam historical data future
  • 14. business context input data schema Episode ● ID ● Masterbrand ● Brand ● Genres ● Formats ● Series ● Schedule start ● Schedule end ● Release date User ● ID User Interactions ● User ID ● Consumed Episodes
  • 15. business context input data instance Episode ● p09f7b5r ● bbc_radio_scotland ● p07jt0jn ● ["100001", "200003"] ● [“PT017”] ● p05l30yb ● 2017-11-03T10:07:00Z ● 2021-06-05T21:00:00Z ● 2017-11-03T00:00:00Z User ● AdL9IFh-xQU= User Interactions ● AdL9IFh-xQU= ● ["m000tw74", "p095546q", "p0958bg1", "p0940vx8"]
  • 16. business context output data schema & sample User Predictions Schema ● User ID ● Recommendations ○ Episode ID ○ Score User Predictions Sample ● AdL9IFh-xQU= ● [ (“p09f7b5r”, 1.0), (“b13jlkf01”, 0.5), (“m124uk”, 0.1)]
  • 18. risk analysis precompute recommendations cost estimate: precompute daily for a month with pure Python in a VM: ~ US$ 250 Estimate of time (seconds) to precompute recommendations analysis using c2-standard-30 (30 vCPU and 120 RAM) and LightFM
  • 19. risk analysis sorting recommendations sort 100k predictions per user with pure Python did not seem efficient
  • 20. risk analysis sorting recommendations sort 100k predictions per user with pure Python did not seem efficient concern on the feasibility of applying complex rules April 2020
  • 22. simplification initial numbers Initial numbers for precomputing recommendations: ● Total users: 6.5 mi ● Total content: 200k items ● User activity: 1 year ● Precompute frequency: 3 times a day ● Apply rules after precomputing April 2020
  • 23. simplification initial x agreed numbers Initial numbers for precomputing recommendations: ● Total users: 6.5 mi -> 3.5 mi ● Total content: 200k items -> 100k items ● User activity: 1 year -> 4 months ● Precompute frequency: 3 times a day -> 2 times a week ● Apply rules after precomputing -> before & after precomputing April 2020
  • 24. simplification non-personalised x personalised rules User activity data Content metadata Business Rules, part I - Non-personalised - Recency - Availability - Excluded Masterbrands - Excluded genres Business Rules, part II - Personalised - Already seen items - Local radio (if not consumed previously) - Specific language (if not consumed previously) - Episode picking from a series - Diversification (1 episode per brand/series) Precomputed recommendations Machine Learning model training Predict recommendations once Reduce content from 200k to 100k items 3.5 million times chose the 50 more relevant & editorial compatible items out of 100k items ranked by user May 2020
  • 25. simplification sorting ● Initial problem ○ for each user ■ sort 100k recommendations ● Simplified problem ○ for each user ■ calculate recommendations scores’ 90 percentile ■ discard items that have a score lower than P90 ■ sort ~ 10k items per user June 2020
  • 27. pipeline 1.0 goal ● A single Beam Pipeline to ○ precompute recommendations ○ apply rules June 2020
  • 28. pipeline 1.0 design & arguments August 2020 apache-beam[gcp]==2.15.0 --runner=DataflowRunner --machine-type = n1-standard-1 (1 vCPU & 3.75 GB RAM) --num_workers=10 --autoscaling_algorithm=NONE
  • 31. pipeline 1.0 error when running in dev & prod August 2020 Workflow failed. Causes: S05:Read non-cold start users/Read+Retrieve user ids+Predict+Keep best scores+Sort scores+Process predictions+Group activity history and recommendations/pair_with_recommendations+Group activity history and recommendations/GroupByKey/Reify+Group activity history and recommendations/GroupByKey/Write failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers: beamapp-al-cht01-08141052-08140353-1tqj-harness-0k4v Root cause: The worker lost contact with the service., beamapp-al-cht01-08141052-08140353-1tqj-harness-0k4v Root cause: The worker lost contact with the service., beamapp-al-cht01-08141052-08140353-1tqj-harness-ffqv Root cause: The worker lost contact with the service., beamapp-al-cht01-08141052-08140353-1tqj-harness-cjht Root cause: The worker lost contact with the service.
  • 32. pipeline 1.0 data analysis August 2020
  • 33. 1. Change machine type to a larger one ○ --machine_type=custom-1-6656 (1 vCPU, 6.5 GB RM) - 6.5GB RAM /core ○ --machine_type=m1-ultramem-40 (40 vCPU, 961 GB RAM) - 24GB RAM/core (overkill) 2. Refactor the pipeline 3. Reshuffle => the cost to apply it did not justify the performance gain in the steps after ○ Shuffle service ○ Reshuffle function 4. Increase the amount of workers ○ --num_workers=40 pipeline 1.0 attempts to fix (i) September 2020
  • 34. 5. Control the parallelism in Dataflow so the VM wouldn’t starve out of memory pipeline 1.0 attempts to fix (ii) Worker node (VM) SDK Worker Harness Threads SDK Worker Harness Threads Worker node (VM) SDK Worker Harness Threads Worker node (VM) SDK Worker Harness Threads Harness Threads --number_of_worker_harness_threads=1 --experiments=use_runner_v2 (or) --sdk_worker_parallelism --experiments=no_use_multiple_sdk_containers --experiments=beam_fn_api September 2020
  • 35. pipeline 1.0 attempts to fix (iii) https://stackoverflow.com/questions/63705660/optimising-gcp-costs-for-a-memory-intensive-dataflow-pipeline
  • 36. pipeline 1.0 attempts to fix (iii) https://twitter.com/tati_alchueyr/status/1301152715498758146 https://cloud.google.com/blog/products/data-analytics/ml-inference-in-dataflow-pipelines
  • 37. pipeline 1.0 attempts to fix (iii) https://stackoverflow.com/questions/63705660/optimising-gcp-costs-for-a-memory-intensive-dataflow-pipeline
  • 38. pipeline 1.0 attempts to fix (iii) https://stackoverflow.com/questions/63705660/optimising-gcp-costs-for-a-memory-intensive-dataflow-pipeline
  • 39. pipeline 2.0 goal ● A single Beam Pipeline to ○ precompute recommendations ○ apply rules ● Which could process our production data June 2020
  • 40. pipeline 2.0 design & arguments apache-beam== 2.24 --runner=DataflowRunner --machine-type = custom-30-460800-ext --num_workers= 40 --autoscaling_algorithm=NONE September 2020
  • 41. pipeline 2.0 business outcomes ● +59% increase in interactions in Recommended for You rail ● +103% increase in interactions for under 35s internal external September 2020
  • 42. pipeline 2.0 issues ● but costs were high... £ 279.31 per run September 2020
  • 43. pipeline 2.0 issues OSError: [Errno 28] No space left on device During handling March 2021
  • 44. pipeline 2.0 issues If a batch job uses Dataflow Shuffle, then the default is 25 GB; otherwise, the default is 250 GB. March 2021
  • 45. pipeline 2.0 issues apache-beam== 2.24 --runner=DataflowRunner --machine-type = custom-30-460800-ext --num_workers= 40 --autoscaling_algorithm=NONE --experiments=shuffle_mode=appliance March 2021
  • 46. cost savings plan 1. Administer pain relief 2. Hook up to bypass 3. Heart surgery ➔ Attempt shared memory ➔ Attempt FlexRS ➔ Mid week delta (only compute mid week for users with activity since Sunday’s run) ➔ Split pipeline ➔ Major refactor ➔ SCANN vs LightFM.score() ➔ etc. Timebox: 1 week Timebox: 2 weeks Timebox: 1 month April 2021
  • 47. pipeline 3.0 design apache-beam== 2.24 --runner=DataflowRunner --machine-type = custom-30-460800-ext --num_workers= 40 --autoscaling_algorithm=NONE --experiments=shuffle_mode=appliance April 2021
  • 48. pipeline 3.0 shared memory & FlexRS strategy ● Used production-representative data (model, auxiliary data structures) ● Ran the pipeline for 0.5% users, so the iterations would be cheap ○ 100% users: £ 266.74 ○ 0.5% users: £ 80.54 ● Attempts ○ Shared model using custom-30-460800-ext (15 GB/vCPU) ○ Shared model using custom-30-299520-ext (9.75 GB/vCPU) ○ Shared model using custom-6-50688-ext (8.25 GB/vCPU) ■ 0.5% users: £ 18.46 => -77.5% cost reduction! May 2021
  • 49. pipeline 3.0 shared memory & FlexRS results ● However, when we tried to run the same pipeline for 100%, it would take hours and not complete. ● It was very inefficient and costed more than the initial implementation. May 2021
  • 50. pipeline 3.0 shared memory & FlexRS issue hypothesis Between steps, we were transferring lots of verbose data which could be shared across users. The version of Dataflow/Beam we were using didn’t seem to handle this gracefully May 2021
  • 51. pipeline 4.0 heart surgery ● Split compute predictions from applying rules ● Keep the interfaces to a minimal ○ between these two pipelines ○ between steps within the same pipeline June 2021
  • 52. pipeline 4.0 heart surgery June 2021 Top spender tasks Recommended for you 1.1 (Dataflow estimated task duration in days) Music Mixes 1.1 (Dataflow estimated task duration in days) Featured Comedy 1.1 (Dataflow estimated task duration in days) Enrich user activity 20 27 28 Predict 47 17 17 Apply rules 47 0.3 0.1 Predict 41% 38% 38% Rules-related costs 59% 62% 62% The rules could be applied in a much cheaper VM, allowing 60% of the processing to be reduced to approx. 1/10
  • 53. pipeline 4.1 precompute recommendations apache-beam== 2.29 --runner=DataflowRunner --machine-type = n1-highmem-16 --flexrs-goal = COST_OPTIMIZED --max-num-workers= 64 --number-of-worker-harness-threads=7 --experiments=use_runner_v2 + Batching + Shared memory + Flex RS https://cloud.google.com/blog/products/data-analytics/ml-inference-in-dataflow-pipelines July 2021
  • 54. pipeline 4.1 precompute recommendations Cost to run for 3.5 million users: ● 100k episodes: ~ £ 100.00 => £ 48.92 / run ● 300 episodes: ~ £ 60.00 => £ 3.40 ● 18 episodes: ~ £ 50.00 => £0.74 July 2021
  • 55. pipeline 4.2 apply business rules apache-beam== 2.29 --runner=DataflowRunner --machine-type = n1-standard-1 --experiments=use_runner_v2 + Implemented rules natively + Created minimal interfaces and views of the data July 2021
  • 56. pipeline 4.2 apply business rules Cost to run for 3.5 million users: ● £ 0.15 - £ 2.00 per run, savings: ○ £ 150.00 => £ 2.00 ○ £ 100.00 => £ 0.15 July 2021
  • 57. pipeline 4.0 heart surgery ● We were able to reduce the cost of the most expensive run of the pipeline from £ 279.31 per run to less than £ 50 ● Reduced the costs to -82% July 2021
  • 59. 1. plan based on your data 2. an expensive machine learning pipeline is better than none 3. reducing the scope is a good starting point to saving money ○ Apply non-personalised rules before iterating per user ○ Sort top 1k recommendations by user opposed to 100k 4. using custom machine types might limit other cost savings ○ Such as FlexRS (schedulable preemptible instances in Dataflow only work) 5. to use shared memory may not lead to cost savings 6. minimal interfaces lead to more predictable behaviours in Dataflow 7. splitting the pipeline can be a solution to costs takeaways