SlideShare a Scribd company logo
1
Accelerating ML using
Production Feature Engineering Platform
Dr. Venkata Pingali, Scribble Data
● Why is it so complex and expensive?
● How should Acme think about it?
● Where does production feature engineering fit in?
● How does the system look? Why?
● Should Acme build one?
● Where should Acme start?
Acme Retail Inc Challenge: Put Models into Production
© Scribble Data 2019
2
Acme
Retail Inc
Outline
© Scribble Data 2019
● Production ML & Feature Engineering Overview
● Challenges & Required Capabilities
● Design considerations
● Implementation options
3
Production ML & Feature Engineering Overview
4
© Scribble Data 2019
Production ML - Complex & Expensive
© Scribble Data 2019
https://eng.uber.com/michelangelo/ + Additional Credits: Vikas 5
Uber’s
Michelangelo
5000 Models
Other Examples:
AirBnB BigHead
Stripe RailYard
Why the Complexity - Distribution of Challenges
© Scribble Data 2019https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
6
“Only a small fraction of real-world ML systems is composed of the ML code, as shown by the
small black box in the middle. The required surrounding infrastructure is vast and complex.”
Paper from Google - NeurIPS 2015
Drivers: Speed, Correctness, Evolution, Scale
Production ML - Emerging Generic Architecture
© Scribble Data 2019
7https://databricks.com/session/scaling-ride-hailing-with-machine-learning-on-mlflow
Feature
Engineering
Model Dev &
Management
Model
Deployment
GoJEK @ Spark AI Summit, April 2019
Feature Engineering - Nature
© Scribble Data 2019
● Features are variables generated from data
○ Continuous process (Batch + Near Realtime + Realtime)
● Large in number (‘00s to ‘000s) & evolving
● Frequently (re)computed
● Possibly auto-computed
Customer SKU Name
17826162 0293192 Thai Dragon
Fruit
Customer Premium Imported
17826162 15% of txns 5% of spend
Retail Customer
(X GB)
Features
(~X/1000)
8
Feature Engineering - Classification
© Scribble Data 2019
9
Normal (TBs) Hyper (PBs)
Loose Most ML
ML at Scale
(e.g., Uber’s Michaelangelo)
Tight Most DL
DL at Scale
(e.g., Google TFX)
Data Scale
Model
Integration
Feature Engineering - Classification
© Scribble Data 2019
10
Normal (TBs) Hyper (PBs)
Loose
Most ML
ML at Scale
(e.g., Uber’s Michaelangelo)
Tight Most DL
DL at Scale
(e.g., Google TFX)
Data Scale
Model
Integration
Tall Fat
Classes will expand in future
Feature Engineering Platform - Challenges & Capabilities
11
© Scribble Data 2019
Feature Generation Process
© Scribble Data 2019
12
Input
(Data
lake)
Output
(Feature
Store)
Feature Engineering Tooling
Input
Correctness
Feature
Richness &
Performance
Trust &
Discovery
Organizational Context
© Scribble Data 2019
13
● Expensive activity overall (people, compute)
● Limited engineering resources
● Churn in staff
● More models everyday
● Variable quality of sources, data quality
Datalake
Feature
Store
Feature Engineering Tooling
Capabilities
© Scribble Data 2019
14
● Feature Richness & Performance
○ How do I compute features reliably?
○ How do I make it easier to specify features?
○ How do I make my features richer?
● Input correctness
● Trust and Discovery
Feature Richness - How do I compute features reliably?
© Scribble Data 2019
15
Customer Premium Imported
17826162 15% of txns 5% of spend
Customer Premium Imported
17826162 15% of txns 5% of spend
Customer Premium impscore
17826162 15% of txns 1
Day 1 v 1
Day 2 v 1
Day 3 v 2
....
Exploding combinations: #pipelines x #runs x #versions
Feature Richness - How do I compute features reliably?
© Scribble Data 2019
16
● Composable pipelines with state and audit management
○ Self-documenting modules
○ Versioned namespaces
○ Metadata collection
● Idempotent and reproducible
● Incremental computation and/or check-pointed
● Resource management
Recommendation: Pandas + DAG + State Management + Testing
Feature Richness - How do I easily specify features?
© Scribble Data 2019
17
Feature Richness - How do I easily specify features?
© Scribble Data 2019
18
● Features developed in parallel, large in number
○ Tedious, buggy process
● Metadata required to manage lifecycle of features
● Standardize the schema, language and implementation
● Used on both batch and realtime paths
Recommendation: Feature DSL
Spec
Feature Richness - How do I make my features richer?
© Scribble Data 2019
19
Recommendation: Plan for labeling service/integration with thirdparty
Labels
Customer SKU Name
17826162 0293192 Thai Dragon
Fruit
Acquired taste
Affluence
International exposure
Willingness to Experiment
Feature Richness - How do I make my features richer?
© Scribble Data 2019
20
● Common iterative activity for various usecases, dimensions
● Complexity varies a lot
○ Single/multiple labels, Maker-checker (untrusted), simple text vs
boxes, sensitive vs safe data
● Multiple options depending on scale
○ Excel/Light-weight inhouse/Thirdparty (LabelBox)
Recommendation: Plan for labeling service/integration with thirdparty
Labels
Capabilities
© Scribble Data 2019
21
● Feature Richness & Performance
● Trust and Discovery
○ How do I audit/validate the computation?
○ How do I discover and reuse features?
● Input correctness
Trust - How do I audit/validate the computation?
© Scribble Data 2019
22
Trust - How do I audit/validate the computation?
© Scribble Data 2019
23
● Feature engineering code should be assumed to be buggy
● Datasets multiply (10 pipelines = 3.6K files/year)
● Multiple places & ways
○ Pipeline stages, visualization, audit interface
● Include statistical evaluation
○ Distributions, drift over time & space
Recommendation: Pre/Post computation checks, Easy checking interfaces
Discovery - How do I reuse features?
© Scribble Data 2019
24
Premium P4 2019-02-02
Impscore P3 2019-03-04
... ... ...
Premium P4 2019-02-02
Impscore P3 2019-03-04
... ... ...
Premium P4 2019-02-02
Impscore P3 2019-03-04
... ... ...
1000s expected
#pipelines x
# features
Pipeline1
Pipeline2
....
Pipeline3
Discovery - How do I reuse features?
© Scribble Data 2019
25
Discovery - How do I reuse features?
© Scribble Data 2019
26
● Feature computation is expensive
○ Encourage discovery & reuse
● Multiple ways - marketplace, filter & export, new request
● Necessary capabilities:
○ Standardization of data - representation, location
○ Feature priority, ownership, and status
○ Namespaces
Recommendation: Marketplace, DB, Search, Wiki page
Capabilities
© Scribble Data 2019
27
● Feature Richness & Performance
● Trust and Discovery
● Input correctness
○ Is my data feed complete and correct?
○ Am I interpreting my feed correctly?
○ How do I manage thirdparty datasets?
Input Correctness - Is my data feed complete and correct?
© Scribble Data 2019
28
bdate flag ean
2019 0000 ac0291212
2021 xxx
Input Correctness - Is my data feed complete and correct?
© Scribble Data 2019
29
● Data quality has a large impact on models
○ Gaps, duplicates, exceptions, invalid values
● Poor data quality wastes resources and invalidates models
● Continuous process to improve ML operations over time
● Extensibility required due to context dependency
bdate flag ean
2019 0000 ac0291212
2021 xxx
Recommendation: Continuous health check/early warning system
Input Correctness - Am I interpreting my feed correctly?
© Scribble Data 2019
30
Input Correctness - Am I interpreting my feed correctly?
© Scribble Data 2019
31
● Important if legacy systems are involved
● Risk of semantic mismatch
● Ability to keep adding information
● Avoid documenting in pipeline code - cant be extracted easily
● Pipeline uses API to pull documentation and checks
bdate flag ean
2019 0000 ac0291212
Recommendation: Light-weight catalog (excel, wiki, Atlas, programmatic)
Input Correctness - How do integrate thirdparty datasets?
© Scribble Data 2019
32
● Discovery, Lineage & ownership tracking
○ Datasets are often bought, e.g., surveys
● Support for pre-processing to enable fast & safe ingestion
○ Partitioning, anonymization
● Support for open dataset repository/online data services
● Programmatic, mediated access
○ Security, usage logging
Recommendation: Dataset management service/capability
Ext
Feature Engineering Platform - How to Decide
33
© Scribble Data 2019
Economics of Feature Engineering
© Scribble Data 2019
● Features have a price
○ Price paid every day
○ Amortization happens over time & across models
● Operations costs grow non-linearly with feature count
● Addition/deletion cost is high
○ Triggers recomputation of intermediate state
● Implicit & explicit dependencies get added over time
○ Dependencies are often untracked
34
Questions to Ask
© Scribble Data 2019
● How many models will I have over time? (Speed, Correctness)
● How many features will they need? (Performance)
● How complex are the features? (Performance)
● Is there any feature reuse across models? (Performance)
● How defensible should they be? (Correctness)
● How available should they be? (Robustness, Correctness)
● How big is my dataset? (Performance)
35
Feature Engineering Platform - Implementation
36
© Scribble Data 2019
Approaches
© Scribble Data 2019
● FEAST (Go-JEK)
○ First opensource feature store
○ Two major assumptions: “Tall” datasets, GCP
● Build Inhouse
○ Atlas (Catalog) + LabelBox (labeling) + Airflow (pipeline) + …
○ Pro: Fine-grained control
○ Cons: Frequency mismatch - too high/too light
● Thirdparty (Scribble Enrich)
○ Coherent design
○ Pro: Out of box capabilities, fills gaps
○ Cons: Commercial product limitations, few options
37
Approaches
© Scribble Data 2019
38
Build
Time
Extensi-
bility
OSS Cloud
Lockin
Tech
Stack
Suitability
FEAST Java Large narrow datasets,
GCP
Inhouse Depends on context
Enrich Python Limited engg
bandwidth
More coming: Realtime, Edge, Media, Privacy-sensitive
Inspiration
© Scribble Data 2019
● Uber - Michaelangelo
● AirBnB - BigHead
● Go-JEK - ML Platform,
● LinkedIn - Pro-ML
39
Takeaways
© Scribble Data 2019
● Production data science is complex and expensive
● Data science informed-architectural thinking required
● Production feature engineering in early stages
○ Will grow rapidly in the coming years
● Organizations will be building a FE platform
○ Same reasons - Speed, Correctness, Evolution, Scale
40
Questions?
DENVER BANGALORE
Littleton Indiranagar | HSR
hello@scribbledata.io
41
Scribble Data
Accelerated Production ML Engineering

More Related Content

What's hot

abdul.gafoor CV
abdul.gafoor CVabdul.gafoor CV
abdul.gafoor CV
speed525
 
MM_EN_Dogan_CETIN
MM_EN_Dogan_CETINMM_EN_Dogan_CETIN
MM_EN_Dogan_CETIN
Doğan Çetin
 
Prashant sap abap developer
Prashant sap abap developerPrashant sap abap developer
Prashant sap abap developer
Prashant Singh
 
IBM_AIR LINES BUSINESS TRANSFORMATION
IBM_AIR LINES  BUSINESS TRANSFORMATIONIBM_AIR LINES  BUSINESS TRANSFORMATION
IBM_AIR LINES BUSINESS TRANSFORMATION
Asish Mohanty M@Vodafone Group
 
Shubham' Resume
Shubham' ResumeShubham' Resume
Shubham' Resume
Shubham Srivastava
 
SaurabhKasyap
SaurabhKasyapSaurabhKasyap
SaurabhKasyap
saurabh kasyap
 
How to design web intelligence reports that behave like real dashboards
How to design web intelligence reports that behave like real dashboardsHow to design web intelligence reports that behave like real dashboards
How to design web intelligence reports that behave like real dashboards
Wiiisdom
 

What's hot (7)

abdul.gafoor CV
abdul.gafoor CVabdul.gafoor CV
abdul.gafoor CV
 
MM_EN_Dogan_CETIN
MM_EN_Dogan_CETINMM_EN_Dogan_CETIN
MM_EN_Dogan_CETIN
 
Prashant sap abap developer
Prashant sap abap developerPrashant sap abap developer
Prashant sap abap developer
 
IBM_AIR LINES BUSINESS TRANSFORMATION
IBM_AIR LINES  BUSINESS TRANSFORMATIONIBM_AIR LINES  BUSINESS TRANSFORMATION
IBM_AIR LINES BUSINESS TRANSFORMATION
 
Shubham' Resume
Shubham' ResumeShubham' Resume
Shubham' Resume
 
SaurabhKasyap
SaurabhKasyapSaurabhKasyap
SaurabhKasyap
 
How to design web intelligence reports that behave like real dashboards
How to design web intelligence reports that behave like real dashboardsHow to design web intelligence reports that behave like real dashboards
How to design web intelligence reports that behave like real dashboards
 

Similar to Accelerating ML using Production Feature Engineering

Reducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case StudyReducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case Study
Venkata Pingali
 
workforce analytics using Data Science
workforce analytics using Data Scienceworkforce analytics using Data Science
workforce analytics using Data Science
manish gaurav
 
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
Kent Graziano
 
Data Con LA 2022 - Practical Solutions to Complex Supply Chain Problems
Data Con LA 2022 - Practical Solutions to Complex Supply Chain ProblemsData Con LA 2022 - Practical Solutions to Complex Supply Chain Problems
Data Con LA 2022 - Practical Solutions to Complex Supply Chain Problems
Data Con LA
 
Ai and data migration as a service subhash bhat cwin18-india
Ai and data migration as a service subhash bhat cwin18-indiaAi and data migration as a service subhash bhat cwin18-india
Ai and data migration as a service subhash bhat cwin18-india
Capgemini
 
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
Databricks
 
MongoDB World 2019: From Transformation to Innovation: Lean-teams, Continuous...
MongoDB World 2019: From Transformation to Innovation: Lean-teams, Continuous...MongoDB World 2019: From Transformation to Innovation: Lean-teams, Continuous...
MongoDB World 2019: From Transformation to Innovation: Lean-teams, Continuous...
MongoDB
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
Neo4j
 
Augmented OLAP for Big Data
Augmented OLAP for Big DataAugmented OLAP for Big Data
Augmented OLAP for Big Data
Luke Han
 
Augmented OLAP Analytics for Big Data
Augmented OLAP Analytics for Big DataAugmented OLAP Analytics for Big Data
Augmented OLAP Analytics for Big Data
Tyler Wishnoff
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
Neo4j
 
Machine Learning for Product Managers
Machine Learning for Product ManagersMachine Learning for Product Managers
Machine Learning for Product Managers
Thoughtworks
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
Neo4j
 
BigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLBigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQL
Márton Kodok
 
Applying BigQuery ML on e-commerce data analytics
Applying BigQuery ML on e-commerce data analyticsApplying BigQuery ML on e-commerce data analytics
Applying BigQuery ML on e-commerce data analytics
Márton Kodok
 
(Oracle) DBA and Other Skills Needed in 2020
(Oracle) DBA and Other Skills Needed in 2020(Oracle) DBA and Other Skills Needed in 2020
(Oracle) DBA and Other Skills Needed in 2020
Markus Michalewicz
 
Data Architecture vs Data Modeling
Data Architecture vs Data ModelingData Architecture vs Data Modeling
Data Architecture vs Data Modeling
DATAVERSITY
 
Mule soft meetup Houston 16
Mule soft meetup Houston 16Mule soft meetup Houston 16
Mule soft meetup Houston 16
Jim Andrews
 
StudySapuri Data Analytics Platform with Treasure Data
StudySapuri Data Analytics Platform with Treasure DataStudySapuri Data Analytics Platform with Treasure Data
StudySapuri Data Analytics Platform with Treasure Data
Tetsuo Yamabe
 
Take the Bias out of Big Data Insights With Augmented Analytics
Take the Bias out of Big Data Insights With Augmented AnalyticsTake the Bias out of Big Data Insights With Augmented Analytics
Take the Bias out of Big Data Insights With Augmented Analytics
Tyler Wishnoff
 

Similar to Accelerating ML using Production Feature Engineering (20)

Reducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case StudyReducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case Study
 
workforce analytics using Data Science
workforce analytics using Data Scienceworkforce analytics using Data Science
workforce analytics using Data Science
 
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
 
Data Con LA 2022 - Practical Solutions to Complex Supply Chain Problems
Data Con LA 2022 - Practical Solutions to Complex Supply Chain ProblemsData Con LA 2022 - Practical Solutions to Complex Supply Chain Problems
Data Con LA 2022 - Practical Solutions to Complex Supply Chain Problems
 
Ai and data migration as a service subhash bhat cwin18-india
Ai and data migration as a service subhash bhat cwin18-indiaAi and data migration as a service subhash bhat cwin18-india
Ai and data migration as a service subhash bhat cwin18-india
 
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
 
MongoDB World 2019: From Transformation to Innovation: Lean-teams, Continuous...
MongoDB World 2019: From Transformation to Innovation: Lean-teams, Continuous...MongoDB World 2019: From Transformation to Innovation: Lean-teams, Continuous...
MongoDB World 2019: From Transformation to Innovation: Lean-teams, Continuous...
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Augmented OLAP for Big Data
Augmented OLAP for Big DataAugmented OLAP for Big Data
Augmented OLAP for Big Data
 
Augmented OLAP Analytics for Big Data
Augmented OLAP Analytics for Big DataAugmented OLAP Analytics for Big Data
Augmented OLAP Analytics for Big Data
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Machine Learning for Product Managers
Machine Learning for Product ManagersMachine Learning for Product Managers
Machine Learning for Product Managers
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
BigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLBigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQL
 
Applying BigQuery ML on e-commerce data analytics
Applying BigQuery ML on e-commerce data analyticsApplying BigQuery ML on e-commerce data analytics
Applying BigQuery ML on e-commerce data analytics
 
(Oracle) DBA and Other Skills Needed in 2020
(Oracle) DBA and Other Skills Needed in 2020(Oracle) DBA and Other Skills Needed in 2020
(Oracle) DBA and Other Skills Needed in 2020
 
Data Architecture vs Data Modeling
Data Architecture vs Data ModelingData Architecture vs Data Modeling
Data Architecture vs Data Modeling
 
Mule soft meetup Houston 16
Mule soft meetup Houston 16Mule soft meetup Houston 16
Mule soft meetup Houston 16
 
StudySapuri Data Analytics Platform with Treasure Data
StudySapuri Data Analytics Platform with Treasure DataStudySapuri Data Analytics Platform with Treasure Data
StudySapuri Data Analytics Platform with Treasure Data
 
Take the Bias out of Big Data Insights With Augmented Analytics
Take the Bias out of Big Data Insights With Augmented AnalyticsTake the Bias out of Big Data Insights With Augmented Analytics
Take the Bias out of Big Data Insights With Augmented Analytics
 

Recently uploaded

原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
Building a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdfBuilding a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdf
cjimenez2581
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
lzdvtmy8
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
1tyxnjpia
 
Jio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdfJio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdf
inaya7568
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
vasanthatpuram
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 

Recently uploaded (20)

原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
Building a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdfBuilding a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdf
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
 
Jio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdfJio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdf
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 

Accelerating ML using Production Feature Engineering

  • 1. 1 Accelerating ML using Production Feature Engineering Platform Dr. Venkata Pingali, Scribble Data
  • 2. ● Why is it so complex and expensive? ● How should Acme think about it? ● Where does production feature engineering fit in? ● How does the system look? Why? ● Should Acme build one? ● Where should Acme start? Acme Retail Inc Challenge: Put Models into Production © Scribble Data 2019 2 Acme Retail Inc
  • 3. Outline © Scribble Data 2019 ● Production ML & Feature Engineering Overview ● Challenges & Required Capabilities ● Design considerations ● Implementation options 3
  • 4. Production ML & Feature Engineering Overview 4 © Scribble Data 2019
  • 5. Production ML - Complex & Expensive © Scribble Data 2019 https://eng.uber.com/michelangelo/ + Additional Credits: Vikas 5 Uber’s Michelangelo 5000 Models Other Examples: AirBnB BigHead Stripe RailYard
  • 6. Why the Complexity - Distribution of Challenges © Scribble Data 2019https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf 6 “Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex.” Paper from Google - NeurIPS 2015 Drivers: Speed, Correctness, Evolution, Scale
  • 7. Production ML - Emerging Generic Architecture © Scribble Data 2019 7https://databricks.com/session/scaling-ride-hailing-with-machine-learning-on-mlflow Feature Engineering Model Dev & Management Model Deployment GoJEK @ Spark AI Summit, April 2019
  • 8. Feature Engineering - Nature © Scribble Data 2019 ● Features are variables generated from data ○ Continuous process (Batch + Near Realtime + Realtime) ● Large in number (‘00s to ‘000s) & evolving ● Frequently (re)computed ● Possibly auto-computed Customer SKU Name 17826162 0293192 Thai Dragon Fruit Customer Premium Imported 17826162 15% of txns 5% of spend Retail Customer (X GB) Features (~X/1000) 8
  • 9. Feature Engineering - Classification © Scribble Data 2019 9 Normal (TBs) Hyper (PBs) Loose Most ML ML at Scale (e.g., Uber’s Michaelangelo) Tight Most DL DL at Scale (e.g., Google TFX) Data Scale Model Integration
  • 10. Feature Engineering - Classification © Scribble Data 2019 10 Normal (TBs) Hyper (PBs) Loose Most ML ML at Scale (e.g., Uber’s Michaelangelo) Tight Most DL DL at Scale (e.g., Google TFX) Data Scale Model Integration Tall Fat Classes will expand in future
  • 11. Feature Engineering Platform - Challenges & Capabilities 11 © Scribble Data 2019
  • 12. Feature Generation Process © Scribble Data 2019 12 Input (Data lake) Output (Feature Store) Feature Engineering Tooling Input Correctness Feature Richness & Performance Trust & Discovery
  • 13. Organizational Context © Scribble Data 2019 13 ● Expensive activity overall (people, compute) ● Limited engineering resources ● Churn in staff ● More models everyday ● Variable quality of sources, data quality Datalake Feature Store Feature Engineering Tooling
  • 14. Capabilities © Scribble Data 2019 14 ● Feature Richness & Performance ○ How do I compute features reliably? ○ How do I make it easier to specify features? ○ How do I make my features richer? ● Input correctness ● Trust and Discovery
  • 15. Feature Richness - How do I compute features reliably? © Scribble Data 2019 15 Customer Premium Imported 17826162 15% of txns 5% of spend Customer Premium Imported 17826162 15% of txns 5% of spend Customer Premium impscore 17826162 15% of txns 1 Day 1 v 1 Day 2 v 1 Day 3 v 2 .... Exploding combinations: #pipelines x #runs x #versions
  • 16. Feature Richness - How do I compute features reliably? © Scribble Data 2019 16 ● Composable pipelines with state and audit management ○ Self-documenting modules ○ Versioned namespaces ○ Metadata collection ● Idempotent and reproducible ● Incremental computation and/or check-pointed ● Resource management Recommendation: Pandas + DAG + State Management + Testing
  • 17. Feature Richness - How do I easily specify features? © Scribble Data 2019 17
  • 18. Feature Richness - How do I easily specify features? © Scribble Data 2019 18 ● Features developed in parallel, large in number ○ Tedious, buggy process ● Metadata required to manage lifecycle of features ● Standardize the schema, language and implementation ● Used on both batch and realtime paths Recommendation: Feature DSL Spec
  • 19. Feature Richness - How do I make my features richer? © Scribble Data 2019 19 Recommendation: Plan for labeling service/integration with thirdparty Labels Customer SKU Name 17826162 0293192 Thai Dragon Fruit Acquired taste Affluence International exposure Willingness to Experiment
  • 20. Feature Richness - How do I make my features richer? © Scribble Data 2019 20 ● Common iterative activity for various usecases, dimensions ● Complexity varies a lot ○ Single/multiple labels, Maker-checker (untrusted), simple text vs boxes, sensitive vs safe data ● Multiple options depending on scale ○ Excel/Light-weight inhouse/Thirdparty (LabelBox) Recommendation: Plan for labeling service/integration with thirdparty Labels
  • 21. Capabilities © Scribble Data 2019 21 ● Feature Richness & Performance ● Trust and Discovery ○ How do I audit/validate the computation? ○ How do I discover and reuse features? ● Input correctness
  • 22. Trust - How do I audit/validate the computation? © Scribble Data 2019 22
  • 23. Trust - How do I audit/validate the computation? © Scribble Data 2019 23 ● Feature engineering code should be assumed to be buggy ● Datasets multiply (10 pipelines = 3.6K files/year) ● Multiple places & ways ○ Pipeline stages, visualization, audit interface ● Include statistical evaluation ○ Distributions, drift over time & space Recommendation: Pre/Post computation checks, Easy checking interfaces
  • 24. Discovery - How do I reuse features? © Scribble Data 2019 24 Premium P4 2019-02-02 Impscore P3 2019-03-04 ... ... ... Premium P4 2019-02-02 Impscore P3 2019-03-04 ... ... ... Premium P4 2019-02-02 Impscore P3 2019-03-04 ... ... ... 1000s expected #pipelines x # features Pipeline1 Pipeline2 .... Pipeline3
  • 25. Discovery - How do I reuse features? © Scribble Data 2019 25
  • 26. Discovery - How do I reuse features? © Scribble Data 2019 26 ● Feature computation is expensive ○ Encourage discovery & reuse ● Multiple ways - marketplace, filter & export, new request ● Necessary capabilities: ○ Standardization of data - representation, location ○ Feature priority, ownership, and status ○ Namespaces Recommendation: Marketplace, DB, Search, Wiki page
  • 27. Capabilities © Scribble Data 2019 27 ● Feature Richness & Performance ● Trust and Discovery ● Input correctness ○ Is my data feed complete and correct? ○ Am I interpreting my feed correctly? ○ How do I manage thirdparty datasets?
  • 28. Input Correctness - Is my data feed complete and correct? © Scribble Data 2019 28 bdate flag ean 2019 0000 ac0291212 2021 xxx
  • 29. Input Correctness - Is my data feed complete and correct? © Scribble Data 2019 29 ● Data quality has a large impact on models ○ Gaps, duplicates, exceptions, invalid values ● Poor data quality wastes resources and invalidates models ● Continuous process to improve ML operations over time ● Extensibility required due to context dependency bdate flag ean 2019 0000 ac0291212 2021 xxx Recommendation: Continuous health check/early warning system
  • 30. Input Correctness - Am I interpreting my feed correctly? © Scribble Data 2019 30
  • 31. Input Correctness - Am I interpreting my feed correctly? © Scribble Data 2019 31 ● Important if legacy systems are involved ● Risk of semantic mismatch ● Ability to keep adding information ● Avoid documenting in pipeline code - cant be extracted easily ● Pipeline uses API to pull documentation and checks bdate flag ean 2019 0000 ac0291212 Recommendation: Light-weight catalog (excel, wiki, Atlas, programmatic)
  • 32. Input Correctness - How do integrate thirdparty datasets? © Scribble Data 2019 32 ● Discovery, Lineage & ownership tracking ○ Datasets are often bought, e.g., surveys ● Support for pre-processing to enable fast & safe ingestion ○ Partitioning, anonymization ● Support for open dataset repository/online data services ● Programmatic, mediated access ○ Security, usage logging Recommendation: Dataset management service/capability Ext
  • 33. Feature Engineering Platform - How to Decide 33 © Scribble Data 2019
  • 34. Economics of Feature Engineering © Scribble Data 2019 ● Features have a price ○ Price paid every day ○ Amortization happens over time & across models ● Operations costs grow non-linearly with feature count ● Addition/deletion cost is high ○ Triggers recomputation of intermediate state ● Implicit & explicit dependencies get added over time ○ Dependencies are often untracked 34
  • 35. Questions to Ask © Scribble Data 2019 ● How many models will I have over time? (Speed, Correctness) ● How many features will they need? (Performance) ● How complex are the features? (Performance) ● Is there any feature reuse across models? (Performance) ● How defensible should they be? (Correctness) ● How available should they be? (Robustness, Correctness) ● How big is my dataset? (Performance) 35
  • 36. Feature Engineering Platform - Implementation 36 © Scribble Data 2019
  • 37. Approaches © Scribble Data 2019 ● FEAST (Go-JEK) ○ First opensource feature store ○ Two major assumptions: “Tall” datasets, GCP ● Build Inhouse ○ Atlas (Catalog) + LabelBox (labeling) + Airflow (pipeline) + … ○ Pro: Fine-grained control ○ Cons: Frequency mismatch - too high/too light ● Thirdparty (Scribble Enrich) ○ Coherent design ○ Pro: Out of box capabilities, fills gaps ○ Cons: Commercial product limitations, few options 37
  • 38. Approaches © Scribble Data 2019 38 Build Time Extensi- bility OSS Cloud Lockin Tech Stack Suitability FEAST Java Large narrow datasets, GCP Inhouse Depends on context Enrich Python Limited engg bandwidth More coming: Realtime, Edge, Media, Privacy-sensitive
  • 39. Inspiration © Scribble Data 2019 ● Uber - Michaelangelo ● AirBnB - BigHead ● Go-JEK - ML Platform, ● LinkedIn - Pro-ML 39
  • 40. Takeaways © Scribble Data 2019 ● Production data science is complex and expensive ● Data science informed-architectural thinking required ● Production feature engineering in early stages ○ Will grow rapidly in the coming years ● Organizations will be building a FE platform ○ Same reasons - Speed, Correctness, Evolution, Scale 40
  • 41. Questions? DENVER BANGALORE Littleton Indiranagar | HSR hello@scribbledata.io 41 Scribble Data Accelerated Production ML Engineering