1
Accelerating ML using
Production Feature Engineering Platform
Dr. Venkata Pingali, Scribble Data
● Why is it so complex and expensive?
● How should Acme think about it?
● Where does production feature engineering fit in?
● How does the system look? Why?
● Should Acme build one?
● Where should Acme start?
Acme Retail Inc Challenge: Put Models into Production
© Scribble Data 2019
2
Acme
Retail Inc
Outline
© Scribble Data 2019
● Production ML & Feature Engineering Overview
● Challenges & Required Capabilities
● Design considerations
● Implementation options
3
Production ML & Feature Engineering Overview
4
© Scribble Data 2019
Production ML - Complex & Expensive
© Scribble Data 2019
https://eng.uber.com/michelangelo/ + Additional Credits: Vikas 5
Uber’s
Michelangelo
5000 Models
Other Examples:
AirBnB BigHead
Stripe RailYard
Why the Complexity - Distribution of Challenges
© Scribble Data 2019https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
6
“Only a small fraction of real-world ML systems is composed of the ML code, as shown by the
small black box in the middle. The required surrounding infrastructure is vast and complex.”
Paper from Google - NeurIPS 2015
Drivers: Speed, Correctness, Evolution, Scale
Production ML - Emerging Generic Architecture
© Scribble Data 2019
7https://databricks.com/session/scaling-ride-hailing-with-machine-learning-on-mlflow
Feature
Engineering
Model Dev &
Management
Model
Deployment
GoJEK @ Spark AI Summit, April 2019
Feature Engineering - Nature
© Scribble Data 2019
● Features are variables generated from data
○ Continuous process (Batch + Near Realtime + Realtime)
● Large in number (‘00s to ‘000s) & evolving
● Frequently (re)computed
● Possibly auto-computed
Customer SKU Name
17826162 0293192 Thai Dragon
Fruit
Customer Premium Imported
17826162 15% of txns 5% of spend
Retail Customer
(X GB)
Features
(~X/1000)
8
Feature Engineering - Classification
© Scribble Data 2019
9
Normal (TBs) Hyper (PBs)
Loose Most ML
ML at Scale
(e.g., Uber’s Michaelangelo)
Tight Most DL
DL at Scale
(e.g., Google TFX)
Data Scale
Model
Integration
Feature Engineering - Classification
© Scribble Data 2019
10
Normal (TBs) Hyper (PBs)
Loose
Most ML
ML at Scale
(e.g., Uber’s Michaelangelo)
Tight Most DL
DL at Scale
(e.g., Google TFX)
Data Scale
Model
Integration
Tall Fat
Classes will expand in future
Feature Engineering Platform - Challenges & Capabilities
11
© Scribble Data 2019
Feature Generation Process
© Scribble Data 2019
12
Input
(Data
lake)
Output
(Feature
Store)
Feature Engineering Tooling
Input
Correctness
Feature
Richness &
Performance
Trust &
Discovery
Organizational Context
© Scribble Data 2019
13
● Expensive activity overall (people, compute)
● Limited engineering resources
● Churn in staff
● More models everyday
● Variable quality of sources, data quality
Datalake
Feature
Store
Feature Engineering Tooling
Capabilities
© Scribble Data 2019
14
● Feature Richness & Performance
○ How do I compute features reliably?
○ How do I make it easier to specify features?
○ How do I make my features richer?
● Input correctness
● Trust and Discovery
Feature Richness - How do I compute features reliably?
© Scribble Data 2019
15
Customer Premium Imported
17826162 15% of txns 5% of spend
Customer Premium Imported
17826162 15% of txns 5% of spend
Customer Premium impscore
17826162 15% of txns 1
Day 1 v 1
Day 2 v 1
Day 3 v 2
....
Exploding combinations: #pipelines x #runs x #versions
Feature Richness - How do I compute features reliably?
© Scribble Data 2019
16
● Composable pipelines with state and audit management
○ Self-documenting modules
○ Versioned namespaces
○ Metadata collection
● Idempotent and reproducible
● Incremental computation and/or check-pointed
● Resource management
Recommendation: Pandas + DAG + State Management + Testing
Feature Richness - How do I easily specify features?
© Scribble Data 2019
17
Feature Richness - How do I easily specify features?
© Scribble Data 2019
18
● Features developed in parallel, large in number
○ Tedious, buggy process
● Metadata required to manage lifecycle of features
● Standardize the schema, language and implementation
● Used on both batch and realtime paths
Recommendation: Feature DSL
Spec
Feature Richness - How do I make my features richer?
© Scribble Data 2019
19
Recommendation: Plan for labeling service/integration with thirdparty
Labels
Customer SKU Name
17826162 0293192 Thai Dragon
Fruit
Acquired taste
Affluence
International exposure
Willingness to Experiment
Feature Richness - How do I make my features richer?
© Scribble Data 2019
20
● Common iterative activity for various usecases, dimensions
● Complexity varies a lot
○ Single/multiple labels, Maker-checker (untrusted), simple text vs
boxes, sensitive vs safe data
● Multiple options depending on scale
○ Excel/Light-weight inhouse/Thirdparty (LabelBox)
Recommendation: Plan for labeling service/integration with thirdparty
Labels
Capabilities
© Scribble Data 2019
21
● Feature Richness & Performance
● Trust and Discovery
○ How do I audit/validate the computation?
○ How do I discover and reuse features?
● Input correctness
Trust - How do I audit/validate the computation?
© Scribble Data 2019
22
Trust - How do I audit/validate the computation?
© Scribble Data 2019
23
● Feature engineering code should be assumed to be buggy
● Datasets multiply (10 pipelines = 3.6K files/year)
● Multiple places & ways
○ Pipeline stages, visualization, audit interface
● Include statistical evaluation
○ Distributions, drift over time & space
Recommendation: Pre/Post computation checks, Easy checking interfaces
Discovery - How do I reuse features?
© Scribble Data 2019
24
Premium P4 2019-02-02
Impscore P3 2019-03-04
... ... ...
Premium P4 2019-02-02
Impscore P3 2019-03-04
... ... ...
Premium P4 2019-02-02
Impscore P3 2019-03-04
... ... ...
1000s expected
#pipelines x
# features
Pipeline1
Pipeline2
....
Pipeline3
Discovery - How do I reuse features?
© Scribble Data 2019
25
Discovery - How do I reuse features?
© Scribble Data 2019
26
● Feature computation is expensive
○ Encourage discovery & reuse
● Multiple ways - marketplace, filter & export, new request
● Necessary capabilities:
○ Standardization of data - representation, location
○ Feature priority, ownership, and status
○ Namespaces
Recommendation: Marketplace, DB, Search, Wiki page
Capabilities
© Scribble Data 2019
27
● Feature Richness & Performance
● Trust and Discovery
● Input correctness
○ Is my data feed complete and correct?
○ Am I interpreting my feed correctly?
○ How do I manage thirdparty datasets?
Input Correctness - Is my data feed complete and correct?
© Scribble Data 2019
28
bdate flag ean
2019 0000 ac0291212
2021 xxx
Input Correctness - Is my data feed complete and correct?
© Scribble Data 2019
29
● Data quality has a large impact on models
○ Gaps, duplicates, exceptions, invalid values
● Poor data quality wastes resources and invalidates models
● Continuous process to improve ML operations over time
● Extensibility required due to context dependency
bdate flag ean
2019 0000 ac0291212
2021 xxx
Recommendation: Continuous health check/early warning system
Input Correctness - Am I interpreting my feed correctly?
© Scribble Data 2019
30
Input Correctness - Am I interpreting my feed correctly?
© Scribble Data 2019
31
● Important if legacy systems are involved
● Risk of semantic mismatch
● Ability to keep adding information
● Avoid documenting in pipeline code - cant be extracted easily
● Pipeline uses API to pull documentation and checks
bdate flag ean
2019 0000 ac0291212
Recommendation: Light-weight catalog (excel, wiki, Atlas, programmatic)
Input Correctness - How do integrate thirdparty datasets?
© Scribble Data 2019
32
● Discovery, Lineage & ownership tracking
○ Datasets are often bought, e.g., surveys
● Support for pre-processing to enable fast & safe ingestion
○ Partitioning, anonymization
● Support for open dataset repository/online data services
● Programmatic, mediated access
○ Security, usage logging
Recommendation: Dataset management service/capability
Ext
Feature Engineering Platform - How to Decide
33
© Scribble Data 2019
Economics of Feature Engineering
© Scribble Data 2019
● Features have a price
○ Price paid every day
○ Amortization happens over time & across models
● Operations costs grow non-linearly with feature count
● Addition/deletion cost is high
○ Triggers recomputation of intermediate state
● Implicit & explicit dependencies get added over time
○ Dependencies are often untracked
34
Questions to Ask
© Scribble Data 2019
● How many models will I have over time? (Speed, Correctness)
● How many features will they need? (Performance)
● How complex are the features? (Performance)
● Is there any feature reuse across models? (Performance)
● How defensible should they be? (Correctness)
● How available should they be? (Robustness, Correctness)
● How big is my dataset? (Performance)
35
Feature Engineering Platform - Implementation
36
© Scribble Data 2019
Approaches
© Scribble Data 2019
● FEAST (Go-JEK)
○ First opensource feature store
○ Two major assumptions: “Tall” datasets, GCP
● Build Inhouse
○ Atlas (Catalog) + LabelBox (labeling) + Airflow (pipeline) + …
○ Pro: Fine-grained control
○ Cons: Frequency mismatch - too high/too light
● Thirdparty (Scribble Enrich)
○ Coherent design
○ Pro: Out of box capabilities, fills gaps
○ Cons: Commercial product limitations, few options
37
Approaches
© Scribble Data 2019
38
Build
Time
Extensi-
bility
OSS Cloud
Lockin
Tech
Stack
Suitability
FEAST Java Large narrow datasets,
GCP
Inhouse Depends on context
Enrich Python Limited engg
bandwidth
More coming: Realtime, Edge, Media, Privacy-sensitive
Inspiration
© Scribble Data 2019
● Uber - Michaelangelo
● AirBnB - BigHead
● Go-JEK - ML Platform,
● LinkedIn - Pro-ML
39
Takeaways
© Scribble Data 2019
● Production data science is complex and expensive
● Data science informed-architectural thinking required
● Production feature engineering in early stages
○ Will grow rapidly in the coming years
● Organizations will be building a FE platform
○ Same reasons - Speed, Correctness, Evolution, Scale
40
Questions?
DENVER BANGALORE
Littleton Indiranagar | HSR
hello@scribbledata.io
41
Scribble Data
Accelerated Production ML Engineering

Accelerating ML using Production Feature Engineering

  • 1.
    1 Accelerating ML using ProductionFeature Engineering Platform Dr. Venkata Pingali, Scribble Data
  • 2.
    ● Why isit so complex and expensive? ● How should Acme think about it? ● Where does production feature engineering fit in? ● How does the system look? Why? ● Should Acme build one? ● Where should Acme start? Acme Retail Inc Challenge: Put Models into Production © Scribble Data 2019 2 Acme Retail Inc
  • 3.
    Outline © Scribble Data2019 ● Production ML & Feature Engineering Overview ● Challenges & Required Capabilities ● Design considerations ● Implementation options 3
  • 4.
    Production ML &Feature Engineering Overview 4 © Scribble Data 2019
  • 5.
    Production ML -Complex & Expensive © Scribble Data 2019 https://eng.uber.com/michelangelo/ + Additional Credits: Vikas 5 Uber’s Michelangelo 5000 Models Other Examples: AirBnB BigHead Stripe RailYard
  • 6.
    Why the Complexity- Distribution of Challenges © Scribble Data 2019https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf 6 “Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex.” Paper from Google - NeurIPS 2015 Drivers: Speed, Correctness, Evolution, Scale
  • 7.
    Production ML -Emerging Generic Architecture © Scribble Data 2019 7https://databricks.com/session/scaling-ride-hailing-with-machine-learning-on-mlflow Feature Engineering Model Dev & Management Model Deployment GoJEK @ Spark AI Summit, April 2019
  • 8.
    Feature Engineering -Nature © Scribble Data 2019 ● Features are variables generated from data ○ Continuous process (Batch + Near Realtime + Realtime) ● Large in number (‘00s to ‘000s) & evolving ● Frequently (re)computed ● Possibly auto-computed Customer SKU Name 17826162 0293192 Thai Dragon Fruit Customer Premium Imported 17826162 15% of txns 5% of spend Retail Customer (X GB) Features (~X/1000) 8
  • 9.
    Feature Engineering -Classification © Scribble Data 2019 9 Normal (TBs) Hyper (PBs) Loose Most ML ML at Scale (e.g., Uber’s Michaelangelo) Tight Most DL DL at Scale (e.g., Google TFX) Data Scale Model Integration
  • 10.
    Feature Engineering -Classification © Scribble Data 2019 10 Normal (TBs) Hyper (PBs) Loose Most ML ML at Scale (e.g., Uber’s Michaelangelo) Tight Most DL DL at Scale (e.g., Google TFX) Data Scale Model Integration Tall Fat Classes will expand in future
  • 11.
    Feature Engineering Platform- Challenges & Capabilities 11 © Scribble Data 2019
  • 12.
    Feature Generation Process ©Scribble Data 2019 12 Input (Data lake) Output (Feature Store) Feature Engineering Tooling Input Correctness Feature Richness & Performance Trust & Discovery
  • 13.
    Organizational Context © ScribbleData 2019 13 ● Expensive activity overall (people, compute) ● Limited engineering resources ● Churn in staff ● More models everyday ● Variable quality of sources, data quality Datalake Feature Store Feature Engineering Tooling
  • 14.
    Capabilities © Scribble Data2019 14 ● Feature Richness & Performance ○ How do I compute features reliably? ○ How do I make it easier to specify features? ○ How do I make my features richer? ● Input correctness ● Trust and Discovery
  • 15.
    Feature Richness -How do I compute features reliably? © Scribble Data 2019 15 Customer Premium Imported 17826162 15% of txns 5% of spend Customer Premium Imported 17826162 15% of txns 5% of spend Customer Premium impscore 17826162 15% of txns 1 Day 1 v 1 Day 2 v 1 Day 3 v 2 .... Exploding combinations: #pipelines x #runs x #versions
  • 16.
    Feature Richness -How do I compute features reliably? © Scribble Data 2019 16 ● Composable pipelines with state and audit management ○ Self-documenting modules ○ Versioned namespaces ○ Metadata collection ● Idempotent and reproducible ● Incremental computation and/or check-pointed ● Resource management Recommendation: Pandas + DAG + State Management + Testing
  • 17.
    Feature Richness -How do I easily specify features? © Scribble Data 2019 17
  • 18.
    Feature Richness -How do I easily specify features? © Scribble Data 2019 18 ● Features developed in parallel, large in number ○ Tedious, buggy process ● Metadata required to manage lifecycle of features ● Standardize the schema, language and implementation ● Used on both batch and realtime paths Recommendation: Feature DSL Spec
  • 19.
    Feature Richness -How do I make my features richer? © Scribble Data 2019 19 Recommendation: Plan for labeling service/integration with thirdparty Labels Customer SKU Name 17826162 0293192 Thai Dragon Fruit Acquired taste Affluence International exposure Willingness to Experiment
  • 20.
    Feature Richness -How do I make my features richer? © Scribble Data 2019 20 ● Common iterative activity for various usecases, dimensions ● Complexity varies a lot ○ Single/multiple labels, Maker-checker (untrusted), simple text vs boxes, sensitive vs safe data ● Multiple options depending on scale ○ Excel/Light-weight inhouse/Thirdparty (LabelBox) Recommendation: Plan for labeling service/integration with thirdparty Labels
  • 21.
    Capabilities © Scribble Data2019 21 ● Feature Richness & Performance ● Trust and Discovery ○ How do I audit/validate the computation? ○ How do I discover and reuse features? ● Input correctness
  • 22.
    Trust - Howdo I audit/validate the computation? © Scribble Data 2019 22
  • 23.
    Trust - Howdo I audit/validate the computation? © Scribble Data 2019 23 ● Feature engineering code should be assumed to be buggy ● Datasets multiply (10 pipelines = 3.6K files/year) ● Multiple places & ways ○ Pipeline stages, visualization, audit interface ● Include statistical evaluation ○ Distributions, drift over time & space Recommendation: Pre/Post computation checks, Easy checking interfaces
  • 24.
    Discovery - Howdo I reuse features? © Scribble Data 2019 24 Premium P4 2019-02-02 Impscore P3 2019-03-04 ... ... ... Premium P4 2019-02-02 Impscore P3 2019-03-04 ... ... ... Premium P4 2019-02-02 Impscore P3 2019-03-04 ... ... ... 1000s expected #pipelines x # features Pipeline1 Pipeline2 .... Pipeline3
  • 25.
    Discovery - Howdo I reuse features? © Scribble Data 2019 25
  • 26.
    Discovery - Howdo I reuse features? © Scribble Data 2019 26 ● Feature computation is expensive ○ Encourage discovery & reuse ● Multiple ways - marketplace, filter & export, new request ● Necessary capabilities: ○ Standardization of data - representation, location ○ Feature priority, ownership, and status ○ Namespaces Recommendation: Marketplace, DB, Search, Wiki page
  • 27.
    Capabilities © Scribble Data2019 27 ● Feature Richness & Performance ● Trust and Discovery ● Input correctness ○ Is my data feed complete and correct? ○ Am I interpreting my feed correctly? ○ How do I manage thirdparty datasets?
  • 28.
    Input Correctness -Is my data feed complete and correct? © Scribble Data 2019 28 bdate flag ean 2019 0000 ac0291212 2021 xxx
  • 29.
    Input Correctness -Is my data feed complete and correct? © Scribble Data 2019 29 ● Data quality has a large impact on models ○ Gaps, duplicates, exceptions, invalid values ● Poor data quality wastes resources and invalidates models ● Continuous process to improve ML operations over time ● Extensibility required due to context dependency bdate flag ean 2019 0000 ac0291212 2021 xxx Recommendation: Continuous health check/early warning system
  • 30.
    Input Correctness -Am I interpreting my feed correctly? © Scribble Data 2019 30
  • 31.
    Input Correctness -Am I interpreting my feed correctly? © Scribble Data 2019 31 ● Important if legacy systems are involved ● Risk of semantic mismatch ● Ability to keep adding information ● Avoid documenting in pipeline code - cant be extracted easily ● Pipeline uses API to pull documentation and checks bdate flag ean 2019 0000 ac0291212 Recommendation: Light-weight catalog (excel, wiki, Atlas, programmatic)
  • 32.
    Input Correctness -How do integrate thirdparty datasets? © Scribble Data 2019 32 ● Discovery, Lineage & ownership tracking ○ Datasets are often bought, e.g., surveys ● Support for pre-processing to enable fast & safe ingestion ○ Partitioning, anonymization ● Support for open dataset repository/online data services ● Programmatic, mediated access ○ Security, usage logging Recommendation: Dataset management service/capability Ext
  • 33.
    Feature Engineering Platform- How to Decide 33 © Scribble Data 2019
  • 34.
    Economics of FeatureEngineering © Scribble Data 2019 ● Features have a price ○ Price paid every day ○ Amortization happens over time & across models ● Operations costs grow non-linearly with feature count ● Addition/deletion cost is high ○ Triggers recomputation of intermediate state ● Implicit & explicit dependencies get added over time ○ Dependencies are often untracked 34
  • 35.
    Questions to Ask ©Scribble Data 2019 ● How many models will I have over time? (Speed, Correctness) ● How many features will they need? (Performance) ● How complex are the features? (Performance) ● Is there any feature reuse across models? (Performance) ● How defensible should they be? (Correctness) ● How available should they be? (Robustness, Correctness) ● How big is my dataset? (Performance) 35
  • 36.
    Feature Engineering Platform- Implementation 36 © Scribble Data 2019
  • 37.
    Approaches © Scribble Data2019 ● FEAST (Go-JEK) ○ First opensource feature store ○ Two major assumptions: “Tall” datasets, GCP ● Build Inhouse ○ Atlas (Catalog) + LabelBox (labeling) + Airflow (pipeline) + … ○ Pro: Fine-grained control ○ Cons: Frequency mismatch - too high/too light ● Thirdparty (Scribble Enrich) ○ Coherent design ○ Pro: Out of box capabilities, fills gaps ○ Cons: Commercial product limitations, few options 37
  • 38.
    Approaches © Scribble Data2019 38 Build Time Extensi- bility OSS Cloud Lockin Tech Stack Suitability FEAST Java Large narrow datasets, GCP Inhouse Depends on context Enrich Python Limited engg bandwidth More coming: Realtime, Edge, Media, Privacy-sensitive
  • 39.
    Inspiration © Scribble Data2019 ● Uber - Michaelangelo ● AirBnB - BigHead ● Go-JEK - ML Platform, ● LinkedIn - Pro-ML 39
  • 40.
    Takeaways © Scribble Data2019 ● Production data science is complex and expensive ● Data science informed-architectural thinking required ● Production feature engineering in early stages ○ Will grow rapidly in the coming years ● Organizations will be building a FE platform ○ Same reasons - Speed, Correctness, Evolution, Scale 40
  • 41.
    Questions? DENVER BANGALORE Littleton Indiranagar| HSR hello@scribbledata.io 41 Scribble Data Accelerated Production ML Engineering