Accelerating ML using Production Feature Engineering

1
Accelerating ML using
Production Feature Engineering Platform
Dr. Venkata Pingali, Scribble Data

● Why is it so complex and expensive?
● How should Acme think about it?
● Where does production feature engineering ﬁt in?
● How does the system look? Why?
● Should Acme build one?
● Where should Acme start?
Acme Retail Inc Challenge: Put Models into Production
© Scribble Data 2019
2
Acme
Retail Inc

Outline
● Production ML & Feature Engineering Overview
● Challenges & Required Capabilities
● Design considerations
● Implementation options
3

Production ML & Feature Engineering Overview
4

Production ML - Complex & Expensive
https://eng.uber.com/michelangelo/ + Additional Credits: Vikas 5
Uber’s
Michelangelo
5000 Models
Other Examples:
AirBnB BigHead
Stripe RailYard

Why the Complexity - Distribution of Challenges
© Scribble Data 2019https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
6
“Only a small fraction of real-world ML systems is composed of the ML code, as shown by the
small black box in the middle. The required surrounding infrastructure is vast and complex.”
Paper from Google - NeurIPS 2015
Drivers: Speed, Correctness, Evolution, Scale

Production ML - Emerging Generic Architecture
7https://databricks.com/session/scaling-ride-hailing-with-machine-learning-on-mlflow
Feature
Engineering
Model Dev &
Management
Model
Deployment
GoJEK @ Spark AI Summit, April 2019

Feature Engineering - Nature
● Features are variables generated from data
○ Continuous process (Batch + Near Realtime + Realtime)
● Large in number (‘00s to ‘000s) & evolving
● Frequently (re)computed
● Possibly auto-computed
Customer SKU Name
17826162 0293192 Thai Dragon
Fruit
Customer Premium Imported
17826162 15% of txns 5% of spend
Retail Customer
(X GB)
Features
(~X/1000)
8

Feature Engineering - Classiﬁcation
9
Normal (TBs) Hyper (PBs)
Loose Most ML
ML at Scale
(e.g., Uber’s Michaelangelo)
Tight Most DL
DL at Scale
(e.g., Google TFX)
Data Scale
Model
Integration

Feature Engineering - Classiﬁcation
10
Normal (TBs) Hyper (PBs)
Loose
Most ML
ML at Scale
(e.g., Uber’s Michaelangelo)
Tight Most DL
DL at Scale
(e.g., Google TFX)
Data Scale
Model
Integration
Tall Fat
Classes will expand in future

Feature Engineering Platform - Challenges & Capabilities
11

Feature Generation Process
12
Input
(Data
lake)
Output
(Feature
Store)
Feature Engineering Tooling
Input
Correctness
Feature
Richness &
Performance
Trust &
Discovery

Organizational Context
13
● Expensive activity overall (people, compute)
● Limited engineering resources
● Churn in staﬀ
● More models everyday
● Variable quality of sources, data quality
Datalake
Feature
Store
Feature Engineering Tooling

Capabilities
14
● Feature Richness & Performance
○ How do I compute features reliably?
○ How do I make it easier to specify features?
○ How do I make my features richer?
● Input correctness
● Trust and Discovery

Feature Richness - How do I compute features reliably?
15
Customer Premium impscore
17826162 15% of txns 1
Day 1 v 1
Day 2 v 1
Day 3 v 2
....
Exploding combinations: #pipelines x #runs x #versions

Feature Richness - How do I compute features reliably?
16
● Composable pipelines with state and audit management
○ Self-documenting modules
○ Versioned namespaces
○ Metadata collection
● Idempotent and reproducible
● Incremental computation and/or check-pointed
● Resource management
Recommendation: Pandas + DAG + State Management + Testing

Feature Richness - How do I easily specify features?
17

Feature Richness - How do I easily specify features?
18
● Features developed in parallel, large in number
○ Tedious, buggy process
● Metadata required to manage lifecycle of features
● Standardize the schema, language and implementation
● Used on both batch and realtime paths
Recommendation: Feature DSL
Spec

Feature Richness - How do I make my features richer?
19
Recommendation: Plan for labeling service/integration with thirdparty
Labels
Customer SKU Name
17826162 0293192 Thai Dragon
Fruit
Acquired taste
Aﬄuence
International exposure
Willingness to Experiment

Feature Richness - How do I make my features richer?
20
● Common iterative activity for various usecases, dimensions
● Complexity varies a lot
○ Single/multiple labels, Maker-checker (untrusted), simple text vs
boxes, sensitive vs safe data
● Multiple options depending on scale
○ Excel/Light-weight inhouse/Thirdparty (LabelBox)
Recommendation: Plan for labeling service/integration with thirdparty
Labels

Capabilities
21
○ How do I audit/validate the computation?
○ How do I discover and reuse features?

Trust - How do I audit/validate the computation?
22

Trust - How do I audit/validate the computation?
23
● Feature engineering code should be assumed to be buggy
● Datasets multiply (10 pipelines = 3.6K ﬁles/year)
● Multiple places & ways
○ Pipeline stages, visualization, audit interface
● Include statistical evaluation
○ Distributions, drift over time & space
Recommendation: Pre/Post computation checks, Easy checking interfaces

Discovery - How do I reuse features?
24
Premium P4 2019-02-02
Impscore P3 2019-03-04
... ... ...
Premium P4 2019-02-02
... ... ...
Premium P4 2019-02-02
... ... ...
1000s expected
#pipelines x
# features
Pipeline1
Pipeline2
....
Pipeline3

25

26
● Feature computation is expensive
○ Encourage discovery & reuse
● Multiple ways - marketplace, ﬁlter & export, new request
● Necessary capabilities:
○ Standardization of data - representation, location
○ Feature priority, ownership, and status
○ Namespaces
Recommendation: Marketplace, DB, Search, Wiki page

Capabilities
27
○ Is my data feed complete and correct?
○ Am I interpreting my feed correctly?
○ How do I manage thirdparty datasets?

Input Correctness - Is my data feed complete and correct?
28
bdate flag ean
2019 0000 ac0291212
2021 xxx

Input Correctness - Is my data feed complete and correct?
29
● Data quality has a large impact on models
○ Gaps, duplicates, exceptions, invalid values
● Poor data quality wastes resources and invalidates models
● Continuous process to improve ML operations over time
● Extensibility required due to context dependency
bdate flag ean
2019 0000 ac0291212
2021 xxx
Recommendation: Continuous health check/early warning system

Input Correctness - Am I interpreting my feed correctly?
30

Input Correctness - Am I interpreting my feed correctly?
31
● Important if legacy systems are involved
● Risk of semantic mismatch
● Ability to keep adding information
● Avoid documenting in pipeline code - cant be extracted easily
● Pipeline uses API to pull documentation and checks
bdate flag ean
2019 0000 ac0291212
Recommendation: Light-weight catalog (excel, wiki, Atlas, programmatic)

Input Correctness - How do integrate thirdparty datasets?
32
● Discovery, Lineage & ownership tracking
○ Datasets are often bought, e.g., surveys
● Support for pre-processing to enable fast & safe ingestion
○ Partitioning, anonymization
● Support for open dataset repository/online data services
● Programmatic, mediated access
○ Security, usage logging
Recommendation: Dataset management service/capability
Ext

Feature Engineering Platform - How to Decide
33

Economics of Feature Engineering
● Features have a price
○ Price paid every day
○ Amortization happens over time & across models
● Operations costs grow non-linearly with feature count
● Addition/deletion cost is high
○ Triggers recomputation of intermediate state
● Implicit & explicit dependencies get added over time
○ Dependencies are often untracked
34

Questions to Ask
● How many models will I have over time? (Speed, Correctness)
● How many features will they need? (Performance)
● How complex are the features? (Performance)
● Is there any feature reuse across models? (Performance)
● How defensible should they be? (Correctness)
● How available should they be? (Robustness, Correctness)
● How big is my dataset? (Performance)
35

Feature Engineering Platform - Implementation
36

Approaches
● FEAST (Go-JEK)
○ First opensource feature store
○ Two major assumptions: “Tall” datasets, GCP
● Build Inhouse
○ Atlas (Catalog) + LabelBox (labeling) + Airﬂow (pipeline) + …
○ Pro: Fine-grained control
○ Cons: Frequency mismatch - too high/too light
● Thirdparty (Scribble Enrich)
○ Coherent design
○ Pro: Out of box capabilities, ﬁlls gaps
○ Cons: Commercial product limitations, few options
37

Approaches
38
Build
Time
Extensi-
bility
OSS Cloud
Lockin
Tech
Stack
Suitability
FEAST Java Large narrow datasets,
GCP
Inhouse Depends on context
Enrich Python Limited engg
bandwidth
More coming: Realtime, Edge, Media, Privacy-sensitive

Inspiration
● Uber - Michaelangelo
● AirBnB - BigHead
● Go-JEK - ML Platform,
● LinkedIn - Pro-ML
39

Takeaways
● Production data science is complex and expensive
● Data science informed-architectural thinking required
● Production feature engineering in early stages
○ Will grow rapidly in the coming years
● Organizations will be building a FE platform
○ Same reasons - Speed, Correctness, Evolution, Scale
40

Questions?
DENVER BANGALORE
Littleton Indiranagar | HSR
hello@scribbledata.io
41
Scribble Data
Accelerated Production ML Engineering

Accelerating ML using Production Feature Engineering

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Similar to Accelerating ML using Production Feature Engineering

Similar to Accelerating ML using Production Feature Engineering (20)

Recently uploaded

Recently uploaded (20)

Accelerating ML using Production Feature Engineering