Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Evgeny Shapiro, Varant Zanoyan / Oct 2019 / Airbnb
Zipline: Declarative
Feature Engineering
1. The machine learning workflow
2. The feature engineering problem
3. Zipline as a solution
4. Implementation
5. Results
6...
THE MACHINE
LEARNING WORKFLOW
IN PRODUCTION
Machine Learning
● Goal: Make a prediction about the world given
incomplete data
● Labels: Prediction Target
● Features: k...
Machine Learning
● Goal: Make a prediction about the world given
incomplete data
● Labels: Prediction Target
● Features: k...
ML applications
Unstructured Structured
Image
classification
Chat apps
NLP
Object
detection
FraudCustomer LTVCredit scores...
Feature Engineering
Unstructured Structured
Image
classification
Chat apps
NLP
Object
detection
FraudCustomer LTVCredit sc...
● Offline Batch (email marketing)
○ Does not require serving feature in
production
○ Online/Offline consistency is not a probl...
Feature engineering
For the structured online
use case
“We recognize that a mature system might end up being (at most)
5% machine learning code and (at least) 95% glue code” – S...
ML Models
F1
F2
0 5 7
3
Feature values
Time
4
2 4
Label L
Pred P1
7
3
L
Training data set
Userbehavior&businessprocessesPr...
Log-based training
DB
KV
Service
Application
Scoring
Service
Online Offline (Hive)
Event Bus
Keys,
features,
score
Scoring...
Log-based training
is great †
● Easy to implement
● Any production-available data point can be used
for training and scori...
The Fine Print up
close
● Sharing features is hard
● Testing new features requires production
implementation
● May capture...
Slowdown of experimentation
F1
F2
F3
0 5 7
3
?
Feature values
Time
4
2 4
Label
4
L
Pred P1 P2
7
3
?
4
2
8
L L
Training dat...
● Some models are time-dependent (seasonality)
● For some problems label maturity is on the order
of months
● Production i...
● Backfill features
○ Quick!
● Single feature definition for production and
training
● Automatic pipelines for training and ...
ZIPLINE
Zipline: feature management system
Feature
Definition
Serving
Pipeline
Training
Pipeline
Model
Training Set
Online Scoring...
Feature definition
Training Set API
The time at which we
made the prediction,
also the time at which
we would log the feature
Training Set
If you missed it...
Training set = f(features,
keys, timestamps)
Implementation
Feature philosophy
● Complex features:
○ Only worth it if the gain is huge
○ Require complex computations
○ Harder to inte...
Supported
operations
● Sum, Count
● Min, Max
● First, Last
● Last N
● Statistical moments
● Approx unique count
● Approx p...
Operation
requirements
● Commutative: a ⊕ b = b ⊕ a
● Associative: (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
● Additional optimizations:
○...
Serving pipeline: lambda
Feature
Definition
Streaming
Batch KV
KV
Zipline Client
Data skew: large number of events
user ts
1 2019-10-01 00:00:01
1 2019-10-01 00:00:02
... ...
1 2019-10-01 23:59:59
2 2019...
Aggregate by Key
(a, 1)
(b, 1)
(a, 1)
(b, 1)
(a, 1)
(a, 1)
(b, 1)
(b, 1)
(a, 2)
(b, 2)
(a, 1)
(a, 1)
(a, 1)
(b, 1)
(b, 1)
...
Training pipeline
Model
definition
Batch Hive
Feature
Definition(s)
Data skew: large number of examples
ip ts
127.0.0.1 2019-10-15 05:03:20
127.0.0.1 2019-10-15 12:32:11
127.0.0.1 2019-10-15...
Large number of
timestamps:
Naive solution
● Keep one aggregate per (key, driver timestamp)
● For every event:
○ Find corr...
Non-windowed case
6
Timestamps
for one key
1 3 7 8 10 15 18 20
0 0 1 1 11 1 1Corresponding
values
9
0 0 2 2 21 1 2Correspo...
Non-windowed case (optimized)
6
Timestamps
for one key
1 3 7 8 10 15 18 20
O(Ne
+ Nts
)
Apply to the first affected aggreg...
Data skew: windowed case
0 1 2 3 4 5
6
Timestamps
for one key
Window size = 5
6 7
0-2 2-3 4-5
1 3 7 8 10 15 18 20
6-7
0-3 ...
Feature
Sources
● Hive table produced upstream
● Jitney: Airbnb event bus
● Databases via data warehouse export and CDC
Results
● Zipline cuts weeks of effort:
○ Custom feature pipelines
○ Data leaks in custom aggregations
○ Data sketches
● Improved m...
Results: runtime
optimizations
● Optimized data pipelines:
○ 10x for training set backfill for some models
○ Incremental pi...
Q&A
Zipline—Airbnb’s Declarative Feature Engineering Framework
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Zipline—Airbnb’s Declarative Feature Engineering Framework

Download to read offline

Zipline is Airbnb’s data management platform specifically designed for ML use cases. Previously, ML practitioners at Airbnb spent roughly 60% of their time on collecting and writing transformations for machine learning tasks. Zipline reduces this task from months to days – by making the process declarative. It allows data scientists to easily define features in a simple configuration language. The framework then provides access to point-in-time correct features – for both – offline model training and online inference. In this talk we will describe the architecture of our system and the algorithm that makes the problem of efficient point-in-time correct feature generation, tractable.

The attendee will learn

Importance of point-in-time correct features for achieving better ML model performance
Importance of using change data capture for generating feature views
An algorithm – to efficiently generate features over change data. We use interval trees to efficiently compress time series features. The algorithm allows generating feature aggregates over this compressed representation.
A lambda architecture – that enables using the above algorithm – for online feature generation.
A framework, based on category theory, to understand how feature aggregations be distributed, and independently composed.
While the talk if fairly technical – we will introduce all the concepts from first principles with examples. Basic understanding of data-parallel distributed computation and machine learning might help, but are not required.

Zipline—Airbnb’s Declarative Feature Engineering Framework

  1. 1. Evgeny Shapiro, Varant Zanoyan / Oct 2019 / Airbnb Zipline: Declarative Feature Engineering
  2. 2. 1. The machine learning workflow 2. The feature engineering problem 3. Zipline as a solution 4. Implementation 5. Results 6. Q&A Agenda
  3. 3. THE MACHINE LEARNING WORKFLOW IN PRODUCTION
  4. 4. Machine Learning ● Goal: Make a prediction about the world given incomplete data ● Labels: Prediction Target ● Features: known information to learn from ● Training output: model weights/parameters ● Serving: online feature ● Assumption: Training and serving distribution is the same (consistency)
  5. 5. Machine Learning ● Goal: Make a prediction about the world given incomplete data ● Labels: Prediction Target ● Features: known information to learn from ● Training output: model weights/parameters ● Serving: online feature ● Assumption: Training and serving distribution is the same (consistency)
  6. 6. ML applications Unstructured Structured Image classification Chat apps NLP Object detection FraudCustomer LTVCredit scores Ads Personalized search ● Most of the data is available at once: full image ● Features are automatically extracted from few (often one) data stream: ○ words from a text ○ pixels from an image ● Data arrives steadily as user interacts with the platform ● Features extracted from many event streams: ○ logins ○ clicks ○ bookings ○ page views, etc ● Iterative manual feature engineering # of data sources
  7. 7. Feature Engineering Unstructured Structured Image classification Chat apps NLP Object detection FraudCustomer LTVCredit scores Ads Personalized search # of data sources N-grams from a text Sum of past purchases in last 7 days
  8. 8. ● Offline Batch (email marketing) ○ Does not require serving feature in production ○ Online/Offline consistency is not a problem ● Online Real-time (personalized search) ○ Does require serving feature in production ○ Online/Offline consistency is a problem Offline Batch vs Online Real-time
  9. 9. Feature engineering For the structured online use case
  10. 10. “We recognize that a mature system might end up being (at most) 5% machine learning code and (at least) 95% glue code” – Sculley, NIPS 2015
  11. 11. ML Models F1 F2 0 5 7 3 Feature values Time 4 2 4 Label L Pred P1 7 3 L Training data set Userbehavior&businessprocessesProductProblem
  12. 12. Log-based training DB KV Service Application Scoring Service Online Offline (Hive) Event Bus Keys, features, score Scoring log (daily) Labels Training Set
  13. 13. Log-based training is great † ● Easy to implement ● Any production-available data point can be used for training and scoring ● Log can be used for audit and debug purposes ● Consistency is guaranteed † May capture accidental data distribution shifts, requires upfront implementation of new features in production, may slow down feature iteration cycle, prevents feature sharing between models, increases product experimentation cycle, severely limits your ability to react to incidents, fixing production issues might degrade model performance, may decrease sleep time during on-call rotations. Consult with your architect before taking log-based training approach.
  14. 14. The Fine Print up close ● Sharing features is hard ● Testing new features requires production implementation ● May capture accidental data shifts (bugs, downed services) ● Slows down the iteration cycle ● Limits agility in reacting to production incidents
  15. 15. Slowdown of experimentation F1 F2 F3 0 5 7 3 ? Feature values Time 4 2 4 Label 4 L Pred P1 P2 7 3 ? 4 2 8 L L Training data set Userbehavior&businessprocessesProductProblem
  16. 16. ● Some models are time-dependent (seasonality) ● For some problems label maturity is on the order of months ● Production incidents lead to dirty data in training ● Labels are scarce and expensive to acquire → Months-long iteration cycles → Hard to maintain models in production → Cannot address shifts in data quickly Why is that a problem?
  17. 17. ● Backfill features ○ Quick! ● Single feature definition for production and training ● Automatic pipelines for training and scoring What do we want?
  18. 18. ZIPLINE
  19. 19. Zipline: feature management system Feature Definition Serving Pipeline Training Pipeline Model Training Set Online Scoring Vector Consistency Fast Backfills - Data Warehouse Low Latency Serving - Online Environment
  20. 20. Feature definition
  21. 21. Training Set API The time at which we made the prediction, also the time at which we would log the feature
  22. 22. Training Set
  23. 23. If you missed it... Training set = f(features, keys, timestamps)
  24. 24. Implementation
  25. 25. Feature philosophy ● Complex features: ○ Only worth it if the gain is huge ○ Require complex computations ○ Harder to interpret ○ Harder to maintain ● Simple features: ○ Easier to maintain ○ Faster to compute ○ Cumulatively provide huge gain for the model
  26. 26. Supported operations ● Sum, Count ● Min, Max ● First, Last ● Last N ● Statistical moments ● Approx unique count ● Approx percentile ● Bloom filters + time windows for all operations!
  27. 27. Operation requirements ● Commutative: a ⊕ b = b ⊕ a ● Associative: (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c) ● Additional optimizations: ○ Reversible: a ⊕ ? = c ● Must be O(1) in compute ⇒ must be O(1) in space
  28. 28. Serving pipeline: lambda Feature Definition Streaming Batch KV KV Zipline Client
  29. 29. Data skew: large number of events user ts 1 2019-10-01 00:00:01 1 2019-10-01 00:00:02 ... ... 1 2019-10-01 23:59:59 2 2019-10-02 15:20:30 3 2019-10-12 16:11:44 50% Page views Use aggregateByKey to ensure data is locally combined on the first stage before sent final merge
  30. 30. Aggregate by Key (a, 1) (b, 1) (a, 1) (b, 1) (a, 1) (a, 1) (b, 1) (b, 1) (a, 2) (b, 2) (a, 1) (a, 1) (a, 1) (b, 1) (b, 1) (b, 1) (a, 3) (b, 3) (a, 1) (a, 2) (a, 3) (a, 6) (b, 1) (b, 2) (b, 3) (b, 6) Shuffle Executors
  31. 31. Training pipeline Model definition Batch Hive Feature Definition(s)
  32. 32. Data skew: large number of examples ip ts 127.0.0.1 2019-10-15 05:03:20 127.0.0.1 2019-10-15 12:32:11 127.0.0.1 2019-10-15 09:55:29 ... ... 1.2.3.4 2019-10-15 03:22:21 1.2.3.5 2019-10-15 19:10:59 ip ts 127.0.0.1 2019-10-01 00:00:01 127.0.0.1 2019-10-01 00:00:02 ... ... 1.2.3.4 2019-10-01 23:59:59 1.2.3.5 2019-10-02 15:20:30 1.2.3.6 2019-10-12 16:11:44 50% Training examples Page views
  33. 33. Large number of timestamps: Naive solution ● Keep one aggregate per (key, driver timestamp) ● For every event: ○ Find corresponding key ○ For every driver timestamp of that key: ■ If the event occurred prior to the timestamp produce: ● ((key, driver timestamp), data) ● Use aggregateByKey ● Problem: O(Nts x Ne )
  34. 34. Non-windowed case 6 Timestamps for one key 1 3 7 8 10 15 18 20 0 0 1 1 11 1 1Corresponding values 9 0 0 2 2 21 1 2Corresponding values
  35. 35. Non-windowed case (optimized) 6 Timestamps for one key 1 3 7 8 10 15 18 20 O(Ne + Nts ) Apply to the first affected aggregate. In the end compute a cumulative sum of the values. 0 0 0 0 01 0 0Corresponding values 9 0 0 0 0 01 0 1Corresponding values 0 0 2 2 21 1 2Result
  36. 36. Data skew: windowed case 0 1 2 3 4 5 6 Timestamps for one key Window size = 5 6 7 0-2 2-3 4-5 1 3 7 8 10 15 18 20 6-7 0-3 4-7 0-7 7 8 10 2-3 4 O(Ne x log(Nts )) Timestamp index
  37. 37. Feature Sources ● Hive table produced upstream ● Jitney: Airbnb event bus ● Databases via data warehouse export and CDC
  38. 38. Results
  39. 39. ● Zipline cuts weeks of effort: ○ Custom feature pipelines ○ Data leaks in custom aggregations ○ Data sketches ● Improved model iteration workflow ● Feature distribution observability Results: improved workflow
  40. 40. Results: runtime optimizations ● Optimized data pipelines: ○ 10x for training set backfill for some models ○ Incremental pipelines by default ○ Huge cost savings
  41. 41. Q&A
  • TienNamLe

    Feb. 8, 2021
  • MaheBayireddi

    Oct. 31, 2020
  • abechen

    Dec. 12, 2019

Zipline is Airbnb’s data management platform specifically designed for ML use cases. Previously, ML practitioners at Airbnb spent roughly 60% of their time on collecting and writing transformations for machine learning tasks. Zipline reduces this task from months to days – by making the process declarative. It allows data scientists to easily define features in a simple configuration language. The framework then provides access to point-in-time correct features – for both – offline model training and online inference. In this talk we will describe the architecture of our system and the algorithm that makes the problem of efficient point-in-time correct feature generation, tractable. The attendee will learn Importance of point-in-time correct features for achieving better ML model performance Importance of using change data capture for generating feature views An algorithm – to efficiently generate features over change data. We use interval trees to efficiently compress time series features. The algorithm allows generating feature aggregates over this compressed representation. A lambda architecture – that enables using the above algorithm – for online feature generation. A framework, based on category theory, to understand how feature aggregations be distributed, and independently composed. While the talk if fairly technical – we will introduce all the concepts from first principles with examples. Basic understanding of data-parallel distributed computation and machine learning might help, but are not required.

Views

Total views

2,037

On Slideshare

0

From embeds

0

Number of embeds

1

Actions

Downloads

84

Shares

0

Comments

0

Likes

3

×