Zipline - A Declarative Feature Engineering Framework

Zipline
Declarative Feature Engineering Framework
Nikhil Simha
nikhil.simha@airbnb.com

Exploration
Problem
Feature
Creation
Model
Training
Model
Serving
Feature
Serving
Application
Data
Engineer
Data
Scientist
ML/Systems
Engineer
ML/Systems
Engineer

“We recognize that a mature system might end up being (at most)
5% machine learning code and (at least) 95% glue code” – Sculley, NIPS 2015

• Question – “glue code”
• Imperative process -> Declarative specification
• Months to days
• With just the DS
Goal

• 60 – 70%
• Good data with okay/simple model
Feature Engineering

• Part of Bighead
• Supervised learning
• Structured data vs unstructured data
• systems problem vs. math problem
Context

What makes Feature Engineering Hard?

• Features + Algorithm
• Data
• Continuously Arriving
Everything changes

Service
Fleet
Production
Database
DB
Snapshot
Event log
Change
Capture
Stream
Event
Stream
Change
capture log
M
essage
Bus
D
ata
Lake
Live
Derived
Data
Media

An example
● Predict likelihood of you liking a particular Indian restaurant
● Total visits to Indian places last month
● Average rating of the restaurant last year
● They are all aggregations

An example
● Predict likelihood of you liking a particular Indian restaurant
● Total visits to Indian places last month
● Operation: Count, Input: Visit, Window = 1month,
● Source: Check-in stream
● Average rating of the restaurant
● Operation: AVG, Input: rating, Window = 1yr
● Source: Ratings table
● They are all aggregations

F1
F2
F3
0 5 7
3
0 8
Time
4
2 4
Label
4
L
Prediction P1 P2
7
3
8
4
2
8
L L
Training
data set
Aggregations + Temporal Join

Feature Serving for inference
What is the value of these feature aggregates now?

Real-time features
• Event log + Event Stream = Realtime – Features
• DB Snapshots + Change data = Realtime-features

Feature Serving
• Latency
• Optimized for point queries
• Freshness vs latency
• Service Events and DB Mutations
• Batch correction

Feature Computation for training
What are the exact feature values at the
points-of-interest in history?

user Time
123 2019-09-13 17:31
234 2019-09-14 17:40
345 2019-09-15 17:02
Example
Visits
Cnt / month
Rating
Avg / year
5 4
20 4
6 2
Query Log Aggregated Features

Model Server
Architecture
Feature
Declaration
Streaming
Updates
Batch partial
aggregates
Feature
Store
Feature
Backfills
Model Training
Model
Feature
Client
Labeling
Application
Server

Aggregations – SUM
• Commutative: a + b = b + a
• Associative: (a + b) + c = a + (b + c)
• Reversible: (a + b) – a = b
• Abelian Group

Aggregations – AVG
• One not-so-clever trick
• Operate on “Intermediate Representation” / IR
• Factors into (sum, count)
• Finalized by a division: (sum/count)

Aggregations
• Constant memory / Bounded IR
• Two classes of aggregations
• Sum, Avg, Count etc.,
• Reversible / Abelian Groups
• Min, Max, Approx Unique, most sketches etc.,
• Non-Reversible / Commutative Monoids / Non-Groups

Incremental Windowing – with reversibility
0 1 .. .. 0 1 0 ..
Visits – check-in stream of a user
1 4 6 8 9 8 7
In the last year
-1 +0

2
2 2
Incremental Windowing – with reversibility
1 3
Max rating – Ratings table – grouped by user
3
2 4
4
4
1 0
1
0 1
1
1
4
2 3
3
1 0
1
3
1 2
2
2
2
4

Windowing – w/o reversibility
• Time: O(N^2) vs O(NLogN)
• Space: N vs 2N memory
Groups Non-Groups
Un-Windowed No-Reversal No-Reversal
Windowed Reversal Tree

Windowing – w/o reversibility
• Tiling problem
• Tile([left, right]) => Tile([left, split_point]) + Tile([split_point, right])
• Split_point => right && (MAX_INT << msb(left ^ right))
• Tiles are the binary representation of (right – split_point) and (split_point - left)
• Less hand-waving in the paper

Reversibility - Unpacking Change data
• Deletion is a reversal
• Update is a delete followed by an insert
• Example:
• Sudden heat wave forecast at 7 pm.

user Time
123 2019-09-13 17:31
234 2019-09-14 17:40
345 2019-09-15 17:02
Example
Visits
Sum / month
Rating
Max / year
5 4
20 4
6 2
Query Log Aggregated Features

Feature Backfill
• Time-series join with aggregations
• Left :: Query Log :: [(Entity Key, timestamp)]
• Right :: Raw Data :: [(Entity Key, timestamp, unaggregated)]
• Output :: Feature Data :: [(Entity Key, timestamp, aggregated)]
• Aggregation and join is fused
• Raw data >> query log

12 13
Tree Merge
0 1
Query timestamps
0-1
2 3
2-3
0-3
4 5
4-5
6 7
6-7
4-7
0-7
8 9
8-9
10 11
10-11
8-11
12-13
14 15
14-15
12-15
8-15
0-15
Incoming Event (ts, payload) Event span

Feature Backfill – Topology
Query Log
(key, query time)
Raw Data
(key, event time, payload)
Pivoted queries
(key, [query time])
Broadcast
Partial aggregate
(key, [query time], aggregate)
Tree
merge
Flat map
& Re-key
Partial Aggregate
((key, query time), aggregate)
Results
(key, query time, aggregate)
Shuffle
& Merge
GroupBy

Feature Backfill – Nuances
• Time Skew
• Event time vs ingestion time
• Many sources of raw data at once
• Un-skewed can be faster
• More in paper

Feature Serving – lambda
• Head = Streaming, Tail = Batch
• Availability for batch correction
• Reduced tail Resolution
30 Day window
30 Day window

Links
• 95%+ glue code:
• https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-
systems.pdf
• 50%+ feature engineering
• https://developers.google.com/machine-learning/data-prep/process

Zipline - A Declarative Feature Engineering Framework

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Zipline - A Declarative Feature Engineering Framework

Similar to Zipline - A Declarative Feature Engineering Framework (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Zipline - A Declarative Feature Engineering Framework