Nikhil Simha, Airbnb
Andrew Hoh, Airbnb
Zipline – Airbnb’s ML Data
Management Framework
#ML3SAIS
Feature Engineering
2#ML3SAIS
Idea
Discover
Data
Training
Set
Training &
Evaluation
Airbnb Zipline
Discovering data
3#ML3SAIS
• Where is it?
• Is it good?
• Will it continue to be good?
• Has it already been done before?
Training set
4#ML3SAIS
• What processing engine to use?
• Backfilling
• Point-in-time correct windows
Production
5#ML3SAIS
Training Set
Generation
Model
Training
Push model
to prediction
service
Online
Features
Model
monitoring
Data quality
monitoring
What is Zipline?
• Part of Bighead – Airbnb’s e2e ml platform
• Training Sets
• Feature Store
• Data quality
6#ML3SAIS
Concepts
• Data Source
• Feature-set
• Training-set
• Client
7#ML3SAIS
Feature-Set
• Source queries
– Hive tables
– Event streams
– Mutation streams
• Primary Keys
• Time Stamp
8#ML3SAIS
Feature-Set – Sources
• Multiple sources
– Backfills
– Migrations
9#ML3SAIS
Example/Operations
10#ML3SAIS
11#ML3SAIS
The picture can't be displayed.
Data Quality
Training set
• Driver query
– primary keys
– timestamps
– Point-in-time correct
12#ML3SAIS
Training set – example
13#ML3SAIS
Training set – example
14#ML3SAIS
Backfills
15#ML3SAIS
Old training Set
Time
Features
Label offsets
16#ML3SAIS
Online scoring
• Backfills
• Low latency
• DB Mutations
– Batch correction
17#ML3SAIS
Lambda
• Batch – Spark
• Streaming – Flink
• GDPR
18#ML3SAIS
User
Conf
Batch
Streaming
Zipline Client
User
App
*Daily
*Continuous
KV Store
Why Flink?
19#ML3SAIS
Why Flink?
20#ML3SAIS
• Across batch and streaming
• Fixed length sliding windows
• Raw Events at the tail and head
Aggregations
21#ML3SAIS
7 day sliding window
Availability
22#ML3SAIS
Processing time
Eventtime
Client
23#ML3SAIS
• Aggregation
• Logic to handle availability
• Java
– The API is (List of Feature Names) => Map of feature
values – as objects.
Mutations
• Organizationally easy
• Consistent
• Not flexible
• More complex
– Inverse
24#ML3SAIS
Mutations - Aggregation
25#ML3SAIS
• Monoids
– Associative: (a + b) + c = a + (b + c)
– Distribute aggregation
– Sum, Avg, Count etc..
• Group
– Invertible
– Min, Max, Median, Ntile
– Not possible without memory
– Compromise
Mutations - Aggregation
26#ML3SAIS
• Deletes
= Inverse(before)
• Updates
= Inverse(before) + after
Overview
FeatureSet
Conf
Feature
Stream
Batch
Table
Stats
Table
Data Quality
UI
Client
Offline
Training Set
When?
• Plan to open source in Q3 2018.
• Will be part of Bighead
– Talk tomorrow.
28#ML3SAIS
Questions?
29#ML3SAIS

Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha and Andrew Hoh