Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha and Andrew Hoh

Nikhil Simha, Airbnb
Andrew Hoh, Airbnb
Zipline – Airbnb’s ML Data
Management Framework
#ML3SAIS

Feature Engineering
2#ML3SAIS
Idea
Discover
Data
Training
Set
Training &
Evaluation
Airbnb Zipline

Discovering data
3#ML3SAIS
• Where is it?
• Is it good?
• Will it continue to be good?
• Has it already been done before?

Training set
4#ML3SAIS
• What processing engine to use?
• Backfilling
• Point-in-time correct windows

Production
5#ML3SAIS
Training Set
Generation
Model
Training
Push model
to prediction
service
Online
Features
Model
monitoring
Data quality
monitoring

What is Zipline?
• Part of Bighead – Airbnb’s e2e ml platform
• Training Sets
• Feature Store
• Data quality
6#ML3SAIS

Concepts
• Data Source
• Feature-set
• Training-set
• Client
7#ML3SAIS

Feature-Set
• Source queries
– Hive tables
– Event streams
– Mutation streams
• Primary Keys
• Time Stamp
8#ML3SAIS

Feature-Set – Sources
• Multiple sources
– Backfills
– Migrations
9#ML3SAIS

11#ML3SAIS
The picture can't be displayed.
Data Quality

Training set
• Driver query
– primary keys
– timestamps
– Point-in-time correct
12#ML3SAIS

Training set – example
13#ML3SAIS

Training set – example
14#ML3SAIS

Backfills
15#ML3SAIS
Old training Set
Time
Features

Online scoring
• Backfills
• Low latency
• DB Mutations
– Batch correction
17#ML3SAIS

Lambda
• Batch – Spark
• Streaming – Flink
• GDPR
18#ML3SAIS
User
Conf
Batch
Streaming
Zipline Client
User
App
*Daily
*Continuous
KV Store

• Across batch and streaming
• Fixed length sliding windows
• Raw Events at the tail and head
Aggregations
21#ML3SAIS
7 day sliding window

Availability
22#ML3SAIS
Processing time
Eventtime

Client
23#ML3SAIS
• Aggregation
• Logic to handle availability
• Java
– The API is (List of Feature Names) => Map of feature
values – as objects.

Mutations
• Organizationally easy
• Consistent
• Not flexible
• More complex
– Inverse
24#ML3SAIS

Mutations - Aggregation
25#ML3SAIS
• Monoids
– Associative: (a + b) + c = a + (b + c)
– Distribute aggregation
– Sum, Avg, Count etc..
• Group
– Invertible
– Min, Max, Median, Ntile
– Not possible without memory
– Compromise

Mutations - Aggregation
26#ML3SAIS
• Deletes
= Inverse(before)
• Updates
= Inverse(before) + after

Overview
FeatureSet
Conf
Feature
Stream
Batch
Table
Stats
Table
Data Quality
UI
Client
Offline
Training Set

When?
• Plan to open source in Q3 2018.
• Will be part of Bighead
– Talk tomorrow.
28#ML3SAIS

Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha and Andrew Hoh

More Related Content

What's hot

Similar to Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha and Andrew Hoh

More from Databricks

Recently uploaded

Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha and Andrew Hoh