Zipline—Airbnb’s Declarative Feature Engineering Framework

Databricks
DatabricksDeveloper Marketing and Relations at MuleSoft
Evgeny Shapiro, Varant Zanoyan / Oct 2019 / Airbnb
Zipline: Declarative
Feature Engineering
1. The machine learning workflow
2. The feature engineering problem
3. Zipline as a solution
4. Implementation
5. Results
6. Q&A
Agenda
THE MACHINE
LEARNING WORKFLOW
IN PRODUCTION
Machine Learning
● Goal: Make a prediction about the world given
incomplete data
● Labels: Prediction Target
● Features: known information to learn from
● Training output: model weights/parameters
● Serving: online feature
● Assumption: Training and serving distribution is
the same (consistency)
Machine Learning
● Goal: Make a prediction about the world given
incomplete data
● Labels: Prediction Target
● Features: known information to learn from
● Training output: model weights/parameters
● Serving: online feature
● Assumption: Training and serving distribution
is the same (consistency)
ML applications
Unstructured Structured
Image
classification
Chat apps
NLP
Object
detection
FraudCustomer LTVCredit scores Ads
Personalized search
● Most of the data is available at once: full
image
● Features are automatically extracted from few
(often one) data stream:
○ words from a text
○ pixels from an image
● Data arrives steadily as user interacts with the
platform
● Features extracted from many event streams:
○ logins
○ clicks
○ bookings
○ page views, etc
● Iterative manual feature engineering
# of data sources
Feature Engineering
Unstructured Structured
Image
classification
Chat apps
NLP
Object
detection
FraudCustomer LTVCredit scores Ads
Personalized search
# of data sources
N-grams from a text Sum of past purchases in last 7 days
● Offline Batch (email marketing)
○ Does not require serving feature in
production
○ Online/Offline consistency is not a problem
● Online Real-time (personalized search)
○ Does require serving feature in production
○ Online/Offline consistency is a problem
Offline Batch vs
Online Real-time
Feature engineering
For the structured online
use case
“We recognize that a mature system might end up being (at most)
5% machine learning code and (at least) 95% glue code” – Sculley, NIPS 2015
ML Models
F1
F2
0 5 7
3
Feature values
Time
4
2 4
Label L
Pred P1
7
3
L
Training data set
Userbehavior&businessprocessesProductProblem
Log-based training
DB
KV
Service
Application
Scoring
Service
Online Offline (Hive)
Event Bus
Keys,
features,
score
Scoring log
(daily)
Labels
Training Set
Log-based training
is great †
● Easy to implement
● Any production-available data point can be used
for training and scoring
● Log can be used for audit and debug purposes
● Consistency is guaranteed
† May capture accidental data distribution shifts, requires upfront implementation of new features in production, may slow
down feature iteration cycle, prevents feature sharing between models, increases product experimentation cycle, severely
limits your ability to react to incidents, fixing production issues might degrade model performance, may decrease sleep
time during on-call rotations. Consult with your architect before taking log-based training approach.
The Fine Print up
close
● Sharing features is hard
● Testing new features requires production
implementation
● May capture accidental data shifts (bugs,
downed services)
● Slows down the iteration cycle
● Limits agility in reacting to production incidents
Slowdown of experimentation
F1
F2
F3
0 5 7
3
?
Feature values
Time
4
2 4
Label
4
L
Pred P1 P2
7
3
?
4
2
8
L L
Training data set
Userbehavior&businessprocessesProductProblem
● Some models are time-dependent (seasonality)
● For some problems label maturity is on the order
of months
● Production incidents lead to dirty data in training
● Labels are scarce and expensive to acquire
→ Months-long iteration cycles
→ Hard to maintain models in production
→ Cannot address shifts in data quickly
Why is that a
problem?
● Backfill features
○ Quick!
● Single feature definition for production and
training
● Automatic pipelines for training and scoring
What do we want?
ZIPLINE
Zipline: feature management system
Feature
Definition
Serving
Pipeline
Training
Pipeline
Model
Training Set
Online Scoring
Vector
Consistency
Fast Backfills - Data Warehouse
Low Latency Serving - Online Environment
Feature definition
Training Set API
The time at which we
made the prediction,
also the time at which
we would log the feature
Training Set
If you missed it...
Training set = f(features,
keys, timestamps)
Implementation
Feature philosophy
● Complex features:
○ Only worth it if the gain is huge
○ Require complex computations
○ Harder to interpret
○ Harder to maintain
● Simple features:
○ Easier to maintain
○ Faster to compute
○ Cumulatively provide huge gain for the
model
Supported
operations
● Sum, Count
● Min, Max
● First, Last
● Last N
● Statistical moments
● Approx unique count
● Approx percentile
● Bloom filters
+ time windows for all operations!
Operation
requirements
● Commutative: a ⊕ b = b ⊕ a
● Associative: (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
● Additional optimizations:
○ Reversible: a ⊕ ? = c
● Must be O(1) in compute ⇒ must be O(1) in space
Serving pipeline: lambda
Feature
Definition
Streaming
Batch KV
KV
Zipline Client
Data skew: large number of events
user ts
1 2019-10-01 00:00:01
1 2019-10-01 00:00:02
... ...
1 2019-10-01 23:59:59
2 2019-10-02 15:20:30
3 2019-10-12 16:11:44
50%
Page views
Use aggregateByKey to ensure data is locally combined on the first stage before
sent final merge
Aggregate by Key
(a, 1)
(b, 1)
(a, 1)
(b, 1)
(a, 1)
(a, 1)
(b, 1)
(b, 1)
(a, 2)
(b, 2)
(a, 1)
(a, 1)
(a, 1)
(b, 1)
(b, 1)
(b, 1)
(a, 3)
(b, 3)
(a, 1)
(a, 2)
(a, 3)
(a, 6)
(b, 1)
(b, 2)
(b, 3)
(b, 6)
Shuffle
Executors
Training pipeline
Model
definition
Batch Hive
Feature
Definition(s)
Data skew: large number of examples
ip ts
127.0.0.1 2019-10-15 05:03:20
127.0.0.1 2019-10-15 12:32:11
127.0.0.1 2019-10-15 09:55:29
... ...
1.2.3.4 2019-10-15 03:22:21
1.2.3.5 2019-10-15 19:10:59
ip ts
127.0.0.1 2019-10-01 00:00:01
127.0.0.1 2019-10-01 00:00:02
... ...
1.2.3.4 2019-10-01 23:59:59
1.2.3.5 2019-10-02 15:20:30
1.2.3.6 2019-10-12 16:11:44
50%
Training examples Page views
Large number of
timestamps:
Naive solution
● Keep one aggregate per (key, driver timestamp)
● For every event:
○ Find corresponding key
○ For every driver timestamp of that key:
■ If the event occurred prior to the
timestamp produce:
● ((key, driver timestamp), data)
● Use aggregateByKey
● Problem: O(Nts
x Ne
)
Non-windowed case
6
Timestamps
for one key
1 3 7 8 10 15 18 20
0 0 1 1 11 1 1Corresponding
values
9
0 0 2 2 21 1 2Corresponding
values
Non-windowed case (optimized)
6
Timestamps
for one key
1 3 7 8 10 15 18 20
O(Ne
+ Nts
)
Apply to the first affected aggregate. In the end compute a cumulative sum of the values.
0 0 0 0 01 0 0Corresponding
values
9
0 0 0 0 01 0 1Corresponding
values
0 0 2 2 21 1 2Result
Data skew: windowed case
0 1 2 3 4 5
6
Timestamps
for one key
Window size = 5
6 7
0-2 2-3 4-5
1 3 7 8 10 15 18 20
6-7
0-3 4-7
0-7
7 8 10
2-3
4
O(Ne
x log(Nts
))
Timestamp
index
Feature
Sources
● Hive table produced upstream
● Jitney: Airbnb event bus
● Databases via data warehouse export and CDC
Results
● Zipline cuts weeks of effort:
○ Custom feature pipelines
○ Data leaks in custom aggregations
○ Data sketches
● Improved model iteration workflow
● Feature distribution observability
Results:
improved workflow
Results: runtime
optimizations
● Optimized data pipelines:
○ 10x for training set backfill for some models
○ Incremental pipelines by default
○ Huge cost savings
Q&A
Zipline—Airbnb’s Declarative Feature Engineering Framework
1 of 42

Recommended

Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... by
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...Databricks
35.3K views54 slides
Zipline - A Declarative Feature Engineering Framework by
Zipline - A Declarative Feature Engineering FrameworkZipline - A Declarative Feature Engineering Framework
Zipline - A Declarative Feature Engineering FrameworkDatabricks
636 views40 slides
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha... by
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Databricks
5.4K views29 slides
Building a Feature Store around Dataframes and Apache Spark by
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkDatabricks
1.4K views31 slides
Feature store: Solving anti-patterns in ML-systems by
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsAndrzej Michałowski
2.5K views30 slides
Iceberg: a fast table format for S3 by
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
7.5K views30 slides

More Related Content

What's hot

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl... by
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
3.2K views29 slides
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale by
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScalePinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScaleSeunghyun Lee
2.8K views48 slides
Evening out the uneven: dealing with skew in Flink by
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
2.5K views35 slides
Building a Streaming Microservice Architecture: with Apache Spark Structured ... by
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
1.6K views28 slides
Recurrent Neural Networks for Recommendations and Personalization with Nick P... by
Recurrent Neural Networks for Recommendations and Personalization with Nick P...Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...Databricks
900 views33 slides
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ... by
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
1.8K views76 slides

What's hot(20)

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl... by DataWorks Summit
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit3.2K views
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale by Seunghyun Lee
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScalePinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Seunghyun Lee2.8K views
Evening out the uneven: dealing with skew in Flink by Flink Forward
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward2.5K views
Building a Streaming Microservice Architecture: with Apache Spark Structured ... by Databricks
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Databricks1.6K views
Recurrent Neural Networks for Recommendations and Personalization with Nick P... by Databricks
Recurrent Neural Networks for Recommendations and Personalization with Nick P...Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Databricks900 views
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ... by HostedbyConfluent
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
HostedbyConfluent1.8K views
Where is my bottleneck? Performance troubleshooting in Flink by Flink Forward
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward541 views
Real-time Analytics with Trino and Apache Pinot by Xiang Fu
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
Xiang Fu1.2K views
Large Scale Graph Analytics with JanusGraph by P. Taylor Goetz
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
P. Taylor Goetz19.1K views
Apache Flink internals by Kostas Tzoumas
Apache Flink internalsApache Flink internals
Apache Flink internals
Kostas Tzoumas12.4K views
Bootstrapping state in Apache Flink by DataWorks Summit
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
DataWorks Summit1.7K views
Tame the small files problem and optimize data layout for streaming ingestion... by Flink Forward
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward810 views
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake by Databricks
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks2.2K views
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc... by Altinity Ltd
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
Altinity Ltd3.5K views
Clickhouse at Cloudflare. By Marek Vavrusa by Valery Tkachenko
Clickhouse at Cloudflare. By Marek VavrusaClickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek Vavrusa
Valery Tkachenko674 views
How to Build a ML Platform Efficiently Using Open-Source by Databricks
How to Build a ML Platform Efficiently Using Open-SourceHow to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-Source
Databricks475 views
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap... by Flink Forward
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward3.2K views
Dynamic Partition Pruning in Apache Spark by Databricks
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks4.9K views
Flink history, roadmap and vision by Stephan Ewen
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
Stephan Ewen6.9K views

Similar to Zipline—Airbnb’s Declarative Feature Engineering Framework

Moving from BI to AI : For decision makers by
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makerszekeLabs Technologies
348 views71 slides
Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET" by
Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"
Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"LogeekNightUkraine
152 views70 slides
Next generation alerting and fault detection, SRECon Europe 2016 by
Next generation alerting and fault detection, SRECon Europe 2016Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Dieter Plaetinck
1K views120 slides
Easy path to machine learning (Spring 2021) by
Easy path to machine learning (Spring 2021)Easy path to machine learning (Spring 2021)
Easy path to machine learning (Spring 2021)wesley chun
97 views55 slides
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure by
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen
3.9K views52 slides
Monitoring AI with AI by
Monitoring AI with AIMonitoring AI with AI
Monitoring AI with AIStepan Pushkarev
960 views43 slides

Similar to Zipline—Airbnb’s Declarative Feature Engineering Framework(20)

Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET" by LogeekNightUkraine
Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"
Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"
LogeekNightUkraine152 views
Next generation alerting and fault detection, SRECon Europe 2016 by Dieter Plaetinck
Next generation alerting and fault detection, SRECon Europe 2016Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016
Dieter Plaetinck1K views
Easy path to machine learning (Spring 2021) by wesley chun
Easy path to machine learning (Spring 2021)Easy path to machine learning (Spring 2021)
Easy path to machine learning (Spring 2021)
wesley chun97 views
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure by Fei Chen
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Fei Chen3.9K views
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ... by Provectus
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Provectus116 views
OSMC 2012 | Shinken by Jean Gabès by NETWAYS
OSMC 2012 | Shinken by Jean GabèsOSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean Gabès
NETWAYS41 views
Easy path to machine learning (Spring 2020) by wesley chun
Easy path to machine learning (Spring 2020)Easy path to machine learning (Spring 2020)
Easy path to machine learning (Spring 2020)
wesley chun505 views
AI hype or reality by Awantik Das
AI  hype or realityAI  hype or reality
AI hype or reality
Awantik Das107 views
DevOps and Machine Learning (Geekwire Cloud Tech Summit) by Jasjeet Thind
DevOps and Machine Learning (Geekwire Cloud Tech Summit)DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
Jasjeet Thind1.1K views
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019 by Chun-Yu Tseng
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019 Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
Chun-Yu Tseng595 views
Dive into H2O: NYC by Sri Ambati
Dive into H2O: NYCDive into H2O: NYC
Dive into H2O: NYC
Sri Ambati706 views
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod... by All Things Open
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
All Things Open904 views
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian) by dtz001
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
dtz00161 views
Netflix SRE perf meetup_slides by Ed Hunter
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
Ed Hunter3K views
Gatling - Bordeaux JUG by slandelle
Gatling - Bordeaux JUGGatling - Bordeaux JUG
Gatling - Bordeaux JUG
slandelle2.8K views

More from Databricks

DW Migration Webinar-March 2022.pptx by
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
4.3K views25 slides
Data Lakehouse Symposium | Day 1 | Part 1 by
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
1.5K views43 slides
Data Lakehouse Symposium | Day 1 | Part 2 by
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
743 views16 slides
Data Lakehouse Symposium | Day 4 by
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
1.8K views74 slides
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
6.3K views64 slides
Democratizing Data Quality Through a Centralized Platform by
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
1.4K views36 slides

More from Databricks(20)

DW Migration Webinar-March 2022.pptx by Databricks
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks4.3K views
Data Lakehouse Symposium | Day 1 | Part 1 by Databricks
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks1.5K views
Data Lakehouse Symposium | Day 1 | Part 2 by Databricks
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks743 views
Data Lakehouse Symposium | Day 4 by Databricks
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks1.8K views
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by Databricks
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks6.3K views
Democratizing Data Quality Through a Centralized Platform by Databricks
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks1.4K views
Learn to Use Databricks for Data Science by Databricks
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks1.6K views
Why APM Is Not the Same As ML Monitoring by Databricks
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks743 views
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix by Databricks
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks689 views
Stage Level Scheduling Improving Big Data and AI Integration by Databricks
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks850 views
Simplify Data Conversion from Spark to TensorFlow and PyTorch by Databricks
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks1.8K views
Scaling your Data Pipelines with Apache Spark on Kubernetes by Databricks
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks2.1K views
Scaling and Unifying SciKit Learn and Apache Spark Pipelines by Databricks
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks667 views
Sawtooth Windows for Feature Aggregations by Databricks
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks606 views
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink by Databricks
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks677 views
Re-imagine Data Monitoring with whylogs and Spark by Databricks
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks551 views
Raven: End-to-end Optimization of ML Prediction Queries by Databricks
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks450 views
Processing Large Datasets for ADAS Applications using Apache Spark by Databricks
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks513 views
Massive Data Processing in Adobe Using Delta Lake by Databricks
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks719 views
Machine Learning CI/CD for Email Attack Detection by Databricks
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks389 views

Recently uploaded

Pydata Global 2023 - How can a learnt model unlearn something by
Pydata Global 2023 - How can a learnt model unlearn somethingPydata Global 2023 - How can a learnt model unlearn something
Pydata Global 2023 - How can a learnt model unlearn somethingSARADINDU SENGUPTA
8 views13 slides
apple.pptx by
apple.pptxapple.pptx
apple.pptxhoneybeeqwe
6 views15 slides
Customer Data Cleansing Project.pptx by
Customer Data Cleansing Project.pptxCustomer Data Cleansing Project.pptx
Customer Data Cleansing Project.pptxNat O
6 views23 slides
Lack of communication among family.pptx by
Lack of communication among family.pptxLack of communication among family.pptx
Lack of communication among family.pptxahmed164023
16 views10 slides
Report on OSINT by
Report on OSINTReport on OSINT
Report on OSINTAyonDebnathCertified
6 views15 slides
shivam tiwari.pptx by
shivam tiwari.pptxshivam tiwari.pptx
shivam tiwari.pptxAanyaMishra4
9 views14 slides

Recently uploaded(20)

Pydata Global 2023 - How can a learnt model unlearn something by SARADINDU SENGUPTA
Pydata Global 2023 - How can a learnt model unlearn somethingPydata Global 2023 - How can a learnt model unlearn something
Pydata Global 2023 - How can a learnt model unlearn something
Customer Data Cleansing Project.pptx by Nat O
Customer Data Cleansing Project.pptxCustomer Data Cleansing Project.pptx
Customer Data Cleansing Project.pptx
Nat O6 views
Lack of communication among family.pptx by ahmed164023
Lack of communication among family.pptxLack of communication among family.pptx
Lack of communication among family.pptx
ahmed16402316 views
Data about the sector workshop by info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821729 views
Data Journeys Hard Talk workshop final.pptx by info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821711 views
PyData Global 2022 - Things I learned while running neural networks on microc... by SARADINDU SENGUPTA
PyData Global 2022 - Things I learned while running neural networks on microc...PyData Global 2022 - Things I learned while running neural networks on microc...
PyData Global 2022 - Things I learned while running neural networks on microc...
Best Home Security Systems.pptx by mogalang
Best Home Security Systems.pptxBest Home Security Systems.pptx
Best Home Security Systems.pptx
mogalang9 views
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning by SARADINDU SENGUPTA
GDG Cloud Community Day 2022 -  Managing data quality in Machine LearningGDG Cloud Community Day 2022 -  Managing data quality in Machine Learning
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf by 10urkyr34
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
10urkyr347 views
Business administration Project File.pdf by KiranPrajapati91
Business administration Project File.pdfBusiness administration Project File.pdf
Business administration Project File.pdf
KiranPrajapati9110 views
DGST Methodology Presentation.pdf by maddierlegum
DGST Methodology Presentation.pdfDGST Methodology Presentation.pdf
DGST Methodology Presentation.pdf
maddierlegum7 views
CRM stick or twist workshop by info828217
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshop
info82821714 views

Zipline—Airbnb’s Declarative Feature Engineering Framework

  • 1. Evgeny Shapiro, Varant Zanoyan / Oct 2019 / Airbnb Zipline: Declarative Feature Engineering
  • 2. 1. The machine learning workflow 2. The feature engineering problem 3. Zipline as a solution 4. Implementation 5. Results 6. Q&A Agenda
  • 4. Machine Learning ● Goal: Make a prediction about the world given incomplete data ● Labels: Prediction Target ● Features: known information to learn from ● Training output: model weights/parameters ● Serving: online feature ● Assumption: Training and serving distribution is the same (consistency)
  • 5. Machine Learning ● Goal: Make a prediction about the world given incomplete data ● Labels: Prediction Target ● Features: known information to learn from ● Training output: model weights/parameters ● Serving: online feature ● Assumption: Training and serving distribution is the same (consistency)
  • 6. ML applications Unstructured Structured Image classification Chat apps NLP Object detection FraudCustomer LTVCredit scores Ads Personalized search ● Most of the data is available at once: full image ● Features are automatically extracted from few (often one) data stream: ○ words from a text ○ pixels from an image ● Data arrives steadily as user interacts with the platform ● Features extracted from many event streams: ○ logins ○ clicks ○ bookings ○ page views, etc ● Iterative manual feature engineering # of data sources
  • 7. Feature Engineering Unstructured Structured Image classification Chat apps NLP Object detection FraudCustomer LTVCredit scores Ads Personalized search # of data sources N-grams from a text Sum of past purchases in last 7 days
  • 8. ● Offline Batch (email marketing) ○ Does not require serving feature in production ○ Online/Offline consistency is not a problem ● Online Real-time (personalized search) ○ Does require serving feature in production ○ Online/Offline consistency is a problem Offline Batch vs Online Real-time
  • 9. Feature engineering For the structured online use case
  • 10. “We recognize that a mature system might end up being (at most) 5% machine learning code and (at least) 95% glue code” – Sculley, NIPS 2015
  • 11. ML Models F1 F2 0 5 7 3 Feature values Time 4 2 4 Label L Pred P1 7 3 L Training data set Userbehavior&businessprocessesProductProblem
  • 12. Log-based training DB KV Service Application Scoring Service Online Offline (Hive) Event Bus Keys, features, score Scoring log (daily) Labels Training Set
  • 13. Log-based training is great † ● Easy to implement ● Any production-available data point can be used for training and scoring ● Log can be used for audit and debug purposes ● Consistency is guaranteed † May capture accidental data distribution shifts, requires upfront implementation of new features in production, may slow down feature iteration cycle, prevents feature sharing between models, increases product experimentation cycle, severely limits your ability to react to incidents, fixing production issues might degrade model performance, may decrease sleep time during on-call rotations. Consult with your architect before taking log-based training approach.
  • 14. The Fine Print up close ● Sharing features is hard ● Testing new features requires production implementation ● May capture accidental data shifts (bugs, downed services) ● Slows down the iteration cycle ● Limits agility in reacting to production incidents
  • 15. Slowdown of experimentation F1 F2 F3 0 5 7 3 ? Feature values Time 4 2 4 Label 4 L Pred P1 P2 7 3 ? 4 2 8 L L Training data set Userbehavior&businessprocessesProductProblem
  • 16. ● Some models are time-dependent (seasonality) ● For some problems label maturity is on the order of months ● Production incidents lead to dirty data in training ● Labels are scarce and expensive to acquire → Months-long iteration cycles → Hard to maintain models in production → Cannot address shifts in data quickly Why is that a problem?
  • 17. ● Backfill features ○ Quick! ● Single feature definition for production and training ● Automatic pipelines for training and scoring What do we want?
  • 19. Zipline: feature management system Feature Definition Serving Pipeline Training Pipeline Model Training Set Online Scoring Vector Consistency Fast Backfills - Data Warehouse Low Latency Serving - Online Environment
  • 21. Training Set API The time at which we made the prediction, also the time at which we would log the feature
  • 23. If you missed it... Training set = f(features, keys, timestamps)
  • 25. Feature philosophy ● Complex features: ○ Only worth it if the gain is huge ○ Require complex computations ○ Harder to interpret ○ Harder to maintain ● Simple features: ○ Easier to maintain ○ Faster to compute ○ Cumulatively provide huge gain for the model
  • 26. Supported operations ● Sum, Count ● Min, Max ● First, Last ● Last N ● Statistical moments ● Approx unique count ● Approx percentile ● Bloom filters + time windows for all operations!
  • 27. Operation requirements ● Commutative: a ⊕ b = b ⊕ a ● Associative: (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c) ● Additional optimizations: ○ Reversible: a ⊕ ? = c ● Must be O(1) in compute ⇒ must be O(1) in space
  • 29. Data skew: large number of events user ts 1 2019-10-01 00:00:01 1 2019-10-01 00:00:02 ... ... 1 2019-10-01 23:59:59 2 2019-10-02 15:20:30 3 2019-10-12 16:11:44 50% Page views Use aggregateByKey to ensure data is locally combined on the first stage before sent final merge
  • 30. Aggregate by Key (a, 1) (b, 1) (a, 1) (b, 1) (a, 1) (a, 1) (b, 1) (b, 1) (a, 2) (b, 2) (a, 1) (a, 1) (a, 1) (b, 1) (b, 1) (b, 1) (a, 3) (b, 3) (a, 1) (a, 2) (a, 3) (a, 6) (b, 1) (b, 2) (b, 3) (b, 6) Shuffle Executors
  • 32. Data skew: large number of examples ip ts 127.0.0.1 2019-10-15 05:03:20 127.0.0.1 2019-10-15 12:32:11 127.0.0.1 2019-10-15 09:55:29 ... ... 1.2.3.4 2019-10-15 03:22:21 1.2.3.5 2019-10-15 19:10:59 ip ts 127.0.0.1 2019-10-01 00:00:01 127.0.0.1 2019-10-01 00:00:02 ... ... 1.2.3.4 2019-10-01 23:59:59 1.2.3.5 2019-10-02 15:20:30 1.2.3.6 2019-10-12 16:11:44 50% Training examples Page views
  • 33. Large number of timestamps: Naive solution ● Keep one aggregate per (key, driver timestamp) ● For every event: ○ Find corresponding key ○ For every driver timestamp of that key: ■ If the event occurred prior to the timestamp produce: ● ((key, driver timestamp), data) ● Use aggregateByKey ● Problem: O(Nts x Ne )
  • 34. Non-windowed case 6 Timestamps for one key 1 3 7 8 10 15 18 20 0 0 1 1 11 1 1Corresponding values 9 0 0 2 2 21 1 2Corresponding values
  • 35. Non-windowed case (optimized) 6 Timestamps for one key 1 3 7 8 10 15 18 20 O(Ne + Nts ) Apply to the first affected aggregate. In the end compute a cumulative sum of the values. 0 0 0 0 01 0 0Corresponding values 9 0 0 0 0 01 0 1Corresponding values 0 0 2 2 21 1 2Result
  • 36. Data skew: windowed case 0 1 2 3 4 5 6 Timestamps for one key Window size = 5 6 7 0-2 2-3 4-5 1 3 7 8 10 15 18 20 6-7 0-3 4-7 0-7 7 8 10 2-3 4 O(Ne x log(Nts )) Timestamp index
  • 37. Feature Sources ● Hive table produced upstream ● Jitney: Airbnb event bus ● Databases via data warehouse export and CDC
  • 39. ● Zipline cuts weeks of effort: ○ Custom feature pipelines ○ Data leaks in custom aggregations ○ Data sketches ● Improved model iteration workflow ● Feature distribution observability Results: improved workflow
  • 40. Results: runtime optimizations ● Optimized data pipelines: ○ 10x for training set backfill for some models ○ Incremental pipelines by default ○ Huge cost savings
  • 41. Q&A