Sawtooth Windows for Feature Aggregations

•

0 likes•636 views

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Data & Analytics

Sawtooth Windows
Zipline - Feature Engineering Framework
Nikhil Simha
nikhil.simha@airbnb.com

• Machine Learning
• Supervised
• Structured data – database records, event streams
• Not unstructured data – images, video, audio, text
• Not labels
Features in context

Exploration
Problem
Feature
Creation
Model
Training
Model
Serving
Feature
Serving
Application
Labeling

• Complex models > Simple models
• Can learn complicated relationships within data
Rules of thumb

• Good data >> Bad data
• Labels: True, Balanced
• Features:
• Consistent
• Real-time
• Stable
Rules of thumb

• Simple models + good data >> Complex models + Bad data
• Effort to better data >> Effort to better model
• Realtime features are hard
• Windowed Aggregations are unsupported/inefficient
• Training/Serving consistency
Rules of thumb

• Inadequate data sources
• Event sources: Don’t go back in history
• Database sources: Range scans are very expensive
• Skill gap
• ML vs system engineering
• Missing Backfills - Slow iteration
Hardness of Realtime features

• Features should be real-time
• Features are aggregations
• Most aggregations should be windowed
• Sawtooth windows
Goal

Example
● Restaurant recommendation
● Ratings of restaurant last year
● Check-ins of user by cuisine in the last month
● Latest cuisine check-in by user

Checkins
Ratings
1 1 1
3
Time
1
2 4
Label L
Prediction P1 P2
3
3
4
2.5
L L
Training
data set

Contract
● Serving
● User, Restaurant -> avg_restaurant_rating_1yr, cuisine_visits_30d
● Training
● Labeled Data: (User, Restaurant, timestamp, label)
● Enrich with features

Data sources
● Events
● Timestamped – user_txn stream
● Entities
● As served by microservices etc
● Based on DB
● User_balance table
● Or non-real-time : dim/fct tables

Service
Fleet
Production
Database
DB
Snapshot
Event log
Change
Capture
Stream
Event
Stream
Change
capture log
M
essage
Bus
D
a
t
a
L
a
k
e
Live
Derived
Data
Media

API – Philosophy
• SQL is two languages
• Keep Expression Language
• CAST(get_json_object(response, “$.age”) AS BIGINT)
• Control Structural language
• GROUPBY, JOIN, HAVING, SELECT, WHERE, FROM

API – Philosophy
Windows are first class
Source equivalence: topic ~ table ~ mutations
Data Models are first class
Entity (dim)
Events (fact, timestamped)

API – Internals
• Python -> Thrift-Json -> Spark + Scala
• Versioned
• Driven by airflow

Aggregations – SUM
• Commutative: a + b = b + a
• Order independent
• Associative: (a + b) + c = a + (b + c)
• Parallelizable

Aggregations – AVG
• One not-so-clever trick
• Operate on “Intermediate Representation” / IR
• Factors into (sum, count)
• Finalized by a division: (sum/count)

Aggregations
• Constant memory / Bounded IR
• Two classes of aggregations
• Sum, Avg, Count
• Min/Max, Approx Unique, percentiles, topK
• Mutations – updates, deletes etc.

Windows – Hopping
• Staleness
• As stale as the hop size
• Memory Efficient
• One partial per hop

Windows – Sliding
• Freshness
• Memory intensive

Windows – Sawtooth
• Freshness
• Writes are taken into account immediately
• Memory
• Partial aggregates per hop

Windows – Sawtooth
• Catch
• sum/count vs others
• Consistency

Model Server
Serving Architecture
Feature
Declaration
Streaming
aggregates
Batch
aggregates
Feature
Store
Model
Feature
Client
Application
Server

Choosing hops
• Automatically chosen
• Hop size < x% of window size
• Daily, hourly, 5minute
• X ~ 8.34%
• Caching – variety of window sizes can re-use the hop
• 90d, 30d
• Across windows & across queries

What's hot

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

New Directions for Apache ArrowWes McKinney

InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData

Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks

Getting The Best Performance With PySparkSpark Summit

Apache Hadoop on Virtual MachinesDataWorks Summit

Differential Privacy for Information RetrievalGrace Hui Yang

Batch Processing at Scale with Flink & IcebergFlink Forward

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks

Deep Dive into the New Features of Apache Spark 3.0Databricks

Photon Technical Deep Dive: How to Think VectorizedDatabricks

Deep Dive: Memory Management in Apache SparkDatabricks

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks

Change Data Feed in DeltaDatabricks

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

Introduction to Spark Streamingdatamantra

Titan: The Rise of Big Graph DataMarko Rodriguez

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent

PySpark Best PracticesCloudera, Inc.

What's hot (20)

Apache Spark Core—Deep Dive—Proper Optimization

New Directions for Apache Arrow

InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

Getting The Best Performance With PySpark

Apache Hadoop on Virtual Machines

Differential Privacy for Information Retrieval

Batch Processing at Scale with Flink & Iceberg

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

How Adobe Does 2 Million Records Per Second Using Apache Spark!

Deep Dive into the New Features of Apache Spark 3.0

Photon Technical Deep Dive: How to Think Vectorized

Deep Dive: Memory Management in Apache Spark

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...

Change Data Feed in Delta

A Deep Dive into Query Execution Engine of Spark SQL

Introduction to Spark Streaming

Titan: The Rise of Big Graph Data

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...

PySpark Best Practices

Similar to Sawtooth Windows for Feature Aggregations

Making Session Stores More IntelligentKyle Davis

A Production Quality Sketching Library for the Analysis of Big DataDatabricks

Algorithmic techniques-for-big-data-analysisAtner Yegorov

Algorithmic techniques-for-big-data-analysisHiye Biniam

Zipline - A Declarative Feature Engineering FrameworkDatabricks

Make Life Suck Less (Building Scalable Systems)guest0f8e278

Make Life Suck Less (Building Scalable Systems)Bradford Stephens

Casual mass parallel computingaragozin

Data Science meets Software DevelopmentAlexis Seigneurin

Enabling real interactive BI on HadoopDataWorks Summit

Prepare your data for machine learningIvo Andreev

AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOTAmazon Web Services

BYO/DIY Analytics Platform (MeasureCamp Presentation by Clancy Childs)Clancy Childs

Hard Coding as a design approachOren Eini

Internals of Presto ServiceTreasure Data, Inc.

ShaREing Is Caringsporst

Apache con big data 2015 - Data Science from the trenchesVinay Shukla

C++ programming Assignment Helpsmithjonny9876

Automate Machine Learning Pipeline Using MLBoxAxel de Romblay

Proud to be polyglotTugdual Grall

Similar to Sawtooth Windows for Feature Aggregations (20)

Making Session Stores More Intelligent

A Production Quality Sketching Library for the Analysis of Big Data

Algorithmic techniques-for-big-data-analysis

Zipline - A Declarative Feature Engineering Framework

Make Life Suck Less (Building Scalable Systems)

Casual mass parallel computing

Data Science meets Software Development

Enabling real interactive BI on Hadoop

Prepare your data for machine learning

AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT

BYO/DIY Analytics Platform (MeasureCamp Presentation by Clancy Childs)

Hard Coding as a design approach

Internals of Presto Service

ShaREing Is Caring

Apache con big data 2015 - Data Science from the trenches

C++ programming Assignment Help

Automate Machine Learning Pipeline Using MLBox

Proud to be polyglot

Recently uploaded

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

How we prevented account sharing with MFAAndrei Kaleshka

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一fhwihughh

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

ASML's Taxonomy Adventure by Daniel Cantervoginip

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

Easter Eggs From Star Wars and in cars 1 and 217djon017

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

Recently uploaded (20)

RABBIT: A CLI tool for identifying bots based on their GitHub events.

Generative AI for Social Good at Open Data Science East 2024

How we prevented account sharing with MFA

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝

Call Girls in Saket 99530🔝 56974 Escort Service

ASML's Taxonomy Adventure by Daniel Canter

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

Customer Service Analytics - Make Sense of All Your Data.pptx

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

Advanced Machine Learning for Business Professionals

9654467111 Call Girls In Munirka Hotel And Home Service

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

Call Girls In Dwarka 9654467111 Escorts Service

Easter Eggs From Star Wars and in cars 1 and 2

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

Sawtooth Windows for Feature Aggregations

1. Sawtooth Windows Zipline - Feature Engineering Framework Nikhil Simha nikhil.simha@airbnb.com

2. • Machine Learning • Supervised • Structured data – database records, event streams • Not unstructured data – images, video, audio, text • Not labels Features in context

3. Exploration Problem Feature Creation Model Training Model Serving Feature Serving Application Labeling

4. • Complex models > Simple models • Can learn complicated relationships within data Rules of thumb

5. • Good data >> Bad data • Labels: True, Balanced • Features: • Consistent • Real-time • Stable Rules of thumb

6. • Simple models + good data >> Complex models + Bad data • Effort to better data >> Effort to better model • Realtime features are hard • Windowed Aggregations are unsupported/inefficient • Training/Serving consistency Rules of thumb

7. • Inadequate data sources • Event sources: Don’t go back in history • Database sources: Range scans are very expensive • Skill gap • ML vs system engineering • Missing Backfills - Slow iteration Hardness of Realtime features

8. • Features should be real-time • Features are aggregations • Most aggregations should be windowed • Sawtooth windows Goal

9. Example ● Restaurant recommendation ● Ratings of restaurant last year ● Check-ins of user by cuisine in the last month ● Latest cuisine check-in by user

10. Checkins Ratings 1 1 1 3 Time 1 2 4 Label L Prediction P1 P2 3 3 4 2.5 L L Training data set

11. Contract ● Serving ● User, Restaurant -> avg_restaurant_rating_1yr, cuisine_visits_30d ● Training ● Labeled Data: (User, Restaurant, timestamp, label) ● Enrich with features

12. Data sources ● Events ● Timestamped – user_txn stream ● Entities ● As served by microservices etc ● Based on DB ● User_balance table ● Or non-real-time : dim/fct tables

13. Service Fleet Production Database DB Snapshot Event log Change Capture Stream Event Stream Change capture log M essage Bus D a t a L a k e Live Derived Data Media

14. Feature Set Example

15. Feature Set Example

16. Feature Set Example

17. API – Philosophy • SQL is two languages • Keep Expression Language • CAST(get_json_object(response, “$.age”) AS BIGINT) • Control Structural language • GROUPBY, JOIN, HAVING, SELECT, WHERE, FROM

18. API – Philosophy Windows are first class Source equivalence: topic ~ table ~ mutations Data Models are first class Entity (dim) Events (fact, timestamped)

19. API – Internals • Python -> Thrift-Json -> Spark + Scala • Versioned • Driven by airflow

20. Aggregation Math

21. Aggregations – SUM • Commutative: a + b = b + a • Order independent • Associative: (a + b) + c = a + (b + c) • Parallelizable

22. Aggregations – AVG • One not-so-clever trick • Operate on “Intermediate Representation” / IR • Factors into (sum, count) • Finalized by a division: (sum/count)

23. Aggregations • Constant memory / Bounded IR • Two classes of aggregations • Sum, Avg, Count • Min/Max, Approx Unique, percentiles, topK • Mutations – updates, deletes etc.

24. Windows – Hopping

25. Windows – Hopping • Staleness • As stale as the hop size • Memory Efficient • One partial per hop

26. Windows – Sliding • Freshness • Memory intensive

27. Windows – Sawtooth • Freshness • Writes are taken into account immediately • Memory • Partial aggregates per hop

28. Windows – Sawtooth

29. Windows – Sawtooth • Catch • sum/count vs others • Consistency

30. Model Server Serving Architecture Feature Declaration Streaming aggregates Batch aggregates Feature Store Model Feature Client Application Server

31. Windows – Lambda • Points of change

32. Windows – Lambda

33. Choosing hops • Automatically chosen • Hop size < x% of window size • Daily, hourly, 5minute • X ~ 8.34% • Caching – variety of window sizes can re-use the hop • 90d, 30d • Across windows & across queries

34. Questions

Sawtooth Windows for Feature Aggregations

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sawtooth Windows for Feature Aggregations

Similar to Sawtooth Windows for Feature Aggregations (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Sawtooth Windows for Feature Aggregations