Successfully reported this slideshow.

Sawtooth Windows for Feature Aggregations

0

Share

1 of 34
1 of 34

Sawtooth Windows for Feature Aggregations

0

Share

Download to read offline

Description

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Transcript

  1. 1. Sawtooth Windows Zipline - Feature Engineering Framework Nikhil Simha nikhil.simha@airbnb.com
  2. 2. • Machine Learning • Supervised • Structured data – database records, event streams • Not unstructured data – images, video, audio, text • Not labels Features in context
  3. 3. Exploration Problem Feature Creation Model Training Model Serving Feature Serving Application Labeling
  4. 4. • Complex models > Simple models • Can learn complicated relationships within data Rules of thumb
  5. 5. • Good data >> Bad data • Labels: True, Balanced • Features: • Consistent • Real-time • Stable Rules of thumb
  6. 6. • Simple models + good data >> Complex models + Bad data • Effort to better data >> Effort to better model • Realtime features are hard • Windowed Aggregations are unsupported/inefficient • Training/Serving consistency Rules of thumb
  7. 7. • Inadequate data sources • Event sources: Don’t go back in history • Database sources: Range scans are very expensive • Skill gap • ML vs system engineering • Missing Backfills - Slow iteration Hardness of Realtime features
  8. 8. • Features should be real-time • Features are aggregations • Most aggregations should be windowed • Sawtooth windows Goal
  9. 9. Example ● Restaurant recommendation ● Ratings of restaurant last year ● Check-ins of user by cuisine in the last month ● Latest cuisine check-in by user
  10. 10. Checkins Ratings 1 1 1 3 Time 1 2 4 Label L Prediction P1 P2 3 3 4 2.5 L L Training data set
  11. 11. Contract ● Serving ● User, Restaurant -> avg_restaurant_rating_1yr, cuisine_visits_30d ● Training ● Labeled Data: (User, Restaurant, timestamp, label) ● Enrich with features
  12. 12. Data sources ● Events ● Timestamped – user_txn stream ● Entities ● As served by microservices etc ● Based on DB ● User_balance table ● Or non-real-time : dim/fct tables
  13. 13. Service Fleet Production Database DB Snapshot Event log Change Capture Stream Event Stream Change capture log M essage Bus D a t a L a k e Live Derived Data Media
  14. 14. Feature Set Example
  15. 15. Feature Set Example
  16. 16. Feature Set Example
  17. 17. API – Philosophy • SQL is two languages • Keep Expression Language • CAST(get_json_object(response, “$.age”) AS BIGINT) • Control Structural language • GROUPBY, JOIN, HAVING, SELECT, WHERE, FROM
  18. 18. API – Philosophy Windows are first class Source equivalence: topic ~ table ~ mutations Data Models are first class Entity (dim) Events (fact, timestamped)
  19. 19. API – Internals • Python -> Thrift-Json -> Spark + Scala • Versioned • Driven by airflow
  20. 20. Aggregation Math
  21. 21. Aggregations – SUM • Commutative: a + b = b + a • Order independent • Associative: (a + b) + c = a + (b + c) • Parallelizable
  22. 22. Aggregations – AVG • One not-so-clever trick • Operate on “Intermediate Representation” / IR • Factors into (sum, count) • Finalized by a division: (sum/count)
  23. 23. Aggregations • Constant memory / Bounded IR • Two classes of aggregations • Sum, Avg, Count • Min/Max, Approx Unique, percentiles, topK • Mutations – updates, deletes etc.
  24. 24. Windows – Hopping
  25. 25. Windows – Hopping • Staleness • As stale as the hop size • Memory Efficient • One partial per hop
  26. 26. Windows – Sliding • Freshness • Memory intensive
  27. 27. Windows – Sawtooth • Freshness • Writes are taken into account immediately • Memory • Partial aggregates per hop
  28. 28. Windows – Sawtooth
  29. 29. Windows – Sawtooth • Catch • sum/count vs others • Consistency
  30. 30. Model Server Serving Architecture Feature Declaration Streaming aggregates Batch aggregates Feature Store Model Feature Client Application Server
  31. 31. Windows – Lambda • Points of change
  32. 32. Windows – Lambda
  33. 33. Choosing hops • Automatically chosen • Hop size < x% of window size • Daily, hourly, 5minute • X ~ 8.34% • Caching – variety of window sizes can re-use the hop • 90d, 30d • Across windows & across queries
  34. 34. Questions

Description

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Transcript

  1. 1. Sawtooth Windows Zipline - Feature Engineering Framework Nikhil Simha nikhil.simha@airbnb.com
  2. 2. • Machine Learning • Supervised • Structured data – database records, event streams • Not unstructured data – images, video, audio, text • Not labels Features in context
  3. 3. Exploration Problem Feature Creation Model Training Model Serving Feature Serving Application Labeling
  4. 4. • Complex models > Simple models • Can learn complicated relationships within data Rules of thumb
  5. 5. • Good data >> Bad data • Labels: True, Balanced • Features: • Consistent • Real-time • Stable Rules of thumb
  6. 6. • Simple models + good data >> Complex models + Bad data • Effort to better data >> Effort to better model • Realtime features are hard • Windowed Aggregations are unsupported/inefficient • Training/Serving consistency Rules of thumb
  7. 7. • Inadequate data sources • Event sources: Don’t go back in history • Database sources: Range scans are very expensive • Skill gap • ML vs system engineering • Missing Backfills - Slow iteration Hardness of Realtime features
  8. 8. • Features should be real-time • Features are aggregations • Most aggregations should be windowed • Sawtooth windows Goal
  9. 9. Example ● Restaurant recommendation ● Ratings of restaurant last year ● Check-ins of user by cuisine in the last month ● Latest cuisine check-in by user
  10. 10. Checkins Ratings 1 1 1 3 Time 1 2 4 Label L Prediction P1 P2 3 3 4 2.5 L L Training data set
  11. 11. Contract ● Serving ● User, Restaurant -> avg_restaurant_rating_1yr, cuisine_visits_30d ● Training ● Labeled Data: (User, Restaurant, timestamp, label) ● Enrich with features
  12. 12. Data sources ● Events ● Timestamped – user_txn stream ● Entities ● As served by microservices etc ● Based on DB ● User_balance table ● Or non-real-time : dim/fct tables
  13. 13. Service Fleet Production Database DB Snapshot Event log Change Capture Stream Event Stream Change capture log M essage Bus D a t a L a k e Live Derived Data Media
  14. 14. Feature Set Example
  15. 15. Feature Set Example
  16. 16. Feature Set Example
  17. 17. API – Philosophy • SQL is two languages • Keep Expression Language • CAST(get_json_object(response, “$.age”) AS BIGINT) • Control Structural language • GROUPBY, JOIN, HAVING, SELECT, WHERE, FROM
  18. 18. API – Philosophy Windows are first class Source equivalence: topic ~ table ~ mutations Data Models are first class Entity (dim) Events (fact, timestamped)
  19. 19. API – Internals • Python -> Thrift-Json -> Spark + Scala • Versioned • Driven by airflow
  20. 20. Aggregation Math
  21. 21. Aggregations – SUM • Commutative: a + b = b + a • Order independent • Associative: (a + b) + c = a + (b + c) • Parallelizable
  22. 22. Aggregations – AVG • One not-so-clever trick • Operate on “Intermediate Representation” / IR • Factors into (sum, count) • Finalized by a division: (sum/count)
  23. 23. Aggregations • Constant memory / Bounded IR • Two classes of aggregations • Sum, Avg, Count • Min/Max, Approx Unique, percentiles, topK • Mutations – updates, deletes etc.
  24. 24. Windows – Hopping
  25. 25. Windows – Hopping • Staleness • As stale as the hop size • Memory Efficient • One partial per hop
  26. 26. Windows – Sliding • Freshness • Memory intensive
  27. 27. Windows – Sawtooth • Freshness • Writes are taken into account immediately • Memory • Partial aggregates per hop
  28. 28. Windows – Sawtooth
  29. 29. Windows – Sawtooth • Catch • sum/count vs others • Consistency
  30. 30. Model Server Serving Architecture Feature Declaration Streaming aggregates Batch aggregates Feature Store Model Feature Client Application Server
  31. 31. Windows – Lambda • Points of change
  32. 32. Windows – Lambda
  33. 33. Choosing hops • Automatically chosen • Hop size < x% of window size • Daily, hourly, 5minute • X ~ 8.34% • Caching – variety of window sizes can re-use the hop • 90d, 30d • Across windows & across queries
  34. 34. Questions

More Related Content

×