Successfully reported this slideshow.
Your SlideShare is downloading. ×

Sawtooth Windows for Feature Aggregations

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 34 Ad

Sawtooth Windows for Feature Aggregations

Download to read offline

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Sawtooth Windows for Feature Aggregations (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

Sawtooth Windows for Feature Aggregations

  1. 1. Sawtooth Windows Zipline - Feature Engineering Framework Nikhil Simha nikhil.simha@airbnb.com
  2. 2. • Machine Learning • Supervised • Structured data – database records, event streams • Not unstructured data – images, video, audio, text • Not labels Features in context
  3. 3. Exploration Problem Feature Creation Model Training Model Serving Feature Serving Application Labeling
  4. 4. • Complex models > Simple models • Can learn complicated relationships within data Rules of thumb
  5. 5. • Good data >> Bad data • Labels: True, Balanced • Features: • Consistent • Real-time • Stable Rules of thumb
  6. 6. • Simple models + good data >> Complex models + Bad data • Effort to better data >> Effort to better model • Realtime features are hard • Windowed Aggregations are unsupported/inefficient • Training/Serving consistency Rules of thumb
  7. 7. • Inadequate data sources • Event sources: Don’t go back in history • Database sources: Range scans are very expensive • Skill gap • ML vs system engineering • Missing Backfills - Slow iteration Hardness of Realtime features
  8. 8. • Features should be real-time • Features are aggregations • Most aggregations should be windowed • Sawtooth windows Goal
  9. 9. Example ● Restaurant recommendation ● Ratings of restaurant last year ● Check-ins of user by cuisine in the last month ● Latest cuisine check-in by user
  10. 10. Checkins Ratings 1 1 1 3 Time 1 2 4 Label L Prediction P1 P2 3 3 4 2.5 L L Training data set
  11. 11. Contract ● Serving ● User, Restaurant -> avg_restaurant_rating_1yr, cuisine_visits_30d ● Training ● Labeled Data: (User, Restaurant, timestamp, label) ● Enrich with features
  12. 12. Data sources ● Events ● Timestamped – user_txn stream ● Entities ● As served by microservices etc ● Based on DB ● User_balance table ● Or non-real-time : dim/fct tables
  13. 13. Service Fleet Production Database DB Snapshot Event log Change Capture Stream Event Stream Change capture log M essage Bus D a t a L a k e Live Derived Data Media
  14. 14. Feature Set Example
  15. 15. Feature Set Example
  16. 16. Feature Set Example
  17. 17. API – Philosophy • SQL is two languages • Keep Expression Language • CAST(get_json_object(response, “$.age”) AS BIGINT) • Control Structural language • GROUPBY, JOIN, HAVING, SELECT, WHERE, FROM
  18. 18. API – Philosophy Windows are first class Source equivalence: topic ~ table ~ mutations Data Models are first class Entity (dim) Events (fact, timestamped)
  19. 19. API – Internals • Python -> Thrift-Json -> Spark + Scala • Versioned • Driven by airflow
  20. 20. Aggregation Math
  21. 21. Aggregations – SUM • Commutative: a + b = b + a • Order independent • Associative: (a + b) + c = a + (b + c) • Parallelizable
  22. 22. Aggregations – AVG • One not-so-clever trick • Operate on “Intermediate Representation” / IR • Factors into (sum, count) • Finalized by a division: (sum/count)
  23. 23. Aggregations • Constant memory / Bounded IR • Two classes of aggregations • Sum, Avg, Count • Min/Max, Approx Unique, percentiles, topK • Mutations – updates, deletes etc.
  24. 24. Windows – Hopping
  25. 25. Windows – Hopping • Staleness • As stale as the hop size • Memory Efficient • One partial per hop
  26. 26. Windows – Sliding • Freshness • Memory intensive
  27. 27. Windows – Sawtooth • Freshness • Writes are taken into account immediately • Memory • Partial aggregates per hop
  28. 28. Windows – Sawtooth
  29. 29. Windows – Sawtooth • Catch • sum/count vs others • Consistency
  30. 30. Model Server Serving Architecture Feature Declaration Streaming aggregates Batch aggregates Feature Store Model Feature Client Application Server
  31. 31. Windows – Lambda • Points of change
  32. 32. Windows – Lambda
  33. 33. Choosing hops • Automatically chosen • Hop size < x% of window size • Daily, hourly, 5minute • X ~ 8.34% • Caching – variety of window sizes can re-use the hop • 90d, 30d • Across windows & across queries
  34. 34. Questions

×