In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
2. • Machine Learning
• Supervised
• Structured data – database records, event streams
• Not unstructured data – images, video, audio, text
• Not labels
Features in context
4. • Complex models > Simple models
• Can learn complicated relationships within data
Rules of thumb
5. • Good data >> Bad data
• Labels: True, Balanced
• Features:
• Consistent
• Real-time
• Stable
Rules of thumb
6. • Simple models + good data >> Complex models + Bad data
• Effort to better data >> Effort to better model
• Realtime features are hard
• Windowed Aggregations are unsupported/inefficient
• Training/Serving consistency
Rules of thumb
7. • Inadequate data sources
• Event sources: Don’t go back in history
• Database sources: Range scans are very expensive
• Skill gap
• ML vs system engineering
• Missing Backfills - Slow iteration
Hardness of Realtime features
8. • Features should be real-time
• Features are aggregations
• Most aggregations should be windowed
• Sawtooth windows
Goal
9. Example
● Restaurant recommendation
● Ratings of restaurant last year
● Check-ins of user by cuisine in the last month
● Latest cuisine check-in by user
11. Contract
● Serving
● User, Restaurant -> avg_restaurant_rating_1yr, cuisine_visits_30d
● Training
● Labeled Data: (User, Restaurant, timestamp, label)
● Enrich with features
12. Data sources
● Events
● Timestamped – user_txn stream
● Entities
● As served by microservices etc
● Based on DB
● User_balance table
● Or non-real-time : dim/fct tables
17. API – Philosophy
• SQL is two languages
• Keep Expression Language
• CAST(get_json_object(response, “$.age”) AS BIGINT)
• Control Structural language
• GROUPBY, JOIN, HAVING, SELECT, WHERE, FROM
18. API – Philosophy
Windows are first class
Source equivalence: topic ~ table ~ mutations
Data Models are first class
Entity (dim)
Events (fact, timestamped)
19. API – Internals
• Python -> Thrift-Json -> Spark + Scala
• Versioned
• Driven by airflow
21. Aggregations – SUM
• Commutative: a + b = b + a
• Order independent
• Associative: (a + b) + c = a + (b + c)
• Parallelizable
22. Aggregations – AVG
• One not-so-clever trick
• Operate on “Intermediate Representation” / IR
• Factors into (sum, count)
• Finalized by a division: (sum/count)
23. Aggregations
• Constant memory / Bounded IR
• Two classes of aggregations
• Sum, Avg, Count
• Min/Max, Approx Unique, percentiles, topK
• Mutations – updates, deletes etc.
33. Choosing hops
• Automatically chosen
• Hop size < x% of window size
• Daily, hourly, 5minute
• X ~ 8.34%
• Caching – variety of window sizes can re-use the hop
• 90d, 30d
• Across windows & across queries