Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha and Andrew Hoh

4,023 views

Published on

Zipline is Airbnb’s data management platform specifically designed for ML use cases. Previously, ML practitioners at Airbnb spent roughly 60% of their time on collecting and writing transformations for machine learning tasks. Zipline reduces this task from months to days. It allows users to define features in an easy-to-use configuration language, then provides access to the following features: resource efficient and point-in-time correct training set backfills and scheduled updates, feature visualizations and automatic data quality monitoring, feature availability in online scoring environment: batch and streaming with batch correction (lambda architecture), collaboration and sharing of features, and data ownership and management.

Spark powers many of Zipline’s features, especially offline tasks for efficient training set backfills and feature computation. This talk covers Ziplines architecture and the main problems that Zipline solves. Despite being widespread, there is no open source software to address these problems. As a result, we intend to open source our work.

Published in: Data & Analytics

Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha and Andrew Hoh

  1. 1. Nikhil Simha, Airbnb Andrew Hoh, Airbnb Zipline – Airbnb’s ML Data Management Framework #ML3SAIS
  2. 2. Feature Engineering 2#ML3SAIS Idea Discover Data Training Set Training & Evaluation Airbnb Zipline
  3. 3. Discovering data 3#ML3SAIS • Where is it? • Is it good? • Will it continue to be good? • Has it already been done before?
  4. 4. Training set 4#ML3SAIS • What processing engine to use? • Backfilling • Point-in-time correct windows
  5. 5. Production 5#ML3SAIS Training Set Generation Model Training Push model to prediction service Online Features Model monitoring Data quality monitoring
  6. 6. What is Zipline? • Part of Bighead – Airbnb’s e2e ml platform • Training Sets • Feature Store • Data quality 6#ML3SAIS
  7. 7. Concepts • Data Source • Feature-set • Training-set • Client 7#ML3SAIS
  8. 8. Feature-Set • Source queries – Hive tables – Event streams – Mutation streams • Primary Keys • Time Stamp 8#ML3SAIS
  9. 9. Feature-Set – Sources • Multiple sources – Backfills – Migrations 9#ML3SAIS
  10. 10. Example/Operations 10#ML3SAIS
  11. 11. 11#ML3SAIS The picture can't be displayed. Data Quality
  12. 12. Training set • Driver query – primary keys – timestamps – Point-in-time correct 12#ML3SAIS
  13. 13. Training set – example 13#ML3SAIS
  14. 14. Training set – example 14#ML3SAIS
  15. 15. Backfills 15#ML3SAIS Old training Set Time Features
  16. 16. Label offsets 16#ML3SAIS
  17. 17. Online scoring • Backfills • Low latency • DB Mutations – Batch correction 17#ML3SAIS
  18. 18. Lambda • Batch – Spark • Streaming – Flink • GDPR 18#ML3SAIS User Conf Batch Streaming Zipline Client User App *Daily *Continuous KV Store
  19. 19. Why Flink? 19#ML3SAIS
  20. 20. Why Flink? 20#ML3SAIS
  21. 21. • Across batch and streaming • Fixed length sliding windows • Raw Events at the tail and head Aggregations 21#ML3SAIS 7 day sliding window
  22. 22. Availability 22#ML3SAIS Processing time Eventtime
  23. 23. Client 23#ML3SAIS • Aggregation • Logic to handle availability • Java – The API is (List of Feature Names) => Map of feature values – as objects.
  24. 24. Mutations • Organizationally easy • Consistent • Not flexible • More complex – Inverse 24#ML3SAIS
  25. 25. Mutations - Aggregation 25#ML3SAIS • Monoids – Associative: (a + b) + c = a + (b + c) – Distribute aggregation – Sum, Avg, Count etc.. • Group – Invertible – Min, Max, Median, Ntile – Not possible without memory – Compromise
  26. 26. Mutations - Aggregation 26#ML3SAIS • Deletes = Inverse(before) • Updates = Inverse(before) + after
  27. 27. Overview FeatureSet Conf Feature Stream Batch Table Stats Table Data Quality UI Client Offline Training Set
  28. 28. When? • Plan to open source in Q3 2018. • Will be part of Bighead – Talk tomorrow. 28#ML3SAIS
  29. 29. Questions? 29#ML3SAIS

×