Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
1 of 41

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta



Download to read offline

Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning pipelines. A key part of our pipelines is the world’s first open-source Feature Store, based on Apache Hive, that acts as a data warehouse for features, providing a natural API between data engineers – who write feature engineering code in Spark (in Scala or Python) – and Data Scientists, who select features from the feature store to generate training/test data for models. In this talk, we will discuss how Databricks Delta solves several of the key challenges in building both feature engineering pipelines that feed our Feature Store and in managing the feature data itself.

Firstly, we will show how expectations and schema enforcement in Databricks Delta can be used to provide data validation, ensuring that feature data does not have missing or invalid values that could negatively affect model training. Secondly, time-travel in Databricks Delta can be used to provide version management and experiment reproducability for training/test datasets. That is, given a model, you can re-run the training experiment for that model using the same version of the data that was used to train the model.

We will also discuss the next steps needed to take this work to the next level. Finally, we will perform a live demo, showing how Delta can be used in end-to-end ML pipelines using Spark on Hopsworks.

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Kim Hammar, Logical Clocks AB KimHammar1 Jim Dowling, Logical Clocks AB jim_dowling End-to-End ML Pipelines with Databricks Delta and Hopsworks Feature Store #UnifiedDataAnalytics #SparkAISummit
  3. 3. Machine Learning in the Abstract 3
  4. 4. Where does the Data come from? 4
  5. 5. Where does the Data come from? 5 “Data is the hardest part of ML and the most important piece to get right. Modelers spend most of their time selecting and transforming features at training time and then building the pipelines to deliver those features to production models.” [Uber on Michelangelo]
  6. 6. Data comes from the Feature Store 6
  7. 7. How do we feed the Feature Store? 7
  8. 8. Outline 8 1. Hopsworks 2. Databricks Delta 3. Hopsworks Feature Store 4. Demo 5. Summary
  9. 9. 9 Datasources Applications API Dashboards Hopsworks Apache Beam Apache Spark Pip Conda Tensorflow scikit-learn Keras J upyter Notebooks Tensorboard Apache Beam Apache Spark Apache Flink Kubernetes Batch Distributed ML &DL Model Serving Hopsworks Feature Store Kafka + Spark Streaming Model Monitoring Orchestration in Airflow Data Preparation &Ingestion Experimentation &Model Training Deploy &Productionalize Streaming Filesystem and Metadata storage HopsFS
  10. 10. 10
  11. 11. 11
  12. 12. 12
  13. 13. 13
  14. 14. 14
  15. 15. 15
  16. 16. Next-Gen Data Lakes Data Lakes are starting to resemble databases: – Apache Hudi, Delta, and Apache Iceberg add: • ACID transactional layers on top of the data lake • Indexes to speed up queries (data skipping) • Incremental Ingestion (late data, delete existing records) • Time-travel queries 16
  17. 17. Problems: No Incremental Updates, No rollback on failure, No Time-Travel, No Isolation. 17
  18. 18. Solution: Incremental ETL with ACID Transactions 18
  19. 19. Upsert & Time Travel Example 19
  20. 20. Upsert & Time Travel Example 20
  21. 21. Upsert ==Insert or Update 21
  22. 22. Version Data By Commits 22
  23. 23. Delta Lake by Databricks • Delta Lake is a Transactional Layer that sits on top of your Data Lake: – ACID Transactions with Optimistic Concurrency Control – Log-Structured Storage – Open Format (Parquet-based storage) – Time-travel 23
  24. 24. Delta Datasets 24
  25. 25. Optimistic Concurrency Control 25
  26. 26. Optimistic Concurrency Control 26
  27. 27. Mutual Exclusion for Writers 27
  28. 28. Optimistic Retry 28
  29. 29. Scalable Metadata Management 29
  30. 30. Other Frameworks: Apache Hudi, Apache Iceberg • Hudi was developed by Uber for their Hadoop Data Lake (HDFS first, then S3 support) • Iceberg was developed by Netflix with S3 as target storage layer • All three frameworks (Delta, Hudi, Iceberg) have common goals of adding ACID updates, incremental ingestion, efficient queries. 30
  31. 31. Next-Gen Data Lakes Compared 31 Delta Hudi Iceberg Incremental Ingestion Spark Spark Spark ACID updates HDFS, S3* HDFS S3, HDFS File Formats Parquet Avro, Parquet Parquet, ORC Data Skipping (File-Level Indexes) Min-Max Stats+Z-Order Clustering* File-Level Max-Min stats + Bloom Filter File-Level Max-Min Filtering Concurrency Control Optimistic Optimistic Optimistic Data Validation Expectations (coming soon) In Hopsworks N/A Merge-on-Read No Yes (coming soon) No Schema Evolution Yes Yes Yes File I/O Cache Yes* No No Cleanup Manual Automatic, Manual No Compaction Manual Automatic No *Databricks version only (not open-source)
  32. 32. 32 How can a Feature Store leverage Log-Structured Storage (e.g., Delta or Hudi or Iceberg)?
  33. 33. Hopsworks Feature Store 33 Feature Mgmt Storage Access Statistics Online Features Discovery Offline Features Data Scientist Online Apps Data Engineer MySQL Cluster (Metadata, Online Features) Apache Hive Columnar DB (Offline Features) Feature Data Ingestion Hopsworks Feature Store Training Data (S3, HDFS) Batch Apps Discover features, create training data, save models, read online/offline/on- demand features, historical feature values. Models HopsFS JDBC (SAS, R, etc) Feature CRUD Add/remove features, access control, feature data validation. Access Control Time Travel Data Validation Pandas or PySpark DataFrame External DB Feature Defn Țselect ..Ț AWS Sagemaker and Databricks Integration • Computation engine (Spark) • Incremental ACID Ingestion • Time-Travel • Data Validation • On-Demand or Cached Features • Online or Offline Features
  34. 34. Incremental Feature Engineering with Hudi 34
  35. 35. Point-in-Time Correct Feature Data 35
  36. 36. Feature Time Travel with Hudi and Hopsworks Feature Store 36
  37. 37. Demo: Hopsworks Featurestore + Databricks Platform 37
  38. 38. Summary • Delta, Hudi, Iceberg bring Reliability, Upserts & Time-Travel to Data Lakes – Functionalities that are well suited for Feature Stores • Hopsworks Feature Store builds on Hudi/Hive and is the world’s first open-source Feature Store (released 2018) • The Hopsworks Platform also supports End-to-End ML pipelines using the Feature Store and Spark/Beam/Flink, Tensorflow/PyTorch, and Airflow 38
  39. 39. Thank you! 470 Ramona St, Palo Alto Kista, Stockholm Register for a free account at Twitter @logicalclocks @hopsworks GitHub rks
  40. 40. References • Feature Store: the missing data layer in ML pipelines? • Python-First ML Pipelines with Hopsworks • Hopsworks white paper. • HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases. • Open Source: • Thanks to Logical Clocks Team: Jim Dowling, Seif Haridi, Theo Kakantousis, Fabio Buso, Gautier Berthou, Ermias Gebremeskel, Mahmoud Ismail, Salman Niazi, Antonios Kouzoupis, Robin Andersson, Alex Ormenisan, Rasmus Toivonen, Steffen Grohsschmiedt, and Moritz Meister 40