Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

End to-end ml pipelines with beam, flink, tensor flow, and hopsworks (beam summit europe 2019)

109 views

Published on


Apache Beam is a key technology for building scalable End-to-End ML pipelines, as it is the data preparation and model analysis engine for TensorFlow Extended (TFX), a framework for horizontally scalable Machine Learning (ML) pipelines based on TensorFlow. In this talk, we present TFX on Hopsworks, a fully open-source platform for running TFX pipelines on any cloud or on-premise. Hopsworks is a project-based multi-tenant platform for both data parallel programming and horizontally scalable machine learning pipelines. Hopsworks supports Apache Flink as a runner for Beam jobs and TFX pipelines are supported through Airflow support in Hopsworks. We will demonstrate how to build a ML pipeline with TFX, Beam’s Python API and the Flink Runner by using Jupyter notebooks, explain how security is transparently enabled with short-lived TLS certificates, and go through all the pipeline steps, from Data Validation, to Transformation, Model training with TensorFlow, Model Analysis, Model Serving and Monitoring with Kubernetes.
To the best of our knowledge, Hopsworks is the first fully open-source on-premise platform that supports both TFX pipelines and Apache Beam.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

End to-end ml pipelines with beam, flink, tensor flow, and hopsworks (beam summit europe 2019)

  1. 1. BERLIN2019 BERLIN2019
  2. 2. BERLIN 2019 1. End-to-end ML pipelines 2. What is Hopsworks 3. Beam Portable Runner with Flink in Hopsworks 4. ML Pipelines with Beam and TensorFlow Extended 5. Demo
  3. 3. BERLIN 2019 Data Prep Data Ingest Train Serve Online Monitor Distributed Storage Raw Data Event Data Data Lake Resource Manager
  4. 4. BERLIN 2019
  5. 5. BERLIN 2019
  6. 6. BERLIN 2019
  7. 7. BERLIN 2019 this one paper could repay your investment HopsFS is a huge win. World’s first Hadoop platform to support GPUs-as-a-Resource World’s fastest Hadoop Published at USENIX FAST with Oracle and Spotify World’s First Open Source Feature Store for Machine Learning World’s First Distributed Filesystem to store small files in metadata on NVMe disks Winner of IEEE Scale Challenge 2017 with HopsFS - 1.2m ops/sec 2017 World’s most scalable Filesystem with Multi Data Center Availability 2018 2019 World’s first Open Source Platform to support TensorFlow Extended (TFX) on Beam
  8. 8. BERLIN 2019 Proj-XProject-42 Kafka TopicResources /Projs/My/Data Project-AllCompanyDBModels
  9. 9. BERLIN 2019 ● Manage Hopsworks resources via the REST API ○ Projects ○ Datasets ○ Jobs ○ Users ○ FeatureStore ○ Kafka ○ .. ● Documented with Swagger and hosted on SwaggerHub
  10. 10. BERLIN2019 BERLIN2019
  11. 11. BERLIN 2019 Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Cloud Dataflow Execution 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam concepts available in new languages. 3. Runner writers: who have a distributed processing environment and want to support Beam pipelines https://s.apache.org/apache-beam-project-overview
  12. 12. BERLIN 2019 ● Develop Beam pipelines in Python from Jupyter notebooks ● Tooling to simplify deployment and execution ● Manage lifecycle of Job Service ● SDK Workers (harness) with conda env ● Scalable execution on Flink clusters
  13. 13. BERLIN 2019 Running Beam jobs on a Flink cluster from a Hopsworks project
  14. 14. BERLIN 2019 ● hops-util-py (Python) and HopsUtil(Java) ● Simplify development by: ○ Setting security config ○ Discovering cluster services ○ Helper methods for the Hopsworks REST API ○ ML Experiments ● Manage Beam Job Service def start_beam_job_service( flink_session_name, artifacts_dir="Resources", job_server_path="hdfs:///user/flink/", job_server_jar="beam-runners-flink-1.8-job-server-2.13.0.jar", sdk_worker_parallelism=1) https://github.com/logicalclocks/hops-util-py/ https://github.com/logicalclocks/hops-util
  15. 15. BERLIN 2019 ● Docker: ○ Build image with all your dependencies ○ Update or modify? build new containers ○ Additional infrastructure components ● Process: ○ Install dependencies on all servers ○ Management of dependencies? ○ Easy to update and modify libraries ○ Challenge? Multi-tenancy & keep servers in sync ● SDK Worker (Harness): SDK-provided program responsible for executing user code ● How to manage the user’s dependencies, libraries, … ?
  16. 16. Conda Repo Hopsworks Cluster BERLIN 2019 No need to write Dockerfiles
  17. 17. BERLIN 2019 ● Manage notebook settings from dashboard
  18. 18. BERLIN 2019 ● Execute a Beam Python pipeline ○ With the Python kernel either in a docker container managed by Kubernetes or as a local Python process. ○ In a PySpark executor in the cluster.
  19. 19. BERLIN 2019
  20. 20. BERLIN 2019 https://www.slideshare.net/ThomasWeise/python-streaming-pipelines-on-flink-beam-meetup-at-lyft-2019
  21. 21. BERLIN 2019 https://www.slideshare.net/ThomasWeise/python-streaming-pipelines-on-flink-beam-meetup-at-lyft-2019 HopsFS Local/YARN/K8s Hopsworks Session cluster on YARN
  22. 22. BERLIN 2019 https://www.slideshare.net/ThomasWeise/python-streaming-pipelines-on-flink-beam-meetup-at-lyft-2019 HopsFS Local/YARN/K8s # Compiled with HopsFS dependencies Hopsworks hops-util.py Session cluster on YARN
  23. 23. BERLIN 2019 https://www.slideshare.net/ThomasWeise/python-streaming-pipelines-on-flink-beam-meetup-at-lyft-2019 HopsFS Local/YARN/K8s # hops-util-py localizes Job Service jar file from HopsFS # Provides arguments (ports, artifacts_dir, etc.) # Start Job Service and returns host,port # Job Service automatically shuts down when Python pipeline shuts down Hopsworks hops-util.py host,port = start_beam_job_service() host+“:”+port Session cluster on YARN
  24. 24. BERLIN 2019 https://www.slideshare.net/ThomasWeise/python-streaming-pipelines-on-flink-beam-meetup-at-lyft-2019 HopsFS Local/YARN/K8s Hopsworks hops-util.py Conda repo host+“:”+port Session cluster on YARN # Python conda env and Hopsworks env variables are set in SDK Worker script
  25. 25. BERLIN 2019 ● Flink JobManager and TaskManager ● Beam Job service ○ Local mode - logs in project’s Jupyter staging dir ○ Cluster - logs in the PySpark container where process is running. ● SDK Worker ○ Logs are in the Flink TaskManager container ● Collect and visualize with the ELK stack ○ Logs are accessible only by project members
  26. 26. BERLIN 2019
  27. 27. BERLIN 2019
  28. 28. BERLIN2019 BERLIN2019
  29. 29. BERLIN 2019 https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
  30. 30. BERLIN 2019 https://www.tensorflow.org/tfx
  31. 31. BERLIN 2019
  32. 32. BERLIN 2019
  33. 33. BERLIN 2019 Executor 1 Executor N Driver HopsFS (HDFS)TensorBoard Model Serving
  34. 34. BERLIN 2019 ● Repeatable experiments ● Manage experiments metadata ● Integration with Tensorboard
  35. 35. BERLIN 2019
  36. 36. BERLIN2019 BERLIN2019
  37. 37. BERLIN 2019 ● Airflow available as a multi-tenant service in a Hopsworks ● Develop pipelines with Hopsworks operators and sensors
  38. 38. BERLIN 2019
  39. 39. BERLIN2019 BERLIN2019
  40. 40. BERLIN 2019 Raw Data Event Data Monitor Serving Feature Store / TFX Transform Data PrepIngest DeployExperiment / Train logs logs Metadata Store External Model Analysis FeatureStore
  41. 41. BERLIN 2019 ● Beam 2.13.0 ● Flink 1.8.0 ● TensorFlow 1.14.0 ● TFX 0.14.0dev ● TensorFlow Model Analysis 0.13.2
  42. 42. BERLIN2019 BERLIN2019
  43. 43. BERLIN 2019 ● Summary ○ Hopsworks v1.0 the first on-prem open source horizontally scalable platform to support Beam Portable Runner with Flink runner ○ Develop and Manage lifecycle of horizontally scalable End-to-End ML Pipelines with Beam and TFX ● Future Work ○ Add support for Spark Runner ○ Export metrics in InfluxDB and monitor with Grafana
  44. 44. https://github.com/logicalclocks BERLIN 2019

×