Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hopsworks at Google AI Huddle, Sunnyvale

178 views

Published on

Hopsworks is a platform for designing and operating End to End Machine Learning using PySpark and TensorFlow/PyTorch. Early access is now available on GCP. Hopsworks includes the industry's first Feature Store. Hopsworks is open-source.

Published in: Technology
  • Get Paid $25 per hour to watch YouTube videos ■■■ http://t.cn/AieXipTS
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Hopsworks at Google AI Huddle, Sunnyvale

  1. 1. Hopsworks on GCP Jim Dowling, CEO Logical Clocks 8 August 2019 Google Sunnyvale
  2. 2. Hopsworks Technical Milestones “If you’re working with big data and Hadoop, this one paper could repay your investment in the Morning Paper many times over.... HopsFS is a huge win.” - Adrian Colyer, The Morning Paper World’s first Hadoop platform to support GPUs-as-a-Resource World’s fastest HDFS Published at USENIX FAST with Oracle and Spotify World’s First Open Source Feature Store for Machine Learning World’s First Distributed Filesystem to store small files in metadata on NVMe disks Winner of IEEE Scale Challenge 2017 with HopsFS - 1.2m ops/sec 2017 World’s most scalable POSIX-like Hierarchical Filesystem with Multi Data Center Availability with 1.6m ops/sec on GCP 2018 2019 First non-Google ML Platform with TensorFlow Extended (TFX) support through Beam/Flink World’s First Unified Hyperparam and Ablation Study Parallel Prog.` Framework
  3. 3. Evolution of Distributed Filesystems POSIXPast Present NFS, HDFS S3 GCS HopsFS Single DC, Strongly Consistent Metadata Multi-DC, Eventually Consistent Metadata Multi-DC, Strongly Consistent Metadata Object Store POSIX-like Filesystem
  4. 4. Why HopsFS? ● Distributed FS ○ Needed for Parallel ML Experiments / Dist Training / FeatureStore ● Provenance/Free-text-search ○ Change Data Capture API ● Performance ○ > 1.6m ops/sec over 3 Azes on GCP using Spotify’s Hadoop workload ○ NVMe for small files stored in metadata ● HDFS API ○ TensorFlow/Keras/PySpark/Beam/Flink/PyTorch (Petastorm)
  5. 5. Hopsworks – a platform for Data Intensive AI built on HopsFS
  6. 6. Data validation Distributed Training Model Serving A/B Testing Monitoring Pipeline Management HyperParameter Tuning Feature Engineering Data Collection Hardware Management Data Model Prediction φ(x) Hopsworks hides the Complexity of Deep Learning Hopsworks REST API Hopsworks Feature Store [Adapted from Schulley et Al “Technical Debt of ML” ]
  7. 7. What is Hopsworks? Elasticity & Performance Governance & ComplianceDevelopment & Operations Secure Multi-Tenancy Project-based restricted access Encryption At-Rest, In-Motion TLS/SSL everywhere AI-Asset Governance Models, experiments, data, GPUs Data/Model/Feature Lineage Discover/track dependencies Notebooks for Development First-class Python Support Version Everything Code, Infrastructure, Data Model Serving on Kubernetes TF Serving, MLeap, SkLearn End-to-End ML Pipelines Orchestrated by Airflow Feature Store Data warehouse for ML Distributed Deep Learning Faster with more GPUs HopsFS NVMe speed with Big Data Horizontally Scalable Ingestion, DataPrep, Training, Serving FS
  8. 8. Which services require Distributed Metadata (HopsFS)? Elasticity & Performance Governance & ComplianceDevelopment & Operations Secure Multi-Tenancy Project-based restricted access Encryption At-Rest, In-Motion TLS/SSL everywhere AI-Asset Governance Models, experiments, data, GPUs Data/Model/Feature Lineage Discover/track dependencies Notebooks for Development First-class Python Support Version Everything Code, Infrastructure, Data Model Serving on Kubernetes TF Serving, MLeap, SkLearn End-to-End ML Pipelines Orchestrated by Airflow Feature Store Data warehouse for ML Distributed Deep Learning Faster with more GPUs HopsFS NVMe speed with Big Data Horizontally Scalable Ingestion, DataPrep, Training, Serving FS
  9. 9. End-to-End ML Pipelines
  10. 10. End-to-End ML Pipelines in Hopsworks
  11. 11. ML Pipelines with a Feature Store
  12. 12. ML Pipelines with a Feature Store
  13. 13. Feature Store These should be based on the same feature engineering code. Features Training Labels Model Features Inference Model Labels
  14. 14. Feature Store Application Developer ML Developer It’s not always trivial to ensure features are engineered consistently between training and inference Features Training Labels Model Features Inference Model Labels
  15. 15. Feature Store Feature StorePut Get Get Features Training Labels Model Features Inference Model Labels
  16. 16. Feature Store Feature StorePut Get Get Features Training Labels Model Features Inference Model Labels Batch App Online App Online or Offline Features? On-Demand or Cached Features?
  17. 17. Hopsworks Feature Store Feature Mgmt Storage Access Statistics Online Features Discovery Offline Features Data Scientist Online Apps Data Engineer MySQL Cluster (Metadata, Online Features) Apache Hive Columnar DB (Offline Features) Feature Data Ingestion Hopsworks Feature Store Training Data (S3, HDFS) Batch Apps Discover features, create training data, save models, read online/offline/on- demand features, historical feature values. Models HopsFS JDBC (SAS, R, etc) Feature CRUD Add/remove features, access control, feature data validation. Access Control Time Travel Data Validation Pandas or PySpark DataFrame External DB Feature Defn select ..
  18. 18. FeatureStore Abstractions Titanic Passenger List Feature Groups Features Train/Test Datasets SexName PClass Passenger Bank Account BalanceName Sex BalancePClass survivename .tfrecords .csv .numpy .hdf5, .petastorm, etc Features, FeatureGroups, and Train/Test Datasets are versioned Survive
  19. 19. Register a Feature Group with the Feature Store titanic_df = # Spark or Pandas Dataframe # Do feature engineering on ‘titanic_data’ # Register Dataframe as FeatureGroup featurestore.create_featuregroup(titanic_df, "titanic_data_passengers“)
  20. 20. Create Training Datasets using the Feature Store sample_data = featurestore.get_features([“name”, “Pclass”, “Sex”, “balance”]) featurestore.create_training_dataset(sample_data, “titanic_training_dataset", data_format="tfrecords“, training_dataset_version=1) # Use the training dataset dataset_dir = featurestore.get_training_dataset_path( "titanic_training_dataset") s = featurestore.get_training_dataset_tf_record_schema( “titanic_training_dataset”)
  21. 21. End-to-End ML Pipelines in Hopsworks
  22. 22. TensorFlow Extended (TFX) https://www.tensorflow.org/tfx TensorFlow Extended Components in Hopsworks
  23. 23. TFX on a Flink Cluster with Portable RunnerTensorFlow Extended Components in Hopsworks
  24. 24. Apache Airflow to Orchestrate ML Pipelines
  25. 25. Apache Airflow to Orchestrate ML Pipelines Airflow Jobs REST API Hopsworks Jobs: PySpark, Spark, Flink, Beam/Flink
  26. 26. What about TFX metadata?
  27. 27. Explicit vs Implicit Provenance ● Explicit: TFX, MLFlow ○ Wrap existing code in components that execute a stage in the pipeline ○ Interceptors in components inject metadata to a metadata store as data flows through the pipeline ○ Store metadata about artefacts and executions ● Implicit: Hopsworks with ePipe* DataPrep Train HopsFS Experiment /Training_ Datasets /Experiments /Models Elasticsearch ePipe: ChangeDataCapture API *ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata, CCGrid, 2019.
  28. 28. Provenance in Hopsworks ● Implicit Provenance ○ Applications read/write files to HopsFS – infer artefacts from pathname conventions and xattrs ● Explicit Provenance in Hops API ○ Executions: hops.experiment.grid_search(…), hops.experiment.collective_allreduce(…) ○ FeatureStore: hops.featurestore.create_training_dataset(….) ○ Saving and deploying models: hops.serving.export(export_path, “model_name", 1) hops.serving.deploy(“…/model.pb”, destination)
  29. 29. /Experiments ● Executions add entries in /Experiments: experiment.launch(…) experiment.grid_search(…) experiment.collective_allreduce(…) experiment.lagom(…) ● /Experiments contains: ○ logs (application, tensorboard) ○ executed notebook file ○ conda environment used ○ checkpoints /Projects/MyProj └ Experiments └ <app_id> └ <type> ├─ checkpoints ├─ tensorboard_logs ├─ logfile └─ versioned_resources ├─ notebook.ipynb └─ conda_env.yml
  30. 30. Experiments
  31. 31. Individual Experiment Overview
  32. 32. /Models ● Named/versioned model management for: TensorFlow/Keras Scikit Learn ● A Models dataset can be securely shared with other projects or the whole cluster ● The provenance API returns the conda.yml and execution used to train a given model /Projects/MyProj └ Models └ <name> └ <version> ├─ saved_model.pb └─ variables/ ...
  33. 33. Model Serving ● Model Serving on Google Cloud AI Platform ○ Export models to Cloud Storage buckets ● On-Premise ○ Serve models on Kuberenetes ○ TensorFlow Serving, Scikit-Learn
  34. 34. Jupyter Notebooks as Jobs in Airflow Pipelines
  35. 35. Principles of Development with Notebooks ● No throwaway code ● Code can be run either in Notebooks or as Jobs (in Pipelines) ● Notebooks/jobs should be parameterizable ● No external configuration for program logic ○ HPARAMs are part of the program ● Core training loop should be the same for all stages of development ○ Data exploration, hparam tuning, dist training,
  36. 36. Problem in ML Pipeline Development Process? Explore Data, Train model Hyperparam Opt. Distributed Training Notebook Python + YML in Gitport/rewrite code Iteration is hard/impossible/a-bad-idea
  37. 37. Notebooks offer more than Prototyping ● Familiar web-based development environment ● Interactive development/debugging ● Reporting platform for ML applications (Papermill by Netflix) ● Parameterizable as part of Pipelines (Papermill by Netflix) Disclaimer: Notebooks are not for everyone
  38. 38. ML Pipelines of Jupyter Notebooks with Airflow Select Features, File Format Feature Engineering Validate & Deploy Model Experiment, Train Model Airflow Dataprep Pipeline Training and Deployment Pipeline Feature Store
  39. 39. PySpark Notebooks as Jobs in ML Pipelines
  40. 40. Running TensorFlow/Keras/PyTorch Apps in PySpark Warning: micro-exposure to PySpark may cure you of distributed programming phobia
  41. 41. End-to-End ML Pipelines in Hopsworks
  42. 42. GPU(s) in PySpark Executor, Driver coordinates PySpark makes it easier to write TensorFlow/Keras/ PyTorch code that can either be run on a single GPU or scale to run on lots of GPUS for Parallel Experiments or Distributed Training. Executor Executor Driver
  43. 43. Executor 1 Executor N Driver HopsFS • Training/Test Datasets • Model checkpoints, Trained Models • Experiment run data • Provenance data • Application logs Need Distributed Filesystem for Coordination Model Serving TensorBoard
  44. 44. PySpark Hello World
  45. 45. 1 * Executor print(“Hello from GPU”) Driver experiment.launch(..) PySpark – Hello World
  46. 46. Leave code unchanged, but configure 4 Executors print(“Hello from GPU”) Driver print(“Hello from GPU”) print(“Hello from GPU”) print(“Hello from GPU”)
  47. 47. Driver with 4 Executors
  48. 48. Same/Replica Conda Environment on all Executors conda_env conda_envconda_envconda_envconda_env
  49. 49. A Conda Environment Per Project in Hopsworks
  50. 50. Use Pip or Conda to install Python libraries
  51. 51. TensorFlow Distributed Training with PySpark def train(): # Separate shard of dataset per worker # create Estimator w/ DistribStrategy # as CollectiveAllReduce # train model, evaluate return loss # Driver code below here # builds TF_CONFIG and shares to workers from hops import experiment experiment.collective_allreduce(train) More details: http//github.com/logicalclocks/hops-examples
  52. 52. Undirected Hyperparam Search with PySpark def train(dropout): # Same dataset for all workers # create model and optimizer # add this worker’s value of dropout # train model and evaluate return loss # Driver code below here from hops import experiment args={“dropout”:[0.1, 0.4, 0.8]} experiment.grid_search(train,args) More details: http//github.com/logicalclocks/hops-examples
  53. 53. Directed Hyperparam Search with PySpark def train(dropout): # Same dataset for all workers # create model and optimizer optimizer.apply(dropout) # train model and evaluate return loss from hops import experiment args={“dropout”: “0.1-0.8”} experiment.diff_ev(train,args) More details: http//github.com/logicalclocks/hops-examples
  54. 54. Wasted Compute!
  55. 55. Parallel ML Experiments with PySpark and Maggy
  56. 56. PClass Sexname survive Iterative Model Development Dataset Machine Learning Model Optimizer Evaluate Problem Definition Data Preparation Model Selection Hyperparameters Repeat if needed Model Training
  57. 57. Maggy: Unified Hparam Opt & Ablation Programming Machine Learning System Hyperparameter Optimizer New Hyperparameter Values Evaluate New Dataset/Model-Architecture Ablation Study Controller Synchronous or Asynchronous Trials Directed or Undirected Search User-Defined Search/Optimizers
  58. 58. Maggy – Async, Parallel ML experiments using PySpark ● Experiment-Driven, interactive development of ML applications ● Parallel Experimentation ○ Hyperparameter Optimization ○ Ablation Studies ○ Leave-one-Feature-out ● Interactive Debugging ○ Experiment/executor logs shown both in Jupyter notebooks and logging
  59. 59. Feature Ablation ● Uses the Feature Store to access the dataset metadata ● Generates Python callables that once called will create a dataset ○ Removes one-feature-at-a-time Sexname survivePClass Sexname survive
  60. 60. Layer Ablation ● Uses a base model function ● Generates Python callables that once called with create a modified model ○ Uses the model configuration to find and remove layer(s) ○ Removes one-layer-at-a-time (or one-layer-group-at-a-time)
  61. 61. User API: Initialize the Study and Add Features
  62. 62. Driver Code to setup Ablation Experiment User API: Base Model and Ablation Exp. Setup
  63. 63. User API: Wrap the Training Function
  64. 64. Interactive Debugging: Print Executor Logs in Jupyter Notebooks
  65. 65. Printing Worker Logs in Jupyter Notebooks
  66. 66. Hopsworks on GCP ● Hopsworks available as an Image for GCP ○ Blog post coming soon with details ● Next Steps ○ Cloud Native Features ■ Elastic add/remove resources (compute/GPU) ■ API integration with Google Model serving
  67. 67. That was Hopsworks Efficiency & Performance Security & GovernanceDevelopment & Operations Secure Multi-Tenancy Project-based restricted access Encryption At-Rest, In-Motion TLS/SSL everywhere AI-Asset Governance Models, experiments, data, GPUs Data/Model/Feature Lineage Discover/track dependencies Development Environment First-class Python Support Version Everything Code, Infrastructure, Data Model Serving on Kubernetes TF Serving, SkLearn End-to-End ML Pipelines Orchestrated by Airflow Feature Store Data warehouse for ML Distributed Deep Learning Faster with more GPUs HopsFS NVMe speed with Big Data Horizontally Scalable Ingestion, DataPrep, Training, Serving FS
  68. 68. Hopsworks 1.0 coming soon Hopsworks-1.0 ● Beam 2.14.0 ● Flink 1.8.1 ● Spark 2.4.3 ● TensorFlow 1.14.0 ● TFX 0.13 ● TensorFlow Model Analysis 0.13.2 ● PyTorch 1.1 ● ROCm 2.6, Cuda 10.X
  69. 69. Acknowledgements and References Slides and Diagrams from colleagues: ● Maggy: Moritz Meister and Sina Sheikholeslami ● Feature Store: Kim Hammar ● Beam/Flink on Hopsworks: Theofilos Kakantousis References ● HopsFS: Scaling hierarchical file system metadata …, USENIX FAST 2017. ● Size matters: Improving the performance of small files …, ACM Middleware 2018. ● ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata, CCGrid, 2019. ● Hopsworks Demo, SysML 2019.
  70. 70. Thank you! 470 Ramona St Palo Alto https://www.logicalclocks.com Register for a free account at www.hops.site Twitter @logicalclocks @hopsworks GitHub https://github.com/logicalclocks/hopsworks https://github.com/hopshadoop/hops

×