Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Feature Store in Hopsworks

212 views

Published on

Bay Area AI Meetup at Mesosphere on the Hopsworks Feature Store

Published in: Technology
  • Be the first to comment

  • Be the first to like this

The Feature Store in Hopsworks

  1. 1. Introducing the Feature Store in Hopsworks Bay Area AI Meetup @ Mesosphere March 5th, 2019 jim_dowling CEO @ Logical Clocks Assoc Prof @ KTH
  2. 2. Today’s Agenda 1. What is a Feature Store and why do you need one? 2. The Hopsworks’ Feature Store 3. Demo 2
  3. 3. Become a Data Scientist! 3 Eureka! This will give a 12% increase in the efficiency of this wind farm!
  4. 4. Data Scientists are not Data Engineers 4 HDFSGCS Storage CosmosDB How do I find features in this sea of data sources? This tastes like dairy in my Latte!
  5. 5. What is a Feature? A measurable property of a phenomena under observation •A raw word, a pixel, a sensor value a feature •A column in a datastore •An aggregate (mean, max, sum, min) •A derived representation (embedding or cluster) 5
  6. 6. ©2018 Logical Clocks AB. All Rights Reserved 6 A More Complex Feature Pipeline
  7. 7. Data Science with the Feature Store 7 HDFSGCS Storage CosmosDB Feature Warehouse Store Feature Pipelines (Select, Transform, Aggregate, ..) Now, I can change the world - one click- through at a time.
  8. 8. Features need to be first-class entities •Features should be discoverable and reused. •Features should be access controlled, versioned, and governed. - Enable reproducibility. •Ability to pre-compute and automatically backfill features. - Aggregates, embeddings - avoid expensive re-computation. - On-demand computation of features should also be possible. •The Feature Store should help “solve the data problem, so that Data Scientists don’t have to.” [uber] 8
  9. 9. Hopsworks’ Feature Store - Reusability of features between models and teams - Automatic backfilling of features - Automatic feature documentation and analysis - Feature versioning - Standardized access of features between training and serving - Feature discovery - Access control for Feature Stores 9
  10. 10. There are other advantages to the Feature Store … 10
  11. 11. Just select and type text. Use control handle to adjust line spacing. Bert Features Bert Features Bert Features Marketing Research Analytics Prevent Duplicated Feature Engineering 11 DUPLICATED
  12. 12. Prevent Inconsistent Features– Training/Serving 12 Feature implementations may not be consistent – correctness problems!
  13. 13. Known Feature Stores in Production •Logical Clocks – Hopsworks (open source) •Uber Michelangelo •Airbnb – Bighead/Zipline •Comcast •Twitter •GO-JEK Feast (open source on GCE) 13
  14. 14. The API Between Data Science and Data Engineering 14 Data Engineer Data Scientist
  15. 15. A Feature Store for Hopsworks 15
  16. 16. ©2018 Logical Clocks AB. All Rights Reserved Short History of the Hops Project 16 3/5/2019 2017 2018 Publish world’s fastest HDFS (HopsFS) at USENIX FAST with Spotify Winner of IEEE Scale Challenge 2017 for HopsFS – 1.2m ops/sec World’s First Distributed Filesystem to store small files in metadata on NVMe disks World’s first open- source Feature Store for Machine Learning 2019 “If you’re working with big data and Hadoop, this one paper could repay your investment in the Morning Paper many times over.... HopFS is a huge win.” Adrian Colyer, The Morning Paper World’s first Hadoop platform to support GPUs-as-a-Resource
  17. 17. ©2018 Logical Clocks AB. All Rights Reserved Hopsworks – Batch, Streaming, Deep Learning Data Sources HopsFS Kafka Airflow Spark / Flink Spark Feature Store Hive Deep Learning BI Tools & Reporting Notebooks Serving w/ Kubernetes Hopsworks On-Premise, AWS, Azure, GCE Elastic External Service Hopsworks Service
  18. 18. ©2018 Logical Clocks AB. All Rights Reserved Data Sources HopsFS Kafka Airflow Spark / Flink Spark Feature Store Hive Deep Learning BI Tools & Reporting Notebooks Serving w/ Kubernetes Hopsworks On-Premise, AWS, Azure, GCE Elastic External Service Hopsworks Service BATCH ANALYTICS STREAMING ML & DEEP LEARNING Hopsworks – Batch, Streaming, Deep Learning
  19. 19. ©2018 Logical Clocks AB. All Rights Reserved 19 Hopsworks: Multi-Tenancy with Projects Proj-42 Proj-X Shared TopicFeatureStore /Projs/My/Data Proj-AllCompanyDB Ismail et al, Hopsworks: Improving User Experience and Development on Hadoop with Scalable, Strongly Consistent Metadata, ICDCS 2017 TLS Certificates • User certs • Application certs • Service certs Models
  20. 20. ©2018 Logical Clocks AB. All Rights Reserved 20 Distributed Deep Learning in Hopsworks Executor 1 Executor N Driver conda_env conda_env conda_env HopsFS (HDFS) TensorBoard ModelsExperiments Training Data Logs
  21. 21. Replicated Conda Environments •Every project can create its own conda environment, replicated at all hosts in the cluster -Base environments for Python2 and Python3 mostly adequate •Hopsworks ensures consistent conda command log replication to all hosts in the cluster using a local agent 21 Hopsworks conda commands Kagent envs Kagent envs Host A Host B
  22. 22. ©2018 Logical Clocks AB. All Rights Reserved ML Infrastructure in Hopsworks 22 MODEL TRAINING Feature Store HopsML API & Airflow [Diagram adapted from “technical debt of machine learning”]
  23. 23. Feature Store Concepts 23
  24. 24. •The Feature Store API: For writing/reading to/from the feature store •The Feature Registry: A user interface to share and discover features •The Metadata Layer: For storing feature metadata (versioning, feature analysis, documentation, jobs) •The Feature Engineering Jobs: For computing features •The Storage Layer: For storing feature data in the feature store Building Blocks of a Feature Store 24 Feature Storage Feature Metadata Jobs Feature Registry API
  25. 25. ©2018 Logical Clocks AB. All Rights Reserved 25 Reading from the Feature Store (Data Scientist) from hops import featurestore raw_data = spark.read.parquet(filename) polynomial_features = raw_data.map(lambda x: x^2) featurestore.insert_into_featuregroup(polynomial_features, "polynomial_featuregroup") from hops import featurestore df = featurestore.get_features([ "average_attendance", "average_player_age“]) df.create_training_dataset(df, “players_td”) Writing to the Feature Store (Data Engineer) Scala API also available tfrecords, numpy, petastorm, hdf5, csv
  26. 26. Feature Storage 26 Parquet HDFS GCS Storage
  27. 27. Feature Metadata 27 Parquet HDFS GCS Storage
  28. 28. HopsML Feature Store Pipelines 28
  29. 29. ©2018 Logical Clocks AB. All Rights Reserved Raw Data Event Data Monitor HopsFS Feature Store Serving StorePre-ProcessIngest DeployExperiment/Train Airflow logs logs
  30. 30. ©2018 Logical Clocks AB. All Rights Reserved Model Serving and Monitoring 30 Hopsworks Inference Request Response 1. Access Control Model Serving Images Model Server Kubernetes Data Lake Monitor 2. Log Prediction/Result Link Predictions with Outcomes to measure Model Performance
  31. 31. Feature Store Demo 31
  32. 32. Summary and Roadmap •Hopsworks is a new Data Platform with first-class support for Python / Deep Learning / ML / Data Governance / GPUs -Hopsworks has an open-source Feature Store •Ongoing Work -Online Feature Store -Feature Transformation Library/DSK -Automated Data Provenance -Feature Store Incremental Updates with Hudi on Hive 32/36
  33. 33. ©2018 Logical Clocks AB. All Rights Reserved 33 Upcoming Hopsworks Events in the Bay Area: -April 1st at Stanford, SysML -April 23rd – Hopsworks Hands-on in Palo Alto -April 25th in Moscone Center, Databricks Spark/AI Summit Read More: http://www.logicalclocks.com/feature-store by Kim Hammar
  34. 34. ©2018 Logical Clocks AB. All Rights Reserved 34 @logicalclocks www.logicalclocks.com Try it Out! 1. Register for an account at: www.hops.site 2. Enter your Firstname/lastname here: https://bit.ly/2UEixTr

×