SlideShare a Scribd company logo
1 of 33
Download to read offline
The Feature Store:
Data Engineering meets Data Science
PyData 56th Meetup, London
May 7th, 2019
jim_dowling
CEO @ Logical Clocks
Assoc Prof @ KTH
©2018 Logical Clocks AB. All Rights Reserved
www.logicalclocks.com
Stockholm Office
Box 1263,
Isafjordsgatan 22
Kista,
Sweden
Silicon Valley Office
470 Ramona St
Palo Alto
California,
USA
UK Office
IDEALondon,
69 Wilson St,
London, EC2A2BB,
UK
Dr. Jim Dowling
CEO
Steffen Grohsschmiedt
Head of Cloud
Theofilos Kakantousis
COO
Fabio Buso
Head of Engineering
Venture Capital Backed (Inventure, Frontline.vc, AI Seed)
Prof Seif Hardi
Chief Scientist
©2018 Logical Clocks AB. All Rights Reserved
3
He Her
Han Hon
Hen
Become a Data Scientist!
4
Eureka! This will
give a 12% increase
in the efficiency of
this wind farm!
Data Scientists are not Data Engineers
5
HDFSGCS Storage CosmosDB
How do I find features in this sea of data sources?
This tastes like dairy
in my Latte!
Data Science with the Feature Store
6
HDFSGCS Storage CosmosDB
Feature Store
Feature Pipelines (Select, Transform, Aggregate, ..)
Now, I can change
the world - one click-
through at a time.
©2018 Logical Clocks AB. All Rights Reserved
7
Reading from the Feature Store (Data Scientist)
from hops import featurestore
raw_data = spark.read.parquet(filename)
polynomial_features = raw_data.map(lambda x: x^2)
featurestore.insert_into_featuregroup(polynomial_features,
"polynomial_featuregroup")
from hops import featurestore
df = featurestore.get_features([
"average_attendance", "average_player_age“])
df.create_training_dataset(df, “players_td”)
Writing to the Feature Store (Data Engineer)
tfrecords, numpy, petastorm, hdf5, csv
What is a Feature?
A feature may be a column in a Data Warehouse, but more
generally it is a measurable property of a phenomena under
observation and (part of) an input to a ML model.
Features are often computed from raw or structured data sources:
•A raw word, a pixel, a sound wave, a sensor value;
•An aggregate
(mean, max, sum, min)
•A window
(last_hour, last_day, etc)
•A derived representation
(embedding or cluster)
8
Just select and type text.
Use control handle to
adjust line spacing.
Bert
Features
Bert
Features
Bert
Features
Marketing Research Analytics
Duplicated Feature Engineering
9
DUPLICATED
Prevent Inconsistent Features– Training/Serving
10
Feature implementations
may not be consistent –
correctness problems!
Features as first-class entities
•Features should be discoverable and reused.
•Features should be access controlled,
versioned, and governed.
- Enable reproducibility.
•Ability to pre-compute and
automatically backfill features.
- Aggregates, embeddings - avoid expensive re-computation.
- On-demand computation of features should also be possible.
•The Feature Store should help “solve the data problem, so that Data
Scientists don’t have to.” [uber]
11
Data Engineering meets Data Science
Feature
Store
Add/Remove
Features
Browse & Select Features
to create Train/Test Data
Data Engineer Data Scientist
12
A ML Pipeline with the Feature Store
13
Feature
Store
Register Feature
and its Job/Data
Select Features
and generate
Train/Test DataStructured
& Raw Data
Train
Model
Validate Models,
Deploy
Serve
Model
Online Features
Offline (Batch/Streaming) Feature Store
14
Data
Lake
Offline
Feature
Store
Training
Job
Batch or
Streaming
Inference
1. Register Feature
Engineering Job, copy
Feature Data
2. Create
Training Data
and Train
3. Save
Model
a. Get Feature
Engineering Job, Model,
Conda Environment
b. Run Job
Online Feature Store
15
Data
Lake
Online
Feature
Store
Train
Real-Time
Serving
1. Engineer Features
2. Create Training Data
3. Train Model
4. Deploy Model
a. Request Prediction
b. Get Online Features
c. Response
Online
Apps
Known Feature Stores in Production
•Logical Clocks – Hopsworks (world’s first open source)
•Uber Michelangelo
•Airbnb – Bighead/Zipline
•Comcast
•Twitter
•GO-JEK Feast (GCE)
•Branch
16
A Feature Store for Hopsworks
17
©2018 Logical Clocks AB. All Rights Reserved
Hopsworks – Batch, Streaming, Deep Learning
Data
Sources
HopsFS
Kafka
Airflow
Spark /
Flink
Spark
Feature
Store
Hive
Deep
Learning
BI Tools &
Reporting
Notebooks
Serving w/
Kubernetes
Hopsworks
On-Premise, AWS, Azure, GCE
Elastic
External
Service
Hopsworks
Service
©2018 Logical Clocks AB. All Rights Reserved
Data
Sources
HopsFS
Kafka
Airflow
Spark /
Flink
Spark
Feature
Store
Hive
Deep
Learning
BI Tools &
Reporting
Notebooks
Serving w/
Kubernetes
Hopsworks
On-Premise, AWS, Azure, GCE
Elastic
External
Service
Hopsworks
Service
BATCH ANALYTICS
STREAMING
ML & DEEP LEARNING
Hopsworks – Batch, Streaming, Deep Learning
©2018 Logical Clocks AB. All Rights Reserved
ML Infrastructure in Hopsworks
20
MODEL TRAINING
Feature
Store
HopsML API
& Airflow
[Diagram adapted from “technical debt of machine learning”]
©2018 Logical Clocks AB. All Rights Reserved
21
Distributed Deep Learning in Hopsworks
Executor 1 Executor N
Driver
conda_env
conda_env conda_env
HopsFS (HDFS)
TensorBoard ModelsExperiments Training Data Logs
©2018 Logical Clocks AB. All Rights Reserved
Hyperparameter Optimization
22
# RUNS ON THE EXECUTORS
def train(lr, dropout):
def input_fn(): # return dataset
optimizer = …
model = …
model.add(Conv2D(…))
model.compile(…)
model.fit(…)
model.evaluate(…)
# RUNS ON THE DRIVER
Hparams= {‘lr’:[0.001, 0.0001],
‘dropout’: [0.25, 0.5, 0.75]}
experiment.grid_search(train,HParams)
https://github.com/logicalclocks/hops-examples
Executor 1 Executor N
Driver
conda_env
conda_env conda_env
HopsFS (HDFS)
TensorBoard ModelsExperiments Training Data Logs
©2018 Logical Clocks AB. All Rights Reserved
Distributed Training
23
# RUNS ON THE EXECUTORS
def train():
def input_fn(): # return dataset
model = …
optimizer = …
model.compile(…)
rc = tf.estimator.RunConfig(
‘CollectiveAllReduceStrategy’)
keras_estimator = tf.keras.estimator.
model_to_estimator(….)
tf.estimator.train_and_evaluate(
keras_estimator, input_fn)
# RUNS ON THE DRIVER
experiment.collective_all_reduce(train)
https://github.com/logicalclocks/hops-examples
Executor 1 Executor N
Driver
conda_env
conda_env conda_env
HopsFS (HDFS)
TensorBoard ModelsExperiments Training Data Logs
Hopsworks’ Feature Store Concepts
24
©2018 Logical Clocks AB. All Rights Reserved
Online Model Serving and Monitoring
25
25
Link Predictions with Outcomes to measure Model Performance
Hopsworks
Inference
Request Response
1. Access Control
Feature
Store
2. Build Feature Vector
Model
Server
Kubernetes
3. Make Prediction
4. Log Prediction
Data Lake
Monitor
HopsML Feature Store Pipelines
26
©2018 Logical Clocks AB. All Rights Reserved
Raw Data
Event Data
Monitor
HopsFS
Feature
Store Serving
Feature StoreFeature EngineeringIngest DeployExperiment/Train
Airflow
logs
logs
©2018 Logical Clocks AB. All Rights Reserved
ML Pipelines of Jupyter Notebooks
28
Select
Features,
File Format
Feature
Engineering
Validate &
Deploy Model
Experiment,
Train Model
Airflow
End-to-End ML Pipeline
Feature Backfill Pipeline Training and Deployment Pipeline
Feature
Store
©2018 Logical Clocks AB. All Rights Reserved
Hopsworks Feature Store as a Service
29
Hops
df.save( topic )
df.read( s3:// )
Real-Time
Serving
Train
Hopsworks Feature Store Sagemaker, Azure ML, Google AIFeature Engineering
(AWS EMR, Azure HD Insight, Cloudera)
Data
Lake
Data Lake
select features
and generate
training data
s3://..tfrecords
read
online features
Batch
Serving
Feature Store Demo
30
Summary and Roadmap
•Hopsworks is a new Data Platform with first-class support
for Python / Deep Learning / ML / Data Governance / GPUs
-Hopsworks has an open-source Feature Store
•Ongoing Work
-Data Provenance
-Feature Store Incremental Updates with Hudi on Hive
31/32
©2018 Logical Clocks AB. All Rights Reserved
32
@logicalclocks
www.logicalclocks.com
Try it Out!
1. Register for an account at: www.hops.site
©2018 Logical Clocks AB. All Rights Reserved
ML Pipelines of Jupyter Notebooks
33
Convert .ipynb to .py
Jobs Service
Run .py or .jar
Schedule using
REST API or UI
materialize certs, ENV variables
View Old Notebooks, Experiments and Visualizations
Experiments &
Tensorboard
PySpark Kernel
.ipynb (HDFS contents)
[logs, results]
Livy Server
HopsYARNHopsFS
Interactive
materialize certs, ENV variables

More Related Content

What's hot

Managed Feature Store for Machine Learning
Managed Feature Store for Machine LearningManaged Feature Store for Machine Learning
Managed Feature Store for Machine LearningLogical Clocks
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsAndrzej Michałowski
 
StreamSQL Feature Store (Apache Pulsar Summit)
StreamSQL Feature Store (Apache Pulsar Summit)StreamSQL Feature Store (Apache Pulsar Summit)
StreamSQL Feature Store (Apache Pulsar Summit)Simba Khadder
 
Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021Jim Dowling
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
 
Hopsworks Feature Store 2.0 a new paradigm
Hopsworks Feature Store  2.0   a new paradigmHopsworks Feature Store  2.0   a new paradigm
Hopsworks Feature Store 2.0 a new paradigmJim Dowling
 
Hopsworks MLOps World talk june 21
Hopsworks MLOps World talk june 21Hopsworks MLOps World talk june 21
Hopsworks MLOps World talk june 21Jim Dowling
 
END-TO-END MACHINE LEARNING STACK
END-TO-END MACHINE LEARNING STACKEND-TO-END MACHINE LEARNING STACK
END-TO-END MACHINE LEARNING STACKJan Wiegelmann
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureData Science Milan
 
Flux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineFlux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineJan Wiegelmann
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJim Dowling
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingJan Wiegelmann
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkSpark Summit
 
Distributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlowDistributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlowJan Wiegelmann
 

What's hot (20)

Managed Feature Store for Machine Learning
Managed Feature Store for Machine LearningManaged Feature Store for Machine Learning
Managed Feature Store for Machine Learning
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systems
 
StreamSQL Feature Store (Apache Pulsar Summit)
StreamSQL Feature Store (Apache Pulsar Summit)StreamSQL Feature Store (Apache Pulsar Summit)
StreamSQL Feature Store (Apache Pulsar Summit)
 
Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Hopsworks Feature Store 2.0 a new paradigm
Hopsworks Feature Store  2.0   a new paradigmHopsworks Feature Store  2.0   a new paradigm
Hopsworks Feature Store 2.0 a new paradigm
 
Hopsworks MLOps World talk june 21
Hopsworks MLOps World talk june 21Hopsworks MLOps World talk june 21
Hopsworks MLOps World talk june 21
 
END-TO-END MACHINE LEARNING STACK
END-TO-END MACHINE LEARNING STACKEND-TO-END MACHINE LEARNING STACK
END-TO-END MACHINE LEARNING STACK
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
 
Flux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineFlux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / Pipeline
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocks
 
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
 
Big Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on SparkBig Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on Spark
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous Driving
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
Distributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlowDistributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlow
 

Similar to PyData Meetup - Feature Store for Hopsworks and ML Pipelines

2018 data engineering for ml asset management for features and models
2018 data engineering for ml asset management for features and models2018 data engineering for ml asset management for features and models
2018 data engineering for ml asset management for features and modelsGe Org
 
Berlin buzzwords 2018 TensorFlow on Hops
Berlin buzzwords 2018 TensorFlow on HopsBerlin buzzwords 2018 TensorFlow on Hops
Berlin buzzwords 2018 TensorFlow on HopsJim Dowling
 
Distributed TensorFlow on Hops (Papis London, April 2018)
Distributed TensorFlow on Hops (Papis London, April 2018)Distributed TensorFlow on Hops (Papis London, April 2018)
Distributed TensorFlow on Hops (Papis London, April 2018)Jim Dowling
 
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxDowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxLex Avstreikh
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkDatabricks
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...Big Data Value Association
 
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfPyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfJim Dowling
 
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)Jasjeet Thind
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerProvectus
 
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Luciano Resende
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleJim Dowling
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesStitch Fix Algorithms
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...Joachim Schlosser
 
Manoj Shanmugasundaram - Agile Machine Learning Development
Manoj Shanmugasundaram - Agile Machine Learning DevelopmentManoj Shanmugasundaram - Agile Machine Learning Development
Manoj Shanmugasundaram - Agile Machine Learning DevelopmentAgile Impact Conference
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning InfrastructureSigOpt
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in ProductionDataWorks Summit
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreMoritz Meister
 

Similar to PyData Meetup - Feature Store for Hopsworks and ML Pipelines (20)

2018 data engineering for ml asset management for features and models
2018 data engineering for ml asset management for features and models2018 data engineering for ml asset management for features and models
2018 data engineering for ml asset management for features and models
 
Berlin buzzwords 2018 TensorFlow on Hops
Berlin buzzwords 2018 TensorFlow on HopsBerlin buzzwords 2018 TensorFlow on Hops
Berlin buzzwords 2018 TensorFlow on Hops
 
Distributed TensorFlow on Hops (Papis London, April 2018)
Distributed TensorFlow on Hops (Papis London, April 2018)Distributed TensorFlow on Hops (Papis London, April 2018)
Distributed TensorFlow on Hops (Papis London, April 2018)
 
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxDowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache Spark
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
 
03_aiops-1.pptx
03_aiops-1.pptx03_aiops-1.pptx
03_aiops-1.pptx
 
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfPyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
 
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
 
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML Pipelines
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
 
Manoj Shanmugasundaram - Agile Machine Learning Development
Manoj Shanmugasundaram - Agile Machine Learning DevelopmentManoj Shanmugasundaram - Agile Machine Learning Development
Manoj Shanmugasundaram - Agile Machine Learning Development
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature Store
 
Data herding
Data herdingData herding
Data herding
 

More from Jim Dowling

ARVC and flecainide case report[EI] Jim.docx.pdf
ARVC and flecainide case report[EI] Jim.docx.pdfARVC and flecainide case report[EI] Jim.docx.pdf
ARVC and flecainide case report[EI] Jim.docx.pdfJim Dowling
 
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfPyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfJim Dowling
 
_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdfJim Dowling
 
Building Hopsworks, a cloud-native managed feature store for machine learning
Building Hopsworks, a cloud-native managed feature store for machine learning Building Hopsworks, a cloud-native managed feature store for machine learning
Building Hopsworks, a cloud-native managed feature store for machine learning Jim Dowling
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022Jim Dowling
 
GANs for Anti Money Laundering
GANs for Anti Money LaunderingGANs for Anti Money Laundering
GANs for Anti Money LaunderingJim Dowling
 
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala UniversityInvited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala UniversityJim Dowling
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019Jim Dowling
 
All AI Roads lead to Distribution - Dot AI
All AI Roads lead to Distribution - Dot AIAll AI Roads lead to Distribution - Dot AI
All AI Roads lead to Distribution - Dot AIJim Dowling
 
End-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in FinanceEnd-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in FinanceJim Dowling
 
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraScaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraJim Dowling
 
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsScaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsJim Dowling
 
Odsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on HopsOdsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on HopsJim Dowling
 

More from Jim Dowling (14)

ARVC and flecainide case report[EI] Jim.docx.pdf
ARVC and flecainide case report[EI] Jim.docx.pdfARVC and flecainide case report[EI] Jim.docx.pdf
ARVC and flecainide case report[EI] Jim.docx.pdf
 
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfPyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdf
 
_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf
 
Building Hopsworks, a cloud-native managed feature store for machine learning
Building Hopsworks, a cloud-native managed feature store for machine learning Building Hopsworks, a cloud-native managed feature store for machine learning
Building Hopsworks, a cloud-native managed feature store for machine learning
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
 
GANs for Anti Money Laundering
GANs for Anti Money LaunderingGANs for Anti Money Laundering
GANs for Anti Money Laundering
 
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala UniversityInvited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
 
All AI Roads lead to Distribution - Dot AI
All AI Roads lead to Distribution - Dot AIAll AI Roads lead to Distribution - Dot AI
All AI Roads lead to Distribution - Dot AI
 
End-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in FinanceEnd-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in Finance
 
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraScaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
 
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsScaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
 
Odsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on HopsOdsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on Hops
 

Recently uploaded

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 

Recently uploaded (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 

PyData Meetup - Feature Store for Hopsworks and ML Pipelines

  • 1. The Feature Store: Data Engineering meets Data Science PyData 56th Meetup, London May 7th, 2019 jim_dowling CEO @ Logical Clocks Assoc Prof @ KTH
  • 2. ©2018 Logical Clocks AB. All Rights Reserved www.logicalclocks.com Stockholm Office Box 1263, Isafjordsgatan 22 Kista, Sweden Silicon Valley Office 470 Ramona St Palo Alto California, USA UK Office IDEALondon, 69 Wilson St, London, EC2A2BB, UK Dr. Jim Dowling CEO Steffen Grohsschmiedt Head of Cloud Theofilos Kakantousis COO Fabio Buso Head of Engineering Venture Capital Backed (Inventure, Frontline.vc, AI Seed) Prof Seif Hardi Chief Scientist
  • 3. ©2018 Logical Clocks AB. All Rights Reserved 3 He Her Han Hon Hen
  • 4. Become a Data Scientist! 4 Eureka! This will give a 12% increase in the efficiency of this wind farm!
  • 5. Data Scientists are not Data Engineers 5 HDFSGCS Storage CosmosDB How do I find features in this sea of data sources? This tastes like dairy in my Latte!
  • 6. Data Science with the Feature Store 6 HDFSGCS Storage CosmosDB Feature Store Feature Pipelines (Select, Transform, Aggregate, ..) Now, I can change the world - one click- through at a time.
  • 7. ©2018 Logical Clocks AB. All Rights Reserved 7 Reading from the Feature Store (Data Scientist) from hops import featurestore raw_data = spark.read.parquet(filename) polynomial_features = raw_data.map(lambda x: x^2) featurestore.insert_into_featuregroup(polynomial_features, "polynomial_featuregroup") from hops import featurestore df = featurestore.get_features([ "average_attendance", "average_player_age“]) df.create_training_dataset(df, “players_td”) Writing to the Feature Store (Data Engineer) tfrecords, numpy, petastorm, hdf5, csv
  • 8. What is a Feature? A feature may be a column in a Data Warehouse, but more generally it is a measurable property of a phenomena under observation and (part of) an input to a ML model. Features are often computed from raw or structured data sources: •A raw word, a pixel, a sound wave, a sensor value; •An aggregate (mean, max, sum, min) •A window (last_hour, last_day, etc) •A derived representation (embedding or cluster) 8
  • 9. Just select and type text. Use control handle to adjust line spacing. Bert Features Bert Features Bert Features Marketing Research Analytics Duplicated Feature Engineering 9 DUPLICATED
  • 10. Prevent Inconsistent Features– Training/Serving 10 Feature implementations may not be consistent – correctness problems!
  • 11. Features as first-class entities •Features should be discoverable and reused. •Features should be access controlled, versioned, and governed. - Enable reproducibility. •Ability to pre-compute and automatically backfill features. - Aggregates, embeddings - avoid expensive re-computation. - On-demand computation of features should also be possible. •The Feature Store should help “solve the data problem, so that Data Scientists don’t have to.” [uber] 11
  • 12. Data Engineering meets Data Science Feature Store Add/Remove Features Browse & Select Features to create Train/Test Data Data Engineer Data Scientist 12
  • 13. A ML Pipeline with the Feature Store 13 Feature Store Register Feature and its Job/Data Select Features and generate Train/Test DataStructured & Raw Data Train Model Validate Models, Deploy Serve Model Online Features
  • 14. Offline (Batch/Streaming) Feature Store 14 Data Lake Offline Feature Store Training Job Batch or Streaming Inference 1. Register Feature Engineering Job, copy Feature Data 2. Create Training Data and Train 3. Save Model a. Get Feature Engineering Job, Model, Conda Environment b. Run Job
  • 15. Online Feature Store 15 Data Lake Online Feature Store Train Real-Time Serving 1. Engineer Features 2. Create Training Data 3. Train Model 4. Deploy Model a. Request Prediction b. Get Online Features c. Response Online Apps
  • 16. Known Feature Stores in Production •Logical Clocks – Hopsworks (world’s first open source) •Uber Michelangelo •Airbnb – Bighead/Zipline •Comcast •Twitter •GO-JEK Feast (GCE) •Branch 16
  • 17. A Feature Store for Hopsworks 17
  • 18. ©2018 Logical Clocks AB. All Rights Reserved Hopsworks – Batch, Streaming, Deep Learning Data Sources HopsFS Kafka Airflow Spark / Flink Spark Feature Store Hive Deep Learning BI Tools & Reporting Notebooks Serving w/ Kubernetes Hopsworks On-Premise, AWS, Azure, GCE Elastic External Service Hopsworks Service
  • 19. ©2018 Logical Clocks AB. All Rights Reserved Data Sources HopsFS Kafka Airflow Spark / Flink Spark Feature Store Hive Deep Learning BI Tools & Reporting Notebooks Serving w/ Kubernetes Hopsworks On-Premise, AWS, Azure, GCE Elastic External Service Hopsworks Service BATCH ANALYTICS STREAMING ML & DEEP LEARNING Hopsworks – Batch, Streaming, Deep Learning
  • 20. ©2018 Logical Clocks AB. All Rights Reserved ML Infrastructure in Hopsworks 20 MODEL TRAINING Feature Store HopsML API & Airflow [Diagram adapted from “technical debt of machine learning”]
  • 21. ©2018 Logical Clocks AB. All Rights Reserved 21 Distributed Deep Learning in Hopsworks Executor 1 Executor N Driver conda_env conda_env conda_env HopsFS (HDFS) TensorBoard ModelsExperiments Training Data Logs
  • 22. ©2018 Logical Clocks AB. All Rights Reserved Hyperparameter Optimization 22 # RUNS ON THE EXECUTORS def train(lr, dropout): def input_fn(): # return dataset optimizer = … model = … model.add(Conv2D(…)) model.compile(…) model.fit(…) model.evaluate(…) # RUNS ON THE DRIVER Hparams= {‘lr’:[0.001, 0.0001], ‘dropout’: [0.25, 0.5, 0.75]} experiment.grid_search(train,HParams) https://github.com/logicalclocks/hops-examples Executor 1 Executor N Driver conda_env conda_env conda_env HopsFS (HDFS) TensorBoard ModelsExperiments Training Data Logs
  • 23. ©2018 Logical Clocks AB. All Rights Reserved Distributed Training 23 # RUNS ON THE EXECUTORS def train(): def input_fn(): # return dataset model = … optimizer = … model.compile(…) rc = tf.estimator.RunConfig( ‘CollectiveAllReduceStrategy’) keras_estimator = tf.keras.estimator. model_to_estimator(….) tf.estimator.train_and_evaluate( keras_estimator, input_fn) # RUNS ON THE DRIVER experiment.collective_all_reduce(train) https://github.com/logicalclocks/hops-examples Executor 1 Executor N Driver conda_env conda_env conda_env HopsFS (HDFS) TensorBoard ModelsExperiments Training Data Logs
  • 25. ©2018 Logical Clocks AB. All Rights Reserved Online Model Serving and Monitoring 25 25 Link Predictions with Outcomes to measure Model Performance Hopsworks Inference Request Response 1. Access Control Feature Store 2. Build Feature Vector Model Server Kubernetes 3. Make Prediction 4. Log Prediction Data Lake Monitor
  • 26. HopsML Feature Store Pipelines 26
  • 27. ©2018 Logical Clocks AB. All Rights Reserved Raw Data Event Data Monitor HopsFS Feature Store Serving Feature StoreFeature EngineeringIngest DeployExperiment/Train Airflow logs logs
  • 28. ©2018 Logical Clocks AB. All Rights Reserved ML Pipelines of Jupyter Notebooks 28 Select Features, File Format Feature Engineering Validate & Deploy Model Experiment, Train Model Airflow End-to-End ML Pipeline Feature Backfill Pipeline Training and Deployment Pipeline Feature Store
  • 29. ©2018 Logical Clocks AB. All Rights Reserved Hopsworks Feature Store as a Service 29 Hops df.save( topic ) df.read( s3:// ) Real-Time Serving Train Hopsworks Feature Store Sagemaker, Azure ML, Google AIFeature Engineering (AWS EMR, Azure HD Insight, Cloudera) Data Lake Data Lake select features and generate training data s3://..tfrecords read online features Batch Serving
  • 31. Summary and Roadmap •Hopsworks is a new Data Platform with first-class support for Python / Deep Learning / ML / Data Governance / GPUs -Hopsworks has an open-source Feature Store •Ongoing Work -Data Provenance -Feature Store Incremental Updates with Hudi on Hive 31/32
  • 32. ©2018 Logical Clocks AB. All Rights Reserved 32 @logicalclocks www.logicalclocks.com Try it Out! 1. Register for an account at: www.hops.site
  • 33. ©2018 Logical Clocks AB. All Rights Reserved ML Pipelines of Jupyter Notebooks 33 Convert .ipynb to .py Jobs Service Run .py or .jar Schedule using REST API or UI materialize certs, ENV variables View Old Notebooks, Experiments and Visualizations Experiments & Tensorboard PySpark Kernel .ipynb (HDFS contents) [logs, results] Livy Server HopsYARNHopsFS Interactive materialize certs, ENV variables