The Feature Store:
Data Engineering meets Data Science
PyData 56th Meetup, London
May 7th, 2019
jim_dowling
CEO @ Logical Clocks
Assoc Prof @ KTH
©2018 Logical Clocks AB. All Rights Reserved
www.logicalclocks.com
Stockholm Office
Box 1263,
Isafjordsgatan 22
Kista,
Sweden
Silicon Valley Office
470 Ramona St
Palo Alto
California,
USA
UK Office
IDEALondon,
69 Wilson St,
London, EC2A2BB,
UK
Dr. Jim Dowling
CEO
Steffen Grohsschmiedt
Head of Cloud
Theofilos Kakantousis
COO
Fabio Buso
Head of Engineering
Venture Capital Backed (Inventure, Frontline.vc, AI Seed)
Prof Seif Hardi
Chief Scientist
©2018 Logical Clocks AB. All Rights Reserved
3
He Her
Han Hon
Hen
Become a Data Scientist!
4
Eureka! This will
give a 12% increase
in the efficiency of
this wind farm!
Data Scientists are not Data Engineers
5
HDFSGCS Storage CosmosDB
How do I find features in this sea of data sources?
This tastes like dairy
in my Latte!
Data Science with the Feature Store
6
HDFSGCS Storage CosmosDB
Feature Store
Feature Pipelines (Select, Transform, Aggregate, ..)
Now, I can change
the world - one click-
through at a time.
©2018 Logical Clocks AB. All Rights Reserved
7
Reading from the Feature Store (Data Scientist)
from hops import featurestore
raw_data = spark.read.parquet(filename)
polynomial_features = raw_data.map(lambda x: x^2)
featurestore.insert_into_featuregroup(polynomial_features,
"polynomial_featuregroup")
from hops import featurestore
df = featurestore.get_features([
"average_attendance", "average_player_age“])
df.create_training_dataset(df, “players_td”)
Writing to the Feature Store (Data Engineer)
tfrecords, numpy, petastorm, hdf5, csv
What is a Feature?
A feature may be a column in a Data Warehouse, but more
generally it is a measurable property of a phenomena under
observation and (part of) an input to a ML model.
Features are often computed from raw or structured data sources:
•A raw word, a pixel, a sound wave, a sensor value;
•An aggregate
(mean, max, sum, min)
•A window
(last_hour, last_day, etc)
•A derived representation
(embedding or cluster)
8
Just select and type text.
Use control handle to
adjust line spacing.
Bert
Features
Bert
Features
Bert
Features
Marketing Research Analytics
Duplicated Feature Engineering
9
DUPLICATED
Prevent Inconsistent Features– Training/Serving
10
Feature implementations
may not be consistent –
correctness problems!
Features as first-class entities
•Features should be discoverable and reused.
•Features should be access controlled,
versioned, and governed.
- Enable reproducibility.
•Ability to pre-compute and
automatically backfill features.
- Aggregates, embeddings - avoid expensive re-computation.
- On-demand computation of features should also be possible.
•The Feature Store should help “solve the data problem, so that Data
Scientists don’t have to.” [uber]
11
Data Engineering meets Data Science
Feature
Store
Add/Remove
Features
Browse & Select Features
to create Train/Test Data
Data Engineer Data Scientist
12
A ML Pipeline with the Feature Store
13
Feature
Store
Register Feature
and its Job/Data
Select Features
and generate
Train/Test DataStructured
& Raw Data
Train
Model
Validate Models,
Deploy
Serve
Model
Online Features
Offline (Batch/Streaming) Feature Store
14
Data
Lake
Offline
Feature
Store
Training
Job
Batch or
Streaming
Inference
1. Register Feature
Engineering Job, copy
Feature Data
2. Create
Training Data
and Train
3. Save
Model
a. Get Feature
Engineering Job, Model,
Conda Environment
b. Run Job
Online Feature Store
15
Data
Lake
Online
Feature
Store
Train
Real-Time
Serving
1. Engineer Features
2. Create Training Data
3. Train Model
4. Deploy Model
a. Request Prediction
b. Get Online Features
c. Response
Online
Apps
Known Feature Stores in Production
•Logical Clocks – Hopsworks (world’s first open source)
•Uber Michelangelo
•Airbnb – Bighead/Zipline
•Comcast
•Twitter
•GO-JEK Feast (GCE)
•Branch
16
A Feature Store for Hopsworks
17
©2018 Logical Clocks AB. All Rights Reserved
Hopsworks – Batch, Streaming, Deep Learning
Data
Sources
HopsFS
Kafka
Airflow
Spark /
Flink
Spark
Feature
Store
Hive
Deep
Learning
BI Tools &
Reporting
Notebooks
Serving w/
Kubernetes
Hopsworks
On-Premise, AWS, Azure, GCE
Elastic
External
Service
Hopsworks
Service
©2018 Logical Clocks AB. All Rights Reserved
Data
Sources
HopsFS
Kafka
Airflow
Spark /
Flink
Spark
Feature
Store
Hive
Deep
Learning
BI Tools &
Reporting
Notebooks
Serving w/
Kubernetes
Hopsworks
On-Premise, AWS, Azure, GCE
Elastic
External
Service
Hopsworks
Service
BATCH ANALYTICS
STREAMING
ML & DEEP LEARNING
Hopsworks – Batch, Streaming, Deep Learning
©2018 Logical Clocks AB. All Rights Reserved
ML Infrastructure in Hopsworks
20
MODEL TRAINING
Feature
Store
HopsML API
& Airflow
[Diagram adapted from “technical debt of machine learning”]
©2018 Logical Clocks AB. All Rights Reserved
21
Distributed Deep Learning in Hopsworks
Executor 1 Executor N
Driver
conda_env
conda_env conda_env
HopsFS (HDFS)
TensorBoard ModelsExperiments Training Data Logs
©2018 Logical Clocks AB. All Rights Reserved
Hyperparameter Optimization
22
# RUNS ON THE EXECUTORS
def train(lr, dropout):
def input_fn(): # return dataset
optimizer = …
model = …
model.add(Conv2D(…))
model.compile(…)
model.fit(…)
model.evaluate(…)
# RUNS ON THE DRIVER
Hparams= {‘lr’:[0.001, 0.0001],
‘dropout’: [0.25, 0.5, 0.75]}
experiment.grid_search(train,HParams)
https://github.com/logicalclocks/hops-examples
Executor 1 Executor N
Driver
conda_env
conda_env conda_env
HopsFS (HDFS)
TensorBoard ModelsExperiments Training Data Logs
©2018 Logical Clocks AB. All Rights Reserved
Distributed Training
23
# RUNS ON THE EXECUTORS
def train():
def input_fn(): # return dataset
model = …
optimizer = …
model.compile(…)
rc = tf.estimator.RunConfig(
‘CollectiveAllReduceStrategy’)
keras_estimator = tf.keras.estimator.
model_to_estimator(….)
tf.estimator.train_and_evaluate(
keras_estimator, input_fn)
# RUNS ON THE DRIVER
experiment.collective_all_reduce(train)
https://github.com/logicalclocks/hops-examples
Executor 1 Executor N
Driver
conda_env
conda_env conda_env
HopsFS (HDFS)
TensorBoard ModelsExperiments Training Data Logs
Hopsworks’ Feature Store Concepts
24
©2018 Logical Clocks AB. All Rights Reserved
Online Model Serving and Monitoring
25
25
Link Predictions with Outcomes to measure Model Performance
Hopsworks
Inference
Request Response
1. Access Control
Feature
Store
2. Build Feature Vector
Model
Server
Kubernetes
3. Make Prediction
4. Log Prediction
Data Lake
Monitor
HopsML Feature Store Pipelines
26
©2018 Logical Clocks AB. All Rights Reserved
Raw Data
Event Data
Monitor
HopsFS
Feature
Store Serving
Feature StoreFeature EngineeringIngest DeployExperiment/Train
Airflow
logs
logs
©2018 Logical Clocks AB. All Rights Reserved
ML Pipelines of Jupyter Notebooks
28
Select
Features,
File Format
Feature
Engineering
Validate &
Deploy Model
Experiment,
Train Model
Airflow
End-to-End ML Pipeline
Feature Backfill Pipeline Training and Deployment Pipeline
Feature
Store
©2018 Logical Clocks AB. All Rights Reserved
Hopsworks Feature Store as a Service
29
Hops
df.save( topic )
df.read( s3:// )
Real-Time
Serving
Train
Hopsworks Feature Store Sagemaker, Azure ML, Google AIFeature Engineering
(AWS EMR, Azure HD Insight, Cloudera)
Data
Lake
Data Lake
select features
and generate
training data
s3://..tfrecords
read
online features
Batch
Serving
Feature Store Demo
30
Summary and Roadmap
•Hopsworks is a new Data Platform with first-class support
for Python / Deep Learning / ML / Data Governance / GPUs
-Hopsworks has an open-source Feature Store
•Ongoing Work
-Data Provenance
-Feature Store Incremental Updates with Hudi on Hive
31/32
©2018 Logical Clocks AB. All Rights Reserved
32
@logicalclocks
www.logicalclocks.com
Try it Out!
1. Register for an account at: www.hops.site
©2018 Logical Clocks AB. All Rights Reserved
ML Pipelines of Jupyter Notebooks
33
Convert .ipynb to .py
Jobs Service
Run .py or .jar
Schedule using
REST API or UI
materialize certs, ENV variables
View Old Notebooks, Experiments and Visualizations
Experiments &
Tensorboard
PySpark Kernel
.ipynb (HDFS contents)
[logs, results]
Livy Server
HopsYARNHopsFS
Interactive
materialize certs, ENV variables

PyData Meetup - Feature Store for Hopsworks and ML Pipelines

  • 1.
    The Feature Store: DataEngineering meets Data Science PyData 56th Meetup, London May 7th, 2019 jim_dowling CEO @ Logical Clocks Assoc Prof @ KTH
  • 2.
    ©2018 Logical ClocksAB. All Rights Reserved www.logicalclocks.com Stockholm Office Box 1263, Isafjordsgatan 22 Kista, Sweden Silicon Valley Office 470 Ramona St Palo Alto California, USA UK Office IDEALondon, 69 Wilson St, London, EC2A2BB, UK Dr. Jim Dowling CEO Steffen Grohsschmiedt Head of Cloud Theofilos Kakantousis COO Fabio Buso Head of Engineering Venture Capital Backed (Inventure, Frontline.vc, AI Seed) Prof Seif Hardi Chief Scientist
  • 3.
    ©2018 Logical ClocksAB. All Rights Reserved 3 He Her Han Hon Hen
  • 4.
    Become a DataScientist! 4 Eureka! This will give a 12% increase in the efficiency of this wind farm!
  • 5.
    Data Scientists arenot Data Engineers 5 HDFSGCS Storage CosmosDB How do I find features in this sea of data sources? This tastes like dairy in my Latte!
  • 6.
    Data Science withthe Feature Store 6 HDFSGCS Storage CosmosDB Feature Store Feature Pipelines (Select, Transform, Aggregate, ..) Now, I can change the world - one click- through at a time.
  • 7.
    ©2018 Logical ClocksAB. All Rights Reserved 7 Reading from the Feature Store (Data Scientist) from hops import featurestore raw_data = spark.read.parquet(filename) polynomial_features = raw_data.map(lambda x: x^2) featurestore.insert_into_featuregroup(polynomial_features, "polynomial_featuregroup") from hops import featurestore df = featurestore.get_features([ "average_attendance", "average_player_age“]) df.create_training_dataset(df, “players_td”) Writing to the Feature Store (Data Engineer) tfrecords, numpy, petastorm, hdf5, csv
  • 8.
    What is aFeature? A feature may be a column in a Data Warehouse, but more generally it is a measurable property of a phenomena under observation and (part of) an input to a ML model. Features are often computed from raw or structured data sources: •A raw word, a pixel, a sound wave, a sensor value; •An aggregate (mean, max, sum, min) •A window (last_hour, last_day, etc) •A derived representation (embedding or cluster) 8
  • 9.
    Just select andtype text. Use control handle to adjust line spacing. Bert Features Bert Features Bert Features Marketing Research Analytics Duplicated Feature Engineering 9 DUPLICATED
  • 10.
    Prevent Inconsistent Features–Training/Serving 10 Feature implementations may not be consistent – correctness problems!
  • 11.
    Features as first-classentities •Features should be discoverable and reused. •Features should be access controlled, versioned, and governed. - Enable reproducibility. •Ability to pre-compute and automatically backfill features. - Aggregates, embeddings - avoid expensive re-computation. - On-demand computation of features should also be possible. •The Feature Store should help “solve the data problem, so that Data Scientists don’t have to.” [uber] 11
  • 12.
    Data Engineering meetsData Science Feature Store Add/Remove Features Browse & Select Features to create Train/Test Data Data Engineer Data Scientist 12
  • 13.
    A ML Pipelinewith the Feature Store 13 Feature Store Register Feature and its Job/Data Select Features and generate Train/Test DataStructured & Raw Data Train Model Validate Models, Deploy Serve Model Online Features
  • 14.
    Offline (Batch/Streaming) FeatureStore 14 Data Lake Offline Feature Store Training Job Batch or Streaming Inference 1. Register Feature Engineering Job, copy Feature Data 2. Create Training Data and Train 3. Save Model a. Get Feature Engineering Job, Model, Conda Environment b. Run Job
  • 15.
    Online Feature Store 15 Data Lake Online Feature Store Train Real-Time Serving 1.Engineer Features 2. Create Training Data 3. Train Model 4. Deploy Model a. Request Prediction b. Get Online Features c. Response Online Apps
  • 16.
    Known Feature Storesin Production •Logical Clocks – Hopsworks (world’s first open source) •Uber Michelangelo •Airbnb – Bighead/Zipline •Comcast •Twitter •GO-JEK Feast (GCE) •Branch 16
  • 17.
    A Feature Storefor Hopsworks 17
  • 18.
    ©2018 Logical ClocksAB. All Rights Reserved Hopsworks – Batch, Streaming, Deep Learning Data Sources HopsFS Kafka Airflow Spark / Flink Spark Feature Store Hive Deep Learning BI Tools & Reporting Notebooks Serving w/ Kubernetes Hopsworks On-Premise, AWS, Azure, GCE Elastic External Service Hopsworks Service
  • 19.
    ©2018 Logical ClocksAB. All Rights Reserved Data Sources HopsFS Kafka Airflow Spark / Flink Spark Feature Store Hive Deep Learning BI Tools & Reporting Notebooks Serving w/ Kubernetes Hopsworks On-Premise, AWS, Azure, GCE Elastic External Service Hopsworks Service BATCH ANALYTICS STREAMING ML & DEEP LEARNING Hopsworks – Batch, Streaming, Deep Learning
  • 20.
    ©2018 Logical ClocksAB. All Rights Reserved ML Infrastructure in Hopsworks 20 MODEL TRAINING Feature Store HopsML API & Airflow [Diagram adapted from “technical debt of machine learning”]
  • 21.
    ©2018 Logical ClocksAB. All Rights Reserved 21 Distributed Deep Learning in Hopsworks Executor 1 Executor N Driver conda_env conda_env conda_env HopsFS (HDFS) TensorBoard ModelsExperiments Training Data Logs
  • 22.
    ©2018 Logical ClocksAB. All Rights Reserved Hyperparameter Optimization 22 # RUNS ON THE EXECUTORS def train(lr, dropout): def input_fn(): # return dataset optimizer = … model = … model.add(Conv2D(…)) model.compile(…) model.fit(…) model.evaluate(…) # RUNS ON THE DRIVER Hparams= {‘lr’:[0.001, 0.0001], ‘dropout’: [0.25, 0.5, 0.75]} experiment.grid_search(train,HParams) https://github.com/logicalclocks/hops-examples Executor 1 Executor N Driver conda_env conda_env conda_env HopsFS (HDFS) TensorBoard ModelsExperiments Training Data Logs
  • 23.
    ©2018 Logical ClocksAB. All Rights Reserved Distributed Training 23 # RUNS ON THE EXECUTORS def train(): def input_fn(): # return dataset model = … optimizer = … model.compile(…) rc = tf.estimator.RunConfig( ‘CollectiveAllReduceStrategy’) keras_estimator = tf.keras.estimator. model_to_estimator(….) tf.estimator.train_and_evaluate( keras_estimator, input_fn) # RUNS ON THE DRIVER experiment.collective_all_reduce(train) https://github.com/logicalclocks/hops-examples Executor 1 Executor N Driver conda_env conda_env conda_env HopsFS (HDFS) TensorBoard ModelsExperiments Training Data Logs
  • 24.
  • 25.
    ©2018 Logical ClocksAB. All Rights Reserved Online Model Serving and Monitoring 25 25 Link Predictions with Outcomes to measure Model Performance Hopsworks Inference Request Response 1. Access Control Feature Store 2. Build Feature Vector Model Server Kubernetes 3. Make Prediction 4. Log Prediction Data Lake Monitor
  • 26.
  • 27.
    ©2018 Logical ClocksAB. All Rights Reserved Raw Data Event Data Monitor HopsFS Feature Store Serving Feature StoreFeature EngineeringIngest DeployExperiment/Train Airflow logs logs
  • 28.
    ©2018 Logical ClocksAB. All Rights Reserved ML Pipelines of Jupyter Notebooks 28 Select Features, File Format Feature Engineering Validate & Deploy Model Experiment, Train Model Airflow End-to-End ML Pipeline Feature Backfill Pipeline Training and Deployment Pipeline Feature Store
  • 29.
    ©2018 Logical ClocksAB. All Rights Reserved Hopsworks Feature Store as a Service 29 Hops df.save( topic ) df.read( s3:// ) Real-Time Serving Train Hopsworks Feature Store Sagemaker, Azure ML, Google AIFeature Engineering (AWS EMR, Azure HD Insight, Cloudera) Data Lake Data Lake select features and generate training data s3://..tfrecords read online features Batch Serving
  • 30.
  • 31.
    Summary and Roadmap •Hopsworksis a new Data Platform with first-class support for Python / Deep Learning / ML / Data Governance / GPUs -Hopsworks has an open-source Feature Store •Ongoing Work -Data Provenance -Feature Store Incremental Updates with Hudi on Hive 31/32
  • 32.
    ©2018 Logical ClocksAB. All Rights Reserved 32 @logicalclocks www.logicalclocks.com Try it Out! 1. Register for an account at: www.hops.site
  • 33.
    ©2018 Logical ClocksAB. All Rights Reserved ML Pipelines of Jupyter Notebooks 33 Convert .ipynb to .py Jobs Service Run .py or .jar Schedule using REST API or UI materialize certs, ENV variables View Old Notebooks, Experiments and Visualizations Experiments & Tensorboard PySpark Kernel .ipynb (HDFS contents) [logs, results] Livy Server HopsYARNHopsFS Interactive materialize certs, ENV variables