Feature Store: the missing data layer in ML pipelines?1
Spotify ML Guild Fika
Kim Hammar
kim@logicalclocks.com
February 26, 2019
1
Kim Hammar and Jim Dowling. Feature Store: the missing data layer in ML pipelines?
https://www.logicalclocks.com/feature-store/. 2018.
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 1 / 29
Model
ϕ(x)
Data Predictions
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 2 / 29
Model
ϕ(x)
Data Predictions
Distributed Training
Data Validation
Feature Engineering
Data Collection
Hardware
Management
HyperParameter
Tuning
Model
Serving
Pipeline Management
A/B
Testing
Monitoring
2
2
Image inspired from Sculley et al. (Google) Hidden Technical Debt in Machine Learning Systems
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 3 / 29
Outline
1 Hopsworks: Quick background of the platform
2 What is a Feature Store
3 Why You Need a Feature Store, Things to Consider:
How to encourage feature reusage?
How to store large-scale datasets for deep learning?
How to serve features for inference?
4 How to Build a Feature Store (Hopsworks Feature Store Case Study)
5 Demo
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 4 / 29
REST API
Kafka
TF Serving
Data Ingestion Data Prep Feature Store Training Serving
Orchestration
CPUs GPUs
HopsML
HopsYARN (fork of YARN)
HopsFS (fork of HDFS)
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 5 / 29
REST API
Kafka
TF Serving
Data Ingestion Data Prep Feature Store Training Serving
Orchestration
CPUs GPUs
HopsML
HopsYARN (fork of YARN)
HopsFS (fork of HDFS)
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 5 / 29
ϕ(x)




y1
...
yn








x1,1 . . . x1,n
... . . .
...
xn,1 . . . xn,n




ˆy
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29
ϕ(x)




y1
...
yn








x1,1 . . . x1,n
... . . .
...
xn,1 . . . xn,n




ˆy
?
_
_("))_/
_
3
Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo.
https://eng.uber.com/scaling-michelangelo/. 2018.
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29
ϕ(x)




y1
...
yn








x1,1 . . . x1,n
... . . .
...
xn,1 . . . xn,n




ˆy
?
_
_("))_/
_
“Data is the hardest part of ML and the most important piece to
get right.
Modelers spend most of their time selecting and transforming
features at training time and then building the pipelines to deliver
those features to production models.”
- Uber3
3
Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo.
https://eng.uber.com/scaling-michelangelo/. 2018.
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29
ϕ(x)




y1
...
yn








x1,1 . . . x1,n
... . . .
...
xn,1 . . . xn,n




ˆyFeature Store
“Data is the hardest part of ML and the most important piece to
get right.
Modelers spend most of their time selecting and transforming
features at training time and then building the pipelines to deliver
those features to production models.”
- Uber4
4
Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo.
https://eng.uber.com/scaling-michelangelo/. 2018.
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29
Disentangle ML Pipelines with a Feature Store
Raw/Structured Data
Feature Store
Feature Engineering Training
Models
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
A feature store is a central vault for storing documented, curated, and
access-controlled features.
The feature store is the interface between data engineering and data
model development
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 7 / 29
Dataset 1 Dataset 2 . . . Dataset n
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy≥ 0.9 < 0.9
≥ 0.2
< 0.2
≥ 11.2
< 11.2
B B
A
(−1, −1) (−8, −8)(−10, 0) (0, −10)
40 60 80 100
160
180
200
X
Y
Feature
Engineering
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 8 / 29
Dataset 1 Dataset 2 . . . Dataset n
Feature Store
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy≥ 0.9 < 0.9
≥ 0.2
< 0.2
≥ 11.2
< 11.2
B B
A
(−1, −1) (−8, −8)(−10, 0) (0, −10)
40 60 80 100
160
180
200
X
Y
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29
Dataset 1 Dataset 2 . . . Dataset n
Feature Store
Backfilling
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy≥ 0.9 < 0.9
≥ 0.2
< 0.2
≥ 11.2
< 11.2
B B
A
(−1, −1) (−8, −8)(−10, 0) (0, −10)
40 60 80 100
160
180
200
X
Y
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29
Dataset 1 Dataset 2 . . . Dataset n
Feature Store
Backfilling
Analysis
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy≥ 0.9 < 0.9
≥ 0.2
< 0.2
≥ 11.2
< 11.2
B B
A
(−1, −1) (−8, −8)(−10, 0) (0, −10)
40 60 80 100
160
180
200
X
Y
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29
Dataset 1 Dataset 2 . . . Dataset n
Feature Store
Backfilling
Analysis
Versioning
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy≥ 0.9 < 0.9
≥ 0.2
< 0.2
≥ 11.2
< 11.2
B B
A
(−1, −1) (−8, −8)(−10, 0) (0, −10)
40 60 80 100
160
180
200
X
Y
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29
Dataset 1 Dataset 2 . . . Dataset n
Feature Store
Backfilling
Analysis
Versioning
Documentation
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy≥ 0.9 < 0.9
≥ 0.2
< 0.2
≥ 11.2
< 11.2
B B
A
(−1, −1) (−8, −8)(−10, 0) (0, −10)
40 60 80 100
160
180
200
X
Y
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29
What is a Feature?
A feature is a measurable property of some data-sample
A feature could be..
An aggregate value (min, max, mean, sum)
A raw value (a pixel, a word from a piece of text)
A value from a database table (the age of a customer)
A derived representation: e.g an embedding or a cluster
Features are the fuel for AI systems:




x1
...
xn




Features
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Model θ
ˆy
Prediction
L(y, ˆy)
Loss
Gradient
θL(y, ˆy)
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 10 / 29
Raw text
lower-case
& remove noise
tokenization lemmatization
words.txt group by post words_post.csv
word2vec TF-IDF LDAontology-matching
annotation with weak
supervision
Model
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 11 / 29
Raw text
Feature Store
TF-IDF word2vec LDA
weak
annotation
normalization
Model
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 12 / 29
How to Encourage Feature
Reusage?
Feature Marketplace
Feature Marketplace
Download
Features
Search
FeaturesFeatures
Publish
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 14 / 29
Feature Store API Service
from hops import featurestore
features_df =
featurestore.get_features([
"average_attendance",
"average_player_age"
])
Feature Relationships
Feature Groups
Shared Storage
Feature Store API Service
Feature
Metadata
in ML pipelines
Include features
Figure: Feature Store API Service
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 15 / 29
How to Store Datasets for Deep
Learning?
How to Store Datasets for Deep Learning?
Should be framework
agnostic
Need to be able to store
tensor datasets
Should support sharding for
distributed training
Advanced features:
row-predicate filtering, SQL
interface, columnar selection.
?
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 17 / 29
How to Store Datasets for Deep Learning?
HDF5 TFRecords
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 18 / 29
How to Store Datasets for Deep Learning?
Petastorm is a dataset
format designed for deep
learning
Petastorm stores data as
parquet files with extra
metadata to handle
multi-dimensional tensors
Petastorm contains readers
for the popular machine
learning frameworks such as
SparkML, Tensorflow,
PyTorch
Petastorm
5
5
Robbie Gruener, Owen Cheng, and Yevgeni Litvin. Introducing Petastorm: Uber ATG’s Data Access Library for
Deep Learning. https://eng.uber.com/petastorm/. 2018.Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 19 / 29
How to Serve Features for
Inference?
Delivering Features for Training and Serving is Different
Serving can require real-time
features
Ideally we want consistency
between real-time features
and batch features used for
training
Complex engineering
problem
Feature Store
Real-Time Features
Prediction
Inference Request
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 21 / 29
How to Implement a (batch)
Feature Store?
The Components of a Feature Store
The Storage Layer: For storing feature data in the feature store
The Metadata Layer: For storing feature metadata (versioning,
feature analysis, documentation, jobs)
The Feature Engineering Jobs: For computing features
The Feature Registry: A user interface to share and discover
features
The Feature Store API: For writing/reading to/from the feature
store
Feature Storage
Feature Metadata Jobs
Feature Registry API
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 23 / 29
Feature Storage
Feature
Computation
Raw/Structured Data
Data Lake
Feature
Group 1
Feature
Group 2
Feature
Group 3
Feature
Group 4
project_featurestore.db
Hive
Metastore
Foreign keys
Feature Storage
Feature Metadata Jobs
Feature Registry API
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 24 / 29
Feature Metadata
Feature
Computation
Raw/Structured Data
Data Lake
Feature
Group 1
Feature
Group 2
Feature
Group 3
Feature
Group 4
project_featurestore.db
Hive
Metastore
Featurestore
Metadata
Foreign keys
Foreign keys
Feature Storage
Feature Metadata Jobs
Feature Registry API
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 25 / 29
Feature Registry and API
Feature
Computation
Raw/Structured Data
Data Lake
Feature
Group 1
Feature
Group 2
Feature
Group 3
Feature
Group 4
project_featurestore.db
Hive
Metastore
Featurestore
Metadata
Foreign keys
Foreign keys
Hopsworks
Feature registry (UI)
REST API
Program APIs
Feature Storage
Feature Metadata Jobs
Feature Registry API
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 26 / 29
Feature
Computation
Raw/Structured Data
Data Lake
Feature Store
Curated Features
Model
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Demo-Setting
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 27 / 29
Summary
Machine learning comes with a high technical cost
Machine learning pipelines needs proper data management
A feature store is a place to store curated and documented features
The feature store serves as an interface between feature engineering
and model development, it can help disentangle complex ML pipelines
Hopsworks6 provides the world’s first open-source feature store
@hopshadoop
www.hops.io
@logicalclocks
www.logicalclocks.com
We are open source:
https://github.com/logicalclocks/hopsworks
https://github.com/hopshadoop/hops
7
6
Jim Dowling. Introducing Hopsworks. https://www.logicalclocks.com/introducing-hopsworks/. 2018.
7
Thanks to Logical Clocks Team: Jim Dowling, Seif Haridi, Theo Kakantousis, Fabio Buso, Gautier Berthou,
Ermias Gebremeskel, Mahmoud Ismail, Salman Niazi, Antonios Kouzoupis, Robin Andersson, and Alex Ormenisan
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 28 / 29
References
Hopsworks’ feature store8 (the only open-source one!)
Uber’s feature store9
Airbnb’s feature store10
Comcast’s feature store11
GO-JEK’s feature store12
HopsML13
Hopsworks14
8
Kim Hammar and Jim Dowling. Feature Store: the missing data layer in ML pipelines?
https://www.logicalclocks.com/feature-store/. 2018.
9
Li Erran Li et al. “Scaling Machine Learning as a Service”. In: Proceedings of The 3rd International
Conference on Predictive Applications and APIs. Ed. by Claire Hardgrove et al. Vol. 67. Proceedings of Machine
Learning Research. Microsoft NERD, Boston, USA: PMLR, 2017, pp. 14–29. URL:
http://proceedings.mlr.press/v67/li17a.html.
10
Nikhil Simha and Varant Zanoyan. Zipline: Airbnb’s Machine Learning Data Management Platform.
https://databricks.com/session/zipline-airbnbs-machine-learning-data-management-platform. 2018.
11
Nabeel Sarwar. Operationalizing Machine Learning—Managing Provenance from Raw Data to Predictions.
https://databricks.com/session/operationalizing-machine-learning-managing-provenance-from-raw-data-to-
predictions. 2018.
12
Willem Pienaar. Building a Feature Platform to Scale Machine Learning | DataEngConf BCN ’18.
https://www.youtube.com/watch?v=0iCXY6VnpCc. 2018.
13
Logical Clocks AB. HopsML: Python-First ML Pipelines.
https://hops.readthedocs.io/en/latest/hopsml/hopsML.html. 2018.
14
Jim Dowling. Introducing Hopsworks. https://www.logicalclocks.com/introducing-hopsworks/. 2018.
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 29 / 29
Backup Slides
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 30 / 29
Modeling Data in the Feature Store
A feature group is a logical grouping of features
Typically from the same input dataset and computed with the same job
A training dataset is a set of features suitable for a prediction task
Features in a training dataset are often from several feature groups
E.g features on customers, features on user activities, etc.
Training Datasets d
Feature groups g
Features f f1 f2 f3 f4 f5
g1
f6 f7 f8 f9 f10
g2
f11 f12
g3
d1 d2
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 31 / 29
Hopsworks Feature Store API Service
SQL
hops-util
Query Planner
Feature Store Data
Hive on HopsFS
Feature Store
Metadata
Dataframe With Features
Client Interface Feature Store Service Output
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 32 / 29
Training Pipeline in HopsML
1 Create job/notebook to compute features and publish to the feature
store
2 Create job/notebook to read features/labels and save to a training
dataset
3 Read the training dataset into your model for training
HopsFS
Data Lake
Hive
Feature store
Raw data
Feature computation
Feature store features
hops-util
hops-util-py
Basis features Training dataset Model
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 33 / 29
Hopsworks Feature Store API
Reading from the Feature Store:
from hops import featurestore
features_df = featurestore.get_features([
"average_attendance",
"average_player_age"
])
Writing to the Feature Store:
from hops import featurestore
raw_data = spark.read.parquet(filename)
pol_features = raw_data.map(lambda x: x^2)
featurestore.insert_into_featuregroup(pol_features , "pol_featuregroup")
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 34 / 29

Kim Hammar - Spotify ML Guild Meetup - Feature Stores

  • 1.
    Feature Store: themissing data layer in ML pipelines?1 Spotify ML Guild Fika Kim Hammar kim@logicalclocks.com February 26, 2019 1 Kim Hammar and Jim Dowling. Feature Store: the missing data layer in ML pipelines? https://www.logicalclocks.com/feature-store/. 2018. Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 1 / 29
  • 2.
    Model ϕ(x) Data Predictions Kim Hammar(Logical Clocks) Hopsworks Feature Store February 26, 2019 2 / 29
  • 3.
    Model ϕ(x) Data Predictions Distributed Training DataValidation Feature Engineering Data Collection Hardware Management HyperParameter Tuning Model Serving Pipeline Management A/B Testing Monitoring 2 2 Image inspired from Sculley et al. (Google) Hidden Technical Debt in Machine Learning Systems Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 3 / 29
  • 4.
    Outline 1 Hopsworks: Quickbackground of the platform 2 What is a Feature Store 3 Why You Need a Feature Store, Things to Consider: How to encourage feature reusage? How to store large-scale datasets for deep learning? How to serve features for inference? 4 How to Build a Feature Store (Hopsworks Feature Store Case Study) 5 Demo Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 4 / 29
  • 5.
    REST API Kafka TF Serving DataIngestion Data Prep Feature Store Training Serving Orchestration CPUs GPUs HopsML HopsYARN (fork of YARN) HopsFS (fork of HDFS) Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 5 / 29
  • 6.
    REST API Kafka TF Serving DataIngestion Data Prep Feature Store Training Serving Orchestration CPUs GPUs HopsML HopsYARN (fork of YARN) HopsFS (fork of HDFS) Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 5 / 29
  • 7.
    ϕ(x)     y1 ... yn         x1,1 . .. x1,n ... . . . ... xn,1 . . . xn,n     ˆy Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29
  • 8.
    ϕ(x)     y1 ... yn         x1,1 . .. x1,n ... . . . ... xn,1 . . . xn,n     ˆy ? _ _("))_/ _ 3 Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo. https://eng.uber.com/scaling-michelangelo/. 2018. Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29
  • 9.
    ϕ(x)     y1 ... yn         x1,1 . .. x1,n ... . . . ... xn,1 . . . xn,n     ˆy ? _ _("))_/ _ “Data is the hardest part of ML and the most important piece to get right. Modelers spend most of their time selecting and transforming features at training time and then building the pipelines to deliver those features to production models.” - Uber3 3 Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo. https://eng.uber.com/scaling-michelangelo/. 2018. Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29
  • 10.
    ϕ(x)     y1 ... yn         x1,1 . .. x1,n ... . . . ... xn,1 . . . xn,n     ˆyFeature Store “Data is the hardest part of ML and the most important piece to get right. Modelers spend most of their time selecting and transforming features at training time and then building the pipelines to deliver those features to production models.” - Uber4 4 Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo. https://eng.uber.com/scaling-michelangelo/. 2018. Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29
  • 11.
    Disentangle ML Pipelineswith a Feature Store Raw/Structured Data Feature Store Feature Engineering Training Models b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy A feature store is a central vault for storing documented, curated, and access-controlled features. The feature store is the interface between data engineering and data model development Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 7 / 29
  • 12.
    Dataset 1 Dataset2 . . . Dataset n b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy≥ 0.9 < 0.9 ≥ 0.2 < 0.2 ≥ 11.2 < 11.2 B B A (−1, −1) (−8, −8)(−10, 0) (0, −10) 40 60 80 100 160 180 200 X Y Feature Engineering Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 8 / 29
  • 13.
    Dataset 1 Dataset2 . . . Dataset n Feature Store b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy≥ 0.9 < 0.9 ≥ 0.2 < 0.2 ≥ 11.2 < 11.2 B B A (−1, −1) (−8, −8)(−10, 0) (0, −10) 40 60 80 100 160 180 200 X Y Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29
  • 14.
    Dataset 1 Dataset2 . . . Dataset n Feature Store Backfilling b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy≥ 0.9 < 0.9 ≥ 0.2 < 0.2 ≥ 11.2 < 11.2 B B A (−1, −1) (−8, −8)(−10, 0) (0, −10) 40 60 80 100 160 180 200 X Y Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29
  • 15.
    Dataset 1 Dataset2 . . . Dataset n Feature Store Backfilling Analysis b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy≥ 0.9 < 0.9 ≥ 0.2 < 0.2 ≥ 11.2 < 11.2 B B A (−1, −1) (−8, −8)(−10, 0) (0, −10) 40 60 80 100 160 180 200 X Y Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29
  • 16.
    Dataset 1 Dataset2 . . . Dataset n Feature Store Backfilling Analysis Versioning b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy≥ 0.9 < 0.9 ≥ 0.2 < 0.2 ≥ 11.2 < 11.2 B B A (−1, −1) (−8, −8)(−10, 0) (0, −10) 40 60 80 100 160 180 200 X Y Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29
  • 17.
    Dataset 1 Dataset2 . . . Dataset n Feature Store Backfilling Analysis Versioning Documentation b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy≥ 0.9 < 0.9 ≥ 0.2 < 0.2 ≥ 11.2 < 11.2 B B A (−1, −1) (−8, −8)(−10, 0) (0, −10) 40 60 80 100 160 180 200 X Y Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29
  • 18.
    What is aFeature? A feature is a measurable property of some data-sample A feature could be.. An aggregate value (min, max, mean, sum) A raw value (a pixel, a word from a piece of text) A value from a database table (the age of a customer) A derived representation: e.g an embedding or a cluster Features are the fuel for AI systems:     x1 ... xn     Features b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy Model θ ˆy Prediction L(y, ˆy) Loss Gradient θL(y, ˆy) Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 10 / 29
  • 19.
    Raw text lower-case & removenoise tokenization lemmatization words.txt group by post words_post.csv word2vec TF-IDF LDAontology-matching annotation with weak supervision Model Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 11 / 29
  • 20.
    Raw text Feature Store TF-IDFword2vec LDA weak annotation normalization Model Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 12 / 29
  • 21.
    How to EncourageFeature Reusage?
  • 22.
    Feature Marketplace Feature Marketplace Download Features Search FeaturesFeatures Publish KimHammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 14 / 29
  • 23.
    Feature Store APIService from hops import featurestore features_df = featurestore.get_features([ "average_attendance", "average_player_age" ]) Feature Relationships Feature Groups Shared Storage Feature Store API Service Feature Metadata in ML pipelines Include features Figure: Feature Store API Service Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 15 / 29
  • 24.
    How to StoreDatasets for Deep Learning?
  • 25.
    How to StoreDatasets for Deep Learning? Should be framework agnostic Need to be able to store tensor datasets Should support sharding for distributed training Advanced features: row-predicate filtering, SQL interface, columnar selection. ? Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 17 / 29
  • 26.
    How to StoreDatasets for Deep Learning? HDF5 TFRecords Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 18 / 29
  • 27.
    How to StoreDatasets for Deep Learning? Petastorm is a dataset format designed for deep learning Petastorm stores data as parquet files with extra metadata to handle multi-dimensional tensors Petastorm contains readers for the popular machine learning frameworks such as SparkML, Tensorflow, PyTorch Petastorm 5 5 Robbie Gruener, Owen Cheng, and Yevgeni Litvin. Introducing Petastorm: Uber ATG’s Data Access Library for Deep Learning. https://eng.uber.com/petastorm/. 2018.Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 19 / 29
  • 28.
    How to ServeFeatures for Inference?
  • 29.
    Delivering Features forTraining and Serving is Different Serving can require real-time features Ideally we want consistency between real-time features and batch features used for training Complex engineering problem Feature Store Real-Time Features Prediction Inference Request b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 21 / 29
  • 30.
    How to Implementa (batch) Feature Store?
  • 31.
    The Components ofa Feature Store The Storage Layer: For storing feature data in the feature store The Metadata Layer: For storing feature metadata (versioning, feature analysis, documentation, jobs) The Feature Engineering Jobs: For computing features The Feature Registry: A user interface to share and discover features The Feature Store API: For writing/reading to/from the feature store Feature Storage Feature Metadata Jobs Feature Registry API Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 23 / 29
  • 32.
    Feature Storage Feature Computation Raw/Structured Data DataLake Feature Group 1 Feature Group 2 Feature Group 3 Feature Group 4 project_featurestore.db Hive Metastore Foreign keys Feature Storage Feature Metadata Jobs Feature Registry API Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 24 / 29
  • 33.
    Feature Metadata Feature Computation Raw/Structured Data DataLake Feature Group 1 Feature Group 2 Feature Group 3 Feature Group 4 project_featurestore.db Hive Metastore Featurestore Metadata Foreign keys Foreign keys Feature Storage Feature Metadata Jobs Feature Registry API Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 25 / 29
  • 34.
    Feature Registry andAPI Feature Computation Raw/Structured Data Data Lake Feature Group 1 Feature Group 2 Feature Group 3 Feature Group 4 project_featurestore.db Hive Metastore Featurestore Metadata Foreign keys Foreign keys Hopsworks Feature registry (UI) REST API Program APIs Feature Storage Feature Metadata Jobs Feature Registry API Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 26 / 29
  • 35.
    Feature Computation Raw/Structured Data Data Lake FeatureStore Curated Features Model b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy Demo-Setting Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 27 / 29
  • 36.
    Summary Machine learning comeswith a high technical cost Machine learning pipelines needs proper data management A feature store is a place to store curated and documented features The feature store serves as an interface between feature engineering and model development, it can help disentangle complex ML pipelines Hopsworks6 provides the world’s first open-source feature store @hopshadoop www.hops.io @logicalclocks www.logicalclocks.com We are open source: https://github.com/logicalclocks/hopsworks https://github.com/hopshadoop/hops 7 6 Jim Dowling. Introducing Hopsworks. https://www.logicalclocks.com/introducing-hopsworks/. 2018. 7 Thanks to Logical Clocks Team: Jim Dowling, Seif Haridi, Theo Kakantousis, Fabio Buso, Gautier Berthou, Ermias Gebremeskel, Mahmoud Ismail, Salman Niazi, Antonios Kouzoupis, Robin Andersson, and Alex Ormenisan Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 28 / 29
  • 37.
    References Hopsworks’ feature store8(the only open-source one!) Uber’s feature store9 Airbnb’s feature store10 Comcast’s feature store11 GO-JEK’s feature store12 HopsML13 Hopsworks14 8 Kim Hammar and Jim Dowling. Feature Store: the missing data layer in ML pipelines? https://www.logicalclocks.com/feature-store/. 2018. 9 Li Erran Li et al. “Scaling Machine Learning as a Service”. In: Proceedings of The 3rd International Conference on Predictive Applications and APIs. Ed. by Claire Hardgrove et al. Vol. 67. Proceedings of Machine Learning Research. Microsoft NERD, Boston, USA: PMLR, 2017, pp. 14–29. URL: http://proceedings.mlr.press/v67/li17a.html. 10 Nikhil Simha and Varant Zanoyan. Zipline: Airbnb’s Machine Learning Data Management Platform. https://databricks.com/session/zipline-airbnbs-machine-learning-data-management-platform. 2018. 11 Nabeel Sarwar. Operationalizing Machine Learning—Managing Provenance from Raw Data to Predictions. https://databricks.com/session/operationalizing-machine-learning-managing-provenance-from-raw-data-to- predictions. 2018. 12 Willem Pienaar. Building a Feature Platform to Scale Machine Learning | DataEngConf BCN ’18. https://www.youtube.com/watch?v=0iCXY6VnpCc. 2018. 13 Logical Clocks AB. HopsML: Python-First ML Pipelines. https://hops.readthedocs.io/en/latest/hopsml/hopsML.html. 2018. 14 Jim Dowling. Introducing Hopsworks. https://www.logicalclocks.com/introducing-hopsworks/. 2018. Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 29 / 29
  • 38.
    Backup Slides Kim Hammar(Logical Clocks) Hopsworks Feature Store February 26, 2019 30 / 29
  • 39.
    Modeling Data inthe Feature Store A feature group is a logical grouping of features Typically from the same input dataset and computed with the same job A training dataset is a set of features suitable for a prediction task Features in a training dataset are often from several feature groups E.g features on customers, features on user activities, etc. Training Datasets d Feature groups g Features f f1 f2 f3 f4 f5 g1 f6 f7 f8 f9 f10 g2 f11 f12 g3 d1 d2 Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 31 / 29
  • 40.
    Hopsworks Feature StoreAPI Service SQL hops-util Query Planner Feature Store Data Hive on HopsFS Feature Store Metadata Dataframe With Features Client Interface Feature Store Service Output Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 32 / 29
  • 41.
    Training Pipeline inHopsML 1 Create job/notebook to compute features and publish to the feature store 2 Create job/notebook to read features/labels and save to a training dataset 3 Read the training dataset into your model for training HopsFS Data Lake Hive Feature store Raw data Feature computation Feature store features hops-util hops-util-py Basis features Training dataset Model b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 33 / 29
  • 42.
    Hopsworks Feature StoreAPI Reading from the Feature Store: from hops import featurestore features_df = featurestore.get_features([ "average_attendance", "average_player_age" ]) Writing to the Feature Store: from hops import featurestore raw_data = spark.read.parquet(filename) pol_features = raw_data.map(lambda x: x^2) featurestore.insert_into_featuregroup(pol_features , "pol_featuregroup") Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 34 / 29