1. Feature Store: the missing data layer in ML pipelines?1
Hopsworks Hands On - Palo Alto
Kim Hammar
kim@logicalclocks.com
April 23, 2019
1
Kim Hammar and Jim Dowling. Feature Store: the missing data layer in ML pipelines?
https://www.logicalclocks.com/feature-store/. 2018.
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 1 / 22
3. ϕ(x)
y1
...
yn
x1,1 . . . x1,n
... . . .
...
xn,1 . . . xn,n
ˆy
?
_
_("))_/
_
2
Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo.
https://eng.uber.com/scaling-michelangelo/. 2018.
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 2 / 22
4. ϕ(x)
y1
...
yn
x1,1 . . . x1,n
... . . .
...
xn,1 . . . xn,n
ˆy
?
_
_("))_/
_
“Data is the hardest part of ML and the most important piece to
get right.
Modelers spend most of their time selecting and transforming
features at training time and then building the pipelines to deliver
those features to production models.”
- Uber2
2
Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo.
https://eng.uber.com/scaling-michelangelo/. 2018.
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 2 / 22
5. ϕ(x)
y1
...
yn
x1,1 . . . x1,n
... . . .
...
xn,1 . . . xn,n
ˆyFeature Store
“Data is the hardest part of ML and the most important piece to
get right.
Modelers spend most of their time selecting and transforming
features at training time and then building the pipelines to deliver
those features to production models.”
- Uber3
3
Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo.
https://eng.uber.com/scaling-michelangelo/. 2018.
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 2 / 22
6. Merging Our Data Intensive and Compute Intensive
Workloads
Compute IntensiveData Intensive
w(t)
Σ
Σ
Σ
Σ
σ
σ
σ
σ
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 3 / 22
7. Merging Our Data Intensive and Compute Intensive
Workloads
Compute IntensiveData Intensive
Feature Store
w(t)
Σ
Σ
Σ
Σ
σ
σ
σ
σ
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 3 / 22
8. Outline
1 What is a Feature Store
2 Why You Need a Feature Store
3 How to Build a Feature Store (Hopsworks Feature Store)
4 Demo
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 4 / 22
9. Solution: Disentangle ML Pipelines
with a Feature Store
Raw/Structured Data
Feature Store
Feature Engineering Training
Models
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
A feature store is a central vault for storing documented, curated, and
access-controlled features.
The feature store is the interface between data engineering and data
model development
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 5 / 22
10. Make ML-Features A First-Class Citizen in Your Data Lakes
Traditional Feature Engineering
N spark tasks Derived data Model Training
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Raw data lake
Feature Engineering w Feature Store
N spark tasks Derived data Model Training
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Raw data lake
Feature Store
Metadata
...
Make your features first-class citizens:
Document features
Version features
Invest in a data layer specifically for features (feature store)
Make features access-controlled and searchable
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 6 / 22
11. What is a Feature?
A feature is a measurable property of some data-sample
A feature could be..
An aggregate value (min, max, mean, sum)
A raw value (a pixel, a word from a piece of text)
A value from a database table (the age of a customer)
A derived representation: e.g an embedding or a cluster
Features are the fuel for AI systems:
x1
...
xn
Features
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Model θ
ˆy
Prediction
L(y, ˆy)
Loss
Gradient
θL(y, ˆy)
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 7 / 22
12. Feature Engineering is Crucial for Model Performance
x1• • • • • •◦ ◦ ◦
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 8 / 22
13. Feature Engineering is Crucial for Model Performance
x1
x2
•
•
• •
•
•
◦
◦
◦
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 9 / 22
14. Feature Engineering is Crucial for Model Performance
x1
x2
•
•
• •
•
•
◦
◦
◦
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 10 / 22
15. Feature Engineering is Complex
Input Data
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 11 / 22
16. Feature Engineering is Complex
Input Data
lorem
ipsum
dolor
sit
amet
ut adjective
verb
noun
stopword
noun
verb IP
NP
Det
lorem
N
N
ipsum
I
I
dolor
VP
V
V
sit
AP
Deg
amet
A
A
ut
lorem
ipsum
dolor
sit
amet
ut
[0.141, −0.715, 0.51, . . .]
[0.303, −0.56, 0.32, . . .]
[−0.071, 0.9, 0.77, . . .]
[0.877, −0.05, −0.81, . . .]
[0.642, 0.115, 0.19, . . .]
[0.141, −0.295, 0.51, . . .]
Embeddings
Linguistic Features
Feature Engineering
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 11 / 22
17. Feature Engineering is Complex
Input Data
lorem
ipsum
dolor
sit
amet
ut adjective
verb
noun
stopword
noun
verb IP
NP
Det
lorem
N
N
ipsum
I
I
dolor
VP
V
V
sit
AP
Deg
amet
A
A
ut
lorem
ipsum
dolor
sit
amet
ut
[0.141, −0.715, 0.51, . . .]
[0.303, −0.56, 0.32, . . .]
[−0.071, 0.9, 0.77, . . .]
[0.877, −0.05, −0.81, . . .]
[0.642, 0.115, 0.19, . . .]
[0.141, −0.295, 0.51, . . .]
Embeddings
Linguistic Features
Feature Engineering
[0.141, 0.51, . . .]
Feature Matrix
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 11 / 22
18. Feature Engineering is Complex
Input Data
lorem
ipsum
dolor
sit
amet
ut adjective
verb
noun
stopword
noun
verb IP
NP
Det
lorem
N
N
ipsum
I
I
dolor
VP
V
V
sit
AP
Deg
amet
A
A
ut
lorem
ipsum
dolor
sit
amet
ut
[0.141, −0.715, 0.51, . . .]
[0.303, −0.56, 0.32, . . .]
[−0.071, 0.9, 0.77, . . .]
[0.877, −0.05, −0.81, . . .]
[0.642, 0.115, 0.19, . . .]
[0.141, −0.295, 0.51, . . .]
Embeddings
Linguistic Features
Feature Engineering
[0.141, 0.51, . . .]
Feature Matrix
Train Val Test
Train/Val/Test Split
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 11 / 22
19. Feature Engineering is Complex
Input Data
lorem
ipsum
dolor
sit
amet
ut adjective
verb
noun
stopword
noun
verb IP
NP
Det
lorem
N
N
ipsum
I
I
dolor
VP
V
V
sit
AP
Deg
amet
A
A
ut
lorem
ipsum
dolor
sit
amet
ut
[0.141, −0.715, 0.51, . . .]
[0.303, −0.56, 0.32, . . .]
[−0.071, 0.9, 0.77, . . .]
[0.877, −0.05, −0.81, . . .]
[0.642, 0.115, 0.19, . . .]
[0.141, −0.295, 0.51, . . .]
Embeddings
Linguistic Features
Feature Engineering
[0.141, 0.51, . . .]
Feature Matrix
Train Val Test
Train/Val/Test Split
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Train Model
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 11 / 22
20. Feature Engineering is Complex
Input Data
lorem
ipsum
dolor
sit
amet
ut adjective
verb
noun
stopword
noun
verb IP
NP
Det
lorem
N
N
ipsum
I
I
dolor
VP
V
V
sit
AP
Deg
amet
A
A
ut
lorem
ipsum
dolor
sit
amet
ut
[0.141, −0.715, 0.51, . . .]
[0.303, −0.56, 0.32, . . .]
[−0.071, 0.9, 0.77, . . .]
[0.877, −0.05, −0.81, . . .]
[0.642, 0.115, 0.19, . . .]
[0.141, −0.295, 0.51, . . .]
Embeddings
Linguistic Features
Feature Engineering
[0.141, 0.51, . . .]
Feature Matrix
Train Val Test
Train/Val/Test Split
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Train Model
How do you make this scale?
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 11 / 22
21. Feature Engineering is Complex
Input Data
lorem
ipsum
dolor
sit
amet
ut adjective
verb
noun
stopword
noun
verb IP
NP
Det
lorem
N
N
ipsum
I
I
dolor
VP
V
V
sit
AP
Deg
amet
A
A
ut
lorem
ipsum
dolor
sit
amet
ut
[0.141, −0.715, 0.51, . . .]
[0.303, −0.56, 0.32, . . .]
[−0.071, 0.9, 0.77, . . .]
[0.877, −0.05, −0.81, . . .]
[0.642, 0.115, 0.19, . . .]
[0.141, −0.295, 0.51, . . .]
Embeddings
Linguistic Features
Feature Engineering
[0.141, 0.51, . . .]
Feature Matrix
Train Val Test
Train/Val/Test Split
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Train Model
How do you make this scale? How to manage the feature pipelines?
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 11 / 22
22. Feature Engineering is Complex
Input Data
lorem
ipsum
dolor
sit
amet
ut adjective
verb
noun
stopword
noun
verb IP
NP
Det
lorem
N
N
ipsum
I
I
dolor
VP
V
V
sit
AP
Deg
amet
A
A
ut
lorem
ipsum
dolor
sit
amet
ut
[0.141, −0.715, 0.51, . . .]
[0.303, −0.56, 0.32, . . .]
[−0.071, 0.9, 0.77, . . .]
[0.877, −0.05, −0.81, . . .]
[0.642, 0.115, 0.19, . . .]
[0.141, −0.295, 0.51, . . .]
Embeddings
Linguistic Features
Feature Engineering
[0.141, 0.51, . . .]
Feature Matrix
Train Val Test
Train/Val/Test Split
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Train Model
How do you make this scale? How to manage the feature pipelines?
How to make features reusable and robust?
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 11 / 22
23. Feature Engineering is Complex
Input Data
lorem
ipsum
dolor
sit
amet
ut adjective
verb
noun
stopword
noun
verb IP
NP
Det
lorem
N
N
ipsum
I
I
dolor
VP
V
V
sit
AP
Deg
amet
A
A
ut
lorem
ipsum
dolor
sit
amet
ut
[0.141, −0.715, 0.51, . . .]
[0.303, −0.56, 0.32, . . .]
[−0.071, 0.9, 0.77, . . .]
[0.877, −0.05, −0.81, . . .]
[0.642, 0.115, 0.19, . . .]
[0.141, −0.295, 0.51, . . .]
Embeddings
Linguistic Features
Feature Engineering
[0.141, 0.51, . . .]
Feature Matrix
Train Val Test
Train/Val/Test Split
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Train Model
How do you make this scale? How to manage the feature pipelines?
How to make features reusable and robust?
Feature Engineering is Complex Yet Crucial for Model Performance
Treat your features accordingly!
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 11 / 22
24. Feature Pipeline Jungles
Data Lake (Raw/Structured Data)
Feature Data (Derived Data)
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 12 / 22
25. Disentangle Your ML Pipelines with a Feature Store
Data Sources Dataset 1 Dataset 2 . . . Dataset n
Feature Store
Feature Store
A data management platform for machine learning.
The interface between data engineering and data science.
Models
Models are trained using sets of features.
The features are fetched from the feature store
and can overlap between models.
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy≥ 0.9 < 0.9
≥ 0.2
< 0.2
≥ 11.2
< 11.2
B B
A
(−1, −1) (−8, −8)(−10, 0) (0, −10)
40 60 80 100
160
180
200
X
Y
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 13 / 22
26. Disentangle Your ML Pipelines with a Feature Store
Dataset 1 Dataset 2 . . . Dataset n
Feature Store
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy≥ 0.9 < 0.9
≥ 0.2
< 0.2
≥ 11.2
< 11.2
B B
A
(−1, −1) (−8, −8)(−10, 0) (0, −10)
40 60 80 100
160
180
200
X
Y
Backfilling
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 13 / 22
27. Disentangle Your ML Pipelines with a Feature Store
Dataset 1 Dataset 2 . . . Dataset n
Feature Store
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy≥ 0.9 < 0.9
≥ 0.2
< 0.2
≥ 11.2
< 11.2
B B
A
(−1, −1) (−8, −8)(−10, 0) (0, −10)
40 60 80 100
160
180
200
X
Y
Backfilling
Analysis
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 13 / 22
28. Disentangle Your ML Pipelines with a Feature Store
Dataset 1 Dataset 2 . . . Dataset n
Feature Store
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy≥ 0.9 < 0.9
≥ 0.2
< 0.2
≥ 11.2
< 11.2
B B
A
(−1, −1) (−8, −8)(−10, 0) (0, −10)
40 60 80 100
160
180
200
X
Y
Backfilling
Analysis
Versioning
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 13 / 22
29. Disentangle Your ML Pipelines with a Feature Store
Dataset 1 Dataset 2 . . . Dataset n
Feature Store
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy≥ 0.9 < 0.9
≥ 0.2
< 0.2
≥ 11.2
< 11.2
B B
A
(−1, −1) (−8, −8)(−10, 0) (0, −10)
40 60 80 100
160
180
200
X
Y
Backfilling
Analysis
Versioning
Documentation
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 13 / 22
30. High-Level APIs and Abstractions
from hops import featurestore
features_df = featurestore.get_features(
[
"average_attendance",
"average_player_age"
])
featurestore.create_featuregroup(
f_df, "t_features",
description="...", version=2)
d_dir = featurestore.get_training_dataset_path(td_name)
tf_schema = featurestore.get_tf_record_schema(td_name)
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 14 / 22
31. High-Level APIs and Abstractions
Read from the feature store
from hops import featurestore
features_df = featurestore.get_features(
[
"average_attendance",
"average_player_age"
])
featurestore.create_featuregroup(
f_df, "t_features",
description="...", version=2)
d_dir = featurestore.get_training_dataset_path(td_name)
tf_schema = featurestore.get_tf_record_schema(td_name)
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 14 / 22
32. High-Level APIs and Abstractions
Read from the feature store
Write to the feature store
from hops import featurestore
features_df = featurestore.get_features(
[
"average_attendance",
"average_player_age"
])
featurestore.create_featuregroup(
f_df, "t_features",
description="...", version=2)
d_dir = featurestore.get_training_dataset_path(td_name)
tf_schema = featurestore.get_tf_record_schema(td_name)
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 14 / 22
33. High-Level APIs and Abstractions
Read from the feature store
Write to the feature store
Metadata operations
from hops import featurestore
features_df = featurestore.get_features(
[
"average_attendance",
"average_player_age"
])
featurestore.create_featuregroup(
f_df, "t_features",
description="...", version=2)
d_dir = featurestore.get_training_dataset_path(td_name)
tf_schema = featurestore.get_tf_record_schema(td_name)
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 14 / 22
34. Existing Feature Stores
Uber’s feature store4
Airbnb’s feature store5
Comcast’s feature store6
Facebook’s feature store7
GO-JEK’s feature store8
Twitter’s feature store9
Branch International’s feature store10
Hopsworks’ feature store11 (the only open-source one!)
4
Li Erran Li et al. “Scaling Machine Learning as a Service”. In: Proceedings of The 3rd International
Conference on Predictive Applications and APIs. Ed. by Claire Hardgrove et al. Vol. 67. Proceedings of Machine
Learning Research. Microsoft NERD, Boston, USA: PMLR, 2017, pp. 14–29. URL:
http://proceedings.mlr.press/v67/li17a.html.
5
Nikhil Simha and Varant Zanoyan. Zipline: Airbnb’s Machine Learning Data Management Platform.
https://databricks.com/session/zipline-airbnbs-machine-learning-data-management-platform. 2018.
6
Nabeel Sarwar. Operationalizing Machine Learning—Managing Provenance from Raw Data to Predictions.
https://databricks.com/session/operationalizing-machine-learning-managing-provenance-from-raw-data-to-
predictions. 2018.
7
Kim Hazelwood et al. “Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”. In:
Feb. 2018, pp. 620–629. DOI: 10.1109/HPCA.2018.00059.
8
Willem Pienaar. Building a Feature Platform to Scale Machine Learning | DataEngConf BCN ’18.
https://www.youtube.com/watch?v=0iCXY6VnpCc. 2018.
9Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 15 / 22
35. The Components of a Feature Store
Feature Storage
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 16 / 22
36. The Components of a Feature Store
Feature Metadata
Schema Statistics Documentation Jobs
Feature Storage
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 16 / 22
37. The Components of a Feature Store
Client Interface
API
from hops import featurestore
features_df = featurestore.get_features(
[
"average_attendance",
"average_player_age"
])
Feature Registry
Feature Metadata
Schema Statistics Documentation Jobs
Feature Storage
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 16 / 22
38. Hopsworks Feature Store
Feature Storage
HopsFS
Feature groups Training Datasets
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 17 / 22
39. Hopsworks Feature Store
Feature Metadata
Schema Statistics Documentation Jobs
Feature Storage
HopsFS
Feature groups Training Datasets
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 17 / 22
40. Hopsworks Feature Store
REST API
Feature Metadata
Schema Statistics Documentation Jobs
Feature Storage
HopsFS
Feature groups Training Datasets
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 17 / 22
41. Hopsworks Feature Store
Client Interface
API
from hops import featurestore
features_df = featurestore.get_features(
[
"average_attendance",
"average_player_age"
])
Hopsworks
Feature Registry
REST API
Feature Metadata
Schema Statistics Documentation Jobs
Feature Storage
HopsFS
Feature groups Training Datasets
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 17 / 22
42. Hopsworks Feature Store
Client Interface
API
from hops import featurestore
features_df = featurestore.get_features(
[
"average_attendance",
"average_player_age"
])
Hopsworks
Feature Registry
REST API
Feature Metadata
Schema Statistics Documentation Jobs
Feature Storage
HopsFS
Feature groups Training Datasets
e1
e2
e3
e4
Gradient
Gradient
Gradient
Gradient
Model Training/Serving
Feature Engineering
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 17 / 22
43. Hopsworks Feature Store
Data Sources
HopsFS
Feature Creation Feature Storage
FeatureStoreAPI
Models
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 18 / 22
44. Feature Creation Example
[in:5, out:10.5, . . .]
Transactions
Spark
sum_trx_amount
from hops import featurestore
trx_df = spark.read.parquet(..)
trx_sum_amount_df = trx_df.select("amount,customer")
.groupBy("customer")
.agg(sum("amount"))
featurestore.create_featuregroup(
trx_sum_amount_df ,
"trx_sum_amount",
description="sum of transactions"
)
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 19 / 22
46. Summary
Machine learning comes with a high technical cost
Machine learning pipelines needs proper data management
A feature store is a place to store curated and documented features
The feature store serves as an interface between feature engineering
and model development, it can help disentangle complex ML pipelines
Hopsworks12 provides the world’s first open-source feature store
@hopshadoop
www.hops.io
@logicalclocks
www.logicalclocks.com
We are open source:
https://github.com/logicalclocks/hopsworks
https://github.com/hopshadoop/hops
13
12
Jim Dowling. Introducing Hopsworks. https://www.logicalclocks.com/introducing-hopsworks/. 2018.
13
Thanks to Logical Clocks Team: Jim Dowling, Seif Haridi, Theo Kakantousis, Fabio Buso, Gautier Berthou,
Ermias Gebremeskel, Mahmoud Ismail, Salman Niazi, Antonios Kouzoupis, Robin Andersson, Alex Ormenisan, and
Rasmus Toivonen
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 21 / 22
47. References
Hopsworks’ feature store14
HopsML15
Hopsworks16
14
Kim Hammar and Jim Dowling. Feature Store: the missing data layer in ML pipelines?
https://www.logicalclocks.com/feature-store/. 2018.
15
Logical Clocks AB. HopsML: Python-First ML Pipelines.
https://hops.readthedocs.io/en/latest/hopsml/hopsML.html. 2018.
16
Jim Dowling. Introducing Hopsworks. https://www.logicalclocks.com/introducing-hopsworks/. 2018.
Kim Hammar (Logical Clocks) Hopsworks Feature Store April 23, 2019 22 / 22