Introducing the Feature Store in Hopsworks
Bay Area AI Meetup @ Mesosphere
March 5th, 2019
CEO @ Logical Clocks
Assoc Prof @ KTH
1. What is a Feature Store and why do you need one?
2. The Hopsworks’ Feature Store
Become a Data Scientist!
Eureka! This will
give a 12% increase
in the efficiency of
this wind farm!
Data Scientists are not Data Engineers
HDFSGCS Storage CosmosDB
How do I find features in this sea of data sources?
This tastes like dairy
in my Latte!
What is a Feature?
A measurable property of a phenomena under observation
•A raw word, a pixel, a sensor value a feature
•A column in a datastore
(mean, max, sum, min)
•A derived representation
(embedding or cluster)
Data Science with the Feature Store
HDFSGCS Storage CosmosDB
Feature Warehouse Store
Feature Pipelines (Select, Transform, Aggregate, ..)
Now, I can change
the world - one click-
through at a time.
Features need to be first-class entities
•Features should be discoverable and reused.
•Features should be access controlled,
versioned, and governed.
- Enable reproducibility.
•Ability to pre-compute and
automatically backfill features.
- Aggregates, embeddings - avoid expensive re-computation.
- On-demand computation of features should also be possible.
•The Feature Store should help “solve the data problem, so that Data
Scientists don’t have to.” [uber]
Hopsworks’ Feature Store
- Reusability of features
between models and teams
- Automatic backfilling of
- Automatic feature
documentation and analysis
- Feature versioning
- Standardized access of
features between training
- Feature discovery
- Access control for
There are other advantages to the Feature Store …
Just select and type text.
Use control handle to
adjust line spacing.
Marketing Research Analytics
Prevent Duplicated Feature Engineering
Prevent Inconsistent Features– Training/Serving
may not be consistent –
Known Feature Stores in Production
•Logical Clocks – Hopsworks (open source)
•Airbnb – Bighead/Zipline
•GO-JEK Feast (open source on GCE)
The API Between Data Science and Data Engineering
Replicated Conda Environments
•Every project can create its own conda environment,
replicated at all hosts in the cluster
-Base environments for Python2 and Python3 mostly adequate
•Hopsworks ensures consistent conda command log
replication to all hosts in the cluster using a local agent
Host A Host B
•The Feature Store API: For
writing/reading to/from the feature store
•The Feature Registry: A user interface
to share and discover features
•The Metadata Layer: For storing
feature metadata (versioning, feature
analysis, documentation, jobs)
•The Feature Engineering Jobs: For
•The Storage Layer: For storing feature
data in the feature store
Building Blocks of a Feature Store
Feature Metadata Jobs
Feature Registry API
Summary and Roadmap
•Hopsworks is a new Data Platform with first-class support
for Python / Deep Learning / ML / Data Governance / GPUs
-Hopsworks has an open-source Feature Store
-Online Feature Store
-Feature Transformation Library/DSK
-Automated Data Provenance
-Feature Store Incremental Updates with Hudi on Hive