4. Become a Data Scientist!
4
Eureka! This will
give a 12% increase
in the efficiency of
this wind farm!
5. Data Scientists are not Data Engineers
5
HDFSGCS Storage CosmosDB
How do I find features in this sea of data sources?
This tastes like dairy
in my Latte!
6. Data Science with the Feature Store
6
HDFSGCS Storage CosmosDB
Feature Store
Feature Pipelines (Select, Transform, Aggregate, ..)
Now, I can change
the world - one click-
through at a time.
8. What is a Feature?
A feature may be a column in a Data Warehouse, but more
generally it is a measurable property of a phenomena under
observation and (part of) an input to a ML model.
Features are often computed from raw or structured data sources:
•A raw word, a pixel, a sound wave, a sensor value;
•An aggregate
(mean, max, sum, min)
•A window
(last_hour, last_day, etc)
•A derived representation
(embedding or cluster)
8
9. Just select and type text.
Use control handle to
adjust line spacing.
Bert
Features
Bert
Features
Bert
Features
Marketing Research Analytics
Duplicated Feature Engineering
9
DUPLICATED
10. Prevent Inconsistent Features– Training/Serving
10
Feature implementations
may not be consistent –
correctness problems!
11. Features as first-class entities
•Features should be discoverable and reused.
•Features should be access controlled,
versioned, and governed.
- Enable reproducibility.
•Ability to pre-compute and
automatically backfill features.
- Aggregates, embeddings - avoid expensive re-computation.
- On-demand computation of features should also be possible.
•The Feature Store should help “solve the data problem, so that Data
Scientists don’t have to.” [uber]
11
12. Data Engineering meets Data Science
Feature
Store
Add/Remove
Features
Browse & Select Features
to create Train/Test Data
Data Engineer Data Scientist
12
13. A ML Pipeline with the Feature Store
13
Feature
Store
Register Feature
and its Job/Data
Select Features
and generate
Train/Test DataStructured
& Raw Data
Train
Model
Validate Models,
Deploy
Serve
Model
Online Features
14. Offline (Batch/Streaming) Feature Store
14
Data
Lake
Offline
Feature
Store
Training
Job
Batch or
Streaming
Inference
1. Register Feature
Engineering Job, copy
Feature Data
2. Create
Training Data
and Train
3. Save
Model
a. Get Feature
Engineering Job, Model,
Conda Environment
b. Run Job
31. Summary and Roadmap
•Hopsworks is a new Data Platform with first-class support
for Python / Deep Learning / ML / Data Governance / GPUs
-Hopsworks has an open-source Feature Store
•Ongoing Work
-Data Provenance
-Feature Store Incremental Updates with Hudi on Hive
31/32