What is a Feature?
A feature may be a column in a Data Warehouse, but more
generally it is a measurable property of a phenomena under
observation and (part of) an input to a ML model.
Features are often computed from raw or structured data sources:
•A raw word, a pixel, a sound wave, a sensor value;
(mean, max, sum, min)
(last_hour, last_day, etc)
•A derived representation
(embedding or cluster)
Just select and type text.
Use control handle to
adjust line spacing.
Marketing Research Analytics
Duplicated Feature Engineering
Prevent Inconsistent Features– Training/Serving
may not be consistent –
Features as first-class entities
•Features should be discoverable and reused.
•Features should be access controlled,
versioned, and governed.
- Enable reproducibility.
•Ability to pre-compute and
automatically backfill features.
- Aggregates, embeddings - avoid expensive re-computation.
- On-demand computation of features should also be possible.
•The Feature Store should help “solve the data problem, so that Data
Scientists don’t have to.” [uber]
Data Engineering meets Data Science
Browse & Select Features
to create Train/Test Data
Data Engineer Data Scientist
A ML Pipeline with the Feature Store
and its Job/Data
& Raw Data
Offline (Batch/Streaming) Feature Store
1. Register Feature
Engineering Job, copy
a. Get Feature
Engineering Job, Model,
b. Run Job
Online Feature Store
1. Engineer Features
2. Create Training Data
3. Train Model
4. Deploy Model
a. Request Prediction
b. Get Online Features
Known Feature Stores in Production
•Logical Clocks – Hopsworks (world’s first open source)
•Airbnb – Bighead/Zipline
•GO-JEK Feast (GCE)
Summary and Roadmap
•Hopsworks is a new Data Platform with first-class support
for Python / Deep Learning / ML / Data Governance / GPUs
-Hopsworks has an open-source Feature Store
-Feature Store Incremental Updates with Hudi on Hive