Managed Feature Store for Machine Learning

Dr. Jim Dowling
CEO / Co-Founder
Logical Clocks
Managed Feature Store
for ML Webinar
[ Presenter ]

Leadership & Offices
Stockholm
Box 1263,
Isafjordsgatan 22
Kista,
Sweden
London
IDEALondon,
69 Wilson St,
London,,
UK
Silicon Valley
470 Ramona St
Palo Alto
California,
USA
Dr. Jim Dowling
CEO
Theo Kakantousis
COO
Prof. Seif Haridi
Chief Scientist
Fabio Buso
VP Engineering
Steffen Grohsschmiedt
Head Of Cloud
www.logicalclocks.com
Shraddha Chouhan
Head Of Marketing

Hopsworks - Award Winning Platform

Today’s Journey to a Feature Store and Beyond
Ad-hoc Scripts
and Jobs
Shared Feature
Pipelines
Feature Store
MLOps with a
Feature Store

Known Feature Stores in Production
● Logical Clocks – Hopsworks (world’s first open source)
● Uber Michelangelo
● Airbnb – Bighead/Zipline
● Comcast
● Twitter
● GO-JEK Feast (GCE, open-source layer over BigTable/BigQuery)
● Branch
● Conde Nast
● Facebook FB Learner
● Netflix
Reference: www.featurestore.org

What is a Feature?
A feature is a measurable property of a phenomena under observation and
(part of) an input to a ML model.
Example features:
● A raw word, a pixel, a sound
wave, a sensor value;
● An aggregate
(mean, max, sum, min)
● A window
(last_hour, last_day, etc)
● A derived representation
(embedding or cluster)

numbers
(in arrays)
A Data Engineer’s perspective on Feature Engineering
numbers
arrays
(of numbers)
one-hot
encoding
Databases
Schemas
varchar, charsets
integer, blob,
varbinary

Feature Engineering is about Transforming Data

Feature Engineering is about Transforming Data
from pyspark.ml.feature import Normalizer
scaledDF = spark.parquet.read(”…”)
l1_norm=Normalizer().setP(1).setInputCol("features").setOutputCol("l1_norm")
l1_norm.transform(scaleDF)
Normalize

Consistent Features between Training and Inference
It’s not always trivial to ensure features are engineered
consistently between training and inference
Features
Training
Labels Model
Features
Inference
Model Labels

Feature Store – Reuse Cached Features
One
Feature
Pipeline
Get
Get
Features
Training
Labels Model
Features
Inference
Model Labels
Feature
Store

Features name Pclass Sex Survive Name Balance
Train / Test
Datasets
Survivename PClass Sex Balance
Join key
Feature
Groups
Titanic
Passenger List
Passenger
Bank Account
File format
.tfrecords
.npy
.csv
.hdf5,
.petastorm, etc
Storage
GCS
Amazon S3
HopsFS
Features, FeatureGroups, and Train/Test Datasets are all versioned
Feature Store Concepts

Streaming App pushes click features every 5 secs
Streaming App pushes CDC data every 30 secs
Pandas App pushes user profile updates every hour
Batch App pushes featurized weblogs data every day
Online
Feature
Store
Offline
Feature
Store
SQL DW
S3, HDFS
SQL
Event Data
Real-Time Data
Real-time feature transformations (<2 secs) Online
App
Low
Latency
Features
High
Latency
Features
Train,
Batch App
FeatureGroups are ingested at different Cadences
Feature Store
No existing database is both scalable (PBs) and low latency (<10ms). Hence, online + offline Feature Stores.

Feature Store
ClickFeatureGroup
TableFeatureGroup
UserFeatureGroup
LogsFeatureGroup
Event Data
SQL DW
S3, HDFS
SQL
DataFrameAPI
Kafka Input
Flink
RTFeatureGroup
Online
App
Train,
Batch App
FeatureGroup ingestion in Hopsworks
User Clicks
DB Updates
User Profile Updates
Weblogs
Real-time features
Kafka Output
Simplify access to the online/offline Feature Stores by providing a general-purpose DataFrame API.

Register a Feature Group with the Feature Store
from hops import featurestore as fs
df = # Spark or Pandas Dataframe
# Do feature engineering on ‘df’
# Register Dataframe as FeatureGroup
fs.create_featuregroup(df, ”titanic_df“)

HOPSWORKS
Rest API
1 Add Metadata
2 Add Statistics
….
Offline FS
Apache Hive
HopsFS
(External)
Spark Cluster
.parquet, .orc (TLS)
Online FS
MySQL Cluster
fs.create_featuregroup(df, “titanic_df”,
offline=True, online=True)
Feature Ingestion with Spark

Online
Feature Store
(Serving)
Offline
Feature Store
(Training & Batch)
Online Apps
Model Training
Batch Apps
Event Data
SQL DW
S3, HDFS
SQL
Ingest
Data
From
Used
By
Hopsworks Feature Store

Create Training Datasets using the Feature Store
from hops import featurestore as fs
sample_data = fs.get_features([“name”, “Pclass”, “Sex”, “Balance”, “Survived”])
fs.create_training_dataset(sample_data, “titanic_training_dataset",
data_format="tfrecords“, training_dataset_version=1)

HOPSWORKS
Offline FS
Apache Hive
HopsFS
Join Features <<TLS>>
Online FS
MySQL Cluster
(External)
Spark Cluster
sample_data = fs.get_features([“name”,
“Pclass”, “Sex”, “Balance”, “Survived”])
Create Training Datasets with (External) Spark
Storage
GCS Amazon S3 HopsFS
.npy, .tfrecords, .csv

commit-0097
….
commit-0002
commit-0001
FeatureGroup
atomic
update
Feature Store
Time-Travel Queries for Creating Training Datasets
df = fs.get_features(…., from=“2017”, to=“2019”)
Storage
GCS Amazon S3 HopsFS
.tfrecords
.csv
.npy

US-West-la
MySQL
NDB1 Model
Online Application
1.JDBC 2.Predict
1. Build a Feature Vector using the Online Feature Store
US-West-1c
MySQL
NDB3Model
~5-50ms
Online Feature Store: High Availability & Low-Latency
US-West-1b
MySQL
NDB2Model
2-20ms
2. Send the Feature Vector to a Model for Prediction

HOPSWORKS
Rest API
Return JDBC Query
….
Offline FS
Apache Hive
HopsFS
Online FS
MySQL Cluster SELECT .. FROM WHERE … in [keys]
<<TLS>>
getQuery(“model”)
<<API-Key>> Online
Application
Online Feature Store: JDBC API
[keys]
user_id,
session_id,
timestamp, etc
Model
Prediction

APPLICATIONS
API
DASHBOARDS
HOPSWORKS
DATASOURCES
ORCHESTRATION
In Airflow
BATCH
Apache Beam
Apache Spark
STREAMING
Apache Beam
Apache Spark
Apache Flink
HOPSWORKS
FEATURE
STORE
DISTRIBUTED
ML & DL
Pip
Conda
Tensorflow
scikit-learn
PyTorch
Jupyter
Notebooks
Tensorboard
FILESYSTEM & METADATA STORAGE
HopsFS
MODEL
SERVING
Kubernetes
MODEL
MONITORING
Kafka
+
Spark Streaming
Data Preparation
& Ingestion
Experimentation
& Model Training
Deploy
& Productionalize
Apache
Kafka

1
Feature
Engineering
2
Feature
Selection
3
Training &
Validation
4 Serving 5 Prediction
Train/Test Data
(S3, HDFS, etc)
Online
Application
Batch
Application
Data Warehouse
Data Lake
Feature
Engineering
Offline
Feature Store
Feature
Selection
Scoring &
Validation
Train
Model
Serving
Online
Feature Store
Model
Repository
Monitor
Experiments
Deploy
Feature Vector
Kafka

ML Lifecycle
Stage 1. Data Engineer
Models
Stage 2. Data Scientist
Model APIs
Stage 3. ML Engineer
Intelligent App
Stage 4. App Developer
Features
Model Hyperparameters
Model Candidates
Feature
Selection
Training DataTest Data
Model
Design
Model
Architecture
Model
Architecture
Model
Architecture
Model
Architecture
Model Repository
Model
Architecture
Model
Architecture
Model
ArchitectureTrial
Data Scientist
Experiments
Model Validation
Batch Apps
Online
Application
Predict
Get Online Features
App DeveloperRedshift S3 Cassandra Hadoop
Feature
Engineering
Feature Store
Data Engineer
Kubernetes / Serverless
KPI Dashboards
Alerts
Actions
Model
Architecture
Model
Architecture
Model
Architecture
Model
ArchitectureModel
Kafka
Model Inference API
Log Predictions
Predict
Streaming or
Serverless
Monitoring App
Log Predictions and
Join Outcomes
Online Model Serving
ML Engineer

Feature Store
Offline Features (Hive)
Secure Multi-Tenancy
Role-based Access Control
Encryption At-Rest, In-Motion
TLS/SSL everywhere
AI-Asset Governance
Models, experiments, data, GPUs
Data/Model/Feature Lineage
Discover/track dependencies
Real-Time, HA Database
MySQL Cluster (NDB)
JDBC API for Serving Clients
Online apps only need JDBC
In-Memory or NVMe data
Single-digit ms query times
Apache Hive on HopsFS
Scalable Data warehouse
Spark for Feature Computing
Fast backfilling of Training Data
HopsFS
NVMe speed with Big Data
HA and Horizontally Scalable
From 1 to 100s of nodes and
PBs of data
Hive
HA and Horizontally Scalable
Add nodes with no downtime
and scale to 10s of TBs
JDBC
NDB
NVMe
Security & GovernanceOnline Features (NDB)

Agenda for demo
Feature Store Overview
Access control / governance / statistics
Creating Features
Online vs Offline Features
Search for Features
Create training dataset
Query planner and hints
Online Feature Store
JDBC API for online the Feature Store

Hopsworks Subscription Models
Full Featured
AGPL-v3 License Model
Hopsworks Community
Kubernetes Support
• Model Serving
• Other services for robustness (Jupyter, more coming)
Authentication (LDAP, Kerberos, OAuth2)
Github support
Hopsworks Enterprise

Try it out!
www.hopsworks.ai
Stockholm
Box 1263,
Isafjordsgatan 22
Kista,
Sweden
London
IDEALondon,
69 Wilson St,
London,,
UK
Silicon Valley
470 Ramona St
Palo Alto
California,
USA
www.logicalclocks.com

Managed Feature Store for Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Managed Feature Store for Machine Learning

Similar to Managed Feature Store for Machine Learning (20)

Recently uploaded

Recently uploaded (20)

Managed Feature Store for Machine Learning