Hamburg Data Science Meetup - MLOps with a Feature Store

MLOps with a Feature Store
Filling the Gap in ML Infrastructure
Moritz Meister
Data Scientist
Software Engineer @ Logical Clocks
@morimeister
Hamburg Data Science Meetup
May 28th, 2020

Hopworks,
cloud-native
& open-source

MLOps
CI/CD for ML models.
Feature Store
Deﬁnition, storage, and access of features.
Shared Feature Engineering Code
Well versioned feature engineering jobs.
Adhoc Scripts and Jobs
Data and code silos.
Journey to a Feature Store and Beyond

Event DataRaw Data
SQL Data
DATA LAKEDATA PIPELINES FEATURE PIPELINES
MODEL
SERVING
TRAIN & VALIDATE
MONITOR
Data Engineer Data Scientist ML Engineer
End to End ML Pipelines

Event DataRaw Data
SQL Data
DATA LAKE
TRAIN & VALIDATE
Hopsworks
FEATURE
STORE
ONLINE MODEL SERVING
BATCH MODEL SCORING
BI Platforms
MONITOR
End to End ML Pipelines
DATA PIPELINES FEATURE PIPELINES

● Logical Clocks – Hopsworks (world’s ﬁrst open source)
● Uber Michelangelo
● Airbnb – Bighead/Zipline
● Comcast
● Twitter
● GO-JEK Feast (GCE, open-source layer over BigTable/BigQuery)
● Branch
● Conde Nast
● Facebook FB Learner
● Netﬂix
Reference: www.featurestore.org
Known Feature Stores in Production

numbers
(in arrays)
numbers
arrays
(of numbers)
one-hot encoding
Databases
Schemas
varchar, charsets
integer, blob,
varbinary
A Data Engineer’s Perspective on Feature Engineering

Feature Engineering is about Transforming Data

from pyspark.ml.feature import Normalizer
scaledDF = spark.parquet.read(”…”)
l1_norm=Normalizer().setP(1).setInputCol("features").setOutputCol("l1_norm")
l1_norm.transform(scaleDF)
Normalize
Feature Engineering is about Transforming Data

ModelFeatures Labels
TRAINING
LabelsFeatures Model
INFERENCE
Feature Store
Get
Get
Consistent Features Between Training and Inference

Features name Pclass Sex Survive Name
Balanc
e
Train / Test
Datasets
Survivename PClass Sex Balance
Join key
Feature
Groups
Titanic
Passenger List
Passenger
Bank Account
File format
.tfrecord
.npy
.csv
.hdf5,
.petastorm, etc
Storage
GCS
Amazon
S3
HopsFS
Features, Feature Groups, and Train/Test Datasets are all versioned
Feature Store Concepts

Streaming App pushes click features every 5 secs
Streaming App pushes CDC data every 30 secs
Pandas App pushes user proﬁle updates every hour
Batch App pushes featurized weblogs data every day
Online
Feature
Store
Oﬄine
Feature
Store
SQL DW
S3, HDFS
SQL
Event Data
Real-Time Data
Real-time feature transformations (<2 secs) Online
App
Low
Latency
Features
High
Latency
Features
Train,
Batch App
Feature Store
No existing database is both scalable (PBs) and low latency (<10ms). Hence, online + offline Feature Stores.
<10ms
TBs/PBs
Feature Groups are ingested at different Cadences

Feature Store
ClickFeatureGroup
TableFeatureGroup
UserFeatureGroup
LogsFeatureGroup
Event Data
SQL DW
S3, HDFS
SQL
DataFrameAPI
Kafka Input
Flink
RTFeatureGroup
Online
App
Train,
Batch App
User Clicks
DB Updates
User Proﬁle Updates
Weblogs
Real-time features
Kafka Output
Simplify Ingestion to the Online/Offline Feature Stores by providing a general-purpose DataFrame API.
Feature Groups are ingested at different Cadences

from hops import featurestore as fs
df = # Spark or Pandas Dataframe
# Do feature engineering on ‘df’
# Register Dataframe as FeatureGroup
fs.create_featuregroup (df, ”titanic_df“)
Register a Feature Group with the Feature Store

Hopsworks Feature Store
Feature Store
Event Data
Snowﬂake,
Redshift, SQL
Delta Lake
SF3, HDFS,
Online
Feature Store
Oﬄine
Feature Store
Ingest
Data
From
Used
By
Online Apps
Batch Apps
Create Train/Test Data

from hops import featurestore as fs
sample_data = fs.get_features ([“name”, “Pclass”, “Sex”, “Balance”,
“Survived”])
fs.create_training_dataset (sample_data, “titanic_training_dataset",
data_format="tfrecords“, training_dataset_version=1)
Create Training Datasets using the Feature Store

US-West-1a
MySQL
NDB1
Model
Online Application
1.JDBC 2.Predict
1. Build a Feature Vector using the Online Feature Store
US-West-1c
MySQL
NDB3
Model
~5-50ms
US-West-1b
MySQL
NDB2
Model
2-20ms
2. Send the Feature Vector to a Model for Prediction
Online Feature Store: High Availability & Low-Latency

Hopsworks
APPLICATIONS
API
DASHBOARDS
HOPSWORKS
DATASOURCES
In Airﬂow
Apache Beam
Apache Spark
Apache Beam
Apache Spark
Apache Flink
HOPSWORKS
FEATURE
STORE
Pip
Conda
Tensorﬂow
scikit-learn
PyTorch
Jupyter
Notebooks
Tensorboard
HopsFS
Kubernetes
Kafka
+
Spark
Streaming
Data Preparation
& Ingestion
Experimentation
& Model Training
Deploy
& Productionalize
Apache
Kafka

ML Infrastructure: The complete Picture
1
Feature
Engineering
2
Feature
Selection
3
Training &
Validation
4 Serving 5 Prediction
Train/Test Data
(S3, HDFS, etc)
Online
Application
Batch
Application
Data Warehouse
Data Lake
Feature
Engineering
Oﬄine
Feature Store
Feature
Selection
Scoring &
Validation
Train
Model
Serving
Online
Feature Store
Model
Repository
Monitor
Experiments
Deploy
Feature Vector
Kafka

Multi-Worker Training for TensorFlow (using PySpark)
https://databricks.com/session/distributed-deep-learning-with-apache-spark-and-tensorﬂow
Maggy: Async HParam Tuning and Parallel Ablation Studies (using PySpark)
https://databricks.com/session_eu19/asynchronous-hyperparameter-optimization-with-apache-spark
Project-Based Multi-Tenancy
Implicit Provenance for ML Workﬂows
Instrument instead of rewrite (TFX, MLFlow) – enabled by a CDC API
Secure Sensitive data on a shared cluster:
Datasets, Hive DBs, Feature Stores, Kafka Topics all private to Projects – but can be shared.
Conda environment per project (sane Python dependency management in a cluster).
More in Hopsworks

Full Featured
AGPL-v3 License Model
Hopsworks Community
Kubernetes Support
• Model Serving
• Other services for robustness (Jupyter, more coming)
Authentication (LDAP, Kerberos, OAuth2)
Github support
Hopsworks Enterprise
Managed SAAS platform (currently only on AWS)
Hopsworks.ai
Trying out Hopsworks

@hopsworks
http://github.com/logicalclocks/hopsworks
Show us some love!

Stockholm
Box 1263,
Isafjordsgatan 22
Kista,
Sweden
London
IDEALondon,
69 Wilson St,
London, EC2A2BB,
UK
Silicon Valley
470 Ramona St
Palo Alto
California,
USA
WWW.LOGICALCLOCKS.COM
@hopsworks
http://github.com/logicalclocks/hopsworks
Show us some love!

Hamburg Data Science Meetup - MLOps with a Feature Store

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hamburg Data Science Meetup - MLOps with a Feature Store

Similar to Hamburg Data Science Meetup - MLOps with a Feature Store (20)

Recently uploaded

Recently uploaded (20)

Hamburg Data Science Meetup - MLOps with a Feature Store