Dr. Jim Dowling
CEO / Co-Founder
Logical Clocks
Managed Feature Store
for ML Webinar
[ Presenter ]
Leadership & Offices
Stockholm
Box 1263,
Isafjordsgatan 22
Kista,
Sweden
London
IDEALondon,
69 Wilson St,
London,,
UK
Silicon Valley
470 Ramona St
Palo Alto
California,
USA
Dr. Jim Dowling
CEO
Theo Kakantousis
COO
Prof. Seif Haridi
Chief Scientist
Fabio Buso
VP Engineering
Steffen Grohsschmiedt
Head Of Cloud
www.logicalclocks.com
Shraddha Chouhan
Head Of Marketing
Hopsworks - Award Winning Platform
Today’s Journey to a Feature Store and Beyond
Ad-hoc Scripts
and Jobs
Shared Feature
Pipelines
Feature Store
MLOps with a
Feature Store
Known Feature Stores in Production
● Logical Clocks – Hopsworks (world’s first open source)
● Uber Michelangelo
● Airbnb – Bighead/Zipline
● Comcast
● Twitter
● GO-JEK Feast (GCE, open-source layer over BigTable/BigQuery)
● Branch
● Conde Nast
● Facebook FB Learner
● Netflix
Reference: www.featurestore.org
What is a Feature?
A feature is a measurable property of a phenomena under observation and
(part of) an input to a ML model.
Example features:
● A raw word, a pixel, a sound
wave, a sensor value;
● An aggregate
(mean, max, sum, min)
● A window
(last_hour, last_day, etc)
● A derived representation
(embedding or cluster)
numbers
(in arrays)
A Data Engineer’s perspective on Feature Engineering
numbers
arrays
(of numbers)
one-hot
encoding
Databases
Schemas
varchar, charsets
integer, blob,
varbinary
Feature Engineering is about Transforming Data
Feature Engineering is about Transforming Data
from pyspark.ml.feature import Normalizer
scaledDF = spark.parquet.read(”…”)
l1_norm=Normalizer().setP(1).setInputCol("features").setOutputCol("l1_norm")
l1_norm.transform(scaleDF)
Normalize
Consistent Features between Training and Inference
It’s not always trivial to ensure features are engineered
consistently between training and inference
Features
Training
Labels Model
Features
Inference
Model Labels
Feature Store – Reuse Cached Features
One
Feature
Pipeline
Get
Get
Features
Training
Labels Model
Features
Inference
Model Labels
Feature
Store
Features name Pclass Sex Survive Name Balance
Train / Test
Datasets
Survivename PClass Sex Balance
Join key
Feature
Groups
Titanic ​
Passenger List​
Passenger
Bank Account
File format
.tfrecords
.npy
.csv
.hdf5,
.petastorm, etc
Storage
GCS
Amazon S3
HopsFS
Features, FeatureGroups, and Train/Test Datasets are all versioned
Feature Store Concepts
Streaming App pushes click features every 5 secs
Streaming App pushes CDC data every 30 secs
Pandas App pushes user profile updates every hour
Batch App pushes featurized weblogs data every day
Online
Feature
Store
Offline
Feature
Store
SQL DW
S3, HDFS
SQL
Event Data
Real-Time Data
Real-time feature transformations (<2 secs) Online
App
Low
Latency
Features
High
Latency
Features
Train,
Batch App
FeatureGroups are ingested at different Cadences
Feature Store
No existing database is both scalable (PBs) and low latency (<10ms). Hence, online + offline Feature Stores.
Feature Store
ClickFeatureGroup
TableFeatureGroup
UserFeatureGroup
LogsFeatureGroup
Event Data
SQL DW
S3, HDFS
SQL
DataFrameAPI
Kafka Input
Flink
RTFeatureGroup
Online
App
Train,
Batch App
FeatureGroup ingestion in Hopsworks
User Clicks
DB Updates
User Profile Updates
Weblogs
Real-time features
Kafka Output
Simplify access to the online/offline Feature Stores by providing a general-purpose DataFrame API.
Register a Feature Group with the Feature Store
from hops import featurestore as fs
df = # Spark or Pandas Dataframe
# Do feature engineering on ‘df’
# Register Dataframe as FeatureGroup
fs.create_featuregroup(df, ”titanic_df“)
HOPSWORKS
Rest API
1 Add Metadata
2 Add Statistics
….
Offline FS
Apache Hive
HopsFS
(External)
Spark Cluster
.parquet, .orc (TLS)
Online FS
MySQL Cluster
fs.create_featuregroup(df, “titanic_df”,
offline=True, online=True)
Feature Ingestion with Spark
Online
Feature Store
(Serving)
Offline
Feature Store
(Training & Batch)
Online Apps
Model Training
Batch Apps
Event Data
SQL DW
S3, HDFS
SQL
Ingest
Data
From
Used
By
Hopsworks Feature Store
Create Training Datasets using the Feature Store
from hops import featurestore as fs
sample_data = fs.get_features([“name”, “Pclass”, “Sex”, “Balance”, “Survived”])
fs.create_training_dataset(sample_data, “titanic_training_dataset",
data_format="tfrecords“, training_dataset_version=1)
HOPSWORKS
Offline FS
Apache Hive
HopsFS
Join Features <<TLS>>
Online FS
MySQL Cluster
(External)
Spark Cluster
sample_data = fs.get_features([“name”,
“Pclass”, “Sex”, “Balance”, “Survived”])
Create Training Datasets with (External) Spark
Storage
GCS Amazon S3 HopsFS
.npy, .tfrecords, .csv
commit-0097
….
commit-0002
commit-0001
FeatureGroup
atomic
update
Feature Store
Time-Travel Queries for Creating Training Datasets
df = fs.get_features(…., from=“2017”, to=“2019”)
Storage
GCS Amazon S3 HopsFS
.tfrecords
.csv
.npy
US-West-la
MySQL
NDB1 Model
Online Application
1.JDBC 2.Predict
1. Build a Feature Vector using the Online Feature Store
US-West-1c
MySQL
NDB3Model
~5-50ms
Online Feature Store: High Availability & Low-Latency
US-West-1b
MySQL
NDB2Model
2-20ms
2. Send the Feature Vector to a Model for Prediction
HOPSWORKS
Rest API
Return JDBC Query
….
Offline FS
Apache Hive
HopsFS
Online FS
MySQL Cluster SELECT .. FROM WHERE … in [keys]
<<TLS>>
getQuery(“model”)
<<API-Key>> Online
Application
Online Feature Store: JDBC API
[keys]
user_id,
session_id,
timestamp, etc
Model
Prediction
HOPSWORKS
APPLICATIONS
API
DASHBOARDS
HOPSWORKS
DATASOURCES
ORCHESTRATION
In Airflow
BATCH
Apache Beam
Apache Spark
STREAMING
Apache Beam
Apache Spark
Apache Flink
HOPSWORKS
FEATURE
STORE
DISTRIBUTED
ML & DL
Pip
Conda
Tensorflow
scikit-learn
PyTorch
Jupyter
Notebooks
Tensorboard
FILESYSTEM & METADATA STORAGE
HopsFS
MODEL
SERVING
Kubernetes
MODEL
MONITORING
Kafka
+
Spark Streaming
Data Preparation
& Ingestion
Experimentation
& Model Training
Deploy
& Productionalize
Apache
Kafka
1
Feature
Engineering
2
Feature
Selection
3
Training &
Validation
4 Serving 5 Prediction
Train/Test Data
(S3, HDFS, etc)
Online
Application
Batch
Application
Data Warehouse
Data Lake
Feature
Engineering
Offline
Feature Store
Feature
Selection
Scoring &
Validation
Train
Model
Serving
Online
Feature Store
Model
Repository
Monitor
Experiments
Deploy
Feature Vector
Kafka
ML Lifecycle
Stage 1. Data Engineer
Models
Stage 2. Data Scientist
Model APIs
Stage 3. ML Engineer
Intelligent App
Stage 4. App Developer
Features
Model Hyperparameters
Model Candidates
Feature
Selection
Training DataTest Data
Model
Design
Model
Architecture
Model
Architecture
Model
Architecture
Model
Architecture
Model Repository
Model
Architecture
Model
Architecture
Model
ArchitectureTrial
Data Scientist
Experiments
Model Validation
Batch Apps
Online
Application
Predict
Get Online Features
App DeveloperRedshift S3 Cassandra Hadoop
Feature
Engineering
Feature Store
Data Engineer
Kubernetes / Serverless
KPI Dashboards
Alerts
Actions
Model
Architecture
Model
Architecture
Model
Architecture
Model
ArchitectureModel
Kafka
Model Inference API
Log Predictions
Predict
Streaming or
Serverless
Monitoring App
Log Predictions and
Join Outcomes
Online Model Serving
ML Engineer
Feature Store
Offline Features (Hive)
Secure Multi-Tenancy
Role-based Access Control
Encryption At-Rest, In-Motion
TLS/SSL everywhere
AI-Asset Governance
Models, experiments, data, GPUs
Data/Model/Feature Lineage
Discover/track dependencies
Real-Time, HA Database
MySQL Cluster (NDB)
JDBC API for Serving Clients
Online apps only need JDBC
In-Memory or NVMe data
Single-digit ms query times
Apache Hive on HopsFS
Scalable Data warehouse
Spark for Feature Computing
Fast backfilling of Training Data
HopsFS
NVMe speed with Big Data
HA and Horizontally Scalable
From 1 to 100s of nodes and
PBs of data
Hive
HA and Horizontally Scalable
Add nodes with no downtime
and scale to 10s of TBs
JDBC
NDB
NVMe
Security & GovernanceOnline Features (NDB)
Agenda for demo
Feature Store Overview
Access control / governance / statistics
Creating Features
Online vs Offline Features
Search for Features
Create training dataset
Query planner and hints
Online Feature Store
JDBC API for online the Feature Store
Hopsworks Subscription Models
Full Featured
AGPL-v3 License Model
Hopsworks Community
Kubernetes Support
• Model Serving
• Other services for robustness (Jupyter, more coming)
Authentication (LDAP, Kerberos, OAuth2)
Github support
Hopsworks Enterprise
Try it out!
www.hopsworks.ai
Stockholm
Box 1263,
Isafjordsgatan 22
Kista,
Sweden
London
IDEALondon,
69 Wilson St,
London,,
UK
Silicon Valley
470 Ramona St
Palo Alto
California,
USA
www.logicalclocks.com

Managed Feature Store for Machine Learning

  • 1.
    Dr. Jim Dowling CEO/ Co-Founder Logical Clocks Managed Feature Store for ML Webinar [ Presenter ]
  • 2.
    Leadership & Offices Stockholm Box1263, Isafjordsgatan 22 Kista, Sweden London IDEALondon, 69 Wilson St, London,, UK Silicon Valley 470 Ramona St Palo Alto California, USA Dr. Jim Dowling CEO Theo Kakantousis COO Prof. Seif Haridi Chief Scientist Fabio Buso VP Engineering Steffen Grohsschmiedt Head Of Cloud www.logicalclocks.com Shraddha Chouhan Head Of Marketing
  • 3.
    Hopsworks - AwardWinning Platform
  • 4.
    Today’s Journey toa Feature Store and Beyond Ad-hoc Scripts and Jobs Shared Feature Pipelines Feature Store MLOps with a Feature Store
  • 5.
    Known Feature Storesin Production ● Logical Clocks – Hopsworks (world’s first open source) ● Uber Michelangelo ● Airbnb – Bighead/Zipline ● Comcast ● Twitter ● GO-JEK Feast (GCE, open-source layer over BigTable/BigQuery) ● Branch ● Conde Nast ● Facebook FB Learner ● Netflix Reference: www.featurestore.org
  • 6.
    What is aFeature? A feature is a measurable property of a phenomena under observation and (part of) an input to a ML model. Example features: ● A raw word, a pixel, a sound wave, a sensor value; ● An aggregate (mean, max, sum, min) ● A window (last_hour, last_day, etc) ● A derived representation (embedding or cluster)
  • 7.
    numbers (in arrays) A DataEngineer’s perspective on Feature Engineering numbers arrays (of numbers) one-hot encoding Databases Schemas varchar, charsets integer, blob, varbinary
  • 8.
    Feature Engineering isabout Transforming Data
  • 9.
    Feature Engineering isabout Transforming Data from pyspark.ml.feature import Normalizer scaledDF = spark.parquet.read(”…”) l1_norm=Normalizer().setP(1).setInputCol("features").setOutputCol("l1_norm") l1_norm.transform(scaleDF) Normalize
  • 10.
    Consistent Features betweenTraining and Inference It’s not always trivial to ensure features are engineered consistently between training and inference Features Training Labels Model Features Inference Model Labels
  • 11.
    Feature Store –Reuse Cached Features One Feature Pipeline Get Get Features Training Labels Model Features Inference Model Labels Feature Store
  • 12.
    Features name PclassSex Survive Name Balance Train / Test Datasets Survivename PClass Sex Balance Join key Feature Groups Titanic ​ Passenger List​ Passenger Bank Account File format .tfrecords .npy .csv .hdf5, .petastorm, etc Storage GCS Amazon S3 HopsFS Features, FeatureGroups, and Train/Test Datasets are all versioned Feature Store Concepts
  • 13.
    Streaming App pushesclick features every 5 secs Streaming App pushes CDC data every 30 secs Pandas App pushes user profile updates every hour Batch App pushes featurized weblogs data every day Online Feature Store Offline Feature Store SQL DW S3, HDFS SQL Event Data Real-Time Data Real-time feature transformations (<2 secs) Online App Low Latency Features High Latency Features Train, Batch App FeatureGroups are ingested at different Cadences Feature Store No existing database is both scalable (PBs) and low latency (<10ms). Hence, online + offline Feature Stores.
  • 14.
    Feature Store ClickFeatureGroup TableFeatureGroup UserFeatureGroup LogsFeatureGroup Event Data SQLDW S3, HDFS SQL DataFrameAPI Kafka Input Flink RTFeatureGroup Online App Train, Batch App FeatureGroup ingestion in Hopsworks User Clicks DB Updates User Profile Updates Weblogs Real-time features Kafka Output Simplify access to the online/offline Feature Stores by providing a general-purpose DataFrame API.
  • 15.
    Register a FeatureGroup with the Feature Store from hops import featurestore as fs df = # Spark or Pandas Dataframe # Do feature engineering on ‘df’ # Register Dataframe as FeatureGroup fs.create_featuregroup(df, ”titanic_df“)
  • 16.
    HOPSWORKS Rest API 1 AddMetadata 2 Add Statistics …. Offline FS Apache Hive HopsFS (External) Spark Cluster .parquet, .orc (TLS) Online FS MySQL Cluster fs.create_featuregroup(df, “titanic_df”, offline=True, online=True) Feature Ingestion with Spark
  • 17.
    Online Feature Store (Serving) Offline Feature Store (Training& Batch) Online Apps Model Training Batch Apps Event Data SQL DW S3, HDFS SQL Ingest Data From Used By Hopsworks Feature Store
  • 18.
    Create Training Datasetsusing the Feature Store from hops import featurestore as fs sample_data = fs.get_features([“name”, “Pclass”, “Sex”, “Balance”, “Survived”]) fs.create_training_dataset(sample_data, “titanic_training_dataset", data_format="tfrecords“, training_dataset_version=1)
  • 19.
    HOPSWORKS Offline FS Apache Hive HopsFS JoinFeatures <<TLS>> Online FS MySQL Cluster (External) Spark Cluster sample_data = fs.get_features([“name”, “Pclass”, “Sex”, “Balance”, “Survived”]) Create Training Datasets with (External) Spark Storage GCS Amazon S3 HopsFS .npy, .tfrecords, .csv
  • 20.
    commit-0097 …. commit-0002 commit-0001 FeatureGroup atomic update Feature Store Time-Travel Queriesfor Creating Training Datasets df = fs.get_features(…., from=“2017”, to=“2019”) Storage GCS Amazon S3 HopsFS .tfrecords .csv .npy
  • 21.
    US-West-la MySQL NDB1 Model Online Application 1.JDBC2.Predict 1. Build a Feature Vector using the Online Feature Store US-West-1c MySQL NDB3Model ~5-50ms Online Feature Store: High Availability & Low-Latency US-West-1b MySQL NDB2Model 2-20ms 2. Send the Feature Vector to a Model for Prediction
  • 22.
    HOPSWORKS Rest API Return JDBCQuery …. Offline FS Apache Hive HopsFS Online FS MySQL Cluster SELECT .. FROM WHERE … in [keys] <<TLS>> getQuery(“model”) <<API-Key>> Online Application Online Feature Store: JDBC API [keys] user_id, session_id, timestamp, etc Model Prediction
  • 23.
  • 24.
    APPLICATIONS API DASHBOARDS HOPSWORKS DATASOURCES ORCHESTRATION In Airflow BATCH Apache Beam ApacheSpark STREAMING Apache Beam Apache Spark Apache Flink HOPSWORKS FEATURE STORE DISTRIBUTED ML & DL Pip Conda Tensorflow scikit-learn PyTorch Jupyter Notebooks Tensorboard FILESYSTEM & METADATA STORAGE HopsFS MODEL SERVING Kubernetes MODEL MONITORING Kafka + Spark Streaming Data Preparation & Ingestion Experimentation & Model Training Deploy & Productionalize Apache Kafka
  • 25.
    1 Feature Engineering 2 Feature Selection 3 Training & Validation 4 Serving5 Prediction Train/Test Data (S3, HDFS, etc) Online Application Batch Application Data Warehouse Data Lake Feature Engineering Offline Feature Store Feature Selection Scoring & Validation Train Model Serving Online Feature Store Model Repository Monitor Experiments Deploy Feature Vector Kafka
  • 26.
    ML Lifecycle Stage 1.Data Engineer Models Stage 2. Data Scientist Model APIs Stage 3. ML Engineer Intelligent App Stage 4. App Developer Features Model Hyperparameters Model Candidates Feature Selection Training DataTest Data Model Design Model Architecture Model Architecture Model Architecture Model Architecture Model Repository Model Architecture Model Architecture Model ArchitectureTrial Data Scientist Experiments Model Validation Batch Apps Online Application Predict Get Online Features App DeveloperRedshift S3 Cassandra Hadoop Feature Engineering Feature Store Data Engineer Kubernetes / Serverless KPI Dashboards Alerts Actions Model Architecture Model Architecture Model Architecture Model ArchitectureModel Kafka Model Inference API Log Predictions Predict Streaming or Serverless Monitoring App Log Predictions and Join Outcomes Online Model Serving ML Engineer
  • 27.
    Feature Store Offline Features(Hive) Secure Multi-Tenancy Role-based Access Control Encryption At-Rest, In-Motion TLS/SSL everywhere AI-Asset Governance Models, experiments, data, GPUs Data/Model/Feature Lineage Discover/track dependencies Real-Time, HA Database MySQL Cluster (NDB) JDBC API for Serving Clients Online apps only need JDBC In-Memory or NVMe data Single-digit ms query times Apache Hive on HopsFS Scalable Data warehouse Spark for Feature Computing Fast backfilling of Training Data HopsFS NVMe speed with Big Data HA and Horizontally Scalable From 1 to 100s of nodes and PBs of data Hive HA and Horizontally Scalable Add nodes with no downtime and scale to 10s of TBs JDBC NDB NVMe Security & GovernanceOnline Features (NDB)
  • 28.
    Agenda for demo FeatureStore Overview Access control / governance / statistics Creating Features Online vs Offline Features Search for Features Create training dataset Query planner and hints Online Feature Store JDBC API for online the Feature Store
  • 29.
    Hopsworks Subscription Models FullFeatured AGPL-v3 License Model Hopsworks Community Kubernetes Support • Model Serving • Other services for robustness (Jupyter, more coming) Authentication (LDAP, Kerberos, OAuth2) Github support Hopsworks Enterprise
  • 30.
    Try it out! www.hopsworks.ai Stockholm Box1263, Isafjordsgatan 22 Kista, Sweden London IDEALondon, 69 Wilson St, London,, UK Silicon Valley 470 Ramona St Palo Alto California, USA www.logicalclocks.com