Feast Feature Store
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
An In-depth Overview
Experimentation and Application
in Tabular data
29th March 2024
Presenters:
• Loc Nguyen (MLE): loc_nguyen@epam.com
• Hong Ong (Lead Data SE): hong_ong@epam.com
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Agenda
• Motivation: Why do we need feature stores?
• Feast: Why do we choose Feast as feature stores?
• How to: Store and Serve by using Feast?
• Demo scenarios:
o As a DS: Quickly experiment based on pre-calculate features.
o As a DE: Data pipeline integration.
o As a MLE: Serving online APIs inferences.
• Discussion:
o Pros and Cons
o Extend beyond tabular data: Image, Text
o Datasource VectorDB
2
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Motivation: Why do we need feature stores?
3
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
• Tricky in getting
other project's features for our
own project.
• Data scientist want to spend
time optimizing models, but
end up spending too much time
on data preparation.
P R O D U C T I O N R E A D Y
M A N A G E D A T A C O N S I S T E N C Y
L A C K O F D I S C O V E R Y
4
• Serving features in production is
hard-gap between feature get
from training and feature get in
production.
• Duplication of works.
• Difficulties in manage data
consistency across projects.
What are the challenges with features?
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Ideas
5
Avoid duplicate effort and re-use code/outputs as much as possible.
• Centralize code base.
• Centralize calculated features.
Share calculated features across-projects and production ready.
• Easy in features discovery.
• Easy in retrieving features.
• Easy synchronize training phase and production phase.
CONFIDENTIAL | © 2024 EPAM Systems, Inc. 6
Source: link
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Why do we choose Feast as feature stores?
7
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Github reputation
8
Star Rating: Centralize code base. has over 5.2k
stars, indicating substantial popularity in the
open-source community.
Frequency of Updates: The repository
is frequently updated, implying that the tool is
actively being developed, with bugs being fixed
and new features being added regularly.
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Feast: Design goals
9
• Share features between teams and projects to reduce duplicated effort.
• Provide data consistency across projects.
• Decouple feature engineering from model development (DS would love this).
• Provide access feature in real-time.
• Support integrated with existing tools that ML practitioners are familiar with (e.g. Airflow, Dagster, MLFlow, K8s).
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Feast overview
10
Source: link
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
How to: Store and Serve by using Feast?
11
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
What is a feature ?
12
Useid Avg Loan late
payment rate
Avg income Avg credit
score
datetime
3001 0 1800 500 22/10/01
3002 0.2 1500 380 22/12/01
3001 0 2500 500 23/12/01
3002 0.3 1800 500 23/11/01
Features related to a user
User is the entity type
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
How do we train a loan model?
13
usesid age Income creditscore
3001 30 2000 400
3002 28 3500 570
Loanid userid datetim
e
Loan
amount
Interate
rate
Loan
term
DTI
rate
Default
1001 3001 23/01 10000 1.5 36 0.34 0
1002 3002 24/02 20000 1.4 40 0.48 1
Useid Avg Loan late
payment rate
Avg
income
Avg credit
score
datetime
3001 0 1800 500 22/10/01
3002 0.2 1500 380 22/12/01
3001 0 2500 500 23/12/01
3002 0.3 1800 500 23/11/01
Loan
user
age
Loan
user
currnet
income
Loan
user
current
credit
score
Avg
Loan
late
payme
nt rate
Avg
income
Avg
credit
score
Loan
amount
Interate
rate
Loan
term
DTI
rate
Default
30 2000 400 0 1800 500 10000 1.5 36 0.34 0
28 3500 570 0.3 1800 500 20000 1.4 40 0.48 1
Loan information
User
precomputed
feature
User information
1. Join features to target
2. Train model
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
How do we predict a loan transaction ?
14
userid Loan
user
age
Loan
user
currnet
income
Loan
user
current
credit
score
Avg
Loan
late
payme
nt rate
Avg
income
Avg
credit
score
Loan
amount
Interate
rate
Loan
term
DTI
rate
30003 30 2000 400 0 1800 500 10000 1.5 36 0.34
3. Incoming request
4. Predict loan default
Loan default rate
28.9%
But where do these
features come from ?
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
What are the challenge with features
15
"I want to spend my time optimizing
models, but I end up spending too much
time on data preparation"
• Spends most time creating and managing data
pipelines
• Lack of discovery leads to duplication work
• Doesn't want to think about how to get
features into production
• Doesn't want to worry about data consistency
between training and serving
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
What are the challenge with features
16
"I need to manage many copies of data
on fragmented data infrastructure,
while handling adhoc requests from
data scientists."
• Need to provision and manange fragmented
data infrastructure
• Data processing does not scale to needs
• Serving features in production is hard
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
How Feast come to rescue ?
17
Data Engineer
Data scientist
MLE
Source:MLOps Tools Part 5: BigQuery + Memorystore vs. FEAST for Feature Store | Datatonic
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Store: Feature registry
18
Entity
Datasouce
Feature view
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Serve
19
get_historical_features get_online_features
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Serve: Feature service
20
Feature view 1
Feature view 2
Very long
features retrieve ?
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
DEMO SCENARIO
21
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Scenario
22
• As a Data Scientist (DS):
o Want to quickly experiment based on pre-calculate features
o So that I don't spend much time in duplicated works.
• As a Data Engineer:
o Want to select final data version
o So that I can integrate to data pipeline.
• As a Machine Learning Engineer:
o Want to quickly select latest version materialized from Data platform
o So that I can serve online APIs inferences.
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
As a Data Scientist (1) - Create and Register features (staging)
23
Source: What is a Feature Store?
(feast.dev)
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
As a Data Scientist (1) - Create and Register features (staging)
24
Step 1: Build feature
Step 2: Save to a location
Step 3: Registry to feature store
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
As a Data Scientist (2) - Reuse features for other training
25
Source: What is a Feature Store?
(feast.dev)
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Feast UI
26
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Feast UI
27
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
As a Data Engineer – Integrate to data pipeline (production)
28
Source: link
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Feast: CI/CD pipeline and Auto import features
29
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Feast: CI/CD pipeline and Auto import features
30
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Feast: CI/CD pipeline and Auto import features
31
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Feast: CI/CD pipeline and Auto import features
32
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
As a Machine Learning Engineer
33
Source: A State of Feast
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
As a Machine Learning Engineer
34
Get feature
service
Get features
Either get from request
body or feature list
Source: https://github.com/ElliotNguyen68/feast_epam.git
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Future works
35
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Datasource VectorDB
• Extend DataSource class for VectorDB such as Milvus, Pinecone, PGVector.
• [Alpha] Data quality monitoring (link)
36
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Discussion
37
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Strengths and weaknesses
• Standardized and Consistent Features: ensures features usage
across different models whose same definitions and
transformations are applied everywhere.
• Time-series Support: provide point-in-time correctness — a key
important factor in financial or transactional data.
• Integration with other tools: Feast integrates well with popular
machine learning frameworks such as Airflow, Dagster, MLFlow,
K8s, along with data and model management tools.
• Online and Offline Support: helpful for real-time versus batch
predictions, simplifies deploying and serving models by providing a
unified way to manage all feature data for both training and
prediction. Currently Feast support SnowFlake, Bigquerry,
Postgresql, MSQL, RedShift, Athena, DuckDB, Trino, Ibis,
Redis, DynamoDB, Cassandra, Hbase, Rockset, Hazecast, ...
• Data Versioning: easy to manage and track different versions of
features.
PROS
• Complexity: may be overkill for simple projects or teams just
getting started with machine learning.
• Minimal UI: offers minimal user interface options for exploring and
managing features.
• Limited Database Support: As of now, Feast supports only a limited
number of databases like Google BigQuery, Redshift, Snowflake,
Spark, and PostgreSQL.
• Documentation: While improving, the documentation for Feast is
still a bit sparse, particularly for complex deployments or advanced
use cases.
CONS
38
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Extend beyond tabular data: Images and Text
39
Ideas
• Use Feast to manage the metadata of these feature vectors (like feature
names, data types, and statistical properties), not the actual raw image and
text data.
For image data
• Features could be raw pixel values, or some transformation of the images, such
as edges, textures, shapes. You can also make use of pre-trained Convolutional
Neural Networks (CNNs) to extract features.
For text data
• Features could be based on various Natural Language Processing (NLP)
techniques like BOW, TF-IDF, word embeddings, or even more advanced things
like sentence embeddings generated from models like BERT or GPT.
Source: link
Source: link
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Keys takeaway
• Understanding the needs and motivations for feature stores in data operations and collaboration.
• Data scientists can quickly experiment based on pre-calculated features without spending time for data preparation.
• Data engineers have the ability to integrate with data pipelines correctly by choosing correct data version from features store.
• Machine Learning Engineers can serve online APIs inferences with consistent calculated features.
• Understanding the process of storing and serving data using Feast.
• Feast is easy to integrate with current MLOps system.
• Feast can extend for other DataSource such as VectorDB.
• Feast can extend beyond tabular data: Images and Text.
40
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
THANK YOU!
41
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
APPENDIX
42
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Store: calculate feature (loan to income)
43
CONFIDENTIAL | © 2024 EPAM Systems, Inc.
Registry's schema
44

Feast Feature Store - An In-depth Overview Experimentation and Application in Tabular data.pdf

  • 1.
    Feast Feature Store CONFIDENTIAL| © 2024 EPAM Systems, Inc. An In-depth Overview Experimentation and Application in Tabular data 29th March 2024 Presenters: • Loc Nguyen (MLE): loc_nguyen@epam.com • Hong Ong (Lead Data SE): hong_ong@epam.com
  • 2.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Agenda • Motivation: Why do we need feature stores? • Feast: Why do we choose Feast as feature stores? • How to: Store and Serve by using Feast? • Demo scenarios: o As a DS: Quickly experiment based on pre-calculate features. o As a DE: Data pipeline integration. o As a MLE: Serving online APIs inferences. • Discussion: o Pros and Cons o Extend beyond tabular data: Image, Text o Datasource VectorDB 2
  • 3.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Motivation: Why do we need feature stores? 3
  • 4.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. • Tricky in getting other project's features for our own project. • Data scientist want to spend time optimizing models, but end up spending too much time on data preparation. P R O D U C T I O N R E A D Y M A N A G E D A T A C O N S I S T E N C Y L A C K O F D I S C O V E R Y 4 • Serving features in production is hard-gap between feature get from training and feature get in production. • Duplication of works. • Difficulties in manage data consistency across projects. What are the challenges with features?
  • 5.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Ideas 5 Avoid duplicate effort and re-use code/outputs as much as possible. • Centralize code base. • Centralize calculated features. Share calculated features across-projects and production ready. • Easy in features discovery. • Easy in retrieving features. • Easy synchronize training phase and production phase.
  • 6.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. 6 Source: link
  • 7.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Why do we choose Feast as feature stores? 7
  • 8.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Github reputation 8 Star Rating: Centralize code base. has over 5.2k stars, indicating substantial popularity in the open-source community. Frequency of Updates: The repository is frequently updated, implying that the tool is actively being developed, with bugs being fixed and new features being added regularly.
  • 9.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Feast: Design goals 9 • Share features between teams and projects to reduce duplicated effort. • Provide data consistency across projects. • Decouple feature engineering from model development (DS would love this). • Provide access feature in real-time. • Support integrated with existing tools that ML practitioners are familiar with (e.g. Airflow, Dagster, MLFlow, K8s).
  • 10.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Feast overview 10 Source: link
  • 11.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. How to: Store and Serve by using Feast? 11
  • 12.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. What is a feature ? 12 Useid Avg Loan late payment rate Avg income Avg credit score datetime 3001 0 1800 500 22/10/01 3002 0.2 1500 380 22/12/01 3001 0 2500 500 23/12/01 3002 0.3 1800 500 23/11/01 Features related to a user User is the entity type
  • 13.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. How do we train a loan model? 13 usesid age Income creditscore 3001 30 2000 400 3002 28 3500 570 Loanid userid datetim e Loan amount Interate rate Loan term DTI rate Default 1001 3001 23/01 10000 1.5 36 0.34 0 1002 3002 24/02 20000 1.4 40 0.48 1 Useid Avg Loan late payment rate Avg income Avg credit score datetime 3001 0 1800 500 22/10/01 3002 0.2 1500 380 22/12/01 3001 0 2500 500 23/12/01 3002 0.3 1800 500 23/11/01 Loan user age Loan user currnet income Loan user current credit score Avg Loan late payme nt rate Avg income Avg credit score Loan amount Interate rate Loan term DTI rate Default 30 2000 400 0 1800 500 10000 1.5 36 0.34 0 28 3500 570 0.3 1800 500 20000 1.4 40 0.48 1 Loan information User precomputed feature User information 1. Join features to target 2. Train model
  • 14.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. How do we predict a loan transaction ? 14 userid Loan user age Loan user currnet income Loan user current credit score Avg Loan late payme nt rate Avg income Avg credit score Loan amount Interate rate Loan term DTI rate 30003 30 2000 400 0 1800 500 10000 1.5 36 0.34 3. Incoming request 4. Predict loan default Loan default rate 28.9% But where do these features come from ?
  • 15.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. What are the challenge with features 15 "I want to spend my time optimizing models, but I end up spending too much time on data preparation" • Spends most time creating and managing data pipelines • Lack of discovery leads to duplication work • Doesn't want to think about how to get features into production • Doesn't want to worry about data consistency between training and serving
  • 16.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. What are the challenge with features 16 "I need to manage many copies of data on fragmented data infrastructure, while handling adhoc requests from data scientists." • Need to provision and manange fragmented data infrastructure • Data processing does not scale to needs • Serving features in production is hard
  • 17.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. How Feast come to rescue ? 17 Data Engineer Data scientist MLE Source:MLOps Tools Part 5: BigQuery + Memorystore vs. FEAST for Feature Store | Datatonic
  • 18.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Store: Feature registry 18 Entity Datasouce Feature view
  • 19.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Serve 19 get_historical_features get_online_features
  • 20.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Serve: Feature service 20 Feature view 1 Feature view 2 Very long features retrieve ?
  • 21.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. DEMO SCENARIO 21
  • 22.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Scenario 22 • As a Data Scientist (DS): o Want to quickly experiment based on pre-calculate features o So that I don't spend much time in duplicated works. • As a Data Engineer: o Want to select final data version o So that I can integrate to data pipeline. • As a Machine Learning Engineer: o Want to quickly select latest version materialized from Data platform o So that I can serve online APIs inferences.
  • 23.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. As a Data Scientist (1) - Create and Register features (staging) 23 Source: What is a Feature Store? (feast.dev)
  • 24.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. As a Data Scientist (1) - Create and Register features (staging) 24 Step 1: Build feature Step 2: Save to a location Step 3: Registry to feature store
  • 25.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. As a Data Scientist (2) - Reuse features for other training 25 Source: What is a Feature Store? (feast.dev)
  • 26.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Feast UI 26
  • 27.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Feast UI 27
  • 28.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. As a Data Engineer – Integrate to data pipeline (production) 28 Source: link
  • 29.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Feast: CI/CD pipeline and Auto import features 29
  • 30.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Feast: CI/CD pipeline and Auto import features 30
  • 31.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Feast: CI/CD pipeline and Auto import features 31
  • 32.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Feast: CI/CD pipeline and Auto import features 32
  • 33.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. As a Machine Learning Engineer 33 Source: A State of Feast
  • 34.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. As a Machine Learning Engineer 34 Get feature service Get features Either get from request body or feature list Source: https://github.com/ElliotNguyen68/feast_epam.git
  • 35.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Future works 35
  • 36.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Datasource VectorDB • Extend DataSource class for VectorDB such as Milvus, Pinecone, PGVector. • [Alpha] Data quality monitoring (link) 36
  • 37.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Discussion 37
  • 38.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Strengths and weaknesses • Standardized and Consistent Features: ensures features usage across different models whose same definitions and transformations are applied everywhere. • Time-series Support: provide point-in-time correctness — a key important factor in financial or transactional data. • Integration with other tools: Feast integrates well with popular machine learning frameworks such as Airflow, Dagster, MLFlow, K8s, along with data and model management tools. • Online and Offline Support: helpful for real-time versus batch predictions, simplifies deploying and serving models by providing a unified way to manage all feature data for both training and prediction. Currently Feast support SnowFlake, Bigquerry, Postgresql, MSQL, RedShift, Athena, DuckDB, Trino, Ibis, Redis, DynamoDB, Cassandra, Hbase, Rockset, Hazecast, ... • Data Versioning: easy to manage and track different versions of features. PROS • Complexity: may be overkill for simple projects or teams just getting started with machine learning. • Minimal UI: offers minimal user interface options for exploring and managing features. • Limited Database Support: As of now, Feast supports only a limited number of databases like Google BigQuery, Redshift, Snowflake, Spark, and PostgreSQL. • Documentation: While improving, the documentation for Feast is still a bit sparse, particularly for complex deployments or advanced use cases. CONS 38
  • 39.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Extend beyond tabular data: Images and Text 39 Ideas • Use Feast to manage the metadata of these feature vectors (like feature names, data types, and statistical properties), not the actual raw image and text data. For image data • Features could be raw pixel values, or some transformation of the images, such as edges, textures, shapes. You can also make use of pre-trained Convolutional Neural Networks (CNNs) to extract features. For text data • Features could be based on various Natural Language Processing (NLP) techniques like BOW, TF-IDF, word embeddings, or even more advanced things like sentence embeddings generated from models like BERT or GPT. Source: link Source: link
  • 40.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Keys takeaway • Understanding the needs and motivations for feature stores in data operations and collaboration. • Data scientists can quickly experiment based on pre-calculated features without spending time for data preparation. • Data engineers have the ability to integrate with data pipelines correctly by choosing correct data version from features store. • Machine Learning Engineers can serve online APIs inferences with consistent calculated features. • Understanding the process of storing and serving data using Feast. • Feast is easy to integrate with current MLOps system. • Feast can extend for other DataSource such as VectorDB. • Feast can extend beyond tabular data: Images and Text. 40
  • 41.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. THANK YOU! 41
  • 42.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. APPENDIX 42
  • 43.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Store: calculate feature (loan to income) 43
  • 44.
    CONFIDENTIAL | ©2024 EPAM Systems, Inc. Registry's schema 44