SlideShare a Scribd company logo
Transforming Feature Ideas
into Machine Learning Inputs
Xavier Conort
9th May 2023 - Singapore
Transforming Feature Ideas into actionable ML inputs
2
FeatureByte is a free and source-available platform designed to
empower data enthusiasts who love creating innovative features and
improving model accuracy through data.
With FeatureByte, you can:
● Quickly create features,
● Seamlessly experiment with features and
● Effortlessly deploy features
All without the headache of creating and managing new pipelines.
Free and source available platform released on May 8!
3
You can access our repository at https://github.com/featurebyte/featurebyte and our documentation at
https://docs.featurebyte.com/0.2/. Please check it out and give us your feedback! We're eager to hear your thoughts and
improve FeatureByte to better suit your needs.
Agenda
4
● Quick introduction to FeatureByte Architecture
● Introduction to the Grocery dataset
● Transform 9 Features Ideas
● Get training data and deploy our new features in production!
FeatureByte
Architecture
6
Your DataBricks, Snowflake or Spark data warehouse is used as:
● data source
● compute engine
● feature store
The FeatureByte Service, which runs on Docker, acts
as the interface between the SDK and your data
warehouse.
It automatically manages tasks like validating
requests, scheduling and executing tasks, and storing
metadata.
This Python module transpiles the feature logical plan
to platform-specific SQL
Light-weight installation that leverages your Data Warehouse
You interact with the tool through the SDK, which
provides an intuitive framework for creating, sharing,
experimenting, deploying, and managing features.
Grocery dataset
Note: To access the Grocery dataset or play with FeatureByte's tutorials - you can install a pre-built local Spark data warehouse with pre-populated data, via
the command featurebyte.playground().
8
We will work with a Catalog where 4 tables were registered from a grocery dataset.
Dimension table with
static description on
products
Item table with details on
purchased products in
customer invoices
Event table where each
row indicates an invoice
Slowly Changing Dimension table that contains data
on customers that change over time.
A catalog is an object that help organize your
feature engineering assets by domain
Simple observation set to preview features
9
Observation sets are used to materialize historical
feature values. They combine entity values and
historical points-in-time that you want to learn from.
We will work here with a small Pandas DataFrame to
preview features when customers are doing a purchase.
The observation set is retrieved from a sample of the
GROCERYINVOICE table’s view.
The column containing points-in-time must be named
‘POINT-IN-TIME’
The column containing entity values must be named
with an accepted serving name as defined during the
entity registration.
View objects are used to prepare data. Work like SQL
views but with a syntax similar to Pandas!
9 Feature ideas for the
Grocery dataset
9 feature ideas for the Customer entity
1. Frequency: the count of invoices by Customer in the past 12 weeks
2. Monetary: the amount spent by Customer in the past 12 weeks
3. Recency: the time since the Customer’s last invoice
4. Basket: the sum spent by Customer per Product Group in the past 12 weeks
5. Diversity: the entropy of the Customer’s basket per Product Group in the past 12 weeks
6. Stability: the cosine similarity of the Customer’s basket in the past 2 weeks compared to their
basket in the past 12 weeks
7. Similarity: the cosine similarity of the Customer’s basket compared to the baskets of Customers
living in the same state in the past 12 weeks
8. Regularity: the entropy of weekdays of the customer invoices in the past 12 weeks
9. Change: whether the Customer’s Address changed in the past 12 weeks
11
Feature Idea 1: Frequency
Count of invoices by Customer in the past 12 weeks
12
The groupby() method enables to group a view by
columns representing entities.
The aggregate_over() method enables aggregation
operation over a window prior to the observation point,
here 12 weeks.
Feature objects contain the logical plan to compute a feature
Feature Idea 2: Monetary
Amount spent by Customer in the past 12 weeks
13
Aggregation operations include latest, count, sum,
average, minimum, maximum, and standard deviation.
Feature Idea 3: Recency
Time since Customer last invoice
14
RequestColumn.point_in_time() enables the creation of
features that compare the value of an other feature with
the observation point
Another example of using this method would be to
derive a customer's age from their birth date and the
observation point.
Feature Idea 4: Basket
Sum spent by Customer per Product Group in the past 2 and 12 weeks
15
Aggregation can be done across a categorical column,
here ProductGroup.
When the feature is materialized, a dictionary is
returned.
In our example, the keys represent the ProductGroup,
and their corresponding values represent the total
amount spent on each Product Group.
Feature Idea 5: Diversity
Entropy of the Customer’s basket in the past 12 weeks
16
A dictionary feature can be transformed into another
feature. We apply here the entropy.
Feature Idea 6: Stability
Cosine Similarity of the Customer’s basket in the past 2 weeks vs their
basket in the past 12 weeks
17
2 dictionary features with different
windows can be compared using the cosine
similarity.
Feature Idea 7: Similarity
Cosine Similarity of the Customer’s basket in the past 12 weeks
vs the baskets of Customers living in the same state
18
When 2 dictionary features are grouped with 2 different
entities,
● they can be still compared using the cosine similarity.
● the entity of the resulting feature is determined based
on the relationship between the 2 entities.
In our example, the primary entity of
“Customer_Similarity_with_State_12w” is
grocerycustomer because grocerycustomer
is a child of the frenchstate entity.
The feature can be served by providing only
the grocerycustomer entity values. Values
for frenchstate entity are derived
automatically by FeatureByte.
Feature Idea 8: Regularity
Entropy of weekdays of the Customer’s invoices in the past 12 weeks
19
In this example, the keys of the dictionary will represent
the weekday, and their corresponding values will
represent the total amount spent on each weekday.
The weekdays dictionary feature is here transformed
into another feature by applying the entropy.
Feature Idea 9: Change
Whether Customer’s Address Changed in the past 12 weeks
20
Change View can be created from a Slowly Changing Dimension
(SCD) table to analyze changes that occur in an attribute, here the
StreetAddress of the grocerycustomer.
Get training data and
deploy our 9 features in
production!
Audit one feature via its Definition file
22
The feature definition file is critical for maintaining the integrity
of a feature. It serves as the single source of truth for the feature,
providing an explicit outline of the intended operations of the
feature declaration, including inherited operations.
The file is automatically generated when a feature is declared or a
new version is derived. It is used to generate the final logical
execution graph, which is then transpiled into platform-specific
SQL for feature materialization.
Create feature list
23
Compute training data
24
You can choose to store your training data in the feature store for
reuse or audit.
If you connect FeatureByte to your data warehouse, the
computation of features is performed in the data warehouse,
taking advantage of its scalability, stability, and efficiency.
In this case, you can work with much larger observation tables with
millions of rows.
Deploy and get shell template for REST API
25
Thank You!
To get started, check out
Repo: https://github.com/featurebyte/featurebyte
Documentation: https://docs.featurebyte.com/0.2/
27
Catalog objects help organize
assets by domain
View objects are used to prepare data. Work like SQL
views but with a syntax similar to Pandas!
Table objects centralize essential
metadata for feature engineering.
4 types: event table, item table,
SCD table and dimension table.
Entity objects and relationships facilitate
feature organization and serving Feature objects contain the logical plan to compute a feature
Python SDK to register, share and manage tables, features and
more…

More Related Content

Similar to Transforming Feature Ideas into Machine Learning Inputs

Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
Vijayananda Mohire
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
Vijayananda Mohire
 
PowerBI Embedded in D365 Finance and Operations
PowerBI Embedded in D365 Finance and OperationsPowerBI Embedded in D365 Finance and Operations
PowerBI Embedded in D365 Finance and Operations
CloudFronts Technologies LLP.
 
what is context API and How it works in React.pptx
what is context API and How it works in React.pptxwhat is context API and How it works in React.pptx
what is context API and How it works in React.pptx
BOSC Tech Labs
 
Rinkeshkumar Bhagat Portfolio
Rinkeshkumar Bhagat PortfolioRinkeshkumar Bhagat Portfolio
Rinkeshkumar Bhagat PortfolioRinkeshkumar15
 
Rinkeshkumar Bhagat Portfolio
Rinkeshkumar Bhagat PortfolioRinkeshkumar Bhagat Portfolio
Rinkeshkumar Bhagat PortfolioRinkeshkumar15
 
Appcelerator Titanium Alloy + Kinvey Collection Databinding - Part One
Appcelerator Titanium Alloy + Kinvey Collection Databinding - Part OneAppcelerator Titanium Alloy + Kinvey Collection Databinding - Part One
Appcelerator Titanium Alloy + Kinvey Collection Databinding - Part One
Aaron Saunders
 
Gunavathi_Resume
Gunavathi_ResumeGunavathi_Resume
Gunavathi_Resumeguna vathi
 
KPI definition with Business Activity Monitor 2.0
KPI definition with Business Activity Monitor 2.0KPI definition with Business Activity Monitor 2.0
KPI definition with Business Activity Monitor 2.0WSO2
 
Summer ‘14 Release Training by Astrea
Summer ‘14 Release Training by AstreaSummer ‘14 Release Training by Astrea
Summer ‘14 Release Training by Astrea
priyanshi_astrea
 
KFServing and Feast
KFServing and FeastKFServing and Feast
KFServing and Feast
Animesh Singh
 
How to broadcast a b ex report through e
How to broadcast a b ex report through eHow to broadcast a b ex report through e
How to broadcast a b ex report through eZaynab Fadlallah
 
R12 upgrade considerations by product v6.0 1
R12 upgrade considerations by product v6.0 1R12 upgrade considerations by product v6.0 1
R12 upgrade considerations by product v6.0 1
Miguel Felicio
 
R12 upgrade considerations by product v6.0 1
R12 upgrade considerations by product v6.0 1R12 upgrade considerations by product v6.0 1
R12 upgrade considerations by product v6.0 1
Miguel Felicio
 
Release 12 features work arounds and Upgrade
Release 12 features work arounds and UpgradeRelease 12 features work arounds and Upgrade
Release 12 features work arounds and UpgradePrasad Gudipaty M.S., PMP
 
SAP BusinessObjects BI 4.3
SAP BusinessObjects BI 4.3SAP BusinessObjects BI 4.3
SAP BusinessObjects BI 4.3
Wiiisdom
 
Purnima_SAP_ABAP_Consultant
Purnima_SAP_ABAP_ConsultantPurnima_SAP_ABAP_Consultant
Purnima_SAP_ABAP_ConsultantPurnima Singh
 
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...Karthik K Iyengar
 
Sql Portfolio
Sql PortfolioSql Portfolio
Sql Portfolio
Shelli Ciaschini
 
GraphQL Advanced
GraphQL AdvancedGraphQL Advanced
GraphQL Advanced
LeanIX GmbH
 

Similar to Transforming Feature Ideas into Machine Learning Inputs (20)

Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
PowerBI Embedded in D365 Finance and Operations
PowerBI Embedded in D365 Finance and OperationsPowerBI Embedded in D365 Finance and Operations
PowerBI Embedded in D365 Finance and Operations
 
what is context API and How it works in React.pptx
what is context API and How it works in React.pptxwhat is context API and How it works in React.pptx
what is context API and How it works in React.pptx
 
Rinkeshkumar Bhagat Portfolio
Rinkeshkumar Bhagat PortfolioRinkeshkumar Bhagat Portfolio
Rinkeshkumar Bhagat Portfolio
 
Rinkeshkumar Bhagat Portfolio
Rinkeshkumar Bhagat PortfolioRinkeshkumar Bhagat Portfolio
Rinkeshkumar Bhagat Portfolio
 
Appcelerator Titanium Alloy + Kinvey Collection Databinding - Part One
Appcelerator Titanium Alloy + Kinvey Collection Databinding - Part OneAppcelerator Titanium Alloy + Kinvey Collection Databinding - Part One
Appcelerator Titanium Alloy + Kinvey Collection Databinding - Part One
 
Gunavathi_Resume
Gunavathi_ResumeGunavathi_Resume
Gunavathi_Resume
 
KPI definition with Business Activity Monitor 2.0
KPI definition with Business Activity Monitor 2.0KPI definition with Business Activity Monitor 2.0
KPI definition with Business Activity Monitor 2.0
 
Summer ‘14 Release Training by Astrea
Summer ‘14 Release Training by AstreaSummer ‘14 Release Training by Astrea
Summer ‘14 Release Training by Astrea
 
KFServing and Feast
KFServing and FeastKFServing and Feast
KFServing and Feast
 
How to broadcast a b ex report through e
How to broadcast a b ex report through eHow to broadcast a b ex report through e
How to broadcast a b ex report through e
 
R12 upgrade considerations by product v6.0 1
R12 upgrade considerations by product v6.0 1R12 upgrade considerations by product v6.0 1
R12 upgrade considerations by product v6.0 1
 
R12 upgrade considerations by product v6.0 1
R12 upgrade considerations by product v6.0 1R12 upgrade considerations by product v6.0 1
R12 upgrade considerations by product v6.0 1
 
Release 12 features work arounds and Upgrade
Release 12 features work arounds and UpgradeRelease 12 features work arounds and Upgrade
Release 12 features work arounds and Upgrade
 
SAP BusinessObjects BI 4.3
SAP BusinessObjects BI 4.3SAP BusinessObjects BI 4.3
SAP BusinessObjects BI 4.3
 
Purnima_SAP_ABAP_Consultant
Purnima_SAP_ABAP_ConsultantPurnima_SAP_ABAP_Consultant
Purnima_SAP_ABAP_Consultant
 
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
 
Sql Portfolio
Sql PortfolioSql Portfolio
Sql Portfolio
 
GraphQL Advanced
GraphQL AdvancedGraphQL Advanced
GraphQL Advanced
 

Recently uploaded

一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 

Transforming Feature Ideas into Machine Learning Inputs

  • 1. Transforming Feature Ideas into Machine Learning Inputs Xavier Conort 9th May 2023 - Singapore
  • 2. Transforming Feature Ideas into actionable ML inputs 2 FeatureByte is a free and source-available platform designed to empower data enthusiasts who love creating innovative features and improving model accuracy through data. With FeatureByte, you can: ● Quickly create features, ● Seamlessly experiment with features and ● Effortlessly deploy features All without the headache of creating and managing new pipelines.
  • 3. Free and source available platform released on May 8! 3 You can access our repository at https://github.com/featurebyte/featurebyte and our documentation at https://docs.featurebyte.com/0.2/. Please check it out and give us your feedback! We're eager to hear your thoughts and improve FeatureByte to better suit your needs.
  • 4. Agenda 4 ● Quick introduction to FeatureByte Architecture ● Introduction to the Grocery dataset ● Transform 9 Features Ideas ● Get training data and deploy our new features in production!
  • 6. 6 Your DataBricks, Snowflake or Spark data warehouse is used as: ● data source ● compute engine ● feature store The FeatureByte Service, which runs on Docker, acts as the interface between the SDK and your data warehouse. It automatically manages tasks like validating requests, scheduling and executing tasks, and storing metadata. This Python module transpiles the feature logical plan to platform-specific SQL Light-weight installation that leverages your Data Warehouse You interact with the tool through the SDK, which provides an intuitive framework for creating, sharing, experimenting, deploying, and managing features.
  • 7. Grocery dataset Note: To access the Grocery dataset or play with FeatureByte's tutorials - you can install a pre-built local Spark data warehouse with pre-populated data, via the command featurebyte.playground().
  • 8. 8 We will work with a Catalog where 4 tables were registered from a grocery dataset. Dimension table with static description on products Item table with details on purchased products in customer invoices Event table where each row indicates an invoice Slowly Changing Dimension table that contains data on customers that change over time. A catalog is an object that help organize your feature engineering assets by domain
  • 9. Simple observation set to preview features 9 Observation sets are used to materialize historical feature values. They combine entity values and historical points-in-time that you want to learn from. We will work here with a small Pandas DataFrame to preview features when customers are doing a purchase. The observation set is retrieved from a sample of the GROCERYINVOICE table’s view. The column containing points-in-time must be named ‘POINT-IN-TIME’ The column containing entity values must be named with an accepted serving name as defined during the entity registration. View objects are used to prepare data. Work like SQL views but with a syntax similar to Pandas!
  • 10. 9 Feature ideas for the Grocery dataset
  • 11. 9 feature ideas for the Customer entity 1. Frequency: the count of invoices by Customer in the past 12 weeks 2. Monetary: the amount spent by Customer in the past 12 weeks 3. Recency: the time since the Customer’s last invoice 4. Basket: the sum spent by Customer per Product Group in the past 12 weeks 5. Diversity: the entropy of the Customer’s basket per Product Group in the past 12 weeks 6. Stability: the cosine similarity of the Customer’s basket in the past 2 weeks compared to their basket in the past 12 weeks 7. Similarity: the cosine similarity of the Customer’s basket compared to the baskets of Customers living in the same state in the past 12 weeks 8. Regularity: the entropy of weekdays of the customer invoices in the past 12 weeks 9. Change: whether the Customer’s Address changed in the past 12 weeks 11
  • 12. Feature Idea 1: Frequency Count of invoices by Customer in the past 12 weeks 12 The groupby() method enables to group a view by columns representing entities. The aggregate_over() method enables aggregation operation over a window prior to the observation point, here 12 weeks. Feature objects contain the logical plan to compute a feature
  • 13. Feature Idea 2: Monetary Amount spent by Customer in the past 12 weeks 13 Aggregation operations include latest, count, sum, average, minimum, maximum, and standard deviation.
  • 14. Feature Idea 3: Recency Time since Customer last invoice 14 RequestColumn.point_in_time() enables the creation of features that compare the value of an other feature with the observation point Another example of using this method would be to derive a customer's age from their birth date and the observation point.
  • 15. Feature Idea 4: Basket Sum spent by Customer per Product Group in the past 2 and 12 weeks 15 Aggregation can be done across a categorical column, here ProductGroup. When the feature is materialized, a dictionary is returned. In our example, the keys represent the ProductGroup, and their corresponding values represent the total amount spent on each Product Group.
  • 16. Feature Idea 5: Diversity Entropy of the Customer’s basket in the past 12 weeks 16 A dictionary feature can be transformed into another feature. We apply here the entropy.
  • 17. Feature Idea 6: Stability Cosine Similarity of the Customer’s basket in the past 2 weeks vs their basket in the past 12 weeks 17 2 dictionary features with different windows can be compared using the cosine similarity.
  • 18. Feature Idea 7: Similarity Cosine Similarity of the Customer’s basket in the past 12 weeks vs the baskets of Customers living in the same state 18 When 2 dictionary features are grouped with 2 different entities, ● they can be still compared using the cosine similarity. ● the entity of the resulting feature is determined based on the relationship between the 2 entities. In our example, the primary entity of “Customer_Similarity_with_State_12w” is grocerycustomer because grocerycustomer is a child of the frenchstate entity. The feature can be served by providing only the grocerycustomer entity values. Values for frenchstate entity are derived automatically by FeatureByte.
  • 19. Feature Idea 8: Regularity Entropy of weekdays of the Customer’s invoices in the past 12 weeks 19 In this example, the keys of the dictionary will represent the weekday, and their corresponding values will represent the total amount spent on each weekday. The weekdays dictionary feature is here transformed into another feature by applying the entropy.
  • 20. Feature Idea 9: Change Whether Customer’s Address Changed in the past 12 weeks 20 Change View can be created from a Slowly Changing Dimension (SCD) table to analyze changes that occur in an attribute, here the StreetAddress of the grocerycustomer.
  • 21. Get training data and deploy our 9 features in production!
  • 22. Audit one feature via its Definition file 22 The feature definition file is critical for maintaining the integrity of a feature. It serves as the single source of truth for the feature, providing an explicit outline of the intended operations of the feature declaration, including inherited operations. The file is automatically generated when a feature is declared or a new version is derived. It is used to generate the final logical execution graph, which is then transpiled into platform-specific SQL for feature materialization.
  • 24. Compute training data 24 You can choose to store your training data in the feature store for reuse or audit. If you connect FeatureByte to your data warehouse, the computation of features is performed in the data warehouse, taking advantage of its scalability, stability, and efficiency. In this case, you can work with much larger observation tables with millions of rows.
  • 25. Deploy and get shell template for REST API 25
  • 26. Thank You! To get started, check out Repo: https://github.com/featurebyte/featurebyte Documentation: https://docs.featurebyte.com/0.2/
  • 27. 27 Catalog objects help organize assets by domain View objects are used to prepare data. Work like SQL views but with a syntax similar to Pandas! Table objects centralize essential metadata for feature engineering. 4 types: event table, item table, SCD table and dimension table. Entity objects and relationships facilitate feature organization and serving Feature objects contain the logical plan to compute a feature Python SDK to register, share and manage tables, features and more…