SlideShare a Scribd company logo
1 of 44
Download to read offline
FeatureByte
Maximizing Your ML
Success with Innovative
Feature Engineering
30th May 2023
Xavier Conort @ Data Science Sydney
FeatureByte
Maximizing Your ML success with Innovative Feature Engineering
2
Agenda:
● Feature Engineering is a creative process. Never
stop looking for new sources of inspiration
○ Practical examples how I got inspired over the
years.
● Name and organize your ideas to make them more
repeatable and visible
○ Practical examples how I used signal types to
organize my feature ideas.
● Convert feature ideation into actionable ML inputs.
○ Practical examples using a Grocery Dataset
and FeatureByte.
FeatureByte
”Creativity is just
connecting things”
(Steve Jobs)
FeatureByte 4
Learn from your peers
● We applied Owen’s entropy trick to create features that measured the seasonality of students logs to a
MOOC platform: entropy of day of weeks, entropy of hour of days, entropy of hour of weeks,...
● The features were highly predictive on the likelihood of students dropout and dramatically simplified our
winning solution!
● I consider Owen’s entropy idea as one of the most powerful techniques in feature engineering. Entropy
can be used to capture diversity signal but also seasonality when applied to date parts.
During the KDD2015 Cup, Owen Zhang taught me how to use
Entropy to measure diversity from a categorical distribution.
FeatureByte 5
Don’t ignore old tricks
I may have been one of the first Kagglers to apply stacking to ensemble models. I then used stacking to
convert high dimensional categorical or text columns into numeric features. Now, stacking is one of the
most popular techniques in Data Science.
I haven’t invented stacking but may have been one the first to benefit from it. All thanks to an old paper by
David Wolpert (1992) that Google suggested and that helped me finish at the 2nd place of my first Kaggle
competition!
FeatureByte 6
Don’t look down on industry specific practices
A defining moment in my career was winning the GE Flight Quest, and I owe part of that success to a
valuable insurance practice: the 2 stages modeling:
● First, you learn from features your trust and for which you can see causal relationships
● Second, you learn residual effects of features that may expose your solution to bias
Thanks to it, I could deal with a short history of 3 months and still learn the effect of airports in a
generalized way. Without this, I may have overfitted the airport effect and learnt less from the most
important cause of flight delays: bad weather.
FeatureByte 7
Think outside the box
Like many data scientists, I learned to use cosine
similarity whilst working with text and images.
I eventually realized that cosine similarity could be applied
with other data vectors, such as customer baskets,
distribution of day of weeks, or any distribution across
categories…
Now I use a lot cosine similarity to measure stability over
time or similarity between parent entities!
FeatureByte
Name and Organize
Feature Ideas
A practical example with Signal Types
FeatureByte 9
Once I realized I was forgetting prior ideas, I decided to
be more organized and began classifying my feature
ideas into distinct signal types.
This simple yet effective approach has changed the
way I organize my thoughts and sparked fresh ideas!
I also found this helps me better communicate by
providing me with a framework for discussing and
evaluating ideas and convince stakeholders not familiar
with feature engineering there is a whole world beyond
RFM.
Name and Organize your ideas
For many years, I relied on
reflexology to kickstart my
feature ideation process!
FeatureByte 10
Recency Signals
Attributes of the latest event to occur.
Metrics
● Time since last event.
● Label of last event.
● Magnitude of last event.
Examples
● How much did a customer spend on their
last purchase?
● How long since the last equipment failure?
● Was a patient’s last visit for an emergency?
FeatureByte 11
Frequency Signals
How often do events occur?
Metrics
● Number of events occurring
over a time window.
Examples
● How many hospital admissions has the
patient had in the past 12 months?
● How many international phone calls has a
customer made in the past month?
FeatureByte 12
Monetary Signals
What are the monetary amounts of
events over a time window?
Metrics
● Total cost of events occurring
over a time window.
● Average cost of events
occurring over a time window
Examples
● How much did a customer spend over the
past 12 months?
● What was the average discount applied to a
customer’s purchases over the past month?
FeatureByte 13
Seasonality Signals
Any seasonality in the timing of events.
Metrics
● Entropy of day of the week (or
any date parts) of events.
● Cross-aggregation of events
across date parts
Examples
● Does a customer always shop on the same
days of the week?
● Do equipment failures tend to occur at
different times of the day or times of year?
FeatureByte 14
Clumpiness Signals
Do events occur randomly across time,
or are there long gaps between intense
bursts of activity?
Metrics
● Coefficient of variation of
inter-event times
● Log utility of inter-event times
Examples
● Which customers are binge-watching TV
shows?
● Do equipment failures occur in cascades?
● After a long break without purchasing, does
a customer return to a store a few times
over several days to purchase items?
FeatureByte 15
Change Signals
Have usually static attributes changed?
By how much?
Metrics
● Did an attribute change?
● How many times did an
attribute change?
● By how much did it change?
Examples
● Did a customer change address over the past
two years? How far?
● Has a password been reset in the past 48
hours?
● Was network equipment replaced with a
different make or model?
FeatureByte 16
Basket Signals
What are the counts or amounts of
events or available resources,
cross-categorized by an item label?
Metrics
● Counts of item types
● Sum of values across item
types
● Max values of item types
Examples
● What are the counts of each item type in a
shopping basket?
● What is the total value of item type for each
customer’s shipping orders over 30 days?
● What is the maximum weight of each type of
item in a shipping order?
FeatureByte 17
Diversity Signals
How variable are the data values?
Metrics
● Coefficient of variation of
amounts
● Entropy of labels
Examples
● How variable are the monthly invoice
amounts over the past 6 months?
● How stable is a patient's blood pressure?
● Does a customer always purchase similar
products?
FeatureByte 18
Similarity Signals
How similar is one entity to a group it is
related to?
Metrics
● Ratio of one entity's numeric
value to the average or max of
values over a group
● Z-score
● Cosine similarity of two parent
entities’ cross-aggregations
Examples
● How similar is one customer's purchases to
customers of the same age and gender?
● What is the ratio of a network antenna's
maximum temperature to that of all other
antennas that are the same model, or
located in the same geography?
FeatureByte 19
Stability Signals
Are an entity’s recent events similar to
the past for the same entity?
Metrics
● Ratio of latest numeric value to
the average or max of values
over a historical time window.
● Cosine similarity of two
cross-aggregations from
different time periods
Examples
● Are the contents of the latest shopping
basket similar to past purchases?
● Is the average of numeric event data over
the most recent period similar to the past?
● Is a recent data value anomalous (Z-score)?
FeatureByte 20
Location Signals
The static or dynamic location of an
event or entity.
Metrics
● The latitude and longitude of an
static entity.
● Max Distance between an
entity with other entities
● Speed, acceleration of a moving
entity
Examples
● What is the distance between the shop and
the customer address?
● Is the shipping address in a remote location?
What is the population density in that state?
● How fast has a vehicle moved in the past
hour?
FeatureByte
Grocery dataset
FeatureByte 22
We will work with a Catalog where 4 tables were registered from a grocery dataset.
Dimension table with
static description on
products
Item table with details on
purchased products in
customer invoices
Event table where each
row indicates an invoice
Slowly Changing Dimension table that contains data
on customers that change over time.
A catalog is an object that help organize your
feature engineering assets by domain
To access the Grocery dataset or play with FeatureByte's tutorials - you can install a pre-built local Spark
data warehouse with pre-populated data, via the command featurebyte.playground().
FeatureByte
10 Feature ideas for the
Grocery dataset
FeatureByte
10 feature ideas for the Customer entity
1. Frequency: the count of invoices by Customer in the past 12 weeks
2. Monetary: the amount spent by Customer in the past 12 weeks
3. Recency: the time since the Customer’s last invoice
4. Basket: the sum spent by Customer per Product Group in the past 12 weeks
5. Diversity: the entropy of the Customer’s basket per Product Group in the past 12 weeks
6. Stability: the cosine similarity of the Customer’s basket in the past 2 weeks compared to their
basket in the past 12 weeks
7. Similarity: the cosine similarity of the Customer’s basket compared to the baskets of Customers
living in the same state in the past 12 weeks
8. Seasonality: the entropy of weekdays of the customer invoices in the past 12 weeks
9. Change: whether the Customer’s Address changed in the past 12 weeks
10. Clumpiness: coefficient of variation of Customer inter-event time in the past 12 weeks
24
FeatureByte
Practical examples
using FeatureByte
FeatureByte
FeatureByte
26
FeatureByte is a free and source-available platform designed to empower data enthusiasts who love
creating innovative features and improving model accuracy through data.
With FeatureByte, you can:
● Quickly create and share features,
● Seamlessly experiment with features and
● Effortlessly deploy features in production
All without the headache of creating and managing new pipelines.
No large-scale outbound data transfers.
Computation and feature storage are
done where data sits to leverage
scalability, stability, and efficiency of your
DWH.
FeatureByte
Feature Idea 1: Frequency
Count of invoices by Customer in the past 12 weeks
27
The groupby() method enables to group a view by
columns representing entities.
The aggregate_over() method enables aggregation
operation over a window prior to the observation point,
here 12 weeks.
Feature objects contain the logical plan to compute a feature
FeatureByte
Feature Idea 2: Monetary
Amount spent by Customer in the past 12 weeks
28
Aggregation operations include latest, count, sum,
average, minimum, maximum, and standard deviation.
FeatureByte
Feature Idea 3: Recency
Time since Customer last invoice
29
RequestColumn.point_in_time() enables the creation of
features that compare the value of an other feature with
the observation point
Another example of using this method would be to
derive a customer's age from their birth date and the
observation point.
FeatureByte
Feature Idea 4: Basket
Sum spent by Customer across Product Group in the past 2 and 12 weeks
30
Aggregation can be done across a categorical column,
here ProductGroup. We call this Cross-Aggregation!
When the feature is materialized, a dictionary is
returned.
In our example, the keys represent the ProductGroup,
and their corresponding values represent the total
amount spent on each Product Group.
FeatureByte
Feature Idea 5: Diversity
Entropy of the Customer’s basket in the past 12 weeks
31
A dictionary feature can be transformed into another
feature. We apply here the entropy.
FeatureByte
Feature Idea 6: Stability
Cosine Similarity of the Customer’s basket in the past 2 weeks vs their
basket in the past 12 weeks
32
2 dictionary features with different
windows can be compared using the cosine
similarity.
FeatureByte
Feature Idea 7: Similarity
Cosine Similarity of the Customer’s basket in the past 12 weeks
vs the baskets of Customers living in the same state
33
When 2 dictionary features are grouped with 2 different
entities,
● they can be still compared using the cosine similarity.
● the entity of the resulting feature is determined based
on the relationship between the 2 entities.
In our example, the primary entity of
“Customer_Similarity_with_State_12w” is
grocerycustomer because grocerycustomer
is a child of the frenchstate entity.
The feature can be served by providing only
the grocerycustomer entity values. Values
for frenchstate entity are derived
automatically by FeatureByte.
FeatureByte
Feature Idea 8: Seasonality
Entropy of weekdays of the Customer’s invoices in the past 12 weeks
34
In this example, the keys of the dictionary will represent the
weekday, and their corresponding values will represent the
total amount spent on each weekday.
The weekdays dictionary feature is here transformed
into another feature by applying the entropy.
FeatureByte
Feature Idea 9: Change
Whether Customer’s Address Changed in the past 12 weeks
35
Change View can be created from a Slowly Changing Dimension
(SCD) table to analyze changes that occur in an attribute, here the
StreetAddress of the grocerycustomer.
FeatureByte
Feature Idea 10: Clumpiness
CV of Customer Inter-Event time in the past 12 weeks
36
Get Inter-Event time by getting the timestamp of the previous
customer timestamp (lag)
FeatureByte
Get training data and
deploy our 10 features
in production!
FeatureByte
Audit one feature via its Definition file:
38
The feature definition file is critical for maintaining the integrity
of a feature. It serves as the single source of truth for the feature,
providing an explicit outline of the intended operations of the
feature declaration, including inherited operations.
The file is automatically generated when a feature is declared or a
new version is derived. It is used to generate the final logical
execution graph, which is then transpiled into platform-specific
SQL for feature materialization.
FeatureByte
Create feature list
39
FeatureByte
Compute training data
40
You can choose to store your training data in the feature store for
reuse or audit.
If you connect FeatureByte to your data warehouse, the
computation of features is performed in the data warehouse,
taking advantage of its scalability, stability, and efficiency.
In this case, you can work with much larger observation tables with
millions of rows.
FeatureByte
Deploy and get shell template for REST API:
41
FeatureByte
Conclusion
FeatureByte 43
Conclusion: “Creativity is just connecting things” (Steve Jobs)
Consider utilizing tools that streamline the conversion of your feature ideas
into production-ready machine learning inputs. This will:
● help you focus on what really matters, feature ideation
● and maximize the success of your ML projects.
While you may have some original ideas, the
majority will be inspired by others. Embrace
these ideas and make them your own.
Organize and name your feature ideas. This
will help you remember them and
communicate them. This process can also spark
new ideas along the way!
FeatureByte
Thank You!
To get started, check out
Repo: https://github.com/featurebyte/featurebyte
Documentation: https://docs.featurebyte.com/0.2/

More Related Content

Similar to Maximizing Your ML Success with Innovative Feature Engineering

Webinar: Analytics with NoSQL: Why, for What, and When?
Webinar: Analytics with NoSQL: Why, for What, and When?Webinar: Analytics with NoSQL: Why, for What, and When?
Webinar: Analytics with NoSQL: Why, for What, and When?MongoDB
 
Leverage The Power of Small Data
Leverage The Power of Small DataLeverage The Power of Small Data
Leverage The Power of Small DataKaryn Zuidinga
 
Calculating Your Customer Lifetime Value - Dawn of the Data Age Lecture Series
Calculating Your Customer Lifetime Value - Dawn of the Data Age Lecture SeriesCalculating Your Customer Lifetime Value - Dawn of the Data Age Lecture Series
Calculating Your Customer Lifetime Value - Dawn of the Data Age Lecture SeriesLuciano Pesci, PhD
 
INUSE Seminar May 8, 2012: Hyysalo
INUSE Seminar May 8, 2012: HyysaloINUSE Seminar May 8, 2012: Hyysalo
INUSE Seminar May 8, 2012: Hyysaloinuseproject
 
MS5103BusinessAnalyticsProject
MS5103BusinessAnalyticsProjectMS5103BusinessAnalyticsProject
MS5103BusinessAnalyticsProjectBrian Connolly
 
Creating a Single Source of Truth: Leverage all of your data with powerful an...
Creating a Single Source of Truth: Leverage all of your data with powerful an...Creating a Single Source of Truth: Leverage all of your data with powerful an...
Creating a Single Source of Truth: Leverage all of your data with powerful an...Looker
 
Digicorp - Supply Chain Analytics Apps
Digicorp - Supply Chain Analytics AppsDigicorp - Supply Chain Analytics Apps
Digicorp - Supply Chain Analytics AppsDigicorp
 
An Agile approach to Business Metrics
An Agile approach to Business MetricsAn Agile approach to Business Metrics
An Agile approach to Business MetricsPablo Valcárcel
 
LPCx Barcelona: How to use the design thinking methodology to revamp your API?
LPCx Barcelona: How to use the design thinking methodology to revamp your API?LPCx Barcelona: How to use the design thinking methodology to revamp your API?
LPCx Barcelona: How to use the design thinking methodology to revamp your API?Thiga
 
IT-106 Pseudo-Coding Wk 5
IT-106 Pseudo-Coding Wk 5IT-106 Pseudo-Coding Wk 5
IT-106 Pseudo-Coding Wk 5Mark Simon
 
Disciplined Entrepreneurship: What can you do for your customer?
Disciplined Entrepreneurship: What can you do for your customer?Disciplined Entrepreneurship: What can you do for your customer?
Disciplined Entrepreneurship: What can you do for your customer?Elaine Chen
 
Future proof event on 13 sept 18 - Innovation & IP - by Bagaar & Gevers
Future proof event on 13 sept 18 - Innovation & IP - by Bagaar & GeversFuture proof event on 13 sept 18 - Innovation & IP - by Bagaar & Gevers
Future proof event on 13 sept 18 - Innovation & IP - by Bagaar & GeversDavid Gillain
 
Agile, User Stories, Domain Driven Design
Agile, User Stories, Domain Driven DesignAgile, User Stories, Domain Driven Design
Agile, User Stories, Domain Driven DesignAraf Karsh Hamid
 
Driving Customer Loyalty with Azure Machine Learning
Driving Customer Loyalty with Azure Machine LearningDriving Customer Loyalty with Azure Machine Learning
Driving Customer Loyalty with Azure Machine LearningCCG
 
MVP (Minimum Viable Product) Readiness | Boost Labs
MVP (Minimum Viable Product) Readiness | Boost LabsMVP (Minimum Viable Product) Readiness | Boost Labs
MVP (Minimum Viable Product) Readiness | Boost LabsBoost Labs
 
User stories — how to cook a cat?
User stories — how to cook a cat?User stories — how to cook a cat?
User stories — how to cook a cat?Vladimir Tarasov
 
Harnessing the content beast – Content marketing in the multiscreen world
Harnessing the content beast – Content marketing in the multiscreen worldHarnessing the content beast – Content marketing in the multiscreen world
Harnessing the content beast – Content marketing in the multiscreen worldThomas Robbins
 

Similar to Maximizing Your ML Success with Innovative Feature Engineering (20)

Webinar: Analytics with NoSQL: Why, for What, and When?
Webinar: Analytics with NoSQL: Why, for What, and When?Webinar: Analytics with NoSQL: Why, for What, and When?
Webinar: Analytics with NoSQL: Why, for What, and When?
 
Leverage The Power of Small Data
Leverage The Power of Small DataLeverage The Power of Small Data
Leverage The Power of Small Data
 
Calculating Your Customer Lifetime Value - Dawn of the Data Age Lecture Series
Calculating Your Customer Lifetime Value - Dawn of the Data Age Lecture SeriesCalculating Your Customer Lifetime Value - Dawn of the Data Age Lecture Series
Calculating Your Customer Lifetime Value - Dawn of the Data Age Lecture Series
 
INUSE Seminar May 8, 2012: Hyysalo
INUSE Seminar May 8, 2012: HyysaloINUSE Seminar May 8, 2012: Hyysalo
INUSE Seminar May 8, 2012: Hyysalo
 
MS5103BusinessAnalyticsProject
MS5103BusinessAnalyticsProjectMS5103BusinessAnalyticsProject
MS5103BusinessAnalyticsProject
 
Data Science and Future of Retail: Beacon analytics
Data Science and Future of Retail: Beacon analyticsData Science and Future of Retail: Beacon analytics
Data Science and Future of Retail: Beacon analytics
 
Creating a Single Source of Truth: Leverage all of your data with powerful an...
Creating a Single Source of Truth: Leverage all of your data with powerful an...Creating a Single Source of Truth: Leverage all of your data with powerful an...
Creating a Single Source of Truth: Leverage all of your data with powerful an...
 
Digicorp - Supply Chain Analytics Apps
Digicorp - Supply Chain Analytics AppsDigicorp - Supply Chain Analytics Apps
Digicorp - Supply Chain Analytics Apps
 
An Agile approach to Business Metrics
An Agile approach to Business MetricsAn Agile approach to Business Metrics
An Agile approach to Business Metrics
 
LPCx Barcelona: How to use the design thinking methodology to revamp your API?
LPCx Barcelona: How to use the design thinking methodology to revamp your API?LPCx Barcelona: How to use the design thinking methodology to revamp your API?
LPCx Barcelona: How to use the design thinking methodology to revamp your API?
 
IT-106 Pseudo-Coding Wk 5
IT-106 Pseudo-Coding Wk 5IT-106 Pseudo-Coding Wk 5
IT-106 Pseudo-Coding Wk 5
 
2019 01-design thinking-for architects
2019 01-design thinking-for architects2019 01-design thinking-for architects
2019 01-design thinking-for architects
 
Disciplined Entrepreneurship: What can you do for your customer?
Disciplined Entrepreneurship: What can you do for your customer?Disciplined Entrepreneurship: What can you do for your customer?
Disciplined Entrepreneurship: What can you do for your customer?
 
Future proof event on 13 sept 18 - Innovation & IP - by Bagaar & Gevers
Future proof event on 13 sept 18 - Innovation & IP - by Bagaar & GeversFuture proof event on 13 sept 18 - Innovation & IP - by Bagaar & Gevers
Future proof event on 13 sept 18 - Innovation & IP - by Bagaar & Gevers
 
Agile, User Stories, Domain Driven Design
Agile, User Stories, Domain Driven DesignAgile, User Stories, Domain Driven Design
Agile, User Stories, Domain Driven Design
 
Stories, Backlog & Mapping
Stories, Backlog & MappingStories, Backlog & Mapping
Stories, Backlog & Mapping
 
Driving Customer Loyalty with Azure Machine Learning
Driving Customer Loyalty with Azure Machine LearningDriving Customer Loyalty with Azure Machine Learning
Driving Customer Loyalty with Azure Machine Learning
 
MVP (Minimum Viable Product) Readiness | Boost Labs
MVP (Minimum Viable Product) Readiness | Boost LabsMVP (Minimum Viable Product) Readiness | Boost Labs
MVP (Minimum Viable Product) Readiness | Boost Labs
 
User stories — how to cook a cat?
User stories — how to cook a cat?User stories — how to cook a cat?
User stories — how to cook a cat?
 
Harnessing the content beast – Content marketing in the multiscreen world
Harnessing the content beast – Content marketing in the multiscreen worldHarnessing the content beast – Content marketing in the multiscreen world
Harnessing the content beast – Content marketing in the multiscreen world
 

Recently uploaded

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Ulm U学位证,乌尔姆大学毕业证书1:1制作
Ulm U学位证,乌尔姆大学毕业证书1:1制作Ulm U学位证,乌尔姆大学毕业证书1:1制作
Ulm U学位证,乌尔姆大学毕业证书1:1制作ys8omjxb
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
SWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptxSWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptxviniciusperissetr
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 

Recently uploaded (20)

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Ulm U学位证,乌尔姆大学毕业证书1:1制作
Ulm U学位证,乌尔姆大学毕业证书1:1制作Ulm U学位证,乌尔姆大学毕业证书1:1制作
Ulm U学位证,乌尔姆大学毕业证书1:1制作
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
SWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptxSWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptx
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 

Maximizing Your ML Success with Innovative Feature Engineering

  • 1. FeatureByte Maximizing Your ML Success with Innovative Feature Engineering 30th May 2023 Xavier Conort @ Data Science Sydney
  • 2. FeatureByte Maximizing Your ML success with Innovative Feature Engineering 2 Agenda: ● Feature Engineering is a creative process. Never stop looking for new sources of inspiration ○ Practical examples how I got inspired over the years. ● Name and organize your ideas to make them more repeatable and visible ○ Practical examples how I used signal types to organize my feature ideas. ● Convert feature ideation into actionable ML inputs. ○ Practical examples using a Grocery Dataset and FeatureByte.
  • 4. FeatureByte 4 Learn from your peers ● We applied Owen’s entropy trick to create features that measured the seasonality of students logs to a MOOC platform: entropy of day of weeks, entropy of hour of days, entropy of hour of weeks,... ● The features were highly predictive on the likelihood of students dropout and dramatically simplified our winning solution! ● I consider Owen’s entropy idea as one of the most powerful techniques in feature engineering. Entropy can be used to capture diversity signal but also seasonality when applied to date parts. During the KDD2015 Cup, Owen Zhang taught me how to use Entropy to measure diversity from a categorical distribution.
  • 5. FeatureByte 5 Don’t ignore old tricks I may have been one of the first Kagglers to apply stacking to ensemble models. I then used stacking to convert high dimensional categorical or text columns into numeric features. Now, stacking is one of the most popular techniques in Data Science. I haven’t invented stacking but may have been one the first to benefit from it. All thanks to an old paper by David Wolpert (1992) that Google suggested and that helped me finish at the 2nd place of my first Kaggle competition!
  • 6. FeatureByte 6 Don’t look down on industry specific practices A defining moment in my career was winning the GE Flight Quest, and I owe part of that success to a valuable insurance practice: the 2 stages modeling: ● First, you learn from features your trust and for which you can see causal relationships ● Second, you learn residual effects of features that may expose your solution to bias Thanks to it, I could deal with a short history of 3 months and still learn the effect of airports in a generalized way. Without this, I may have overfitted the airport effect and learnt less from the most important cause of flight delays: bad weather.
  • 7. FeatureByte 7 Think outside the box Like many data scientists, I learned to use cosine similarity whilst working with text and images. I eventually realized that cosine similarity could be applied with other data vectors, such as customer baskets, distribution of day of weeks, or any distribution across categories… Now I use a lot cosine similarity to measure stability over time or similarity between parent entities!
  • 8. FeatureByte Name and Organize Feature Ideas A practical example with Signal Types
  • 9. FeatureByte 9 Once I realized I was forgetting prior ideas, I decided to be more organized and began classifying my feature ideas into distinct signal types. This simple yet effective approach has changed the way I organize my thoughts and sparked fresh ideas! I also found this helps me better communicate by providing me with a framework for discussing and evaluating ideas and convince stakeholders not familiar with feature engineering there is a whole world beyond RFM. Name and Organize your ideas For many years, I relied on reflexology to kickstart my feature ideation process!
  • 10. FeatureByte 10 Recency Signals Attributes of the latest event to occur. Metrics ● Time since last event. ● Label of last event. ● Magnitude of last event. Examples ● How much did a customer spend on their last purchase? ● How long since the last equipment failure? ● Was a patient’s last visit for an emergency?
  • 11. FeatureByte 11 Frequency Signals How often do events occur? Metrics ● Number of events occurring over a time window. Examples ● How many hospital admissions has the patient had in the past 12 months? ● How many international phone calls has a customer made in the past month?
  • 12. FeatureByte 12 Monetary Signals What are the monetary amounts of events over a time window? Metrics ● Total cost of events occurring over a time window. ● Average cost of events occurring over a time window Examples ● How much did a customer spend over the past 12 months? ● What was the average discount applied to a customer’s purchases over the past month?
  • 13. FeatureByte 13 Seasonality Signals Any seasonality in the timing of events. Metrics ● Entropy of day of the week (or any date parts) of events. ● Cross-aggregation of events across date parts Examples ● Does a customer always shop on the same days of the week? ● Do equipment failures tend to occur at different times of the day or times of year?
  • 14. FeatureByte 14 Clumpiness Signals Do events occur randomly across time, or are there long gaps between intense bursts of activity? Metrics ● Coefficient of variation of inter-event times ● Log utility of inter-event times Examples ● Which customers are binge-watching TV shows? ● Do equipment failures occur in cascades? ● After a long break without purchasing, does a customer return to a store a few times over several days to purchase items?
  • 15. FeatureByte 15 Change Signals Have usually static attributes changed? By how much? Metrics ● Did an attribute change? ● How many times did an attribute change? ● By how much did it change? Examples ● Did a customer change address over the past two years? How far? ● Has a password been reset in the past 48 hours? ● Was network equipment replaced with a different make or model?
  • 16. FeatureByte 16 Basket Signals What are the counts or amounts of events or available resources, cross-categorized by an item label? Metrics ● Counts of item types ● Sum of values across item types ● Max values of item types Examples ● What are the counts of each item type in a shopping basket? ● What is the total value of item type for each customer’s shipping orders over 30 days? ● What is the maximum weight of each type of item in a shipping order?
  • 17. FeatureByte 17 Diversity Signals How variable are the data values? Metrics ● Coefficient of variation of amounts ● Entropy of labels Examples ● How variable are the monthly invoice amounts over the past 6 months? ● How stable is a patient's blood pressure? ● Does a customer always purchase similar products?
  • 18. FeatureByte 18 Similarity Signals How similar is one entity to a group it is related to? Metrics ● Ratio of one entity's numeric value to the average or max of values over a group ● Z-score ● Cosine similarity of two parent entities’ cross-aggregations Examples ● How similar is one customer's purchases to customers of the same age and gender? ● What is the ratio of a network antenna's maximum temperature to that of all other antennas that are the same model, or located in the same geography?
  • 19. FeatureByte 19 Stability Signals Are an entity’s recent events similar to the past for the same entity? Metrics ● Ratio of latest numeric value to the average or max of values over a historical time window. ● Cosine similarity of two cross-aggregations from different time periods Examples ● Are the contents of the latest shopping basket similar to past purchases? ● Is the average of numeric event data over the most recent period similar to the past? ● Is a recent data value anomalous (Z-score)?
  • 20. FeatureByte 20 Location Signals The static or dynamic location of an event or entity. Metrics ● The latitude and longitude of an static entity. ● Max Distance between an entity with other entities ● Speed, acceleration of a moving entity Examples ● What is the distance between the shop and the customer address? ● Is the shipping address in a remote location? What is the population density in that state? ● How fast has a vehicle moved in the past hour?
  • 22. FeatureByte 22 We will work with a Catalog where 4 tables were registered from a grocery dataset. Dimension table with static description on products Item table with details on purchased products in customer invoices Event table where each row indicates an invoice Slowly Changing Dimension table that contains data on customers that change over time. A catalog is an object that help organize your feature engineering assets by domain To access the Grocery dataset or play with FeatureByte's tutorials - you can install a pre-built local Spark data warehouse with pre-populated data, via the command featurebyte.playground().
  • 23. FeatureByte 10 Feature ideas for the Grocery dataset
  • 24. FeatureByte 10 feature ideas for the Customer entity 1. Frequency: the count of invoices by Customer in the past 12 weeks 2. Monetary: the amount spent by Customer in the past 12 weeks 3. Recency: the time since the Customer’s last invoice 4. Basket: the sum spent by Customer per Product Group in the past 12 weeks 5. Diversity: the entropy of the Customer’s basket per Product Group in the past 12 weeks 6. Stability: the cosine similarity of the Customer’s basket in the past 2 weeks compared to their basket in the past 12 weeks 7. Similarity: the cosine similarity of the Customer’s basket compared to the baskets of Customers living in the same state in the past 12 weeks 8. Seasonality: the entropy of weekdays of the customer invoices in the past 12 weeks 9. Change: whether the Customer’s Address changed in the past 12 weeks 10. Clumpiness: coefficient of variation of Customer inter-event time in the past 12 weeks 24
  • 26. FeatureByte FeatureByte 26 FeatureByte is a free and source-available platform designed to empower data enthusiasts who love creating innovative features and improving model accuracy through data. With FeatureByte, you can: ● Quickly create and share features, ● Seamlessly experiment with features and ● Effortlessly deploy features in production All without the headache of creating and managing new pipelines. No large-scale outbound data transfers. Computation and feature storage are done where data sits to leverage scalability, stability, and efficiency of your DWH.
  • 27. FeatureByte Feature Idea 1: Frequency Count of invoices by Customer in the past 12 weeks 27 The groupby() method enables to group a view by columns representing entities. The aggregate_over() method enables aggregation operation over a window prior to the observation point, here 12 weeks. Feature objects contain the logical plan to compute a feature
  • 28. FeatureByte Feature Idea 2: Monetary Amount spent by Customer in the past 12 weeks 28 Aggregation operations include latest, count, sum, average, minimum, maximum, and standard deviation.
  • 29. FeatureByte Feature Idea 3: Recency Time since Customer last invoice 29 RequestColumn.point_in_time() enables the creation of features that compare the value of an other feature with the observation point Another example of using this method would be to derive a customer's age from their birth date and the observation point.
  • 30. FeatureByte Feature Idea 4: Basket Sum spent by Customer across Product Group in the past 2 and 12 weeks 30 Aggregation can be done across a categorical column, here ProductGroup. We call this Cross-Aggregation! When the feature is materialized, a dictionary is returned. In our example, the keys represent the ProductGroup, and their corresponding values represent the total amount spent on each Product Group.
  • 31. FeatureByte Feature Idea 5: Diversity Entropy of the Customer’s basket in the past 12 weeks 31 A dictionary feature can be transformed into another feature. We apply here the entropy.
  • 32. FeatureByte Feature Idea 6: Stability Cosine Similarity of the Customer’s basket in the past 2 weeks vs their basket in the past 12 weeks 32 2 dictionary features with different windows can be compared using the cosine similarity.
  • 33. FeatureByte Feature Idea 7: Similarity Cosine Similarity of the Customer’s basket in the past 12 weeks vs the baskets of Customers living in the same state 33 When 2 dictionary features are grouped with 2 different entities, ● they can be still compared using the cosine similarity. ● the entity of the resulting feature is determined based on the relationship between the 2 entities. In our example, the primary entity of “Customer_Similarity_with_State_12w” is grocerycustomer because grocerycustomer is a child of the frenchstate entity. The feature can be served by providing only the grocerycustomer entity values. Values for frenchstate entity are derived automatically by FeatureByte.
  • 34. FeatureByte Feature Idea 8: Seasonality Entropy of weekdays of the Customer’s invoices in the past 12 weeks 34 In this example, the keys of the dictionary will represent the weekday, and their corresponding values will represent the total amount spent on each weekday. The weekdays dictionary feature is here transformed into another feature by applying the entropy.
  • 35. FeatureByte Feature Idea 9: Change Whether Customer’s Address Changed in the past 12 weeks 35 Change View can be created from a Slowly Changing Dimension (SCD) table to analyze changes that occur in an attribute, here the StreetAddress of the grocerycustomer.
  • 36. FeatureByte Feature Idea 10: Clumpiness CV of Customer Inter-Event time in the past 12 weeks 36 Get Inter-Event time by getting the timestamp of the previous customer timestamp (lag)
  • 37. FeatureByte Get training data and deploy our 10 features in production!
  • 38. FeatureByte Audit one feature via its Definition file: 38 The feature definition file is critical for maintaining the integrity of a feature. It serves as the single source of truth for the feature, providing an explicit outline of the intended operations of the feature declaration, including inherited operations. The file is automatically generated when a feature is declared or a new version is derived. It is used to generate the final logical execution graph, which is then transpiled into platform-specific SQL for feature materialization.
  • 40. FeatureByte Compute training data 40 You can choose to store your training data in the feature store for reuse or audit. If you connect FeatureByte to your data warehouse, the computation of features is performed in the data warehouse, taking advantage of its scalability, stability, and efficiency. In this case, you can work with much larger observation tables with millions of rows.
  • 41. FeatureByte Deploy and get shell template for REST API: 41
  • 43. FeatureByte 43 Conclusion: “Creativity is just connecting things” (Steve Jobs) Consider utilizing tools that streamline the conversion of your feature ideas into production-ready machine learning inputs. This will: ● help you focus on what really matters, feature ideation ● and maximize the success of your ML projects. While you may have some original ideas, the majority will be inspired by others. Embrace these ideas and make them your own. Organize and name your feature ideas. This will help you remember them and communicate them. This process can also spark new ideas along the way!
  • 44. FeatureByte Thank You! To get started, check out Repo: https://github.com/featurebyte/featurebyte Documentation: https://docs.featurebyte.com/0.2/