Effective feature engineering is essential for achieving successful machine learning outcomes, but important signals within data are often missed. In this talk, Xavier will present his fresh and innovative approach to feature engineering ideation, exploring how to create features with different signal types, including timing, regularity, stability, diversity, and similarity. Using FeatureByte, a recently launched free and source-available feature platform that leverages popular data platforms like Snowflake, Spark, and DataBricks, Xavier will demonstrate the practical application of these techniques. Attendees can expect to gain valuable insights and techniques on how to creatively engineer features and unlock the full potential of their data for machine learning success.
2. FeatureByte
Maximizing Your ML success with Innovative Feature Engineering
2
Agenda:
● Feature Engineering is a creative process. Never
stop looking for new sources of inspiration
○ Practical examples how I got inspired over the
years.
● Name and organize your ideas to make them more
repeatable and visible
○ Practical examples how I used signal types to
organize my feature ideas.
● Convert feature ideation into actionable ML inputs.
○ Practical examples using a Grocery Dataset
and FeatureByte.
4. FeatureByte 4
Learn from your peers
● We applied Owen’s entropy trick to create features that measured the seasonality of students logs to a
MOOC platform: entropy of day of weeks, entropy of hour of days, entropy of hour of weeks,...
● The features were highly predictive on the likelihood of students dropout and dramatically simplified our
winning solution!
● I consider Owen’s entropy idea as one of the most powerful techniques in feature engineering. Entropy
can be used to capture diversity signal but also seasonality when applied to date parts.
During the KDD2015 Cup, Owen Zhang taught me how to use
Entropy to measure diversity from a categorical distribution.
5. FeatureByte 5
Don’t ignore old tricks
I may have been one of the first Kagglers to apply stacking to ensemble models. I then used stacking to
convert high dimensional categorical or text columns into numeric features. Now, stacking is one of the
most popular techniques in Data Science.
I haven’t invented stacking but may have been one the first to benefit from it. All thanks to an old paper by
David Wolpert (1992) that Google suggested and that helped me finish at the 2nd place of my first Kaggle
competition!
6. FeatureByte 6
Don’t look down on industry specific practices
A defining moment in my career was winning the GE Flight Quest, and I owe part of that success to a
valuable insurance practice: the 2 stages modeling:
● First, you learn from features your trust and for which you can see causal relationships
● Second, you learn residual effects of features that may expose your solution to bias
Thanks to it, I could deal with a short history of 3 months and still learn the effect of airports in a
generalized way. Without this, I may have overfitted the airport effect and learnt less from the most
important cause of flight delays: bad weather.
7. FeatureByte 7
Think outside the box
Like many data scientists, I learned to use cosine
similarity whilst working with text and images.
I eventually realized that cosine similarity could be applied
with other data vectors, such as customer baskets,
distribution of day of weeks, or any distribution across
categories…
Now I use a lot cosine similarity to measure stability over
time or similarity between parent entities!
9. FeatureByte 9
Once I realized I was forgetting prior ideas, I decided to
be more organized and began classifying my feature
ideas into distinct signal types.
This simple yet effective approach has changed the
way I organize my thoughts and sparked fresh ideas!
I also found this helps me better communicate by
providing me with a framework for discussing and
evaluating ideas and convince stakeholders not familiar
with feature engineering there is a whole world beyond
RFM.
Name and Organize your ideas
For many years, I relied on
reflexology to kickstart my
feature ideation process!
10. FeatureByte 10
Recency Signals
Attributes of the latest event to occur.
Metrics
● Time since last event.
● Label of last event.
● Magnitude of last event.
Examples
● How much did a customer spend on their
last purchase?
● How long since the last equipment failure?
● Was a patient’s last visit for an emergency?
11. FeatureByte 11
Frequency Signals
How often do events occur?
Metrics
● Number of events occurring
over a time window.
Examples
● How many hospital admissions has the
patient had in the past 12 months?
● How many international phone calls has a
customer made in the past month?
12. FeatureByte 12
Monetary Signals
What are the monetary amounts of
events over a time window?
Metrics
● Total cost of events occurring
over a time window.
● Average cost of events
occurring over a time window
Examples
● How much did a customer spend over the
past 12 months?
● What was the average discount applied to a
customer’s purchases over the past month?
13. FeatureByte 13
Seasonality Signals
Any seasonality in the timing of events.
Metrics
● Entropy of day of the week (or
any date parts) of events.
● Cross-aggregation of events
across date parts
Examples
● Does a customer always shop on the same
days of the week?
● Do equipment failures tend to occur at
different times of the day or times of year?
14. FeatureByte 14
Clumpiness Signals
Do events occur randomly across time,
or are there long gaps between intense
bursts of activity?
Metrics
● Coefficient of variation of
inter-event times
● Log utility of inter-event times
Examples
● Which customers are binge-watching TV
shows?
● Do equipment failures occur in cascades?
● After a long break without purchasing, does
a customer return to a store a few times
over several days to purchase items?
15. FeatureByte 15
Change Signals
Have usually static attributes changed?
By how much?
Metrics
● Did an attribute change?
● How many times did an
attribute change?
● By how much did it change?
Examples
● Did a customer change address over the past
two years? How far?
● Has a password been reset in the past 48
hours?
● Was network equipment replaced with a
different make or model?
16. FeatureByte 16
Basket Signals
What are the counts or amounts of
events or available resources,
cross-categorized by an item label?
Metrics
● Counts of item types
● Sum of values across item
types
● Max values of item types
Examples
● What are the counts of each item type in a
shopping basket?
● What is the total value of item type for each
customer’s shipping orders over 30 days?
● What is the maximum weight of each type of
item in a shipping order?
17. FeatureByte 17
Diversity Signals
How variable are the data values?
Metrics
● Coefficient of variation of
amounts
● Entropy of labels
Examples
● How variable are the monthly invoice
amounts over the past 6 months?
● How stable is a patient's blood pressure?
● Does a customer always purchase similar
products?
18. FeatureByte 18
Similarity Signals
How similar is one entity to a group it is
related to?
Metrics
● Ratio of one entity's numeric
value to the average or max of
values over a group
● Z-score
● Cosine similarity of two parent
entities’ cross-aggregations
Examples
● How similar is one customer's purchases to
customers of the same age and gender?
● What is the ratio of a network antenna's
maximum temperature to that of all other
antennas that are the same model, or
located in the same geography?
19. FeatureByte 19
Stability Signals
Are an entity’s recent events similar to
the past for the same entity?
Metrics
● Ratio of latest numeric value to
the average or max of values
over a historical time window.
● Cosine similarity of two
cross-aggregations from
different time periods
Examples
● Are the contents of the latest shopping
basket similar to past purchases?
● Is the average of numeric event data over
the most recent period similar to the past?
● Is a recent data value anomalous (Z-score)?
20. FeatureByte 20
Location Signals
The static or dynamic location of an
event or entity.
Metrics
● The latitude and longitude of an
static entity.
● Max Distance between an
entity with other entities
● Speed, acceleration of a moving
entity
Examples
● What is the distance between the shop and
the customer address?
● Is the shipping address in a remote location?
What is the population density in that state?
● How fast has a vehicle moved in the past
hour?
22. FeatureByte 22
We will work with a Catalog where 4 tables were registered from a grocery dataset.
Dimension table with
static description on
products
Item table with details on
purchased products in
customer invoices
Event table where each
row indicates an invoice
Slowly Changing Dimension table that contains data
on customers that change over time.
A catalog is an object that help organize your
feature engineering assets by domain
To access the Grocery dataset or play with FeatureByte's tutorials - you can install a pre-built local Spark
data warehouse with pre-populated data, via the command featurebyte.playground().
24. FeatureByte
10 feature ideas for the Customer entity
1. Frequency: the count of invoices by Customer in the past 12 weeks
2. Monetary: the amount spent by Customer in the past 12 weeks
3. Recency: the time since the Customer’s last invoice
4. Basket: the sum spent by Customer per Product Group in the past 12 weeks
5. Diversity: the entropy of the Customer’s basket per Product Group in the past 12 weeks
6. Stability: the cosine similarity of the Customer’s basket in the past 2 weeks compared to their
basket in the past 12 weeks
7. Similarity: the cosine similarity of the Customer’s basket compared to the baskets of Customers
living in the same state in the past 12 weeks
8. Seasonality: the entropy of weekdays of the customer invoices in the past 12 weeks
9. Change: whether the Customer’s Address changed in the past 12 weeks
10. Clumpiness: coefficient of variation of Customer inter-event time in the past 12 weeks
24
26. FeatureByte
FeatureByte
26
FeatureByte is a free and source-available platform designed to empower data enthusiasts who love
creating innovative features and improving model accuracy through data.
With FeatureByte, you can:
● Quickly create and share features,
● Seamlessly experiment with features and
● Effortlessly deploy features in production
All without the headache of creating and managing new pipelines.
No large-scale outbound data transfers.
Computation and feature storage are
done where data sits to leverage
scalability, stability, and efficiency of your
DWH.
27. FeatureByte
Feature Idea 1: Frequency
Count of invoices by Customer in the past 12 weeks
27
The groupby() method enables to group a view by
columns representing entities.
The aggregate_over() method enables aggregation
operation over a window prior to the observation point,
here 12 weeks.
Feature objects contain the logical plan to compute a feature
28. FeatureByte
Feature Idea 2: Monetary
Amount spent by Customer in the past 12 weeks
28
Aggregation operations include latest, count, sum,
average, minimum, maximum, and standard deviation.
29. FeatureByte
Feature Idea 3: Recency
Time since Customer last invoice
29
RequestColumn.point_in_time() enables the creation of
features that compare the value of an other feature with
the observation point
Another example of using this method would be to
derive a customer's age from their birth date and the
observation point.
30. FeatureByte
Feature Idea 4: Basket
Sum spent by Customer across Product Group in the past 2 and 12 weeks
30
Aggregation can be done across a categorical column,
here ProductGroup. We call this Cross-Aggregation!
When the feature is materialized, a dictionary is
returned.
In our example, the keys represent the ProductGroup,
and their corresponding values represent the total
amount spent on each Product Group.
31. FeatureByte
Feature Idea 5: Diversity
Entropy of the Customer’s basket in the past 12 weeks
31
A dictionary feature can be transformed into another
feature. We apply here the entropy.
32. FeatureByte
Feature Idea 6: Stability
Cosine Similarity of the Customer’s basket in the past 2 weeks vs their
basket in the past 12 weeks
32
2 dictionary features with different
windows can be compared using the cosine
similarity.
33. FeatureByte
Feature Idea 7: Similarity
Cosine Similarity of the Customer’s basket in the past 12 weeks
vs the baskets of Customers living in the same state
33
When 2 dictionary features are grouped with 2 different
entities,
● they can be still compared using the cosine similarity.
● the entity of the resulting feature is determined based
on the relationship between the 2 entities.
In our example, the primary entity of
“Customer_Similarity_with_State_12w” is
grocerycustomer because grocerycustomer
is a child of the frenchstate entity.
The feature can be served by providing only
the grocerycustomer entity values. Values
for frenchstate entity are derived
automatically by FeatureByte.
34. FeatureByte
Feature Idea 8: Seasonality
Entropy of weekdays of the Customer’s invoices in the past 12 weeks
34
In this example, the keys of the dictionary will represent the
weekday, and their corresponding values will represent the
total amount spent on each weekday.
The weekdays dictionary feature is here transformed
into another feature by applying the entropy.
35. FeatureByte
Feature Idea 9: Change
Whether Customer’s Address Changed in the past 12 weeks
35
Change View can be created from a Slowly Changing Dimension
(SCD) table to analyze changes that occur in an attribute, here the
StreetAddress of the grocerycustomer.
36. FeatureByte
Feature Idea 10: Clumpiness
CV of Customer Inter-Event time in the past 12 weeks
36
Get Inter-Event time by getting the timestamp of the previous
customer timestamp (lag)
38. FeatureByte
Audit one feature via its Definition file:
38
The feature definition file is critical for maintaining the integrity
of a feature. It serves as the single source of truth for the feature,
providing an explicit outline of the intended operations of the
feature declaration, including inherited operations.
The file is automatically generated when a feature is declared or a
new version is derived. It is used to generate the final logical
execution graph, which is then transpiled into platform-specific
SQL for feature materialization.
40. FeatureByte
Compute training data
40
You can choose to store your training data in the feature store for
reuse or audit.
If you connect FeatureByte to your data warehouse, the
computation of features is performed in the data warehouse,
taking advantage of its scalability, stability, and efficiency.
In this case, you can work with much larger observation tables with
millions of rows.
43. FeatureByte 43
Conclusion: “Creativity is just connecting things” (Steve Jobs)
Consider utilizing tools that streamline the conversion of your feature ideas
into production-ready machine learning inputs. This will:
● help you focus on what really matters, feature ideation
● and maximize the success of your ML projects.
While you may have some original ideas, the
majority will be inspired by others. Embrace
these ideas and make them your own.
Organize and name your feature ideas. This
will help you remember them and
communicate them. This process can also spark
new ideas along the way!
44. FeatureByte
Thank You!
To get started, check out
Repo: https://github.com/featurebyte/featurebyte
Documentation: https://docs.featurebyte.com/0.2/