In this presentation, data science expert and Kaggle grandmaster Xavier Conort defines feature engineering and identifies the pain points that can be addressed by feature platforms. He dives into the features he tends to extract from transactional data and addresses how the magic of transformers and generative AI help produce those features and make them more transparent.
Xavier Conort is currently Co-Founder and CPO of FeatureByte, an AI-based self-service feature platform that helps data scientists turn raw data into fully governed AI pipelines in minutes, at 1/5th the cost.
2. FeatureByte 2
The talk will not cover problems that Data Science can
solve, but instead we will focus on how 3 innovations in
Data Science Feature Platforms, Transforms and GenAI can
transform the way data scientists work.
These innovations have the potential to make data scientists
more efficient, creative, and transparent.
3. FeatureByte
Agenda
3
● Define feature engineering
● Pain points solved by Feature Platforms
● The magic of Transformers
● The features I like to extract from transactional data
● How Generative AI help produce those features and make them
more transparent
4. FeatureByte
A little about me
4
Actuary
Learnt to formulate
mathematically
insurance problems and
solved them with data.
Kaggle addict
Learnt to have fun with
data.
Chief Data Scientist,
DataRobot
Learnt to codify best practices.
Co-founder,
FeatureByte
Solving remaining pain points
and planning to have fun with
data again!
7. FeatureByte
2 types of features
7
1. Transformed features:
a. combine columns, binning…
b. model specific: scaling, one-hot encoding, embedding, reducing, …
2. Features from raw data:
a. From regular time series
b. From transactional data
c. From slowly changing dimension tables
Well covered by books and blogs, well supported by
tools like sklearn and automated by Auto-ML
Pretty well covered by books and blogs and
automated by Auto-ML, TS package or deep learning
The most exciting but challenging feature
engineering task.
8. FeatureByte 8
Dimension table with
static description on
products
Item table with details on
purchased products in
customer invoices
Event table
where each
row indicates
an invoice in a
Grocery shop
Slowly Changing Dimension table that contains
data on customers that change over time.
Simple example of transaction data: A Grocery Dataset
10. FeatureByte
3 skills are involved
10
DOMAIN
EXPERTISE
DATA
ENGINEERING
DATA
SCIENCE
DW
SAAS
SOURCES
MODEL
TRAINING
MODEL
PREDICTION
STREAMING
11. FeatureByte
… while 3 innovations are already radically simplifying the equation
11
DOMAIN
EXPERTISE
DATA
ENGINEERING
DATA
SCIENCE
Generative AI is
getting close to
emulate a
Domain Expert
available 24/7
Feature
Platforms are
simplifying data
engineering
Off the Shelf
Transformers are
revolutionizing NLP
13. FeatureByte 13
Pain Point How a Feature Platform is addressing it
Features computation has too high latency Pre-computes features and stores them in
an online store for low latency
Feature values in production are different
from training affecting production accuracy
Ensures training-serving consistency
Code for efficient time travel in SQL is too
complex to write
Declarative framework in Python to simplify
the creation of features
I don’t want to wait to experiment with my
new feature ideas
Automated backfilling to materialize features
at any point in time in the past and populate
training / test data
Painful search for existing features Feature Catalog to share and reuse features
Don’t know when my new features will be
finally deployed
Unified solution to ensure smooth transition
from experimentation to deployment
16. FeatureByte
What do I think about Transformers?
16
1. As a scientist: they are beautiful
2. As a data scientist: they are awesome
3. As a data engineer: another transformation to operationalize
4. As a product manager: they are revolutionary!
17. FeatureByte
Immediate use of Transformers
17
1. Radically simplify a complex task (NLP)
2. In many cases, don’t need any training for great outcomes! Can be used off
the shelf to transform raw text columns of transactional data:
a. Text embedding,
b. Name Entity Recognition,
c. Sentiment analysis,
d. Keyword extraction,
e. Extract specific information in a table structure…
3. I just need to aggregate the outputs in a meaningful way
18. FeatureByte
Do I still need Feature Platforms for Transformers?
18
Feature Platforms can improve Transformers operationalization
1. Point-in-Time correctness
2. Low latency thanks to the pre-computation of features that involve
transformers
3. Caching (via partial aggregations or other mechanisms) to reduce expensive
transformer calls
4. Transformer Library
19. FeatureByte
Can Deep Learning also replace traditional feature engineering?
19
Maybe in the future, but less magic expected…
1. I have not seen anything convincing yet apart from:
a. regular time series
b. sequences of event (successful results for recommendation system)
2. If solutions emerge, it won’t be pre-trained Off the Shelf Transformers like
for NLP and you will need a lot of data to train them to get meaningful
results.
3. And for many of us, it may not pass the model validation process:
a. not explainable enough
b. and maybe not robust enough if your data is not XXXL
20. FeatureByte 20
Because:
"Semantics is to artificial intelligence as physics is
to engineering."
John McCarthy, considered to be one of the most important figures in the
history of artificial intelligence
And LLMs are very effective at understanding semantics
So why do I think Transformers are revolutionary?
22. FeatureByte 22
across a variety of signal types and entities associated with the use case
1. Attribute: gets the attribute of the entity at a point-in-time
2. Frequency: counts the occurrence of events
3. Recency: measures the time since the latest event
4. Timing: relates to when the events happened
5. Latest event: attributes of the latest event
6. Stats: aggregates a numeric column's values
7. Diversity: measures the variability of data values
8. Stability: compares recent events to those of earlier periods
9. Similarity: compares an individual entity feature to a group
10. Most frequent: gets the most frequent value of a categorical column
11. Bucketing: aggregates a column's values across categories of a categorical column.
12. Attribute stats: collects stats for an attribute of the entity
13. Attribute change: measures the occurrence or magnitude of changes to slowly changing attributes
14. Location…
The MEANINGFUL ones!
23. FeatureByte 23
I first understand the use case, the data and identify relevant entities
Candidates could include:
● the Customer entity
● the Product entity
● the Customer x Product interaction
● the Invoice entity
● the Item entity
Then, for each entity x signal type, I think which features would make sense
24. FeatureByte 24
Invoice x Similarity
Customer x Diversity
Customer x Attribute change
Customer x Similarity
28. FeatureByte
Areas where Generative AI can already help
28
3 observations we made when we built FeatureByte Copilot to help feature
ideation.
Generative AI:
1. is already very familiar with data modeling concepts
2. can recognize the semantic of the data columns well beyond numeric and
string if meaningful names or good descriptions are provided
3. already possesses deep domain knowledge that can help
a. take some important decisions in the feature engineering process
b. assess how relevant a feature is to a use case
31. FeatureByte
Knows what is an entity which is a key concept in feature engineering
31
also, one of the most important
concepts in feature engineering
44. FeatureByte
The 3 innovations are accelerating Data Science by
44
1. Reducing time-to-value thanks to:
a. quicker experimentation, and smooth transition to deployment
2. Increasing production accuracy thanks to:
a. training / serving consistency
b. transformers
c. context aware feature generation
3. Increasing Transparency thanks to:
a. Plain English descriptions of features
b. Relevance explanations to a use case
45. FeatureByte
Thread or opportunity?
45
1. If you believe in Model Explainability and Regulation:
a. Generative AI is creating opportunities to bridge the gap between
features and non-technical stakeholders
2. If you are worried about GenAI replacing you:
a. Opportunity to deploy solutions for more use cases and have fun!
b. Opportunity to boost your domain knowledge and your creativity
c. Opportunity to spend more time in formulating your use case and
exploring advanced solutions like causal modelling + A/B testing that
are better addressing it
46. FeatureByte
Thanks!
If you are curious to know more about Featurebyte
White Paper on FeatureByte Copilot
Video of FeatureByte Enterprise
Github Repo of the free and source available engine: https://github.com/featurebyte/featurebyte
Documentation: https://docs.featurebyte.com/0.5/