SlideShare a Scribd company logo
1 of 46
Download to read offline
FeatureByte
Accelerating Data Science
through Feature Platform,
Transformers and GenAI
Sept 2023
Xavier Conort
with the help of Google Bard, ChatGPT and FeatureByte!
FeatureByte 2
The talk will not cover problems that Data Science can
solve, but instead we will focus on how 3 innovations in
Data Science Feature Platforms, Transforms and GenAI can
transform the way data scientists work.
These innovations have the potential to make data scientists
more efficient, creative, and transparent.
FeatureByte
Agenda
3
● Define feature engineering
● Pain points solved by Feature Platforms
● The magic of Transformers
● The features I like to extract from transactional data
● How Generative AI help produce those features and make them
more transparent
FeatureByte
A little about me
4
Actuary
Learnt to formulate
mathematically
insurance problems and
solved them with data.
Kaggle addict
Learnt to have fun with
data.
Chief Data Scientist,
DataRobot
Learnt to codify best practices.
Co-founder,
FeatureByte
Solving remaining pain points
and planning to have fun with
data again!
FeatureByte
Define feature
engineering
FeatureByte
Let’s ask Bard!
6
FeatureByte
2 types of features
7
1. Transformed features:
a. combine columns, binning…
b. model specific: scaling, one-hot encoding, embedding, reducing, …
2. Features from raw data:
a. From regular time series
b. From transactional data
c. From slowly changing dimension tables
Well covered by books and blogs, well supported by
tools like sklearn and automated by Auto-ML
Pretty well covered by books and blogs and
automated by Auto-ML, TS package or deep learning
The most exciting but challenging feature
engineering task.
FeatureByte 8
Dimension table with
static description on
products
Item table with details on
purchased products in
customer invoices
Event table
where each
row indicates
an invoice in a
Grocery shop
Slowly Changing Dimension table that contains
data on customers that change over time.
Simple example of transaction data: A Grocery Dataset
FeatureByte
Feature Engineering with transactional data is Complex and Challenging
9
No equivalent of
Xgboost in
feature
engineering
FeatureByte
3 skills are involved
10
DOMAIN
EXPERTISE
DATA
ENGINEERING
DATA
SCIENCE
DW
SAAS
SOURCES
MODEL
TRAINING
MODEL
PREDICTION
STREAMING
FeatureByte
… while 3 innovations are already radically simplifying the equation
11
DOMAIN
EXPERTISE
DATA
ENGINEERING
DATA
SCIENCE
Generative AI is
getting close to
emulate a
Domain Expert
available 24/7
Feature
Platforms are
simplifying data
engineering
Off the Shelf
Transformers are
revolutionizing NLP
FeatureByte
Pain points solved by
feature platforms
FeatureByte 13
Pain Point How a Feature Platform is addressing it
Features computation has too high latency Pre-computes features and stores them in
an online store for low latency
Feature values in production are different
from training affecting production accuracy
Ensures training-serving consistency
Code for efficient time travel in SQL is too
complex to write
Declarative framework in Python to simplify
the creation of features
I don’t want to wait to experiment with my
new feature ideas
Automated backfilling to materialize features
at any point in time in the past and populate
training / test data
Painful search for existing features Feature Catalog to share and reuse features
Don’t know when my new features will be
finally deployed
Unified solution to ensure smooth transition
from experimentation to deployment
FeatureByte
The magic of
Transformers
FeatureByte 15
FeatureByte
What do I think about Transformers?
16
1. As a scientist: they are beautiful
2. As a data scientist: they are awesome
3. As a data engineer: another transformation to operationalize
4. As a product manager: they are revolutionary!
FeatureByte
Immediate use of Transformers
17
1. Radically simplify a complex task (NLP)
2. In many cases, don’t need any training for great outcomes! Can be used off
the shelf to transform raw text columns of transactional data:
a. Text embedding,
b. Name Entity Recognition,
c. Sentiment analysis,
d. Keyword extraction,
e. Extract specific information in a table structure…
3. I just need to aggregate the outputs in a meaningful way
FeatureByte
Do I still need Feature Platforms for Transformers?
18
Feature Platforms can improve Transformers operationalization
1. Point-in-Time correctness
2. Low latency thanks to the pre-computation of features that involve
transformers
3. Caching (via partial aggregations or other mechanisms) to reduce expensive
transformer calls
4. Transformer Library
FeatureByte
Can Deep Learning also replace traditional feature engineering?
19
Maybe in the future, but less magic expected…
1. I have not seen anything convincing yet apart from:
a. regular time series
b. sequences of event (successful results for recommendation system)
2. If solutions emerge, it won’t be pre-trained Off the Shelf Transformers like
for NLP and you will need a lot of data to train them to get meaningful
results.
3. And for many of us, it may not pass the model validation process:
a. not explainable enough
b. and maybe not robust enough if your data is not XXXL
FeatureByte 20
Because:
"Semantics is to artificial intelligence as physics is
to engineering."
John McCarthy, considered to be one of the most important figures in the
history of artificial intelligence
And LLMs are very effective at understanding semantics
So why do I think Transformers are revolutionary?
FeatureByte
The features I like to
extract from
transactional data
FeatureByte 22
across a variety of signal types and entities associated with the use case
1. Attribute: gets the attribute of the entity at a point-in-time
2. Frequency: counts the occurrence of events
3. Recency: measures the time since the latest event
4. Timing: relates to when the events happened
5. Latest event: attributes of the latest event
6. Stats: aggregates a numeric column's values
7. Diversity: measures the variability of data values
8. Stability: compares recent events to those of earlier periods
9. Similarity: compares an individual entity feature to a group
10. Most frequent: gets the most frequent value of a categorical column
11. Bucketing: aggregates a column's values across categories of a categorical column.
12. Attribute stats: collects stats for an attribute of the entity
13. Attribute change: measures the occurrence or magnitude of changes to slowly changing attributes
14. Location…
The MEANINGFUL ones!
FeatureByte 23
I first understand the use case, the data and identify relevant entities
Candidates could include:
● the Customer entity
● the Product entity
● the Customer x Product interaction
● the Invoice entity
● the Item entity
Then, for each entity x signal type, I think which features would make sense
FeatureByte 24
Invoice x Similarity
Customer x Diversity
Customer x Attribute change
Customer x Similarity
FeatureByte 25
Customer x Timing
Features inspired by Owen Zhang, Kaggle
Grandmaster
FeatureByte 26
More than 100 examples in the tutorials of FeatureByte’s free and
source available package
FeatureByte
How Generative AI can
help produce those
features
FeatureByte
Areas where Generative AI can already help
28
3 observations we made when we built FeatureByte Copilot to help feature
ideation.
Generative AI:
1. is already very familiar with data modeling concepts
2. can recognize the semantic of the data columns well beyond numeric and
string if meaningful names or good descriptions are provided
3. already possesses deep domain knowledge that can help
a. take some important decisions in the feature engineering process
b. assess how relevant a feature is to a use case
FeatureByte
Generative AI is already
familiar with data
modeling concepts
FeatureByte
Knows best practices
30
● critical to be point-in-time correct
● base to build interesting features
FeatureByte
Knows what is an entity which is a key concept in feature engineering
31
also, one of the most important
concepts in feature engineering
FeatureByte
Generative AI can
recognize data columns
semantics
FeatureByte
Make the difference between different types of numeric columns
33
FeatureByte
and explain why
34
FeatureByte
before agreeing with GenAI, you may sometimes need to…
35
FeatureByte
… understand better GenAI thinking
36
FeatureByte
Generative AI can help
design your feature
engineering strategy
FeatureByte
Example: can recommend strategy to filter data
38
FeatureByte
Generative AI can
assess the feature
relevance to a use case
FeatureByte
Plain English explanation
40
FeatureByte
… even for complex features
41
FeatureByte
… and is sharing its domain knowledge during the process
42
FeatureByte
Conclusion
FeatureByte
The 3 innovations are accelerating Data Science by
44
1. Reducing time-to-value thanks to:
a. quicker experimentation, and smooth transition to deployment
2. Increasing production accuracy thanks to:
a. training / serving consistency
b. transformers
c. context aware feature generation
3. Increasing Transparency thanks to:
a. Plain English descriptions of features
b. Relevance explanations to a use case
FeatureByte
Thread or opportunity?
45
1. If you believe in Model Explainability and Regulation:
a. Generative AI is creating opportunities to bridge the gap between
features and non-technical stakeholders
2. If you are worried about GenAI replacing you:
a. Opportunity to deploy solutions for more use cases and have fun!
b. Opportunity to boost your domain knowledge and your creativity
c. Opportunity to spend more time in formulating your use case and
exploring advanced solutions like causal modelling + A/B testing that
are better addressing it
FeatureByte
Thanks!
If you are curious to know more about Featurebyte
White Paper on FeatureByte Copilot
Video of FeatureByte Enterprise
Github Repo of the free and source available engine: https://github.com/featurebyte/featurebyte
Documentation: https://docs.featurebyte.com/0.5/

More Related Content

Similar to Accelerating Data Science through Feature Platform, Transformers, and GenAI

Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsAndrzej Michałowski
 
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will loveScaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will loveJune Andrews
 
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...AgileNetwork
 
The Art Of Performance Tuning - with presenter notes!
The Art Of Performance Tuning - with presenter notes!The Art Of Performance Tuning - with presenter notes!
The Art Of Performance Tuning - with presenter notes!Jonathan Ross
 
Deep Learning with CNTK
Deep Learning with CNTKDeep Learning with CNTK
Deep Learning with CNTKAshish Jaiman
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
Digital_IOT_(Microsoft_Solution).pdf
Digital_IOT_(Microsoft_Solution).pdfDigital_IOT_(Microsoft_Solution).pdf
Digital_IOT_(Microsoft_Solution).pdfssuserd23711
 
Google machine learning engineer exam dumps 2022
Google machine learning engineer exam dumps 2022Google machine learning engineer exam dumps 2022
Google machine learning engineer exam dumps 2022SkillCertProExams
 
Do you have too many meaningless features? — Featurebyte @ ODSC East 2023
Do you have too many meaningless features? — Featurebyte @ ODSC East 2023Do you have too many meaningless features? — Featurebyte @ ODSC East 2023
Do you have too many meaningless features? — Featurebyte @ ODSC East 2023FeatureByte
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCgdgsurrey
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIVijayananda Mohire
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
 
Balancing Infrastructure with Optimization and Problem Formulation
Balancing Infrastructure with Optimization and Problem FormulationBalancing Infrastructure with Optimization and Problem Formulation
Balancing Infrastructure with Optimization and Problem FormulationAlex D. Gaudio
 
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Hima Patel
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016MLconf
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Giridhar Addepalli
 

Similar to Accelerating Data Science through Feature Platform, Transformers, and GenAI (20)

Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systems
 
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will loveScaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
 
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
 
The Art Of Performance Tuning - with presenter notes!
The Art Of Performance Tuning - with presenter notes!The Art Of Performance Tuning - with presenter notes!
The Art Of Performance Tuning - with presenter notes!
 
Deep Learning with CNTK
Deep Learning with CNTKDeep Learning with CNTK
Deep Learning with CNTK
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
KFServing and Feast
KFServing and FeastKFServing and Feast
KFServing and Feast
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
Digital_IOT_(Microsoft_Solution).pdf
Digital_IOT_(Microsoft_Solution).pdfDigital_IOT_(Microsoft_Solution).pdf
Digital_IOT_(Microsoft_Solution).pdf
 
Google machine learning engineer exam dumps 2022
Google machine learning engineer exam dumps 2022Google machine learning engineer exam dumps 2022
Google machine learning engineer exam dumps 2022
 
Do you have too many meaningless features? — Featurebyte @ ODSC East 2023
Do you have too many meaningless features? — Featurebyte @ ODSC East 2023Do you have too many meaningless features? — Featurebyte @ ODSC East 2023
Do you have too many meaningless features? — Featurebyte @ ODSC East 2023
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDC
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Balancing Infrastructure with Optimization and Problem Formulation
Balancing Infrastructure with Optimization and Problem FormulationBalancing Infrastructure with Optimization and Problem Formulation
Balancing Infrastructure with Optimization and Problem Formulation
 
ODSC APAC 2022 - Explainable AI
ODSC APAC 2022 - Explainable AIODSC APAC 2022 - Explainable AI
ODSC APAC 2022 - Explainable AI
 
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 

Recently uploaded

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Recently uploaded (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

Accelerating Data Science through Feature Platform, Transformers, and GenAI

  • 1. FeatureByte Accelerating Data Science through Feature Platform, Transformers and GenAI Sept 2023 Xavier Conort with the help of Google Bard, ChatGPT and FeatureByte!
  • 2. FeatureByte 2 The talk will not cover problems that Data Science can solve, but instead we will focus on how 3 innovations in Data Science Feature Platforms, Transforms and GenAI can transform the way data scientists work. These innovations have the potential to make data scientists more efficient, creative, and transparent.
  • 3. FeatureByte Agenda 3 ● Define feature engineering ● Pain points solved by Feature Platforms ● The magic of Transformers ● The features I like to extract from transactional data ● How Generative AI help produce those features and make them more transparent
  • 4. FeatureByte A little about me 4 Actuary Learnt to formulate mathematically insurance problems and solved them with data. Kaggle addict Learnt to have fun with data. Chief Data Scientist, DataRobot Learnt to codify best practices. Co-founder, FeatureByte Solving remaining pain points and planning to have fun with data again!
  • 7. FeatureByte 2 types of features 7 1. Transformed features: a. combine columns, binning… b. model specific: scaling, one-hot encoding, embedding, reducing, … 2. Features from raw data: a. From regular time series b. From transactional data c. From slowly changing dimension tables Well covered by books and blogs, well supported by tools like sklearn and automated by Auto-ML Pretty well covered by books and blogs and automated by Auto-ML, TS package or deep learning The most exciting but challenging feature engineering task.
  • 8. FeatureByte 8 Dimension table with static description on products Item table with details on purchased products in customer invoices Event table where each row indicates an invoice in a Grocery shop Slowly Changing Dimension table that contains data on customers that change over time. Simple example of transaction data: A Grocery Dataset
  • 9. FeatureByte Feature Engineering with transactional data is Complex and Challenging 9 No equivalent of Xgboost in feature engineering
  • 10. FeatureByte 3 skills are involved 10 DOMAIN EXPERTISE DATA ENGINEERING DATA SCIENCE DW SAAS SOURCES MODEL TRAINING MODEL PREDICTION STREAMING
  • 11. FeatureByte … while 3 innovations are already radically simplifying the equation 11 DOMAIN EXPERTISE DATA ENGINEERING DATA SCIENCE Generative AI is getting close to emulate a Domain Expert available 24/7 Feature Platforms are simplifying data engineering Off the Shelf Transformers are revolutionizing NLP
  • 12. FeatureByte Pain points solved by feature platforms
  • 13. FeatureByte 13 Pain Point How a Feature Platform is addressing it Features computation has too high latency Pre-computes features and stores them in an online store for low latency Feature values in production are different from training affecting production accuracy Ensures training-serving consistency Code for efficient time travel in SQL is too complex to write Declarative framework in Python to simplify the creation of features I don’t want to wait to experiment with my new feature ideas Automated backfilling to materialize features at any point in time in the past and populate training / test data Painful search for existing features Feature Catalog to share and reuse features Don’t know when my new features will be finally deployed Unified solution to ensure smooth transition from experimentation to deployment
  • 16. FeatureByte What do I think about Transformers? 16 1. As a scientist: they are beautiful 2. As a data scientist: they are awesome 3. As a data engineer: another transformation to operationalize 4. As a product manager: they are revolutionary!
  • 17. FeatureByte Immediate use of Transformers 17 1. Radically simplify a complex task (NLP) 2. In many cases, don’t need any training for great outcomes! Can be used off the shelf to transform raw text columns of transactional data: a. Text embedding, b. Name Entity Recognition, c. Sentiment analysis, d. Keyword extraction, e. Extract specific information in a table structure… 3. I just need to aggregate the outputs in a meaningful way
  • 18. FeatureByte Do I still need Feature Platforms for Transformers? 18 Feature Platforms can improve Transformers operationalization 1. Point-in-Time correctness 2. Low latency thanks to the pre-computation of features that involve transformers 3. Caching (via partial aggregations or other mechanisms) to reduce expensive transformer calls 4. Transformer Library
  • 19. FeatureByte Can Deep Learning also replace traditional feature engineering? 19 Maybe in the future, but less magic expected… 1. I have not seen anything convincing yet apart from: a. regular time series b. sequences of event (successful results for recommendation system) 2. If solutions emerge, it won’t be pre-trained Off the Shelf Transformers like for NLP and you will need a lot of data to train them to get meaningful results. 3. And for many of us, it may not pass the model validation process: a. not explainable enough b. and maybe not robust enough if your data is not XXXL
  • 20. FeatureByte 20 Because: "Semantics is to artificial intelligence as physics is to engineering." John McCarthy, considered to be one of the most important figures in the history of artificial intelligence And LLMs are very effective at understanding semantics So why do I think Transformers are revolutionary?
  • 21. FeatureByte The features I like to extract from transactional data
  • 22. FeatureByte 22 across a variety of signal types and entities associated with the use case 1. Attribute: gets the attribute of the entity at a point-in-time 2. Frequency: counts the occurrence of events 3. Recency: measures the time since the latest event 4. Timing: relates to when the events happened 5. Latest event: attributes of the latest event 6. Stats: aggregates a numeric column's values 7. Diversity: measures the variability of data values 8. Stability: compares recent events to those of earlier periods 9. Similarity: compares an individual entity feature to a group 10. Most frequent: gets the most frequent value of a categorical column 11. Bucketing: aggregates a column's values across categories of a categorical column. 12. Attribute stats: collects stats for an attribute of the entity 13. Attribute change: measures the occurrence or magnitude of changes to slowly changing attributes 14. Location… The MEANINGFUL ones!
  • 23. FeatureByte 23 I first understand the use case, the data and identify relevant entities Candidates could include: ● the Customer entity ● the Product entity ● the Customer x Product interaction ● the Invoice entity ● the Item entity Then, for each entity x signal type, I think which features would make sense
  • 24. FeatureByte 24 Invoice x Similarity Customer x Diversity Customer x Attribute change Customer x Similarity
  • 25. FeatureByte 25 Customer x Timing Features inspired by Owen Zhang, Kaggle Grandmaster
  • 26. FeatureByte 26 More than 100 examples in the tutorials of FeatureByte’s free and source available package
  • 27. FeatureByte How Generative AI can help produce those features
  • 28. FeatureByte Areas where Generative AI can already help 28 3 observations we made when we built FeatureByte Copilot to help feature ideation. Generative AI: 1. is already very familiar with data modeling concepts 2. can recognize the semantic of the data columns well beyond numeric and string if meaningful names or good descriptions are provided 3. already possesses deep domain knowledge that can help a. take some important decisions in the feature engineering process b. assess how relevant a feature is to a use case
  • 29. FeatureByte Generative AI is already familiar with data modeling concepts
  • 30. FeatureByte Knows best practices 30 ● critical to be point-in-time correct ● base to build interesting features
  • 31. FeatureByte Knows what is an entity which is a key concept in feature engineering 31 also, one of the most important concepts in feature engineering
  • 32. FeatureByte Generative AI can recognize data columns semantics
  • 33. FeatureByte Make the difference between different types of numeric columns 33
  • 35. FeatureByte before agreeing with GenAI, you may sometimes need to… 35
  • 37. FeatureByte Generative AI can help design your feature engineering strategy
  • 38. FeatureByte Example: can recommend strategy to filter data 38
  • 39. FeatureByte Generative AI can assess the feature relevance to a use case
  • 41. FeatureByte … even for complex features 41
  • 42. FeatureByte … and is sharing its domain knowledge during the process 42
  • 44. FeatureByte The 3 innovations are accelerating Data Science by 44 1. Reducing time-to-value thanks to: a. quicker experimentation, and smooth transition to deployment 2. Increasing production accuracy thanks to: a. training / serving consistency b. transformers c. context aware feature generation 3. Increasing Transparency thanks to: a. Plain English descriptions of features b. Relevance explanations to a use case
  • 45. FeatureByte Thread or opportunity? 45 1. If you believe in Model Explainability and Regulation: a. Generative AI is creating opportunities to bridge the gap between features and non-technical stakeholders 2. If you are worried about GenAI replacing you: a. Opportunity to deploy solutions for more use cases and have fun! b. Opportunity to boost your domain knowledge and your creativity c. Opportunity to spend more time in formulating your use case and exploring advanced solutions like causal modelling + A/B testing that are better addressing it
  • 46. FeatureByte Thanks! If you are curious to know more about Featurebyte White Paper on FeatureByte Copilot Video of FeatureByte Enterprise Github Repo of the free and source available engine: https://github.com/featurebyte/featurebyte Documentation: https://docs.featurebyte.com/0.5/