SlideShare a Scribd company logo
1 of 13
Label engineering and N-Hot Encoders
• In a movies review dataset, let’s consider an un-
supervised pipeline with a pre-processing phase
• Some of our features might hold categorical data
preprocessing
dimensionality
reduction
clustering validation
© Mor Krispil, 2018
score
(text)
genre
(text)
Label engineering – Categorical features
• In a movies review dataset, let’s consider an un-
supervised pipeline with a pre-processing phase
• Some of our features might hold categorical data
© Mor Krispil, 2018
name score genre
Super Kill high action
Music and more medium musical
UFO Invasion 2 low sci-fi
Label engineering – Label Encoding
Each unique label is coded with a running int. For ex:
• Score: (“low”, “medium”, “high”) => (1, 2, 3)
• Genre: (“action”, “musical”, “sci-fi”) => (1, 2, 3)
Each label feature is replaced with a numeric feature:
X (m, n) => Xt (m, n)
© Mor Krispil, 2018
name score genre
Super Kill 3 1
Music and more 2 2
UFO Invasion 2 1 3
Label Encoding - Review
Pros:
• Simple implementation (ex: scikit LabelEncoder)
• Dataset shape and density stays the same
Limitations:
• New label values should be meticulously coded with
that running int
• Can only be used with labels when order is
meaningful!
– “medium” (2) is indeed > “low” (1)
– But.. does “sci-fi” (3) > “musical” (2)??
© Mor Krispil, 2018
Label engineering – One Hot Encoding
Each unique label is assigned with a new binary
feature, in which 1 represents its existence.
• For ex: our Genre feature is transformed into 3
binary features:
For m rows with n label-features
X (m, n) => Xt (m, 1st_unique_vals + ... + nth_uni_vals)
© Mor Krispil, 2018
name genre_action genre_musical genre_sci-ifi
Super Kill 1 0 0
Music and more 0 1 0
UFO Invasion 2 0 0 1
One Hot Encoding - Review
Pros:
• New label values can be stacked later as additional
features
• Built in implenetation of both scikit
(OneHotEncoder) and pandas (get_dummies)
Limitations:
• Good for few label values per feature (ENUM like)
• Feature “explosion” and the Curse of Dimensionality
with many values
© Mor Krispil, 2018
Label Encoding – prefer One Hot Encoding
• Even with many label values – the resulting matrix is
highly sparse
• Late scikit versions support sparse DataFrames out
of the box (not just scipy csr_matrix anymore )
© Mor Krispil, 2018
preprocessing
dimensionality
reduction
clustering validation
Hot
Encod.
Label Encoding – prefer One Hot Encoding
• We can then apply “sparse friendly” Dimensionality
Reduction, like scikit TruncatedSVD, which is less
sensitive to data normalization – thus your matrix
gets to stay sparse!
© Mor Krispil, 2018
preprocessing
dimensionality
reduction
clustering validation
Hot
Encod.
Trunc.
SVD
Label Encoding – Multi / Weighted Labels
But what can we do with multiple labels per feature?
• Scenario 1: from our current dataset:
© Mor Krispil, 2018
name score genre
Super Kill high action
Music and more medium musical
UFO Invasion 2 low sci-fi
Killing Me Softly high action, comedy
Label Encoding – Multi / Weighted Labels
Also, in the preprocessing phase we’d sometimes like
to aggregate rows per some entity.
What can we do with multiple occurrences of the
same value? A weighted representation?
• Scenario 2: In a user watch-list dataset, we’d like to
aggregate the genres watched, per user
© Mor Krispil, 2018
user movie genre
sam Super Kill action
sam Super Kill 2 action
sam Music and more musical
user Genres
sam ??
Label Encoding and N-Hot Encoders
N-Hot Encoding suggests that each unique label is
assigned with a new numeric feature, in which 1-N
value represents existence and weight.
© Mor Krispil, 2018
user genre_action genre_musical genre_sci-ifi
sam 2 1 0
musical_dude 0 10 0
scully 5 0 20
N-Hot Encoders - Review
Pros:
• Weighted categorical features, model ready
• Resulting data shape and sparsity is the same as
with One-Hot
Limitations:
• Same as with One-Hot
• No built-in implementations
• A little bit extra weight - features are numeric, not
binary. Use the minimum int size required
© Mor Krispil, 2018
Label Encoding and N-Hot Encoders
Thanks 
© Mor Krispil, 2018
preprocessing
dimensionality
reduction
clustering validation
N-Hot
Encod.
Trunc.
SVD

More Related Content

What's hot

Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Simplilearn
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Simplilearn
 
Supervised Machine Learning
Supervised Machine LearningSupervised Machine Learning
Supervised Machine LearningAnkit Rai
 
Introduction to-machine-learning
Introduction to-machine-learningIntroduction to-machine-learning
Introduction to-machine-learningBabu Priyavrat
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Machine learning with scikitlearn
Machine learning with scikitlearnMachine learning with scikitlearn
Machine learning with scikitlearnPratap Dangeti
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningShubhmay Potdar
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Simplilearn
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningShahar Cohen
 
Presentation on unsupervised learning
Presentation on unsupervised learning Presentation on unsupervised learning
Presentation on unsupervised learning ANKUSH PAL
 

What's hot (20)

Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
 
Supervised Machine Learning
Supervised Machine LearningSupervised Machine Learning
Supervised Machine Learning
 
Introduction to-machine-learning
Introduction to-machine-learningIntroduction to-machine-learning
Introduction to-machine-learning
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Machine learning with scikitlearn
Machine learning with scikitlearnMachine learning with scikitlearn
Machine learning with scikitlearn
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
 
Deep learning
Deep learningDeep learning
Deep learning
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Presentation on unsupervised learning
Presentation on unsupervised learning Presentation on unsupervised learning
Presentation on unsupervised learning
 

Similar to ML Label engineering and N-Hot Encoders

Just Count the Love-Hate Squares
Just Count the Love-Hate SquaresJust Count the Love-Hate Squares
Just Count the Love-Hate SquaresKyle Teague
 
Gaming product name, genre
Gaming product name, genreGaming product name, genre
Gaming product name, genresheaaaholloway
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...Spark Summit
 
MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series BigML, Inc
 
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...Sri Ambati
 

Similar to ML Label engineering and N-Hot Encoders (10)

Just Count the Love-Hate Squares
Just Count the Love-Hate SquaresJust Count the Love-Hate Squares
Just Count the Love-Hate Squares
 
Gaming product name, genre
Gaming product name, genreGaming product name, genre
Gaming product name, genre
 
MLBox
MLBoxMLBox
MLBox
 
Secure 2 Party AES
Secure 2 Party AESSecure 2 Party AES
Secure 2 Party AES
 
R user group meeting 25th jan 2017
R user group meeting 25th jan 2017R user group meeting 25th jan 2017
R user group meeting 25th jan 2017
 
SSD: Single Shot MultiBox Detector (UPC Reading Group)
SSD: Single Shot MultiBox Detector (UPC Reading Group)SSD: Single Shot MultiBox Detector (UPC Reading Group)
SSD: Single Shot MultiBox Detector (UPC Reading Group)
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
 
MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series
 
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
 
DynamoDB Design Workshop
DynamoDB Design WorkshopDynamoDB Design Workshop
DynamoDB Design Workshop
 

Recently uploaded

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 

Recently uploaded (20)

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 

ML Label engineering and N-Hot Encoders

  • 1. Label engineering and N-Hot Encoders • In a movies review dataset, let’s consider an un- supervised pipeline with a pre-processing phase • Some of our features might hold categorical data preprocessing dimensionality reduction clustering validation © Mor Krispil, 2018 score (text) genre (text)
  • 2. Label engineering – Categorical features • In a movies review dataset, let’s consider an un- supervised pipeline with a pre-processing phase • Some of our features might hold categorical data © Mor Krispil, 2018 name score genre Super Kill high action Music and more medium musical UFO Invasion 2 low sci-fi
  • 3. Label engineering – Label Encoding Each unique label is coded with a running int. For ex: • Score: (“low”, “medium”, “high”) => (1, 2, 3) • Genre: (“action”, “musical”, “sci-fi”) => (1, 2, 3) Each label feature is replaced with a numeric feature: X (m, n) => Xt (m, n) © Mor Krispil, 2018 name score genre Super Kill 3 1 Music and more 2 2 UFO Invasion 2 1 3
  • 4. Label Encoding - Review Pros: • Simple implementation (ex: scikit LabelEncoder) • Dataset shape and density stays the same Limitations: • New label values should be meticulously coded with that running int • Can only be used with labels when order is meaningful! – “medium” (2) is indeed > “low” (1) – But.. does “sci-fi” (3) > “musical” (2)?? © Mor Krispil, 2018
  • 5. Label engineering – One Hot Encoding Each unique label is assigned with a new binary feature, in which 1 represents its existence. • For ex: our Genre feature is transformed into 3 binary features: For m rows with n label-features X (m, n) => Xt (m, 1st_unique_vals + ... + nth_uni_vals) © Mor Krispil, 2018 name genre_action genre_musical genre_sci-ifi Super Kill 1 0 0 Music and more 0 1 0 UFO Invasion 2 0 0 1
  • 6. One Hot Encoding - Review Pros: • New label values can be stacked later as additional features • Built in implenetation of both scikit (OneHotEncoder) and pandas (get_dummies) Limitations: • Good for few label values per feature (ENUM like) • Feature “explosion” and the Curse of Dimensionality with many values © Mor Krispil, 2018
  • 7. Label Encoding – prefer One Hot Encoding • Even with many label values – the resulting matrix is highly sparse • Late scikit versions support sparse DataFrames out of the box (not just scipy csr_matrix anymore ) © Mor Krispil, 2018 preprocessing dimensionality reduction clustering validation Hot Encod.
  • 8. Label Encoding – prefer One Hot Encoding • We can then apply “sparse friendly” Dimensionality Reduction, like scikit TruncatedSVD, which is less sensitive to data normalization – thus your matrix gets to stay sparse! © Mor Krispil, 2018 preprocessing dimensionality reduction clustering validation Hot Encod. Trunc. SVD
  • 9. Label Encoding – Multi / Weighted Labels But what can we do with multiple labels per feature? • Scenario 1: from our current dataset: © Mor Krispil, 2018 name score genre Super Kill high action Music and more medium musical UFO Invasion 2 low sci-fi Killing Me Softly high action, comedy
  • 10. Label Encoding – Multi / Weighted Labels Also, in the preprocessing phase we’d sometimes like to aggregate rows per some entity. What can we do with multiple occurrences of the same value? A weighted representation? • Scenario 2: In a user watch-list dataset, we’d like to aggregate the genres watched, per user © Mor Krispil, 2018 user movie genre sam Super Kill action sam Super Kill 2 action sam Music and more musical user Genres sam ??
  • 11. Label Encoding and N-Hot Encoders N-Hot Encoding suggests that each unique label is assigned with a new numeric feature, in which 1-N value represents existence and weight. © Mor Krispil, 2018 user genre_action genre_musical genre_sci-ifi sam 2 1 0 musical_dude 0 10 0 scully 5 0 20
  • 12. N-Hot Encoders - Review Pros: • Weighted categorical features, model ready • Resulting data shape and sparsity is the same as with One-Hot Limitations: • Same as with One-Hot • No built-in implementations • A little bit extra weight - features are numeric, not binary. Use the minimum int size required © Mor Krispil, 2018
  • 13. Label Encoding and N-Hot Encoders Thanks  © Mor Krispil, 2018 preprocessing dimensionality reduction clustering validation N-Hot Encod. Trunc. SVD