SlideShare a Scribd company logo
Label engineering and N-Hot Encoders
• In a movies review dataset, let’s consider an un-
supervised pipeline with a pre-processing phase
• Some of our features might hold categorical data
preprocessing
dimensionality
reduction
clustering validation
© Mor Krispil, 2018
score
(text)
genre
(text)
Label engineering – Categorical features
• In a movies review dataset, let’s consider an un-
supervised pipeline with a pre-processing phase
• Some of our features might hold categorical data
© Mor Krispil, 2018
name score genre
Super Kill high action
Music and more medium musical
UFO Invasion 2 low sci-fi
Label engineering – Label Encoding
Each unique label is coded with a running int. For ex:
• Score: (“low”, “medium”, “high”) => (1, 2, 3)
• Genre: (“action”, “musical”, “sci-fi”) => (1, 2, 3)
Each label feature is replaced with a numeric feature:
X (m, n) => Xt (m, n)
© Mor Krispil, 2018
name score genre
Super Kill 3 1
Music and more 2 2
UFO Invasion 2 1 3
Label Encoding - Review
Pros:
• Simple implementation (ex: scikit LabelEncoder)
• Dataset shape and density stays the same
Limitations:
• New label values should be meticulously coded with
that running int
• Can only be used with labels when order is
meaningful!
– “medium” (2) is indeed > “low” (1)
– But.. does “sci-fi” (3) > “musical” (2)??
© Mor Krispil, 2018
Label engineering – One Hot Encoding
Each unique label is assigned with a new binary
feature, in which 1 represents its existence.
• For ex: our Genre feature is transformed into 3
binary features:
For m rows with n label-features
X (m, n) => Xt (m, 1st_unique_vals + ... + nth_uni_vals)
© Mor Krispil, 2018
name genre_action genre_musical genre_sci-ifi
Super Kill 1 0 0
Music and more 0 1 0
UFO Invasion 2 0 0 1
One Hot Encoding - Review
Pros:
• New label values can be stacked later as additional
features
• Built in implenetation of both scikit
(OneHotEncoder) and pandas (get_dummies)
Limitations:
• Good for few label values per feature (ENUM like)
• Feature “explosion” and the Curse of Dimensionality
with many values
© Mor Krispil, 2018
Label Encoding – prefer One Hot Encoding
• Even with many label values – the resulting matrix is
highly sparse
• Late scikit versions support sparse DataFrames out
of the box (not just scipy csr_matrix anymore )
© Mor Krispil, 2018
preprocessing
dimensionality
reduction
clustering validation
Hot
Encod.
Label Encoding – prefer One Hot Encoding
• We can then apply “sparse friendly” Dimensionality
Reduction, like scikit TruncatedSVD, which is less
sensitive to data normalization – thus your matrix
gets to stay sparse!
© Mor Krispil, 2018
preprocessing
dimensionality
reduction
clustering validation
Hot
Encod.
Trunc.
SVD
Label Encoding – Multi / Weighted Labels
But what can we do with multiple labels per feature?
• Scenario 1: from our current dataset:
© Mor Krispil, 2018
name score genre
Super Kill high action
Music and more medium musical
UFO Invasion 2 low sci-fi
Killing Me Softly high action, comedy
Label Encoding – Multi / Weighted Labels
Also, in the preprocessing phase we’d sometimes like
to aggregate rows per some entity.
What can we do with multiple occurrences of the
same value? A weighted representation?
• Scenario 2: In a user watch-list dataset, we’d like to
aggregate the genres watched, per user
© Mor Krispil, 2018
user movie genre
sam Super Kill action
sam Super Kill 2 action
sam Music and more musical
user Genres
sam ??
Label Encoding and N-Hot Encoders
N-Hot Encoding suggests that each unique label is
assigned with a new numeric feature, in which 1-N
value represents existence and weight.
© Mor Krispil, 2018
user genre_action genre_musical genre_sci-ifi
sam 2 1 0
musical_dude 0 10 0
scully 5 0 20
N-Hot Encoders - Review
Pros:
• Weighted categorical features, model ready
• Resulting data shape and sparsity is the same as
with One-Hot
Limitations:
• Same as with One-Hot
• No built-in implementations
• A little bit extra weight - features are numeric, not
binary. Use the minimum int size required
© Mor Krispil, 2018
Label Encoding and N-Hot Encoders
Thanks 
© Mor Krispil, 2018
preprocessing
dimensionality
reduction
clustering validation
N-Hot
Encod.
Trunc.
SVD

More Related Content

What's hot

Data Science
Data ScienceData Science
Data Science
Prakhyath Rai
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
Tonmoy Bhagawati
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Tharushi Ruwandika
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Vivek Garg
 
Supervised Machine Learning
Supervised Machine LearningSupervised Machine Learning
Supervised Machine Learning
Ankit Rai
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
VARUN KUMAR
 
Machine learning
Machine learningMachine learning
Machine learning
Dr Geetha Mohan
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Sampath Kumar
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
Azad public school
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its Applications
Dr Ganesh Iyer
 
Machine learning
Machine learningMachine learning
Machine learning
Rajib Kumar De
 
Foundations of Machine Learning
Foundations of Machine LearningFoundations of Machine Learning
Foundations of Machine Learning
mahutte
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
Oswald Campesato
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised Learning
Lukas Tencer
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
Salah Amean
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
Spotle.ai
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Laguna State Polytechnic University
 
Statistics vs machine learning
Statistics vs machine learningStatistics vs machine learning
Statistics vs machine learning
Tom Dierickx
 
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Simplilearn
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Simplilearn
 

What's hot (20)

Data Science
Data ScienceData Science
Data Science
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Supervised Machine Learning
Supervised Machine LearningSupervised Machine Learning
Supervised Machine Learning
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
 
Machine learning
Machine learningMachine learning
Machine learning
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its Applications
 
Machine learning
Machine learningMachine learning
Machine learning
 
Foundations of Machine Learning
Foundations of Machine LearningFoundations of Machine Learning
Foundations of Machine Learning
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised Learning
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Statistics vs machine learning
Statistics vs machine learningStatistics vs machine learning
Statistics vs machine learning
 
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 

Similar to ML Label engineering and N-Hot Encoders

Just Count the Love-Hate Squares
Just Count the Love-Hate SquaresJust Count the Love-Hate Squares
Just Count the Love-Hate Squares
Kyle Teague
 
Gaming product name, genre
Gaming product name, genreGaming product name, genre
Gaming product name, genre
sheaaaholloway
 
MLBox
MLBoxMLBox
Secure 2 Party AES
Secure 2 Party AESSecure 2 Party AES
Secure 2 Party AES
JITENDRA KUMAR PATEL
 
R user group meeting 25th jan 2017
R user group meeting 25th jan 2017R user group meeting 25th jan 2017
R user group meeting 25th jan 2017
Garrett Teoh Hor Keong
 
SSD: Single Shot MultiBox Detector (UPC Reading Group)
SSD: Single Shot MultiBox Detector (UPC Reading Group)SSD: Single Shot MultiBox Detector (UPC Reading Group)
SSD: Single Shot MultiBox Detector (UPC Reading Group)
Universitat Politècnica de Catalunya
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
Spark Summit
 
MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series
BigML, Inc
 
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
Sri Ambati
 
DynamoDB Design Workshop
DynamoDB Design WorkshopDynamoDB Design Workshop
DynamoDB Design Workshop
Amazon Web Services
 

Similar to ML Label engineering and N-Hot Encoders (10)

Just Count the Love-Hate Squares
Just Count the Love-Hate SquaresJust Count the Love-Hate Squares
Just Count the Love-Hate Squares
 
Gaming product name, genre
Gaming product name, genreGaming product name, genre
Gaming product name, genre
 
MLBox
MLBoxMLBox
MLBox
 
Secure 2 Party AES
Secure 2 Party AESSecure 2 Party AES
Secure 2 Party AES
 
R user group meeting 25th jan 2017
R user group meeting 25th jan 2017R user group meeting 25th jan 2017
R user group meeting 25th jan 2017
 
SSD: Single Shot MultiBox Detector (UPC Reading Group)
SSD: Single Shot MultiBox Detector (UPC Reading Group)SSD: Single Shot MultiBox Detector (UPC Reading Group)
SSD: Single Shot MultiBox Detector (UPC Reading Group)
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
 
MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series
 
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
 
DynamoDB Design Workshop
DynamoDB Design WorkshopDynamoDB Design Workshop
DynamoDB Design Workshop
 

Recently uploaded

原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 

Recently uploaded (20)

原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 

ML Label engineering and N-Hot Encoders

  • 1. Label engineering and N-Hot Encoders • In a movies review dataset, let’s consider an un- supervised pipeline with a pre-processing phase • Some of our features might hold categorical data preprocessing dimensionality reduction clustering validation © Mor Krispil, 2018 score (text) genre (text)
  • 2. Label engineering – Categorical features • In a movies review dataset, let’s consider an un- supervised pipeline with a pre-processing phase • Some of our features might hold categorical data © Mor Krispil, 2018 name score genre Super Kill high action Music and more medium musical UFO Invasion 2 low sci-fi
  • 3. Label engineering – Label Encoding Each unique label is coded with a running int. For ex: • Score: (“low”, “medium”, “high”) => (1, 2, 3) • Genre: (“action”, “musical”, “sci-fi”) => (1, 2, 3) Each label feature is replaced with a numeric feature: X (m, n) => Xt (m, n) © Mor Krispil, 2018 name score genre Super Kill 3 1 Music and more 2 2 UFO Invasion 2 1 3
  • 4. Label Encoding - Review Pros: • Simple implementation (ex: scikit LabelEncoder) • Dataset shape and density stays the same Limitations: • New label values should be meticulously coded with that running int • Can only be used with labels when order is meaningful! – “medium” (2) is indeed > “low” (1) – But.. does “sci-fi” (3) > “musical” (2)?? © Mor Krispil, 2018
  • 5. Label engineering – One Hot Encoding Each unique label is assigned with a new binary feature, in which 1 represents its existence. • For ex: our Genre feature is transformed into 3 binary features: For m rows with n label-features X (m, n) => Xt (m, 1st_unique_vals + ... + nth_uni_vals) © Mor Krispil, 2018 name genre_action genre_musical genre_sci-ifi Super Kill 1 0 0 Music and more 0 1 0 UFO Invasion 2 0 0 1
  • 6. One Hot Encoding - Review Pros: • New label values can be stacked later as additional features • Built in implenetation of both scikit (OneHotEncoder) and pandas (get_dummies) Limitations: • Good for few label values per feature (ENUM like) • Feature “explosion” and the Curse of Dimensionality with many values © Mor Krispil, 2018
  • 7. Label Encoding – prefer One Hot Encoding • Even with many label values – the resulting matrix is highly sparse • Late scikit versions support sparse DataFrames out of the box (not just scipy csr_matrix anymore ) © Mor Krispil, 2018 preprocessing dimensionality reduction clustering validation Hot Encod.
  • 8. Label Encoding – prefer One Hot Encoding • We can then apply “sparse friendly” Dimensionality Reduction, like scikit TruncatedSVD, which is less sensitive to data normalization – thus your matrix gets to stay sparse! © Mor Krispil, 2018 preprocessing dimensionality reduction clustering validation Hot Encod. Trunc. SVD
  • 9. Label Encoding – Multi / Weighted Labels But what can we do with multiple labels per feature? • Scenario 1: from our current dataset: © Mor Krispil, 2018 name score genre Super Kill high action Music and more medium musical UFO Invasion 2 low sci-fi Killing Me Softly high action, comedy
  • 10. Label Encoding – Multi / Weighted Labels Also, in the preprocessing phase we’d sometimes like to aggregate rows per some entity. What can we do with multiple occurrences of the same value? A weighted representation? • Scenario 2: In a user watch-list dataset, we’d like to aggregate the genres watched, per user © Mor Krispil, 2018 user movie genre sam Super Kill action sam Super Kill 2 action sam Music and more musical user Genres sam ??
  • 11. Label Encoding and N-Hot Encoders N-Hot Encoding suggests that each unique label is assigned with a new numeric feature, in which 1-N value represents existence and weight. © Mor Krispil, 2018 user genre_action genre_musical genre_sci-ifi sam 2 1 0 musical_dude 0 10 0 scully 5 0 20
  • 12. N-Hot Encoders - Review Pros: • Weighted categorical features, model ready • Resulting data shape and sparsity is the same as with One-Hot Limitations: • Same as with One-Hot • No built-in implementations • A little bit extra weight - features are numeric, not binary. Use the minimum int size required © Mor Krispil, 2018
  • 13. Label Encoding and N-Hot Encoders Thanks  © Mor Krispil, 2018 preprocessing dimensionality reduction clustering validation N-Hot Encod. Trunc. SVD