ML Label engineering and N-Hot Encoders

Label engineering and N-Hot Encoders
• In a movies review dataset, let’s consider an un-
supervised pipeline with a pre-processing phase
• Some of our features might hold categorical data
preprocessing
dimensionality
reduction
clustering validation
© Mor Krispil, 2018
score
(text)
genre
(text)

Label engineering – Categorical features
• In a movies review dataset, let’s consider an un-
supervised pipeline with a pre-processing phase
• Some of our features might hold categorical data
name score genre
Super Kill high action
Music and more medium musical
UFO Invasion 2 low sci-fi

Label engineering – Label Encoding
Each unique label is coded with a running int. For ex:
• Score: (“low”, “medium”, “high”) => (1, 2, 3)
• Genre: (“action”, “musical”, “sci-fi”) => (1, 2, 3)
Each label feature is replaced with a numeric feature:
X (m, n) => Xt (m, n)
name score genre
Super Kill 3 1
Music and more 2 2
UFO Invasion 2 1 3

Label Encoding - Review
Pros:
• Simple implementation (ex: scikit LabelEncoder)
• Dataset shape and density stays the same
Limitations:
• New label values should be meticulously coded with
that running int
• Can only be used with labels when order is
meaningful!
– “medium” (2) is indeed > “low” (1)
– But.. does “sci-fi” (3) > “musical” (2)??

Label engineering – One Hot Encoding
Each unique label is assigned with a new binary
feature, in which 1 represents its existence.
• For ex: our Genre feature is transformed into 3
binary features:
For m rows with n label-features
X (m, n) => Xt (m, 1st_unique_vals + ... + nth_uni_vals)
name genre_action genre_musical genre_sci-ifi
Super Kill 1 0 0
Music and more 0 1 0
UFO Invasion 2 0 0 1

One Hot Encoding - Review
Pros:
• New label values can be stacked later as additional
features
• Built in implenetation of both scikit
(OneHotEncoder) and pandas (get_dummies)
Limitations:
• Good for few label values per feature (ENUM like)
• Feature “explosion” and the Curse of Dimensionality
with many values

Label Encoding – prefer One Hot Encoding
• Even with many label values – the resulting matrix is
highly sparse
• Late scikit versions support sparse DataFrames out
of the box (not just scipy csr_matrix anymore )
preprocessing
dimensionality
reduction
Hot
Encod.

Label Encoding – prefer One Hot Encoding
• We can then apply “sparse friendly” Dimensionality
Reduction, like scikit TruncatedSVD, which is less
sensitive to data normalization – thus your matrix
gets to stay sparse!
preprocessing
dimensionality
reduction
Hot
Encod.
Trunc.
SVD

Label Encoding – Multi / Weighted Labels
But what can we do with multiple labels per feature?
• Scenario 1: from our current dataset:
name score genre
Super Kill high action
Music and more medium musical
UFO Invasion 2 low sci-fi
Killing Me Softly high action, comedy

Label Encoding – Multi / Weighted Labels
Also, in the preprocessing phase we’d sometimes like
to aggregate rows per some entity.
What can we do with multiple occurrences of the
same value? A weighted representation?
• Scenario 2: In a user watch-list dataset, we’d like to
aggregate the genres watched, per user
user movie genre
sam Super Kill action
sam Super Kill 2 action
sam Music and more musical
user Genres
sam ??

Label Encoding and N-Hot Encoders
N-Hot Encoding suggests that each unique label is
assigned with a new numeric feature, in which 1-N
value represents existence and weight.
user genre_action genre_musical genre_sci-ifi
sam 2 1 0
musical_dude 0 10 0
scully 5 0 20

N-Hot Encoders - Review
Pros:
• Weighted categorical features, model ready
• Resulting data shape and sparsity is the same as
with One-Hot
Limitations:
• Same as with One-Hot
• No built-in implementations
• A little bit extra weight - features are numeric, not
binary. Use the minimum int size required

Label Encoding and N-Hot Encoders
Thanks 
preprocessing
dimensionality
reduction
N-Hot
Encod.
Trunc.
SVD

ML Label engineering and N-Hot Encoders

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ML Label engineering and N-Hot Encoders

Similar to ML Label engineering and N-Hot Encoders (10)

Recently uploaded

Recently uploaded (20)

ML Label engineering and N-Hot Encoders