Entity embeddings for categorical data

Entity embeddings for
categorical data
Paul Skeie

2
Outline
• Background
• Datarepresentations
• Gradient boosted trees
• Deep Learning
• Entity embeddings

3
Exceeded 1 million users in 2017
Collaborativeandcompetitive datascience
Gradient boosted trees win most contests with tabular/structureddata
Deep Learning wins when datais unstructured images/text/sound

Standard modeling activitiesStatistics or machine learning, most activities are common
Select
model
Select
inputs
Train model
Test model
on unseen
data
Evaluate
performance
Success?
Modelling
activities
Supervised learning needs labeled data
SELECT GROUND TRUTH TO TARGET THE TRAINING AGAINST
Requires experts with deep understanding in the field
FEATURE ENGINEERING – FIND RELEVANT INPUTS
The risk of overfitting is high when the model has
many parameters
COMMON PITFALL - OVERFITTING
Nodes are randomly dropped so that the rest must readjust
DEEP LEARNING AVOIDS OVERFITTING USING DROPOUT
DATA
DISCOVERY

5
Artificial neural networks - Some highlights from timeline
Snipped from https://www.scaruffi.com/mind/ai.pdf

6
• Internet produces massivedatasets
• Powerful GPUs developed primarily for gaming
• Improved algorithms
• Better ways of mitigating overfitting
Artificial Neural networks – Why now?

7
Deep learning in the industry
Jeff Dean
Google Brain Team
AI Frontiers
Trends and Developments in Deep Learning Research

9
Data representations, decomposing a vector
x
y
v
u
u
v
V =
We can decompose
the vector V into a
vector of length u
directed along the x
axis, and a vector of
length v directed
along the y axis.
V

10
Data representations, vector length and direction
x
y u
v
V =
V
V
α
=
V
α
Both these data
representationsdefine
the same vector.
How you want to feed
this information to the
learning algorithm
depends on what
you’re aiming to
predict.
If this vector would represent wind in the horizontal plane, and we want to predict the power output from
a wind turbine, which we happen to know is a function of the wind speed, feeding in
to the learning algorithm makes a lot of sense.
𝑉 = 𝑢2 + 𝑣2This way the learning algorithm doesn’t need to figure out Pythagoras on it’s own.
However, with enough training data, a neural network could figure this out.
𝑉

11
Data representations, cyclic variables
x
y u
v
V =
V
V
α
=
V
α
Cyclic variables needs
special consideration.
Angle α, the angle
between 0° and 359°
is only 1°, this is not
obvious to a learning
algorithm.

12
Neural networks can learn new data representations
Compute Graph
𝑃 𝑥 = 𝐿𝑖𝑛𝑒𝑎𝑟 → 𝑟𝑒𝐿𝑈 → 𝐿𝑖𝑛𝑒𝑎𝑟 → 𝑟𝑒𝐿𝑈 → 𝐿𝑖𝑛𝑒𝑎𝑟 → 𝑺𝒊𝒈𝒎𝒐𝒊𝒅
Neural network architecture subject to change
Width - Depth Logistic Regression
𝜎 =
1
1 + 𝑒−𝑥

13
An artificial neural network is just a series of matrix operations
𝑧 = 𝑊𝑥 + 𝑏
𝑎 = 𝜎(𝑧)

14
A simple neural network
𝑧1
𝑧2
𝑧3
=
𝑤11 𝑤12 𝑤13
𝑤21 𝑤22 𝑤23
𝑤31 𝑤32 𝑤33
∙
𝑥1
𝑥2
𝑥3
+
𝑏1
𝑏2
𝑏3
𝑎1
𝑎2
𝑎3
=
𝑟𝑒𝐿𝑈(𝑧1)
Linear transformation
Apply non-linearity

15
Instead of feature inputs, use activations from previous
layer as input
𝑧1
[𝑛+1]
𝑧2
[𝑛+1]
𝑧3
[𝑛+1]
=
𝑤11 𝑤12 𝑤13
𝑤21 𝑤22 𝑤23
𝑤31 𝑤32 𝑤33
∙
𝑎1
𝑛
𝑎2
𝑛
𝑎3
𝑛
+
𝑏1
𝑏2
𝑏3
𝑎1
[𝑛+1]
𝑎2
[𝑛+1]
𝑎3
[𝑛+1]
=
𝑟𝑒𝐿𝑈(𝑧1
𝑛+1
)
[𝑛+1]
)
[𝑛+1]
)

16
Logistic regression
𝑧 = 𝑤11 𝑤12 𝑤13 ∙
𝑥1
𝑥2
𝑥3
+ 𝑏
ො𝑦 = 𝜎(𝑧)
𝜎 =
1
1 + 𝑒−𝑥

17
Decision trees and gradient boosting
http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html
Trivialsplitting:X1<0.5 => y=0.3

18
• Supervised learning
• Mapping someinputs to some outputs
• 𝑥1 𝑥2 𝑥3 … 𝑥 𝑛 → 𝑦1 𝑦2 𝑦3 … 𝑦 𝑚
• Using some parameters 𝜃1 𝜃2 𝜃3 … 𝜃 𝑝
• That you determine by minimizing some loss
• Objectivefunction in xgboostis training loss + regularization term
• 𝑜𝑏𝑗 𝜃 = 𝐿 𝜃 + Ω(𝜃) 𝐿 𝜃 = σ𝑖=1
𝑘 (ෝ𝑦𝑖 − 𝑦𝑖 )2
xgboost

19
CART – Classification and Regression Trees
Does a person like computergames?
The score adds expressiveness to the leaf

20
Form an ensemble of weak learners
Add score of multipletrees together

21
Neural networks
Trainingcycle

22
• Normalize data
• If input is categorical,represent it as one-hot encodings
• Red,blue,green -> red=[1,0,0] , blue=[0,1,0], green=[0,0,1]
• If input is text,represent words as word embeddings
• If embeddinglength was 4, we could have«bank» = [0.23,1.2,0.34,0.78]
• The embeddings can be learned as part of the learningtask, or:
• Embeddings can be taken from a language model trained froma larger text corpus
Preprocessing of inputs to neural networks

23
• Large number of categories lead to long one-hot vectors
• Different values of categorical variables are treated as
completely independent of each other.
Some weaknesses of one-hot for categorical data

25
Paris – France + Italy ~ Rome
- + ~

26
• >20 000 forbedringsforslag since 2010
• Each Forbedringsforslag has one text with maximum 98
words
• Each Forbedringsforslag is classified into a product
category by a person.
• Can we take those data and teach a learning algorithm to
predict product category?
Forbedringsforslag

27
Forbedringsforslag – Neural network architecture
«Hei, jeg opplever det som veldig forvirrende at jeg ser bokført saldo. Jeg trenger kun å se
disponibel saldo. Ønsker å bare se disponibel eller velge det som den saldoen som er synlig.»

28
Conclusion Forbedringsforslag
• We finally arriveat an accuracy of 75% for both the validation set and the test set
• Without regularization we startoverfitting after 10 to 15 epochs
• By applying dropoutfraction of 0.2 on both input-to-stateand state-to-statein the LSTM, we avoid overfitting
• A thin graphical user interfacecan presentthe products sorted by descending predicted probability
• The labelling job can the be quicker, but it can’tbe done entirely by machine learning

29
Sales prediction Kaggle contest 2015
• 3000 drug stores
• 7 countries
• Predict daily sales
• Depends on:
• Promotions
• Competition
• School
• State holiday
• Seasonality
• Locality
• Etc

30
• In principle a neural network can approximateany
continous function and piece wise continous function
• A neural network is not suitable to approximate arbitrary
non-continous functions as it assumes a certain level of
continuity
• Decision trees do not assumeany continuity of feature
variables and can divide the states of a variable as fine as
necessary

31
• «The rise of neural networks in natural language
processing is based on the word embeddings which puts
words with similar meaning closer to each other in a
word space thus increasing the continuity of the words
compared to using one-hot encoding of words»

32
Keras implementation of entity embeddings by Guo
https://github.com/entron/entity-embedding-rossmann/
• Store
• Day of week
• Promo
• Year
• Month
• Day of month
• State

33
Neural network architecture Guo

34
The embeddings have learned some German geography

35
• Entity embeddings reduce memory usage and speeds up neural
networks compared to one-hot encoding.
• Intrinsic properties of the categorical features can be revealed by
mapping similar values close to each other in embedding space.
• The embeddings learned boost the performance of other machine
learning methods when using them as input features instead.
• Guo and Berkhahn came out third in the Rossman Store Sales prediction
• The students at MILA, Montreal who won the Taxi Destination
prediction on Kaggle also used entity embeddings
http://blog.kaggle.com/2015/07/27/taxi-trajectory-winners-interview-
1st-place-team-%F0%9F%9A%95/
Conclusions

Entity embeddings for categorical data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Entity embeddings for categorical data

Similar to Entity embeddings for categorical data (20)

Recently uploaded

Recently uploaded (20)

Entity embeddings for categorical data