Word embeddings are common for NLP tasks, but embeddings can also be used to learn relations among categorical data. Deep learning can be useful also for structured data, and entity embeddings is one reason why it makes sense. These are slides from a seminar held in Sbanken.
3. 3
Exceeded 1 million users in 2017
Collaborativeandcompetitive datascience
Gradient boosted trees win most contests with tabular/structureddata
Deep Learning wins when datais unstructured images/text/sound
4. Standard modeling activitiesStatistics or machine learning, most activities are common
Select
model
Select
inputs
Train model
Test model
on unseen
data
Evaluate
performance
Success?
Modelling
activities
Supervised learning needs labeled data
SELECT GROUND TRUTH TO TARGET THE TRAINING AGAINST
Requires experts with deep understanding in the field
FEATURE ENGINEERING – FIND RELEVANT INPUTS
The risk of overfitting is high when the model has
many parameters
COMMON PITFALL - OVERFITTING
Nodes are randomly dropped so that the rest must readjust
DEEP LEARNING AVOIDS OVERFITTING USING DROPOUT
DATA
DISCOVERY
9. 9
Data representations, decomposing a vector
x
y
v
u
u
v
V =
We can decompose
the vector V into a
vector of length u
directed along the x
axis, and a vector of
length v directed
along the y axis.
V
10. 10
Data representations, vector length and direction
x
y u
v
V =
V
V
α
=
V
α
Both these data
representationsdefine
the same vector.
How you want to feed
this information to the
learning algorithm
depends on what
you’re aiming to
predict.
If this vector would represent wind in the horizontal plane, and we want to predict the power output from
a wind turbine, which we happen to know is a function of the wind speed, feeding in
to the learning algorithm makes a lot of sense.
𝑉 = 𝑢2 + 𝑣2This way the learning algorithm doesn’t need to figure out Pythagoras on it’s own.
However, with enough training data, a neural network could figure this out.
𝑉
11. 11
Data representations, cyclic variables
x
y u
v
V =
V
V
α
=
V
α
Cyclic variables needs
special consideration.
Angle α, the angle
between 0° and 359°
is only 1°, this is not
obvious to a learning
algorithm.
22. 22
• Normalize data
• If input is categorical,represent it as one-hot encodings
• Red,blue,green -> red=[1,0,0] , blue=[0,1,0], green=[0,0,1]
• If input is text,represent words as word embeddings
• If embeddinglength was 4, we could have«bank» = [0.23,1.2,0.34,0.78]
• The embeddings can be learned as part of the learningtask, or:
• Embeddings can be taken from a language model trained froma larger text corpus
Preprocessing of inputs to neural networks
23. 23
• Large number of categories lead to long one-hot vectors
• Different values of categorical variables are treated as
completely independent of each other.
Some weaknesses of one-hot for categorical data
26. 26
• >20 000 forbedringsforslag since 2010
• Each Forbedringsforslag has one text with maximum 98
words
• Each Forbedringsforslag is classified into a product
category by a person.
• Can we take those data and teach a learning algorithm to
predict product category?
Forbedringsforslag
27. 27
Forbedringsforslag – Neural network architecture
«Hei, jeg opplever det som veldig forvirrende at jeg ser bokført saldo. Jeg trenger kun å se
disponibel saldo. Ønsker å bare se disponibel eller velge det som den saldoen som er synlig.»
28. 28
Conclusion Forbedringsforslag
• We finally arriveat an accuracy of 75% for both the validation set and the test set
• Without regularization we startoverfitting after 10 to 15 epochs
• By applying dropoutfraction of 0.2 on both input-to-stateand state-to-statein the LSTM, we avoid overfitting
• A thin graphical user interfacecan presentthe products sorted by descending predicted probability
• The labelling job can the be quicker, but it can’tbe done entirely by machine learning
29. 29
Sales prediction Kaggle contest 2015
• 3000 drug stores
• 7 countries
• Predict daily sales
• Depends on:
• Promotions
• Competition
• School
• State holiday
• Seasonality
• Locality
• Etc
30. 30
• In principle a neural network can approximateany
continous function and piece wise continous function
• A neural network is not suitable to approximate arbitrary
non-continous functions as it assumes a certain level of
continuity
• Decision trees do not assumeany continuity of feature
variables and can divide the states of a variable as fine as
necessary
31. 31
• «The rise of neural networks in natural language
processing is based on the word embeddings which puts
words with similar meaning closer to each other in a
word space thus increasing the continuity of the words
compared to using one-hot encoding of words»
32. 32
Keras implementation of entity embeddings by Guo
https://github.com/entron/entity-embedding-rossmann/
• Store
• Day of week
• Promo
• Year
• Month
• Day of month
• State
35. 35
• Entity embeddings reduce memory usage and speeds up neural
networks compared to one-hot encoding.
• Intrinsic properties of the categorical features can be revealed by
mapping similar values close to each other in embedding space.
• The embeddings learned boost the performance of other machine
learning methods when using them as input features instead.
• Guo and Berkhahn came out third in the Rossman Store Sales prediction
• The students at MILA, Montreal who won the Taxi Destination
prediction on Kaggle also used entity embeddings
http://blog.kaggle.com/2015/07/27/taxi-trajectory-winners-interview-
1st-place-team-%F0%9F%9A%95/
Conclusions