Presentation

Athens BD Jun2018 | p1
Embeddings of Categorical Variables

Definition
We usually encode categoriesas positive integers so embeddings are mappings
Z→Rk
k is called the 'embedding dimension'.
An embedding 'or VS representationor VS method' of a categoricalvariablex is any
mapping of its categories to Rk.
To learn the embedding of a categoricalin a ML task means to find a map
categories → Rk
where
k << number of categories
Consider VS embeddings as an evolutionofone-hot encodingwe traditionally use to represent categories.
But why we've been using OH encoding anyway?
Why not just use successive integers to represent categories?

Motivation
With the exception of classificationand regression trees (CART), learning algorithms
operate on subsets of Rn where n is the inputdimension.
A naive encoding of categories as (say positive and consecutive) integers suffers
from several issues:
1. The model performance depends on the choice of the
encoding
Suppose we're given {blue, orange, green} → {1, 2, 3}
so that x1 = 1, x2 = 2, x3 = 3
and y1 = 2, y2 = 6, y3 = -2
The LM doesn’t fit
However, if we change the encoding to
{blue, orange, green} = {2, 3, 1} the fit will be perfect.

Motivation
2. The use of integers to represent the values of categoricalinputs destructs the
learning process by treating thegradient overdifferent categoriesunequally:
Assume that the model function containsa multiplicativedependency wx∙x ie:
f(x,...)= f(wx,...)for a categoricalx and we're provided with a training example where x = j.
For any objective J, the partial derivativeat x = j is
∂J/∂wx|x=j ~ j ∙ ∂J/∂x|x=j
The jth categorycontributes to model training j timesthe1st
category!
3. What if a category contributes positively to the output and another category negatively?
Using a single parameter to model the categoricalwill most probably sendtheparametertozero by
the end of the training process!

Why CARTS do not require encoding?
CARTS partitionthe input observable space using a sequence of coordinatesplitsthat
greedily minimize an objective.
By “greedily”we mean that the objective is minimized at eachsplit.A greedy optimum is not the optimum
over all the possible partitions of the input space though.
More formally, given a training set T = {X = [x1,...,xn], Y = (y1,...yn)}with xj Rk , j = 1,...,n
A coordinatesplit at level 0 divides T in 2 subsets T1 = {X1, Y1} and T2 = {X2, Y2} such that the sum of the values
of the objective applied to each subset is minimized.
Level-0 loop:
 coordinate
 coordinate value
evaluate the objective
check minimum
return the coord and coord-value of minimum

Why CARTS do not require encoding?
In regressiontasks theobjective is the MSE
of y's in Yj, j = 1,2.
In a binaryclassificationtask, T1 is
associated with class C1 and T2 with class C2
and the objective is the number of correct
guesses of Cj in Tj
The crucial thing is that for the splitting
process to work:
1. the types of X and Y are not required to be numerical,
2. no ordering of the values of X and Y is implicitly
assumed.

Learning Embeddings in Tensorflow
We're using an example from the retail industry.
The data is sales countsof prepared meat and burger products for a group of stores of a large food retailerin
the US. Line items are salescount per store, calendarday andstockkeepingunit(SKU).
The object is to estimate sales givena SKU, locationand day.
We'll employ a FFNN of just a single hidden layer and an objective that is not the MSE
because it is not suited for countdata.
A random variableY∈Z+ is said to havethe Poisson distribution with parameter μ, if it takes positive integer
values y = 0,1,2,... withprobability
P(Y = y) = eμ⋅μy/y!

Learning Embeddings in Tensorflow
The reason for using the aboveas a model for the distributionof SKU sales is its relationto the binomial
distribution (Bernoulli trials):
If Xj j = 1,2,... areindependent binomialsie
Xj ~ 𝓑(πj) and
Σjπj → μ < ∞ then
ΣjXj ~ Poisson(μ)
Fix a product say S, that sold n items yesterday at Wholefoods MidtownATL.
Each Xj roughly represents a customer that buys S with probability πj and n = ΣjXj
From this point the process of deriving a loss is pretty much standard:
we set y = wTx where w is the weightvec and x the input and maximize the negativelog likelihood.

Input Encodings
SKU IDs, calendar days and store locationsare OH encoded.
This creates an input space of several hundred or thousand binary variablesdepending on the size of the
assortment and the number of stores.
This is an issue for memory as soon as the number of training examples are more than a few thousands
(certain precautions can be takenthough!)
OH Encoding (ohh…)
Vector space encoding
Insteadof store-ids we use geospatialcoordinates(lat | long). Calendar days
are mapped to R2 using a VS representationthat brings closely together days
around a year's end:
day number → cos(2πj/365), sin(2πj/365)

How does it work?
j embeddingj
a( 1)
W( 1)
h( 1)
a( n)
h( n)
=yhat
b( 1)
K- di m
K-dimOtherinputs

The Tensorflow code (go to Jupyter)

The gain of SKU embedding
Suppose the object is to estimate a kind of 'market-basket' whencashier transaction data is not
availableie, groups of SKUs with approx the same sales across days and stores.
This is a core problem in assortment planning:
estimate the number | percentageof product items I'll need to stock for the next week|month|season.
Probably more involvedis the use of assortments in demandforecasting:estimatea product's sales
for the next period from its sales history.
How is the aboverelated to the learnt VS embeddings of SKUs?
The core insight is that neighboring values in the embedding space have similar sales across stores and days
Well, not exactly: currently the best theoretical result we haveis this:
m∙‖e1 - e2‖ ≤ Ex‖yhat(e1,x) - yhat(e2,x)‖≤ M∙‖e1 - e2‖ with m ≤ M
The practice shows thoughtthat the insightholds

Embedding projectors
An embedding projector tries to create a 2D or 3D scatterplot from a multidimensionalset of
points.
The purpose is to retain as much variancein the originalset as possible.
PCA is the most widely used method howeverit fails in high dimensional spaces or complex
geometries.
The proposed method there is t-SNE.
It learns the positions of 2|3D points by minimizing the KL divergence of probability distributions it defines
for the original and space and its t-SNE image (what a hack!).
The reference examples of MNIST and Word2Vecare in the tensorboard-projector page.

Telecom operators exploit the call graph of their subscribers using elementary or more advanced
methods.
Given a log of calls between subscribers (voice and texts) over a time period of N days they define the
strengthofa relation betweensubscribers by the number and duration of calls they make to one
another.
An example from telecoms
Variationstake into account the time of day, the day of week the uniformity of call frequency etc.
A subscriber's X network | community are the subscribers with the strongest relationwith X.
An approach in line with our discussion, is to use the call graph to mapthesubscribers inanembedding
space. A subscriber's community are the nearest neighbors in the embedding space (obviously).

There're several benefits of this approach:
▪ Embeddingshavememory.As soon as a new call record becomes availablea few iterations of the
neural network will accommodate the new information in the existing embedding vectors. This permits
real-time community updates.
▪ Embeddings facilitatethe visualizationof variouscustomer-levelmeasureson their projected
manifolds: We can view for example the distributionof rateplans or rateplan categories or the
distribution of customer tenure over the embedding vectors.
▪ The most useful property thought, is the way embeddings can be used to predict the community of a
new customer for whom there's no call log yet (but a few things are known initially eg the rateplan,
service subscriptions and demographics).
An example from telecoms

the
you need to integrate your program into a larger process, interoperating
ernal systems and processes.
How far can we go?
Word2Vecwas the first REALLY impressive use of a certain novel kind of word embedding.
It constructs a languagemodel from a text corpus ie given a part of a sentence it will predict the rest of it.
A direct consequence is machine translation: throw in a sentence in Greek and it will translate it to Swahili.
Try this out in Google translate.
More?
Sunspring is the first movie script completely written by a machine 2 years ago

Thanxguys
For more pizzas you can track me here:
http://www.mltrain.cc
http://www.linkedin.con/in.cmalliopoulos

Presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Presentation

Similar to Presentation (20)

Recently uploaded

Recently uploaded (20)

Presentation