SlideShare a Scribd company logo
Athens BD Jun2018 | p1
Embeddings of Categorical Variables
Athens BD Jun2018 | p2
Definition
We usually encode categoriesas positive integers so embeddings are mappings
Z→Rk
k is called the 'embedding dimension'.
An embedding 'or VS representationor VS method' of a categoricalvariablex is any
mapping of its categories to Rk.
To learn the embedding of a categoricalin a ML task means to find a map
categories → Rk
where
k << number of categories
Consider VS embeddings as an evolutionofone-hot encodingwe traditionally use to represent categories.
But why we've been using OH encoding anyway?
Why not just use successive integers to represent categories?
Athens BD Jun2018 | p3
Motivation
With the exception of classificationand regression trees (CART), learning algorithms
operate on subsets of Rn where n is the inputdimension.
A naive encoding of categories as (say positive and consecutive) integers suffers
from several issues:
1. The model performance depends on the choice of the
encoding
Suppose we're given {blue, orange, green} → {1, 2, 3}
so that x1 = 1, x2 = 2, x3 = 3
and y1 = 2, y2 = 6, y3 = -2
The LM doesn’t fit
However, if we change the encoding to
{blue, orange, green} = {2, 3, 1} the fit will be perfect.
Athens BD Jun2018 | p4
Motivation
2. The use of integers to represent the values of categoricalinputs destructs the
learning process by treating thegradient overdifferent categoriesunequally:
Assume that the model function containsa multiplicativedependency wx∙x ie:
f(x,...)= f(wx,...)for a categoricalx and we're provided with a training example where x = j.
For any objective J, the partial derivativeat x = j is
∂J/∂wx|x=j ~ j ∙ ∂J/∂x|x=j
The jth categorycontributes to model training j timesthe1st
category!
3. What if a category contributes positively to the output and another category negatively?
Using a single parameter to model the categoricalwill most probably sendtheparametertozero by
the end of the training process!
Athens BD Jun2018 | p5
Why CARTS do not require encoding?
CARTS partitionthe input observable space using a sequence of coordinatesplitsthat
greedily minimize an objective.
By “greedily”we mean that the objective is minimized at eachsplit.A greedy optimum is not the optimum
over all the possible partitions of the input space though.
More formally, given a training set T = {X = [x1,...,xn], Y = (y1,...yn)}with xj Rk , j = 1,...,n
A coordinatesplit at level 0 divides T in 2 subsets T1 = {X1, Y1} and T2 = {X2, Y2} such that the sum of the values
of the objective applied to each subset is minimized.
Level-0 loop:
 coordinate
 coordinate value
evaluate the objective
check minimum
return the coord and coord-value of minimum
Athens BD Jun2018 | p6
Why CARTS do not require encoding?
In regressiontasks theobjective is the MSE
of y's in Yj, j = 1,2.
In a binaryclassificationtask, T1 is
associated with class C1 and T2 with class C2
and the objective is the number of correct
guesses of Cj in Tj
The crucial thing is that for the splitting
process to work:
1. the types of X and Y are not required to be numerical,
2. no ordering of the values of X and Y is implicitly
assumed.
Athens BD Jun2018 | p7
Learning Embeddings in Tensorflow
We're using an example from the retail industry.
The data is sales countsof prepared meat and burger products for a group of stores of a large food retailerin
the US. Line items are salescount per store, calendarday andstockkeepingunit(SKU).
The object is to estimate sales givena SKU, locationand day.
We'll employ a FFNN of just a single hidden layer and an objective that is not the MSE
because it is not suited for countdata.
A random variableY∈Z+ is said to havethe Poisson distribution with parameter μ, if it takes positive integer
values y = 0,1,2,... withprobability
P(Y = y) = eμ⋅μy/y!
Athens BD Jun2018 | p8
Learning Embeddings in Tensorflow
The reason for using the aboveas a model for the distributionof SKU sales is its relationto the binomial
distribution (Bernoulli trials):
If Xj j = 1,2,... areindependent binomialsie
Xj ~ 𝓑(πj) and
Σjπj → μ < ∞ then
ΣjXj ~ Poisson(μ)
Fix a product say S, that sold n items yesterday at Wholefoods MidtownATL.
Each Xj roughly represents a customer that buys S with probability πj and n = ΣjXj
From this point the process of deriving a loss is pretty much standard:
we set y = wTx where w is the weightvec and x the input and maximize the negativelog likelihood.
Athens BD Jun2018 | p9
Input Encodings
SKU IDs, calendar days and store locationsare OH encoded.
This creates an input space of several hundred or thousand binary variablesdepending on the size of the
assortment and the number of stores.
This is an issue for memory as soon as the number of training examples are more than a few thousands
(certain precautions can be takenthough!)
OH Encoding (ohh…)
Vector space encoding
Insteadof store-ids we use geospatialcoordinates(lat | long). Calendar days
are mapped to R2 using a VS representationthat brings closely together days
around a year's end:
day number → cos(2πj/365), sin(2πj/365)
Athens BD Jun2018 | p10
How does it work?
j embeddingj
a( 1)
W( 1)
h( 1)
a( n)
h( n)
=yhat
b( 1)
K- di m
K-dimOtherinputs
Athens BD Jun2018 | p11
The Tensorflow code (go to Jupyter)
Athens BD Jun2018 | p12
The gain of SKU embedding
Suppose the object is to estimate a kind of 'market-basket' whencashier transaction data is not
availableie, groups of SKUs with approx the same sales across days and stores.
This is a core problem in assortment planning:
estimate the number | percentageof product items I'll need to stock for the next week|month|season.
Probably more involvedis the use of assortments in demandforecasting:estimatea product's sales
for the next period from its sales history.
How is the aboverelated to the learnt VS embeddings of SKUs?
The core insight is that neighboring values in the embedding space have similar sales across stores and days
Well, not exactly: currently the best theoretical result we haveis this:
m∙‖e1 - e2‖ ≤ Ex‖yhat(e1,x) - yhat(e2,x)‖≤ M∙‖e1 - e2‖ with m ≤ M
The practice shows thoughtthat the insightholds
Athens BD Jun2018 | p13
Embedding projectors
An embedding projector tries to create a 2D or 3D scatterplot from a multidimensionalset of
points.
The purpose is to retain as much variancein the originalset as possible.
PCA is the most widely used method howeverit fails in high dimensional spaces or complex
geometries.
The proposed method there is t-SNE.
It learns the positions of 2|3D points by minimizing the KL divergence of probability distributions it defines
for the original and space and its t-SNE image (what a hack!).
The reference examples of MNIST and Word2Vecare in the tensorboard-projector page.
Athens BD Jun2018 | p14
Telecom operators exploit the call graph of their subscribers using elementary or more advanced
methods.
Given a log of calls between subscribers (voice and texts) over a time period of N days they define the
strengthofa relation betweensubscribers by the number and duration of calls they make to one
another.
An example from telecoms
Variationstake into account the time of day, the day of week the uniformity of call frequency etc.
A subscriber's X network | community are the subscribers with the strongest relationwith X.
An approach in line with our discussion, is to use the call graph to mapthesubscribers inanembedding
space. A subscriber's community are the nearest neighbors in the embedding space (obviously).
Athens BD Jun2018 | p15
There're several benefits of this approach:
▪ Embeddingshavememory.As soon as a new call record becomes availablea few iterations of the
neural network will accommodate the new information in the existing embedding vectors. This permits
real-time community updates.
▪ Embeddings facilitatethe visualizationof variouscustomer-levelmeasureson their projected
manifolds: We can view for example the distributionof rateplans or rateplan categories or the
distribution of customer tenure over the embedding vectors.
▪ The most useful property thought, is the way embeddings can be used to predict the community of a
new customer for whom there's no call log yet (but a few things are known initially eg the rateplan,
service subscriptions and demographics).
An example from telecoms
Athens BD Jun2018 | p16
the
you need to integrate your program into a larger process, interoperating
ernal systems and processes.
How far can we go?
Word2Vecwas the first REALLY impressive use of a certain novel kind of word embedding.
It constructs a languagemodel from a text corpus ie given a part of a sentence it will predict the rest of it.
A direct consequence is machine translation: throw in a sentence in Greek and it will translate it to Swahili.
Try this out in Google translate.
More?
Sunspring is the first movie script completely written by a machine 2 years ago
Athens BD Jun2018 | p17
Thanxguys
For more pizzas you can track me here:
http://www.mltrain.cc
http://www.linkedin.con/in.cmalliopoulos

More Related Content

What's hot

Color
ColorColor
Instance based learning
Instance based learningInstance based learning
Instance based learningSlideshare
 
Graphical Models In Python | Edureka
Graphical Models In Python | EdurekaGraphical Models In Python | Edureka
Graphical Models In Python | Edureka
Edureka!
 
Brief introduction on GAN
Brief introduction on GANBrief introduction on GAN
Brief introduction on GAN
Dai-Hai Nguyen
 
Correlation modeling and portfolio optimization - CIPEFA
Correlation modeling and portfolio optimization - CIPEFACorrelation modeling and portfolio optimization - CIPEFA
Correlation modeling and portfolio optimization - CIPEFA
Juan Andrés Serur
 
Elliptic curve scalar multiplier using karatsuba
Elliptic curve scalar multiplier using karatsubaElliptic curve scalar multiplier using karatsuba
Elliptic curve scalar multiplier using karatsuba
IAEME Publication
 
Statistical Clustering and Portfolio Management
Statistical Clustering and Portfolio ManagementStatistical Clustering and Portfolio Management
Statistical Clustering and Portfolio Management
Juan Andrés Serur
 
About functional SIR
About functional SIRAbout functional SIR
About functional SIR
tuxette
 
Data Applied: Clustering
Data Applied: ClusteringData Applied: Clustering
Data Applied: Clustering
DataminingTools Inc
 
Bivariatealgebraic integerencoded arai algorithm for
Bivariatealgebraic integerencoded arai algorithm forBivariatealgebraic integerencoded arai algorithm for
Bivariatealgebraic integerencoded arai algorithm for
eSAT Publishing House
 
Number formats for signals and coefficients in DSP system
Number formats for signals and coefficients in DSP systemNumber formats for signals and coefficients in DSP system
Number formats for signals and coefficients in DSP system
sarithabanala
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
ijceronline
 
Ch 9-1.Machine Learning: Symbol-based
Ch 9-1.Machine Learning: Symbol-basedCh 9-1.Machine Learning: Symbol-based
Ch 9-1.Machine Learning: Symbol-basedbutest
 
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
Deep Learning JP
 
Boosted multinomial logit model (working manuscript)
Boosted multinomial logit model (working manuscript)Boosted multinomial logit model (working manuscript)
Boosted multinomial logit model (working manuscript)Jay (Jianqiang) Wang
 
The Basic Model of Computation
The Basic Model of ComputationThe Basic Model of Computation
The Basic Model of Computation
DipakKumar122
 

What's hot (17)

Color
ColorColor
Color
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
Graphical Models In Python | Edureka
Graphical Models In Python | EdurekaGraphical Models In Python | Edureka
Graphical Models In Python | Edureka
 
Brief introduction on GAN
Brief introduction on GANBrief introduction on GAN
Brief introduction on GAN
 
Correlation modeling and portfolio optimization - CIPEFA
Correlation modeling and portfolio optimization - CIPEFACorrelation modeling and portfolio optimization - CIPEFA
Correlation modeling and portfolio optimization - CIPEFA
 
Lesson 29
Lesson 29Lesson 29
Lesson 29
 
Elliptic curve scalar multiplier using karatsuba
Elliptic curve scalar multiplier using karatsubaElliptic curve scalar multiplier using karatsuba
Elliptic curve scalar multiplier using karatsuba
 
Statistical Clustering and Portfolio Management
Statistical Clustering and Portfolio ManagementStatistical Clustering and Portfolio Management
Statistical Clustering and Portfolio Management
 
About functional SIR
About functional SIRAbout functional SIR
About functional SIR
 
Data Applied: Clustering
Data Applied: ClusteringData Applied: Clustering
Data Applied: Clustering
 
Bivariatealgebraic integerencoded arai algorithm for
Bivariatealgebraic integerencoded arai algorithm forBivariatealgebraic integerencoded arai algorithm for
Bivariatealgebraic integerencoded arai algorithm for
 
Number formats for signals and coefficients in DSP system
Number formats for signals and coefficients in DSP systemNumber formats for signals and coefficients in DSP system
Number formats for signals and coefficients in DSP system
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
Ch 9-1.Machine Learning: Symbol-based
Ch 9-1.Machine Learning: Symbol-basedCh 9-1.Machine Learning: Symbol-based
Ch 9-1.Machine Learning: Symbol-based
 
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
 
Boosted multinomial logit model (working manuscript)
Boosted multinomial logit model (working manuscript)Boosted multinomial logit model (working manuscript)
Boosted multinomial logit model (working manuscript)
 
The Basic Model of Computation
The Basic Model of ComputationThe Basic Model of Computation
The Basic Model of Computation
 

Similar to Presentation

A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
Devansh16
 
Software tookits for machine learning and graphical models
Software tookits for machine learning and graphical modelsSoftware tookits for machine learning and graphical models
Software tookits for machine learning and graphical modelsbutest
 
A detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning AlgorithmsA detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning Algorithms
NIET Journal of Engineering & Technology (NIETJET)
 
Performance Analysis on Fingerprint Image Compression Using K-SVD-SR and SPIHT
Performance Analysis on Fingerprint Image Compression Using K-SVD-SR and SPIHTPerformance Analysis on Fingerprint Image Compression Using K-SVD-SR and SPIHT
Performance Analysis on Fingerprint Image Compression Using K-SVD-SR and SPIHT
IRJET Journal
 
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
IJERA Editor
 
CFM Challenge - Course Project
CFM Challenge - Course ProjectCFM Challenge - Course Project
CFM Challenge - Course Project
KhalilBergaoui
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
IJERD Editor
 
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
ScaleGraph - A High-Performance Library for Billion-Scale Graph AnalyticsScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
Toyotaro Suzumura
 
From_seq2seq_to_BERT
From_seq2seq_to_BERTFrom_seq2seq_to_BERT
From_seq2seq_to_BERT
Huali Zhao
 
Analysis of GF (2m) Multiplication Algorithm: Classic Method v/s Karatsuba-Of...
Analysis of GF (2m) Multiplication Algorithm: Classic Method v/s Karatsuba-Of...Analysis of GF (2m) Multiplication Algorithm: Classic Method v/s Karatsuba-Of...
Analysis of GF (2m) Multiplication Algorithm: Classic Method v/s Karatsuba-Of...
rahulmonikasharma
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
Geoffrey Fox
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
Bhaskar Mitra
 
NEW APPROACH FOR SOLVING FUZZY TRIANGULAR ASSIGNMENT BY ROW MINIMA METHOD
NEW APPROACH FOR SOLVING FUZZY TRIANGULAR ASSIGNMENT BY ROW MINIMA METHODNEW APPROACH FOR SOLVING FUZZY TRIANGULAR ASSIGNMENT BY ROW MINIMA METHOD
NEW APPROACH FOR SOLVING FUZZY TRIANGULAR ASSIGNMENT BY ROW MINIMA METHOD
IAEME Publication
 
Visual Techniques
Visual TechniquesVisual Techniques
Visual Techniques
Md. Shohel Rana
 
Integrative Parallel Programming in HPC
Integrative Parallel Programming in HPCIntegrative Parallel Programming in HPC
Integrative Parallel Programming in HPC
Victor Eijkhout
 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with python
Simone Piunno
 
lecture15-supervised.ppt
lecture15-supervised.pptlecture15-supervised.ppt
lecture15-supervised.ppt
Indra Hermawan
 
Chapter two
Chapter twoChapter two
Chapter two
mihiretu kassaye
 
Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)
Yueshen Xu
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
Chun-Ming Chang
 

Similar to Presentation (20)

A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
Software tookits for machine learning and graphical models
Software tookits for machine learning and graphical modelsSoftware tookits for machine learning and graphical models
Software tookits for machine learning and graphical models
 
A detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning AlgorithmsA detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning Algorithms
 
Performance Analysis on Fingerprint Image Compression Using K-SVD-SR and SPIHT
Performance Analysis on Fingerprint Image Compression Using K-SVD-SR and SPIHTPerformance Analysis on Fingerprint Image Compression Using K-SVD-SR and SPIHT
Performance Analysis on Fingerprint Image Compression Using K-SVD-SR and SPIHT
 
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
 
CFM Challenge - Course Project
CFM Challenge - Course ProjectCFM Challenge - Course Project
CFM Challenge - Course Project
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
ScaleGraph - A High-Performance Library for Billion-Scale Graph AnalyticsScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
 
From_seq2seq_to_BERT
From_seq2seq_to_BERTFrom_seq2seq_to_BERT
From_seq2seq_to_BERT
 
Analysis of GF (2m) Multiplication Algorithm: Classic Method v/s Karatsuba-Of...
Analysis of GF (2m) Multiplication Algorithm: Classic Method v/s Karatsuba-Of...Analysis of GF (2m) Multiplication Algorithm: Classic Method v/s Karatsuba-Of...
Analysis of GF (2m) Multiplication Algorithm: Classic Method v/s Karatsuba-Of...
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
NEW APPROACH FOR SOLVING FUZZY TRIANGULAR ASSIGNMENT BY ROW MINIMA METHOD
NEW APPROACH FOR SOLVING FUZZY TRIANGULAR ASSIGNMENT BY ROW MINIMA METHODNEW APPROACH FOR SOLVING FUZZY TRIANGULAR ASSIGNMENT BY ROW MINIMA METHOD
NEW APPROACH FOR SOLVING FUZZY TRIANGULAR ASSIGNMENT BY ROW MINIMA METHOD
 
Visual Techniques
Visual TechniquesVisual Techniques
Visual Techniques
 
Integrative Parallel Programming in HPC
Integrative Parallel Programming in HPCIntegrative Parallel Programming in HPC
Integrative Parallel Programming in HPC
 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with python
 
lecture15-supervised.ppt
lecture15-supervised.pptlecture15-supervised.ppt
lecture15-supervised.ppt
 
Chapter two
Chapter twoChapter two
Chapter two
 
Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
 

Recently uploaded

一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 

Recently uploaded (20)

一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 

Presentation

  • 1. Athens BD Jun2018 | p1 Embeddings of Categorical Variables
  • 2. Athens BD Jun2018 | p2 Definition We usually encode categoriesas positive integers so embeddings are mappings Z→Rk k is called the 'embedding dimension'. An embedding 'or VS representationor VS method' of a categoricalvariablex is any mapping of its categories to Rk. To learn the embedding of a categoricalin a ML task means to find a map categories → Rk where k << number of categories Consider VS embeddings as an evolutionofone-hot encodingwe traditionally use to represent categories. But why we've been using OH encoding anyway? Why not just use successive integers to represent categories?
  • 3. Athens BD Jun2018 | p3 Motivation With the exception of classificationand regression trees (CART), learning algorithms operate on subsets of Rn where n is the inputdimension. A naive encoding of categories as (say positive and consecutive) integers suffers from several issues: 1. The model performance depends on the choice of the encoding Suppose we're given {blue, orange, green} → {1, 2, 3} so that x1 = 1, x2 = 2, x3 = 3 and y1 = 2, y2 = 6, y3 = -2 The LM doesn’t fit However, if we change the encoding to {blue, orange, green} = {2, 3, 1} the fit will be perfect.
  • 4. Athens BD Jun2018 | p4 Motivation 2. The use of integers to represent the values of categoricalinputs destructs the learning process by treating thegradient overdifferent categoriesunequally: Assume that the model function containsa multiplicativedependency wx∙x ie: f(x,...)= f(wx,...)for a categoricalx and we're provided with a training example where x = j. For any objective J, the partial derivativeat x = j is ∂J/∂wx|x=j ~ j ∙ ∂J/∂x|x=j The jth categorycontributes to model training j timesthe1st category! 3. What if a category contributes positively to the output and another category negatively? Using a single parameter to model the categoricalwill most probably sendtheparametertozero by the end of the training process!
  • 5. Athens BD Jun2018 | p5 Why CARTS do not require encoding? CARTS partitionthe input observable space using a sequence of coordinatesplitsthat greedily minimize an objective. By “greedily”we mean that the objective is minimized at eachsplit.A greedy optimum is not the optimum over all the possible partitions of the input space though. More formally, given a training set T = {X = [x1,...,xn], Y = (y1,...yn)}with xj Rk , j = 1,...,n A coordinatesplit at level 0 divides T in 2 subsets T1 = {X1, Y1} and T2 = {X2, Y2} such that the sum of the values of the objective applied to each subset is minimized. Level-0 loop:  coordinate  coordinate value evaluate the objective check minimum return the coord and coord-value of minimum
  • 6. Athens BD Jun2018 | p6 Why CARTS do not require encoding? In regressiontasks theobjective is the MSE of y's in Yj, j = 1,2. In a binaryclassificationtask, T1 is associated with class C1 and T2 with class C2 and the objective is the number of correct guesses of Cj in Tj The crucial thing is that for the splitting process to work: 1. the types of X and Y are not required to be numerical, 2. no ordering of the values of X and Y is implicitly assumed.
  • 7. Athens BD Jun2018 | p7 Learning Embeddings in Tensorflow We're using an example from the retail industry. The data is sales countsof prepared meat and burger products for a group of stores of a large food retailerin the US. Line items are salescount per store, calendarday andstockkeepingunit(SKU). The object is to estimate sales givena SKU, locationand day. We'll employ a FFNN of just a single hidden layer and an objective that is not the MSE because it is not suited for countdata. A random variableY∈Z+ is said to havethe Poisson distribution with parameter μ, if it takes positive integer values y = 0,1,2,... withprobability P(Y = y) = eμ⋅μy/y!
  • 8. Athens BD Jun2018 | p8 Learning Embeddings in Tensorflow The reason for using the aboveas a model for the distributionof SKU sales is its relationto the binomial distribution (Bernoulli trials): If Xj j = 1,2,... areindependent binomialsie Xj ~ 𝓑(πj) and Σjπj → μ < ∞ then ΣjXj ~ Poisson(μ) Fix a product say S, that sold n items yesterday at Wholefoods MidtownATL. Each Xj roughly represents a customer that buys S with probability πj and n = ΣjXj From this point the process of deriving a loss is pretty much standard: we set y = wTx where w is the weightvec and x the input and maximize the negativelog likelihood.
  • 9. Athens BD Jun2018 | p9 Input Encodings SKU IDs, calendar days and store locationsare OH encoded. This creates an input space of several hundred or thousand binary variablesdepending on the size of the assortment and the number of stores. This is an issue for memory as soon as the number of training examples are more than a few thousands (certain precautions can be takenthough!) OH Encoding (ohh…) Vector space encoding Insteadof store-ids we use geospatialcoordinates(lat | long). Calendar days are mapped to R2 using a VS representationthat brings closely together days around a year's end: day number → cos(2πj/365), sin(2πj/365)
  • 10. Athens BD Jun2018 | p10 How does it work? j embeddingj a( 1) W( 1) h( 1) a( n) h( n) =yhat b( 1) K- di m K-dimOtherinputs
  • 11. Athens BD Jun2018 | p11 The Tensorflow code (go to Jupyter)
  • 12. Athens BD Jun2018 | p12 The gain of SKU embedding Suppose the object is to estimate a kind of 'market-basket' whencashier transaction data is not availableie, groups of SKUs with approx the same sales across days and stores. This is a core problem in assortment planning: estimate the number | percentageof product items I'll need to stock for the next week|month|season. Probably more involvedis the use of assortments in demandforecasting:estimatea product's sales for the next period from its sales history. How is the aboverelated to the learnt VS embeddings of SKUs? The core insight is that neighboring values in the embedding space have similar sales across stores and days Well, not exactly: currently the best theoretical result we haveis this: m∙‖e1 - e2‖ ≤ Ex‖yhat(e1,x) - yhat(e2,x)‖≤ M∙‖e1 - e2‖ with m ≤ M The practice shows thoughtthat the insightholds
  • 13. Athens BD Jun2018 | p13 Embedding projectors An embedding projector tries to create a 2D or 3D scatterplot from a multidimensionalset of points. The purpose is to retain as much variancein the originalset as possible. PCA is the most widely used method howeverit fails in high dimensional spaces or complex geometries. The proposed method there is t-SNE. It learns the positions of 2|3D points by minimizing the KL divergence of probability distributions it defines for the original and space and its t-SNE image (what a hack!). The reference examples of MNIST and Word2Vecare in the tensorboard-projector page.
  • 14. Athens BD Jun2018 | p14 Telecom operators exploit the call graph of their subscribers using elementary or more advanced methods. Given a log of calls between subscribers (voice and texts) over a time period of N days they define the strengthofa relation betweensubscribers by the number and duration of calls they make to one another. An example from telecoms Variationstake into account the time of day, the day of week the uniformity of call frequency etc. A subscriber's X network | community are the subscribers with the strongest relationwith X. An approach in line with our discussion, is to use the call graph to mapthesubscribers inanembedding space. A subscriber's community are the nearest neighbors in the embedding space (obviously).
  • 15. Athens BD Jun2018 | p15 There're several benefits of this approach: ▪ Embeddingshavememory.As soon as a new call record becomes availablea few iterations of the neural network will accommodate the new information in the existing embedding vectors. This permits real-time community updates. ▪ Embeddings facilitatethe visualizationof variouscustomer-levelmeasureson their projected manifolds: We can view for example the distributionof rateplans or rateplan categories or the distribution of customer tenure over the embedding vectors. ▪ The most useful property thought, is the way embeddings can be used to predict the community of a new customer for whom there's no call log yet (but a few things are known initially eg the rateplan, service subscriptions and demographics). An example from telecoms
  • 16. Athens BD Jun2018 | p16 the you need to integrate your program into a larger process, interoperating ernal systems and processes. How far can we go? Word2Vecwas the first REALLY impressive use of a certain novel kind of word embedding. It constructs a languagemodel from a text corpus ie given a part of a sentence it will predict the rest of it. A direct consequence is machine translation: throw in a sentence in Greek and it will translate it to Swahili. Try this out in Google translate. More? Sunspring is the first movie script completely written by a machine 2 years ago
  • 17. Athens BD Jun2018 | p17 Thanxguys For more pizzas you can track me here: http://www.mltrain.cc http://www.linkedin.con/in.cmalliopoulos