SlideShare a Scribd company logo
A Summary for Distributed Representations of
Words and Phrases and Their Compositionality
Xiangnan YUE
xiangnanyue@gmail.com
Ingénieur Civil @Mines ParisTech
M2 Data Science @Ecole Polytechnique
Road Map
- Part 1 - Learning Word2Vec
- Baseline Skip-gram model
- Hierarchical softmax
- Negative sampling
- Sub-sampling trick
- Part 2 - Learning Phrase
- Part 3 - Empirical Results
- Part 4 - Additive Compositionality
- Part 5 - Conclusion
Background
This paper follows the paper [1] Efficient estimation of word representations in
vector space in 2013.
This paper proposed some important improvements for the Skip-gram model,
which is widely used for obtaining the famous “word vectors”. [1] proposed the
CBOW (Continuous Bag-of-words Model) and Skip-gram, using a 3-layer neural
network to train the word-vectors.
While CBOW is entering the context to predict the center, Skip-gram is entering
the center word to predict the context.
Part 1
- Baseline Skip-Gram Model
The principle for the baseline Skip-gram
- Inputting a centered word, the model will be able to
give us the conditional probabilities of surrounding
words. The goal is to get the hidden layer “word
vectors” by maximizing the likelyhood function.
- The objective function is to maximize a log-
likelyhood function :
- No activation function for the hidden layer, and the
output neuron used softmax function.
- The parameters of the hidden layer are trained as
the “word vectors”. It is the matrix of shape (m, n),
where m is the total number of words, and n is the
feature number (vector length)
Figure 1. Architecture of a basic skip-gram, the graph
referred from
http://mccormickml.com/2016/04/19/word2vec-tutorial-
the-skip-gram-model/
Remarks for Baseline Skip-gram
- The objective function is under the Naive Bayesian
assumption that the features w(t+j) and w(t+i) are independent knowing w(t)
so that p(w(t+j) | w(t)) can be separated by log(ᐧ) operation.
- The parameters for the hidden layer are too large to be trained. In the
following formula, for any input word vector , we still have to calculate the
inner product with every output vector (between the hidden layer and
the softmax) to get the output probability distribution for each word wo. The
parameter matrix between the output and the hidden layer is of the same size
of our word-vector matrix (m, n). It’s not practical when the vocabulary table
size m is too large.
- Besides, in order to train such a large volume of parameters, training data
volume also has to be large -- much slower as the running time is proportional
to the multiplication of the two volumes.
- Therefore, other extensions of Skip-gram were proposed, such as the
Hierarchical Softmax and the Negative Sampling (one contribution of this
paper).
- Stochastic Gradient Descent was proposed to train the Skip-gram.
Remarks for Baseline Skip-gram
Part 1
- Hierarchical Softmax Method
In order to reduce the complexity of calculating the weights of output layer.
Hierarchical Softmax (HS) was used in the Skip-gram model. HS was firstly
proposed by [2] Morin and Bengio (2005).
We will introduce the data structure used in HS: Huffman Tree / Coding. Huffman
tree assigns short codes (path) to the most frequent words, and thus the random
walk on the tree will find these words quickly.
The Hierarchical Softmax
Huffman Tree
Huffman tree was initially proposed for lossless
data compression. It is constructed by the
frequency table of words in the corpus. It’s a
method of bottom-up.
Words are positioned as the leaf (prefix coding).
The most often appeared words are close to the
root and the most infrequent words are close to
the bottom.
The decoding path starts from the root and stops
when word w is found (reaches the leaf).
Figure 2. Huffman Tree : an
example. Graph referred from :
https://en.wikipedia.org/wiki/Huffman_coding
Huffman Tree
Here we use |w| to represent the total number of words, it’s the same as the
number of leafs of Huffman Tree, and |f| to represent the number of features on
the hidden layer, and denotes sigmoid function
Huffman Tree reduces the computation complexity on the output layer from
to approximate for one input sample ! The softmax
function is replaced by:
This equation explains the process: we first choose ch() to be fixed as the left child
node. Then for any chosen w, if at node j the path goes to left, we take ,
otherwise take (1 - ). The path from root to w leads to a
maximum likelyhood problem. By backpropogation, the parametres are updated.
The number of parameters rest the same yet a binary Huffman Tree reduced the
computation volume to an acceptable level.
Different methods for constructing the tree structure can be found by [3] Mnih and
Hinton (2009).
HS was proposed in paper [1] a priori and the performance is not satisfying. This
paper compared the results of Hierarchical Softmax with Negative Sampling, and
also Noise Contrastive Estimation - NCE proposed by [4] Gutmann and
Hyvarinen (2012), together with the subsampling trick and using phrase as
training samples.
Huffman Tree
Part 1
- Negative Sampling Method
Negative Sampling Method
The principle is that each sample adjusts only a small part of the parametres. The
idea is selecting a few negative samples (those which are not around the input
centered word) and only updating the output layer weight of the selected negative
words and the positive word.
Remark:
- It’s a similar way to dropout layer in deep neural network, yet the dropout
layer is between the hidden layers, instead of removing the output units.
- In the hidden layer, the number of updated weights are not changed -- only
the feature weights of input words are to be changed.
Negative Sampling Method
Ways to select Negative Samples: Use the uni-gram distribution Pn(w) to select
the negative words. According to the paper, in the objective function,
is replaced by the following function
a ¾ power value is used with the uni-gram percentage. This value is empirical.
Note that is the sigmoid function: ,
thus the first term above is the positive word’s log probability and the second term
is the negative words’ non-observe log probability.
Part 1
- Sub-Sampling Method
Subsampling Heuristic
Sub-sampling is important because some words are much more frequent than
others. Therefore when the sampling window size is large enough, some words
appear quite often in sampling windows yet doesn’t reflect any information (the
same goes for a variable with probability 1 doesn’t give any information as
entropy).
Heuristically, by removing the appearing rate of the same word while conserving
the ranking in the whole documents will even improve the final results. The
method used in this paper is a sub-sampling by a probability of saving which is of
the order .
The discard Probability : , t is called the
subsampling rate, and f(w) is the frequency of the word w in the corpus.
Part 2
- Learning Phrase
Use Phrases
Another important contribution is to use the phrases (or word pairs) rather than the
words in their training samples. Though this can be regarded as a prepared step,
it’s of great importance, given that:
- The samples volume is greatly reduced.
- The meaning of compositioned phrases are saved.
The scores for co-appearance are calculated simply by 2-gram / 1-gram. When
the score is large enough, two words are combined into one single phrase.
The whole training data set is passing 3-4 times this preprocessing algorithm. In
the paper, the obtained phrases’ vector quality is tested by another analogical
reasoning task (72% accuracy).
Part 3
- Empirical Results
Analogical Reasoning Task
An example for analogical reasoning is
as left. Given three phrases, to how
much degree can the fourth phrase be
inferred from the first three phrase-
vectors. By using cosine distance
between the vectors, the closest answer
is selected and compared with the true
answer.
This test method is used also in [1].
Figure 3. Examples for Analogical Reasoning test : In
each cell, given the first three words, test the accuracy
of calculating the fourth.
Empirical Results (words, no phrase learning)
Using Analogical Reasoning Task, the empirical results is listed in this paper.
- Negative Sampling outperforms the Hierarchical Softmax, and slightly
better than the Noise Contrastive Estimation.
- The subsampling speeds up the algorithm and even got better results for
Semantic and Total accuracy.
dimensionality = 300;
context size = 5;
discard words with
frequence < 5;
k = 5, 15 (the number
of negative words);
# of words = billions
Empirical Results (using phrases)
Using Analogical Reasoning Task, the empirical results is listed in this paper.
- Using subsampling, Hierarchical Softmax outperforms the other methods !
Yet when no subsampling, HS got the worst accuracy.
- As using phrases largely reduced the total number training samples, more
training data is needed (to be continued...)
dimensionality = 300;
context size = 5;
discard words with
frequence < 5;
k = 5, 15 (the number
of negative words);
Empirical Results (using phrases)
The paper claims that when using 33 billion words, dimensionality = 1000, and
the entire sentence as context size, this resulted in an accuracy of 72%. The
best performed model is still HS + subsampling.
Part 4
- Additive Compositionality
Additive Compositionality
Additive Compositionality is an interesting property of word vector. It shows that a
non-obvious degree of language understanding can be obtained by using basic
operations (sum here) on the word representation. That is something more than
analogical reasoning.
The author gave an intuitive explanation as follows (the product of distributions.)
Part 5
- Conclusion
Conclusion
This paper mainly has three contributions:
- Using common word pairs (phrase representations) to replace words in the
training model.
- Subsampling the frequent words to largely decrease training time and get
better performance for infrequent words.
- Proposing Negative Sampling
- Comparing different methods’ combination and trained a best model (up-to-
then) on a huge (30 billions words) data set.
- Addressing the additive compositionality of word vector.
The works of this paper can be further used in CBOW in [1], and the code is given
at https://code.google.com/archive/p/word2vec/
Referrence
[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word
representations in vector space. ICLR Workshop, 2013.
[2] Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In Pro-
ceedings of the international workshop on artificial intelligence and statistics, pages 246–252, 2005.
[3] Andriy Mnih and Geoffrey E Hinton. A scalable hierarchical distributed language model. Advances in
neural information processing systems, 21:1081–1088, 2009.
[4] Michael U Gutmann and Aapo Hyva ̈rinen. Noise-contrastive estimation of unnormalized statistical
mod- els, with applications to natural image statistics. The Journal of Machine Learning Research,
13:307–361, 2012.

More Related Content

What's hot

Gtm vs dm
Gtm vs dmGtm vs dm
Gtm vs dm
Nuzhat Nasir
 
affective filter krashen monitor
affective filter krashen monitoraffective filter krashen monitor
affective filter krashen monitor
UNIVERSITY OF EDUCATION LAHORE (MULTAN CAMPUS)
 
task based language teaching TBLT
task based language teaching TBLTtask based language teaching TBLT
task based language teaching TBLT
Mohammed Imad
 
An Introduction to Applied Linguistics part 2
An Introduction to Applied Linguistics part 2An Introduction to Applied Linguistics part 2
An Introduction to Applied Linguistics part 2Samira Rahmdel
 
Week 2 unit 3 & 4 - language maintenance and shift - linguistic varieties an...
Week 2  unit 3 & 4 - language maintenance and shift - linguistic varieties an...Week 2  unit 3 & 4 - language maintenance and shift - linguistic varieties an...
Week 2 unit 3 & 4 - language maintenance and shift - linguistic varieties an...
Mar Iam
 
Explaining second language learning
Explaining second language learningExplaining second language learning
Explaining second language learningUTPL UTPL
 
Focusing on language content in a communicative syllabus
Focusing on language content in a communicative syllabusFocusing on language content in a communicative syllabus
Focusing on language content in a communicative syllabus
Unggul DJatmika
 
Gardner’s socio educational model.
Gardner’s socio educational model.Gardner’s socio educational model.
Gardner’s socio educational model.Nabeela Taimur Ali
 
difference b/w test, assessment, evaluation
difference b/w test, assessment, evaluationdifference b/w test, assessment, evaluation
difference b/w test, assessment, evaluation
Wardah Azhar
 
Communicative Language Teaching
Communicative Language TeachingCommunicative Language Teaching
Communicative Language Teaching
Arianny Carolina
 
The Noticing Hypothesis: Explanation and Practical Applications
The Noticing Hypothesis: Explanation and Practical ApplicationsThe Noticing Hypothesis: Explanation and Practical Applications
The Noticing Hypothesis: Explanation and Practical Applications
Marce Winchester
 
English 781 881 we slides
English 781 881 we slidesEnglish 781 881 we slides
English 781 881 we slides
lisyaseloni
 
Describing Teachers Harmer chapter 5 , E4
Describing Teachers Harmer  chapter 5 , E4Describing Teachers Harmer  chapter 5 , E4
Describing Teachers Harmer chapter 5 , E4
IELTS Council
 
Interaction hypothesis by Atula Ahuja
Interaction hypothesis  by Atula AhujaInteraction hypothesis  by Atula Ahuja
Interaction hypothesis by Atula Ahuja
Atula Ahuja
 
An Overview of Syllabuses in English Language Teaching
An Overview of Syllabuses in English Language TeachingAn Overview of Syllabuses in English Language Teaching
An Overview of Syllabuses in English Language Teachingjetnang
 
Varieties of english PPT
Varieties of english PPTVarieties of english PPT
Varieties of english PPTMaFranciscaaa
 
Translatability and untranslatability
Translatability and untranslatabilityTranslatability and untranslatability
Translatability and untranslatabilityAmer Minhas
 
The Place of Tasks in the Language Classroom
The Place of Tasks in the Language ClassroomThe Place of Tasks in the Language Classroom
The Place of Tasks in the Language Classroom
Nailun Naja
 
The definition of CLT By Alireza Sadeghiyan - www.academia.edu
The definition of CLT By Alireza Sadeghiyan - www.academia.eduThe definition of CLT By Alireza Sadeghiyan - www.academia.edu
The definition of CLT By Alireza Sadeghiyan - www.academia.edu
Alireza Sadeghian
 

What's hot (20)

Direct method
Direct methodDirect method
Direct method
 
Gtm vs dm
Gtm vs dmGtm vs dm
Gtm vs dm
 
affective filter krashen monitor
affective filter krashen monitoraffective filter krashen monitor
affective filter krashen monitor
 
task based language teaching TBLT
task based language teaching TBLTtask based language teaching TBLT
task based language teaching TBLT
 
An Introduction to Applied Linguistics part 2
An Introduction to Applied Linguistics part 2An Introduction to Applied Linguistics part 2
An Introduction to Applied Linguistics part 2
 
Week 2 unit 3 & 4 - language maintenance and shift - linguistic varieties an...
Week 2  unit 3 & 4 - language maintenance and shift - linguistic varieties an...Week 2  unit 3 & 4 - language maintenance and shift - linguistic varieties an...
Week 2 unit 3 & 4 - language maintenance and shift - linguistic varieties an...
 
Explaining second language learning
Explaining second language learningExplaining second language learning
Explaining second language learning
 
Focusing on language content in a communicative syllabus
Focusing on language content in a communicative syllabusFocusing on language content in a communicative syllabus
Focusing on language content in a communicative syllabus
 
Gardner’s socio educational model.
Gardner’s socio educational model.Gardner’s socio educational model.
Gardner’s socio educational model.
 
difference b/w test, assessment, evaluation
difference b/w test, assessment, evaluationdifference b/w test, assessment, evaluation
difference b/w test, assessment, evaluation
 
Communicative Language Teaching
Communicative Language TeachingCommunicative Language Teaching
Communicative Language Teaching
 
The Noticing Hypothesis: Explanation and Practical Applications
The Noticing Hypothesis: Explanation and Practical ApplicationsThe Noticing Hypothesis: Explanation and Practical Applications
The Noticing Hypothesis: Explanation and Practical Applications
 
English 781 881 we slides
English 781 881 we slidesEnglish 781 881 we slides
English 781 881 we slides
 
Describing Teachers Harmer chapter 5 , E4
Describing Teachers Harmer  chapter 5 , E4Describing Teachers Harmer  chapter 5 , E4
Describing Teachers Harmer chapter 5 , E4
 
Interaction hypothesis by Atula Ahuja
Interaction hypothesis  by Atula AhujaInteraction hypothesis  by Atula Ahuja
Interaction hypothesis by Atula Ahuja
 
An Overview of Syllabuses in English Language Teaching
An Overview of Syllabuses in English Language TeachingAn Overview of Syllabuses in English Language Teaching
An Overview of Syllabuses in English Language Teaching
 
Varieties of english PPT
Varieties of english PPTVarieties of english PPT
Varieties of english PPT
 
Translatability and untranslatability
Translatability and untranslatabilityTranslatability and untranslatability
Translatability and untranslatability
 
The Place of Tasks in the Language Classroom
The Place of Tasks in the Language ClassroomThe Place of Tasks in the Language Classroom
The Place of Tasks in the Language Classroom
 
The definition of CLT By Alireza Sadeghiyan - www.academia.edu
The definition of CLT By Alireza Sadeghiyan - www.academia.eduThe definition of CLT By Alireza Sadeghiyan - www.academia.edu
The definition of CLT By Alireza Sadeghiyan - www.academia.edu
 

Similar to Summary distributed representations_words_phrases

AN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONS
AN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONSAN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONS
AN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONS
ijaia
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
kevig
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
kevig
 
[Emnlp] what is glo ve part i - towards data science
[Emnlp] what is glo ve  part i - towards data science[Emnlp] what is glo ve  part i - towards data science
[Emnlp] what is glo ve part i - towards data science
Nikhil Jaiswal
 
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEDETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
AbdurrahimDerric
 
Sample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfSample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdf
AaryanArora10
 
Sentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic RelationsSentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic Relations
Editor IJCATR
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning concepts
Joe li
 
2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt
milkesa13
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...
irjes
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
IJRES Journal
 
Paper id 312201512
Paper id 312201512Paper id 312201512
Paper id 312201512
IJRAT
 
lecture_mooney.ppt
lecture_mooney.pptlecture_mooney.ppt
lecture_mooney.pptbutest
 
ABDO_MLROM_PHYSOR2016
ABDO_MLROM_PHYSOR2016ABDO_MLROM_PHYSOR2016
ABDO_MLROM_PHYSOR2016Mohammad Abdo
 
Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)NYversity
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data萍華 楊
 
Information retrieval as statistical translation
Information retrieval as statistical translationInformation retrieval as statistical translation
Information retrieval as statistical translation
Bhavesh Singh
 
Equirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
Equirs: Explicitly Query Understanding Information Retrieval System Based on HmmEquirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
Equirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
International Journal of Engineering Inventions www.ijeijournal.com
 
Comparative analysis of algorithms classification and methods the presentatio...
Comparative analysis of algorithms classification and methods the presentatio...Comparative analysis of algorithms classification and methods the presentatio...
Comparative analysis of algorithms classification and methods the presentatio...
Sayed Rahman
 
Comparative analysis of algorithms_MADI
Comparative analysis of algorithms_MADIComparative analysis of algorithms_MADI
Comparative analysis of algorithms_MADISayed Rahman
 

Similar to Summary distributed representations_words_phrases (20)

AN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONS
AN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONSAN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONS
AN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONS
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
 
[Emnlp] what is glo ve part i - towards data science
[Emnlp] what is glo ve  part i - towards data science[Emnlp] what is glo ve  part i - towards data science
[Emnlp] what is glo ve part i - towards data science
 
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEDETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
 
Sample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfSample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdf
 
Sentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic RelationsSentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic Relations
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning concepts
 
2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
 
Paper id 312201512
Paper id 312201512Paper id 312201512
Paper id 312201512
 
lecture_mooney.ppt
lecture_mooney.pptlecture_mooney.ppt
lecture_mooney.ppt
 
ABDO_MLROM_PHYSOR2016
ABDO_MLROM_PHYSOR2016ABDO_MLROM_PHYSOR2016
ABDO_MLROM_PHYSOR2016
 
Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
 
Information retrieval as statistical translation
Information retrieval as statistical translationInformation retrieval as statistical translation
Information retrieval as statistical translation
 
Equirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
Equirs: Explicitly Query Understanding Information Retrieval System Based on HmmEquirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
Equirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
 
Comparative analysis of algorithms classification and methods the presentatio...
Comparative analysis of algorithms classification and methods the presentatio...Comparative analysis of algorithms classification and methods the presentatio...
Comparative analysis of algorithms classification and methods the presentatio...
 
Comparative analysis of algorithms_MADI
Comparative analysis of algorithms_MADIComparative analysis of algorithms_MADI
Comparative analysis of algorithms_MADI
 

Recently uploaded

Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 

Recently uploaded (20)

Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 

Summary distributed representations_words_phrases

  • 1. A Summary for Distributed Representations of Words and Phrases and Their Compositionality Xiangnan YUE xiangnanyue@gmail.com Ingénieur Civil @Mines ParisTech M2 Data Science @Ecole Polytechnique
  • 2. Road Map - Part 1 - Learning Word2Vec - Baseline Skip-gram model - Hierarchical softmax - Negative sampling - Sub-sampling trick - Part 2 - Learning Phrase - Part 3 - Empirical Results - Part 4 - Additive Compositionality - Part 5 - Conclusion
  • 3. Background This paper follows the paper [1] Efficient estimation of word representations in vector space in 2013. This paper proposed some important improvements for the Skip-gram model, which is widely used for obtaining the famous “word vectors”. [1] proposed the CBOW (Continuous Bag-of-words Model) and Skip-gram, using a 3-layer neural network to train the word-vectors. While CBOW is entering the context to predict the center, Skip-gram is entering the center word to predict the context.
  • 4. Part 1 - Baseline Skip-Gram Model
  • 5. The principle for the baseline Skip-gram - Inputting a centered word, the model will be able to give us the conditional probabilities of surrounding words. The goal is to get the hidden layer “word vectors” by maximizing the likelyhood function. - The objective function is to maximize a log- likelyhood function : - No activation function for the hidden layer, and the output neuron used softmax function. - The parameters of the hidden layer are trained as the “word vectors”. It is the matrix of shape (m, n), where m is the total number of words, and n is the feature number (vector length) Figure 1. Architecture of a basic skip-gram, the graph referred from http://mccormickml.com/2016/04/19/word2vec-tutorial- the-skip-gram-model/
  • 6. Remarks for Baseline Skip-gram - The objective function is under the Naive Bayesian assumption that the features w(t+j) and w(t+i) are independent knowing w(t) so that p(w(t+j) | w(t)) can be separated by log(ᐧ) operation. - The parameters for the hidden layer are too large to be trained. In the following formula, for any input word vector , we still have to calculate the inner product with every output vector (between the hidden layer and the softmax) to get the output probability distribution for each word wo. The parameter matrix between the output and the hidden layer is of the same size of our word-vector matrix (m, n). It’s not practical when the vocabulary table size m is too large.
  • 7. - Besides, in order to train such a large volume of parameters, training data volume also has to be large -- much slower as the running time is proportional to the multiplication of the two volumes. - Therefore, other extensions of Skip-gram were proposed, such as the Hierarchical Softmax and the Negative Sampling (one contribution of this paper). - Stochastic Gradient Descent was proposed to train the Skip-gram. Remarks for Baseline Skip-gram
  • 8. Part 1 - Hierarchical Softmax Method
  • 9. In order to reduce the complexity of calculating the weights of output layer. Hierarchical Softmax (HS) was used in the Skip-gram model. HS was firstly proposed by [2] Morin and Bengio (2005). We will introduce the data structure used in HS: Huffman Tree / Coding. Huffman tree assigns short codes (path) to the most frequent words, and thus the random walk on the tree will find these words quickly. The Hierarchical Softmax
  • 10. Huffman Tree Huffman tree was initially proposed for lossless data compression. It is constructed by the frequency table of words in the corpus. It’s a method of bottom-up. Words are positioned as the leaf (prefix coding). The most often appeared words are close to the root and the most infrequent words are close to the bottom. The decoding path starts from the root and stops when word w is found (reaches the leaf). Figure 2. Huffman Tree : an example. Graph referred from : https://en.wikipedia.org/wiki/Huffman_coding
  • 11. Huffman Tree Here we use |w| to represent the total number of words, it’s the same as the number of leafs of Huffman Tree, and |f| to represent the number of features on the hidden layer, and denotes sigmoid function Huffman Tree reduces the computation complexity on the output layer from to approximate for one input sample ! The softmax function is replaced by: This equation explains the process: we first choose ch() to be fixed as the left child node. Then for any chosen w, if at node j the path goes to left, we take , otherwise take (1 - ). The path from root to w leads to a maximum likelyhood problem. By backpropogation, the parametres are updated.
  • 12. The number of parameters rest the same yet a binary Huffman Tree reduced the computation volume to an acceptable level. Different methods for constructing the tree structure can be found by [3] Mnih and Hinton (2009). HS was proposed in paper [1] a priori and the performance is not satisfying. This paper compared the results of Hierarchical Softmax with Negative Sampling, and also Noise Contrastive Estimation - NCE proposed by [4] Gutmann and Hyvarinen (2012), together with the subsampling trick and using phrase as training samples. Huffman Tree
  • 13. Part 1 - Negative Sampling Method
  • 14. Negative Sampling Method The principle is that each sample adjusts only a small part of the parametres. The idea is selecting a few negative samples (those which are not around the input centered word) and only updating the output layer weight of the selected negative words and the positive word. Remark: - It’s a similar way to dropout layer in deep neural network, yet the dropout layer is between the hidden layers, instead of removing the output units. - In the hidden layer, the number of updated weights are not changed -- only the feature weights of input words are to be changed.
  • 15. Negative Sampling Method Ways to select Negative Samples: Use the uni-gram distribution Pn(w) to select the negative words. According to the paper, in the objective function, is replaced by the following function a ¾ power value is used with the uni-gram percentage. This value is empirical. Note that is the sigmoid function: , thus the first term above is the positive word’s log probability and the second term is the negative words’ non-observe log probability.
  • 17. Subsampling Heuristic Sub-sampling is important because some words are much more frequent than others. Therefore when the sampling window size is large enough, some words appear quite often in sampling windows yet doesn’t reflect any information (the same goes for a variable with probability 1 doesn’t give any information as entropy). Heuristically, by removing the appearing rate of the same word while conserving the ranking in the whole documents will even improve the final results. The method used in this paper is a sub-sampling by a probability of saving which is of the order . The discard Probability : , t is called the subsampling rate, and f(w) is the frequency of the word w in the corpus.
  • 19. Use Phrases Another important contribution is to use the phrases (or word pairs) rather than the words in their training samples. Though this can be regarded as a prepared step, it’s of great importance, given that: - The samples volume is greatly reduced. - The meaning of compositioned phrases are saved. The scores for co-appearance are calculated simply by 2-gram / 1-gram. When the score is large enough, two words are combined into one single phrase. The whole training data set is passing 3-4 times this preprocessing algorithm. In the paper, the obtained phrases’ vector quality is tested by another analogical reasoning task (72% accuracy).
  • 21. Analogical Reasoning Task An example for analogical reasoning is as left. Given three phrases, to how much degree can the fourth phrase be inferred from the first three phrase- vectors. By using cosine distance between the vectors, the closest answer is selected and compared with the true answer. This test method is used also in [1]. Figure 3. Examples for Analogical Reasoning test : In each cell, given the first three words, test the accuracy of calculating the fourth.
  • 22. Empirical Results (words, no phrase learning) Using Analogical Reasoning Task, the empirical results is listed in this paper. - Negative Sampling outperforms the Hierarchical Softmax, and slightly better than the Noise Contrastive Estimation. - The subsampling speeds up the algorithm and even got better results for Semantic and Total accuracy. dimensionality = 300; context size = 5; discard words with frequence < 5; k = 5, 15 (the number of negative words); # of words = billions
  • 23. Empirical Results (using phrases) Using Analogical Reasoning Task, the empirical results is listed in this paper. - Using subsampling, Hierarchical Softmax outperforms the other methods ! Yet when no subsampling, HS got the worst accuracy. - As using phrases largely reduced the total number training samples, more training data is needed (to be continued...) dimensionality = 300; context size = 5; discard words with frequence < 5; k = 5, 15 (the number of negative words);
  • 24. Empirical Results (using phrases) The paper claims that when using 33 billion words, dimensionality = 1000, and the entire sentence as context size, this resulted in an accuracy of 72%. The best performed model is still HS + subsampling.
  • 25. Part 4 - Additive Compositionality
  • 26. Additive Compositionality Additive Compositionality is an interesting property of word vector. It shows that a non-obvious degree of language understanding can be obtained by using basic operations (sum here) on the word representation. That is something more than analogical reasoning. The author gave an intuitive explanation as follows (the product of distributions.)
  • 28. Conclusion This paper mainly has three contributions: - Using common word pairs (phrase representations) to replace words in the training model. - Subsampling the frequent words to largely decrease training time and get better performance for infrequent words. - Proposing Negative Sampling - Comparing different methods’ combination and trained a best model (up-to- then) on a huge (30 billions words) data set. - Addressing the additive compositionality of word vector. The works of this paper can be further used in CBOW in [1], and the code is given at https://code.google.com/archive/p/word2vec/
  • 29. Referrence [1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. ICLR Workshop, 2013. [2] Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In Pro- ceedings of the international workshop on artificial intelligence and statistics, pages 246–252, 2005. [3] Andriy Mnih and Geoffrey E Hinton. A scalable hierarchical distributed language model. Advances in neural information processing systems, 21:1081–1088, 2009. [4] Michael U Gutmann and Aapo Hyva ̈rinen. Noise-contrastive estimation of unnormalized statistical mod- els, with applications to natural image statistics. The Journal of Machine Learning Research, 13:307–361, 2012.