SlideShare a Scribd company logo
1 of 16
Download to read offline
01 Introduction
02 General Model(skip-gram)
Contents
03 Subword Model(SISG)
§ 기존의 embedding model은 unique word를 하나의 vector에 할당할 수 있었음
§ 그러나 이와 같은 방식은 vocabulary의 크기가 커지거나 rare word가 많을수록 한계점을 내포함
§ 이러한 word들은 good word representation을 얻기 힘듦
§ 특히 현재까지의 word representation 기법들은 문자의 internal structure를 고려하지 않고 있음
§ 그러나 스페인어나 프랑스어의 경우 대부분의 동사가 40개 이상의 변형된 형태를 지니고 있으므로, 이러한 언어에서는 rare word 문제가
더욱 대두 될 것임
01 Introduction
• 연구 배경 및 목적
2
이처럼 형태학적인 특징이 풍부한 언어의 경우 subword 정보를 활용하여 학습하면,
vector representation을 개선시킬 수 있을 것임
02 General Model
• skip-gram
target
word
context
word
context
word
context
word
context
word
window
size
𝑃 𝑞𝑢𝑖𝑐𝑘, 𝑏𝑟𝑜𝑤𝑛 𝑡ℎ𝑒
= 𝑃 𝑞𝑢𝑖𝑐𝑘 𝑡ℎ𝑒 𝑃(𝑏𝑟𝑤𝑜𝑛|𝑡ℎ𝑒)
𝑃 𝑡ℎ𝑒, 𝑏𝑟𝑜𝑤𝑛, 𝑓𝑜𝑥 𝑞𝑢𝑖𝑐𝑘
= 𝑃 𝑡ℎ𝑒 𝑞𝑢𝑖𝑐𝑘 𝑃 𝑏𝑟𝑜𝑤𝑛 𝑞𝑢𝑖𝑐𝑘 𝑃(𝑓𝑜𝑥|𝑞𝑢𝑖𝑐𝑘)
𝑃 𝑡ℎ𝑒, 𝑞𝑢𝑖𝑐𝑘, 𝑓𝑜𝑥, 𝑗𝑢𝑚𝑝𝑠 𝑏𝑟𝑜𝑤𝑛
= 𝑃 𝑡ℎ𝑒 𝑏𝑟𝑜𝑤𝑛 𝑃 𝑞𝑢𝑖𝑐𝑘 𝑏𝑟𝑜𝑤𝑛 𝑃 𝑓𝑜𝑥 𝑏𝑟𝑜𝑤𝑛 𝑃(𝑗𝑢𝑚𝑝𝑠|𝑏𝑟𝑜𝑤𝑛)
𝑃 𝑞𝑢𝑖𝑐𝑘, 𝑏𝑟𝑜𝑤𝑛, 𝑗𝑢𝑚𝑝𝑠, 𝑜𝑣𝑒𝑟 𝑓𝑜𝑥
= 𝑃 𝑞𝑢𝑖𝑐𝑘 𝑓𝑜𝑥 𝑃 𝑏𝑟𝑤𝑜𝑛 𝑓𝑜𝑥 𝑃 𝑗𝑢𝑚𝑝𝑠 𝑓𝑜𝑥 𝑃(𝑜𝑣𝑒𝑟|𝑓𝑜𝑥)
Assumption
• context word는 조건부 독립(conditionally independent)
X
= "
!"#
$
"
%∈'!
𝑝(𝑤%|𝑤!)
02 General Model
• skip-gram
𝐼#×)
𝑊)×*
𝐻#×*
𝑊*×)
+
𝑂#×)
02 General Model
• skip-gram
0
1
0
⋮
0
2 0.5 4
0.1 0.2 0.7
0.3
⋮
3
−2
⋮
5
0.1
⋮
0.6
0.1
0.2
0.7
0.3 0.1 0.4
0.1 0.3 −1.1
0.1 0.2 0.4
⋯ −0.4
⋯ 0.3
⋯ 0.1
0.12
0.21
0.10
⋮
0.09
1
0
0
⋮
0
0
0
1
⋮
0
𝑙𝑜𝑠𝑠("#$, &'()*)
𝑙𝑜𝑠𝑠(&'()*, ,-./0)
+ 𝑙𝑜𝑠𝑠&'()*
𝑦,-./
𝑦!-0.
𝐼#×) 𝑊)×* 𝐻#×* 𝑊*×)
+
the
quick
brown
⋮
dog the
quick
brown
dog
⋯
quick
02 General Model
• Row Indexing: 𝑋와 𝑊!"#$%행렬 곱의 병목 해결
§ 𝑋와 𝑊12,0!행렬곱은 vocabulary의 크기가 커질수록 큰 계산 비용이
수반됨
§ ℎ를 구하기 위해서는 input vector 𝑋에서 one-hot index를
𝑊12,0!의 row index로 활용하면 바로 도출 할 수 있음
§ 결과적으로 같지만 계산 비용에서는 큰 차이가 발생함
weight matrix 𝑾𝒊𝒏𝒑𝒖𝒕의 원소 중에서
input vector 𝑋와 관련 있는 부분만 골라서 업데이트 하는 것이 목적
0
1
0
⋮
0
2 0.5 4
0.1 0.2 0.7
0.3
⋮
3
−2
⋮
5
0.1
⋮
0.6
0.1
0.2
0.7
the
quick
brown
⋮
dog
quick
02 General Model
• Negative Sampling: ℎ와 𝑊&$%#$%행렬 곱 및 softmax 계층의 병목 해결
§ latent vector ℎ와 𝑊80!,0! 역시 vocabulary의 크기가 증가하면
큰 계산 비용이 요구됨
§ softmax 연산도 vocabulary의 크기가 증가하면 큰 계산 비용이
요구됨
§ 그러나 하나의 target word와 관련된 context word들은 window
size내의 작은 word 정도밖에 안됨
§ 다시 말해 ℎ와 𝑊80!,0!의 행렬곱 연산은 인풋과 관련되어 업데이트 되
어야할 단어는 몇개 안되는데도 불구하고 vocabulary에 있는 모든 단
어들과의 관계를 비교하여야 하여 비효율적임
0.1
0.2
0.7
0.3 0.1 0.4
0.1 0.3 −1.1
0.1 0.2 0.4
⋯ −0.4
⋯ 0.3
⋯ 0.1
0.12
0.21
0.10
⋮
0.09
the
quick
brown
dog
⋯
quick
02 General Model
• Negative Sampling: ℎ와 𝑊&$%#$%행렬 곱 및 softmax 계층의 병목 해결
§ 이를 해결하기 위해 negative sampling 기법을 활용할 수 있음
§ negative sampling의 핵심은 multi-class classification을
binary classification로 근사 해보겠다는 것
• positive example: target word의 context word
• negative example: target word의 context word가 아닌 word
0.1
0.2
0.7
0.3 0.1 0.4
0.1 0.3 −1.1
0.1 0.2 0.4
⋯ −0.4
⋯ 0.3
⋯ 0.1
0.12
0.21
0.10
⋮
0.09
the
quick
brown
dog
⋯
quick
02 General Model
• Negative Sampling: ℎ와 𝑊&$%#$%행렬 곱 및 softmax 계층의 병목 해결
0.1
0.2
0.7
0.3 0.1 0.4
0.1 0.3 −1.1
0.1 0.2 0.4
⋯ −0.4
⋯ −0.3
⋯ −0.1
the
quick
brown
dog
⋯quick
dot 0.59
0.52dot
𝑏𝑖𝑛𝑎𝑟𝑦𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐿𝑜𝑠𝑠!"#$%,'()
𝑏𝑖𝑛𝑎𝑟𝑦𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐿𝑜𝑠𝑠!"#$%,*+,-.
dot 0.45 𝑏𝑖𝑛𝑎𝑟𝑦𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐿𝑜𝑠𝑠!"#$%,/,0
𝑙𝑎𝑏𝑒𝑙 = 1
𝑙𝑎𝑏𝑒𝑙 = 1
𝑙𝑎𝑏𝑒𝑙 = 0
𝐿𝑜𝑠𝑠&'()*+
0.1
0.2
0.7
0.3 0.1 0.4
0.1 0.3 −1.1
0.1 0.2 0.4
⋯ −0.4
⋯ 0.3
⋯ 0.1
0.12
0.21
0.10
⋮
0.09
1
0
0
⋮
0
0
0
1
⋮
0
𝑙𝑜𝑠𝑠("#$%&, ()*)
𝑙𝑜𝑠𝑠("#$%&, ,-./0)
+ 𝑙𝑜𝑠𝑠"#$%&
𝑦1-*2
𝑦(-#*
𝐻3×5 𝑊5×6
7
the
quick
brown
dog
⋯
quick
02 General Model
• Negative Sampling: ℎ와 𝑊&$%#$%행렬 곱 및 softmax 계층의 병목 해결
§ How to negative sampling?
• Corpus 내에서 자주 등장하는 단어를 더 많이 추출하고 드물게 등장하는 단어는 적게 추출하고자 함
§ Probability distribution is derived through the equation below:
𝑃 𝑤( =
𝑓(𝑤()4/6
∑789
0 𝑓(𝑤7)4/6
𝑓 𝑤( = ⁄# 𝑜𝑓 𝑤( 𝑖𝑛 𝑐𝑜𝑟𝑝𝑢𝑠 # 𝑜𝑓 𝑎𝑙𝑙 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑐𝑜𝑟𝑝𝑢𝑠
§ Why use 3/4?
• 등장 확률이 낮은 단어가 조금 더 쉽게 샘플링이 될 수 있도록 하기 위함.
02 General Model
• skip-gram
§ We start by briefly reviewing the skip-gram model introduced by Mikolov et al.
§ Inspired by the distributional hypothesis (Harris, 1954), word representations are trained to predict well words that
appear in its context.
§ it is assumed that there is no relationship between context words 𝑤) given target word 𝑤"(conditional independence).
𝑃 𝑤":;, 𝑤"<; 𝑤" = 𝑃 𝑤":; 𝑤" 𝑃 𝑤"<; 𝑤"
§ Given a large training corpus represented as a sequence of words (𝑤;, … , 𝑤=), the objective of the skipgram model is to
maximize following log-likelihood:
E
"8;
=
E
)∈?5
𝑝(𝑤)|𝑤") ⟺ H
"8;
=
H
)∈?5
log 𝑝(𝑤)|𝑤")
Notation
• 𝑊: vocab size
• 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index
• 𝑤$: context word
• 𝑤': target word
• 𝐶': set of indices of words surrounding word 𝑤'
• 𝒩',$: a set of negative examples sampled from the
vocabulary.
02 General Model
• skip-gram: Objective
§ One possible choice to define the probability of a context word is the softmax:
𝑃 𝑤% 𝑤! =
𝑒9(;!, ;")
∑>"#
?
𝑒9(;!, >)
§ The problem of predicting context words can instead be framed as a set of independent binary classification tasks.
§ Then the goal is to independently predict the presence(or absence) of context words.
§ For the word as position 𝑡 we consider all context words as positive examples and sample negatives at random from the
dictionary. For a chosen context position 𝑐, using binary logistic loss, we obtain the follow negative log-likelihood:
log(1 + 𝑒@9(;!, ;")
) + 9
2∈𝒩!,"
log(1 + 𝑒9(;!, ;")
)
𝑠 𝑤", 𝑤) = 𝑢/5
= 𝑣/6
Notation
• 𝑊: vocab size
• 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index
• 𝑤$: context word
• 𝑤': target word
• 𝐶': set of indices of words surrounding word 𝑤'
• 𝒩',$: a set of negative examples sampled from the
vocabulary.
02 General Model
• skip-gram: Negative-sampling
§ For the word as position 𝑡 we consider all context words as positive examples and sample negatives at random from the
dictionary. For a chosen context position 𝑐, using binary logistic loss, we obtain the follow negative log-likelihood:
log(1 + 𝑒@9(;!, ;")) + 9
2∈𝒩!,"
log(1 + 𝑒9(;!, ;"))
𝑠 𝑤", 𝑤) = 𝑢/5
= 𝑣/6
§ For all target words, we can re-write the objective as:
H
"8;
=
[H
)∈?5
log(1 + 𝑒:@(/5, /6)) + H
0∈𝒩5,6
log(1 + 𝑒@(/5, /6))]
Notation
• 𝑊: vocab size
• 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index
• 𝑤$: context word
• 𝑤': target word
• 𝐶': set of indices of words surrounding word 𝑤'
• 𝒩',$: a set of negative examples sampled from the
vocabulary.
Negative sampling results in both faster training
and learn accurate representations especially for frequent words (Mikolov et al., 2013)
02 General Model
• skip-gram: Subsampling of frequent Words
§ In very large corpora, frequent words usually provide less information value than the rare words.
§ the vector representations of frequent words do not change significantly after training on several million examples.
§ To counter the imbalance between the rare and frequent words, we used a simple subsampling approach
• each word 𝑤( in the training set is discarded with probability computed by the formula:
𝑃 𝑤( = 1 −
𝑡
𝑓 𝑤(
𝑓 𝑤( = ⁄# 𝑜𝑓 𝑤( 𝑖𝑛 𝑐𝑜𝑟𝑝𝑢𝑠 # 𝑜𝑓 𝑎𝑙𝑙 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑐𝑜𝑟𝑝𝑢𝑠
𝑡 = 𝑐ℎ𝑜𝑠𝑒𝑛 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑(𝑟𝑒𝑐𝑜𝑚𝑚𝑒𝑛𝑑𝑒𝑑 10:B)
Notation
• 𝑊: vocab size
• 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index
• 𝑤$: context word
• 𝑤': target word
• 𝐶': set of indices of words surrounding word 𝑤'
• 𝒩',$: a set of negative examples sampled from the
vocabulary.
Subsampling results in both faster training
and significantly better representation of uncommon words (Mikolov et al., 2013)
03 Subword Model
• SISG, FastText
§ Each word w is represented as a bag of character n-gram.
§ We add special boundary symbols ‘<’ and ‘>’ at the beginning and end of words, allowing to distinguish prefixes and
suffixes from other character sequences.
§ We also include the word w itself in the set of its n-grams, to learn a representation for each word(in addition to
character n-gram)
§ Taking the word where and n=3 as an example, it will be represented by the character n-gram:
𝒢/#$-$ = {< 𝑤ℎ, 𝑤ℎ𝑒, ℎ𝑒𝑟, 𝑒𝑟𝑒, 𝑟𝑒 >, < 𝑤ℎ𝑒𝑟𝑒 >}
§ We represent a word by the sum of the vector representations of its n-grams, so we obtain the scoring function:
𝑠 𝑤, 𝑐 = H
C∈𝒢7
𝑧C
= 𝑣)
Notation
• 𝑊: vocab size
• 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index
• 𝑤$: context word
• 𝑤': target word
• 𝐶': set of indices of words surrounding word 𝑤'
• 𝒩',$: a set of negative examples sampled from the
vocabulary.
• 𝒢-: a set of subwords given a word 𝑤
03 Subword Model
• SISG, FastText
pseudo code of computing loss

More Related Content

What's hot

Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherMLReview
 
(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modelingSerhii Havrylov
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopiwan_rg
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practicehen_drik
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimEdgar Marca
 
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...Association for Computational Linguistics
 
Tricks in natural language processing
Tricks in natural language processingTricks in natural language processing
Tricks in natural language processingBabu Priyavrat
 
ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"nozyh
 
Lda and it's applications
Lda and it's applicationsLda and it's applications
Lda and it's applicationsBabu Priyavrat
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Hady Elsahar
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结君 廖
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalBhaskar Mitra
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Daniele Di Mitri
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Marina Santini
 

What's hot (20)

Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and Whither
 
(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practice
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensim
 
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism ...
 
Understanding GloVe
Understanding GloVeUnderstanding GloVe
Understanding GloVe
 
Tricks in natural language processing
Tricks in natural language processingTricks in natural language processing
Tricks in natural language processing
 
ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"
 
Lda and it's applications
Lda and it's applicationsLda and it's applications
Lda and it's applications
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information Retrieval
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
 

Similar to Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰

2021 03-02-distributed representations-of_words_and_phrases
2021 03-02-distributed representations-of_words_and_phrases2021 03-02-distributed representations-of_words_and_phrases
2021 03-02-distributed representations-of_words_and_phrasesJAEMINJEONG5
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptxGowrySailaja
 
Subword tokenizers
Subword tokenizersSubword tokenizers
Subword tokenizersHa Loc Do
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Chunyang Chen
 
Sergey Nikolenko and Elena Tutubalina - Constructing Aspect-Based Sentiment ...
Sergey Nikolenko and  Elena Tutubalina - Constructing Aspect-Based Sentiment ...Sergey Nikolenko and  Elena Tutubalina - Constructing Aspect-Based Sentiment ...
Sergey Nikolenko and Elena Tutubalina - Constructing Aspect-Based Sentiment ...AIST
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsTae Hwan Jung
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 
Dynamic pooling and unfolding recursive autoencoders for paraphrase detection
Dynamic pooling and unfolding recursive autoencoders for paraphrase detectionDynamic pooling and unfolding recursive autoencoders for paraphrase detection
Dynamic pooling and unfolding recursive autoencoders for paraphrase detectionKoza Ozawa
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Jinpyo Lee
 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptxSan Kim
 
NS-CUK Seminar: H.B.Kim, Review on "subgraph2vec: Learning Distributed Repre...
NS-CUK Seminar: H.B.Kim,  Review on "subgraph2vec: Learning Distributed Repre...NS-CUK Seminar: H.B.Kim,  Review on "subgraph2vec: Learning Distributed Repre...
NS-CUK Seminar: H.B.Kim, Review on "subgraph2vec: Learning Distributed Repre...ssuser4b1f48
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelHemantha Kulathilake
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnRwanEnan
 
Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...
Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...
Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...禎晃 山崎
 

Similar to Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰 (20)

Dependency-Based Word Embeddings
Dependency-Based Word EmbeddingsDependency-Based Word Embeddings
Dependency-Based Word Embeddings
 
2021 03-02-distributed representations-of_words_and_phrases
2021 03-02-distributed representations-of_words_and_phrases2021 03-02-distributed representations-of_words_and_phrases
2021 03-02-distributed representations-of_words_and_phrases
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptx
 
Subword tokenizers
Subword tokenizersSubword tokenizers
Subword tokenizers
 
wordembedding.pptx
wordembedding.pptxwordembedding.pptx
wordembedding.pptx
 
Word embedding
Word embedding Word embedding
Word embedding
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
 
Sergey Nikolenko and Elena Tutubalina - Constructing Aspect-Based Sentiment ...
Sergey Nikolenko and  Elena Tutubalina - Constructing Aspect-Based Sentiment ...Sergey Nikolenko and  Elena Tutubalina - Constructing Aspect-Based Sentiment ...
Sergey Nikolenko and Elena Tutubalina - Constructing Aspect-Based Sentiment ...
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword units
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Dynamic pooling and unfolding recursive autoencoders for paraphrase detection
Dynamic pooling and unfolding recursive autoencoders for paraphrase detectionDynamic pooling and unfolding recursive autoencoders for paraphrase detection
Dynamic pooling and unfolding recursive autoencoders for paraphrase detection
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx
 
NS-CUK Seminar: H.B.Kim, Review on "subgraph2vec: Learning Distributed Repre...
NS-CUK Seminar: H.B.Kim,  Review on "subgraph2vec: Learning Distributed Repre...NS-CUK Seminar: H.B.Kim,  Review on "subgraph2vec: Learning Distributed Repre...
NS-CUK Seminar: H.B.Kim, Review on "subgraph2vec: Learning Distributed Repre...
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...
Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...
Constructing dataset based_on_concept_hierarchy_for_evaluating_word_vectors_l...
 
NLP_KASHK:N-Grams
NLP_KASHK:N-GramsNLP_KASHK:N-Grams
NLP_KASHK:N-Grams
 

Recently uploaded

How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 

Recently uploaded (20)

How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 

Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰

  • 1. 01 Introduction 02 General Model(skip-gram) Contents 03 Subword Model(SISG)
  • 2. § 기존의 embedding model은 unique word를 하나의 vector에 할당할 수 있었음 § 그러나 이와 같은 방식은 vocabulary의 크기가 커지거나 rare word가 많을수록 한계점을 내포함 § 이러한 word들은 good word representation을 얻기 힘듦 § 특히 현재까지의 word representation 기법들은 문자의 internal structure를 고려하지 않고 있음 § 그러나 스페인어나 프랑스어의 경우 대부분의 동사가 40개 이상의 변형된 형태를 지니고 있으므로, 이러한 언어에서는 rare word 문제가 더욱 대두 될 것임 01 Introduction • 연구 배경 및 목적 2 이처럼 형태학적인 특징이 풍부한 언어의 경우 subword 정보를 활용하여 학습하면, vector representation을 개선시킬 수 있을 것임
  • 3. 02 General Model • skip-gram target word context word context word context word context word window size 𝑃 𝑞𝑢𝑖𝑐𝑘, 𝑏𝑟𝑜𝑤𝑛 𝑡ℎ𝑒 = 𝑃 𝑞𝑢𝑖𝑐𝑘 𝑡ℎ𝑒 𝑃(𝑏𝑟𝑤𝑜𝑛|𝑡ℎ𝑒) 𝑃 𝑡ℎ𝑒, 𝑏𝑟𝑜𝑤𝑛, 𝑓𝑜𝑥 𝑞𝑢𝑖𝑐𝑘 = 𝑃 𝑡ℎ𝑒 𝑞𝑢𝑖𝑐𝑘 𝑃 𝑏𝑟𝑜𝑤𝑛 𝑞𝑢𝑖𝑐𝑘 𝑃(𝑓𝑜𝑥|𝑞𝑢𝑖𝑐𝑘) 𝑃 𝑡ℎ𝑒, 𝑞𝑢𝑖𝑐𝑘, 𝑓𝑜𝑥, 𝑗𝑢𝑚𝑝𝑠 𝑏𝑟𝑜𝑤𝑛 = 𝑃 𝑡ℎ𝑒 𝑏𝑟𝑜𝑤𝑛 𝑃 𝑞𝑢𝑖𝑐𝑘 𝑏𝑟𝑜𝑤𝑛 𝑃 𝑓𝑜𝑥 𝑏𝑟𝑜𝑤𝑛 𝑃(𝑗𝑢𝑚𝑝𝑠|𝑏𝑟𝑜𝑤𝑛) 𝑃 𝑞𝑢𝑖𝑐𝑘, 𝑏𝑟𝑜𝑤𝑛, 𝑗𝑢𝑚𝑝𝑠, 𝑜𝑣𝑒𝑟 𝑓𝑜𝑥 = 𝑃 𝑞𝑢𝑖𝑐𝑘 𝑓𝑜𝑥 𝑃 𝑏𝑟𝑤𝑜𝑛 𝑓𝑜𝑥 𝑃 𝑗𝑢𝑚𝑝𝑠 𝑓𝑜𝑥 𝑃(𝑜𝑣𝑒𝑟|𝑓𝑜𝑥) Assumption • context word는 조건부 독립(conditionally independent) X = " !"# $ " %∈'! 𝑝(𝑤%|𝑤!)
  • 4. 02 General Model • skip-gram 𝐼#×) 𝑊)×* 𝐻#×* 𝑊*×) + 𝑂#×)
  • 5. 02 General Model • skip-gram 0 1 0 ⋮ 0 2 0.5 4 0.1 0.2 0.7 0.3 ⋮ 3 −2 ⋮ 5 0.1 ⋮ 0.6 0.1 0.2 0.7 0.3 0.1 0.4 0.1 0.3 −1.1 0.1 0.2 0.4 ⋯ −0.4 ⋯ 0.3 ⋯ 0.1 0.12 0.21 0.10 ⋮ 0.09 1 0 0 ⋮ 0 0 0 1 ⋮ 0 𝑙𝑜𝑠𝑠("#$, &'()*) 𝑙𝑜𝑠𝑠(&'()*, ,-./0) + 𝑙𝑜𝑠𝑠&'()* 𝑦,-./ 𝑦!-0. 𝐼#×) 𝑊)×* 𝐻#×* 𝑊*×) + the quick brown ⋮ dog the quick brown dog ⋯ quick
  • 6. 02 General Model • Row Indexing: 𝑋와 𝑊!"#$%행렬 곱의 병목 해결 § 𝑋와 𝑊12,0!행렬곱은 vocabulary의 크기가 커질수록 큰 계산 비용이 수반됨 § ℎ를 구하기 위해서는 input vector 𝑋에서 one-hot index를 𝑊12,0!의 row index로 활용하면 바로 도출 할 수 있음 § 결과적으로 같지만 계산 비용에서는 큰 차이가 발생함 weight matrix 𝑾𝒊𝒏𝒑𝒖𝒕의 원소 중에서 input vector 𝑋와 관련 있는 부분만 골라서 업데이트 하는 것이 목적 0 1 0 ⋮ 0 2 0.5 4 0.1 0.2 0.7 0.3 ⋮ 3 −2 ⋮ 5 0.1 ⋮ 0.6 0.1 0.2 0.7 the quick brown ⋮ dog quick
  • 7. 02 General Model • Negative Sampling: ℎ와 𝑊&$%#$%행렬 곱 및 softmax 계층의 병목 해결 § latent vector ℎ와 𝑊80!,0! 역시 vocabulary의 크기가 증가하면 큰 계산 비용이 요구됨 § softmax 연산도 vocabulary의 크기가 증가하면 큰 계산 비용이 요구됨 § 그러나 하나의 target word와 관련된 context word들은 window size내의 작은 word 정도밖에 안됨 § 다시 말해 ℎ와 𝑊80!,0!의 행렬곱 연산은 인풋과 관련되어 업데이트 되 어야할 단어는 몇개 안되는데도 불구하고 vocabulary에 있는 모든 단 어들과의 관계를 비교하여야 하여 비효율적임 0.1 0.2 0.7 0.3 0.1 0.4 0.1 0.3 −1.1 0.1 0.2 0.4 ⋯ −0.4 ⋯ 0.3 ⋯ 0.1 0.12 0.21 0.10 ⋮ 0.09 the quick brown dog ⋯ quick
  • 8. 02 General Model • Negative Sampling: ℎ와 𝑊&$%#$%행렬 곱 및 softmax 계층의 병목 해결 § 이를 해결하기 위해 negative sampling 기법을 활용할 수 있음 § negative sampling의 핵심은 multi-class classification을 binary classification로 근사 해보겠다는 것 • positive example: target word의 context word • negative example: target word의 context word가 아닌 word 0.1 0.2 0.7 0.3 0.1 0.4 0.1 0.3 −1.1 0.1 0.2 0.4 ⋯ −0.4 ⋯ 0.3 ⋯ 0.1 0.12 0.21 0.10 ⋮ 0.09 the quick brown dog ⋯ quick
  • 9. 02 General Model • Negative Sampling: ℎ와 𝑊&$%#$%행렬 곱 및 softmax 계층의 병목 해결 0.1 0.2 0.7 0.3 0.1 0.4 0.1 0.3 −1.1 0.1 0.2 0.4 ⋯ −0.4 ⋯ −0.3 ⋯ −0.1 the quick brown dog ⋯quick dot 0.59 0.52dot 𝑏𝑖𝑛𝑎𝑟𝑦𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐿𝑜𝑠𝑠!"#$%,'() 𝑏𝑖𝑛𝑎𝑟𝑦𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐿𝑜𝑠𝑠!"#$%,*+,-. dot 0.45 𝑏𝑖𝑛𝑎𝑟𝑦𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐿𝑜𝑠𝑠!"#$%,/,0 𝑙𝑎𝑏𝑒𝑙 = 1 𝑙𝑎𝑏𝑒𝑙 = 1 𝑙𝑎𝑏𝑒𝑙 = 0 𝐿𝑜𝑠𝑠&'()*+ 0.1 0.2 0.7 0.3 0.1 0.4 0.1 0.3 −1.1 0.1 0.2 0.4 ⋯ −0.4 ⋯ 0.3 ⋯ 0.1 0.12 0.21 0.10 ⋮ 0.09 1 0 0 ⋮ 0 0 0 1 ⋮ 0 𝑙𝑜𝑠𝑠("#$%&, ()*) 𝑙𝑜𝑠𝑠("#$%&, ,-./0) + 𝑙𝑜𝑠𝑠"#$%& 𝑦1-*2 𝑦(-#* 𝐻3×5 𝑊5×6 7 the quick brown dog ⋯ quick
  • 10. 02 General Model • Negative Sampling: ℎ와 𝑊&$%#$%행렬 곱 및 softmax 계층의 병목 해결 § How to negative sampling? • Corpus 내에서 자주 등장하는 단어를 더 많이 추출하고 드물게 등장하는 단어는 적게 추출하고자 함 § Probability distribution is derived through the equation below: 𝑃 𝑤( = 𝑓(𝑤()4/6 ∑789 0 𝑓(𝑤7)4/6 𝑓 𝑤( = ⁄# 𝑜𝑓 𝑤( 𝑖𝑛 𝑐𝑜𝑟𝑝𝑢𝑠 # 𝑜𝑓 𝑎𝑙𝑙 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑐𝑜𝑟𝑝𝑢𝑠 § Why use 3/4? • 등장 확률이 낮은 단어가 조금 더 쉽게 샘플링이 될 수 있도록 하기 위함.
  • 11. 02 General Model • skip-gram § We start by briefly reviewing the skip-gram model introduced by Mikolov et al. § Inspired by the distributional hypothesis (Harris, 1954), word representations are trained to predict well words that appear in its context. § it is assumed that there is no relationship between context words 𝑤) given target word 𝑤"(conditional independence). 𝑃 𝑤":;, 𝑤"<; 𝑤" = 𝑃 𝑤":; 𝑤" 𝑃 𝑤"<; 𝑤" § Given a large training corpus represented as a sequence of words (𝑤;, … , 𝑤=), the objective of the skipgram model is to maximize following log-likelihood: E "8; = E )∈?5 𝑝(𝑤)|𝑤") ⟺ H "8; = H )∈?5 log 𝑝(𝑤)|𝑤") Notation • 𝑊: vocab size • 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index • 𝑤$: context word • 𝑤': target word • 𝐶': set of indices of words surrounding word 𝑤' • 𝒩',$: a set of negative examples sampled from the vocabulary.
  • 12. 02 General Model • skip-gram: Objective § One possible choice to define the probability of a context word is the softmax: 𝑃 𝑤% 𝑤! = 𝑒9(;!, ;") ∑>"# ? 𝑒9(;!, >) § The problem of predicting context words can instead be framed as a set of independent binary classification tasks. § Then the goal is to independently predict the presence(or absence) of context words. § For the word as position 𝑡 we consider all context words as positive examples and sample negatives at random from the dictionary. For a chosen context position 𝑐, using binary logistic loss, we obtain the follow negative log-likelihood: log(1 + 𝑒@9(;!, ;") ) + 9 2∈𝒩!," log(1 + 𝑒9(;!, ;") ) 𝑠 𝑤", 𝑤) = 𝑢/5 = 𝑣/6 Notation • 𝑊: vocab size • 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index • 𝑤$: context word • 𝑤': target word • 𝐶': set of indices of words surrounding word 𝑤' • 𝒩',$: a set of negative examples sampled from the vocabulary.
  • 13. 02 General Model • skip-gram: Negative-sampling § For the word as position 𝑡 we consider all context words as positive examples and sample negatives at random from the dictionary. For a chosen context position 𝑐, using binary logistic loss, we obtain the follow negative log-likelihood: log(1 + 𝑒@9(;!, ;")) + 9 2∈𝒩!," log(1 + 𝑒9(;!, ;")) 𝑠 𝑤", 𝑤) = 𝑢/5 = 𝑣/6 § For all target words, we can re-write the objective as: H "8; = [H )∈?5 log(1 + 𝑒:@(/5, /6)) + H 0∈𝒩5,6 log(1 + 𝑒@(/5, /6))] Notation • 𝑊: vocab size • 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index • 𝑤$: context word • 𝑤': target word • 𝐶': set of indices of words surrounding word 𝑤' • 𝒩',$: a set of negative examples sampled from the vocabulary. Negative sampling results in both faster training and learn accurate representations especially for frequent words (Mikolov et al., 2013)
  • 14. 02 General Model • skip-gram: Subsampling of frequent Words § In very large corpora, frequent words usually provide less information value than the rare words. § the vector representations of frequent words do not change significantly after training on several million examples. § To counter the imbalance between the rare and frequent words, we used a simple subsampling approach • each word 𝑤( in the training set is discarded with probability computed by the formula: 𝑃 𝑤( = 1 − 𝑡 𝑓 𝑤( 𝑓 𝑤( = ⁄# 𝑜𝑓 𝑤( 𝑖𝑛 𝑐𝑜𝑟𝑝𝑢𝑠 # 𝑜𝑓 𝑎𝑙𝑙 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑐𝑜𝑟𝑝𝑢𝑠 𝑡 = 𝑐ℎ𝑜𝑠𝑒𝑛 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑(𝑟𝑒𝑐𝑜𝑚𝑚𝑒𝑛𝑑𝑒𝑑 10:B) Notation • 𝑊: vocab size • 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index • 𝑤$: context word • 𝑤': target word • 𝐶': set of indices of words surrounding word 𝑤' • 𝒩',$: a set of negative examples sampled from the vocabulary. Subsampling results in both faster training and significantly better representation of uncommon words (Mikolov et al., 2013)
  • 15. 03 Subword Model • SISG, FastText § Each word w is represented as a bag of character n-gram. § We add special boundary symbols ‘<’ and ‘>’ at the beginning and end of words, allowing to distinguish prefixes and suffixes from other character sequences. § We also include the word w itself in the set of its n-grams, to learn a representation for each word(in addition to character n-gram) § Taking the word where and n=3 as an example, it will be represented by the character n-gram: 𝒢/#$-$ = {< 𝑤ℎ, 𝑤ℎ𝑒, ℎ𝑒𝑟, 𝑒𝑟𝑒, 𝑟𝑒 >, < 𝑤ℎ𝑒𝑟𝑒 >} § We represent a word by the sum of the vector representations of its n-grams, so we obtain the scoring function: 𝑠 𝑤, 𝑐 = H C∈𝒢7 𝑧C = 𝑣) Notation • 𝑊: vocab size • 𝑤 ∈ {1, 2, … , 𝑊}: word is identified by its index • 𝑤$: context word • 𝑤': target word • 𝐶': set of indices of words surrounding word 𝑤' • 𝒩',$: a set of negative examples sampled from the vocabulary. • 𝒢-: a set of subwords given a word 𝑤
  • 16. 03 Subword Model • SISG, FastText pseudo code of computing loss