SlideShare a Scribd company logo
Word Embeddings: Why
the Hype ?
Hady Elsahar
Hady.elsahar@univ-st-etienne.fr
slides available at :
Introduction
● Why vector for Natural language ?
● Convensional representations for words and documents
● Methods of Dimensionality reduction
Deep learning models:
● Continuous Bag of words model
● Other Models (SKip Gram Model, GloVe)
● Evaluation of Word Vectors
● Readings and references
Introduction: Why Vectors
Document Classification or Clustering :
● Documents composed of words
● Similar documents will contain similar words
● Machine Learning love vectors
● A Machine Learning algorithm shall know
which words are significant which category
Bag of Words Model
“Represent each document which the bag of words it contains”
d1 : Mary loves Movies, Cinema and Art Class 1 : Arts
d2 : John went to the Football game Class 2 : Sports
d3 : Robert went for the Movie Delicatessen Class : Arts
Mary Loves Movies Cinema Art John Went to the Delicatessen Robert Football Game and for
d1 1 1 1 1 1 1
d2 1 1 1 1 1 1
d3 1 1 1 1 1
Bag of Words Model
Can a Machine learning algorithm know that “the” and “for” are un important
words ?
● Yes : But will need lots of training labeled data
What to do ?
● Use hand crafted features (weighting features for words)
● Make lots of them
● Keep doing this for 50 years
● Regret later .. cry hard
Bag of Words Model + Weghting eiFeatures
Weighting features example TF-IDF
● TF-IDF ~= Term Frequency / Document frequency
● Motivation : Words appearing in large number of documents are not
significant
Mary Loves Movies Cinema Art John Went to the Delicatessen Robert Football Game and for
d1 0.3779 0.3779 0.3779 0.3779 0.3779 0.0001
d2 0.4402 0.001 0.02 0.4558 0.458
d3 0.001 0.01 0.01 0.458 0.0001
Word Vector Representations
Document can be represented by words, But how to represent words
themselves ?
“You shall know a word by the
company it keeps”
Word Vector Representations
Use a sliding window over a big corpus of text and count word co-occurences in
between.
1. I enjoy flying.
2. I like NLP.
3. I like deep learning.
Bag of words Representations: Drawbacks
● High dimensionality and Very sparse !!!!!
● Unable to capture word order
○ “ good but expensive” “expensive but good” will have same representation.
● Unable to capture semantic similarities (mostly because of sparsity)
○ “boy”, “girl” and “car”
○ “Human”, “Person” and “Giraffe”
Bag of words Representations: Drawbacks
How to over come this ?
● Keep using hand crafted features
● Make lots of them
● Keep doing this for 50 years
● Regret later .. cry hard
Or … Dimensionality reduction
Dimensionality Reduction using Matrix
factorization
Singular value decomposition
where : σ1
> σ2
.. > σn
> 0
Singular value decomposition
● Lower dimensionality K << |V|
● taking the most significant projection of your vectors
space
Latent semantic Indexing / Analysis (1994)
⋃ : are dense word vector representations
V : are dense Document vector representations
LSA / LSI , HAL methods made huge advancements in document retrieval and
semantic similarity
Deep learning Word Embeddings (2003)
“A Neural Probabilistic Language Model” Bengio et al. 2003
Original task “Language Modeling” :
- Prediction of next word given sequence of previous words.
- Useful in Speech Recognition, Autcompletion, Machine translation.
“The Cat Chills on a mat ” , Calculate : P( mat | the, cat, chills, on, a )
Deep learning Word Embeddings (2003)
“A Neural Probabilistic Language Model” Bengio et al. 2003
Quoting from the paper:
“This is intrinsically difficult because of the curse of dimensionality: a word
sequence on which the model will be tested is likely to be different from all the
word sequences seen during training.”
“We propose to fight the curse of dimensionality by learning a distributed
representation for words”
Continuous Bag of Words model (CBOW)
Tomas Mikolov et al. (2013)
The model Predicts the current word given the context
scan text in large corpus with a window
Input : x0
, x1
, x3
, x4
output : x2
“ The Cat Chills on a mat ”
x0
x1
x2
x3
x4
x5
Continuous Bag of Words model (CBOW)(2013)
| V | vocabulary size
Χi
∈ R 1 x | V |
1 hot vector representation of each word
yi
∈ R| V | x 1
one hot representation of the correct middle word (expected output)
1 0 0 0 0 0yi
0 0 0 0 1 0 0
0 0 0 1 0 0 0
0 0 0 0 0 0 1
0 0 1 0 0 0 0
x0
x1
x3
x4
| V |
Black box
Continuous Bag of Words model (CBOW)(2013)
| V | vocabulary size
Χi
∈ R 1 x | V |
1 hot vector representation of each word
yi
∈ R| V | x 1
one hot representation of the correct middle word (expected output)
yi
x0
x1
x3
x4
W(1)
Average
W(2) softmax
Continuous Bag of Words model (CBOW)(2013)
n arbitary length of our word embeddings
W(1)
∈ Rn × |V|
Input word vector
ui
∈ R n x 1
Representation of Xi
After multiplication with input matrix
0 0 0 0 1 0 0
0 0 0 1 0 0 0
0 0 0 0 0 0 1
0 0 1 0 0 0 0
x0
x1
x3
x4
| V |
0 1 3
1 3 6
5 0 3
9 8 0
2 2 2
5 6 7
8 8 8
|V|
n
W(1)
2 2 2
9 8 0
8 8 8
5 0 3
u0
u1
u3
u4
n
Continuous Bag of Words model (CBOW)(2013)
hi
∈ R n x 1
hi
= Average of u0
u1
u3
u4
2 2 2
9 8 0
8 8 8
5 0 3
u0
u1
u3
u4
n
Average
20.25 4.5 3.25
hi
Continuous Bag of Words model (CBOW)(2013)
W (2)
∈ R n x | V |
Output word vector
Z ∈ R | V | x 1
Output vector representation of Xi
Z = hi
W(2)
| V |
W(2)
0 1 3 1 3 6 5
0 3 9 8 0 2 2
2 5 6 7 8 8 8
n 32 14 23 0.22 12 14 55 19
Z
| V |
20.25 4.5 3.25
hi
Continuous Bag of Words model (CBOW)(2013)
How to compare Z to yi
?
Largest value corresponds to the correct class ? … no Softmax
Softmax: squashes a K-dimensional vector of arbitrary real values to
a K-dimensional vector of real values in the range (0, 1)
1 0 0 0 0 0 0 0yi
32 14 23 0.22 2 14 55 19Z
Continuous Bag of Words model (CBOW)(2013)
y^ = softmax ( Z )
yi
∈ R| V | x 1
one hot representation of the correct middle word
1 0 0 0 0 0 0 0yi
32 14 23 0.22 2 14 55 19Z
y^ 0.7 0.1 0.02 0.08 0 0 0.1
Continuous Bag of Words model (CBOW)(2013)
● We need estimated words y^ to be closest to the original answer
● One common error function is the cross entropy H(yˆ, y) (why ?).
Since y is one hot vector
Continuous Bag of Words model (CBOW)(2013)
● We need estimated words y^ to be closest to the original answer
● One common error function is the cross entropy error H(yˆ, y) (why ?).
Since y is one hot vector
Continuous Bag of Words model (CBOW)(2013)
Perfect language model will expect the propability of the correct word y^i
= 1
So loss will be 0
Optimization task :
● Learn W(1)
and W(2)
to minimize the cost function over all the dataset.
● using back propagation, update weights in W(1)
and W(2)
Continuous Bag of Words model (CBOW)(2013)
0 1 3
1 3 6
5 0 3
9 8 0
2 2 2
5 6 7
8 8 8
|V|
n
W(1)
W (1)
:
● After training over a large corpus
● Each row represents a dense vector for each word in the
vocabulary
● These word vectors contains better semantic and syntactic
representation than other dense vectors ( will be proven later)
● These word vectors performs better for all NLP tasks (will be
proven later)
Skip Gram model (2013)
GloVe: Global Vectors for Word
Representation, Pennington et al. (2014)
Motivation:
ice - steam = ( solid, gas, water, fashion ) ?
● A distributional model should capture words that
appears with “ice” but not “steam”.
● Hence, doing well in semantic analogy task (explained
later)
GloVe: Global Vectors for Word
Representation, Pennington et al. (2014)
Starts from a co-oocurrrence matrix
p(solid | ice ) = Xsolid,ice
/ Xice
GloVe: Global Vectors for Word
Representation, Pennington et al. (2014)
Optimize the Objective function:
wi
word vector of word i
Pik
probability of word k to occurs in context of word i
Ok, But are word vectors really good ?!
Evaluation of word vectors :
1. Intrinsic evaluation : make sure it encodes semantic
information
2. Extrinsic evaluation : make sure it’s useful for other NLP
tasks (the hype)
Intrinsic Evaluation of Word Vectors
Word similarity task
Intrinsic Evaluation of Word Vectors
Results from : GloVe: Global Vectors for Word Representation, Pennington et al 2014.
Word similarity dataset “WS353”: http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/
Word similarity task
Intrinsic Evaluation of Word Vectors
Word Analogy task
Intrinsic Evaluation of Word Vectors
Word Analogy task
Evaluation data : https://word2vec.googlecode.com/svn/trunk/questions-words.txt
: capital-world
Abuja Nigeria Accra Ghana
: gram3-comparative
bad worse big bigger
: gram2-opposite
acceptable unacceptable aware unaware
: gram1-adjective-to-adverb
amazing amazingly apparent apparently
Intrinsic Evaluation of Word Vectors
Word Analogy task
Extrinsic Evaluation of Word Vectors
Part of Speech Tagging :
input : Word Embeddings are cool
output: Noun Noun Verb Adjective
Named Entity recognition :
input : Nous sommes charlie hebdo
output: Out Out Person Person
Extrinsic Evaluation of Word Vectors
* systems: POS: (Toutanova et al. 2003), NER: (Ando & Zhang 2005)
** 130,000-word embedding trained on Wikipedia and Reuters with 11 word window, 100 unit hidden
layer – for 7 weeks! – then supervised task training
*** Features are character suffixes for POS and a gazeteer for NER
“Unsupervised Pretraining”
(the secret sauce)
Problem:
1. Task T1
: Few training data (D1
)
2. Hand crafted Feature representation of inputs R1
3. Machine learning Algorithm M1
on T1
using R1
performs bad
Solution:
1. Create Task T2
: With lots of available training data (D2
)
(unsupervised) but has to have the same input as T1
2. Solve T2
using (D2
) and learn representation of the inputs (R2
)
3. R2
+ M1
better than R1
+ M1
on task T1
“Unsupervised Pretraining”
(the secret sauce)
But what also if ?:
Learn D3
while doing T1
using R2
and M1
Even better results !!
* Same architecture as C&W 2011, but word embeddings are kept constant during the supervised
training phase
** C&W is unsupervised pre-train + supervised NN + features model of last slide
word2vec :
https://code.google.com/p/word2vec/
GloVe :
http://nlp.stanford.edu/projects/glove/
Dependency based :
https://levyomer.wordpress.com/...
Pretrained Word vectors ready for use
Other word embeddings :
● Dependency Based Word embeddings: Levy et al. 2014 : http://www.aclweb.org.....
● Sentiment Analysis Word Embeddings: http://ai.stanford.edu/~ang/pap.....
Knowledge base embeddings :
● Structured Embeddings (SE) (Bordes et al ‘11 )
● Collective Matrix Factorization (RESCAL) (Nickel et al., ’11)
● Neural Tensor Networks (socher et al. ‘13)
● TATEC (Garcia-Duran et al., ’14)
Other Types of Embeddings:
Joint embeddings (Text + Knowledge bases):
● Joint Learning of Words and Meaning Representations (Bordes et al. ‘12)
● Knowledge Graph and Text Jointly Embedding (Wang et al ‘14)
Other Types of Embeddings:
References:
Before Word2Vec:
Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. "Learning representations by back-propagating errors."
Cognitive modeling 5 (1988): 3.
http://www.iro.umontreal.ca/~vincentp/ift3395/lectures/backprop_old.pdf
Bengio, Yoshua, et al. "A neural probabilistic language model." The Journal of Machine Learning Research 3 (2003): 1137-
1155.
http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
References:
Word2vec (CBOW and Skip Gram):
Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
Efficient Estimation of Word Representations in Vector Space.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean.
Distributed Representations of Words and Phrases and their Compositionality.
In Proceedings of NIPS, 2013.
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In
Proceedings of NAACL HLT, 2013.
GloVe: Global Vectors for Word Representation, Pennington et al.(2014) http://www-nlp.stanford.edu/pubs/glove.pdf
Further Readings:
Negative sampling: http://papers.nips.cc/paper/....
Energy based learning : http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf
Joint learning (learning tasks simultaneously): http://ronan.collobert.com/pub...
Learning Resources
Deep Learning for NLP ( Stanford Course )
http://cs224d.stanford.edu/
Deep Learning for Natural Language Processing (without Magic : NAACL 2013 Tutorial
http://nlp.stanford.edu/courses/NAACL2013/

More Related Content

What's hot

Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Sergey Karayev
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language model
JiWenKim
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
Tomer Lieber
 
Bert
BertBert
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
Arvind Devaraj
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
Hanwha System / ICT
 
A Simple Explanation of XLNet
A Simple Explanation of XLNetA Simple Explanation of XLNet
A Simple Explanation of XLNet
Domyoung Lee
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
Young Seok Kim
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
Rrubaa Panchendrarajan
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
Ding Li
 
Abstractive Text Summarization
Abstractive Text SummarizationAbstractive Text Summarization
Abstractive Text Summarization
Tho Phan
 
Financial Question Answering with BERT Language Models
Financial Question Answering with BERT Language ModelsFinancial Question Answering with BERT Language Models
Financial Question Answering with BERT Language Models
Bithiah Yuan
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
Arvind Devaraj
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
NameetDaga1
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
Hye-min Ahn
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
shaurya uppal
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
bhavesh_physics
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear Complexity
Sangwoo Mo
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
JEE HYUN PARK
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
Yuta Niki
 

What's hot (20)

Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language model
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
Bert
BertBert
Bert
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
A Simple Explanation of XLNet
A Simple Explanation of XLNetA Simple Explanation of XLNet
A Simple Explanation of XLNet
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Abstractive Text Summarization
Abstractive Text SummarizationAbstractive Text Summarization
Abstractive Text Summarization
 
Financial Question Answering with BERT Language Models
Financial Question Answering with BERT Language ModelsFinancial Question Answering with BERT Language Models
Financial Question Answering with BERT Language Models
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear Complexity
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 

Similar to Word Embeddings, why the hype ?

Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptx
GowrySailaja
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
Ted Xiao
 
Word2vec and Friends
Word2vec and FriendsWord2vec and Friends
Word2vec and Friends
Bruno Gonçalves
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2Viral Gupta
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
Jinpyo Lee
 
Contemporary Models of Natural Language Processing
Contemporary Models of Natural Language ProcessingContemporary Models of Natural Language Processing
Contemporary Models of Natural Language Processing
Katerina Vylomova
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Bhaskar Mitra
 
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)
 
A Neural Probabilistic Language Model_v2
A Neural Probabilistic Language Model_v2A Neural Probabilistic Language Model_v2
A Neural Probabilistic Language Model_v2
Jisoo Jang
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on Language
Roelof Pieters
 
2022-10, UCL NLP meetup, Toward a Better Understanding of Relational Knowledg...
2022-10, UCL NLP meetup, Toward a Better Understanding of Relational Knowledg...2022-10, UCL NLP meetup, Toward a Better Understanding of Relational Knowledg...
2022-10, UCL NLP meetup, Toward a Better Understanding of Relational Knowledg...
asahiushio1
 
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and Whither
MLReview
 
Embedding for fun fumarola Meetup Milano DLI luglio
Embedding for fun fumarola Meetup Milano DLI luglioEmbedding for fun fumarola Meetup Milano DLI luglio
Embedding for fun fumarola Meetup Milano DLI luglio
Deep Learning Italia
 
Deep Learning and Text Mining
Deep Learning and Text MiningDeep Learning and Text Mining
Deep Learning and Text Mining
Will Stanton
 
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
ssuserc35c0e
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
Forward Gradient
 
Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)
Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)
Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)
Universitat Politècnica de Catalunya
 
Tiancheng Zhao - 2017 - Learning Discourse-level Diversity for Neural Dialog...
Tiancheng Zhao - 2017 -  Learning Discourse-level Diversity for Neural Dialog...Tiancheng Zhao - 2017 -  Learning Discourse-level Diversity for Neural Dialog...
Tiancheng Zhao - 2017 - Learning Discourse-level Diversity for Neural Dialog...
Association for Computational Linguistics
 
David Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AIDavid Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AI
Bayes Nets meetup London
 
Microsoft PROSE SDK: A Framework for Inductive Program Synthesis
Microsoft PROSE SDK: A Framework for Inductive Program SynthesisMicrosoft PROSE SDK: A Framework for Inductive Program Synthesis
Microsoft PROSE SDK: A Framework for Inductive Program Synthesis
Alex Polozov
 

Similar to Word Embeddings, why the hype ? (20)

Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptx
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Word2vec and Friends
Word2vec and FriendsWord2vec and Friends
Word2vec and Friends
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Contemporary Models of Natural Language Processing
Contemporary Models of Natural Language ProcessingContemporary Models of Natural Language Processing
Contemporary Models of Natural Language Processing
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
 
A Neural Probabilistic Language Model_v2
A Neural Probabilistic Language Model_v2A Neural Probabilistic Language Model_v2
A Neural Probabilistic Language Model_v2
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on Language
 
2022-10, UCL NLP meetup, Toward a Better Understanding of Relational Knowledg...
2022-10, UCL NLP meetup, Toward a Better Understanding of Relational Knowledg...2022-10, UCL NLP meetup, Toward a Better Understanding of Relational Knowledg...
2022-10, UCL NLP meetup, Toward a Better Understanding of Relational Knowledg...
 
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and Whither
 
Embedding for fun fumarola Meetup Milano DLI luglio
Embedding for fun fumarola Meetup Milano DLI luglioEmbedding for fun fumarola Meetup Milano DLI luglio
Embedding for fun fumarola Meetup Milano DLI luglio
 
Deep Learning and Text Mining
Deep Learning and Text MiningDeep Learning and Text Mining
Deep Learning and Text Mining
 
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
 
Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)
Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)
Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)
 
Tiancheng Zhao - 2017 - Learning Discourse-level Diversity for Neural Dialog...
Tiancheng Zhao - 2017 -  Learning Discourse-level Diversity for Neural Dialog...Tiancheng Zhao - 2017 -  Learning Discourse-level Diversity for Neural Dialog...
Tiancheng Zhao - 2017 - Learning Discourse-level Diversity for Neural Dialog...
 
David Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AIDavid Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AI
 
Microsoft PROSE SDK: A Framework for Inductive Program Synthesis
Microsoft PROSE SDK: A Framework for Inductive Program SynthesisMicrosoft PROSE SDK: A Framework for Inductive Program Synthesis
Microsoft PROSE SDK: A Framework for Inductive Program Synthesis
 

Recently uploaded

GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
Areesha Ahmad
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
AADYARAJPANDEY1
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SELF-EXPLANATORY
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
anitaento25
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
AlguinaldoKong
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
muralinath2
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
sonaliswain16
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
ossaicprecious19
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
Sérgio Sacani
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
AADYARAJPANDEY1
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
kumarmathi863
 

Recently uploaded (20)

GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
 

Word Embeddings, why the hype ?

  • 1. Word Embeddings: Why the Hype ? Hady Elsahar Hady.elsahar@univ-st-etienne.fr slides available at :
  • 2. Introduction ● Why vector for Natural language ? ● Convensional representations for words and documents ● Methods of Dimensionality reduction Deep learning models: ● Continuous Bag of words model ● Other Models (SKip Gram Model, GloVe) ● Evaluation of Word Vectors ● Readings and references
  • 3. Introduction: Why Vectors Document Classification or Clustering : ● Documents composed of words ● Similar documents will contain similar words ● Machine Learning love vectors ● A Machine Learning algorithm shall know which words are significant which category
  • 4. Bag of Words Model “Represent each document which the bag of words it contains” d1 : Mary loves Movies, Cinema and Art Class 1 : Arts d2 : John went to the Football game Class 2 : Sports d3 : Robert went for the Movie Delicatessen Class : Arts Mary Loves Movies Cinema Art John Went to the Delicatessen Robert Football Game and for d1 1 1 1 1 1 1 d2 1 1 1 1 1 1 d3 1 1 1 1 1
  • 5. Bag of Words Model Can a Machine learning algorithm know that “the” and “for” are un important words ? ● Yes : But will need lots of training labeled data What to do ? ● Use hand crafted features (weighting features for words) ● Make lots of them ● Keep doing this for 50 years ● Regret later .. cry hard
  • 6. Bag of Words Model + Weghting eiFeatures Weighting features example TF-IDF ● TF-IDF ~= Term Frequency / Document frequency ● Motivation : Words appearing in large number of documents are not significant Mary Loves Movies Cinema Art John Went to the Delicatessen Robert Football Game and for d1 0.3779 0.3779 0.3779 0.3779 0.3779 0.0001 d2 0.4402 0.001 0.02 0.4558 0.458 d3 0.001 0.01 0.01 0.458 0.0001
  • 7. Word Vector Representations Document can be represented by words, But how to represent words themselves ? “You shall know a word by the company it keeps”
  • 8. Word Vector Representations Use a sliding window over a big corpus of text and count word co-occurences in between. 1. I enjoy flying. 2. I like NLP. 3. I like deep learning.
  • 9. Bag of words Representations: Drawbacks ● High dimensionality and Very sparse !!!!! ● Unable to capture word order ○ “ good but expensive” “expensive but good” will have same representation. ● Unable to capture semantic similarities (mostly because of sparsity) ○ “boy”, “girl” and “car” ○ “Human”, “Person” and “Giraffe”
  • 10. Bag of words Representations: Drawbacks How to over come this ? ● Keep using hand crafted features ● Make lots of them ● Keep doing this for 50 years ● Regret later .. cry hard Or … Dimensionality reduction
  • 11. Dimensionality Reduction using Matrix factorization Singular value decomposition where : σ1 > σ2 .. > σn > 0
  • 12. Singular value decomposition ● Lower dimensionality K << |V| ● taking the most significant projection of your vectors space
  • 13. Latent semantic Indexing / Analysis (1994) ⋃ : are dense word vector representations V : are dense Document vector representations LSA / LSI , HAL methods made huge advancements in document retrieval and semantic similarity
  • 14. Deep learning Word Embeddings (2003) “A Neural Probabilistic Language Model” Bengio et al. 2003 Original task “Language Modeling” : - Prediction of next word given sequence of previous words. - Useful in Speech Recognition, Autcompletion, Machine translation. “The Cat Chills on a mat ” , Calculate : P( mat | the, cat, chills, on, a )
  • 15. Deep learning Word Embeddings (2003) “A Neural Probabilistic Language Model” Bengio et al. 2003 Quoting from the paper: “This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training.” “We propose to fight the curse of dimensionality by learning a distributed representation for words”
  • 16. Continuous Bag of Words model (CBOW) Tomas Mikolov et al. (2013) The model Predicts the current word given the context scan text in large corpus with a window Input : x0 , x1 , x3 , x4 output : x2 “ The Cat Chills on a mat ” x0 x1 x2 x3 x4 x5
  • 17. Continuous Bag of Words model (CBOW)(2013) | V | vocabulary size Χi ∈ R 1 x | V | 1 hot vector representation of each word yi ∈ R| V | x 1 one hot representation of the correct middle word (expected output) 1 0 0 0 0 0yi 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 x0 x1 x3 x4 | V | Black box
  • 18. Continuous Bag of Words model (CBOW)(2013) | V | vocabulary size Χi ∈ R 1 x | V | 1 hot vector representation of each word yi ∈ R| V | x 1 one hot representation of the correct middle word (expected output) yi x0 x1 x3 x4 W(1) Average W(2) softmax
  • 19. Continuous Bag of Words model (CBOW)(2013) n arbitary length of our word embeddings W(1) ∈ Rn × |V| Input word vector ui ∈ R n x 1 Representation of Xi After multiplication with input matrix 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 x0 x1 x3 x4 | V | 0 1 3 1 3 6 5 0 3 9 8 0 2 2 2 5 6 7 8 8 8 |V| n W(1) 2 2 2 9 8 0 8 8 8 5 0 3 u0 u1 u3 u4 n
  • 20. Continuous Bag of Words model (CBOW)(2013) hi ∈ R n x 1 hi = Average of u0 u1 u3 u4 2 2 2 9 8 0 8 8 8 5 0 3 u0 u1 u3 u4 n Average 20.25 4.5 3.25 hi
  • 21. Continuous Bag of Words model (CBOW)(2013) W (2) ∈ R n x | V | Output word vector Z ∈ R | V | x 1 Output vector representation of Xi Z = hi W(2) | V | W(2) 0 1 3 1 3 6 5 0 3 9 8 0 2 2 2 5 6 7 8 8 8 n 32 14 23 0.22 12 14 55 19 Z | V | 20.25 4.5 3.25 hi
  • 22. Continuous Bag of Words model (CBOW)(2013) How to compare Z to yi ? Largest value corresponds to the correct class ? … no Softmax Softmax: squashes a K-dimensional vector of arbitrary real values to a K-dimensional vector of real values in the range (0, 1) 1 0 0 0 0 0 0 0yi 32 14 23 0.22 2 14 55 19Z
  • 23. Continuous Bag of Words model (CBOW)(2013) y^ = softmax ( Z ) yi ∈ R| V | x 1 one hot representation of the correct middle word 1 0 0 0 0 0 0 0yi 32 14 23 0.22 2 14 55 19Z y^ 0.7 0.1 0.02 0.08 0 0 0.1
  • 24. Continuous Bag of Words model (CBOW)(2013) ● We need estimated words y^ to be closest to the original answer ● One common error function is the cross entropy H(yˆ, y) (why ?). Since y is one hot vector
  • 25. Continuous Bag of Words model (CBOW)(2013) ● We need estimated words y^ to be closest to the original answer ● One common error function is the cross entropy error H(yˆ, y) (why ?). Since y is one hot vector
  • 26. Continuous Bag of Words model (CBOW)(2013) Perfect language model will expect the propability of the correct word y^i = 1 So loss will be 0 Optimization task : ● Learn W(1) and W(2) to minimize the cost function over all the dataset. ● using back propagation, update weights in W(1) and W(2)
  • 27. Continuous Bag of Words model (CBOW)(2013) 0 1 3 1 3 6 5 0 3 9 8 0 2 2 2 5 6 7 8 8 8 |V| n W(1) W (1) : ● After training over a large corpus ● Each row represents a dense vector for each word in the vocabulary ● These word vectors contains better semantic and syntactic representation than other dense vectors ( will be proven later) ● These word vectors performs better for all NLP tasks (will be proven later)
  • 28. Skip Gram model (2013)
  • 29. GloVe: Global Vectors for Word Representation, Pennington et al. (2014) Motivation: ice - steam = ( solid, gas, water, fashion ) ? ● A distributional model should capture words that appears with “ice” but not “steam”. ● Hence, doing well in semantic analogy task (explained later)
  • 30. GloVe: Global Vectors for Word Representation, Pennington et al. (2014) Starts from a co-oocurrrence matrix p(solid | ice ) = Xsolid,ice / Xice
  • 31. GloVe: Global Vectors for Word Representation, Pennington et al. (2014) Optimize the Objective function: wi word vector of word i Pik probability of word k to occurs in context of word i
  • 32. Ok, But are word vectors really good ?! Evaluation of word vectors : 1. Intrinsic evaluation : make sure it encodes semantic information 2. Extrinsic evaluation : make sure it’s useful for other NLP tasks (the hype)
  • 33. Intrinsic Evaluation of Word Vectors Word similarity task
  • 34. Intrinsic Evaluation of Word Vectors Results from : GloVe: Global Vectors for Word Representation, Pennington et al 2014. Word similarity dataset “WS353”: http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/ Word similarity task
  • 35. Intrinsic Evaluation of Word Vectors Word Analogy task
  • 36. Intrinsic Evaluation of Word Vectors Word Analogy task Evaluation data : https://word2vec.googlecode.com/svn/trunk/questions-words.txt : capital-world Abuja Nigeria Accra Ghana : gram3-comparative bad worse big bigger : gram2-opposite acceptable unacceptable aware unaware : gram1-adjective-to-adverb amazing amazingly apparent apparently
  • 37. Intrinsic Evaluation of Word Vectors Word Analogy task
  • 38. Extrinsic Evaluation of Word Vectors Part of Speech Tagging : input : Word Embeddings are cool output: Noun Noun Verb Adjective Named Entity recognition : input : Nous sommes charlie hebdo output: Out Out Person Person
  • 39. Extrinsic Evaluation of Word Vectors * systems: POS: (Toutanova et al. 2003), NER: (Ando & Zhang 2005) ** 130,000-word embedding trained on Wikipedia and Reuters with 11 word window, 100 unit hidden layer – for 7 weeks! – then supervised task training *** Features are character suffixes for POS and a gazeteer for NER
  • 40. “Unsupervised Pretraining” (the secret sauce) Problem: 1. Task T1 : Few training data (D1 ) 2. Hand crafted Feature representation of inputs R1 3. Machine learning Algorithm M1 on T1 using R1 performs bad Solution: 1. Create Task T2 : With lots of available training data (D2 ) (unsupervised) but has to have the same input as T1 2. Solve T2 using (D2 ) and learn representation of the inputs (R2 ) 3. R2 + M1 better than R1 + M1 on task T1
  • 41. “Unsupervised Pretraining” (the secret sauce) But what also if ?: Learn D3 while doing T1 using R2 and M1
  • 42. Even better results !! * Same architecture as C&W 2011, but word embeddings are kept constant during the supervised training phase ** C&W is unsupervised pre-train + supervised NN + features model of last slide
  • 43. word2vec : https://code.google.com/p/word2vec/ GloVe : http://nlp.stanford.edu/projects/glove/ Dependency based : https://levyomer.wordpress.com/... Pretrained Word vectors ready for use
  • 44. Other word embeddings : ● Dependency Based Word embeddings: Levy et al. 2014 : http://www.aclweb.org..... ● Sentiment Analysis Word Embeddings: http://ai.stanford.edu/~ang/pap..... Knowledge base embeddings : ● Structured Embeddings (SE) (Bordes et al ‘11 ) ● Collective Matrix Factorization (RESCAL) (Nickel et al., ’11) ● Neural Tensor Networks (socher et al. ‘13) ● TATEC (Garcia-Duran et al., ’14) Other Types of Embeddings:
  • 45. Joint embeddings (Text + Knowledge bases): ● Joint Learning of Words and Meaning Representations (Bordes et al. ‘12) ● Knowledge Graph and Text Jointly Embedding (Wang et al ‘14) Other Types of Embeddings:
  • 46. References: Before Word2Vec: Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. "Learning representations by back-propagating errors." Cognitive modeling 5 (1988): 3. http://www.iro.umontreal.ca/~vincentp/ift3395/lectures/backprop_old.pdf Bengio, Yoshua, et al. "A neural probabilistic language model." The Journal of Machine Learning Research 3 (2003): 1137- 1155. http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
  • 47. References: Word2vec (CBOW and Skip Gram): Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013). Efficient Estimation of Word Representations in Vector Space. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013. GloVe: Global Vectors for Word Representation, Pennington et al.(2014) http://www-nlp.stanford.edu/pubs/glove.pdf
  • 48. Further Readings: Negative sampling: http://papers.nips.cc/paper/.... Energy based learning : http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf Joint learning (learning tasks simultaneously): http://ronan.collobert.com/pub...
  • 49. Learning Resources Deep Learning for NLP ( Stanford Course ) http://cs224d.stanford.edu/ Deep Learning for Natural Language Processing (without Magic : NAACL 2013 Tutorial http://nlp.stanford.edu/courses/NAACL2013/