SlideShare a Scribd company logo
Word embeddings (continued)
• Idea: learn an embedding from words into vectors
• Need to have a function W(word) that returns a
vector encoding that word.
Word embeddings: properties
• Relationships between words correspond to
difference between vectors.
http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
Word embeddings: questions
• How big should the embedding space be?
• Trade-offs like any other machine learning problem –
greater capacity versus efficiency and overfitting.
• How do we find W?
• Often as part of a prediction or classification task
involving neighboring words.
Learning word embeddings
• First attempt:
• Input data is sets of 5 words from a meaningful
sentence. E.g., “one of the best places”. Modify half of
them by replacing middle word with a random word.
“one of function best places”
• W is a map (depending on parameters, Q) from words to
50 dim’l vectors. E.g., a look-up table or an RNN.
• Feed 5 embeddings into a module R to determine ‘valid’
or ‘invalid’
• Optimize over Q to predict better
http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
https://arxiv.org/ftp/arxiv/papers/1102/1102.1808.pdf
word2vec
• Predict words using context
• Two versions: CBOW (continuous bag of words) and
Skip-gram
https://skymind.ai/wiki/word2vec
CBOW
• Bag of words
• Gets rid of word order. Used in discrete case using
counts of words that appear.
• CBOW
• Takes vector embeddings of n words before target and n
words after and adds them (as vectors).
• Also removes word order, but the vector sum is
meaningful enough to deduce missing word.
Word2vec – Continuous Bag of
Word
• E.g. “The cat sat on floor”
• Window size = 2
7
the
cat
on
floor
sat
www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
8
0
1
0
0
0
0
0
0
…
0
0
0
0
1
0
0
0
0
…
0
cat
on
0
0
0
0
0
0
0
1
…
0
Input layer
Hidden layer
sat
Output layer
one-hot
vector
one-hot
vector
Index of cat in vocabulary
www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
12
0
1
0
0
0
0
0
0
…
0
0
0
0
1
0
0
0
0
…
0
cat
on
0
0
0
0
0
0
0
1
…
0
Input layer
Hidden layer
sat
Output layer
one-hot
vector
one-hot
vector
Index of cat in vocabulary
www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
9
0
1
0
0
0
0
0
0
…
0
0
0
0
1
0
0
0
0
…
0
cat
on
0
0
0
0
0
0
0
1
…
0
Input layer
Hidden layer
sat
Output layer
𝑊𝑉×𝑁
𝑊𝑉×𝑁
V-dim
V-dim
N-dim
𝑊′𝑁×𝑉
V-dim
N will be the size of word vector
We must learn W and W’
www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
10
0
1
0
0
0
0
0
0
…
0
0
0
0
1
0
0
0
0
…
0
xcat
xon
0
0
0
0
0
0
0
1
…
0
Input layer
Hidden layer
sat
Output layer
V-dim
V-dim
N-dim
V-dim
+ 𝑣 =
𝑣𝑐𝑎𝑡 + 𝑣𝑜𝑛
2
0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2
0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1
… … … … … … … … … …
… … … … … … … … … …
0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2
×
0
1
0
0
0
0
0
0
…
0
𝑊𝑉×𝑁
𝑇
× 𝑥𝑐𝑎𝑡 = 𝑣𝑐𝑎𝑡
2.4
2.6
…
…
1.8
=
www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
11
0
1
0
0
0
0
0
0
…
0
0
0
0
1
0
0
0
0
…
0
xcat
xon
0
0
0
0
0
0
0
1
…
0
Input layer
Hidden layer
sat
Output layer
V-dim
V-dim
N-dim
V-dim
+ 𝑣 =
𝑣𝑐𝑎𝑡 + 𝑣𝑜𝑛
2
0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2
0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1
… … … … … … … … … …
… … … … … … … … … …
0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2
×
0
0
0
1
0
0
0
0
…
0
𝑊𝑉×𝑁
𝑇
× 𝑥𝑜𝑛 = 𝑣𝑜𝑛
1.8
2.9
…
…
1.9
=
www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
12
0
1
0
0
0
0
0
0
…
0
0
0
0
1
0
0
0
0
…
0
cat
on
0
0
0
0
0
0
0
1
…
0
Input layer
Hidden layer
𝑦sat
Output layer
𝑊𝑉×𝑁
𝑊𝑉×𝑁
V-dim
V-dim
N-dim
𝑊𝑉×𝑁
′
× 𝑣 = 𝑧
V-dim
N will be the size of word vector
𝑣
𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧)
www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
13
0
1
0
0
0
0
0
0
…
0
0
0
0
1
0
0
0
0
…
0
cat
on
0
0
0
0
0
0
0
1
…
0
Input layer
Hidden layer
𝑦sat
Output layer
𝑊𝑉×𝑁
𝑊𝑉×𝑁
V-dim
V-dim
N-dim
𝑊𝑉×𝑁
′
× 𝑣 = 𝑧
𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧)
V-dim
N will be the size of word vector
𝑣
0.01
0.02
0.00
0.02
0.01
0.02
0.01
0.7
…
0.00
𝑦
We would prefer 𝑦 close to 𝑦𝑠𝑎𝑡
www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
14
0
1
0
0
0
0
0
0
…
0
0
0
0
1
0
0
0
0
…
0
xcat
xon
0
0
0
0
0
0
0
1
…
0
Input layer
Hidden layer
sat
Output layer
V-dim
V-dim
N-dim
V-dim
𝑊𝑉×𝑁
𝑊𝑉×𝑁
0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2
0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1
… … … … … … … … … …
… … … … … … … … … …
0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2
𝑊𝑉×𝑁
𝑇
Contain word’s vectors
𝑊𝑉×𝑁
′
We can consider either W or W’ as the word’s representation. Or
even take the average.
www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
Some interesting results
15
www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
Word analogies
16
www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
Skip gram
• Skip gram – alternative to CBOW
• Start with a single word embedding and try to predict the
surrounding words.
• Much less well-defined problem, but works better in
practice (scales better).
Skip gram
• Map from center word to probability on
surrounding words. One input/output unit below.
• There is no activation function on the hidden layer
neurons, but the output neurons use softmax.
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
Skip gram example
• Vocabulary of 10,000 words.
• Embedding vectors with 300 features.
• So the hidden layer is going to be represented by a
weight matrix with 10,000 rows (multiply by vector
on the left).
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
Skip gram/CBOW intuition
• Similar “contexts” (that is, what words are likely to
appear around them), lead to similar embeddings
for two words.
• One way for the network to output similar context
predictions for these two words is if the word
vectors are similar. So, if two words have similar
contexts, then the network is motivated to learn
similar word vectors for these two words!
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
Word2vec shortcomings
• Problem: 10,000 words and 300 dim embedding
gives a large parameter space to learn. And 10K
words is minimal for real applications.
• Slow to train, and need lots of data, particularly to
learn uncommon words.
Word2vec improvements:
word pairs and phrases
• Idea: Treat common word pairs or phrases as single
“words.”
• E.g., Boston Globe (newspaper) is different from Boston and
Globe separately. Embed Boston Globe as a single
word/phrase.
• Method: make phrases out of words which occur
together often relative to the number of individual
occurrences. Prefer phrases made of infrequent words
in order to avoid making phrases out of common words
like “and the” or “this is”.
• Pros/cons: Increases vocabulary size but decreases
training expense.
• Results: Led to 3 million “words” trained on 100 billion
words from a Google News dataset.
Word2vec improvements:
subsample frequent words
• Idea: Subsample frequent words to decrease the
number of training examples.
• The probability that we cut the word is related to the word’s
frequency. More common words are cut more.
• Uncommon words (anything < 0.26% of total words) are kept
• E.g., remove some occurrences of “the.”
• Method: For each word, cut the word with probability
related to the word’s frequency.
• Benefits: If we have a window size of 10, and we
remove a specific instance of “the” from our text:
• As we train on the remaining words, “the” will not appear in
any of their context windows.
Word2vec improvements:
selective updates
• Idea: Use “Negative Sampling”, which causes each
training sample to update only a small percentage of
the model’s weights.
• Observation: A “correct output” of the network is a
one-hot vector. That is, one neuron should output a 1,
and all of the other thousands of output neurons to
output a 0.
• Method: With negative sampling, randomly select just
a small number of “negative” words (let’s say 5) to
update the weights for. (In this context, a “negative”
word is one for which we want the network to output a
0 for). We will also still update the weights for our
“positive” word.
Word embedding applications
http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
• The use of word representations… has
become a key “secret sauce” for the success
of many NLP systems in recent years, across
tasks including named entity recognition,
part-of-speech tagging, parsing, and
semantic role labeling. (Luong et al. (2013))
• Learning a good representation on a task A
and then using it on a task B is one of the
major tricks in the Deep Learning toolbox.
• Pretraining, transfer learning, and multi-task
learning.
• Can allow the representation to learn from more
than one kind of data.
Word embedding applications
• Can learn to map multiple kinds of data
into a single representation.
• E.g., bilingual English and Mandarin
Chinese word-embedding as in Socher et
al. (2013a).
• Embed as above, but words that are
known as close translations should be
close together.
• Words we didn’t know were translations
end up close together!
• Structures of two languages get pulled
into alignment.
http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
Word embedding applications
http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
• Can apply to get a joint embedding of words and
images or other multi-modal data sets.
• New classes map near similar existing classes: e.g.,
if ‘cat’ is unknown, cat images map near dog.
• a
http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/

More Related Content

Similar to Word_Embeddings.pptx

Word embeddings as a service - PyData NYC 2015
Word embeddings as a service -  PyData NYC 2015Word embeddings as a service -  PyData NYC 2015
Word embeddings as a service - PyData NYC 2015
François Scharffe
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2
Viral Gupta
 
Word2 vec
Word2 vecWord2 vec
Word2 vec
ankit_ppt
 
Common Technical Writing Issues
Common Technical Writing IssuesCommon Technical Writing Issues
Common Technical Writing Issues
Tao Xie
 
Deep learning Malaysia presentation 12/4/2017
Deep learning Malaysia presentation 12/4/2017Deep learning Malaysia presentation 12/4/2017
Deep learning Malaysia presentation 12/4/2017
Brian Ho
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Matthew Lease
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
Hady Elsahar
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
Ajay Taneja
 
Practical Machine Learning and Rails Part2
Practical Machine Learning and Rails Part2Practical Machine Learning and Rails Part2
Practical Machine Learning and Rails Part2
ryanstout
 
Writeissues
WriteissuesWriteissues
Writeissues
Tanveer Razwan
 
Viva file
Viva fileViva file
Viva file
anupamasingh87
 
Four Languages From Forty Years Ago
Four Languages From Forty Years AgoFour Languages From Forty Years Ago
Four Languages From Forty Years Ago
Scott Wlaschin
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
Machine Learning Prague
 
NLP & DBpedia
 NLP & DBpedia NLP & DBpedia
NLP & DBpedia
kelbedweihy
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
Hemantha Kulathilake
 
Introduction to Software - Coder Forge - John Mulhall
Introduction to Software - Coder Forge - John MulhallIntroduction to Software - Coder Forge - John Mulhall
Introduction to Software - Coder Forge - John Mulhall
John Mulhall
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
odsc
 
Intro to oop.pptx
Intro to oop.pptxIntro to oop.pptx
Intro to oop.pptx
UmerUmer25
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
Shruti kar
 

Similar to Word_Embeddings.pptx (20)

Word embeddings as a service - PyData NYC 2015
Word embeddings as a service -  PyData NYC 2015Word embeddings as a service -  PyData NYC 2015
Word embeddings as a service - PyData NYC 2015
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2
 
Word2 vec
Word2 vecWord2 vec
Word2 vec
 
Common Technical Writing Issues
Common Technical Writing IssuesCommon Technical Writing Issues
Common Technical Writing Issues
 
Deep learning Malaysia presentation 12/4/2017
Deep learning Malaysia presentation 12/4/2017Deep learning Malaysia presentation 12/4/2017
Deep learning Malaysia presentation 12/4/2017
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
Practical Machine Learning and Rails Part2
Practical Machine Learning and Rails Part2Practical Machine Learning and Rails Part2
Practical Machine Learning and Rails Part2
 
Writeissues
WriteissuesWriteissues
Writeissues
 
Viva file
Viva fileViva file
Viva file
 
Four Languages From Forty Years Ago
Four Languages From Forty Years AgoFour Languages From Forty Years Ago
Four Languages From Forty Years Ago
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
NLP & DBpedia
 NLP & DBpedia NLP & DBpedia
NLP & DBpedia
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
Introduction to Software - Coder Forge - John Mulhall
Introduction to Software - Coder Forge - John MulhallIntroduction to Software - Coder Forge - John Mulhall
Introduction to Software - Coder Forge - John Mulhall
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
Intro to oop.pptx
Intro to oop.pptxIntro to oop.pptx
Intro to oop.pptx
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 

Recently uploaded

Design and optimization of ion propulsion drone
Design and optimization of ion propulsion droneDesign and optimization of ion propulsion drone
Design and optimization of ion propulsion drone
bjmsejournal
 
Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...
bijceesjournal
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
UReason
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
SakkaravarthiShanmug
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
Nada Hikmah
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
RamonNovais6
 
AI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptxAI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptx
architagupta876
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
Prakhyath Rai
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
Atif Razi
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
Seminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptxSeminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptx
Madan Karki
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
ElakkiaU
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 

Recently uploaded (20)

Design and optimization of ion propulsion drone
Design and optimization of ion propulsion droneDesign and optimization of ion propulsion drone
Design and optimization of ion propulsion drone
 
Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
 
AI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptxAI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptx
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 
Seminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptxSeminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptx
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 

Word_Embeddings.pptx

  • 1. Word embeddings (continued) • Idea: learn an embedding from words into vectors • Need to have a function W(word) that returns a vector encoding that word.
  • 2. Word embeddings: properties • Relationships between words correspond to difference between vectors. http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
  • 3. Word embeddings: questions • How big should the embedding space be? • Trade-offs like any other machine learning problem – greater capacity versus efficiency and overfitting. • How do we find W? • Often as part of a prediction or classification task involving neighboring words.
  • 4. Learning word embeddings • First attempt: • Input data is sets of 5 words from a meaningful sentence. E.g., “one of the best places”. Modify half of them by replacing middle word with a random word. “one of function best places” • W is a map (depending on parameters, Q) from words to 50 dim’l vectors. E.g., a look-up table or an RNN. • Feed 5 embeddings into a module R to determine ‘valid’ or ‘invalid’ • Optimize over Q to predict better http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/ https://arxiv.org/ftp/arxiv/papers/1102/1102.1808.pdf
  • 5. word2vec • Predict words using context • Two versions: CBOW (continuous bag of words) and Skip-gram https://skymind.ai/wiki/word2vec
  • 6. CBOW • Bag of words • Gets rid of word order. Used in discrete case using counts of words that appear. • CBOW • Takes vector embeddings of n words before target and n words after and adds them (as vectors). • Also removes word order, but the vector sum is meaningful enough to deduce missing word.
  • 7. Word2vec – Continuous Bag of Word • E.g. “The cat sat on floor” • Window size = 2 7 the cat on floor sat www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
  • 8. 8 0 1 0 0 0 0 0 0 … 0 0 0 0 1 0 0 0 0 … 0 cat on 0 0 0 0 0 0 0 1 … 0 Input layer Hidden layer sat Output layer one-hot vector one-hot vector Index of cat in vocabulary www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx 12 0 1 0 0 0 0 0 0 … 0 0 0 0 1 0 0 0 0 … 0 cat on 0 0 0 0 0 0 0 1 … 0 Input layer Hidden layer sat Output layer one-hot vector one-hot vector Index of cat in vocabulary www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
  • 9. 9 0 1 0 0 0 0 0 0 … 0 0 0 0 1 0 0 0 0 … 0 cat on 0 0 0 0 0 0 0 1 … 0 Input layer Hidden layer sat Output layer 𝑊𝑉×𝑁 𝑊𝑉×𝑁 V-dim V-dim N-dim 𝑊′𝑁×𝑉 V-dim N will be the size of word vector We must learn W and W’ www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
  • 10. 10 0 1 0 0 0 0 0 0 … 0 0 0 0 1 0 0 0 0 … 0 xcat xon 0 0 0 0 0 0 0 1 … 0 Input layer Hidden layer sat Output layer V-dim V-dim N-dim V-dim + 𝑣 = 𝑣𝑐𝑎𝑡 + 𝑣𝑜𝑛 2 0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1 … … … … … … … … … … … … … … … … … … … … 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 × 0 1 0 0 0 0 0 0 … 0 𝑊𝑉×𝑁 𝑇 × 𝑥𝑐𝑎𝑡 = 𝑣𝑐𝑎𝑡 2.4 2.6 … … 1.8 = www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
  • 11. 11 0 1 0 0 0 0 0 0 … 0 0 0 0 1 0 0 0 0 … 0 xcat xon 0 0 0 0 0 0 0 1 … 0 Input layer Hidden layer sat Output layer V-dim V-dim N-dim V-dim + 𝑣 = 𝑣𝑐𝑎𝑡 + 𝑣𝑜𝑛 2 0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1 … … … … … … … … … … … … … … … … … … … … 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 × 0 0 0 1 0 0 0 0 … 0 𝑊𝑉×𝑁 𝑇 × 𝑥𝑜𝑛 = 𝑣𝑜𝑛 1.8 2.9 … … 1.9 = www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
  • 12. 12 0 1 0 0 0 0 0 0 … 0 0 0 0 1 0 0 0 0 … 0 cat on 0 0 0 0 0 0 0 1 … 0 Input layer Hidden layer 𝑦sat Output layer 𝑊𝑉×𝑁 𝑊𝑉×𝑁 V-dim V-dim N-dim 𝑊𝑉×𝑁 ′ × 𝑣 = 𝑧 V-dim N will be the size of word vector 𝑣 𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧) www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
  • 13. 13 0 1 0 0 0 0 0 0 … 0 0 0 0 1 0 0 0 0 … 0 cat on 0 0 0 0 0 0 0 1 … 0 Input layer Hidden layer 𝑦sat Output layer 𝑊𝑉×𝑁 𝑊𝑉×𝑁 V-dim V-dim N-dim 𝑊𝑉×𝑁 ′ × 𝑣 = 𝑧 𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧) V-dim N will be the size of word vector 𝑣 0.01 0.02 0.00 0.02 0.01 0.02 0.01 0.7 … 0.00 𝑦 We would prefer 𝑦 close to 𝑦𝑠𝑎𝑡 www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
  • 14. 14 0 1 0 0 0 0 0 0 … 0 0 0 0 1 0 0 0 0 … 0 xcat xon 0 0 0 0 0 0 0 1 … 0 Input layer Hidden layer sat Output layer V-dim V-dim N-dim V-dim 𝑊𝑉×𝑁 𝑊𝑉×𝑁 0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1 … … … … … … … … … … … … … … … … … … … … 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 𝑊𝑉×𝑁 𝑇 Contain word’s vectors 𝑊𝑉×𝑁 ′ We can consider either W or W’ as the word’s representation. Or even take the average. www.cs.ucr.edu/~vagelis/classes/CS242/slides/word2vec.pptx
  • 17. Skip gram • Skip gram – alternative to CBOW • Start with a single word embedding and try to predict the surrounding words. • Much less well-defined problem, but works better in practice (scales better).
  • 18. Skip gram • Map from center word to probability on surrounding words. One input/output unit below. • There is no activation function on the hidden layer neurons, but the output neurons use softmax. http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
  • 19. Skip gram example • Vocabulary of 10,000 words. • Embedding vectors with 300 features. • So the hidden layer is going to be represented by a weight matrix with 10,000 rows (multiply by vector on the left). http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
  • 20. Skip gram/CBOW intuition • Similar “contexts” (that is, what words are likely to appear around them), lead to similar embeddings for two words. • One way for the network to output similar context predictions for these two words is if the word vectors are similar. So, if two words have similar contexts, then the network is motivated to learn similar word vectors for these two words! http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
  • 21. Word2vec shortcomings • Problem: 10,000 words and 300 dim embedding gives a large parameter space to learn. And 10K words is minimal for real applications. • Slow to train, and need lots of data, particularly to learn uncommon words.
  • 22. Word2vec improvements: word pairs and phrases • Idea: Treat common word pairs or phrases as single “words.” • E.g., Boston Globe (newspaper) is different from Boston and Globe separately. Embed Boston Globe as a single word/phrase. • Method: make phrases out of words which occur together often relative to the number of individual occurrences. Prefer phrases made of infrequent words in order to avoid making phrases out of common words like “and the” or “this is”. • Pros/cons: Increases vocabulary size but decreases training expense. • Results: Led to 3 million “words” trained on 100 billion words from a Google News dataset.
  • 23. Word2vec improvements: subsample frequent words • Idea: Subsample frequent words to decrease the number of training examples. • The probability that we cut the word is related to the word’s frequency. More common words are cut more. • Uncommon words (anything < 0.26% of total words) are kept • E.g., remove some occurrences of “the.” • Method: For each word, cut the word with probability related to the word’s frequency. • Benefits: If we have a window size of 10, and we remove a specific instance of “the” from our text: • As we train on the remaining words, “the” will not appear in any of their context windows.
  • 24. Word2vec improvements: selective updates • Idea: Use “Negative Sampling”, which causes each training sample to update only a small percentage of the model’s weights. • Observation: A “correct output” of the network is a one-hot vector. That is, one neuron should output a 1, and all of the other thousands of output neurons to output a 0. • Method: With negative sampling, randomly select just a small number of “negative” words (let’s say 5) to update the weights for. (In this context, a “negative” word is one for which we want the network to output a 0 for). We will also still update the weights for our “positive” word.
  • 25. Word embedding applications http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/ • The use of word representations… has become a key “secret sauce” for the success of many NLP systems in recent years, across tasks including named entity recognition, part-of-speech tagging, parsing, and semantic role labeling. (Luong et al. (2013)) • Learning a good representation on a task A and then using it on a task B is one of the major tricks in the Deep Learning toolbox. • Pretraining, transfer learning, and multi-task learning. • Can allow the representation to learn from more than one kind of data.
  • 26. Word embedding applications • Can learn to map multiple kinds of data into a single representation. • E.g., bilingual English and Mandarin Chinese word-embedding as in Socher et al. (2013a). • Embed as above, but words that are known as close translations should be close together. • Words we didn’t know were translations end up close together! • Structures of two languages get pulled into alignment. http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
  • 27. Word embedding applications http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/ • Can apply to get a joint embedding of words and images or other multi-modal data sets. • New classes map near similar existing classes: e.g., if ‘cat’ is unknown, cat images map near dog.

Editor's Notes

  1. W contains input word vectors. W’ contains output word vectors. We can consider either W or W’ as the word’s representation. Or even take the average.