2. OUTLINE
• Why Neural Methods for Natural Language Processing?
• Local vs. GlobalWord Representations
• Unsupervised Learning forWord Representations
• Including Sub-word Information
• Language Models for Contextual Representations
4. WHY NEURAL METHODS FOR NATURAL
LANGUAGE PROCESSING?
• Natural Language Processing (NLP) has moved largely to neural methods in recent years
5. WHAT IS "MODERN" NATURAL LANGUAGE
PROCESSING?
• Natural Language Processing (NLP) has moved largely to neural methods in recent years
• Traditional NLP builds on years of research into language
representation
• Theoretical foundations can lead to model rigidity
• Tasks often rely on manually generated and curated
dictionaries and thesauruses
• Built upon local word representations
6. WHAT IS "MODERN" NATURAL LANGUAGE
PROCESSING?
• Natural Language Processing (NLP) has moved largely to neural methods in recent years
• Few to no assumptions need to be made
• Model architectures purpose built for tasks
• Very active area of research, with most open-source
• Ability to learn global and contextualized word
representations
7. WHAT IS "MODERN" NATURAL LANGUAGE
PROCESSING?
• Natural Language Processing (NLP) has moved largely to neural methods in recent years
• Few to no assumptions need to be made
• Model architectures purpose built for tasks
• Very active area of research, with most open-source
• Ability to learn global and contextual word
representations
18. UNSUPERVISED LEARNING FOR WORD
REPRESENTATIONS
• We generate a seemingly endless
amount of text data each day
• ~ 460k Tweets every minute
• ~ 510k Facebook posts every minute
• We have accumulated vast amounts of
text data in online repositories
• Wikipedia has 5.8M (English) articles
https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/#36a98a9260ba
19. UNSUPERVISED LEARNING FOR WORD
REPRESENTATIONS
• We generate a seemingly endless
amount of text data each day
• ~ 460k Tweets every minute
• ~ 510k Facebook posts every minute
• We have accumulated vast amounts of
text data in online repositories
• Wikipedia has 5.8M (English) articles
1.6M words!
https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
20. UNSUPERVISED LEARNING FOR WORD
REPRESENTATIONS
• We generate a seemingly endless
amount of text data each day
• ~ 460k Tweets every minute
• ~ 510k Facebook posts every minute
• We have accumulated vast amounts of
text data in online repositories
• Wikipedia has 5.8M (English) articles
1.6M words!
https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
32. COMPUTING EFFICIENT WORD REPRESENTATIONS
• Our goal is to learn all of the elements in the dense matrix on the right
33. COMPUTING EFFICIENT WORD REPRESENTATIONS
• To do this, we treat this matrix as a matrix of model weights, which will be optimized
34. COMPUTING EFFICIENT WORD REPRESENTATIONS
• To optimize, we start with a single example from our training set:
(fox, quick)
35. COMPUTING EFFICIENT WORD REPRESENTATIONS
• Grab the current vector of weights for those words:
(fox, quick)
36. COMPUTING EFFICIENT WORD REPRESENTATIONS
• Use these vectors to calculate the probability of the context word, given the center word
(fox, quick)
37. COMPUTING EFFICIENT WORD REPRESENTATIONS
• Then compare the probability to the true value (0 or 1), and update the weights accordingly
(fox, quick)
41. INCLUDING SUB-WORD INFORMATION
• Word2vec really opened the doors for computing global representations of
words, which can then be used for a variety of different tasks, but…
… is it possible to make it better?
42. INCLUDING SUB-WORD INFORMATION
• Word2vec really opened the doors for computing global representations of
words, which can then be used for a variety of different tasks, but…
• YES!What if we included sub-word information?
… is it possible to make it better?
https://fasttext.cc/
45. INCLUDING SUB-WORD INFORMATION
• FastText seeks to find a vector for sequences of characters within each word
<quick>
<qu
qui
uic
uik
ck>
46. INCLUDING SUB-WORD INFORMATION
• Each word is then the sum of each of the sub-word vectors
<quick>
<qu
qui
uic
uik
ck>
+
+
+
+
+
= quick
47. INCLUDING SUB-WORD INFORMATION
(the, quick)
(the, brown)
(quick, the)
(quick, brown)
(quick, fox)
(brown, the)
(brown, quick)
(brown, fox)
(brown, jumps)
(fox, quick)
(fox, brown)
(fox, jumps)
(fox, over)
...
• From there, the FastText utilizes the same skip-gram model tor training as
word2vec (with low-level optimizations)
48. INCLUDING SUB-WORD INFORMATION
• Benefits of FastText
• Learn useful representations of prefixes and suffixes
• Learn useful word roots
• Out of vocabulary inference!!
• Drawbacks of FastText
• Very large number of model parameters
• Known to be difficult to tune
50. THE NEED FOR CONTEXTUAL REPRESENTATIONS
• Skip-gram models are fantastic for computing a single, fixed representation for
a given word or token, but…
… words can have multiple meanings depending on the context…
51. THE NEED FOR CONTEXTUAL REPRESENTATIONS
• Skip-gram models are fantastic for computing a single, fixed representation for
a given word or token, but…
… words can have multiple meanings depending on the context…
context matters.
55. • Neural language models can also be stacked, creating multiple representations
DEEP CONTEXTUAL REPRESENTATIONS
56. • Neural language models can also be stacked, creating multiple representations
DEEP CONTEXTUAL REPRESENTATIONS
fox = + + + …
57. • Neural language models can also be stacked, creating multiple representations
DEEP CONTEXTUAL REPRESENTATIONS
https://allennlp.org/elmo, https://arxiv.org/abs/1810.04805
fox = + + + …