This presentation I initially presented at Data Science UA meetup in August, 2018. Link to the video: https://www.youtube.com/watch?v=Ksg_36ljcQ8&feature=youtu.be&app=desktop&fbclid=IwAR0YQ_WR2YlBLrLSCcLWmV2WviVF1Eo4KB6YCu7C5HNCpCrhEwO-1AIbGqE.
3. What do I do at Grammarly?
1. In the past:
a. word order
b. possessive nouns
c. sentence fragments
d. different types of verb mistakes
e. etc.
2. Now:
a. Paragraph-level checks
4. What is language modeling?
Models that assign probability to the sequence of words
are called language models or LM.
8. Speech recognition
Which one is the most probable?
1. It’s not easy to wreck a nice beach.
2. It’s not easy to recognize speech.
3. It’s not easy to wreck an ice beach.
9. Applications of language models
1. Text prediction
2. Speech recognition
3. Language identification
10. Applications of language models
1. Text prediction
2. Speech recognition
3. Language identification
4. Machine translation
11. Applications of language models
1. Text prediction
2. Speech recognition
3. Language identification
4. Machine translation
5. Handwriting recognition
12. Applications of language models
1. Text prediction
2. Speech recognition
3. Language identification
4. Machine translation
5. Handwriting recognition
6. Error correction
7. etc.
14. Language corpora
A corpus is a body of text.
Some popular English and Ukrainian corpora:
- Gutenberg Dataset (en)
- Wikipedia corpus (en)
- UberText (ua)
- your custom corpus
- ...
15. Assigning a probability to a sentence
Our sentence, s: That monkey made a smart move!
Our corpus is of size N (say, 10 000 sentences).
P(s) = c(s) / N
16. Assigning a probability to a sentence
Our sentence, s: That monkey made a smart move!
Our corpus is of size N (say, 10 000 sentences).
P(s) = c(s) / N
34. Ngrams
N-gram is a sequence of N words.
The monkey is eating a banana!
- unigram: The, monkey, is, eating, a, banana, !
35. Ngrams
N-gram is a sequence of N words.
The monkey is eating a banana!
- unigram: The, monkey, is, eating, a, banana, !
- bigram: <s> The, The monkey, monkey is, is eating, eating a, a banana,...
36. Ngrams
N-gram is a sequence of N words.
The monkey is eating a banana!
- unigram: The, monkey, is, eating, a, banana, !
- bigram: <s> The, The monkey, monkey is, is eating, eating a, a banana,...
- trigram: <s> <s> The, <s> The monkey, The monkey is, monkey is eating,...
- ...
37. Ngrams
From the corpus (of size 50 000) we get:
<s> The 5 678
The monkey 97
monkey is 65
is eating 3 440
eating a 1 675
... ...
<s> <s> The 5 678
<s> The monkey 3
The monkey is 0
monkey is eating 8
is eating a 457
... ...
45. Statistical LM: challenges
● They do not generalize
○ red car = 2 390, blue car = 1 132, purple car = 0
● Intricate smoothing techniques
○ e.g., fixed backing up order should be designed by hand
● Doesn’t capture long-range dependencies
○ That smart monkey, which I told you about, was also sitting on my car!
● Scaling to larger ngrams is very expensive
○ number of possible n-grams on a vocabulary V is Vn
48. One-hot encodings
● Sparse vectors of size V (vocabulary)
image from https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/
49. Challenge accepted
● They do not generalize
○ red, blue, and black appear in similar contexts
● Intricate smoothing techniques
○ no need for additional smoothing since we use word vectors and backprop
● Doesn’t capture long-range dependencies
○ That smart monkey, which I told you about, was also sitting on my car!
● Scaling to larger ngrams is very expensive
50. Neural LM: challenges
● Generalize better :-)
○ brown horse, white horse, green horse ?!?
● Take long time to train
● Very expensive
51. You try it
● KenLM
○ https://github.com/kpu/kenlm
● Simple RNN language model
○ https://github.com/pytorch/examples/tree/master/word_language_model
● LSTM by Salesforce
○ https://github.com/salesforce/awd-lstm-lm
52. Let’s have some fun ;-)
from http://karpathy.github.io/2015/05/21/rnn-effectiveness/
● Baby name generation:
○ Alessia, Mareanne, Chrestina, Hi, Saddie
53. Let’s have some fun ;-)
from http://karpathy.github.io/2015/05/21/rnn-effectiveness/
● Baby name generation:
○ Alessia, Mareanne, Chrestina, Hi, Saddie
● Leo Tolstoy’s War and Peace:
○ "Why do what that day," replied Natasha, and wishing to himself the fact the
princess, Princess Mary was easier, fed in had oftened him.
Pierre aking his soul came to the packs and drove up his father-in-law women.
54. Let’s have some fun ;-)
from http://karpathy.github.io/2015/05/21/rnn-effectiveness/
● Baby name generation:
○ Alessia, Mareanne, Chrestina, Hi, Saddie
● Leo Tolstoy’s War and Peace:
○ "Why do what that day," replied Natasha, and wishing to himself the fact the
princess, Princess Mary was easier, fed in had oftened him.
Pierre aking his soul came to the packs and drove up his father-in-law women.
● All works of Shakespear:
○ PANDARUS:
Alas, I think he shall be come approached and the day
When little srain would be attain'd into being never fed,
And who is but a chain and subjects of his death,
I should not sleep.
57. Reading list
1. Language Modeling with N-grams, Dan Jurafsky and James H. Martin
2. Course notes for NLP by Michael Collins
3. Smoothing for statistical LM
4. Recurrent Neural Network Tutorial
5. Neural Network Methods for NLP, Yoav Goldberg, chapters 9, 13-15