A Neural Probabilistic
Language Model
Learning a distributed representation for words
2018.08.09 Soo
Contents
● Fundamental Problem of Language Modeling
● Statistical Model of Language
● Neural Probabilistic Language Model
● Result
Fundamental Problem of Language Modeling
● Curse of Dimensionality
○ As the number of features or dimensions grows, the amount of data that we need to generalize
accurately exponentially
Number of free parameters
when using discrete spaces
to modeling languages
Use Continuous Space !!
Statistical Model of Language
● The conditional probability of the next word given all the previous ones
● The reason of approximation:
○ Temporally closer words in the word sequence are statistically more dependent → N-gram Models
Statistical Model of Language
● N-gram models examples
○ How likely is “University” given “New York”?
○ Count all “New York University”
○ Count all “New York ?”: e.g., “New York State”, “New York City”, “New York Fire”, “New York Police”,
“New York Bridges”, …
○ How often “New York University” happens among these?
Problems in Statistical Model of Language
● A new combination of n words appears that was not seen in the training corpus
● Solutions
○ back-off trigram models (Katz, 1987)
○ smoothed(or interpolated) trigram model (Jelinek and Mercer, 1980)
● Limits
○ Long-term dependency: no farther than 1 or 2 words
○ no measure of similarity
Neural Probabilistic Language Model
● Abstract:
○ Associate with each word in the vocabulary a distributed word feature vector (a real valued vector
in
○ express the joint probability function of word sequences in terms of the feature vectors of these
words in the sequence
○ learn simultaneously the word feature vectors and the parameters of that probability function
Neural Probabilistic Language Model
1. Word Feature Vector
○ associated with a point in a vector space (embed word into a vector space)
○ Dimension of vector m is much more smaller than the vocabulary size ( V )
2. Probability function
○ product of conditional probabilities → maximization of log-likelihood
A sentence is a Sequence :
Object:
Two Parts
Object:
1. Distributed feature vectors
(= Embedding Layer)
Practically, C is a matrix that a row vector
in C represents a feature vectors.
Also, C is shared across all the words in
context.
Object:
2. Probability functions (g)
(= Two layer Neural Network)
Context vectors → Concatenate →
Linear (Affine) → Tanh →
Linear(Affine) → Softmax
Process:
Sizes:
- |V|: vocab size
- h: hidden size
- m: embed size
Loss:
● f : Neural Network
● R: Regularization (=weight decay)
Parameters:
Total number of free parameters:
- h: hidden size
- m: embed size
Sizes:
- |V|: vocab size
- h: hidden size
- m: embed size
Test Measurement: Perplexity
A measurement of how well a probability
distribution or probability model (q) predicts a
sample
Lower Perplexity means the model q fits better to
generate training data. Means that model is less
surprised by the test sample.
Why geometric averages of inverse probability?
Result
References
● A Neural Probabilistic Language Model - Yoshua Bengio, 2003
blog: https://simonjisu.github.io
github: https://github.com/simonjisu

A Neural Probabilistic Language Model_v2

  • 1.
    A Neural Probabilistic LanguageModel Learning a distributed representation for words 2018.08.09 Soo
  • 2.
    Contents ● Fundamental Problemof Language Modeling ● Statistical Model of Language ● Neural Probabilistic Language Model ● Result
  • 3.
    Fundamental Problem ofLanguage Modeling ● Curse of Dimensionality ○ As the number of features or dimensions grows, the amount of data that we need to generalize accurately exponentially Number of free parameters when using discrete spaces to modeling languages Use Continuous Space !!
  • 4.
    Statistical Model ofLanguage ● The conditional probability of the next word given all the previous ones ● The reason of approximation: ○ Temporally closer words in the word sequence are statistically more dependent → N-gram Models
  • 5.
    Statistical Model ofLanguage ● N-gram models examples ○ How likely is “University” given “New York”? ○ Count all “New York University” ○ Count all “New York ?”: e.g., “New York State”, “New York City”, “New York Fire”, “New York Police”, “New York Bridges”, … ○ How often “New York University” happens among these?
  • 6.
    Problems in StatisticalModel of Language ● A new combination of n words appears that was not seen in the training corpus ● Solutions ○ back-off trigram models (Katz, 1987) ○ smoothed(or interpolated) trigram model (Jelinek and Mercer, 1980) ● Limits ○ Long-term dependency: no farther than 1 or 2 words ○ no measure of similarity
  • 7.
    Neural Probabilistic LanguageModel ● Abstract: ○ Associate with each word in the vocabulary a distributed word feature vector (a real valued vector in ○ express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence ○ learn simultaneously the word feature vectors and the parameters of that probability function
  • 8.
    Neural Probabilistic LanguageModel 1. Word Feature Vector ○ associated with a point in a vector space (embed word into a vector space) ○ Dimension of vector m is much more smaller than the vocabulary size ( V ) 2. Probability function ○ product of conditional probabilities → maximization of log-likelihood
  • 9.
    A sentence isa Sequence : Object: Two Parts
  • 10.
    Object: 1. Distributed featurevectors (= Embedding Layer) Practically, C is a matrix that a row vector in C represents a feature vectors. Also, C is shared across all the words in context.
  • 11.
    Object: 2. Probability functions(g) (= Two layer Neural Network) Context vectors → Concatenate → Linear (Affine) → Tanh → Linear(Affine) → Softmax
  • 12.
    Process: Sizes: - |V|: vocabsize - h: hidden size - m: embed size
  • 13.
    Loss: ● f :Neural Network ● R: Regularization (=weight decay)
  • 14.
    Parameters: Total number offree parameters: - h: hidden size - m: embed size Sizes: - |V|: vocab size - h: hidden size - m: embed size
  • 15.
    Test Measurement: Perplexity Ameasurement of how well a probability distribution or probability model (q) predicts a sample Lower Perplexity means the model q fits better to generate training data. Means that model is less surprised by the test sample.
  • 16.
    Why geometric averagesof inverse probability?
  • 17.
  • 18.
    References ● A NeuralProbabilistic Language Model - Yoshua Bengio, 2003 blog: https://simonjisu.github.io github: https://github.com/simonjisu