1. A Neural Probabilistic Language Model
The Problem:
The fundamental problem for probabilistic language modeling is that the joint distribution of a
large number of discrete variables results in exponentially large free parameters. It is called
‘Curse of Dimensionality’. This demands a use of modeling using continuous variables where the
generalization can be easily achieved. The function that is learned will then have a local
smoothness and every point (n-gram sequence) have significant information about a
combinatorial number of neighboring points.
The Solution:
The paper presents an effective and computationally efficient probabilistic modeling approach
that overcomes the curse of dimensionality. It also overcomes the problem when a totally new
sequence not present in the training data is observed. A neural network model is developed which
has the vector representations of each word and parameters of the probability function in its
parameter set. The objective of the model is to find the parameters that minimize the perplexity of
the training dataset. The model eventually learns the distributed representations of each word and
the probability function of a sequence as a function of the distributed representations. The Neural
model has a hidden layer with tanh activation and the output layer is a Softmax layer. The out of
the model for each input of (n-1) prev word indices are the probabilities of the |V| words in the
vocabulary.
src: Yoshua Bengio et.al. A Neural Probabilistic Language Model
2. The Significance:
This model is capable of taking advantage of longer contexts. Some traditional n-gram based
models have slightly mitigated the problem of appearance of the new sequence by gluing
overlapping sequences. But they could only account for shorter contexts. Continuous
representation with each word having a vector representation, it is now possible to estimate the
probabilities for a sequence unseen in the training corpus. The probability function uses
parameters which increase only linearly with the size of the vocabulary and linear with the size of
the dimension of the vector representation. The curse of dimensionality is solved as we don’t need
the exponential number of free parameters. An extension of this work presents an architecture that
outputs the energy function instead of the probabilities and also takes care of out-of-vocabulary
words.
The development of algorithms that enable computers to automatically process text and
natural language has always been one of the great challenges in Artificial Intelligence.
In recent years, this research direction has increasingly gained importance, last not least
due to the advent of the World Wide Web, which has amplified the need for intelligent
text and language processing. The demand for computer systems that manage, filter and
search through huge repositories of text documents has created a whole new industry,
as has the demand for smart and personalized interfaces. Consequently, any substantial
progress in this domain will have a strong impact on numerous applications ranging from
information retrieval, information filtering, and intelligent agents, to speech recognition,
machine translation, and human-machine interaction.
There are two schools of thought: On one side, there is the traditional linguistics school,
which assumes that linguistic theory and logic can instruct computers to “learn” a language.
On the other side, there is a statistically-oriented community, which believes that machines
can learn (about) natural language from training data such as document collections and text
corpora. This paper follows the latter approach and presents a novel method for learning the
meaning of words in a purely data-driven fashion. The proposed unsupervised learning
technique called Probabilistic Latent Semantic Analysis (PLSA) aims at identifying and
distinguishing between different contexts of word usage without recourse to a dictionary or
thesaurus. This has at least two important implications: Firstly, it allows us to disambiguate
polysems, i.e., words with multiple meanings, and essentially every word is polysemous.
Secondly, it reveals topical similarities by grouping together words that are part of a common
context. As a special case this includes synonyms, i.e., words with identical or almost
identical meaning.
Probabilistic Language Models
A popular idea in computational linguistics is to create a probabilistic model of
language. Such a model assigns a probability to every sentence in English in such a
way that more likely sentences (in some sense) get higher probability. If you are
unsure between two possible sentences, pick the higher probability one.
3. Comment: A ``perfect'' language model is only attainable with true intelligence.
However, approximate language models are often easy to create and good enough for
many applications.
Some models:
unigram: words generated one at a time, drawn from a fixed distribution.
bigram: probability of word depends on previous word.
tag bigram: probability of part of speech depends on previous part of speech,
probability of word depends on part of speech.
maximum entropy: lots of other random features can contribute.
stochastic context free: words generated by a context-free grammar
augmented with probabilitistic rewrite rul