This document discusses various probabilistic language models used in natural language processing applications. It covers n-gram models like bigram and trigram models used for tasks like speech recognition. It describes how probabilistic language models assign probabilities to strings of text based on counting word occurrences. It also discusses techniques like additive smoothing and linear interpolation that are used to handle zero probability word pairs in n-gram models. Finally, it introduces probabilistic context-free grammars which use rewrite rules with associated probabilities to model language structure.
2. SYLLABUS
AI applications
Language Models
Probabilistic Language Models
Information Retrieval
Information Extraction
Natural Language Processing
Machine Translation
Speech Recognition
Robot
Hardware – Perception
Planning – Moving
3. PROBABILISTIC LANGUAGE
PROCESSING
corpus-based approach to understand the
language
A corpus (plural corpora) is a large collection of
text, such as the billions of pages that make up the
World Wide Web.
The text is written by and for humans,
The task of the software is to make it easier for the
human to find the right information.
This approach implies the use of statistics
Learning to take advantage of the corpus,
Probabilistic language models that can be learned
from data
5. learning is just a matter of counting occurrences
probability can be used to choose the most likely
interpretation
A probabilistic language model defines a
probability distribution over a (possibly infinite)
set of strings.
Eg : bigram and trigram language models used in
speech recognition
6. MODELS
unigram model assigns a probability P(w) to each
word in the lexicon.
The model assumes that words are chosen
independently,
so the probability of a string is just the product of
the probability of its words, given by
Πi P(wi)
A bigram model assigns a probability
Πi P(wi/wi-1)
to each word, given the previous word
7. N-GRAM MODEL
an n-gram model conditions on the previous n - 1
words, assigning a probability
Πi P(wi/wi-(n-1) … wi-1)
The models themselves agree:
the model assigns its random string a probability of
For trigram is10-10 ,
For bigram is10-29 and
For unigram is 10-59
8. SMOOTHING
pairs will have a count of zero
We need some way of smoothing over the zero
counts.
ADD-ONE SMOOTHING :
The simplest way to do this is called add-one smoothing
add one to the count of every possible bigram
So if there are N words in the corpus and. B possible
bigrams, then each
bigram with an actual count of c is assigned a probability
estimate of
(c + l)/(N + B)
This method eliminates the problem of zero-probability
n-grams,
but the assumption that every count should be
incremented by exactly one
9. LINEAR INTERPOLATION SMOOTHING
Another approach which combines trigram,
bigram, and unigram models by linear interpolation
where c3 + c2 + c1= 1
The parameters ci can be fixed, or they can be
trained with an EM algorithm.
It is possible to have values of ci that are
dependent on the n-gram counts, so that
we place a higher weight on the probability
estimates that are derived from higher counts.
10. VITERBI EQUATION
It takes as input a unigram word probability
distribution, P(word), and a string.
Then, for each position i in the string, it stores in
best[i] the probability of the most probable string
spanning from the start up to i.
It also stores in words[i] the word ending at
position i that yielded the best probability.
Once it has built up the best and words arrays in a
dynamic programming fashion, it then works
backwards through words to find the best path.
11. PROBABILISTIC CONTEXT-FREE GRAMMARS
n-gram models take advantage of co-occurrence
statistics in the corpora, but they have no notion of
grammar at distances greater than n
An alternative language model is the
PROBABILISTIC CONTEXT-FREE GRAMMAR
(PCFG)
PCFG,' which consists of a CFG wherein each
rewrite rule has an associated probability.
The sum of the probabilities across all rules with the
same left-hand side is 1
12. PROBABILISTIC CONTEXT-FREE GRAMMAR
(PCFG) AND LEXICON
Note:
The numbers in
square brackets
indicate the
probability that a
left-hand-side
symbol will be
rewritten with the
corresponding rule.
13. PARSE TREE
Probability of a string, P(words), is just the sum of the probabilities of its
parse trees.
The probability of a given tree is the product of the probabilities of all the
rules that make up the nodes of the tree.