Hypothesis testing, MLE, language models
Based on some slides of Ilan Gronau,
Ydo Wexler, Dan Geiger & Nir Fridman
•Find the best explanation for the observed data
•Helps predict behavior of similar data sets
An example: Binomial experiments
• Model: The unknown parameter: θ=p(H)
• Data Set: series of experiment results, e.g.
D = H H T H T T T H H …
• Main Assumption: each experiment is independent of
Using Likelihood Functions
• The likelihood of a given value for θ :
LD (θ) = p(D| θ)
• Maximum Likelihood Estimation (MLE) :
We wish to find a value for θ which maximizes the
• For example: The likelihood of ‘HTTHH’ is:
LHTTHH (θ) = p(HTTHH | θ)= θ(1-θ)(1- θ)θ θ = θ3(1-θ)2
• We only need to know N(H) (number of Heads) and N(T)
(number of Tails).
• These are sufficient statistics : LD(θ) = θN(H) (1-θ)N(T)
• A sufficient statistic is a function of the data
that summarizes the relevant information for
• s(D) is a sufficient statistics if for any two
datasets D and D’:
s(D) = s(D’ ) => LD(θ) = LD’(θ)
• Likelihood may be calculated on the statistics.
Maximum Likelihood Estimation
• Goal: Maximize the likelihood (or log-likelihood)
• In our example:
• LD(θ) = θN(H) (1-θ)N(T)
• lD(θ) = log(LD(θ)) = N(H)·log(θ) + N(T)·log(1-θ)
– Maximization of Log-Lilkelihood:
• lD‘(θ) =0:
MLE with multiple parameters
• What if we have several parameters θ1, θ2,…, θK that we
wish to learn?
– die toss (K=6)
– Grades (K=100)
• Sufficient statistics [assumption: a series of independent experiments]:
– N1, N2, …, NK - the number of times each outcome was observed
From MLE to Bayesian Inference
• Likelihood Goal: maximize p(D| θ)
• Our Goal: maximize p(θ|D)
• Following Bayes Rule:
• Intuitively, the prior probability captures our prior
knowledge (prejudice) of the model parameters.
Likelihood Prior probability
MLE in Natural
Language Processing (NLP)
• Goal: Evaluate the probability of the next word based on the
words prior to it:
• Importance: Speech recognition, Hand written word
recognition, part of speech tagging, language identification,
spam detection, etc…
• Markov Assumption: The probability of a word wi in a
sequence of words, depends only on the n-1 words prior to it
in the sequence.
n is a constant.
MLE in NLP
How do we evaluate P(wi) , P(wi| wi-1) , P(wi| wi-2 , wi-1) ?
• Proposal: MLE
Problems with MLE
• Many sequence of length n never appear in the dataset (but do appear in
the real world).
– Task: Speech recognition. We heard a word in a sentence, and wish to decide
between two words: “Milk” and “Silk”
– P(Milk | John drank) >? P(Silk | John drank)
– The word “John” never appeared in the dataset, therefore we cannot decide
• Church and Gal (1991)
– Dataset: 44 million words from news papers
– Vocabulary: 400,653 different words
– Therefore, 1.6 * 1011 possible bigrams
– Very few of them appeared in the dataset….
Most solutions are based on some sort of smoothing:
– Good Turing
• The null hypothesis, denoted by H0
• The alternative hypothesis, denoted by H1.
• Should we reject the null hypothesis in favor of the alternative?
– a value from a certain distribution
– we don't know what the parameter of that distribution is.
– How likely it is that the value we were given could have come from the
distribution with this predicted parameter?
– If it's not very likely, we reject the null hypothesis in favor of the alternative.
• Critical Region
– But what exactly is "not very likely"?
– We choose a region known as the critical region. If the result of our
test lies in this region, then we reject the null hypothesis in favor of
Empirical Evolution methods
• Divide to train and test
– Leave one out
• Cross validation
– 10 fold cross validation
– 5x2 cross validation
• Never (never never!) perform evaluation on
the training data