MaxEnt (Loglinear) Models - Overview

MaxEnt Model
Overview
Anantharaman Narayana Iyer
30 Jan 2015

MaxEnt Classifier
• This is a powerful model that has equivalence to logistic regression
• Many NLP problems can be reformulated as classification problems
• E.g. Language Modelling, Tagging Problems
• MaxEnt is widely used for various text processing tasks.
• Task is to estimate the probability of a class given the context
• The term context may refer to a single word or group of words
• In a large text corpora contains information on the cooccurrence of
classes and specific contexts

Problem: MaxEnt (Refer paper by
Ratnaparkhi)
• Let p(a, b) be the probability of class a occurring with context b
• Given the sparsity of words in b and also limited training data, it is not
possible to completely specify p(a, b)
• Given the sparse evidence about a’s and b’s our goal is to estimate
the probability model p(a, b)

Representing Evidence
• One way to represent evidence is to encode useful facts as features
and to impose constraints on the values of those feature expectations
• A feature is a binary valued function (indicator function):
• 𝑓𝑖: ε ⟶ 0, 1
• Given k features the constraints have the form:
• Expectation value of the model for the feature fj = Observed Expectation
value for the feature fj
• 𝑥∈𝜖 𝑝 𝑥 𝑓𝑗 x = 𝑥∈𝜖 𝑝 𝑥 𝑓𝑗 x
• The principle of maximum entropy requires:

Motivating Problems for Log-linear Models
• Language Model: Given the context (that is, words w1, w2, …, wi-1 ) predict the word wi
• Consider the examples:
• A natural number (i.e. 1, 2, 3, 4, 5, 6, etc.) is called a prime number (or a prime) if it has exactly two positive divisors, 1
and the number itself. Natural numbers greater than 1 that are not prime are called composite.
• Asked about the speculation that he may be inducted into the Cabinet, Parrikar said, “I can comment on it only after
meeting the Prime Minister. Let the Prime Minister who has invited me comment”
• Prime Minister Narendra Modi is likely to expand his Cabinet on Sunday, according to Times Now
• “The prime focus of this release of our product is to simplify the user interface”
• N-gram models
• Uses the context of previous (n-1) words to predict the nth word
• A trigram model approach uses 2 previous words
• Sometimes the accuracy can be improved if other features of the input are taken in to consideration as opposed to
using only a very limited context
• The n-gram LM techniques are not flexible enough to include additional features, such as the total length of sentence,
presence of certain specific words, identity of the author etc. Note: One might include extra features like author’s
name etc and compute conditional probabilities but such extensions to the conventional trigram approach becomes
quickly unwieldy
• Log-linear models can be used to include the additional features and improve the performance

The general problem
• We have an input domain X
• For example: A sequence of words
• There is a finite label set Y
• For example: The space of all possible words – that is the vocabulary
• Our goal is to determine P(y|x) for any x, y where x is in the input
space and y is in the space of labels
• For example: Given an input sentence (that is x, a sequence of words),
determine the next word in the sequence - that is P(wi | w1..wn)

Feature Vector
• A feature is a function fk(x, y) ∈ ℝ
• Often the features used in Log-linear models for typical NLP
applications are binary functions that are also called indicator
functions: fk(x, y) ∈ {0, 1}
• If we have m features then a feature vector f(x, y) ∈ ℝ 𝑚
• The number and choice of features for a given input is arbitrary. The
system developer can design these with an intuition of the problem
space he is addressing.

Features in Log-Linear Models
• Features are pieces of elementary pieces of evidence that link aspects
of what we observe x with a label y that we want to predict (Ref: C
Manning)
• A feature is a function with a bounded real value
𝑓: 𝑋 ∗ 𝑌 → ℝ
• Example:
• Consider a sentence: “Gandhi was born on 2 October 1869 in Porbandar”
• f1(x, y) = [y = PERSON and wi = isCapitalized and wi+1 = (“was” | “is”) and wi+2 =
VERB]
• f2(x, y) = [y = LOCATION and wi = isCapitalized and wi+1 = (“was” | “is”) and wi+2
= VERB]
• f3(x, y) = [y = DATE and wi = CD and wi-1 = (“on”) and wi-2 = VERB]

Feature Vector Representations
• Consider the examples:
• A natural number (i.e. 1, 2, 3, 4, 5, 6, etc.) is called
a prime number (or a prime) if it has exactly two
positive divisors, 1 and the number itself.
Natural numbers greater than 1 that are
not prime are called composite.
• Asked about the speculation that he may be inducted
into the Cabinet, Parrikar said, “I can comment on it
only after meeting the Prime Minister. Let the Prime
Minister who has invited me comment”
• Prime Minister Narendra Modi is likely to expand his
Cabinet on Sunday, according to Times Now
• “The prime focus of this release of our product is to
simplify the user interface”
• Exercise:
• What are the possible features we may consider for
representing the Trigram LM problem?
• How do we extend this set of trigram features in to a
more powerful set of features?

Parameter Vector
• Given the feature vector f(x, y) ∈ ℝ 𝑚 we can define the parameter
vector v ∈ ℝ 𝑚
• Each (x, y) is mapped to a score which is the dot product of the
parameter vector and the feature vector:
𝑣. 𝑓 𝑥, 𝑦 =
𝑘=1
𝑚
𝑣 𝑘 𝑓𝑘

Log-linear model - definition
• Let the Input domain X and label space Y
• Our goal is to determine P(y|x)
• A feature is a function: 𝑓: 𝑋 × 𝑌 → ℝ
• We have m features that constitute a feature vector: 𝑓 𝑥, 𝑦 ∈ ℝ 𝑚
• We also have the parameter vector: 𝑣 ∈ ℝ 𝑚
• We define the log-linear model as:
𝒑 𝒚 𝒙; 𝒗 =
𝒆 𝒗.𝒇 𝒙,𝒚
𝒚′∈𝒀
𝒆 𝒗.𝒇 𝒙,𝒚′

Refer: Coursera Notes Prof Michael Collins

MaxEnt (Loglinear) Models - Overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to MaxEnt (Loglinear) Models - Overview

Similar to MaxEnt (Loglinear) Models - Overview (20)

More from ananth

More from ananth (10)

Recently uploaded

Recently uploaded (20)

MaxEnt (Loglinear) Models - Overview