From logistic regression to linear chain CRF

FromLogisticRegression
toLinear-ChainCRF
Yow-Bang (Darren) Wang
12/20/2012

● Introduction
● Logistic Regression
● Log-Linear Model
● Linear-Chain CRF
○ Example: Part of Speech (POS) Tagging
● CRF Training and Testing
○ Example: Part of Speech (POS) Tagging
● Example: Speech Disfluency Detection
Outline

Introduction
We can approach the theory of CRF from
1. Maximum Entropy
2. Probabilistic Graphical Model
3. Logistic Regression <– today's talk

LinearRegression
● Input x: real-valued features (RV)
● Output y: Gaussian distribution (RV)
● Model parameter
● ML (conditional likelihood) estimation of Ө:
, where {X, Y} are the training data.

LinearRegression
● Output y: Gaussian distribution (RV)
● Represented with a graphical model:
1
x1
xN
y
a0
a1
aN
…...

LogisticRegression
● Output y: Bernoulli distribution (RV)
● Model parameter
Q:Whythisform?
A:Bothsideshaverangeofvalue
{-∞,∞}
NoanalyticalsolutionforML
→gradientdescent

LogisticRegression
● Output y: Bernoulli distribution (RV)
1
x1
xN
a0
a1
aN
…...
pSigmoid

LogisticRegression
Advantages of Logistic Regression:
1. Correlated features x don't lead to problems (contrast to
Naive Bayes)
2. Well-calibrated probability (contrast to SVM)
3. Not sensitive to unbalanced training data
numberof”Y=1"

MultinomialLogisticRegression
● Input x: real-valued features (RV), N-dimension
● Output y: Bernoulli distribution (RV), M-class
1
x1
xN
…
p1
pM
…
Softmax
Neuralnetwork
with2layers!!!
pm
:Probabilityof
m-thclass

Log-LinearModel
An interpretation: Log-Linear Model is a Structured Logistic
Regression
● Structured: allow non-numerical input and output by
defining proper feature function
● Special case: Logistic regression
General form:
● Fj
(x,y): j-th feature function

Log-LinearModel
Note:
1. “Feature” vs. “Feature function”
○ Feature: only correspond to input
○ Feature function: correspond to both input and output
2. Must sum over all possible label y' for denominator
-> normalization into [0, 1].
General form:
● Fj
(x,y): j-th feature function

hidden
observed
From probabilistic graphical model perspective:
● CRF is a Markov Random Field with some disjoint RVs
observed and some hidden.
x
z
y
q
r
p
ConditionalRandomField(CRF)

From probabilistic graphical model perspective:
● Linear-Chain CRF: a specific structure of CRF
Linear-ChainCRF
hidden
observed
Weoftenreferto"linear-chainCRF"
assimply"CRF"

Linear-ChainCRF
From Log-Linear Model point of view: Linear-Chain CRF is a
Log-Linear Model, of which
1. The length L of output y can be varying.
2. The form of feature function is the sum of ”low-level
feature functions”:
hidden
observed
y:
x:
……

Linear-ChainCRF
From Log-Linear Model point of view: Linear-Chain CRF is a
Log-Linear Model, of which
1. The length L of output y can be varying.
2. The form of feature function is the sum of ”low-level
feature functions”:
“We can have a fixed set of feature-functions Fj
for log-
linear training, even though the training examples are not
fixed-length.” [1]

Input (observed) x: word sequence
Output (hidden) y: POS tag sequence
● For example:
x = "He sat on the mat."
y = "pronoun verb preposition article noun"
pron. v.
He sat on the mat.
prep. art. n.
Example:PartofSpeech(POS)Tagging

Input (observed) x: word sequence
Output (hidden) y: POS tag sequence
● With CRF we hope
CRF:
, where

An example of low-level feature function fj
(x,yi
,yi-1
,i):
● "The i-th word in x is capitalized, and POS tag yi
=
proper noun." [TRUE(1) or FALSE(0)]
If wj
positively large: given x and other condition fixed, y
is more probable if fj
(x,yi
,yi-1
,i) is activated.
CRF:
, where
Noteafeaturefunctionmaynotuse
allthegiveninformation

Training
Stochastic Gradient Ascent
● Partial derivative of conditional log-likelihood:
● Update weight by

Training
Note: if j-th feature function is not activated by this
training example
→ we don't need to update it!
→ usually only a few weights need to be updated in each
iteration

Testing
For 1-best derivation:

N V Adj ...
N
V
Adj
...
1. Pre-compute g(yi-1
,yi
) as a table for each i
2. Perform dynamic programming to find the best sequence y:
●
●
……
……
…
●
●
…

1. Pre-compute g(yi-1
,yi
) as a table for each i
2. Perform dynamic programming to find the best sequence y:
● Complexity: O(M2
LD)
Buildatable
Foreachelement
insequence
#offeaturefuNctions

Testing
For probability estimation:
● must also compute all possible y (e.g. all possible POS
sequences) for denominator......
Canbecalculatedbymatrix
multiplication!!!

Example:Speech
Disfluency
Detection

Example:SpeechDisfluencyDetection
One of the application of CRF in speech recognition:
Boundary/Disfluency Detection [5]
● Repetition : “It is is Tuesday.”
● Hesitation : “It is uh… Tuesday.”
● Correction: “It is Monday, I mean, Tuesday.”
● etc.
Possible clues: prosody
● Pitch
● Duration
● Energy
● Pause
● etc.
“Itisuh…Tuesday.”
● Pitchreset?
● Longduration?
● Lowenergy?
● Pauseexistence?

One of the application of CRF in speech recognition:
Boundary/Disfluency Detection [5]
● CRF Input x: prosodic features
● CRF Output y:
Speech
Recognition
Rescoring
Example:SpeechDisfluencyDetection

Reference
[1] Charles Elkan, “Log-linear Models and Conditional Random
Fields”
○ Tutorial at CIKM08 (ACM International Conference on Information and
Knowledge Management)
○ Video: http://videolectures.net/cikm08_elkan_llmacrf/
○ Lecture notes: http://cseweb.ucsd.edu/~elkan/250B/cikmtutorial.pdf
[2] Hanna M. Wallach, “Conditional Random Fields: An
Introduction”
[3] Jeremy Morris, “Conditional Random Fields: An Overview”
○ Presented at OSU Clippers 2008, January 11, 2008

Reference
[4] C. Sutton, K. Rohanimanesh, A. McCallum, “Conditional
random fields: Probabilistic models for segmenting and
labeling sequence data”, 2001.
[5] Liu, Y. and Shriberg, E. and Stolcke, A. and Hillard, D.
and Ostendorf, M. and Harper, M., “Enriching speech
recognition with automatic detection of sentence boundaries
and disfluencies”, in IEEE Transactions on Audio, Speech,
and Language Processing, 2006.

From logistic regression to linear chain CRF

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to From logistic regression to linear chain CRF

Similar to From logistic regression to linear chain CRF (20)

Recently uploaded

Recently uploaded (20)

From logistic regression to linear chain CRF