Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
FromLogisticRegression
toLinear-ChainCRF
Yow-Bang (Darren) Wang
12/20/2012
● Introduction
● Logistic Regression
● Log-Linear Model
● Linear-Chain CRF
○ Example: Part of Speech (POS) Tagging
● CRF T...
Introduction
Introduction
We can approach the theory of CRF from
1. Maximum Entropy
2. Probabilistic Graphical Model
3. Logistic Regres...
LinearRegression
● Input x: real-valued features (RV)
● Output y: Gaussian distribution (RV)
● Model parameter
● ML (condi...
LinearRegression
● Input x: real-valued features (RV)
● Output y: Gaussian distribution (RV)
● Represented with a graphica...
LogisticRegression
LogisticRegression
● Input x: real-valued features (RV)
● Output y: Bernoulli distribution (RV)
● Model parameter
Q:Whythi...
LogisticRegression
● Input x: real-valued features (RV)
● Output y: Bernoulli distribution (RV)
● Represented with a graph...
LogisticRegression
Advantages of Logistic Regression:
1. Correlated features x don't lead to problems (contrast to
Naive B...
MultinomialLogisticRegression
● Input x: real-valued features (RV), N-dimension
● Output y: Bernoulli distribution (RV), M...
Log-LinearModel
Log-LinearModel
An interpretation: Log-Linear Model is a Structured Logistic
Regression
● Structured: allow non-numerical ...
Log-LinearModel
Note:
1. “Feature” vs. “Feature function”
○ Feature: only correspond to input
○ Feature function: correspo...
Linear-ChainCRF
hidden
observed
From probabilistic graphical model perspective:
● CRF is a Markov Random Field with some disjoint RVs
obse...
From probabilistic graphical model perspective:
● Linear-Chain CRF: a specific structure of CRF
Linear-ChainCRF
hidden
obs...
Linear-ChainCRF
From Log-Linear Model point of view: Linear-Chain CRF is a
Log-Linear Model, of which
1. The length L of o...
Linear-ChainCRF
From Log-Linear Model point of view: Linear-Chain CRF is a
Log-Linear Model, of which
1. The length L of o...
Input (observed) x: word sequence
Output (hidden) y: POS tag sequence
● For example:
x = "He sat on the mat."
y = "pronoun...
Example:PartofSpeech(POS)Tagging
Input (observed) x: word sequence
Output (hidden) y: POS tag sequence
● With CRF we hope
...
Example:PartofSpeech(POS)Tagging
An example of low-level feature function fj
(x,yi
,yi-1
,i):
● "The i-th word in x is cap...
CRFTrainingand
Testing
Training
Stochastic Gradient Ascent
● Partial derivative of conditional log-likelihood:
● Update weight by
Training
Note: if j-th feature function is not activated by this
training example
→ we don't need to update it!
→ usually ...
Testing
For 1-best derivation:
N V Adj ...
N
V
Adj
...
For 1-best derivation:
1. Pre-compute g(yi-1
,yi
) as a table for each i
2. Perform dynamic progra...
For 1-best derivation:
1. Pre-compute g(yi-1
,yi
) as a table for each i
2. Perform dynamic programming to find the best s...
Testing
For probability estimation:
● must also compute all possible y (e.g. all possible POS
sequences) for denominator.....
Example:Speech
Disfluency
Detection
Example:SpeechDisfluencyDetection
One of the application of CRF in speech recognition:
Boundary/Disfluency Detection [5]
●...
One of the application of CRF in speech recognition:
Boundary/Disfluency Detection [5]
● CRF Input x: prosodic features
● ...
Reference
[1] Charles Elkan, “Log-linear Models and Conditional Random
Fields”
○ Tutorial at CIKM08 (ACM International Con...
Reference
[4] C. Sutton, K. Rohanimanesh, A. McCallum, “Conditional
random fields: Probabilistic models for segmenting and...
Upcoming SlideShare
Loading in …5
×

From logistic regression to linear chain CRF

* Introduction
* Logistic Regression
* Log-Linear Model
* Linear-Chain CRF
* Example: Part of Speech (POS) Tagging
* CRF Training and Testing
* Example: Part of Speech (POS) Tagging
* Example: Speech Disfluency Detection

From logistic regression to linear chain CRF

  1. 1. FromLogisticRegression toLinear-ChainCRF Yow-Bang (Darren) Wang 12/20/2012
  2. 2. ● Introduction ● Logistic Regression ● Log-Linear Model ● Linear-Chain CRF ○ Example: Part of Speech (POS) Tagging ● CRF Training and Testing ○ Example: Part of Speech (POS) Tagging ● Example: Speech Disfluency Detection Outline
  3. 3. Introduction
  4. 4. Introduction We can approach the theory of CRF from 1. Maximum Entropy 2. Probabilistic Graphical Model 3. Logistic Regression <– today's talk
  5. 5. LinearRegression ● Input x: real-valued features (RV) ● Output y: Gaussian distribution (RV) ● Model parameter ● ML (conditional likelihood) estimation of Ө: , where {X, Y} are the training data.
  6. 6. LinearRegression ● Input x: real-valued features (RV) ● Output y: Gaussian distribution (RV) ● Represented with a graphical model: 1 x1 xN y a0 a1 aN …...
  7. 7. LogisticRegression
  8. 8. LogisticRegression ● Input x: real-valued features (RV) ● Output y: Bernoulli distribution (RV) ● Model parameter Q:Whythisform? A:Bothsideshaverangeofvalue {-∞,∞} NoanalyticalsolutionforML →gradientdescent
  9. 9. LogisticRegression ● Input x: real-valued features (RV) ● Output y: Bernoulli distribution (RV) ● Represented with a graphical model: 1 x1 xN a0 a1 aN …... pSigmoid
  10. 10. LogisticRegression Advantages of Logistic Regression: 1. Correlated features x don't lead to problems (contrast to Naive Bayes) 2. Well-calibrated probability (contrast to SVM) 3. Not sensitive to unbalanced training data numberof”Y=1"
  11. 11. MultinomialLogisticRegression ● Input x: real-valued features (RV), N-dimension ● Output y: Bernoulli distribution (RV), M-class ● Represented with a graphical model: 1 x1 xN … p1 pM … Softmax Neuralnetwork with2layers!!! pm :Probabilityof m-thclass
  12. 12. Log-LinearModel
  13. 13. Log-LinearModel An interpretation: Log-Linear Model is a Structured Logistic Regression ● Structured: allow non-numerical input and output by defining proper feature function ● Special case: Logistic regression General form: ● Fj (x,y): j-th feature function
  14. 14. Log-LinearModel Note: 1. “Feature” vs. “Feature function” ○ Feature: only correspond to input ○ Feature function: correspond to both input and output 2. Must sum over all possible label y' for denominator -> normalization into [0, 1]. General form: ● Fj (x,y): j-th feature function
  15. 15. Linear-ChainCRF
  16. 16. hidden observed From probabilistic graphical model perspective: ● CRF is a Markov Random Field with some disjoint RVs observed and some hidden. x z y q r p ConditionalRandomField(CRF)
  17. 17. From probabilistic graphical model perspective: ● Linear-Chain CRF: a specific structure of CRF Linear-ChainCRF hidden observed Weoftenreferto"linear-chainCRF" assimply"CRF"
  18. 18. Linear-ChainCRF From Log-Linear Model point of view: Linear-Chain CRF is a Log-Linear Model, of which 1. The length L of output y can be varying. 2. The form of feature function is the sum of ”low-level feature functions”: hidden observed y: x: ……
  19. 19. Linear-ChainCRF From Log-Linear Model point of view: Linear-Chain CRF is a Log-Linear Model, of which 1. The length L of output y can be varying. 2. The form of feature function is the sum of ”low-level feature functions”: “We can have a fixed set of feature-functions Fj for log- linear training, even though the training examples are not fixed-length.” [1]
  20. 20. Input (observed) x: word sequence Output (hidden) y: POS tag sequence ● For example: x = "He sat on the mat." y = "pronoun verb preposition article noun" pron. v. He sat on the mat. prep. art. n. Example:PartofSpeech(POS)Tagging
  21. 21. Example:PartofSpeech(POS)Tagging Input (observed) x: word sequence Output (hidden) y: POS tag sequence ● With CRF we hope CRF: , where
  22. 22. Example:PartofSpeech(POS)Tagging An example of low-level feature function fj (x,yi ,yi-1 ,i): ● "The i-th word in x is capitalized, and POS tag yi = proper noun." [TRUE(1) or FALSE(0)] If wj positively large: given x and other condition fixed, y is more probable if fj (x,yi ,yi-1 ,i) is activated. CRF: , where Noteafeaturefunctionmaynotuse allthegiveninformation
  23. 23. CRFTrainingand Testing
  24. 24. Training Stochastic Gradient Ascent ● Partial derivative of conditional log-likelihood: ● Update weight by
  25. 25. Training Note: if j-th feature function is not activated by this training example → we don't need to update it! → usually only a few weights need to be updated in each iteration
  26. 26. Testing For 1-best derivation:
  27. 27. N V Adj ... N V Adj ... For 1-best derivation: 1. Pre-compute g(yi-1 ,yi ) as a table for each i 2. Perform dynamic programming to find the best sequence y: Example:PartofSpeech(POS)Tagging ● ● …… …… … ● ● …
  28. 28. For 1-best derivation: 1. Pre-compute g(yi-1 ,yi ) as a table for each i 2. Perform dynamic programming to find the best sequence y: ● Complexity: O(M2 LD) Example:PartofSpeech(POS)Tagging Buildatable Foreachelement insequence #offeaturefuNctions
  29. 29. Testing For probability estimation: ● must also compute all possible y (e.g. all possible POS sequences) for denominator...... Canbecalculatedbymatrix multiplication!!!
  30. 30. Example:Speech Disfluency Detection
  31. 31. Example:SpeechDisfluencyDetection One of the application of CRF in speech recognition: Boundary/Disfluency Detection [5] ● Repetition : “It is is Tuesday.” ● Hesitation : “It is uh… Tuesday.” ● Correction: “It is Monday, I mean, Tuesday.” ● etc. Possible clues: prosody ● Pitch ● Duration ● Energy ● Pause ● etc. “Itisuh…Tuesday.” ● Pitchreset? ● Longduration? ● Lowenergy? ● Pauseexistence?
  32. 32. One of the application of CRF in speech recognition: Boundary/Disfluency Detection [5] ● CRF Input x: prosodic features ● CRF Output y: Speech Recognition Rescoring Example:SpeechDisfluencyDetection
  33. 33. Reference [1] Charles Elkan, “Log-linear Models and Conditional Random Fields” ○ Tutorial at CIKM08 (ACM International Conference on Information and Knowledge Management) ○ Video: http://videolectures.net/cikm08_elkan_llmacrf/ ○ Lecture notes: http://cseweb.ucsd.edu/~elkan/250B/cikmtutorial.pdf [2] Hanna M. Wallach, “Conditional Random Fields: An Introduction” [3] Jeremy Morris, “Conditional Random Fields: An Overview” ○ Presented at OSU Clippers 2008, January 11, 2008
  34. 34. Reference [4] C. Sutton, K. Rohanimanesh, A. McCallum, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data”, 2001. [5] Liu, Y. and Shriberg, E. and Stolcke, A. and Hillard, D. and Ostendorf, M. and Harper, M., “Enriching speech recognition with automatic detection of sentence boundaries and disfluencies”, in IEEE Transactions on Audio, Speech, and Language Processing, 2006.

    Be the first to comment

    Login to see the comments

  • hawdar

    Dec. 7, 2016
  • williammdavis

    Jan. 10, 2017
  • JeeHyubKim

    Mar. 16, 2017
  • abdoidsaid

    Nov. 1, 2017
  • dharanipunithan

    Mar. 28, 2018
  • RonyArmon1

    Oct. 25, 2019

* Introduction * Logistic Regression * Log-Linear Model * Linear-Chain CRF * Example: Part of Speech (POS) Tagging * CRF Training and Testing * Example: Part of Speech (POS) Tagging * Example: Speech Disfluency Detection

Views

Total views

3,818

On Slideshare

0

From embeds

0

Number of embeds

194

Actions

Downloads

91

Shares

0

Comments

0

Likes

6

×