Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
×

# From logistic regression to linear chain CRF

* Introduction
* Logistic Regression
* Log-Linear Model
* Linear-Chain CRF
* Example: Part of Speech (POS) Tagging
* CRF Training and Testing
* Example: Part of Speech (POS) Tagging
* Example: Speech Disfluency Detection

• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

### From logistic regression to linear chain CRF

1. 1. FromLogisticRegression toLinear-ChainCRF Yow-Bang (Darren) Wang 12/20/2012
2. 2. ● Introduction ● Logistic Regression ● Log-Linear Model ● Linear-Chain CRF ○ Example: Part of Speech (POS) Tagging ● CRF Training and Testing ○ Example: Part of Speech (POS) Tagging ● Example: Speech Disfluency Detection Outline
3. 3. Introduction
4. 4. Introduction We can approach the theory of CRF from 1. Maximum Entropy 2. Probabilistic Graphical Model 3. Logistic Regression <– today's talk
5. 5. LinearRegression ● Input x: real-valued features (RV) ● Output y: Gaussian distribution (RV) ● Model parameter ● ML (conditional likelihood) estimation of Ө: , where {X, Y} are the training data.
6. 6. LinearRegression ● Input x: real-valued features (RV) ● Output y: Gaussian distribution (RV) ● Represented with a graphical model: 1 x1 xN y a0 a1 aN …...
7. 7. LogisticRegression
8. 8. LogisticRegression ● Input x: real-valued features (RV) ● Output y: Bernoulli distribution (RV) ● Model parameter Q:Whythisform? A:Bothsideshaverangeofvalue {-∞,∞} NoanalyticalsolutionforML →gradientdescent
9. 9. LogisticRegression ● Input x: real-valued features (RV) ● Output y: Bernoulli distribution (RV) ● Represented with a graphical model: 1 x1 xN a0 a1 aN …... pSigmoid
10. 10. LogisticRegression Advantages of Logistic Regression: 1. Correlated features x don't lead to problems (contrast to Naive Bayes) 2. Well-calibrated probability (contrast to SVM) 3. Not sensitive to unbalanced training data numberof”Y=1"
11. 11. MultinomialLogisticRegression ● Input x: real-valued features (RV), N-dimension ● Output y: Bernoulli distribution (RV), M-class ● Represented with a graphical model: 1 x1 xN … p1 pM … Softmax Neuralnetwork with2layers!!! pm :Probabilityof m-thclass
12. 12. Log-LinearModel
13. 13. Log-LinearModel An interpretation: Log-Linear Model is a Structured Logistic Regression ● Structured: allow non-numerical input and output by defining proper feature function ● Special case: Logistic regression General form: ● Fj (x,y): j-th feature function
14. 14. Log-LinearModel Note: 1. “Feature” vs. “Feature function” ○ Feature: only correspond to input ○ Feature function: correspond to both input and output 2. Must sum over all possible label y' for denominator -> normalization into [0, 1]. General form: ● Fj (x,y): j-th feature function
15. 15. Linear-ChainCRF
16. 16. hidden observed From probabilistic graphical model perspective: ● CRF is a Markov Random Field with some disjoint RVs observed and some hidden. x z y q r p ConditionalRandomField(CRF)
17. 17. From probabilistic graphical model perspective: ● Linear-Chain CRF: a specific structure of CRF Linear-ChainCRF hidden observed Weoftenreferto"linear-chainCRF" assimply"CRF"
18. 18. Linear-ChainCRF From Log-Linear Model point of view: Linear-Chain CRF is a Log-Linear Model, of which 1. The length L of output y can be varying. 2. The form of feature function is the sum of ”low-level feature functions”: hidden observed y: x: ……
19. 19. Linear-ChainCRF From Log-Linear Model point of view: Linear-Chain CRF is a Log-Linear Model, of which 1. The length L of output y can be varying. 2. The form of feature function is the sum of ”low-level feature functions”: “We can have a fixed set of feature-functions Fj for log- linear training, even though the training examples are not fixed-length.” [1]
20. 20. Input (observed) x: word sequence Output (hidden) y: POS tag sequence ● For example: x = "He sat on the mat." y = "pronoun verb preposition article noun" pron. v. He sat on the mat. prep. art. n. Example:PartofSpeech(POS)Tagging
21. 21. Example:PartofSpeech(POS)Tagging Input (observed) x: word sequence Output (hidden) y: POS tag sequence ● With CRF we hope CRF: , where
22. 22. Example:PartofSpeech(POS)Tagging An example of low-level feature function fj (x,yi ,yi-1 ,i): ● "The i-th word in x is capitalized, and POS tag yi = proper noun." [TRUE(1) or FALSE(0)] If wj positively large: given x and other condition fixed, y is more probable if fj (x,yi ,yi-1 ,i) is activated. CRF: , where Noteafeaturefunctionmaynotuse allthegiveninformation
23. 23. CRFTrainingand Testing
24. 24. Training Stochastic Gradient Ascent ● Partial derivative of conditional log-likelihood: ● Update weight by
25. 25. Training Note: if j-th feature function is not activated by this training example → we don't need to update it! → usually only a few weights need to be updated in each iteration
26. 26. Testing For 1-best derivation:
27. 27. N V Adj ... N V Adj ... For 1-best derivation: 1. Pre-compute g(yi-1 ,yi ) as a table for each i 2. Perform dynamic programming to find the best sequence y: Example:PartofSpeech(POS)Tagging ● ● …… …… … ● ● …
28. 28. For 1-best derivation: 1. Pre-compute g(yi-1 ,yi ) as a table for each i 2. Perform dynamic programming to find the best sequence y: ● Complexity: O(M2 LD) Example:PartofSpeech(POS)Tagging Buildatable Foreachelement insequence #offeaturefuNctions
29. 29. Testing For probability estimation: ● must also compute all possible y (e.g. all possible POS sequences) for denominator...... Canbecalculatedbymatrix multiplication!!!
30. 30. Example:Speech Disfluency Detection
31. 31. Example:SpeechDisfluencyDetection One of the application of CRF in speech recognition: Boundary/Disfluency Detection [5] ● Repetition : “It is is Tuesday.” ● Hesitation : “It is uh… Tuesday.” ● Correction: “It is Monday, I mean, Tuesday.” ● etc. Possible clues: prosody ● Pitch ● Duration ● Energy ● Pause ● etc. “Itisuh…Tuesday.” ● Pitchreset? ● Longduration? ● Lowenergy? ● Pauseexistence?
32. 32. One of the application of CRF in speech recognition: Boundary/Disfluency Detection [5] ● CRF Input x: prosodic features ● CRF Output y: Speech Recognition Rescoring Example:SpeechDisfluencyDetection
33. 33. Reference [1] Charles Elkan, “Log-linear Models and Conditional Random Fields” ○ Tutorial at CIKM08 (ACM International Conference on Information and Knowledge Management) ○ Video: http://videolectures.net/cikm08_elkan_llmacrf/ ○ Lecture notes: http://cseweb.ucsd.edu/~elkan/250B/cikmtutorial.pdf [2] Hanna M. Wallach, “Conditional Random Fields: An Introduction” [3] Jeremy Morris, “Conditional Random Fields: An Overview” ○ Presented at OSU Clippers 2008, January 11, 2008
34. 34. Reference [4] C. Sutton, K. Rohanimanesh, A. McCallum, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data”, 2001. [5] Liu, Y. and Shriberg, E. and Stolcke, A. and Hillard, D. and Ostendorf, M. and Harper, M., “Enriching speech recognition with automatic detection of sentence boundaries and disfluencies”, in IEEE Transactions on Audio, Speech, and Language Processing, 2006.

### Be the first to comment

Dec. 7, 2016
• #### williammdavis

Jan. 10, 2017
• #### JeeHyubKim

Mar. 16, 2017

Nov. 1, 2017
• #### dharanipunithan

Mar. 28, 2018
• #### RonyArmon1

Oct. 25, 2019

* Introduction * Logistic Regression * Log-Linear Model * Linear-Chain CRF * Example: Part of Speech (POS) Tagging * CRF Training and Testing * Example: Part of Speech (POS) Tagging * Example: Speech Disfluency Detection

Total views

3,818

On Slideshare

0

From embeds

0

Number of embeds

194

91

Shares

0