Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
Adaptive Parser-Centric
Text Normalization
Congle Zhang* Tyler Baldwin**
Howard Ho** Benny Kimelfeld** Yunyao Li**
* Uni...
Public Text
Web Text
Private Text
Text
Analytics
Marketing
Financial investment
Drug discovery
Law enforcement
…
Applicati...
DREAM
REALITY
Image from http://samasource.org
CAN YOU READ THIS IN
FIRST ATTEMPT?
ay woundent of see ’ em
CAN YOU READ THIS IN FIRST ATEMPT?
00:0000:0100:02
I would not have seen them.
When a machine reads it
Results from Google translation
Chinese 唉看见他们 woundent
Spanish ay woundent de verlas
Japanese ローマ法...
Text Normalization
• Informal writing  standard written form
9
I would not have seen them .
normalize
ay woundent of see ...
Challenge: Grammar
10
text normalization
would not of see them
ay woundent of see ’ em
I would not have seen them. Vs.
map...
Challenge: Domain Adaptation
Tailor the same text
normalization solution
towards different writing
style of different data...
Challenge: Evaluation
• Previous: word error rate & BLEU score
• However,
– Words are not equally important
– non-word inf...
Adaptive Parser-Centric Text
Normalization
Grammatical
Sentence
Domain
Transferrable
Parsing
performance
Outlines
• Model
• Inference
• Learning
• Instantiation
• Evaluation
• Conclusion
14
Model: Replacement Generator
15
• Replacement <i,j,s>: replace tokens xi … xj-1
with s
• Domain customization
– Generic (c...
Model: Boolean Variables
• Associate a unique Boolean variable Xr with
each replacement r
– Xr=true: replacement r is used...
Model: Normalization Graph
17
• A graphical model Ay woudent of see ‘em
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would n...
Model: Legal Assignment
• Sound
– Any two true replacements do not overlap
– <1,2,”Ay”> and <1,2,”I”> cannot be both true
...
Model: Legal = Path
• A legal assignment: a path from start to end
19
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not...
Model: Assignment Probability
20
• Log-linear model; feature functions on edges
20
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2...
Outlines
• Model
• Inference
• Learning
• Instantiation
• Evaluation
• Conclusion
21
Inference
• Select the assignment with the highest
probability
• Computationally hard on general graph
models …
• But, in ...
Inference
23
• weighted longest path
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”would”>
<4,5,”seen”...
Outlines
• Model
• Inference
• Learning
• Instantiation
• Evaluation
• Conclusion
24
Learning
• Perceptron-style algorithm
– Update weights by
– Comparing (1) most probable output with the
current weights (2...
Learning: Gold vs. Inferred
26
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”would”>
<4,5,”seen”>
<5,6...
Learning: Update Weights on the
Differential Edges
27
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”wo...
Outlines
• Model
• Inference
• Learning
• Instantiation
• Evaluation
• Conclusion
28
Instantiation: Replacement
Generators
29
Generator From To
leave intact good good
edit distance bac back
lowercase NEED ne...
Instantiation: Features
• N-gram
– Frequency of the phrases induced by an edge
• Part-of-speech
– Encourage certain behavi...
Outlines
• Model
• Inference
• Learning
• Instantiation
• Evaluation
• Conclusion
31
Evaluation Metrics: Compare Parses
Input
sentence
32
Human Expert
Gold
sentence
Normalized
sentence
Normalizer
Parser
Pars...
Evaluation Metrics: Example
Test Gold SVO
I kinda wanna
get ipad NEW
I kind of want to
get a new iPad.
verb(get) verb(want...
Evaluation: Baselines
• w/oN: without normalization
• Google: Google spell checker
• w2wN: word-to-word normalization [Han...
Evaluation: Domains
• Twitter [Han and Baldwin 2011]
– Gold: Grammatical sentences
• SMS [Choudhury et al 2007]
– Gold: Gr...
Evaluation: Twitter
36
• Twitter-specific replacement generators
– Hashtags (#), ats (@), and retweets (RT)
– Generators t...
Evaluation: Twitter
System
Verb Subject-Object
Pre Rec F1 Pre Rec F1
w/oN 83.7 68.1 75.1 31.7 38.6 34.8
Google 88.9 78.8 8...
Evaluation: Twitter
System
Verb Subject-Object
Pre Rec F1 Pre Rec F1
w/oN 83.7 68.1 75.1 31.7 38.6 34.8
Google 88.9 78.8 8...
Evaluation: Twitter
System
Verb Subject-Object
Pre Rec F1 Pre Rec F1
w/oN 83.7 68.1 75.1 31.7 38.6 34.8
Google 88.9 78.8 8...
Evaluation: SMS
40
SMS-specific replacement generator:
- Mapping dictionary of SMS
abbreviations
Evaluation: SMS
41
System
Verb Subject-Object
Pre Rec F1 Pre Rec F1
w/oN 76.4 48.1 59.0 19.5 21.5 20.4
Google 85.1 61.6 71...
Evaluation: Call-Center
42
Call Center-specific generator:
- Mapping dictionary of call center
abbreviations
(e.g. “rep.” ...
Evaluation: Call-Center
43
System
Verb Subject-Object
Pre Rec F1 Pre Rec F1
w/oN 98.5 97.1 97.8 69.2 66.1 67.6
Google 99.2...
Discussion
• Domain transfer w/ small amount of effort is
possible
• Performing normalization is indeed beneficial to
depe...
Conclusion
• Normalization framework with an eye toward
domain adaptation
• Parser-centric view of normalization
• Our sys...
Team
46
Upcoming SlideShare
Loading in …5
×

Adaptive Parser-Centric Text Normalization

1,437 views

Published on

Wonderful work done with Congle Zhang (my summer intern in 2012) and my IBM colleagues. Nominated for best paper award and presented at ACL 2013.

Adaptive Parser-Centric Text Normalization
Congle Zhang, Tyler Baldwin, Howard Ho, Benny Kimelfeld, Yunyao Li
Proceedings of ACL, pp. 1159--1168, 2013

Published in: Technology, Education
  • Be the first to comment

Adaptive Parser-Centric Text Normalization

  1. 1. 1 Adaptive Parser-Centric Text Normalization Congle Zhang* Tyler Baldwin** Howard Ho** Benny Kimelfeld** Yunyao Li** * University of Washington **IBM Research - Almaden
  2. 2. Public Text Web Text Private Text Text Analytics Marketing Financial investment Drug discovery Law enforcement … Applications Social media News SEC Internal Data Subscription Data USPTO Text analytics is the key for discovering hidden value from text
  3. 3. DREAM
  4. 4. REALITY
  5. 5. Image from http://samasource.org
  6. 6. CAN YOU READ THIS IN FIRST ATTEMPT?
  7. 7. ay woundent of see ’ em CAN YOU READ THIS IN FIRST ATEMPT? 00:0000:0100:02 I would not have seen them.
  8. 8. When a machine reads it Results from Google translation Chinese 唉看见他们 woundent Spanish ay woundent de verlas Japanese ローマ法王進呈の AY woundent Portuguese ay woundent de vê-los German ay woundent de voir 'em
  9. 9. Text Normalization • Informal writing  standard written form 9 I would not have seen them . normalize ay woundent of see ’ em
  10. 10. Challenge: Grammar 10 text normalization would not of see them ay woundent of see ’ em I would not have seen them. Vs. mapping out-of- vocabulary non-standard tokens to their in- vocabulary standard form ≠
  11. 11. Challenge: Domain Adaptation Tailor the same text normalization solution towards different writing style of different data sources 11
  12. 12. Challenge: Evaluation • Previous: word error rate & BLEU score • However, – Words are not equally important – non-word information (punctuations, capitalization) can be important – Word reordering is important • How does the normalization actually impact the downstream applications? 12
  13. 13. Adaptive Parser-Centric Text Normalization Grammatical Sentence Domain Transferrable Parsing performance
  14. 14. Outlines • Model • Inference • Learning • Instantiation • Evaluation • Conclusion 14
  15. 15. Model: Replacement Generator 15 • Replacement <i,j,s>: replace tokens xi … xj-1 with s • Domain customization – Generic (cross-domain) replacements – Domain-specific replacements Ay1 woudent2 of3 see4 ‘em5 <2,3,”would not”> <1,2,”Ay”> <1,2,”I”> <1,2,ε> <6,6,”.”> … Edit Same Edit Delete Insert …
  16. 16. Model: Boolean Variables • Associate a unique Boolean variable Xr with each replacement r – Xr=true: replacement r is used to produce the output sentence 16 <2,3,”would not”> = true … would not …
  17. 17. Model: Normalization Graph 17 • A graphical model Ay woudent of see ‘em <4,6,”see him”> <1,2,”Ay”> <1,2,”I”> <2,4,”would not have”> <2,3,”would”> <4,5,”seen”> <5,6,”them”> *START* *END* <6,6,”.”> <3,4,”of”>
  18. 18. Model: Legal Assignment • Sound – Any two true replacements do not overlap – <1,2,”Ay”> and <1,2,”I”> cannot be both true • Completeness – Every input token is captured by at least one true replacement 18
  19. 19. Model: Legal = Path • A legal assignment: a path from start to end 19 <4,6,”see him”> <1,2,”Ay”> <1,2,”I”> <2,4,”would not have”> <2,3,”would”> <4,5,”seen”> <5,6,”them”> *START* *END* <6,6,”.”> <3,4,”of”> I would not have see him. Output
  20. 20. Model: Assignment Probability 20 • Log-linear model; feature functions on edges 20 <4,6,”see him”> <1,2,”Ay”> <1,2,”I”> <2,4,”would not have”> <2,3,”would”> <4,5,”seen”> <5,6,”them”> *START* *END* <6,6,”.”> <3,4,”of”>
  21. 21. Outlines • Model • Inference • Learning • Instantiation • Evaluation • Conclusion 21
  22. 22. Inference • Select the assignment with the highest probability • Computationally hard on general graph models … • But, in our model it boils down to finding the longest path in a weighted and directed acyclic graph 22
  23. 23. Inference 23 • weighted longest path <4,6,”see him”> <1,2,”Ay”> <1,2,”I”> <2,4,”would not have”> <2,3,”would”> <4,5,”seen”> <5,6,”them”> *START* *END* <6,6,”.”> <3,4,”of”> I would not have see him.
  24. 24. Outlines • Model • Inference • Learning • Instantiation • Evaluation • Conclusion 24
  25. 25. Learning • Perceptron-style algorithm – Update weights by – Comparing (1) most probable output with the current weights (2) gold sequence 25 (1) Informal: Ay woudent of see ‘em (2) Gold: I would not have seen them. (3) Graph Input Output (1) weights of features
  26. 26. Learning: Gold vs. Inferred 26 <4,6,”see him”> <1,2,”Ay”> <1,2,”I”> <2,4,”would not have”> <2,3,”would”> <4,5,”seen”> <5,6,”them”> *START* *END* <6,6,”.”> <3,4,”of”> Gold sequence Most probable sequence with current θ
  27. 27. Learning: Update Weights on the Differential Edges 27 <4,6,”see him”> <1,2,”Ay”> <1,2,”I”> <2,4,”would not have”> <2,3,”would”> <4,5,”seen”> <5,6,”them”> *START* *END* <6,6,”.”> <3,4,”of”> the gold sequence becomes “longer” Increase wi
  28. 28. Outlines • Model • Inference • Learning • Instantiation • Evaluation • Conclusion 28
  29. 29. Instantiation: Replacement Generators 29 Generator From To leave intact good good edit distance bac back lowercase NEED need capitalize it It Google spell dispaear disappear contraction wouldn’t would not slang language ima I am going to insert punctuation ε . duplicated punctuation !? ! delete filler lmao ε
  30. 30. Instantiation: Features • N-gram – Frequency of the phrases induced by an edge • Part-of-speech – Encourage certain behavior, such as avoiding the deletion of noun phrases. • Positional – Capitalize words after stop punctuations • Lineage – Which generator spawned the replacement 30
  31. 31. Outlines • Model • Inference • Learning • Instantiation • Evaluation • Conclusion 31
  32. 32. Evaluation Metrics: Compare Parses Input sentence 32 Human Expert Gold sentence Normalized sentence Normalizer Parser Parser Compare Gold Parse Normalized Parse Focus on subjects, verbs, and objects (SVO)
  33. 33. Evaluation Metrics: Example Test Gold SVO I kinda wanna get ipad NEW I kind of want to get a new iPad. verb(get) verb(want) verb(get) precisionv = 1/1 recallv = 1/2 subj(get,I) subj(get,wanna) obj(get,NEW) subj(want, I) subj(get,I) obj(get,iPad) precisionso = 1/3 recallso= 1/3 33
  34. 34. Evaluation: Baselines • w/oN: without normalization • Google: Google spell checker • w2wN: word-to-word normalization [Han and Baldwin 2011] • Gw2wN: gold standard for word-to-word normalizations of previous work (whenever available). 34
  35. 35. Evaluation: Domains • Twitter [Han and Baldwin 2011] – Gold: Grammatical sentences • SMS [Choudhury et al 2007] – Gold: Grammatical sentences • Call-Center Log: proprietary – Text-based responses about users’ experience with a call- center for a major company – Gold: Grammatical sentences 35
  36. 36. Evaluation: Twitter 36 • Twitter-specific replacement generators – Hashtags (#), ats (@), and retweets (RT) – Generators that allowed for either the initial symbol or the entire token to be deleted
  37. 37. Evaluation: Twitter System Verb Subject-Object Pre Rec F1 Pre Rec F1 w/oN 83.7 68.1 75.1 31.7 38.6 34.8 Google 88.9 78.8 83.5 36.1 46.3 40.6 w2wN 87.5 81.5 84.4 44.5 58.9 50.7 Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0 generic 91.7 88.9 90.3 53.6 70.2 60.8 domain specific 95.3 88.7 91.9 72.5 76.3 74.4 37 Domain-specific generators yielded the best overall performance
  38. 38. Evaluation: Twitter System Verb Subject-Object Pre Rec F1 Pre Rec F1 w/oN 83.7 68.1 75.1 31.7 38.6 34.8 Google 88.9 78.8 83.5 36.1 46.3 40.6 w2wN 87.5 81.5 84.4 44.5 58.9 50.7 Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0 generic 91.7 88.9 90.3 53.6 70.2 60.8 domain specific 95.3 88.7 91.9 72.5 76.3 74.4 38 w/o domain-specific generators, our system outperformed the word-to-word normalization approaches
  39. 39. Evaluation: Twitter System Verb Subject-Object Pre Rec F1 Pre Rec F1 w/oN 83.7 68.1 75.1 31.7 38.6 34.8 Google 88.9 78.8 83.5 36.1 46.3 40.6 w2wN 87.5 81.5 84.4 44.5 58.9 50.7 Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0 generic 91.7 88.9 90.3 53.6 70.2 60.8 domain specific 95.3 88.7 91.9 72.5 76.3 74.4 39 Even perfect word-to-word normalization is not good enough!
  40. 40. Evaluation: SMS 40 SMS-specific replacement generator: - Mapping dictionary of SMS abbreviations
  41. 41. Evaluation: SMS 41 System Verb Subject-Object Pre Rec F1 Pre Rec F1 w/oN 76.4 48.1 59.0 19.5 21.5 20.4 Google 85.1 61.6 71.5 22.4 26.2 24.1 w2wN 78.5 61.5 68.9 29.9 36.0 32.6 Gw2wN 87.6 76.6 81.8 38.0 50.6 43.4 generic 86.5 77.4 81.7 35.5 47.7 40.7 domain specific 88.1 75.0 81.0 41.0 49.5 44.8
  42. 42. Evaluation: Call-Center 42 Call Center-specific generator: - Mapping dictionary of call center abbreviations (e.g. “rep.”  “representative”)
  43. 43. Evaluation: Call-Center 43 System Verb Subject-Object Pre Rec F1 Pre Rec F1 w/oN 98.5 97.1 97.8 69.2 66.1 67.6 Google 99.2 97.9 98.5 70.5 67.3 68.8 generic 98.9 97.4 98.1 71.3 67.9 69.6 domain specific 99.2 97.4 98.3 87.9 83.1 85.4
  44. 44. Discussion • Domain transfer w/ small amount of effort is possible • Performing normalization is indeed beneficial to dependency parsing – Simple word-to-word normalization is not enough 44
  45. 45. Conclusion • Normalization framework with an eye toward domain adaptation • Parser-centric view of normalization • Our system outperformed competitive baselines over three different domains • Dataset to spur future research – https://www.cs.washington.edu/node/9091/ 45
  46. 46. Team 46

×