Passive-Aggressive Sequence Labeling with
Discriminative Post-Editing for
Recognising Person Entities in Tweets.
Leon Derc...
Problem
● Finding person NEs in tweets, a diverse genre
– Need to know participates in events / claims
● Twitter as the
D....
Why person entities?
● There are many entity types and classification
schemes
– ACE (PER, GPE, ORG); maybe add PROD
– Free...
Machine learning for twitter NER
● We know twitter's diverse & noisy, so let's add word
shape (Xxx) and lemma features
● C...
Two ML adaptations
● SVM/UM
– Hyperplane may lie between two unbalanced classes
– Move closer to minority class, to reflec...
Single-pass results
● Corpus: person entities from MSM2013, Ritter,
UMBC tweet datasets (86k toks, 1.7k ents)
P R F
Stanfo...
Recall problems
● Typical missed entities:
– “Under Obama 's tax plan , ...”
– “delighted for you & Dave !”
– “Strategies ...
Post-editing
● Formulate as binary discriminative problem
– Is a given non-entity text actually a person?
● Narrow search ...
Evaluation
● Baselines: no editing, gazetteer term, gazetter term+1
● Goal is to improve recall: use cost-sensitive SVM
Mi...
Error analysis
● False positives:
– Other-class entities (Huff Post, Exodus Porter)
– Descriptive titles (Millionaire Rob ...
Conclusion
● PA adaptation of CRF helps NER in diverse domain
● Automatic post-editing improves recall
● SVM using context...
Thank you for your time!
Do you have any questions?
Research partially supported by the European Union/EU under the Inform...
Entities in tweets
News Tweets
PER Politicians, business
leaders, journalists,
celebrities
Sportsmen, actors, TV
personali...
Upcoming SlideShare
Loading in …5
×

Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Recognising Person Entities in Tweets

1,163 views

Published on

Presentation with audio: https://www.youtube.com/watch?v=heYj8sCmWCo

Finding the names in tweets is difficult. However, with a few simple modifications to handle the noise and variety in tweets, and a automatic post-editor to fix errors made by the automatic systems, it becomes easier.
Full paper: http://derczynski.com/sheffield/papers/person_tweets.pdf

Published in: Science, Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,163
On SlideShare
0
From Embeds
0
Number of Embeds
48
Actions
Shares
0
Downloads
4
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Recognising Person Entities in Tweets

  1. 1. Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Recognising Person Entities in Tweets. Leon Derczynski Kalina Bontcheva
  2. 2. Problem ● Finding person NEs in tweets, a diverse genre – Need to know participates in events / claims ● Twitter as the D. Melanogaster of social media1 ● Newswire: regulated – “our most frequently-used corpora [..] written and edited predominantly by working-age white men” 2 ● Twitter: wild; many styles – Headlines – Conversations – Colloquial – Just “noise” (hashtags, URLs, mentions) 1. Tufekci, 2014. “Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls” Proc. ICWSM; 2. Eisenstein, 2013. “What to do about bad language on the internet” Proc. NAACL; Image “Mr.checker” Wikimedia Commons
  3. 3. Why person entities? ● There are many entity types and classification schemes – ACE (PER, GPE, ORG); maybe add PROD – Freebase top-level (à la Ritter) ● Have a long tail, making them “resistant” to gazetteer approaches ● Required to mine conversations and claims ● Unfortunately, they're difficult to find in tweets: Stanford NER on CoNLL news: 92.29 F1 Stanford NER on Ritter tweets: 63.20 F1
  4. 4. Machine learning for twitter NER ● We know twitter's diverse & noisy, so let's add word shape (Xxx) and lemma features ● Conventional approaches – sequence labelling ● Lots of dysfluency, differs from newswire ● What if we throw out whole-sequence idea and only use local context? Stanford 72.19 F1 (up from ~63) SVM 75.89 F1 MaxEnt 76.76 F1 CRF 78.89 F1 ● Looks like sequence labelling is useful
  5. 5. Two ML adaptations ● SVM/UM – Hyperplane may lie between two unbalanced classes – Move closer to minority class, to reflect prior distribution ● CRF-PA – Passive: when example's hinge loss is zero, skip updates – Aggressive: when hinge loss >0, scale down example's weight
  6. 6. Single-pass results ● Corpus: person entities from MSM2013, Ritter, UMBC tweet datasets (86k toks, 1.7k ents) P R F Stanford 90.60 60.00 72.19 Ritter 77.23 80.18 78.68 SVM/UM 81.16 74.97 77.94 CRF-PA 86.85 74.71 80.32 ● Honourable mention: MaxEnt, precision 91.10 ● Ritter: good recall, possibly from huge bootstrapped integrated resource ● How can we improve recall without this?
  7. 7. Recall problems ● Typical missed entities: – “Under Obama 's tax plan , ...” – “delighted for you & Dave !” – “Strategies for selling in a slow market : by Denise Calaman” ● Looks like things we'd find in a gazetteer ● How can we include these without reducing precision? ● Post-editing can be effective in fixing up MT output
  8. 8. Post-editing ● Formulate as binary discriminative problem – Is a given non-entity text actually a person? ● Narrow search space: – Does a token in an out-of-entity sequence begin a with known person name? ● Confine window to two tokens ● Given a set of triggers, are tokens in a bigram beginning with a trigger, a person? Best Ann Coulter quotes Under Obama 's tax plan
  9. 9. Evaluation ● Baselines: no editing, gazetteer term, gazetter term+1 ● Goal is to improve recall: use cost-sensitive SVM Missed entity F1 Overall No editing 0.00 80.32 Term only 5.82 82.58 Term+1 6.05 81.67 SVM Cost 0.1 (P) 78.26 83.07 SVM Cost 1.5 (R) 92.73 83.83 Ritter - 78.68
  10. 10. Error analysis ● False positives: – Other-class entities (Huff Post, Exodus Porter) – Descriptive titles (Millionaire Rob Ford) – Names in non-name senses (Marie Claire) – Polysemous names (Mark) ● False negatives: – Capitalisation (charlie gibson, KANYE WEST) – Spelling errors (Russel Crowe) – Common nouns (Jack Straw) – Uncommon names (Spicy Pickle Jr.)
  11. 11. Conclusion ● PA adaptation of CRF helps NER in diverse domain ● Automatic post-editing improves recall ● SVM using context much better than gazetteer ● Only external resource is first name lists
  12. 12. Thank you for your time! Do you have any questions? Research partially supported by the European Union/EU under the Information and Communication Technologies (ICT) theme of the 7th Framework Programme for R&D (FP7), grant PHEME (611233).
  13. 13. Entities in tweets News Tweets PER Politicians, business leaders, journalists, celebrities Sportsmen, actors, TV personalities, celebrities, names of friends LOC Countries, cities, rivers, and other places related to current affairs Restaurants, bars, local landmarks/areas, cities, rarely countries ORG Public and private companies, government organisations Bands, internet companies, sports clubs

×