Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data


Published on

Download software:

Original paper:

Part-of-speech information is a pre-requisite in many NLP algorithms. However, Twitter text is difficult to part-of-speech tag: it is noisy, with linguistic errors and idiosyncratic style. We present a detailed error analysis of existing taggers, motivating a series of tagger augmentations which are demonstrated to improve performance. We identify and evaluate techniques for improving English part-of-speech tagging performance in this genre.

Further, we present a novel approach to system combination for the case where available taggers use different tagsets, based on vote-constrained bootstrapping with unlabeled data. Coupled with assigning prior probabilities to some tokens and handling of unknown words and slang, we reach 88.7% tagging accuracy (90.5% on development data). This is a new high in PTB-compatible tweet part-of-speech tagging, reducing token error by 26.8% and sentence error by 12.2%. The model, training data and tools are made available.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data

  1. 1. Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data Leon Derczynski Alan Ritter Sam Clark Kalina Bontcheva
  2. 2. Streaming social media is powerful ● It's Big Data! – Velocity: 500M tweets / day – Volume: 20M users / month – Variety: earthquakes, stocks, this guy ● Sample of all human discourse - unprecedented ● Not only where people are & when, but also what they are doing ● Interesting stuff - just ask the NSA!
  3. 3. Tweets are dirty ● You all know what Twitter is, so let's just look at some difficult tweets ● Orthography: Kk its 22:48 friday nyt :D really tired so imma go to sleep :) good nyt x god bles xxxxx ● Fragments: Bonfire tonite. All are welcome, joe included ● Capitalisation: Don't Have Time To Stop In??? Then, Check Out Our Quick Full Service Drive Thru Window :) ● Nonverbal acts: RT @Huddy85: @Mz_Twilightxxx *kisses your ass**sneezes after* Lol
  4. 4. Tough tweets: Do we even care? ● Most tweets are linguistically fairly well-formed ● RT @DesignerDepot: Minimalist Web Design: When Less is More - ● just went on an unfollowing spree... there's no point of following you if you haven't tweeted in 10+ days. #justsaying .. ● The tweets we find most difficult, are those that seem to say the least ● So im in tha chi whts popping tonight? ● i just gave my momma some money 4 a bill.... she smiled when i put it n her hand __AND__ said "i wanna go out to eat"... -______- HELLA SCAN
  5. 5. We do care ● However, there is utility in trivia: – Sadilek: Predict if you will get flu, using spatial co-location and friend network – Sugumaran, U. Northern Iowa. Crow corpse reports precede West Nile Virus – Emerging events: tendency to describe briefly ''There's a dead crow in my garden'' @mari: i think im sick ugh..
  6. 6. Problem representation ● Tweets into finite tokens (PTB + URLs, Smileys) ● Put tokens in categories, depending on linguistic function ● Discriminative – cases one by one – e.g. unigram tagger ● Sequence labelling – order matters! – consider neighbouring labels ● Goal: label the whole sequence correctly
  7. 7. Word order still matters.. just ● Hard for tweets: exclamations and fragments ● Whole sequences a bit rare ● @FeeninforPretty making something to eat, aint ate all day ● Peace green tea time!! Happyzone!!!! :))))) ● Sentence structure cues (e.g. caps) often: – absent – over-used
  8. 8. How do current tools do? ● Badly! – Out of the box: – Trained on Twitter, IRC and WSJ data:
  9. 9. Where do they break? ● Continued work extending Stanford Tagger ● Terrible at doing whole sentences – Best was 10% accuracy – SotA on newswire about 55-60% ● Problems on unknown words – this is a good target set to get better performance on – 1 in 5 words completely unseen – 27% token accuracy on this group
  10. 10. What errors occur on unknowns? ● Gold standard errors (dank_UH je_UH → _FW) ● Training lacks IV words (Internet, bake) ● Pre-taggables (URLs, mentions, retweets) ● NN vs. NNP (derek_NN, Bed_NNP) ● Slang (LUVZ, HELLA, 2night) ● Genre-specific (unfollowing) ● Tokenisation errors (ass**sneezes) ● Orthographic (suprising)
  11. 11. Do we have enough data? ● No, it's even worse than normal – Ritter: 15K tokens, PTB, one annotator – Foster: 14K tokens, PTB, low-noise – CMU: 39K tokens, custom, narrow tagset
  12. 12. Tweet PoS-tagging issues ● From analysis, three big issues identified: 1. Many unseen words / orthographies 2. Uncertain sentence structure 3. Not enough annotated data ● Continued with Ritter dataset
  13. 13. Unseen words in tweets ● Two classes: ● Standard token, non-standard orthography; – freinds – KHAAAANNNNNNN! ● Non-standard token, standard orthography – omg + bieber = omb – Huntington
  14. 14. Unseen words in tweets ● Majority of non-standard orthographies can be corrected with a gazetteer: typical Pareto – vids → videos – cussin → cursing – hella → very ● No need to bother with e.g. Brown clustering ● 361 entries give 2.3% token error reduction
  15. 15. Unseen words in tweets ● The rest can handled reasonably with word shape and contextual features ● Using edu.stanford.nlp.tagger.maxent.ExtractorFramesRare ● Features include: – word prefix and suffix shapes – distribution of shape in corpus – shapes of neighbouring words ● Corpus small, so adjust rare threshold ● +5.35% absolute token acc., +18.5% sentence
  16. 16. Tweet “sentence” “structure” ● They are structured (sometimes) ● We still do better if we look at global features – Unigram tagger accuracy: 66% ● Sentence-level accuracy is important – Unigram tagger sentence accuracy: 2.3%
  17. 17. Tweet “sentence” “structure” ● Tweets contain some constrained-form tokens ● Links, hashtags, user mentions, some smileys ● We can fix the label for these tokens ● Knowing P(ci) constrains both P(ci-1|ci) and P(ci+1|ci)
  18. 18. Tweet “sentence” “structure” ● This allows us to prune the transition graph of labels in the sequence ● Because the graph is read in both directions, fixing any label point impacts whole tweet ● Setting label priors reduces token error 5.03%
  19. 19. Not enough data ● Big unlabelled data - 75 000 000 tweets / day (en) ● Bootstrapping sometimes helps in this case ● Problem: initial accuracy is too low ● •︵ _UH ● Solution: consensus with > 1 tagger ◕ ◡ ◕ _UH ● Problem: only one tagger using PTB tags ⋋〴 _⋌ 〵 _UH ● Solution: Vote-constrained Bootstrapping _⊙ ʘ _UH
  20. 20. Vote-constrained bootstrapping ● Not many taggers available for building semi-supervised data ● We chose Ritters plus the CMU tagger ● Where classes don't map 1:1 ● Create equivalence classes between tags – CMU tag R (adverb) → PTB (WRB,RB,RBR,RBS) – CMU tag !(interjection) → PTB (UH) ● Coarser tag constrains set of fine-grained tags
  21. 21. Vote-constrained bootstrapping ● Ask both taggers to label the candidate input ● Add tweet to semi-supervised data if both agree ● Lebron_^ + Lebron_NNP → OK, Lebron_NNP ● books_N + books_VBZ → Fail, reject tweet ● Evaluated quality on development set – Agreed on 17.8% of tweets – Of those, 97.4 of tokens correctly PTB labelled – 71.3% whole tweets correctly labelled
  22. 22. Vote-constrained bootstrapping ● Results: – Use Trendminer lang ID + data – Collected 1.5M agreed-upon tokens ● Adding this bootstrapped data reduced error by: – Token-level: 13.7% Sentence-level: 4.5%
  23. 23. Final results ● Unknown accuracy rate: from 27.8% to 74.5% Token Sentence Baseline: Ritter T-Pos 84.55 9.32 GATE: eval set 88.69 20.34 - error reduction 26.80 12.15 GATE: dev set 90.54 28.81 - error reduction 38.77 21.49
  24. 24. Where do we go next? ● Local tag sequence bounds? ● Better handling of hashtags – I'm stressed at 9am, shopping on my lunch break... can't deal w/ this today. #retailtherapy – I'm so #bored today ● More data – bootstrapped ● More data – part-bootstrapped (e.g. CMU GS) ● More data – human annotated ● Parsing
  25. 25. Downloadable & Friendly ● As command-line tool; as GATE PR; as Stanford Tagger model ● Included in GATE's TwitIE toolkit (4pm, Europa) ● 1.5M token dataset available ● Updates since submission: – Better handling of contractions – Less sensitive to tokenisation scheme ● Please play!
  26. 26. Thank you for your time! There is hope: Jersey Shore is overrated. studying and history homework then a fat night of sleep! Do you have any questions?
  27. 27. Owoputi et al. ● NAACL'13 paper: 90.5% token perf w/ PTB accuracy ● Advancement of the Gimpel tagger, used for our bootstrapping ● Late discovery: Can be adapted to PTB tagset with good results ● We use disjoint techniques to Owoputi; combining them could give an even better result! ● Our model readily re-usable and integrated into existing NLP tool sets
  28. 28. Capitalisation ● Noisy tweets have unusual capitalisation, right? – Buy Our Widgets Now – ugh I haet u all .. stupd ppl #fml ● Lowercase model with lowercased data allows us to ignore capitalisation noise ● Tried multiple approaches to classifying noisy vs. well-formed capitalisation ● Gain from ignoring case in noisy tweets offset by loss from mis-classified well-cased data