Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data


Published on

Download software:

Original paper:

Part-of-speech information is a pre-requisite in many NLP algorithms. However, Twitter text is difficult to part-of-speech tag: it is noisy, with linguistic errors and idiosyncratic style. We present a detailed error analysis of existing taggers, motivating a series of tagger augmentations which are demonstrated to improve performance. We identify and evaluate techniques for improving English part-of-speech tagging performance in this genre.

Further, we present a novel approach to system combination for the case where available taggers use different tagsets, based on vote-constrained bootstrapping with unlabeled data. Coupled with assigning prior probabilities to some tokens and handling of unknown words and slang, we reach 88.7% tagging accuracy (90.5% on development data). This is a new high in PTB-compatible tweet part-of-speech tagging, reducing token error by 26.8% and sentence error by 12.2%. The model, training data and tools are made available.

Published in: Technology
  • Be the first to comment

Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data

  1. 1. Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data Leon Derczynski Alan Ritter Sam Clark Kalina Bontcheva
  2. 2. Streaming social media is powerful ● It's Big Data! – Velocity: 500M tweets / day – Volume: 20M users / month – Variety: earthquakes, stocks, this guy ● Sample of all human discourse - unprecedented ● Not only where people are & when, but also what they are doing ● Interesting stuff - just ask the NSA!
  3. 3. Tweets are dirty ● You all know what Twitter is, so let's just look at some difficult tweets ● Orthography: Kk its 22:48 friday nyt :D really tired so imma go to sleep :) good nyt x god bles xxxxx ● Fragments: Bonfire tonite. All are welcome, joe included ● Capitalisation: Don't Have Time To Stop In??? Then, Check Out Our Quick Full Service Drive Thru Window :) ● Nonverbal acts: RT @Huddy85: @Mz_Twilightxxx *kisses your ass**sneezes after* Lol
  4. 4. Tough tweets: Do we even care? ● Most tweets are linguistically fairly well-formed ● RT @DesignerDepot: Minimalist Web Design: When Less is More - ● just went on an unfollowing spree... there's no point of following you if you haven't tweeted in 10+ days. #justsaying .. ● The tweets we find most difficult, are those that seem to say the least ● So im in tha chi whts popping tonight? ● i just gave my momma some money 4 a bill.... she smiled when i put it n her hand __AND__ said "i wanna go out to eat"... -______- HELLA SCAN
  5. 5. We do care ● However, there is utility in trivia: – Sadilek: Predict if you will get flu, using spatial co-location and friend network – Sugumaran, U. Northern Iowa. Crow corpse reports precede West Nile Virus – Emerging events: tendency to describe briefly ''There's a dead crow in my garden'' @mari: i think im sick ugh..
  6. 6. Problem representation ● Tweets into finite tokens (PTB + URLs, Smileys) ● Put tokens in categories, depending on linguistic function ● Discriminative – cases one by one – e.g. unigram tagger ● Sequence labelling – order matters! – consider neighbouring labels ● Goal: label the whole sequence correctly
  7. 7. Word order still matters.. just ● Hard for tweets: exclamations and fragments ● Whole sequences a bit rare ● @FeeninforPretty making something to eat, aint ate all day ● Peace green tea time!! Happyzone!!!! :))))) ● Sentence structure cues (e.g. caps) often: – absent – over-used
  8. 8. How do current tools do? ● Badly! – Out of the box: – Trained on Twitter, IRC and WSJ data:
  9. 9. Where do they break? ● Continued work extending Stanford Tagger ● Terrible at doing whole sentences – Best was 10% accuracy – SotA on newswire about 55-60% ● Problems on unknown words – this is a good target set to get better performance on – 1 in 5 words completely unseen – 27% token accuracy on this group
  10. 10. What errors occur on unknowns? ● Gold standard errors (dank_UH je_UH → _FW) ● Training lacks IV words (Internet, bake) ● Pre-taggables (URLs, mentions, retweets) ● NN vs. NNP (derek_NN, Bed_NNP) ● Slang (LUVZ, HELLA, 2night) ● Genre-specific (unfollowing) ● Tokenisation errors (ass**sneezes) ● Orthographic (suprising)
  11. 11. Do we have enough data? ● No, it's even worse than normal – Ritter: 15K tokens, PTB, one annotator – Foster: 14K tokens, PTB, low-noise – CMU: 39K tokens, custom, narrow tagset
  12. 12. Tweet PoS-tagging issues ● From analysis, three big issues identified: 1. Many unseen words / orthographies 2. Uncertain sentence structure 3. Not enough annotated data ● Continued with Ritter dataset
  13. 13. Unseen words in tweets ● Two classes: ● Standard token, non-standard orthography; – freinds – KHAAAANNNNNNN! ● Non-standard token, standard orthography – omg + bieber = omb – Huntington
  14. 14. Unseen words in tweets ● Majority of non-standard orthographies can be corrected with a gazetteer: typical Pareto – vids → videos – cussin → cursing – hella → very ● No need to bother with e.g. Brown clustering ● 361 entries give 2.3% token error reduction
  15. 15. Unseen words in tweets ● The rest can handled reasonably with word shape and contextual features ● Using edu.stanford.nlp.tagger.maxent.ExtractorFramesRare ● Features include: – word prefix and suffix shapes – distribution of shape in corpus – shapes of neighbouring words ● Corpus small, so adjust rare threshold ● +5.35% absolute token acc., +18.5% sentence
  16. 16. Tweet “sentence” “structure” ● They are structured (sometimes) ● We still do better if we look at global features – Unigram tagger accuracy: 66% ● Sentence-level accuracy is important – Unigram tagger sentence accuracy: 2.3%
  17. 17. Tweet “sentence” “structure” ● Tweets contain some constrained-form tokens ● Links, hashtags, user mentions, some smileys ● We can fix the label for these tokens ● Knowing P(ci) constrains both P(ci-1|ci) and P(ci+1|ci)
  18. 18. Tweet “sentence” “structure” ● This allows us to prune the transition graph of labels in the sequence ● Because the graph is read in both directions, fixing any label point impacts whole tweet ● Setting label priors reduces token error 5.03%
  19. 19. Not enough data ● Big unlabelled data - 75 000 000 tweets / day (en) ● Bootstrapping sometimes helps in this case ● Problem: initial accuracy is too low ● •︵ _UH ● Solution: consensus with > 1 tagger ◕ ◡ ◕ _UH ● Problem: only one tagger using PTB tags ⋋〴 _⋌ 〵 _UH ● Solution: Vote-constrained Bootstrapping _⊙ ʘ _UH
  20. 20. Vote-constrained bootstrapping ● Not many taggers available for building semi-supervised data ● We chose Ritters plus the CMU tagger ● Where classes don't map 1:1 ● Create equivalence classes between tags – CMU tag R (adverb) → PTB (WRB,RB,RBR,RBS) – CMU tag !(interjection) → PTB (UH) ● Coarser tag constrains set of fine-grained tags
  21. 21. Vote-constrained bootstrapping ● Ask both taggers to label the candidate input ● Add tweet to semi-supervised data if both agree ● Lebron_^ + Lebron_NNP → OK, Lebron_NNP ● books_N + books_VBZ → Fail, reject tweet ● Evaluated quality on development set – Agreed on 17.8% of tweets – Of those, 97.4 of tokens correctly PTB labelled – 71.3% whole tweets correctly labelled
  22. 22. Vote-constrained bootstrapping ● Results: – Use Trendminer lang ID + data – Collected 1.5M agreed-upon tokens ● Adding this bootstrapped data reduced error by: – Token-level: 13.7% Sentence-level: 4.5%
  23. 23. Final results ● Unknown accuracy rate: from 27.8% to 74.5% Token Sentence Baseline: Ritter T-Pos 84.55 9.32 GATE: eval set 88.69 20.34 - error reduction 26.80 12.15 GATE: dev set 90.54 28.81 - error reduction 38.77 21.49
  24. 24. Where do we go next? ● Local tag sequence bounds? ● Better handling of hashtags – I'm stressed at 9am, shopping on my lunch break... can't deal w/ this today. #retailtherapy – I'm so #bored today ● More data – bootstrapped ● More data – part-bootstrapped (e.g. CMU GS) ● More data – human annotated ● Parsing
  25. 25. Downloadable & Friendly ● As command-line tool; as GATE PR; as Stanford Tagger model ● Included in GATE's TwitIE toolkit (4pm, Europa) ● 1.5M token dataset available ● Updates since submission: – Better handling of contractions – Less sensitive to tokenisation scheme ● Please play!
  26. 26. Thank you for your time! There is hope: Jersey Shore is overrated. studying and history homework then a fat night of sleep! Do you have any questions?
  27. 27. Owoputi et al. ● NAACL'13 paper: 90.5% token perf w/ PTB accuracy ● Advancement of the Gimpel tagger, used for our bootstrapping ● Late discovery: Can be adapted to PTB tagset with good results ● We use disjoint techniques to Owoputi; combining them could give an even better result! ● Our model readily re-usable and integrated into existing NLP tool sets
  28. 28. Capitalisation ● Noisy tweets have unusual capitalisation, right? – Buy Our Widgets Now – ugh I haet u all .. stupd ppl #fml ● Lowercase model with lowercased data allows us to ignore capitalisation noise ● Tried multiple approaches to classifying noisy vs. well-formed capitalisation ● Gain from ignoring case in noisy tweets offset by loss from mis-classified well-cased data