Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Named Entity Recognition for Twitter Microposts (only) using Distributed Word Representations

1,182 views

Published on

As part of the Named Entity Recognition for Twitter microposts shared task at ACL2015, we propose a solution which only uses word embeddings. The word embeddings model is trained on 400 million tweets and is available at http://www.fredericgodin.com/software/.

Published in: Social Media
  • Be the first to comment

Named Entity Recognition for Twitter Microposts (only) using Distributed Word Representations

  1. 1. ELIS – Multimedia Lab Fréderic Godin, Baptist Vandersmissen, Wesley De Neve & Rik Van de Walle Multimedia Lab, Ghent University – iMinds Find me at: @frederic_godin / www.fredericgodin.com Named Entity Recognition for Twitter Microposts (only) using Distributed Word Representations
  2. 2. 2 ELIS – Multimedia Lab NER in Twitter Microposts using distributed word representations Fréderic Godin et al. 31 July 2015 Introduction Goal: Recognizing 10 types of named entities (NEs) in noisy Twitter microposts Problem: Tweets contain spelling mistakes, slang and lack uniform grammar rules
  3. 3. 3 ELIS – Multimedia Lab NER in Twitter Microposts using distributed word representations Fréderic Godin et al. 31 July 2015 Traditional solutions Typical features: Ortographic features, gazetteers, corpus statistics or other parsing techniques (PoS and chunking) Typical machine learning techniques: CRF, HMM
  4. 4. 4 ELIS – Multimedia Lab NER in Twitter Microposts using distributed word representations Fréderic Godin et al. 31 July 2015 POS Ortho- graphic Gaze tteers Brown clustering Word embedding ML F1(%) ousia X X X – GloVe entity linking using SVM 56.41 NLANGP – X X X word2vec & GloVe CRF++ 51.40 nrc – – X X word2vec semi-Markov MIRA 44.74 multimedialab – – – – word2vec FFNN 43.75 USFD X X X X – CRF L-BFGS 42.46 iitp X X X – – CRF++ 39.84 Hallym X – – X correlation analysis CRFsuite 37.21 lattice X X – X – CRF wapiti 16.47 Baseline – X X – – CRFsuite 31.97 An overview of the used approaches
  5. 5. 5 ELIS – Multimedia Lab NER in Twitter Microposts using distributed word representations Fréderic Godin et al. 31 July 2015 A simple, general but effective neural network architecture Use word2vec to generate good feature representations for words (=unsupervised learning) Feed those word representations to another neural network (NN) for any classification task (=supervised learning) Example Feature representation Machine learning Label(s) Learn word2vec word representations once in advance Train a new NN for any task
  6. 6. 6 ELIS – Multimedia Lab NER in Twitter Microposts using distributed word representations Fréderic Godin et al. 31 July 2015 Word2vec: automatically learning good features 2D projection of a 400D space of the top 1000 words used on Twitter. The model was trained on 400 million tweets having 5 billion words
  7. 7. 7 ELIS – Multimedia Lab NER in Twitter Microposts using distributed word representations Fréderic Godin et al. 31 July 2015 A simple, general but effective neural network architecture (1) W(t-1) W(t) W(t+1) L o o k u p N-dim N-dim N-dim Feed forward neural network Tag(W(t)) Example Feature representation Machine learning Label(s) Concatenate (3N-dim)Window = 3
  8. 8. 8 ELIS – Multimedia Lab NER in Twitter Microposts using distributed word representations Fréderic Godin et al. 31 July 2015 A simple, general but effective neural network architecture (2) from Beijing to L o o k u p N-dim N-dim N-dim Feed forward neural network Location Example Feature representation Machine learning Label(s) Concatenate (3N-dim)Window = 3
  9. 9. 9 ELIS – Multimedia Lab NER in Twitter Microposts using distributed word representations Fréderic Godin et al. 31 July 2015 Postprocessing (1) FR ML W(1) W(2) W(3) Label(1) Label(2) Label(3) Post- processing Label(1) Label(2) Label(3) Correct for inconsistencies NE starting with an I-tag Multi-word expressions having different categories
  10. 10. 10 ELIS – Multimedia Lab NER in Twitter Microposts using distributed word representations Fréderic Godin et al. 31 July 2015 Postprocessing (2) FR ML Manchester United is B-Loc I-sportsteam O Post- processing B-sportsteam I-sportsteam O Correct for inconsistencies NE starting with an I-tag Multi-word expressions having different categories
  11. 11. 11 ELIS – Multimedia Lab NER in Twitter Microposts using distributed word representations Fréderic Godin et al. 31 July 2015 Experimental setup Feature Learning Word2vec Skipgram with negative sampling 400 million raw English tweets (limited preprocessing) Neural Network One hidden layer, with 500 hidden units Word embeddings of size 400, Voc of 3mil words Mini-batch SGD and Dropout Experiments with Tanh and ReLU
  12. 12. 12 ELIS – Multimedia Lab NER in Twitter Microposts using distributed word representations Fréderic Godin et al. 31 July 2015 Word2vec results Slang - Wrong capitalization - Sometimes not in Gazetteer Spelling
  13. 13. 13 ELIS – Multimedia Lab NER in Twitter Microposts using distributed word representations Fréderic Godin et al. 31 July 2015 Normalizing slang words/spelling
  14. 14. 14 ELIS – Multimedia Lab NER in Twitter Microposts using distributed word representations Fréderic Godin et al. 31 July 2015 Dealing with capitalization + gazetteer functionality
  15. 15. 15 ELIS – Multimedia Lab NER in Twitter Microposts using distributed word representations Fréderic Godin et al. 31 July 2015 Results POS Ortho- graphic Gaze tteers Brown clustering Word embedding ML F1(%) ousia X X X – GloVe entity linking using SVM 56.41 NLANGP – X X X word2vec & GloVe CRF++ 51.40 nrc – – X X word2vec semi-Markov MIRA 44.74 multimedialab – – – – word2vec FFNN 43.75 USFD X X X X – CRF L-BFGS 42.46 iitp X X X – – CRF++ 39.84 Hallym X – – X correlation analysis CRFsuite 37.21 lattice X X – X – CRF wapiti 16.47 BASELINE – X X – – CRFsuite 31.97
  16. 16. 16 ELIS – Multimedia Lab NER in Twitter Microposts using distributed word representations Fréderic Godin et al. 31 July 2015 Lessons learned Feature Learning A W2V window of 1 worked best More syntax-oriented embeddings Neural Networks Multiple layers did not improve the F1-score Dropout and ReLU worked best Postprocessing Multi-word expressions often have different categories
  17. 17. 17 ELIS – Multimedia Lab NER in Twitter Microposts using distributed word representations Fréderic Godin et al. 31 July 2015 Conclusion End-to-end semi-supervised neural network architecture No feature engineering needed Reusable architecture Beats traditional systems that only use hand-crafted features
  18. 18. 18 ELIS – Multimedia Lab NER in Twitter Microposts using distributed word representations Fréderic Godin et al. 31 July 2015 #Questions? http://www.fredericgodin.com/software/ The word2vec Twitter model is available at: @frederic_godin

×