Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Linking Entities in Tweets

778 views

Published on

This presentation describes the participation of the UNIBA team
in the Named Entity rEcognition and Linking (NEEL) Chal-
lenge. We propose a knowledge-based algorithm able to
recognize and link named entities in English tweets. The
approach combines the simple Lesk algorithm with informa-
tion coming from both a distributional semantic model and
usage frequency of Wikipedia concepts. The algorithm per-
forms poorly in the entity recognition, while it achieves good
results in the disambiguation step.

  • Be the first to comment

  • Be the first to like this

UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Linking Entities in Tweets

  1. 1. UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Linking Entities in Tweets Pierpaolo Basile, Annalina Caputo, Giovanni Semeraro, Fedelucio Narducci {fedelucio.narducci, pierpaolo.basile}@uniba.it #Microposts2015, NEEL Challenge, Florence 18th May 2015
  2. 2. The Challenge Just watched Frozen for the first time ever and knew the words to all the songs... How?! #productplacement Problem: Find and link entities in tweets ProductEntity type
  3. 3. Our Approach • Entity Recognition • using PoS-tag • relying on n-grams • Disambiguation • knowledge-based method that combines a Distributional Semantic Models (DSM) with prior probability assigned to each DBpedia concept • Type • manual map for all types defined in the dbpedia-owl ontology to the respective types in the task
  4. 4. Entity Recognition: Indexing Frozen <dbpedia.org/resource/Frozen_(Madonna_song)> Frozen <dbpedia.org/resource/Frozen_(2013_film)> Apple <dbpedia.org/resource/Apple_Inc.> Apple Inc. <dbpedia.org/resource/Apple_Inc.> Barack Obama <http://dbpedia.org/resource/Barack_Obama> DBpedia titles file and DBpedia NLP resources http://wifo5-04.informatik.uni-mannheim.de/downloads/datasets/ Indexing
  5. 5. Entity Recognition… PoS-tagger N-grams generation Tokenization and Normalization Candidate list of surface forms Tweet
  6. 6. …Entity Recognition Search and Filtering Search Score Levenshtein Distance Jaccard Index Candidate list of surface forms Candidate entities and list of possible concepts
  7. 7. Disambiguation Building the glosses Building the context Semantic Ranking 3-step approach
  8. 8. Disambiguation: Building the glosses "Frozen" is a song by American singer- songwriter Madonna… Frozen is a 2013 American 3D computer- animated musical… DBpedia extended abstracts
  9. 9. Disambiguation: Building the context Just watched Frozen for the first time ever and knew the words to all the songs... How?! #productplacement <just, watched, first, time, knew, words, all, songs, how, product, placement> Context
  10. 10. Disambiguation: Semantic Ranking 1/3 • Words as points in a mathematical space • Close words are similar • Word space is built analyzing word co-occurrences in a large corpus • Vector composition using superposition (+)
  11. 11. Disambiguation: Semantic Ranking 2/3 word2vec: https://code.google.com/p/word2vec/ Distributional Semantic Model built on Wikipedia Context • Cosine similarity between the gloss and the context • Linear combination with a function which takes into account the usage of concepts in Wikipedia
  12. 12. Disambiguation: Semantic Ranking 3/3 Statistics about the usage of concepts in Wikipedia 𝑝 𝑐𝑖𝑗 𝑒𝑖 = 𝑡 𝑒𝑖, 𝑐𝑖𝑗 + 1 #𝑒𝑖 + |𝐶𝑖| Concept probability given the entity
  13. 13. 𝑝 𝑐𝑖𝑗 𝑒𝑖 = 𝑡 𝑒𝑖, 𝑐𝑖𝑗 + 1 #𝑒𝑖 + |𝐶𝑖| Disambiguation: Semantic overlap 3/3 Statistics about the usage of concepts in Wikipedia Number of times ei is linked as cij Number of concepts assigned to ei
  14. 14. Evaluation • Development set • 500 manually annotated tweets • Metrics • SLM: Strong Link Match • STMM: Strong Typed Mention Match • MC: Mention Ceaf • System setup • TweetNLP for tokenization and PoS-tagging • word2vec for DSM building: 400 vector dimensions analyzing only terms that occur at least 25 times • Developed in JAVA
  15. 15. Results • Low performance in entity recognition • Good results in disambiguation: F=0.825 considering correct recognition and no-NIL instances Entity Recognition F-SLM F-STMM F-MC PoS-tag 0.362 0.267 0.389 N-grams 0.258 0.191 0.306

×