Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Revealing Entities From Texts With a Hybrid Approach

963 views

Published on

NLP&DBpedia2015 presentation @ISWC2015

Published in: Data & Analytics
  • Be the first to comment

Revealing Entities From Texts With a Hybrid Approach

  1. 1. Julien Plu, Giuseppe Rizzo, Raphaël Troncy {firstname.lastname}@eurecom.fr, @julienplu, @giusepperizzo, @rtroncy Revealing Entities From Texts With a Hybrid Approach
  2. 2. On June 21th, I went to Paris to see the Eiffel Tower and to enjoy the world music day. § Goal: link (or disambiguate) entity mentions one can find in text to their corresponding entries in a knowledge base (e.g. DBpedia) db:Paris db:Eiffel_Towerdb:Fête_de_la_Musiquedb:June_21 What is Entity Linking? 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 2
  3. 3. § Extract entities in diverse type of textual documents: Ø newspaper article, encyclopaedia article, micropost (tweet, status, photo caption), video subtitle, etc. Ø deal with grammar free and short texts that have littlecontext § Adapt what can be extracted depending on guidelines or challenges Ø #Micropost2014 NEEL challenge: link entities that may belong to: Person, Location, Organization, Function, Amount, Animal, Event, Product, Time, and Thing(languages, ethnic groups, nationalities, religions, diseases, sports and astronomical objects) Ø OKE2015 challenge: extract and link entities that must belong to: Person, Location ,Organization, and Role Problems 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 3
  4. 4. Research Question How do we adapt an entity linking system to solve these problems? 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 4
  5. 5. § Input and output to different formats: Ø Input: plain text, NIF, Micropost2014 (pruning phase) Ø Output: NIF, TAC (tsv format), Micropost2014 (tsv format with no offset) § Text is classified according to its provenance § Text is normalized if necessary For microposts content, RT symbols (in case of tweets) and emoticons are removed Text microposts newspaper article, video subtitle, encyclopaedia article, ... Text Normalization Entity Extractor Entity Linking index Pruning ADEL ADEL Workflow 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 5
  6. 6. § Multiple extractors can be used: Ø Possibility to switch on and off an extractor in order to adapt the system to some guidelines Ø Extractors can be: Funsupervised: Dictionary, Hashtag + Mention, Number Extractor Fsupervised: Date Extractor, POS Tagger, NER System § Overlaps are resolved by choosing the longest extracted mention Date Extractor Number Extractor POS Tagger (NNP/NNPS) Dictionary NER System (Stanford) …. Hashtag + Mention Extractor Overlap Resolution Date Extractor: June 21 June 21 Number extractor: 21 Entity Extractor 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 6
  7. 7. § From DBpedia: Ø PageRank Ø Title Ø Redirects, Disambiguation § From Wikipedia: Ø Anchors Ø Link references For example, from the EN Wikipedia article about Xabi Alonso: index (Arsenal F.C., 1);(Mikel Arteta, 2); (San Sebastián, 1);(Liverpool, 2); (Everton F.C., 1) Alonso and [[Arsenal F.C.|Arsenal]] player [[Mikel Arteta]] were neighbours on the same street while growing up in [[San Sebastián]] and also lived near each other in [[Liverpool]]. Alonso convinced [[Mikel Arteta|Arteta]] to transfer to [[Everton F.C.|Everton] after he told him how happy he was living in [[Liverpool]]]. How is the index created? 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 7
  8. 8. § Generate candidates from a fuzzy match to the index § Filter candidates: Ø Filter out candidates that are not semantically related to other entities from the same sentence § Score each candidate using a linear formula: score(cand) = (a * L(m, cand) + b * max(L(m, R(cand))) + c * max(L(m, D(cand)))) * PR(cand) L for Levenshtein distance, R for set of redirects, D for set of disambiguation and PR for PageRank a, b and c are weights set with a > b > c and a + b + c = 1 Candidate Generation Candidate Filtering Scoring mention index query Entity Linking 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 8
  9. 9. Sentence: I went to Paris to see the Eiffel Tower. § Generate Candidates: Ø Paris: db:Paris, db:Paris_Hilton, db:Paris,_Ontario, db:Notre_Dame_de_Paris Ø Eiffel Tower: db:Eiffel_Tower, db:Eiffel_Tower_(Paris,_Tennessee) § Filter candidates: Ø db:Paris, db:Paris_Hilton, db:Paris,_Ontario, db:Notre_Dame_de_Paris Ø db:Eiffel_Tower, db:Eiffel_Tower_(Paris,_Tennessee) § Scoring: Ø Score(db:Paris)= (a * L(“Paris”, “Paris”) + b * max(L(“Paris”, R(“Parisien”, “Paname”))) + c * max(L(“Paris”, D(“Paris (disambiguation)”)))) * PR(db:Paris) Ø Score(db:Notre_Dame_de_Paris)= (a * L(“Paris”, “Notre Dame de Paris”) + b * max(L(“Paris”, R(“Nôtre Dame”, “Paris Cathedral”))) + c * max(L(“Paris”, D(“Notre Dame”, “Notre Dame de Paris (disambiguation)”)))) * PR(db:Notre_Dame_de_Paris) Entity Linking example 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 9
  10. 10. § k-NN machine learning algorithm training process: Ø Run the system on a training set Ø Classify entities as true/false according to the training set Gold Standard Ø Create a file with the features of each entities and their true/false classification Ø Train k-NN with the previous file to get a model § Use 10 features for the training: • Length in number of characters • Extracted mention • Title • Type • PageRank • HITS • Number of inLinks • Number of outLinks • Redirects number • Linking score Training set ADEL Create file of features Train k-NN Pruning 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 10
  11. 11. § Tweets dataset Ø Training set: 2340 tweets Ø Test set: 1165 tweets § Link entities that may belong to one of these ten types: Ø Person, Location, Organization, Function, Amount, Animal, Event, Product, Time, and Thing (languages, ethnic groups, nationalities, religions, diseases, sports and astronomical objects) § Twitter user name dereferencing § Disambiguate in DBpedia 3.9 #Micropost2014 NEEL challenge 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 11
  12. 12. Results on #Micropost2014 § Results of ADEL with and without pruning § Results of other systems Without pruning With pruning Precision Recall F-measure Precision Recall F-measure Extraction 69.17 72.51 70.8 70 41.62 52.2 Linking 47.39 45.23 46.29 48.21 26.74 34.4 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 12 E2E UTwente DataTXT ADEL AIDA Hyberabad SAP F-measure 70.06 54.93 49.9 46.29 45.37 45.23 39.02
  13. 13. § Sentences from Wikipedia Ø Training set: 96 sentences Ø Test set: 101 sentences § Extract and link entities that must belong to one of these four types: Ø Person, Location, Organization and Role § Must disambiguate co-references § Allow emerging entities (NIL) § Disambiguate in DBpedia 3.9 OKE2015 challenge 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 13
  14. 14. Results on OKE2015 § Results of ADEL with and without pruning § Results of other systems https://github.com/anuzzolese/oke-challenge 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 14 Without pruning With pruning Precision Recall F-measure Precision Recall F-measure Extraction 78.2 65.4 71.2 83.8 9.3 16.8 Recognition 65.8 54.8 59.8 75.7 8.4 15.1 Linking 49.4 46.6 48 57.9 6.2 11.1 ADEL FOX FRED F-measure 60.75 49.88 34.73
  15. 15. #Micropost2015 NEEL challenge § Tweets dataset: Ø Training set: 3498 Ø Development set: 500 Ø Test set: 2027 § Extract and link entities that must belong to one of these seven types: Ø Person, Location, Organization, Character, Event, Product, and Thing (languages, ethnic groups, nationalities, religions, diseases, sports and astronomical objects) § Twitter user name dereferencing § Disambiguate in DBpedia 3.9 + NIL 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 15
  16. 16. Results on #Micropost2015 § Results of ADEL without pruning § Results of other systems Ø Strong type mention match Ø Strong link match (not considering the type correctness) Precision Recall F-measure Extraction 68.4 75.2 71.6 Recognition 62.8 45.5 52.8 Linking 48.8 47.1 47.9 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 16 ousia ADEL uva acubelab uniba ualberta cen_neel F-measure 80.7 52.8 41.2 38.8 36.7 32.9 0 ousia acubelab ADEL uniba ualberta uva cen_neel F-measure 76.2 52.3 47.9 46.4 41.5 31.6 0
  17. 17. Error Analysis § Issue for the extraction: Ø “FB is a prime number.” FFB stands for 251 in hexadecimal and will be extracted as Facebook acronym by the wrong extractor § Issue for the filtering: Ø “The series of HP books have been sold million times in France.” FNo relation in Wikipedia between Harry Potter and France. Then no filtering is applied. § Issue for the scoring: Ø “The Spanish football player Alonso played twice for the national team between 1954 and 1960.” FXabi Alonso will be selected instead of Juan Alonso because of the PageRank. 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 17
  18. 18. § Our system gives the possibility to adapt the entity linking task to different kind of text § Our system gives the possibility to adapt the type of extracted entities § Results are similar regardless of the kind of text § Performance at extraction stage similar to top state-of-the-art systems (or slightly better) § Big drop of performance at linking stage mainly due to an unsupervised approach Conclusion 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 18
  19. 19. § Add more adaptive features: language, knowledge base § Improve linking by using a graph-based algorithm: Ø finding the common entities linked to each of the extracted entities Ø example: “Rafael Nadal is a friend of Alonso” . There is no existing direct link between Rafael Nadal and Alonso in DBpedia (or Wikipedia) but they have the entity Spain in common § Improve pruning by: Ø adding additional features: Frelatedness: compute the relation score between one entity and all the others in the text. If there are more than two, compute the average FPOS tag of the previous and the next token in the sentence Ø using other algorithms: FEnsemble Learning FUnsupervised Feature Learning + Deep Learning Future Work 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 19
  20. 20. 2015/10/11 - 3rd NLP & DBpedia International Workshop – Bethlehem, Pennsylvania, USA - 20 http://www.slideshare.net/julienplu http://xkcd.com/1319/

×