Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Julien Plu
julien.plu@eurecom.fr
@julienplu
Knowledge extraction in Web
media: at the frontier of NLP,
Machine Learning an...
Use Case: Bringing Context to Documents
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 3
NEWSWIRES
TWEETS
SEARCH
QUERIE...
Use Case: Bringing Context to Documents
James Patrick Page, OBE (born 9 January 1944)
is an English musician, songwriter, ...
Six Different Problems
1. Identity of an entity
Ø Arena; Arena (magazine); Arena (TV series)
Ø Bucks County, Pennsylvania;...
Research Questions
1. How to adapt an entity linking system depending on
different criteria?
2. How to design an entity li...
State Of The Art
§ The key role of entities:
Ø 70% of search queries contain at least one entity [1]
Ø Bring context to vi...
Methodology
We have split up this thesis into six tasks:
Start thesis
Today
End thesis
(1) Text adaptivity
(1) Entity type...
§ POS Tagger:
Ø bidirectional
CMM (left to right and
right to left)
§ NER Combiner:
Ø Use a combination of CRF with Gibbs ...
ADEL: Modular Framework (Overlap Resolution)
§ Detect overlaps
among extractors
with the boundaries
of the entities
§ Diff...
Modular Framework: Indexing
§ Create index from
DBpedia and Wikipedia
§ Integrate external data
such as PageRank and
HITS ...
ADEL: Modular Framework (Linking)
§ Generate candidate links for
all extracted mentions:
Ø If any, they go to the linking
...
ADEL: Modular Framework (Pruning)
§ k-NN machine learning
algorithm
§ Why a pruning module?
Ø Useful to correct the errors...
§ Experiments on different kind of text by
benchmarking ADEL over different challenges
Ø Tweets: NEEL2014, NEEL2015 and NE...
Type Adaptivity
§ Challenges have their own definition of types
§ In ADEL types are coming from the NER extractor
and the ...
Knowledge Base Adaptivity
§ Joint work with Vrije Universiteit Amsterdam
§ ReCon: define several heuristics in order to re...
Language Adaptivity
§ No results yet. The goal is to let the user choosing
the natural language used in the text
§ Test th...
Distributed and Scalable Architecture
§ No results yet. Being able to deploy the framework in
order to run the tasks in a ...
Evaluation Over Multiple Datasets in Linking
§ 2014 NEEL Challenge with ADEL v1 using the neleval scorer
§ 2015 NEEL Chall...
Conclusions
§ Combining multiple techniques coming from different
domains for entity recognition and linking
§ Having deve...
Future Work
§ Knowledge base adaptivity
Ø Further evaluate the knowledge base and text adaptive features using the ERD dat...
Questions?
Thank you for listening!
2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 22
Upcoming SlideShare
Loading in …5
×

Knowledge extraction in Web media: at the frontier of NLP, Machine Learning and Semantics

610 views

Published on

PhD symposium presentation at WWW 2016 in Montréal.

Published in: Software
  • Be the first to comment

Knowledge extraction in Web media: at the frontier of NLP, Machine Learning and Semantics

  1. 1. Julien Plu julien.plu@eurecom.fr @julienplu Knowledge extraction in Web media: at the frontier of NLP, Machine Learning and Semantics
  2. 2. Use Case: Bringing Context to Documents 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 3 NEWSWIRES TWEETS SEARCH QUERIES SUBTITLES
  3. 3. Use Case: Bringing Context to Documents James Patrick Page, OBE (born 9 January 1944) is an English musician, songwriter, and record producer who achieved international success as the guitarist and founder of the rock band Led Zeppelin. Know More Sort name: Page, Jimmy Type: Person Gender: Male Born: 1944-01-09 (72 years ago) Born in: Heston, Hounslow, London, United Kingdom Pays d’origine : Royaume-Uni Genre musical : Blues rock, rock psychédélique Années actives : 1962-1968 et depuis 1992 Labels : Columbia The Yardbirds est un groupe de rock britannique des années 1960, formé en mai 1963 à Londres en Angleterre dont les guitaristes ont été Eric Clapton, Jeff Beck puis Jimmy Page. Know More 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 4
  4. 4. Six Different Problems 1. Identity of an entity Ø Arena; Arena (magazine); Arena (TV series) Ø Bucks County, Pennsylvania; Milwaukee Bucks 2. Knowledge bases have different coverage Yannick Noah is a Tennis Player and a Singer 4. Various types for an entity (granularity) 5. Different type of documents written in multiple languages 3. High computation to handle large streams 6. Are all phrases entities? (e.g. dates or roles) 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 5
  5. 5. Research Questions 1. How to adapt an entity linking system depending on different criteria? 2. How to design an entity linking system in order to be able to process a large amount of data in near real time? 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 6
  6. 6. State Of The Art § The key role of entities: Ø 70% of search queries contain at least one entity [1] Ø Bring context to videos [2] Ø Help making summary [3] § Current systems (e.g. TagME [3], AIDA [4], Babelfy [5] or DBpedia Spotlight [6]) are hardly parametrized and often do not propose to be adapted to at least one of the previous criteria § Those solutions are often not able to handle large streams of text [1] Jeffrey Pound, Peter Mika, Hugo Zaragoza: Ad-hoc object retrieval in the web of data. WWW 2010 [2] José Luis Redondo García, Giuseppe Rizzo, Raphaël Troncy: The Concentric Nature of News Semantic Snapshots: Knowledge Extraction for Semantic Annotation of News Items. K-CAP 2015 [3] Shruti Chhabra, Srikanta Bedathur: Towards Generating Text Summaries for Entity Chains. ECIR 2014 [4] Paolo Ferragina, Ugo Scaiella: TAGME: on-the-fly annotation of short text fragments (by wikipediaentities). CIKM 2010 [5] Mohamed Amir Yosef, Johannes Hoffart, Ilaria Bordino, Marc Spaniol, Gerhard Weikum: AIDA: AnOnline Tool for Accurate Disambiguation of Named Entities in Text and Tables. PVLDB 4(12) [6] Andrea Moro, Alessandro Raganato, Roberto Navigli: Entity Linking meets Word Sense Disambiguation: a Unified Approach. TACL 2014 [7] Pablo N. Mendes, Max Jakob, Andrés García-Silva, Christian Bizer: DBpedia spotlight: shedding light on the web of documents. I-SEMANTICS 2011 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 7
  7. 7. Methodology We have split up this thesis into six tasks: Start thesis Today End thesis (1) Text adaptivity (1) Entity type adaptivity (1) Knowledge base adaptivity (1) Language adaptivity (1- 2) ADEL Modular framework (2) Distributed and scalable architecture 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 8
  8. 8. § POS Tagger: Ø bidirectional CMM (left to right and right to left) § NER Combiner: Ø Use a combination of CRF with Gibbs sampling (Monte Carlo as graph inference method) models. A simple CRF model could be: PER PER PERO OOO X X X X XX XXXX X set of features for the current word: word capitalized, previous word is “de”, next word is a NNP, … Suppose P(PER | X, PER, O, LOC) = P(PER | X, neighbors(PER)) then X with PER is a CRF Jimmy Page , connaissant le profesionnalisme de John Paul Jones ADEL: Modular Framework (Extractors) PER PERO 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 9
  9. 9. ADEL: Modular Framework (Overlap Resolution) § Detect overlaps among extractors with the boundaries of the entities § Different heuristics can be applied: Ø Merge: (“United States” and “States of America” => “United States of America”) default behavior Ø Simple Substring: (“Florence” and “Florence May Harding” => ”Florence” and “May Harding”) Ø Smart Substring: (”Giants of New York” and “New York” => “Giants” and “New York”) 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 10
  10. 10. Modular Framework: Indexing § Create index from DBpedia and Wikipedia § Integrate external data such as PageRank and HITS scores from Hasso Platner Institute 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 11
  11. 11. ADEL: Modular Framework (Linking) § Generate candidate links for all extracted mentions: Ø If any, they go to the linking method Ø If not, they are linked to NIL § Linking method: Ø ADEL linear formula: 𝑟 𝑙 = 𝑎. 𝐿 𝑚, 𝑡𝑖𝑡𝑙𝑒 + 𝑏. max 𝐿 𝑚, 𝑅 + 𝑐. max 𝐿 𝑚, 𝐷 . 𝑃𝑅(𝑙) r(l): the score of the candidate l L: the Levenshtein distance m: the extracted mention title: the title of the candidate l R: the set of redirect pages associated to the candidate l D: the set of disambiguation pages associated to the candidate l PR: Pagerank associated to the candidate l a, b and c are weights following the properties: a > b > c and a + b + c = 1 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 12
  12. 12. ADEL: Modular Framework (Pruning) § k-NN machine learning algorithm § Why a pruning module? Ø Useful to correct the errors from the extractor by removing wrong annotations. Example: F France played against Russia for a friendly match. F Yesterday, I went to see Against in concert. Ø Useful to adapt the annotations in order to follow a given guideline. Example: suppose we are participating to two different challenges, 2014 NEEL that count the dates as entities, and OKE2015 that do not. F 1st challenge: Jimmy Page was born the January 9th, 1944. F 2nd challenge: Jimmy Page was born the January 9th, 1944. 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 13
  13. 13. § Experiments on different kind of text by benchmarking ADEL over different challenges Ø Tweets: NEEL2014, NEEL2015 and NEEL2016 Ø News article: OKE2015 and OKE2016 § Need to adapt the extractors to use a proper model to handle different kind of texts Ø Retrain the NER extractor with a training dataset Text Adaptivity 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 14
  14. 14. Type Adaptivity § Challenges have their own definition of types § In ADEL types are coming from the NER extractor and the used knowledge base Ø NER types are different of KB types Ø NER types and KB types are different of challenges types § Need a mapping between those different types. It is currently manually made. OKE2015 and OKE2016 Person, Place, Organization, Role NEEL2015 and NEEL2016 Person, Location, Organization, Product, Event, Thing 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 15
  15. 15. Knowledge Base Adaptivity § Joint work with Vrije Universiteit Amsterdam § ReCon: define several heuristics in order to re-rank candidate links provided by our system on newswire articles Ø H1: process the article text first and disambiguate the article title at the end because titles are often too ambiguous Ø H2: detect co-referential entities throughout the article Ø H3: topic modeling to exploit a contextual knowledge base about the found topic 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 16
  16. 16. Language Adaptivity § No results yet. The goal is to let the user choosing the natural language used in the text § Test the framework on ETAPE which is a NER challenge on French TV content from 2012 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 17
  17. 17. Distributed and Scalable Architecture § No results yet. Being able to deploy the framework in order to run the tasks in a distributed and scalable way § Making each task (extraction, linking and pruning) independent of each other and put them out of the global architecture (see how Docker is developed as model) § Stress test the new architecture over large streams such as Twitter streaming API to detect the possible bottlenecks 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 18
  18. 18. Evaluation Over Multiple Datasets in Linking § 2014 NEEL Challenge with ADEL v1 using the neleval scorer § 2015 NEEL Challenge with ADEL v1 using the neleval scorer § 2016 NEEL Challenge with ADEL v2 using the neleval scorer § OKE2015 Challenge with ADEL v1 usingthe GERBIL scorer § OKE2016 Challenge with ADEL v2 usingthe neleval scorer E2E UTwente DataTXT ADEL AIDA Hyberabad SAP F-measure 70.06 54.93 49.9 46.29 45.37 45.23 39.02 ADEL FOX FRED F-measure 60.75 49.88 34.73 ousia acubelab ADEL uniba ualberta uva cen_neel F-measure 76.2 52.3 47.9 46.4 41.5 31.6 0 ADEL kea Insight mit ju unimib F-measure 61.98 54.86 38,28 36.09 35.48 33.53 ADEL F-measure 56.5 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 19
  19. 19. Conclusions § Combining multiple techniques coming from different domains for entity recognition and linking § Having developed different methods in order to make an entity linking system adaptive to one or multiple criteria § Bringing a new approach with ADEL while also reusing existing approaches with the POS and NER extractors § Testing ADEL over different datasets and participating in challenges 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 20
  20. 20. Future Work § Knowledge base adaptivity Ø Further evaluate the knowledge base and text adaptive features using the ERD dataset Ø Evaluate the knowledge base adaptive feature using the TAC KBP dataset Ø Experiment the knowledge base adaptive feature using 3cixty and ad-hoc tourism dataset § Language adaptivity Ø Evaluate the language adaptive feature using the ETAPE and TAC KBP datasets § Modular Framework Ø Improving the linking and the pruning with new methods (e.g. evaluate deep learning methods) § Type adaptivity Ø Further evaluate the approach over more fine grained types using ETAPE challenge. This will bring more issues especially with the scorers § Engineer and evaluate a distributed and scalable architecture on large data streams 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 21
  21. 21. Questions? Thank you for listening! 2016/04/14 - PhD Sympoosium WWW 2016 - Montréal - 22

×