Advertisement

SANAPHOR: Ontology-based Coreference Resolution

eXascale Infolab
Oct. 14, 2015
Advertisement

More Related Content

More from eXascale Infolab(20)

Advertisement

SANAPHOR: Ontology-based Coreference Resolution

  1. SANAPHOR: Ontology-Based Coreference Resolution Roman Prokofyev, Alberto Tonon, Michael Luggen, Loic Vouilloz, Djellel Difallah and Philippe Cudré-Mauroux eXascale Infolab University of Fribourg, Switzerland October 14th, ISWC’15 Bethlehem PA, USA 1
  2. Motivations and Task Overview 2 Task: identify groups (cluster) of co-referring mentions. Example: “Xi Jinping was due to arrive in Washington for a dinner with Barack Obama on Thursday night, in which he will aim to reassure the US president about a rising China. The Chinese president said he favors a “new model of major country relationship" built on understanding, rather than suspicion.” http://www.telegraph.co.uk/ Benefits: • identification of a specific type of an unknown entity • extract more relationships between named entities
  3. State-of-Art in Coreference Resolution Best approaches use generic multi-step algorithm: 1. Pre-processing (POS tagging, parsing, NER) 2. Identification of referring expressions (e.g., pronouns) 3. Anaphoricity determination (“it rains” vs “he took it”) 4. Generation of antecedent candidates 5. Searching/Clustering of candidates Lee et al., Stanford’s multi-pass sieve coreference resolution system at the conll-2011 shared task 3
  4. Motivations for a rich semantic layer 4 http://www.telegraph.co.uk/ “Xi Jinping was due to arrive in Washington for a dinner with Barack Obama on Thursday night, in which he will aim to reassure the US president about a rising China. The Chinese president said he favors a “new model of major country relationship" built on understanding, rather than suspicion.” Syntactic approaches are not able to differentiate between the names of the city and the province.
  5. Semantic layer on top of an existing system 5 Stanford Coref Deterministic Coreference Resolution [US President] [Barack Obama] [Australia] [Quintex Australia] [Quintex ltd.] Documents
  6. Generic overview of the approach Key techniques Split and merge clusters based on their semantics. 6 Clusters produced by Stanford Coref Entity/Type Linking Split clusters Merge clusters SANAPHOR
  7. Pre-Processing: Entity Linking 7 Entity Linking US President Barack Obama Australia Quintex Australia Quintex ltd. US President e1: Barack Obama e2: Australia e3: Quintex Australia e3: Quintex ltd.
  8. Pre-Processing: Semantic Typing 8 Semantic Typing: recognized entities are typed, other mention are typed by string similarity with YAGO. YAGO Index US President e1: Barack Obama t1: US President e1: Barack Obama
  9. Cluster splits 9 Entity- and Type-based splitting on clusters (e2: Australia) (e3: Quintex Australia) (e3: Quintex ltd.) e3: Quintex Australia e3: Quintex ltd. e2: Australia
  10. Cluster splits: heuristics 10 1. Non-identified mention assignment – based on exclusive words in each cluster: Obama ⇒ Barack Obama Jinping ⇒ Xi Jinping 2. Ignore complete subsets of other identified mentions: ✕ Aspen (“Aspen Airways”) ✕ Obama (“Barack Obama”)
  11. Cluster merges 11 Merge different clusters that contain the same types/entities t1: US President e1: Barack Obama (e1: Barack Obama) (t1:US President)
  12. Evaluation CoNLL-2012 Shared Task on Coference Resolution: • over 1M words • 3 parts: development, training and test. Design methods based on dev, evaluate on test. Metrics: • Precision/Recall/F1 for the case of clustering • Evaluate noun-only clusters separately (no pronouns) 12
  13. Cluster linking statistics 13 0 entities 1 entity 2 entities 3 entities All Clusters 4175 849 49 5 Noun-Only Clusters 1208 502 33 2 Total clusters (Stanford Coref): 5078 To be merged To be split All Clusters 270 118 Noun-Only Clusters 77 52
  14. Cluster optimization results 14 • System improves on top of Stanford Coref in both split and merge tasks. • Greater improvement in split task for noun-only clusters, since we do not re-assign pronouns.
  15. Conclusions • Leveraging semantic information improves coreference resolution on top of existing NLP systems. • The performance improves with the improvement of entity and type linking. • Complete evaluation code available at: https://github.com/xi-lab/sanaphor 15 Roman Prokofyev (@rprokofyev) eXascale Infolab (exascale.info), University of Fribourg, Switzerland http://www.slideshare.net/eXascaleInfolab/
  16. Anaphora vs. Coreference “Do you have a cat? I love them.” “a cat” is not an antecedent of “them”. 16
  17. Metrics • True positive (TP) - two similar documents to the same cluster. • True negative (TN) - two dissimilar documents to different clusters. • False positive (FP) - two dissimilar documents to the same cluster. • False negative (FN) - two similar documents to different clusters. 17

Editor's Notes

  1. Welcome everyone, my name is Roman Prokofyev, and I’m a PhD student at the eXascale Infolab, at the University of Fribourg, Switzerland. And today I will present you our joint work on ontology-based conreference resolution.
  2. I’ll start with a overview of the task we are solving here.
  3. So, currently, the standard way to resolve coreferences is by means of a multi-step approach which was developed over the years.
  4. However, NLP-based approach fails to determine correct coreference cluster when the referring phrases are somewhat ambiguous.
  5. In our work we introduce a semantic layer on top of an existing system will allow us to rearrange coreference clusters based on their semantics. Stanford produces so-called clusters…
  6. Thus, we have designed the following pipeline for our system. Let’s see how each box operates in detail.
  7. First step of our pipeline, … spotlight – decent technology
  8. beyond EL, semantic typing, the next pre-processing step is semantic typing…
  9. Now, after we completed the necessary pre-processing steps, we start re-arranging the coreference clusters. The first step is to split semantically unrelated clusters, which means that clusters contain either different entities or types from different branches of hierarchy.
  10. We identified the following problems
  11. Second step is cluster merging, that is, cluster that either contain same entities, or exactly same types, or, in case there is a mix of types and entities,…
  12. Ontonotes 5: available on LDC for free, 1M words from newswire, magazine articles, web data
  13. First, we evaluate the quality of our entity linking step
  14. we notice that the absolute increase in F1 score for the split task is greater for the Noun-Only case (+10.54% vs +2.94%). This results from the fact that All Clusters also contain non-noun mentions, such as pronouns, which we don’t directly tackle in this work but have to be assigned to one of the splits nevertheless. Our approach in that context is to keep the non-noun mentions with the first noun-mention in the cluster, which seems to be suboptimal for this case. For the merge task, the difference between All and Noun-Only clusters is much smaller (+27.03% for the All Clusters vs +18.96% for the Noun-Only case). In this case, non-noun words do not have any effect, since we merge clusters and also include all other mentions.
Advertisement