Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ROSeAnn Presentation


Published on

A growing number of resources are available for enriching documents with semantic annotations. While originally focused on a few standard classes of annotations, the ecosystem of annotators is now becoming increasingly diverse. Although annotators often have very different vocabularies, with both high-level and specialist
concepts, they also have many semantic interconnections. We will show that both the overlap and the diversity in annotator vocabularies motivate the need for semantic annotation integration: middleware that produces a unified annotation on top of diverse semantic annotators. On the one hand, the diversity of vocabulary allows applications
to benefit from the much richer vocabulary available in
an integrated vocabulary. On the other hand, we present evidence that the most widely-used annotators on the web suffer from serious accuracy deficiencies: the overlap in vocabularies from individual annotators allows an integrated annotator to boost accuracy by exploiting inter-annotator agreement and disagreement.

The integration of semantic annotations leads to new challenges, both compared to usual data integration scenarios and to standard aggregation of machine learning tools. We overview an approach to these challenges that performs ontology-aware aggregation. We
introduce an approach that requires no training data, making use of ideas from database repair. We experimentally compare this with a supervised approach, which adapts maximal entropy Markov models to the setting of ontology-based annotations. We further experimentally compare both these approaches with respect to ontology-unaware
supervised approaches, and to individual annotators.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

ROSeAnn Presentation

  1. 1. ROSeAnn Aggregating Semantic Annotators Luying Chen, Stefano Ortona, Giorgio Orsi, and Michael Benedikt <> ! Department of Computer Science The University of Oxford ! DIADEM data extraction methodology domain-centric intelligent automated
  2. 2. Plenty of data on the web
  3. 3. Plenty of data on the web
  4. 4. Plenty of data on the web
  5. 5. Plenty of data on the web
  6. 6. But the web is also text News Feeds Posts, Tweets PDFs 155. Specific events and factors were of particular importance in the decline of ABCPs. Firstly, some conduits had large ABS holdings that experienced huge declines. When investors stopped rolling over ABCPs, these conduits had to rely on guarantees provided by banks which were too large for the banks providing them. While these banks received support to meet their obligations, investor confidence was nonetheless damaged. Secondly, structures in other ABCP markets around the world unsettled investors, including different guarantee agreements and single-seller extendible mortgage conduits. Thirdly, general concerns about the banking sector have caused investors to buy less bank related product. Table 3 - European ABCP issuance Q1 Q2 Q3 Q4 Total 2004 34.7 36.2 44.5 51.3 166.7 2005 58.1 63.4 61.6 55.2 238.4 2006 74.7 84.1 96.5 111.8 367.1
  7. 7. Entity recognition ecosystem
  8. 8. Understanding their behaviour Collect all original entity types Company Country … Movie Person Organization … Location City Company … StateOrCounty Organise them into an taxonomy Organization Company Location StateOrCounty City Thing Country Movie Person Organization disjointWith Person Organization disjointWith Location Movie disjointWith Person Person disjointWith Location Add disjointness constraints
  9. 9. Entity extractors: observations 1 0.8 0.6 0.4 0.2 0 Precision Recall F-score Person Date Movie 1 0.8 0.6 0.4 0.2 0 Location Sport Movie Observation 1 low accuracy (*) Results obtained on Reuters
  10. 10. Entity extractors: observations Observation 2 ! Vocabulary is limited and overlapping AlchemyAPI' Region' Saplo' Extrac1v' Zemanta' Museum' Lupedia' Person' Country' Scien1st' Planet' Brand' Product' Planet' Ocean' Company'
  11. 11. Analysis of conflicts Observation 3 ! They disagree on concepts and Conflicts are frequent -> reconciliation spans
  12. 12. ROSeAnn Reconcile Opinions of Semantic Annotators Goals: Compute logically consistent annotations, Maximize the agreement among annotators.
  13. 13. Supervised: MEMM Train a MEMM sequence labeller Features (token-based): - entity type - subclass / disjointness - span (B/I/O encoding) Inference: most likely and logically-consistent labelling for the sequence (Viterbi + pruning)
  14. 14. Unsupervised: Weighted Repair Judgement aggregation: - experts give opinions about a set of (logical) statements - compute a logically-consistent, aggregated judgement Database repairs / consistent query answering: - database instance + contraints (schema, dependencies) - answers computed on (minimal) repairs
  15. 15. Unsupervised: Weighted Repair Propositions: - ontological constraints Σ - annotations (as facts) Base support: A annotates a span with C∈Σ or failing to annotate S with C’! { (in Ai vocabulary) with C ⊑ C’ AtomicScore(C) = +1, ∀Ai annotating S with C’⊑ C -1, ∀Ai annotating S with C’⊓ C ⊑ ⊥
  16. 16. Unsupervised: Weighted Repair Initial solution: conjunction of all types: φ: C1 ∧ C2 ∧ … ∧ Cn Repair operations (op): - ins(Ci): insertion of a Ci not already in φ - del(Ci): deletion of a Ci from φ (and all its subclasses) and ins(¬Ci) Solution (S): - non conflicting: op in S do not “override” each other - non redundant: insertion/deletion of not implied types - consistent with Σ - maximally agreed: max( 횺ins(C)∈S AtomicScore(C) - 횺del(C)∈S AtomicScore(C) )
  17. 17. Weighted Repair: Example text text text, Jamie Oliver and some text here A1 A2 A3 Person Organisation Chef φ: Person ∧ Organisation ∧ Chef Person ⊓ Organisation ⊑ ⊥! Chef ⊑ Person! Person ⊑ LegalEntity! Organisation ⊑ LegalEntity Σ: AtomicScore(LegalEntity) = +3 {A1,A2,A3} = +3 AtomicScore(Person) = +2 {A1,A3} -1 {A2} = +1 AtomicScore(Organisation) = +1 {A2} -2 {A1,A3} = -1 AtomicScore(Chef) = +1 {A3} -1 {A2} = 0
  18. 18. Weighted Repair: Example φ: Person ∧ Organisation ∧ Chef S1 ins(LegalEntity) del(Chef) del(Person) del(Organisation) φ1: LegalEntity ∧ ¬Person ∧ ¬Organisation ∧ ¬Chef Agr(S1) = +3-1+1-0 = +3 S2 ins(LegalEntity) del(Organisation) φ1: LegalEntity ∧ Person ∧ ¬Organisation ∧ Chef Agr(S1) = +3+1+1+0 = +5 φφ1: Chef
  19. 19. Weighted Repair: Breaking ties S2 φ: Person ∧ Organisation ∧ Chef ins(LegalEntity) del(Organisation) φ1: LegalEntity ∧ Person ∧ ¬Organisation ∧ Chef S3 ins(LegalEntity) del(Organisation) del(Chef) φ1: LegalEntity ∧ Person ∧ ¬Organisation ∧ ¬Chef same agreement fewer operations φ1: Chef φ1: Person
  20. 20. Entity extractors: evaluation Corpora: - MUC7 NER task [300 docs, 7 types, ~18k entities] - Reuters (sample) [250 docs, 215 types, ~50k entities] - FOX (Leipzig) [100 docs, 3 types, 395 entities] - Web [20 docs, 5 types, 624 entities] Evaluation: PrecisionΩ= |InstAN(C+) ∩ InstGS(C+)| |InstAN(C+)| RecallΩ= |InstAN(C+) ∩ InstGS(C+)| |InstGS(C+)| 10-fold validation micro and macro averages
  21. 21. Evaluation: Individual vs Aggregated (*) Full comparison results at
  22. 22. Evaluation: Aggregators (*) Full comparison results at
  23. 23. Evaluation: Performance WR MEMM WR:∝ number of annotations insisting on a span MEMM: ∝ number of concepts in the ontology
  24. 24. Performance: MEMM Training MEMM: ∝ number of entity types in the ontology
  25. 25. ROSeAnn @work Text
  26. 26. ROSeAnn @work Web
  27. 27. ROSeAnn @work PDF
  28. 28. Summary Not discussed - Resolution of conflicting spans - Relationships with consistent QA / argumentation frameworks - WR with weights / bootstrapping -Web and PDF structural NERs (SNER) - MEMM vs CRF ! Future - Automatic maintenance of the ontology - Probabilistic and ontological querying of annotations - Relation, attribute, sentiment extraction - Entity disambiguation and linking
  29. 29. Get ROSeAnn at: Try out our REST endpoints: