Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Ontology-based information extraction in the DERI Reading Group

on

  • 1,928 views

The DERI Reading Group (10.11.2010)

The DERI Reading Group (10.11.2010)

http://www.deri.ie/teaching/reading-groups/archive/

Statistics

Views

Total Views
1,928
Views on SlideShare
1,927
Embed Views
1

Actions

Likes
0
Downloads
34
Comments
1

1 Embed 1

http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Ontology-based information extraction in the DERI Reading Group Ontology-based information extraction in the DERI Reading Group Presentation Transcript

    • The DERI Reading Group
      Ontology-based information extraction: An Overview & Survey
      (2010, Wimalasuriya and Dou)
      Tobias Wunner, UNLP Group
      • Copyright 2010 Digital Enterprise Research Institute. All rights reserved, Paul Buitelaar
    • Definition - Motivation
      Create content for the Semantic Web
      • convert existing websites into ontologies
      Improve quality of existing ontologies
      • Test criterion: OBIE task
      • OBIE good => ontology good
    • Overview
      • Access to information…
    • Overview
      • Access to information…
      Ontologie-based Information Extraction (OBIE):
      “A system that processes unstructured or semi-
      structured natural language text guided by an
      ontology and presents the output in an ontology.
    • Overview
      • ESWC dogfood OBIE-related topics
      New!
    • Problem – two scenarios
      • 1. Text only:
      • Extract conceptualization and instances
      Building
      1. conceptualization
      is-a
      building with café and football table
      County
      T
      Galway
      DERI building
      2. instances
    • Problem – two scenarios
      • 1. Text only:
      • Extract conceptualization and instances
      Building
      1. conceptualization
      is-a
      building with café and football table
      County
      • conceptualization can be too specific / generic
      • wrong conceptualization
      T
      Galway
      DERI building
      2. instances
    • Problem – two scenarios
      • 2. Domain ontology & text:
      • extract instances only
      Conceptualization
      by domain ontology
      located
      in
      Building
      City
      T
      Galway
      DERI building
      2. instances
    • Problem – two scenarios
      • 2. Domain ontology & text:
      • extract instances only
      Conceptualization
      by domain ontology
      located
      in
      Building
      City
      T
      less generic but more semantic stable
      Galway
      DERI building
      2. instances
    • Definition – key characteristics
      Process structured / unstructured text
      “guided” by an ontology
      Present output in ontology
      guided by
      Text
      Source
      Information
      Extractor
      Ontology
    • Definition – ontology learning or population?
      • Ontology population ⊂ OBIE
      • “OBIE is Open information extraction” (Etzioni)
      • alternative: semantics given by ontology!
      • extractors can be inside / outside ontology
      guided by
      Text
      Source
      Information
      Extractor
      Ontology
    • Methods
      • Information extractors
      Linguistic rules
      Gazetteer lists
      Classification (classical / structure-aware)
      Partial parse trees
      Structured data analyzers
      Web querying
    • Linguistic Rules - Methods
      • Regular expressions
      • <COMPANY> .* revenue <Number> <currency>
      “Tesco’s revenue in 2009 was 3.4 billion GBP.”
      • Extraction ontologies
      • combination of ontology and lexicon (Mädche, Embley, Buitelaar)
      • manual construction
      • High precision
    • Gazetteer Methods
      • 2. Gazetteer lists
      • Phrases / words instead of patterns
      • Named-Entity Recognition
      • Requirements:
      Specify what is being extracted
      Specify sources and avoid manual creation
      The software giant SAP…
      Tesco a UK supermarket …
      Siemens energy revenue…
      industry
      … wind energy company Vestas
      Sematic Web
      Software
      Energy
      Supermarket

    • Classification Methods
      • 3. Classification techniques
      • Break down IE task in a set of binary tasks
      c1
      pos
      c2
      Classifier
      semTag
      ..
      cn
      features
    • Classification Methods
      • Classical
      Industry
      Location
      Tesco
      Claddagh
      DERI
      DERI
      Energy
      SW
      City
      Country
      Galway
      Germany
      DERI
      Siemens
      missclassification does
      not consider structure!
      (equal cost 1/6)
      GE
      Ireland
      Munich
      CITEC
    • Classification Methods
      • Structure aware
      Tesco
      Claddagh
      W1,6=3
      DERI
      Galway
      Germany
      Siemens
      GE
      Ireland
      Munich
      CITEC
      Classifier should
      consider taxonomy structure!
    • Other methods
      • 4. Partial parse trees
      • TACITUS, SMES, LTAG
      • 5. Analyze structured data
      • Wikpedia Infoboxes
      • 6. Web querying
      • C-PANKOW
      • “Towards the self annotating web
    • Technologies used in implementation
      • Shallow NLP (GATE, sProUT, StanfordNLP)
      • POS, sentence splitting, regular expression
      • Semantic lexicons (WordNet, GermaNet)
      • synonym, meronym, hypernym
      • Semantic Annotation (OCAT, iDocument, PIMO)
      Missing
      • Terminological tools (UMLS, bio terminologies)
      • Thesauri, translation memory
    • Data sets & evaluation
      Data sets (corpora)
      Message Understanding Conference (MUC-7)
      Automatic Content Extraction (ACE)
      • => more on classical IR, IE, NLP tracks
      • => no data set with given semantics (ontology)
      Evaluation
      • Precision & recall
      • Only used for population task
    • Recent Open IE argument
      • Con: Weikum, From Information to Knowledge -Harvest Web Resources for IE
      • Disambiguation
      • NL relations are not well defined (well defined arguments)
      • Pro: Weld, Using Wiki to Bootrap Open IE
      • Relation targeted:
      • learn extractor per relation -> lower recall
      • Structural targeted:
      • general extraction engine -> lower precision
    • Conclusion and Outlook
      • No established/ agreed methods yet
      • Is OBIE also ontology learning?
      • Data sets
      • Methods for best extractors
      • Semantic Web contribution?
      • eg. Gazetteers from DBPedia
      • Cross-lingual OBIE -> CLOBIE
    • References
      [1] Wimalasuriya, Dou, Ontology-based Information Extraction: An Introduction and Survey of current approaches, in Journal of Computer Science, June 2010
      [2] Buitelaar et Al., Towards linguistically grounded ontologies., ESWC, Springer, 200
      [3] Weikum et Al, From Information to Knowledge – Harvesting Entities and Relationships from Web Sources, Principle Database Systems, 2010
      [4] Weld et al., Using Wikipedia to bootstrap open information extraction, Sigmod Record, 2008