Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ontology-based information extraction in the DERI Reading Group

2,103 views

Published on

The DERI Reading Group (10.11.2010)

http://www.deri.ie/teaching/reading-groups/archive/

Published in: Technology, Education
  • Be the first to like this

Ontology-based information extraction in the DERI Reading Group

  1. 1. The DERI Reading Group Ontology-based information extraction: An Overview & Survey (2010, Wimalasuriya and Dou) Tobias Wunner, UNLP Group  Copyright 2010 Digital Enterprise Research Institute. All rights reserved, Paul Buitelaar
  2. 2. Definition - Motivation a) Create content for the Semantic Web  convert existing websites into ontologies b) Improve quality of existing ontologies  Test criterion: OBIE task  OBIE good => ontology good
  3. 3. Overview  Access to information…
  4. 4. Overview  Access to information… Ontologie-based Information Extraction (OBIE): “A system that processes unstructured or semi- structured natural language text guided by an ontology and presents the output in an ontology.
  5. 5. Overview  ESWC dogfood OBIE-related topics New!
  6. 6. T  1. Text only:  Extract conceptualization and instances County building with café and football table Building is-a 1. conceptualization 2. instances Galway DERI building Problem – two scenarios
  7. 7. T County building with café and football table Building is-a 1. conceptualization 2. instances Galway DERI building Problem – two scenarios  conceptualization can be too specific / generic  wrong conceptualization  1. Text only:  Extract conceptualization and instances
  8. 8. T City Buildinglocated in Conceptualization by domain ontology 2. instances Galway DERI building Problem – two scenarios  2. Domain ontology & text:  extract instances only
  9. 9. T City Buildinglocated in Conceptualization by domain ontology 2. instances Galway DERI building Problem – two scenarios  2. Domain ontology & text:  extract instances only less generic but more semantic stable
  10. 10. Definition – key characteristics a) Process structured / unstructured text b) “guided” by an ontology c) Present output in ontology Text Source Information Extractor Ontology guided by
  11. 11. Definition – ontology learning or population?  Ontology population ⊂ OBIE  “OBIE is Open information extraction” (Etzioni)  alternative: semantics given by ontology!  extractors can be inside / outside ontology Text Source Information Extractor Ontology guided by
  12. 12. Methods  Information extractors 1. Linguistic rules 2. Gazetteer lists 3. Classification (classical / structure-aware) 4. Partial parse trees 5. Structured data analyzers 6. Web querying
  13. 13. Linguistic Rules - Methods  Regular expressions  <COMPANY> .* revenue <Number> <currency> “Tesco’s revenue in 2009 was 3.4 billion GBP.”  Extraction ontologies  combination of ontology and lexicon (Mädche, Embley, Buitelaar)  manual construction  High precision
  14. 14.  2. Gazetteer lists  Phrases / words instead of patterns  Named-Entity Recognition  Requirements: 1) Specify what is being extracted 2) Specify sources and avoid manual creation Gazetteer Methods Sematic Web Software Energy Supermarket … industry The software giant SAP… Tesco a UK supermarket … Siemens energy revenue… … wind energy company Vestas
  15. 15.  3. Classification techniques  Break down IE task in a set of binary tasks Classification Methods pos semTag c1 c2 .. cn Classifier features
  16. 16.  Classical Classification Methods Galway Germany DERI Siemens GEIrelandMunich CITEC missclassification does not consider structure! (equal cost 1/6) DERI TescoCladdagh DERI CountryCity SW Energy IndustryLocation
  17. 17. W1,6=3  Structure aware Classification Methods Galway Germany Siemens GEIrelandMunich CITEC Classifier should consider taxonomy structure! TescoCladdagh DERI
  18. 18.  4. Partial parse trees  TACITUS, SMES, LTAG  5. Analyze structured data  Wikpedia Infoboxes  6. Web querying  C-PANKOW  “Towards the self annotating web Other methods
  19. 19. Technologies used in implementation  Shallow NLP (GATE, sProUT, StanfordNLP)  POS, sentence splitting, regular expression  Semantic lexicons (WordNet, GermaNet)  synonym, meronym, hypernym  Semantic Annotation (OCAT, iDocument, PIMO) Missing  Terminological tools (UMLS, bio terminologies)  Thesauri, translation memory
  20. 20. Data sets & evaluation Data sets (corpora) 1) Message Understanding Conference (MUC-7) 2) Automatic Content Extraction (ACE)  => more on classical IR, IE, NLP tracks  => no data set with given semantics (ontology) Evaluation  Precision & recall  Only used for population task
  21. 21. Recent Open IE argument  Con: Weikum, From Information to Knowledge - Harvest Web Resources for IE  Disambiguation  NL relations are not well defined (well defined arguments)  Pro: Weld, Using Wiki to Bootrap Open IE  Relation targeted:  learn extractor per relation -> lower recall  Structural targeted:  general extraction engine -> lower precision
  22. 22. Conclusion and Outlook  No established/ agreed methods yet  Is OBIE also ontology learning?  Data sets  Methods for best extractors  Semantic Web contribution?  eg. Gazetteers from DBPedia  Cross-lingual OBIE -> CLOBIE
  23. 23. References [1] Wimalasuriya, Dou, Ontology-based Information Extraction: An Introduction and Survey of current approaches, in Journal of Computer Science, June 2010 [2] Buitelaar et Al., Towards linguistically grounded ontologies., ESWC, Springer, 200 [3] Weikum et Al, From Information to Knowledge – Harvesting Entities and Relationships from Web Sources, Principle Database Systems, 2010 [4] Weld et al., Using Wikipedia to bootstrap open information extraction, Sigmod Record, 2008

×