Successfully reported this slideshow.

A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

1

Share

1 of 42
1 of 42

A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

1

Share

Download to read offline

Most information extraction approaches available today have either focused on the extraction of simple relations or in scenarios where
data extracted from texts should be normalized into a database schema or ontology. Some relevant information present in natural language texts,
however, can be irregular, highly contextualized, with complex semantic dependency relations, poorly structured, and intrinsically ambiguous.
These characteristics should also be supported by an information extraction approach. To cope with this scenario, this work introduces a seman-
tic best-effort information extraction approach, which targets an information extraction scenario where text information is extracted under a
pay-as-you-go data quality perspective, trading high-accuracy, schema consistency and terminological normalization for domain-independency,
context capture, wider extraction scope and maximization of the text semantics extraction and representation. A semantic information ex-
traction framework (Graphia) is implemented and evaluated over the Wikipedia corpus.

Most information extraction approaches available today have either focused on the extraction of simple relations or in scenarios where
data extracted from texts should be normalized into a database schema or ontology. Some relevant information present in natural language texts,
however, can be irregular, highly contextualized, with complex semantic dependency relations, poorly structured, and intrinsically ambiguous.
These characteristics should also be supported by an information extraction approach. To cope with this scenario, this work introduces a seman-
tic best-effort information extraction approach, which targets an information extraction scenario where text information is extracted under a
pay-as-you-go data quality perspective, trading high-accuracy, schema consistency and terminological normalization for domain-independency,
context capture, wider extraction scope and maximization of the text semantics extraction and representation. A semantic information ex-
traction framework (Graphia) is implemented and evaluated over the Wikipedia corpus.

More Related Content

Related Audiobooks

Free with a 14 day trial from Scribd

See all

A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

  1. 1. Digital Enterprise Research Institute www.deri.ie A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia André Freitas, Danilo Carvalho, J. C. P. da Silva, Sean O’Riain, Edward Curry © Copyright 2009 Digital Enterprise Research Institute. All rights reserved.
  2. 2. Outline Digital Enterprise Research Institute www.deri.ie  Motivation  Representation  Requirements  Semantic Best-effort Representation  Extraction  Graphia Extractor  Preliminary Evaluation  Extraction Examples  Conclusion
  3. 3. Digital Enterprise Research Institute www.deri.ie Motivation
  4. 4. Motivation Digital Enterprise Research Institute www.deri.ie
  5. 5. Motivation Digital Enterprise Research Institute www.deri.ie
  6. 6. Motivation Digital Enterprise Research Institute www.deri.ie
  7. 7. Motivation Digital Enterprise Research Institute www.deri.ie Natural language texts Linked Data  No terminological or structural regularity  Terminological and structural regularity  Highly contextualized  Shared semantic agreement  Complex semantic dependency between data consumers relations  Ambiguity Information selection/normalization - vocabulary constraints + entity-centric + pay-as-you-go data semantics = semantic best-effort
  8. 8. Motivation Digital Enterprise Research Institute www.deri.ie  Vocabulary-independent (schema-free queries)  How to abstract users from knowing the data representation?  Semantic matching  Schemaless databases in the limit demands vocabulary-independency  How information extraction is reshaped in this scenario?
  9. 9. Motivational Scenario Digital Enterprise Research Institute www.deri.ie What is the relationship between Barack Obama and Indonesia? Semantic Best- effort Extraction Sentence: From age six to ten, Obama attended local schools in Jakarta, Entity-centric text including Besuki Public representation School and St Francis Assisi School.
  10. 10. Digital Enterprise Research Institute www.deri.ie Representation
  11. 11. Computational Linguistics Perspective Digital Enterprise Research Institute www.deri.ie  What is already there to represent NL?  Discourse Representation Theory (DRT)  Semantic Role Labeling (SRL)
  12. 12. Discourse Representation Theory (DRT) Digital Enterprise Research Institute www.deri.ie  “The key idea behind (...) Discourse Representation Theory is that each new sentence of a discourse is interpreted in the context provided by the sentences preceding it.” van Eijck and Kamp  Models propositions in discourse (multiple sentences).  Discourse representation structures (DRS). John enters a card. Every card is green.
  13. 13. Semantic Role Labeling (SRL) Digital Enterprise Research Institute www.deri.ie  Shallow semantic parsing.  Detection of arguments associated with a predicate.  Associated semantic types to arguments. Bill cut his hair with a razor [Agent Bill] cut [Patient his hair] [Instrument with a razor.]
  14. 14. Semantic Best-Effort Digital Enterprise Research Institute www.deri.ie  Objectives:  Entity-centric & Standardized: easier to integrate with other resources  Remove the formal constraints and the ‘baggage’ from existing approaches  Representation robust to extraction limitations/errors
  15. 15. Semantic Best-Effort Requirements Digital Enterprise Research Institute www.deri.ie  Text segmentation into (s,p,o)s  Context representation  Conceptual model independency  Resolve co-references (pay-as-you-go)  Represent recurrent discourse structures  Standardized representation (RDF(S))  Principled interpretation (compositionality)
  16. 16. Examples Digital Enterprise Research Institute www.deri.ie - Text segmentation into (s,p,o)s - Context representation - Resolve co-references (pay-as-you-go) - Conceptual model independency
  17. 17. Examples Digital Enterprise Research Institute www.deri.ie - Context representation
  18. 18. Examples Digital Enterprise Research Institute www.deri.ie
  19. 19. Examples Digital Enterprise Research Institute www.deri.ie - Represent recurrent discourse structures
  20. 20. Examples Digital Enterprise Research Institute www.deri.ie - Represent recurrent discourse structures - Resolve co-references (pay-as-you-go)
  21. 21. Examples Digital Enterprise Research Institute www.deri.ie - Represent recurrent discourse structures
  22. 22. SDG Elements Digital Enterprise Research Institute www.deri.ie  Named, non-named entities and properties  Quantifiers & operators  Triple Trees  Context elements  Co-Referential elements  Resolved & normalized entities
  23. 23. Graph Patterns Digital Enterprise Research Institute www.deri.ie
  24. 24. [[Interpretation]] Digital Enterprise Research Institute www.deri.ie  Graph traversal – deref sequence
  25. 25. Digital Enterprise Research Institute www.deri.ie Extraction
  26. 26. SBE Graph Extraction Tool Digital Enterprise Research Institute www.deri.ie
  27. 27. Extraction Pipeline Architecture Digital Enterprise Research Institute www.deri.ie  Subject  Predicate  Object  Prepositional phrase & Noun complement  Reification  Time
  28. 28. Preliminary Evaluation Digital Enterprise Research Institute www.deri.ie  1033 relations (triples) from 150 sentences from 5 randomly selected Wikipedia articles  Manually classified the graphs: error categories and accuracy.
  29. 29. Preliminary Evaluation Digital Enterprise Research Institute www.deri.ie
  30. 30. Other Extraction Examples Digital Enterprise Research Institute www.deri.ie
  31. 31. Other Extraction Examples Digital Enterprise Research Institute www.deri.ie
  32. 32. Other Extraction Examples Digital Enterprise Research Institute www.deri.ie
  33. 33. Other Extraction Examples Digital Enterprise Research Institute www.deri.ie
  34. 34. Other Extraction Examples Digital Enterprise Research Institute www.deri.ie
  35. 35. Other Extraction Examples Digital Enterprise Research Institute www.deri.ie
  36. 36. Other Extraction Examples Digital Enterprise Research Institute www.deri.ie
  37. 37. Other Extraction Examples Digital Enterprise Research Institute www.deri.ie
  38. 38. Other Extraction Examples Digital Enterprise Research Institute www.deri.ie
  39. 39. Other Extraction Examples Digital Enterprise Research Institute www.deri.ie
  40. 40. Other Extraction Examples Digital Enterprise Research Institute www.deri.ie
  41. 41. Other Extraction Examples Digital Enterprise Research Institute www.deri.ie
  42. 42. Conclusion Digital Enterprise Research Institute www.deri.ie  Main direction for improvement is completeness  Aligned with the pay-as-you-go scenario  Still need to define clear criteria for what you can’t extract  There is still a long way to go (e.g. complex subordination)  Investigation using existing n-ary relations patterns  Context (reification) should be a first-class citizen in the representation of natural language  Focus on getting the semantic pivots (rigid designators) right  Worth putting effort on enumerable patterns (timestamps, operators)

Editor's Notes

  • Change to horizontal
  • Change to horizontal
  • Change to horizontal
  • Change to horizontal
  • Emphasize entity
  • Logic and linguistics have had lively connections from Antiquity right until today
  • Limitations. Categories of requirements.
  • Relate with req
  • ×