A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Digital Enterprise Research Institute www.deri.ie

A Semantic Best-Effort Approach for
Extracting Structured Discourse
Graphs from Wikipedia
André Freitas, Danilo Carvalho, J. C. P. da Silva,
Sean O’Riain, Edward Curry

© Copyright 2009 Digital Enterprise Research Institute. All rights reserved.

Outline

 Motivation
 Representation
 Requirements
 Semantic Best-effort Representation
 Extraction
 Graphia Extractor
 Preliminary Evaluation
 Extraction Examples
 Conclusion


Motivation

Motivation

Natural language texts Linked Data
 No terminological or structural
regularity
 Terminological and
structural regularity
 Highly contextualized
 Shared semantic agreement
 Complex semantic dependency between data consumers
relations
 Ambiguity
Information selection/normalization

- vocabulary constraints
+ entity-centric
+ pay-as-you-go data semantics
= semantic best-effort

Motivation

 Vocabulary-independent (schema-free queries)
 How to abstract users from knowing the data
representation?
 Semantic matching

 Schemaless databases in the limit demands
vocabulary-independency

 How information extraction is reshaped in this
scenario?

Motivational Scenario

What is the relationship between Barack Obama and Indonesia?

Semantic Best-
effort Extraction

Sentence: From age six
to ten, Obama attended
local schools in Jakarta, Entity-centric text
including Besuki Public representation
School and St Francis
Assisi School.


Representation

Computational Linguistics Perspective

 What is already there to represent NL?
 Discourse Representation Theory (DRT)
 Semantic Role Labeling (SRL)

Discourse Representation Theory (DRT)

 “The key idea behind (...) Discourse Representation Theory is
that each new sentence of a discourse is interpreted in the
context provided by the sentences preceding it.”
van Eijck and Kamp
 Models propositions in discourse (multiple sentences).
 Discourse representation structures (DRS).

John enters a card. Every card is green.

Semantic Role Labeling (SRL)

 Shallow semantic parsing.
 Detection of arguments associated with a predicate.
 Associated semantic types to arguments.

Bill cut his hair with a razor
[Agent Bill] cut [Patient his hair] [Instrument with a razor.]

Semantic Best-Effort

 Objectives:
 Entity-centric & Standardized: easier to integrate with
other resources
 Remove the formal constraints and the ‘baggage’ from
existing approaches
 Representation robust to extraction limitations/errors

Semantic Best-Effort Requirements

 Text segmentation into (s,p,o)s
 Context representation
 Conceptual model independency
 Resolve co-references (pay-as-you-go)
 Represent recurrent discourse structures
 Standardized representation (RDF(S))
 Principled interpretation (compositionality)

Examples

- Text segmentation into (s,p,o)s
- Context representation
- Resolve co-references (pay-as-you-go)
- Conceptual model independency

Examples

- Context representation

Examples

Examples

- Represent recurrent discourse structures

Examples

- Represent recurrent discourse structures
- Resolve co-references (pay-as-you-go)

SDG Elements

 Named, non-named entities and properties
 Quantiﬁers & operators
 Triple Trees
 Context elements
 Co-Referential elements
 Resolved & normalized entities

Graph Patterns

[[Interpretation]]

 Graph traversal – deref sequence


Extraction

SBE Graph Extraction Tool

Extraction Pipeline Architecture

 Subject
 Predicate
 Object
 Prepositional phrase & Noun complement
 Reification
 Time

Preliminary Evaluation

 1033 relations (triples) from 150 sentences from 5
randomly selected Wikipedia articles
 Manually classified the graphs: error categories
and accuracy.

Preliminary Evaluation

Other Extraction Examples

Conclusion

 Main direction for improvement is completeness
 Aligned with the pay-as-you-go scenario
 Still need to define clear criteria for what you can’t extract
 There is still a long way to go (e.g. complex subordination)
 Investigation using existing n-ary relations patterns
 Context (reification) should be a first-class citizen in the
representation of natural language
 Focus on getting the semantic pivots (rigid designators) right
 Worth putting effort on enumerable patterns (timestamps,
operators)

A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Similar to A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia (20)

More from Andre Freitas

More from Andre Freitas (20)

A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Editor's Notes