Most information extraction approaches available today have either focused on the extraction of simple relations or in scenarios where
data extracted from texts should be normalized into a database schema or ontology. Some relevant information present in natural language texts,
however, can be irregular, highly contextualized, with complex semantic dependency relations, poorly structured, and intrinsically ambiguous.
These characteristics should also be supported by an information extraction approach. To cope with this scenario, this work introduces a seman-
tic best-effort information extraction approach, which targets an information extraction scenario where text information is extracted under a
pay-as-you-go data quality perspective, trading high-accuracy, schema consistency and terminological normalization for domain-independency,
context capture, wider extraction scope and maximization of the text semantics extraction and representation. A semantic information ex-
traction framework (Graphia) is implemented and evaluated over the Wikipedia corpus.
7. Motivation
Digital Enterprise Research Institute www.deri.ie
Natural language texts Linked Data
No terminological or structural
regularity
Terminological and
structural regularity
Highly contextualized
Shared semantic agreement
Complex semantic dependency between data consumers
relations
Ambiguity
Information selection/normalization
- vocabulary constraints
+ entity-centric
+ pay-as-you-go data semantics
= semantic best-effort
8. Motivation
Digital Enterprise Research Institute www.deri.ie
Vocabulary-independent (schema-free queries)
How to abstract users from knowing the data
representation?
Semantic matching
Schemaless databases in the limit demands
vocabulary-independency
How information extraction is reshaped in this
scenario?
9. Motivational Scenario
Digital Enterprise Research Institute www.deri.ie
What is the relationship between Barack Obama and Indonesia?
Semantic Best-
effort Extraction
Sentence: From age six
to ten, Obama attended
local schools in Jakarta, Entity-centric text
including Besuki Public representation
School and St Francis
Assisi School.
11. Computational Linguistics Perspective
Digital Enterprise Research Institute www.deri.ie
What is already there to represent NL?
Discourse Representation Theory (DRT)
Semantic Role Labeling (SRL)
12. Discourse Representation Theory (DRT)
Digital Enterprise Research Institute www.deri.ie
“The key idea behind (...) Discourse Representation Theory is
that each new sentence of a discourse is interpreted in the
context provided by the sentences preceding it.”
van Eijck and Kamp
Models propositions in discourse (multiple sentences).
Discourse representation structures (DRS).
John enters a card. Every card is green.
13. Semantic Role Labeling (SRL)
Digital Enterprise Research Institute www.deri.ie
Shallow semantic parsing.
Detection of arguments associated with a predicate.
Associated semantic types to arguments.
Bill cut his hair with a razor
[Agent Bill] cut [Patient his hair] [Instrument with a razor.]
14. Semantic Best-Effort
Digital Enterprise Research Institute www.deri.ie
Objectives:
Entity-centric & Standardized: easier to integrate with
other resources
Remove the formal constraints and the ‘baggage’ from
existing approaches
Representation robust to extraction limitations/errors
15. Semantic Best-Effort Requirements
Digital Enterprise Research Institute www.deri.ie
Text segmentation into (s,p,o)s
Context representation
Conceptual model independency
Resolve co-references (pay-as-you-go)
Represent recurrent discourse structures
Standardized representation (RDF(S))
Principled interpretation (compositionality)
16. Examples
Digital Enterprise Research Institute www.deri.ie
- Text segmentation into (s,p,o)s
- Context representation
- Resolve co-references (pay-as-you-go)
- Conceptual model independency
22. SDG Elements
Digital Enterprise Research Institute www.deri.ie
Named, non-named entities and properties
Quantifiers & operators
Triple Trees
Context elements
Co-Referential elements
Resolved & normalized entities
27. Extraction Pipeline Architecture
Digital Enterprise Research Institute www.deri.ie
Subject
Predicate
Object
Prepositional phrase & Noun complement
Reification
Time
28. Preliminary Evaluation
Digital Enterprise Research Institute www.deri.ie
1033 relations (triples) from 150 sentences from 5
randomly selected Wikipedia articles
Manually classified the graphs: error categories
and accuracy.
42. Conclusion
Digital Enterprise Research Institute www.deri.ie
Main direction for improvement is completeness
Aligned with the pay-as-you-go scenario
Still need to define clear criteria for what you can’t extract
There is still a long way to go (e.g. complex subordination)
Investigation using existing n-ary relations patterns
Context (reification) should be a first-class citizen in the
representation of natural language
Focus on getting the semantic pivots (rigid designators) right
Worth putting effort on enumerable patterns (timestamps,
operators)
Editor's Notes
Change to horizontal
Change to horizontal
Change to horizontal
Change to horizontal
Emphasize entity
Logic and linguistics have had lively connections from Antiquity right until today