Formal grammars are extensively used to represent patterns in Information Extraction, but they do not permit the use of several types of features. Finite-state transducers, which are based on regular grammars, solve this issue, but they have other disadvantages such as the lack of expressiveness and the rigid matching priority. As an alternative, we propose Information Extraction Grammars. This model, supported on Language Theory, does permit the use of several features, solves some of the problems of finite-state transducers, and has the same computational complexity in recognition as formal grammars, whether they describe regular or context-free languages.
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
Information Extraction Grammars
1. Context-Free LanguagesRegular Languages
Information Extraction Grammars
ECIR 2015 Vienna, March 30th
Mónica Marrero
National Supercomputing Center, Spain
Julián Urbano
Universitat Pompeu Fabra, Spain
Problem: Grammar-based Named Entity (NE) Recognition Patterns
Features
Part of speech
Case
Gazetteers
Stem
[etc.]
(Semi-)automatic Learning Method
More than
one feature?
Regular Cascade Context-free
Natural/Markup
Lang. expressiveness?
Regular Cascade Context-free
Avoid extra
ambiguity?
Regular Cascade Context-free
Regular
Expressions
Cascade
Grammars
Context-Free
Grammars
Human-readable and based on standards
NE: Person NE: Time NE: Location
Information Extraction systems should be capable of adapting to different entities and domains.
How can we decide what is the best model for a Named Entity Recognition system?
Proposal: Information Extraction Grammars for Named Entity Recognition
Formally, 𝐼𝐸𝐺 = (𝒱, 𝑆, Σ, 𝒫, 𝒞)
𝒱: set of non-terminals
𝑆 ∈ 𝒱: initial symbol
Σ: input alphabet
𝒫: set of production rules
𝒞: set of condition sets assigned to non-terminals,
expressed as function-value pairs 𝑓, 𝑦
All derivations must meet:
𝐴
∗ 𝐼𝐸𝐺
𝜔 ≔ 𝐴
∗ 𝐺
𝜔 and ∀ 𝑓, 𝑦 ∈ 𝒞 𝐴 ∶ 𝑓 𝜔 = 𝑦
Context-Free
Grammar 𝐺
IEG for the recognition of full person names
using First/Last name gazetteers
𝑆 → 𝐹𝐿𝐿 𝑆 → 𝐹𝐿 𝑆 → 𝐹
𝐹 → 𝑇 𝐿 → 𝑇 𝑇 → [a-zA-Z0-9]+
𝒞 𝐹 = 𝐹𝑖𝑟𝑠𝑡𝐺𝑎𝑧, 𝑡𝑟𝑢𝑒 , 𝐶𝑎𝑠𝑒, 𝑢𝑝𝑝𝑒𝑟 , 𝑃𝑂𝑆, 𝑁𝑃
𝒞 𝐿 = 𝐹𝑖𝑟𝑠𝑡𝐺𝑎𝑧, 𝑡𝑟𝑢𝑒 , 𝐶𝑎𝑠𝑒, 𝑢𝑝𝑝𝑒𝑟 , 𝑃𝑂𝑆, 𝑁𝑃
Lisa Brown Smith will present at 4 pm in Foyer room
Similar to synthesized attributes in S-attributed grammars, but in this case
the values of the attributes are given upfront and they are used to constrain the parsing
Computational Complexity
Regular Expression
O(ns2)
Cascade Grammar
O(mns2)
IEG
O(n(tm+s2))
Context-Free Grammar
O(n3)
IEG
O(n3)
Sizes of n: input, m: features, s: states in the automata, t: non-terminals with conditions associated
Summary and Future Work
• Information Extraction Grammars
- Based on standards
- Expressiveness of context-free grammars
- Support for custom features
- Competitive complexity using standard
recognition methods
• Contributes to the flexibility of Information
Extraction tools that can work independently of
the kind of features and the expressiveness of the
language to recognize
• Future work: optimization of the recognition
methods and use of probabilities in the conditions