Information Extraction

Information Extraction
Joan Hurtado
Ignacio Delgado

Contents
• INTRODUCTION
• IE TASKS
• IE WITH CASCADED FINITE-SATATE TRANSDUCERS
• LEARNING-BASED APPROACHES TO IE
• HOW GOOD IS IE

What is IE?
 Information Extraction is the
process of scanning text for
relevant information to
some interest
 Extract:
 Entities
 Relations
 Events
 Who did what to whom
when and where

Why IE?
 Need for eficient processing of texts in
specialized domains
 Focus on relevant parts, ignore the rest
 Typical applications:
 Gleaning business
 Government
 Military intelligence
 WWW searches (more specific than keywords)
 Scientific literature searches
 …

Most common uses
 Named Entity Recognition
 Identify names, special
entities (dates, times)
 Uses textual patterns
 Important at biomedical
applications
 IE is more than NER
 Recognition of events
and their participants

How to measure
performance
 Recall
 What percentage of the correct answers did the
system get
 Precision
 What percentage of the system’s answers were
correct
 F-score
 Weighted harmonic mean between recall and
precision

Unstructured vs. Semi-structured
text
Unstructured
 Natural language
sentences
 Semantics depends on
linguistic analysis
 Examples:
 News stories
 Magazines articles
 Books
 …
Semi-structured
 Structured data
 Semantics defined by its
organization
 Physical layout plays role in
interpretation
 Examples:
 Job postings
 Rental ads
 …

Single-document vs. Multi-
document
 Originally IE systems designed for individual
documents
 Nowadays new systems to extract facts from WWW
 Both use similar techniques
 Distinguishing issue: redundancy
 Multi-document can exploit redundancy
 However need to challenge cross-document
coreference resolution
 Multi-document IE systems also are referred as
open-domain

Assumptions about
Incoming Documents
 Relevant only documents
 Single event documents

IE WITH CASCADE FINITE-STATE TRANSDUCERS

Complex Words
Basic Phrases
Complex Phrases
Domain Events
Template Generation

Complex Words
 Identify multiwords, company names, people
names, locations, dates, times and basic entities
 Recognition strategies:
 Patterns
 Dictionaries
 Context

Basic Phrases
 Some syntactic constructs can be
identified with reasonable
reliability:
 Noun group
 Verb group
 Strategies:
 Simple finite-state grammars
 Ambiguities
 Noun-verb ambiguity
 Verbs locally ambiguous
 Problems
 Not al languages have high
distinction between noun and
verb groups

Complex Phrases
 Recognize complex noun and verb groups
 Complex noun groups
 Appositives
 Measure phrases
 Prepositional attachments (of, for)
 Noun group conjunction
 Complex verb groups
 Verb conjunction
 Verb groups with same significance
 Domain-relevant entities can be recognized

Domain Events
 Ignore anything not identified in previous phases
 Domain events require domain-specific patterns
for identification
 Strategy:
 Finite-state machines
 Certain kind of “pseudo-syntax” can be done
 Nowadays IE systems begin to rely in full-
sentence parsing

Template Generation:
Merging Structures
 Previous stages operate within bounds of single
sentences
 Operate over whole text to combine previous
collected information into a unified whole
 If recognizing multiple events:
 Determine how many distinct events
 Assign each entity to appropriate event

LEARNING-BASED APPROACHES TO IE

Supervised Learning of
Extraction patterns & rules
 Reduce knowledge engineering bottleneck
required to create an IE system for a new domain
 Examples:
 AutoSlog  create lexico-syntactic patterns
 PALKA  patterns generalized based on words
semantics
 LIEP  identify syntactic paths related to roles
 CRYSTAL  “concept nodes” with lexical, syntactic
and semantic constrains
 WHISK  learn regular expressions
 Many others: SRV, RAPIER, …

Supervised Learning of
sequential classifier models
 View IE as a classification problem that can be
tackled using sequential learning models
 Read sequentially and label each word as an
extraction or a non-extraction
 Typical labeling scheme IOB
 Inside
 Outside
 Beginning of desired extraction
 Strategies:
 Hidden Markov Models
 Maximum Entropy Classifiers
 Support Vector Machines

Weakly supervised and
unsupervised approaches
 Annotating training text still requires time and
complexity
 Further techniques to learn extraction using weakly
supervised and unsupervised systems
 Examples
 AutoSlog-TS (preclassifed corpus which texts identified
as relevant or irrelevant)
 Ex-Disco (manually defined seed, patterns ranked, best
patterns selected added to seed)
 Meta-bootstraping (seed nouns that belong to
semantic class)
 On-Demand Information Extraction (dynamically learns
from queries)

Discourse-oriented
approaches to IE
 Most IE systems patterns focus only on local
context surrounding
 Extend systems to have more global view
 Strategy:
 Add constrains to connect entities in diferent
clauses
 Decision trees (WRAP-UP)
 Set of classifiers to identify new templates (ALICE)

How IE systems are
progressing?
 The 60% barrier in performance
 Biggest mistakes in entity and event coreference
 The implicit knowledge on NL not translated to texts
 Problems on training data not found on test data
 Good IE systems typically recognize 90% of entities
 An event requires about 4 entities
 0.9*0.9*0.9*0.9 = 65.61%

Information Extraction

More Related Content

What's hot

Similar to Information Extraction

More from Ignacio Delgado

Recently uploaded

Information Extraction

Editor's Notes