Information Extraction
Joan Hurtado
Ignacio Delgado
Contents
• INTRODUCTION
• IE TASKS
• IE WITH CASCADED FINITE-SATATE TRANSDUCERS
• LEARNING-BASED APPROACHES TO IE
• HOW GOOD IS IE
INTRODUCTION
What is IE?
 Information Extraction is the
process of scanning text for
relevant information to
some interest
 Extract:
 Entities
 Relations
 Events
 Who did what to whom
when and where
Why IE?
 Need for eficient processing of texts in
specialized domains
 Focus on relevant parts, ignore the rest
 Typical applications:
 Gleaning business
 Government
 Military intelligence
 WWW searches (more specific than keywords)
 Scientific literature searches
 …
Most common uses
 Named Entity Recognition
 Identify names, special
entities (dates, times)
 Uses textual patterns
 Important at biomedical
applications
 IE is more than NER
 Recognition of events
and their participants
How to measure
performance
 Recall
 What percentage of the correct answers did the
system get
 Precision
 What percentage of the system’s answers were
correct
 F-score
 Weighted harmonic mean between recall and
precision
c
IE TASKS
Unstructured vs. Semi-structured
text
Unstructured
 Natural language
sentences
 Semantics depends on
linguistic analysis
 Examples:
 News stories
 Magazines articles
 Books
 …
Semi-structured
 Structured data
 Semantics defined by its
organization
 Physical layout plays role in
interpretation
 Examples:
 Job postings
 Rental ads
 …
Single-document vs. Multi-
document
 Originally IE systems designed for individual
documents
 Nowadays new systems to extract facts from WWW
 Both use similar techniques
 Distinguishing issue: redundancy
 Multi-document can exploit redundancy
 However need to challenge cross-document
coreference resolution
 Multi-document IE systems also are referred as
open-domain
Assumptions about
Incoming Documents
 Relevant only documents
 Single event documents
IE WITH CASCADE FINITE-STATE TRANSDUCERS
Complex Words
Basic Phrases
Complex Phrases
Domain Events
Template Generation
Complex Words
 Identify multiwords, company names, people
names, locations, dates, times and basic entities
 Recognition strategies:
 Patterns
 Dictionaries
 Context
Basic Phrases
 Some syntactic constructs can be
identified with reasonable
reliability:
 Noun group
 Verb group
 Strategies:
 Simple finite-state grammars
 Ambiguities
 Noun-verb ambiguity
 Verbs locally ambiguous
 Problems
 Not al languages have high
distinction between noun and
verb groups
Complex Phrases
 Recognize complex noun and verb groups
 Complex noun groups
 Appositives
 Measure phrases
 Prepositional attachments (of, for)
 Noun group conjunction
 Complex verb groups
 Verb conjunction
 Verb groups with same significance
 Domain-relevant entities can be recognized
Domain Events
 Ignore anything not identified in previous phases
 Domain events require domain-specific patterns
for identification
 Strategy:
 Finite-state machines
 Certain kind of “pseudo-syntax” can be done
 Nowadays IE systems begin to rely in full-
sentence parsing
Template Generation:
Merging Structures
 Previous stages operate within bounds of single
sentences
 Operate over whole text to combine previous
collected information into a unified whole
 If recognizing multiple events:
 Determine how many distinct events
 Assign each entity to appropriate event
LEARNING-BASED APPROACHES TO IE
Supervised Learning of
Extraction patterns & rules
 Reduce knowledge engineering bottleneck
required to create an IE system for a new domain
 Examples:
 AutoSlog  create lexico-syntactic patterns
 PALKA  patterns generalized based on words
semantics
 LIEP  identify syntactic paths related to roles
 CRYSTAL  “concept nodes” with lexical, syntactic
and semantic constrains
 WHISK  learn regular expressions
 Many others: SRV, RAPIER, …
Supervised Learning of
sequential classifier models
 View IE as a classification problem that can be
tackled using sequential learning models
 Read sequentially and label each word as an
extraction or a non-extraction
 Typical labeling scheme IOB
 Inside
 Outside
 Beginning of desired extraction
 Strategies:
 Hidden Markov Models
 Maximum Entropy Classifiers
 Support Vector Machines
Weakly supervised and
unsupervised approaches
 Annotating training text still requires time and
complexity
 Further techniques to learn extraction using weakly
supervised and unsupervised systems
 Examples
 AutoSlog-TS (preclassifed corpus which texts identified
as relevant or irrelevant)
 Ex-Disco (manually defined seed, patterns ranked, best
patterns selected added to seed)
 Meta-bootstraping (seed nouns that belong to
semantic class)
 On-Demand Information Extraction (dynamically learns
from queries)
Discourse-oriented
approaches to IE
 Most IE systems patterns focus only on local
context surrounding
 Extend systems to have more global view
 Strategy:
 Add constrains to connect entities in diferent
clauses
 Decision trees (WRAP-UP)
 Set of classifiers to identify new templates (ALICE)
HOW GOOD IS IE
How IE systems are
progressing?
 The 60% barrier in performance
 Biggest mistakes in entity and event coreference
 The implicit knowledge on NL not translated to texts
 Problems on training data not found on test data
 Good IE systems typically recognize 90% of entities
 An event requires about 4 entities
 0.9*0.9*0.9*0.9 = 65.61%
THANKS

Information Extraction

  • 1.
  • 2.
    Contents • INTRODUCTION • IETASKS • IE WITH CASCADED FINITE-SATATE TRANSDUCERS • LEARNING-BASED APPROACHES TO IE • HOW GOOD IS IE
  • 3.
  • 4.
    What is IE? Information Extraction is the process of scanning text for relevant information to some interest  Extract:  Entities  Relations  Events  Who did what to whom when and where
  • 5.
    Why IE?  Needfor eficient processing of texts in specialized domains  Focus on relevant parts, ignore the rest  Typical applications:  Gleaning business  Government  Military intelligence  WWW searches (more specific than keywords)  Scientific literature searches  …
  • 6.
    Most common uses Named Entity Recognition  Identify names, special entities (dates, times)  Uses textual patterns  Important at biomedical applications  IE is more than NER  Recognition of events and their participants
  • 7.
    How to measure performance Recall  What percentage of the correct answers did the system get  Precision  What percentage of the system’s answers were correct  F-score  Weighted harmonic mean between recall and precision
  • 8.
  • 9.
    Unstructured vs. Semi-structured text Unstructured Natural language sentences  Semantics depends on linguistic analysis  Examples:  News stories  Magazines articles  Books  … Semi-structured  Structured data  Semantics defined by its organization  Physical layout plays role in interpretation  Examples:  Job postings  Rental ads  …
  • 10.
    Single-document vs. Multi- document Originally IE systems designed for individual documents  Nowadays new systems to extract facts from WWW  Both use similar techniques  Distinguishing issue: redundancy  Multi-document can exploit redundancy  However need to challenge cross-document coreference resolution  Multi-document IE systems also are referred as open-domain
  • 11.
    Assumptions about Incoming Documents Relevant only documents  Single event documents
  • 12.
    IE WITH CASCADEFINITE-STATE TRANSDUCERS
  • 13.
    Complex Words Basic Phrases ComplexPhrases Domain Events Template Generation
  • 14.
    Complex Words  Identifymultiwords, company names, people names, locations, dates, times and basic entities  Recognition strategies:  Patterns  Dictionaries  Context
  • 15.
    Basic Phrases  Somesyntactic constructs can be identified with reasonable reliability:  Noun group  Verb group  Strategies:  Simple finite-state grammars  Ambiguities  Noun-verb ambiguity  Verbs locally ambiguous  Problems  Not al languages have high distinction between noun and verb groups
  • 16.
    Complex Phrases  Recognizecomplex noun and verb groups  Complex noun groups  Appositives  Measure phrases  Prepositional attachments (of, for)  Noun group conjunction  Complex verb groups  Verb conjunction  Verb groups with same significance  Domain-relevant entities can be recognized
  • 17.
    Domain Events  Ignoreanything not identified in previous phases  Domain events require domain-specific patterns for identification  Strategy:  Finite-state machines  Certain kind of “pseudo-syntax” can be done  Nowadays IE systems begin to rely in full- sentence parsing
  • 18.
    Template Generation: Merging Structures Previous stages operate within bounds of single sentences  Operate over whole text to combine previous collected information into a unified whole  If recognizing multiple events:  Determine how many distinct events  Assign each entity to appropriate event
  • 19.
  • 20.
    Supervised Learning of Extractionpatterns & rules  Reduce knowledge engineering bottleneck required to create an IE system for a new domain  Examples:  AutoSlog  create lexico-syntactic patterns  PALKA  patterns generalized based on words semantics  LIEP  identify syntactic paths related to roles  CRYSTAL  “concept nodes” with lexical, syntactic and semantic constrains  WHISK  learn regular expressions  Many others: SRV, RAPIER, …
  • 21.
    Supervised Learning of sequentialclassifier models  View IE as a classification problem that can be tackled using sequential learning models  Read sequentially and label each word as an extraction or a non-extraction  Typical labeling scheme IOB  Inside  Outside  Beginning of desired extraction  Strategies:  Hidden Markov Models  Maximum Entropy Classifiers  Support Vector Machines
  • 22.
    Weakly supervised and unsupervisedapproaches  Annotating training text still requires time and complexity  Further techniques to learn extraction using weakly supervised and unsupervised systems  Examples  AutoSlog-TS (preclassifed corpus which texts identified as relevant or irrelevant)  Ex-Disco (manually defined seed, patterns ranked, best patterns selected added to seed)  Meta-bootstraping (seed nouns that belong to semantic class)  On-Demand Information Extraction (dynamically learns from queries)
  • 23.
    Discourse-oriented approaches to IE Most IE systems patterns focus only on local context surrounding  Extend systems to have more global view  Strategy:  Add constrains to connect entities in diferent clauses  Decision trees (WRAP-UP)  Set of classifiers to identify new templates (ALICE)
  • 24.
  • 25.
    How IE systemsare progressing?  The 60% barrier in performance  Biggest mistakes in entity and event coreference  The implicit knowledge on NL not translated to texts  Problems on training data not found on test data  Good IE systems typically recognize 90% of entities  An event requires about 4 entities  0.9*0.9*0.9*0.9 = 65.61%
  • 26.

Editor's Notes

  • #10 Semi-structured-> anuncios de clasificados