Information Extraction
Joan Hurtado
Ignacio Delgado
Contents
• INTRODUCTION
• IE TASKS
• IE WITH CASCADED FINITE-SATATE TRANSDUCERS
• LEARNING-BASED APPROACHES TO IE
• HOW GO...
INTRODUCTION
What is IE?
 Information Extraction is the
process of scanning text for
relevant information to
some interest
 Extract:
...
Why IE?
 Need for eficient processing of texts in
specialized domains
 Focus on relevant parts, ignore the rest
 Typica...
Most common uses
 Named Entity Recognition
 Identify names, special
entities (dates, times)
 Uses textual patterns
 Im...
How to measure
performance
 Recall
 What percentage of the correct answers did the
system get
 Precision
 What percent...
c
IE TASKS
Unstructured vs. Semi-structured
text
Unstructured
 Natural language
sentences
 Semantics depends on
linguistic analysis...
Single-document vs. Multi-
document
 Originally IE systems designed for individual
documents
 Nowadays new systems to ex...
Assumptions about
Incoming Documents
 Relevant only documents
 Single event documents
IE WITH CASCADE FINITE-STATE TRANSDUCERS
Complex Words
Basic Phrases
Complex Phrases
Domain Events
Template Generation
Complex Words
 Identify multiwords, company names, people
names, locations, dates, times and basic entities
 Recognition...
Basic Phrases
 Some syntactic constructs can be
identified with reasonable
reliability:
 Noun group
 Verb group
 Strat...
Complex Phrases
 Recognize complex noun and verb groups
 Complex noun groups
 Appositives
 Measure phrases
 Prepositi...
Domain Events
 Ignore anything not identified in previous phases
 Domain events require domain-specific patterns
for ide...
Template Generation:
Merging Structures
 Previous stages operate within bounds of single
sentences
 Operate over whole t...
LEARNING-BASED APPROACHES TO IE
Supervised Learning of
Extraction patterns & rules
 Reduce knowledge engineering bottleneck
required to create an IE syst...
Supervised Learning of
sequential classifier models
 View IE as a classification problem that can be
tackled using sequen...
Weakly supervised and
unsupervised approaches
 Annotating training text still requires time and
complexity
 Further tech...
Discourse-oriented
approaches to IE
 Most IE systems patterns focus only on local
context surrounding
 Extend systems to...
HOW GOOD IS IE
How IE systems are
progressing?
 The 60% barrier in performance
 Biggest mistakes in entity and event coreference
 The ...
THANKS
Upcoming SlideShare
Loading in...5
×

Information Extraction

711

Published on

Presentation took during Computer Linguistics course at UPF (Universitat Pompeu Fabra) covering the following topic:
Information Extraction
Jerry R. Hobbs, University of Southern California Ellen Riloff, University of Utah

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
711
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
29
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Semi-structured-> anuncios de clasificados
  • Information Extraction

    1. 1. Information Extraction Joan Hurtado Ignacio Delgado
    2. 2. Contents • INTRODUCTION • IE TASKS • IE WITH CASCADED FINITE-SATATE TRANSDUCERS • LEARNING-BASED APPROACHES TO IE • HOW GOOD IS IE
    3. 3. INTRODUCTION
    4. 4. What is IE?  Information Extraction is the process of scanning text for relevant information to some interest  Extract:  Entities  Relations  Events  Who did what to whom when and where
    5. 5. Why IE?  Need for eficient processing of texts in specialized domains  Focus on relevant parts, ignore the rest  Typical applications:  Gleaning business  Government  Military intelligence  WWW searches (more specific than keywords)  Scientific literature searches  …
    6. 6. Most common uses  Named Entity Recognition  Identify names, special entities (dates, times)  Uses textual patterns  Important at biomedical applications  IE is more than NER  Recognition of events and their participants
    7. 7. How to measure performance  Recall  What percentage of the correct answers did the system get  Precision  What percentage of the system’s answers were correct  F-score  Weighted harmonic mean between recall and precision
    8. 8. c IE TASKS
    9. 9. Unstructured vs. Semi-structured text Unstructured  Natural language sentences  Semantics depends on linguistic analysis  Examples:  News stories  Magazines articles  Books  … Semi-structured  Structured data  Semantics defined by its organization  Physical layout plays role in interpretation  Examples:  Job postings  Rental ads  …
    10. 10. Single-document vs. Multi- document  Originally IE systems designed for individual documents  Nowadays new systems to extract facts from WWW  Both use similar techniques  Distinguishing issue: redundancy  Multi-document can exploit redundancy  However need to challenge cross-document coreference resolution  Multi-document IE systems also are referred as open-domain
    11. 11. Assumptions about Incoming Documents  Relevant only documents  Single event documents
    12. 12. IE WITH CASCADE FINITE-STATE TRANSDUCERS
    13. 13. Complex Words Basic Phrases Complex Phrases Domain Events Template Generation
    14. 14. Complex Words  Identify multiwords, company names, people names, locations, dates, times and basic entities  Recognition strategies:  Patterns  Dictionaries  Context
    15. 15. Basic Phrases  Some syntactic constructs can be identified with reasonable reliability:  Noun group  Verb group  Strategies:  Simple finite-state grammars  Ambiguities  Noun-verb ambiguity  Verbs locally ambiguous  Problems  Not al languages have high distinction between noun and verb groups
    16. 16. Complex Phrases  Recognize complex noun and verb groups  Complex noun groups  Appositives  Measure phrases  Prepositional attachments (of, for)  Noun group conjunction  Complex verb groups  Verb conjunction  Verb groups with same significance  Domain-relevant entities can be recognized
    17. 17. Domain Events  Ignore anything not identified in previous phases  Domain events require domain-specific patterns for identification  Strategy:  Finite-state machines  Certain kind of “pseudo-syntax” can be done  Nowadays IE systems begin to rely in full- sentence parsing
    18. 18. Template Generation: Merging Structures  Previous stages operate within bounds of single sentences  Operate over whole text to combine previous collected information into a unified whole  If recognizing multiple events:  Determine how many distinct events  Assign each entity to appropriate event
    19. 19. LEARNING-BASED APPROACHES TO IE
    20. 20. Supervised Learning of Extraction patterns & rules  Reduce knowledge engineering bottleneck required to create an IE system for a new domain  Examples:  AutoSlog  create lexico-syntactic patterns  PALKA  patterns generalized based on words semantics  LIEP  identify syntactic paths related to roles  CRYSTAL  “concept nodes” with lexical, syntactic and semantic constrains  WHISK  learn regular expressions  Many others: SRV, RAPIER, …
    21. 21. Supervised Learning of sequential classifier models  View IE as a classification problem that can be tackled using sequential learning models  Read sequentially and label each word as an extraction or a non-extraction  Typical labeling scheme IOB  Inside  Outside  Beginning of desired extraction  Strategies:  Hidden Markov Models  Maximum Entropy Classifiers  Support Vector Machines
    22. 22. Weakly supervised and unsupervised approaches  Annotating training text still requires time and complexity  Further techniques to learn extraction using weakly supervised and unsupervised systems  Examples  AutoSlog-TS (preclassifed corpus which texts identified as relevant or irrelevant)  Ex-Disco (manually defined seed, patterns ranked, best patterns selected added to seed)  Meta-bootstraping (seed nouns that belong to semantic class)  On-Demand Information Extraction (dynamically learns from queries)
    23. 23. Discourse-oriented approaches to IE  Most IE systems patterns focus only on local context surrounding  Extend systems to have more global view  Strategy:  Add constrains to connect entities in diferent clauses  Decision trees (WRAP-UP)  Set of classifiers to identify new templates (ALICE)
    24. 24. HOW GOOD IS IE
    25. 25. How IE systems are progressing?  The 60% barrier in performance  Biggest mistakes in entity and event coreference  The implicit knowledge on NL not translated to texts  Problems on training data not found on test data  Good IE systems typically recognize 90% of entities  An event requires about 4 entities  0.9*0.9*0.9*0.9 = 65.61%
    26. 26. THANKS
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×