• Like

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Information Extraction

  • 576 views
Uploaded on

Presentation took during Computer Linguistics course at UPF (Universitat Pompeu Fabra) covering the following topic: …

Presentation took during Computer Linguistics course at UPF (Universitat Pompeu Fabra) covering the following topic:
Information Extraction
Jerry R. Hobbs, University of Southern California Ellen Riloff, University of Utah

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
576
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
21
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Semi-structured-> anuncios de clasificados

Transcript

  • 1. Information Extraction
    Joan Hurtado
    Ignacio Delgado
  • 2. Contents
    • INTRODUCTION
    • 3. IE TASKS
    • 4. IE WITH CASCADED FINITE-SATATE TRANSDUCERS
    • 5. LEARNING-BASED APPROACHES TO IE
    • 6. HOW GOOD IS IE
  • 0
    INTRODUCTION
  • 7. What is IE?
    Information Extraction is the process of scanning text for relevant information to some interest
    Extract:
    Entities
    Relations
    Events
    Who did what to whom when and where
  • 8. Why IE?
    Need for eficient processing of texts in specialized domains
    Focus on relevant parts, ignore the rest
    Typical applications:
    Gleaning business
    Government
    Military intelligence
    WWW searches (more specific than keywords)
    Scientific literature searches

  • 9. Most common uses
    Named Entity Recognition
    Identify names, special entities (dates, times)
    Uses textual patterns
    Important at biomedical applications
    IE is more than NER
    Recognition of events and their participants
  • 10. How to measure performance
    Recall
    What percentage of the correct answers did the system get
    Precision
    What percentage of the system’s answers were correct
    F-score
    Weighted harmonic mean between recall and precision
  • 11. c
    1
    IE TASKS
  • 12. Unstructured vs. Semi-structured text
    Unstructured
    Natural language sentences
    Semantics depends on linguistic analysis
    Examples:
    News stories
    Magazines articles
    Books

    Semi-structured
    Structured data
    Semantics defined by its organization
    Physical layout plays role in interpretation
    Examples:
    Job postings
    Rental ads

  • 13. Single-document vs. Multi-document
    Originally IE systems designed for individual documents
    Nowadays new systems to extract facts from WWW
    Both use similar techniques
    Distinguishing issue: redundancy
    Multi-document can exploit redundancy
    However need to challenge cross-document coreference resolution
    Multi-document IE systems also are referred as open-domain
  • 14. Assumptions about Incoming Documents
    Relevant only documents
    Single event documents
  • 15. 2
    IE WITH CASCADE FINITE-STATE TRANSDUCERS
  • 16.
  • 17. Complex Words
    Identify multiwords, company names, people names, locations, dates, times and basic entities
    Recognition strategies:
    Patterns
    Dictionaries
    Context
  • 18. Basic Phrases
    Some syntactic constructs can be identified with reasonable reliability:
    Noun group
    Verb group
    Strategies:
    Simple finite-state grammars
    Ambiguities
    Noun-verb ambiguity
    Verbs locally ambiguous
    Problems
    Not al languages have high distinction between noun and verb groups
  • 19. Complex Phrases
    Recognize complex noun and verb groups
    Complex noun groups
    Appositives
    Measure phrases
    Prepositional attachments (of, for)
    Noun group conjunction
    Complex verb groups
    Verb conjunction
    Verb groups with same significance
    Domain-relevant entities can be recognized
  • 20. Domain Events
    Ignore anything not identified in previous phases
    Domain events require domain-specific patterns for identification
    Strategy:
    Finite-state machines
    Certain kind of “pseudo-syntax” can be done
    Nowadays IE systems begin to rely in full-sentence parsing
  • 21. Template Generation: Merging Structures
    Previous stages operate within bounds of single sentences
    Operate over whole text to combine previous collected information into a unified whole
    If recognizing multiple events:
    Determine how many distinct events
    Assign each entity to appropriate event
  • 22. 3
    LEARNING-BASED APPROACHES TO IE
  • 23. Supervised Learning of Extraction patterns & rules
    Reduce knowledge engineering bottleneck required to create an IE system for a new domain
    Examples:
    AutoSlog create lexico-syntactic patterns
    PALKA  patterns generalized based on words semantics
    LIEP  identify syntactic paths related to roles
    CRYSTAL  “concept nodes” with lexical, syntactic and semantic constrains
    WHISK  learn regular expressions
    Many others: SRV, RAPIER, …
  • 24. Supervised Learning of sequential classifier models
    View IE as a classification problem that can be tackled using sequential learning models
    Read sequentially and label each word as an extraction or a non-extraction
    Typical labeling scheme IOB
    Inside
    Outside
    Beginning of desired extraction
    Strategies:
    Hidden Markov Models
    Maximum Entropy Classifiers
    Support Vector Machines
  • 25. Weakly supervised and unsupervised approaches
    Annotating training text still requires time and complexity
    Further techniques to learn extraction using weakly supervised and unsupervised systems
    Examples
    AutoSlog-TS (preclassifed corpus which texts identified as relevant or irrelevant)
    Ex-Disco (manually defined seed, patterns ranked, best patterns selected added to seed)
    Meta-bootstraping (seed nouns that belong to semantic class)
    On-Demand Information Extraction (dynamically learns from queries)
  • 26. Discourse-oriented approaches to IE
    Most IE systems patterns focus only on local context surrounding
    Extend systems to have more global view
    Strategy:
    Add constrains to connect entities in diferent clauses
    Decision trees (WRAP-UP)
    Set of classifiers to identify new templates (ALICE)
  • 27. 4
    HOW GOOD IS IE
  • 28. How IE systems are progressing?
    The 60% barrier in performance
    Biggest mistakes in entity and event coreference
    The implicit knowledge on NL not translated to texts
    Problems on training data not found on test data
    Good IE systems typically recognize 90% of entities
    An event requires about 4 entities
    0.9*0.9*0.9*0.9 = 65.61%
  • 29. THANKS