Your SlideShare is downloading. ×

Information Extraction

665

Published on

Presentation took during Computer Linguistics course at UPF (Universitat Pompeu Fabra) covering the following topic: …

Presentation took during Computer Linguistics course at UPF (Universitat Pompeu Fabra) covering the following topic:
Information Extraction
Jerry R. Hobbs, University of Southern California Ellen Riloff, University of Utah

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
665
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
27
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Semi-structured-> anuncios de clasificados
  • Transcript

    • 1. Information Extraction Joan Hurtado Ignacio Delgado
    • 2. Contents • INTRODUCTION • IE TASKS • IE WITH CASCADED FINITE-SATATE TRANSDUCERS • LEARNING-BASED APPROACHES TO IE • HOW GOOD IS IE
    • 3. INTRODUCTION
    • 4. What is IE?  Information Extraction is the process of scanning text for relevant information to some interest  Extract:  Entities  Relations  Events  Who did what to whom when and where
    • 5. Why IE?  Need for eficient processing of texts in specialized domains  Focus on relevant parts, ignore the rest  Typical applications:  Gleaning business  Government  Military intelligence  WWW searches (more specific than keywords)  Scientific literature searches  …
    • 6. Most common uses  Named Entity Recognition  Identify names, special entities (dates, times)  Uses textual patterns  Important at biomedical applications  IE is more than NER  Recognition of events and their participants
    • 7. How to measure performance  Recall  What percentage of the correct answers did the system get  Precision  What percentage of the system’s answers were correct  F-score  Weighted harmonic mean between recall and precision
    • 8. c IE TASKS
    • 9. Unstructured vs. Semi-structured text Unstructured  Natural language sentences  Semantics depends on linguistic analysis  Examples:  News stories  Magazines articles  Books  … Semi-structured  Structured data  Semantics defined by its organization  Physical layout plays role in interpretation  Examples:  Job postings  Rental ads  …
    • 10. Single-document vs. Multi- document  Originally IE systems designed for individual documents  Nowadays new systems to extract facts from WWW  Both use similar techniques  Distinguishing issue: redundancy  Multi-document can exploit redundancy  However need to challenge cross-document coreference resolution  Multi-document IE systems also are referred as open-domain
    • 11. Assumptions about Incoming Documents  Relevant only documents  Single event documents
    • 12. IE WITH CASCADE FINITE-STATE TRANSDUCERS
    • 13. Complex Words Basic Phrases Complex Phrases Domain Events Template Generation
    • 14. Complex Words  Identify multiwords, company names, people names, locations, dates, times and basic entities  Recognition strategies:  Patterns  Dictionaries  Context
    • 15. Basic Phrases  Some syntactic constructs can be identified with reasonable reliability:  Noun group  Verb group  Strategies:  Simple finite-state grammars  Ambiguities  Noun-verb ambiguity  Verbs locally ambiguous  Problems  Not al languages have high distinction between noun and verb groups
    • 16. Complex Phrases  Recognize complex noun and verb groups  Complex noun groups  Appositives  Measure phrases  Prepositional attachments (of, for)  Noun group conjunction  Complex verb groups  Verb conjunction  Verb groups with same significance  Domain-relevant entities can be recognized
    • 17. Domain Events  Ignore anything not identified in previous phases  Domain events require domain-specific patterns for identification  Strategy:  Finite-state machines  Certain kind of “pseudo-syntax” can be done  Nowadays IE systems begin to rely in full- sentence parsing
    • 18. Template Generation: Merging Structures  Previous stages operate within bounds of single sentences  Operate over whole text to combine previous collected information into a unified whole  If recognizing multiple events:  Determine how many distinct events  Assign each entity to appropriate event
    • 19. LEARNING-BASED APPROACHES TO IE
    • 20. Supervised Learning of Extraction patterns & rules  Reduce knowledge engineering bottleneck required to create an IE system for a new domain  Examples:  AutoSlog  create lexico-syntactic patterns  PALKA  patterns generalized based on words semantics  LIEP  identify syntactic paths related to roles  CRYSTAL  “concept nodes” with lexical, syntactic and semantic constrains  WHISK  learn regular expressions  Many others: SRV, RAPIER, …
    • 21. Supervised Learning of sequential classifier models  View IE as a classification problem that can be tackled using sequential learning models  Read sequentially and label each word as an extraction or a non-extraction  Typical labeling scheme IOB  Inside  Outside  Beginning of desired extraction  Strategies:  Hidden Markov Models  Maximum Entropy Classifiers  Support Vector Machines
    • 22. Weakly supervised and unsupervised approaches  Annotating training text still requires time and complexity  Further techniques to learn extraction using weakly supervised and unsupervised systems  Examples  AutoSlog-TS (preclassifed corpus which texts identified as relevant or irrelevant)  Ex-Disco (manually defined seed, patterns ranked, best patterns selected added to seed)  Meta-bootstraping (seed nouns that belong to semantic class)  On-Demand Information Extraction (dynamically learns from queries)
    • 23. Discourse-oriented approaches to IE  Most IE systems patterns focus only on local context surrounding  Extend systems to have more global view  Strategy:  Add constrains to connect entities in diferent clauses  Decision trees (WRAP-UP)  Set of classifiers to identify new templates (ALICE)
    • 24. HOW GOOD IS IE
    • 25. How IE systems are progressing?  The 60% barrier in performance  Biggest mistakes in entity and event coreference  The implicit knowledge on NL not translated to texts  Problems on training data not found on test data  Good IE systems typically recognize 90% of entities  An event requires about 4 entities  0.9*0.9*0.9*0.9 = 65.61%
    • 26. THANKS

    ×