Your SlideShare is downloading. ×
Information Extraction
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Information Extraction

607
views

Published on

Presentation took during Computer Linguistics course at UPF (Universitat Pompeu Fabra) covering the following topic: …

Presentation took during Computer Linguistics course at UPF (Universitat Pompeu Fabra) covering the following topic:
Information Extraction
Jerry R. Hobbs, University of Southern California Ellen Riloff, University of Utah

Published in: Education, Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
607
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
22
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Semi-structured-> anuncios de clasificados
  • Transcript

    • 1. Information Extraction Joan Hurtado Ignacio Delgado
    • 2. Contents • INTRODUCTION • IE TASKS • IE WITH CASCADED FINITE-SATATE TRANSDUCERS • LEARNING-BASED APPROACHES TO IE • HOW GOOD IS IE
    • 3. INTRODUCTION
    • 4. What is IE?  Information Extraction is the process of scanning text for relevant information to some interest  Extract:  Entities  Relations  Events  Who did what to whom when and where
    • 5. Why IE?  Need for eficient processing of texts in specialized domains  Focus on relevant parts, ignore the rest  Typical applications:  Gleaning business  Government  Military intelligence  WWW searches (more specific than keywords)  Scientific literature searches  …
    • 6. Most common uses  Named Entity Recognition  Identify names, special entities (dates, times)  Uses textual patterns  Important at biomedical applications  IE is more than NER  Recognition of events and their participants
    • 7. How to measure performance  Recall  What percentage of the correct answers did the system get  Precision  What percentage of the system’s answers were correct  F-score  Weighted harmonic mean between recall and precision
    • 8. c IE TASKS
    • 9. Unstructured vs. Semi-structured text Unstructured  Natural language sentences  Semantics depends on linguistic analysis  Examples:  News stories  Magazines articles  Books  … Semi-structured  Structured data  Semantics defined by its organization  Physical layout plays role in interpretation  Examples:  Job postings  Rental ads  …
    • 10. Single-document vs. Multi- document  Originally IE systems designed for individual documents  Nowadays new systems to extract facts from WWW  Both use similar techniques  Distinguishing issue: redundancy  Multi-document can exploit redundancy  However need to challenge cross-document coreference resolution  Multi-document IE systems also are referred as open-domain
    • 11. Assumptions about Incoming Documents  Relevant only documents  Single event documents
    • 12. IE WITH CASCADE FINITE-STATE TRANSDUCERS
    • 13. Complex Words Basic Phrases Complex Phrases Domain Events Template Generation
    • 14. Complex Words  Identify multiwords, company names, people names, locations, dates, times and basic entities  Recognition strategies:  Patterns  Dictionaries  Context
    • 15. Basic Phrases  Some syntactic constructs can be identified with reasonable reliability:  Noun group  Verb group  Strategies:  Simple finite-state grammars  Ambiguities  Noun-verb ambiguity  Verbs locally ambiguous  Problems  Not al languages have high distinction between noun and verb groups
    • 16. Complex Phrases  Recognize complex noun and verb groups  Complex noun groups  Appositives  Measure phrases  Prepositional attachments (of, for)  Noun group conjunction  Complex verb groups  Verb conjunction  Verb groups with same significance  Domain-relevant entities can be recognized
    • 17. Domain Events  Ignore anything not identified in previous phases  Domain events require domain-specific patterns for identification  Strategy:  Finite-state machines  Certain kind of “pseudo-syntax” can be done  Nowadays IE systems begin to rely in full- sentence parsing
    • 18. Template Generation: Merging Structures  Previous stages operate within bounds of single sentences  Operate over whole text to combine previous collected information into a unified whole  If recognizing multiple events:  Determine how many distinct events  Assign each entity to appropriate event
    • 19. LEARNING-BASED APPROACHES TO IE
    • 20. Supervised Learning of Extraction patterns & rules  Reduce knowledge engineering bottleneck required to create an IE system for a new domain  Examples:  AutoSlog  create lexico-syntactic patterns  PALKA  patterns generalized based on words semantics  LIEP  identify syntactic paths related to roles  CRYSTAL  “concept nodes” with lexical, syntactic and semantic constrains  WHISK  learn regular expressions  Many others: SRV, RAPIER, …
    • 21. Supervised Learning of sequential classifier models  View IE as a classification problem that can be tackled using sequential learning models  Read sequentially and label each word as an extraction or a non-extraction  Typical labeling scheme IOB  Inside  Outside  Beginning of desired extraction  Strategies:  Hidden Markov Models  Maximum Entropy Classifiers  Support Vector Machines
    • 22. Weakly supervised and unsupervised approaches  Annotating training text still requires time and complexity  Further techniques to learn extraction using weakly supervised and unsupervised systems  Examples  AutoSlog-TS (preclassifed corpus which texts identified as relevant or irrelevant)  Ex-Disco (manually defined seed, patterns ranked, best patterns selected added to seed)  Meta-bootstraping (seed nouns that belong to semantic class)  On-Demand Information Extraction (dynamically learns from queries)
    • 23. Discourse-oriented approaches to IE  Most IE systems patterns focus only on local context surrounding  Extend systems to have more global view  Strategy:  Add constrains to connect entities in diferent clauses  Decision trees (WRAP-UP)  Set of classifiers to identify new templates (ALICE)
    • 24. HOW GOOD IS IE
    • 25. How IE systems are progressing?  The 60% barrier in performance  Biggest mistakes in entity and event coreference  The implicit knowledge on NL not translated to texts  Problems on training data not found on test data  Good IE systems typically recognize 90% of entities  An event requires about 4 entities  0.9*0.9*0.9*0.9 = 65.61%
    • 26. THANKS