Upcoming SlideShare
Loading in...5







Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

osm.cs.byu.edu osm.cs.byu.edu Presentation Transcript

  • Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding
  • IR and IE
    • IR (Information Retrieval)
      • Retrieves relevant documents from collections
      • Information theory, probabilistic theory, and statistics
    • IE (Information Extraction)
      • Extracts relevant information from documents
      • Machine learning, computational linguistics, and natural language processing
  • History of IE
    • Large amount of both online and offline textual data.
    • Message Understanding Conference (MUC)
      • Quantitative evaluation of IE systems
      • Tasks
        • Latin American terrorism
        • Joint ventures
        • Microelectronics
        • Company management changes
  • Evaluation Metrics
    • Precision
    • Recall
    • F-measure
  • Web Documents
    • Unstructured (Free) Text
      • Regular sentences and paragraphs
      • Linguistic techniques, e.g., NLP
    • Structured Text
      • Itemized information
      • Uniform syntactic clues, e.g., table understanding
    • Semistructured Text
      • Ungrammatical, telegraphic (e.g., missing attributes, multi-value attributes, …)
      • Specialized programs, e.g., wrappers
  • Approaches to IE
    • Knowledge Engineering
      • Grammars are constructed by hand
      • Domain patterns are discovered by human experts through introspection and inspection of a corpus
      • Much laborious tuning and “hill climbing”
    • Machine Learning
      • Use statistical methods when possible
      • Learn rules from annotated corpora
      • Learn rules from interaction with user
  • Knowledge Engineering
    • Advantages
      • With skills and experience, good performing systems are not conceptually hard to develop.
      • The best performing systems have been hand crafted.
    • Disadvantages
      • Very laborious development process
      • Some changes to specifications can be hard to accommodate
      • Required expertise may not be available
  • Machine Learning
    • Advantages
      • Domain portability is relatively straightforward
      • System expertise is not required for customization
      • “ Data driven” rule acquisition ensures full coverage of examples
    • Disadvantages
      • Training data may not exist, and may be very expensive to acquire
      • Large volume of training data may be required
      • Changes to specifications may require reannotation of large quantities of training data
  • Wrapper
    • A specialized program that identifies data of interest and maps them to some suitable format (e.g. XML or relational tables)
    • Challenge: recognizing the data of interest among many other uninterested pieces of text
    • Tasks
      • Source understanding
      • Data processing
  • Free Text
    • AutoSlog
    • Liep
    • Palka
    • Hasten
    • Crystal
      • WebFoot
    • WHISK
  • AutoSlog [1993] The Parliament building was bombed by Carlos.
  • LIEP [1995] The Parliament building was bombed by Carlos .
  • PALKA [1995] The Parliament building was bombed by Carlos .
  • HASTEN [1995] The Parliament building was bombed by Carlos .
    • Egraphs
    • ( SemanticLabel, StructuralElement )
  • CRYSTAL [1995] The Parliament building was bombed by Carlos .
  • CRYSTAL + Webfoot [1997]
  • WHISK [1999]
    • The Parliament building was bombed by Carlos.
    • WHISK Rule:
      • *( PhyObj )*@passive *F ‘bombed’ * {PP ‘by’ *F ( Person )}
    • Context-based patterns
  • Web Documents
    • Semistructured and Unstructured
      • RAPIER (E. Califf, 1997)
      • SRV (D. Freitag, 1998)
      • WHISK (S. Soderland, 1998)
    • Semistructured and Structured
      • WIEN (N. Kushmerick, 1997)
      • SoftMealy (C-H. Hsu, 1998)
      • STALKER (I. Muslea, S. Minton, C. Knoblock, 1998)
  • Inductive Learning
    • Task
    • Inductive Inference
    • Learning Systems
      • Zero-order
      • First-order, e.g., Inductive Logic Programming (ILP)
  • RAPIER [1997]
    • Inductive Logic Programming
    • Extraction Rules
      • Syntactic information
      • Semantic information
    • Advantage
      • Efficient learning (bottom-up)
    • Drawback
      • Single-slot extraction
  • RAPIER Rule
  • SRV [1998]
    • Relational Algorithm (top-down)
    • Features
      • Simple features (e.g., length, character type, …)
      • Relational features (e.g., next-token, …)
    • Advantages
      • Expressive rule representation
    • Drawbacks
      • Single-slot rule generation
      • Large-volume of training data
  • SRV Rule
  • WHISK [1998]
    • Covering Algorithm (top-down)
    • Advantages
      • Learn multi-slot extraction rules
      • Handle various order of items-to-be-extracted
      • Handle document types from free text to structured text
    • Drawbacks
      • Must see all the permutations of items
      • Less expressive feature set
      • Need large volume of training data
  • WHISK Rule
  • WIEN [1997]
    • Assumes
      • Items are always in fixed, known order
    • Introduces several types of wrappers
    • Advantages
      • Fast to learn and extract
    • Drawbacks
      • Can not handle permutations and missing items
      • Must label entire pages
      • Does not use semantic classes
  • WIEN Rule
  • SoftMealy [1998]
    • Learns a transducer
    • Advantages
      • Learns order of items
      • Allows item permutations and missing items
      • Allows both the use of semantic classes and disjunctions
    • Drawbacks
      • Must see all possible permutations
      • Can not use delimiters that do not immediately precede and follow the relevant items
  • SoftMealy Rule
  • STALKER [1998,1999,2001]
    • Hierarchical Information Extraction
    • Embedded Catalog Tree (ECT) Formalism
    • Advantages
      • Extracts nested data
      • Allows item permutations and missing items
      • Need not see all of the permutations
      • One hard-to-extract item does not affect others
    • Drawbacks
      • Does not exploit item order
  • STALKER Rule
  • Web IE Tools (main technique used)
    • Wrapper languages (TSIMMIS, Web-OQL)
    • HTML-aware (X4F, XWRAP, RoadRunner, Lixto)
    • NLP-based (RAPIER, SRV, WHISK)
    • Inductive learning (WIEN, SoftMealy, Stalker)
    • Modeling-based (NoDoSE, DEByE)
    • Ontology-based (BYU ontology)
  • Degree of Automation
    • Trade-off: page lay-out dependent
    • RoadRunner
      • Assume target pages were automatically generated from some data sources
      • The only fully automatic wrapper generator
    • BYU ontology
      • Manually created with graphical editing tool
      • Extraction process fully automatic
  • Support of Complex Objects
    • Complex objects: nested objects, graphs, trees, complex tables, …
    • Earlier tools do not support extracting from complex objects, like RAPIER, SRV, WHISK, and WIEN.
    • BYU ontology
      • Support
  • Page Contents
    • Semistructured data (table type, richly tagged)
    • Semistructured text (text type, rarely tagged)
    • NLP-based tools: text type only
    • Other tools (except ontology-based): table type only
    • BYU ontology: both types
  • Ease of Use
    • HTML-aware tools, easiest to use
    • Wrapper languages, hardest to use
    • Other tools, in the middle
  • Output
    • XML is the best output format for data sharing on the Web.
  • Support for Non-HTML Sources
    • NLP-based and ontology-based, automatically support
    • Other tools, may support but need additional helper like syntactical and semantic analyzer
    • BYU ontology
      • support
  • Resilience and Adaptiveness
    • Resilience: continuing to work properly in the occurrence of changes in the target pages
    • Adaptiveness: working properly with pages from some other sources but in the same application domain
    • Only BYU ontology has both the features.
  • Summary of Qualitative Analysis
  • Graphical Perspective of Qualitative Analysis
  • X means the information extraction system has the capability; X* means the information extraction system has the ability as long as the training corpus can accommodate the required training data; ? Shows that the systems can has the ability in somewhat degree; * means that the extraction pattern itself doesn’t show the ability, but the overall system has the capability. Nested_ data Free Resilient Permuta_tions Missing items Multi-slot Single-slot Semi Struc_ ture Name X X X ? X X ? ? X X X X ROAD_ RUNNER X X X AutoSlog X X X X X X X BYU Onto ? X* X X X X X WHISK ? X X X X X SRV ? X X X X X RAPIER X X * X X X STALKER X* X X X X X SoftMealy X X X WIEN
  • Problem of IE (unstructured documents)
    • Meaning
    • Knowledge
    • Information
    • Data
    Source Target Information Extraction
  • Problem of IE (structured documents)
    • Meaning
    • Knowledge
    • Information
    • Data
    Source Target Information Extraction
  • Problem of IE (semistructured documents)
    • Meaning
    • Knowledge
    • Information
    • Data
    Source Target Information Extraction
  • Solution of IE (the Semantic Web)
    • Meaning
    • Knowledge
    • Information
    • Data
    Source Target Information Extraction