osm.cs.byu.edu
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
630
On Slideshare
630
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
3
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding
  • 2. IR and IE
    • IR (Information Retrieval)
      • Retrieves relevant documents from collections
      • Information theory, probabilistic theory, and statistics
    • IE (Information Extraction)
      • Extracts relevant information from documents
      • Machine learning, computational linguistics, and natural language processing
  • 3. History of IE
    • Large amount of both online and offline textual data.
    • Message Understanding Conference (MUC)
      • Quantitative evaluation of IE systems
      • Tasks
        • Latin American terrorism
        • Joint ventures
        • Microelectronics
        • Company management changes
  • 4. Evaluation Metrics
    • Precision
    • Recall
    • F-measure
  • 5. Web Documents
    • Unstructured (Free) Text
      • Regular sentences and paragraphs
      • Linguistic techniques, e.g., NLP
    • Structured Text
      • Itemized information
      • Uniform syntactic clues, e.g., table understanding
    • Semistructured Text
      • Ungrammatical, telegraphic (e.g., missing attributes, multi-value attributes, …)
      • Specialized programs, e.g., wrappers
  • 6. Approaches to IE
    • Knowledge Engineering
      • Grammars are constructed by hand
      • Domain patterns are discovered by human experts through introspection and inspection of a corpus
      • Much laborious tuning and “hill climbing”
    • Machine Learning
      • Use statistical methods when possible
      • Learn rules from annotated corpora
      • Learn rules from interaction with user
  • 7. Knowledge Engineering
    • Advantages
      • With skills and experience, good performing systems are not conceptually hard to develop.
      • The best performing systems have been hand crafted.
    • Disadvantages
      • Very laborious development process
      • Some changes to specifications can be hard to accommodate
      • Required expertise may not be available
  • 8. Machine Learning
    • Advantages
      • Domain portability is relatively straightforward
      • System expertise is not required for customization
      • “ Data driven” rule acquisition ensures full coverage of examples
    • Disadvantages
      • Training data may not exist, and may be very expensive to acquire
      • Large volume of training data may be required
      • Changes to specifications may require reannotation of large quantities of training data
  • 9. Wrapper
    • A specialized program that identifies data of interest and maps them to some suitable format (e.g. XML or relational tables)
    • Challenge: recognizing the data of interest among many other uninterested pieces of text
    • Tasks
      • Source understanding
      • Data processing
  • 10. Free Text
    • AutoSlog
    • Liep
    • Palka
    • Hasten
    • Crystal
      • WebFoot
    • WHISK
  • 11. AutoSlog [1993] The Parliament building was bombed by Carlos.
  • 12. LIEP [1995] The Parliament building was bombed by Carlos .
  • 13. PALKA [1995] The Parliament building was bombed by Carlos .
  • 14. HASTEN [1995] The Parliament building was bombed by Carlos .
    • Egraphs
    • ( SemanticLabel, StructuralElement )
  • 15. CRYSTAL [1995] The Parliament building was bombed by Carlos .
  • 16. CRYSTAL + Webfoot [1997]
  • 17. WHISK [1999]
    • The Parliament building was bombed by Carlos.
    • WHISK Rule:
      • *( PhyObj )*@passive *F ‘bombed’ * {PP ‘by’ *F ( Person )}
    • Context-based patterns
  • 18. Web Documents
    • Semistructured and Unstructured
      • RAPIER (E. Califf, 1997)
      • SRV (D. Freitag, 1998)
      • WHISK (S. Soderland, 1998)
    • Semistructured and Structured
      • WIEN (N. Kushmerick, 1997)
      • SoftMealy (C-H. Hsu, 1998)
      • STALKER (I. Muslea, S. Minton, C. Knoblock, 1998)
  • 19. Inductive Learning
    • Task
    • Inductive Inference
    • Learning Systems
      • Zero-order
      • First-order, e.g., Inductive Logic Programming (ILP)
  • 20. RAPIER [1997]
    • Inductive Logic Programming
    • Extraction Rules
      • Syntactic information
      • Semantic information
    • Advantage
      • Efficient learning (bottom-up)
    • Drawback
      • Single-slot extraction
  • 21. RAPIER Rule
  • 22. SRV [1998]
    • Relational Algorithm (top-down)
    • Features
      • Simple features (e.g., length, character type, …)
      • Relational features (e.g., next-token, …)
    • Advantages
      • Expressive rule representation
    • Drawbacks
      • Single-slot rule generation
      • Large-volume of training data
  • 23. SRV Rule
  • 24. WHISK [1998]
    • Covering Algorithm (top-down)
    • Advantages
      • Learn multi-slot extraction rules
      • Handle various order of items-to-be-extracted
      • Handle document types from free text to structured text
    • Drawbacks
      • Must see all the permutations of items
      • Less expressive feature set
      • Need large volume of training data
  • 25. WHISK Rule
  • 26. WIEN [1997]
    • Assumes
      • Items are always in fixed, known order
    • Introduces several types of wrappers
    • Advantages
      • Fast to learn and extract
    • Drawbacks
      • Can not handle permutations and missing items
      • Must label entire pages
      • Does not use semantic classes
  • 27. WIEN Rule
  • 28. SoftMealy [1998]
    • Learns a transducer
    • Advantages
      • Learns order of items
      • Allows item permutations and missing items
      • Allows both the use of semantic classes and disjunctions
    • Drawbacks
      • Must see all possible permutations
      • Can not use delimiters that do not immediately precede and follow the relevant items
  • 29. SoftMealy Rule
  • 30. STALKER [1998,1999,2001]
    • Hierarchical Information Extraction
    • Embedded Catalog Tree (ECT) Formalism
    • Advantages
      • Extracts nested data
      • Allows item permutations and missing items
      • Need not see all of the permutations
      • One hard-to-extract item does not affect others
    • Drawbacks
      • Does not exploit item order
  • 31. STALKER Rule
  • 32. Web IE Tools (main technique used)
    • Wrapper languages (TSIMMIS, Web-OQL)
    • HTML-aware (X4F, XWRAP, RoadRunner, Lixto)
    • NLP-based (RAPIER, SRV, WHISK)
    • Inductive learning (WIEN, SoftMealy, Stalker)
    • Modeling-based (NoDoSE, DEByE)
    • Ontology-based (BYU ontology)
  • 33. Degree of Automation
    • Trade-off: page lay-out dependent
    • RoadRunner
      • Assume target pages were automatically generated from some data sources
      • The only fully automatic wrapper generator
    • BYU ontology
      • Manually created with graphical editing tool
      • Extraction process fully automatic
  • 34. Support of Complex Objects
    • Complex objects: nested objects, graphs, trees, complex tables, …
    • Earlier tools do not support extracting from complex objects, like RAPIER, SRV, WHISK, and WIEN.
    • BYU ontology
      • Support
  • 35. Page Contents
    • Semistructured data (table type, richly tagged)
    • Semistructured text (text type, rarely tagged)
    • NLP-based tools: text type only
    • Other tools (except ontology-based): table type only
    • BYU ontology: both types
  • 36. Ease of Use
    • HTML-aware tools, easiest to use
    • Wrapper languages, hardest to use
    • Other tools, in the middle
  • 37. Output
    • XML is the best output format for data sharing on the Web.
  • 38. Support for Non-HTML Sources
    • NLP-based and ontology-based, automatically support
    • Other tools, may support but need additional helper like syntactical and semantic analyzer
    • BYU ontology
      • support
  • 39. Resilience and Adaptiveness
    • Resilience: continuing to work properly in the occurrence of changes in the target pages
    • Adaptiveness: working properly with pages from some other sources but in the same application domain
    • Only BYU ontology has both the features.
  • 40. Summary of Qualitative Analysis
  • 41. Graphical Perspective of Qualitative Analysis
  • 42. X means the information extraction system has the capability; X* means the information extraction system has the ability as long as the training corpus can accommodate the required training data; ? Shows that the systems can has the ability in somewhat degree; * means that the extraction pattern itself doesn’t show the ability, but the overall system has the capability. Nested_ data Free Resilient Permuta_tions Missing items Multi-slot Single-slot Semi Struc_ ture Name X X X ? X X ? ? X X X X ROAD_ RUNNER X X X AutoSlog X X X X X X X BYU Onto ? X* X X X X X WHISK ? X X X X X SRV ? X X X X X RAPIER X X * X X X STALKER X* X X X X X SoftMealy X X X WIEN
  • 43. Problem of IE (unstructured documents)
    • Meaning
    • Knowledge
    • Information
    • Data
    Source Target Information Extraction
  • 44. Problem of IE (structured documents)
    • Meaning
    • Knowledge
    • Information
    • Data
    Source Target Information Extraction
  • 45. Problem of IE (semistructured documents)
    • Meaning
    • Knowledge
    • Information
    • Data
    Source Target Information Extraction
  • 46. Solution of IE (the Semantic Web)
    • Meaning
    • Knowledge
    • Information
    • Data
    Source Target Information Extraction