Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding
IR and IE <ul><li>IR (Information Retrieval) </li></ul><ul><ul><li>Retrieves relevant documents from collections </li></ul...
History of IE <ul><li>Large amount of both online and offline textual data. </li></ul><ul><li>Message Understanding Confer...
Evaluation Metrics <ul><li>Precision </li></ul><ul><li>Recall </li></ul><ul><li>F-measure </li></ul>
Web Documents <ul><li>Unstructured (Free) Text  </li></ul><ul><ul><li>Regular sentences and paragraphs </li></ul></ul><ul>...
Approaches to IE <ul><li>Knowledge Engineering </li></ul><ul><ul><li>Grammars are constructed by hand </li></ul></ul><ul><...
Knowledge Engineering <ul><li>Advantages </li></ul><ul><ul><li>With skills and experience, good performing systems are not...
Machine Learning  <ul><li>Advantages </li></ul><ul><ul><li>Domain portability is relatively straightforward </li></ul></ul...
Wrapper <ul><li>A specialized program that identifies data of interest and maps them to some suitable format (e.g. XML or ...
Free Text <ul><li>AutoSlog </li></ul><ul><li>Liep </li></ul><ul><li>Palka </li></ul><ul><li>Hasten </li></ul><ul><li>Cryst...
AutoSlog [1993] The Parliament building  was bombed by Carlos.
LIEP [1995] The Parliament building  was bombed by  Carlos .
PALKA [1995] The Parliament building  was bombed by  Carlos .
HASTEN [1995] The Parliament building  was bombed by  Carlos . <ul><li>Egraphs </li></ul><ul><li>( SemanticLabel, Structur...
CRYSTAL [1995] The Parliament building  was bombed by  Carlos .
CRYSTAL + Webfoot [1997]
WHISK [1999] <ul><li>The Parliament building  was bombed by  Carlos. </li></ul><ul><li>WHISK Rule: </li></ul><ul><ul><li>*...
Web Documents <ul><li>Semistructured and Unstructured </li></ul><ul><ul><li>RAPIER (E. Califf, 1997) </li></ul></ul><ul><u...
Inductive Learning <ul><li>Task </li></ul><ul><li>Inductive Inference </li></ul><ul><li>Learning Systems </li></ul><ul><ul...
RAPIER [1997] <ul><li>Inductive Logic Programming </li></ul><ul><li>Extraction Rules </li></ul><ul><ul><li>Syntactic infor...
RAPIER Rule
SRV [1998] <ul><li>Relational Algorithm (top-down) </li></ul><ul><li>Features  </li></ul><ul><ul><li>Simple features (e.g....
SRV Rule
WHISK [1998] <ul><li>Covering Algorithm (top-down) </li></ul><ul><li>Advantages </li></ul><ul><ul><li>Learn multi-slot ext...
WHISK Rule
WIEN [1997] <ul><li>Assumes </li></ul><ul><ul><li>Items are always in fixed, known order </li></ul></ul><ul><li>Introduces...
WIEN Rule
SoftMealy [1998] <ul><li>Learns a transducer </li></ul><ul><li>Advantages </li></ul><ul><ul><li>Learns order of items </li...
SoftMealy Rule
STALKER [1998,1999,2001] <ul><li>Hierarchical Information Extraction </li></ul><ul><li>Embedded Catalog Tree (ECT) Formali...
STALKER Rule
Web IE Tools  (main technique used) <ul><li>Wrapper languages  (TSIMMIS, Web-OQL)   </li></ul><ul><li>HTML-aware  (X4F, XW...
Degree of Automation <ul><li>Trade-off: page lay-out dependent </li></ul><ul><li>RoadRunner </li></ul><ul><ul><li>Assume t...
Support of Complex Objects <ul><li>Complex objects: nested objects, graphs, trees, complex tables, … </li></ul><ul><li>Ear...
Page Contents <ul><li>Semistructured data (table type, richly tagged) </li></ul><ul><li>Semistructured text (text type, ra...
Ease of Use <ul><li>HTML-aware tools, easiest to use </li></ul><ul><li>Wrapper languages, hardest to use </li></ul><ul><li...
Output <ul><li>XML is the best output format for data sharing on the Web. </li></ul>
Support for Non-HTML Sources <ul><li>NLP-based and ontology-based, automatically support </li></ul><ul><li>Other tools, ma...
Resilience and Adaptiveness <ul><li>Resilience: continuing to work properly in the occurrence of changes in the target pag...
Summary of Qualitative Analysis
Graphical Perspective of Qualitative Analysis
X means the information extraction system  has the capability; X* means the information extraction system  has the ability...
Problem of IE  (unstructured documents) <ul><li>Meaning </li></ul><ul><li>Knowledge </li></ul><ul><li>Information </li></u...
Problem of IE  (structured documents) <ul><li>Meaning </li></ul><ul><li>Knowledge </li></ul><ul><li>Information </li></ul>...
Problem of IE  (semistructured documents) <ul><li>Meaning </li></ul><ul><li>Knowledge </li></ul><ul><li>Information </li><...
Solution of IE  (the Semantic Web) <ul><li>Meaning </li></ul><ul><li>Knowledge </li></ul><ul><li>Information </li></ul><ul...
Upcoming SlideShare
Loading in …5
×

osm.cs.byu.edu

447
-1

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
447
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

osm.cs.byu.edu

  1. 1. Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding
  2. 2. IR and IE <ul><li>IR (Information Retrieval) </li></ul><ul><ul><li>Retrieves relevant documents from collections </li></ul></ul><ul><ul><li>Information theory, probabilistic theory, and statistics </li></ul></ul><ul><li>IE (Information Extraction) </li></ul><ul><ul><li>Extracts relevant information from documents </li></ul></ul><ul><ul><li>Machine learning, computational linguistics, and natural language processing </li></ul></ul>
  3. 3. History of IE <ul><li>Large amount of both online and offline textual data. </li></ul><ul><li>Message Understanding Conference (MUC) </li></ul><ul><ul><li>Quantitative evaluation of IE systems </li></ul></ul><ul><ul><li>Tasks </li></ul></ul><ul><ul><ul><li>Latin American terrorism </li></ul></ul></ul><ul><ul><ul><li>Joint ventures </li></ul></ul></ul><ul><ul><ul><li>Microelectronics </li></ul></ul></ul><ul><ul><ul><li>Company management changes </li></ul></ul></ul>
  4. 4. Evaluation Metrics <ul><li>Precision </li></ul><ul><li>Recall </li></ul><ul><li>F-measure </li></ul>
  5. 5. Web Documents <ul><li>Unstructured (Free) Text </li></ul><ul><ul><li>Regular sentences and paragraphs </li></ul></ul><ul><ul><li>Linguistic techniques, e.g., NLP </li></ul></ul><ul><li>Structured Text </li></ul><ul><ul><li>Itemized information </li></ul></ul><ul><ul><li>Uniform syntactic clues, e.g., table understanding </li></ul></ul><ul><li>Semistructured Text </li></ul><ul><ul><li>Ungrammatical, telegraphic (e.g., missing attributes, multi-value attributes, …) </li></ul></ul><ul><ul><li>Specialized programs, e.g., wrappers </li></ul></ul>
  6. 6. Approaches to IE <ul><li>Knowledge Engineering </li></ul><ul><ul><li>Grammars are constructed by hand </li></ul></ul><ul><ul><li>Domain patterns are discovered by human experts through introspection and inspection of a corpus </li></ul></ul><ul><ul><li>Much laborious tuning and “hill climbing” </li></ul></ul><ul><li>Machine Learning </li></ul><ul><ul><li>Use statistical methods when possible </li></ul></ul><ul><ul><li>Learn rules from annotated corpora </li></ul></ul><ul><ul><li>Learn rules from interaction with user </li></ul></ul>
  7. 7. Knowledge Engineering <ul><li>Advantages </li></ul><ul><ul><li>With skills and experience, good performing systems are not conceptually hard to develop. </li></ul></ul><ul><ul><li>The best performing systems have been hand crafted. </li></ul></ul><ul><li>Disadvantages </li></ul><ul><ul><li>Very laborious development process </li></ul></ul><ul><ul><li>Some changes to specifications can be hard to accommodate </li></ul></ul><ul><ul><li>Required expertise may not be available </li></ul></ul>
  8. 8. Machine Learning <ul><li>Advantages </li></ul><ul><ul><li>Domain portability is relatively straightforward </li></ul></ul><ul><ul><li>System expertise is not required for customization </li></ul></ul><ul><ul><li>“ Data driven” rule acquisition ensures full coverage of examples </li></ul></ul><ul><li>Disadvantages </li></ul><ul><ul><li>Training data may not exist, and may be very expensive to acquire </li></ul></ul><ul><ul><li>Large volume of training data may be required </li></ul></ul><ul><ul><li>Changes to specifications may require reannotation of large quantities of training data </li></ul></ul>
  9. 9. Wrapper <ul><li>A specialized program that identifies data of interest and maps them to some suitable format (e.g. XML or relational tables) </li></ul><ul><li>Challenge: recognizing the data of interest among many other uninterested pieces of text </li></ul><ul><li>Tasks </li></ul><ul><ul><li>Source understanding </li></ul></ul><ul><ul><li>Data processing </li></ul></ul>
  10. 10. Free Text <ul><li>AutoSlog </li></ul><ul><li>Liep </li></ul><ul><li>Palka </li></ul><ul><li>Hasten </li></ul><ul><li>Crystal </li></ul><ul><ul><li>WebFoot </li></ul></ul><ul><li>WHISK </li></ul>
  11. 11. AutoSlog [1993] The Parliament building was bombed by Carlos.
  12. 12. LIEP [1995] The Parliament building was bombed by Carlos .
  13. 13. PALKA [1995] The Parliament building was bombed by Carlos .
  14. 14. HASTEN [1995] The Parliament building was bombed by Carlos . <ul><li>Egraphs </li></ul><ul><li>( SemanticLabel, StructuralElement ) </li></ul>
  15. 15. CRYSTAL [1995] The Parliament building was bombed by Carlos .
  16. 16. CRYSTAL + Webfoot [1997]
  17. 17. WHISK [1999] <ul><li>The Parliament building was bombed by Carlos. </li></ul><ul><li>WHISK Rule: </li></ul><ul><ul><li>*( PhyObj )*@passive *F ‘bombed’ * {PP ‘by’ *F ( Person )} </li></ul></ul><ul><li>Context-based patterns </li></ul>
  18. 18. Web Documents <ul><li>Semistructured and Unstructured </li></ul><ul><ul><li>RAPIER (E. Califf, 1997) </li></ul></ul><ul><ul><li>SRV (D. Freitag, 1998) </li></ul></ul><ul><ul><li>WHISK (S. Soderland, 1998) </li></ul></ul><ul><li>Semistructured and Structured </li></ul><ul><ul><li>WIEN (N. Kushmerick, 1997) </li></ul></ul><ul><ul><li>SoftMealy (C-H. Hsu, 1998) </li></ul></ul><ul><ul><li>STALKER (I. Muslea, S. Minton, C. Knoblock, 1998) </li></ul></ul>
  19. 19. Inductive Learning <ul><li>Task </li></ul><ul><li>Inductive Inference </li></ul><ul><li>Learning Systems </li></ul><ul><ul><li>Zero-order </li></ul></ul><ul><ul><li>First-order, e.g., Inductive Logic Programming (ILP) </li></ul></ul>
  20. 20. RAPIER [1997] <ul><li>Inductive Logic Programming </li></ul><ul><li>Extraction Rules </li></ul><ul><ul><li>Syntactic information </li></ul></ul><ul><ul><li>Semantic information </li></ul></ul><ul><li>Advantage </li></ul><ul><ul><li>Efficient learning (bottom-up) </li></ul></ul><ul><li>Drawback </li></ul><ul><ul><li>Single-slot extraction </li></ul></ul>
  21. 21. RAPIER Rule
  22. 22. SRV [1998] <ul><li>Relational Algorithm (top-down) </li></ul><ul><li>Features </li></ul><ul><ul><li>Simple features (e.g., length, character type, …) </li></ul></ul><ul><ul><li>Relational features (e.g., next-token, …) </li></ul></ul><ul><li>Advantages </li></ul><ul><ul><li>Expressive rule representation </li></ul></ul><ul><li>Drawbacks </li></ul><ul><ul><li>Single-slot rule generation </li></ul></ul><ul><ul><li>Large-volume of training data </li></ul></ul>
  23. 23. SRV Rule
  24. 24. WHISK [1998] <ul><li>Covering Algorithm (top-down) </li></ul><ul><li>Advantages </li></ul><ul><ul><li>Learn multi-slot extraction rules </li></ul></ul><ul><ul><li>Handle various order of items-to-be-extracted </li></ul></ul><ul><ul><li>Handle document types from free text to structured text </li></ul></ul><ul><li>Drawbacks </li></ul><ul><ul><li>Must see all the permutations of items </li></ul></ul><ul><ul><li>Less expressive feature set </li></ul></ul><ul><ul><li>Need large volume of training data </li></ul></ul>
  25. 25. WHISK Rule
  26. 26. WIEN [1997] <ul><li>Assumes </li></ul><ul><ul><li>Items are always in fixed, known order </li></ul></ul><ul><li>Introduces several types of wrappers </li></ul><ul><li>Advantages </li></ul><ul><ul><li>Fast to learn and extract </li></ul></ul><ul><li>Drawbacks </li></ul><ul><ul><li>Can not handle permutations and missing items </li></ul></ul><ul><ul><li>Must label entire pages </li></ul></ul><ul><ul><li>Does not use semantic classes </li></ul></ul>
  27. 27. WIEN Rule
  28. 28. SoftMealy [1998] <ul><li>Learns a transducer </li></ul><ul><li>Advantages </li></ul><ul><ul><li>Learns order of items </li></ul></ul><ul><ul><li>Allows item permutations and missing items </li></ul></ul><ul><ul><li>Allows both the use of semantic classes and disjunctions </li></ul></ul><ul><li>Drawbacks </li></ul><ul><ul><li>Must see all possible permutations </li></ul></ul><ul><ul><li>Can not use delimiters that do not immediately precede and follow the relevant items </li></ul></ul>
  29. 29. SoftMealy Rule
  30. 30. STALKER [1998,1999,2001] <ul><li>Hierarchical Information Extraction </li></ul><ul><li>Embedded Catalog Tree (ECT) Formalism </li></ul><ul><li>Advantages </li></ul><ul><ul><li>Extracts nested data </li></ul></ul><ul><ul><li>Allows item permutations and missing items </li></ul></ul><ul><ul><li>Need not see all of the permutations </li></ul></ul><ul><ul><li>One hard-to-extract item does not affect others </li></ul></ul><ul><li>Drawbacks </li></ul><ul><ul><li>Does not exploit item order </li></ul></ul>
  31. 31. STALKER Rule
  32. 32. Web IE Tools (main technique used) <ul><li>Wrapper languages (TSIMMIS, Web-OQL) </li></ul><ul><li>HTML-aware (X4F, XWRAP, RoadRunner, Lixto) </li></ul><ul><li>NLP-based (RAPIER, SRV, WHISK) </li></ul><ul><li>Inductive learning (WIEN, SoftMealy, Stalker) </li></ul><ul><li>Modeling-based (NoDoSE, DEByE) </li></ul><ul><li>Ontology-based (BYU ontology) </li></ul>
  33. 33. Degree of Automation <ul><li>Trade-off: page lay-out dependent </li></ul><ul><li>RoadRunner </li></ul><ul><ul><li>Assume target pages were automatically generated from some data sources </li></ul></ul><ul><ul><li>The only fully automatic wrapper generator </li></ul></ul><ul><li>BYU ontology </li></ul><ul><ul><li>Manually created with graphical editing tool </li></ul></ul><ul><ul><li>Extraction process fully automatic </li></ul></ul>
  34. 34. Support of Complex Objects <ul><li>Complex objects: nested objects, graphs, trees, complex tables, … </li></ul><ul><li>Earlier tools do not support extracting from complex objects, like RAPIER, SRV, WHISK, and WIEN. </li></ul><ul><li>BYU ontology </li></ul><ul><ul><li>Support </li></ul></ul>
  35. 35. Page Contents <ul><li>Semistructured data (table type, richly tagged) </li></ul><ul><li>Semistructured text (text type, rarely tagged) </li></ul><ul><li>NLP-based tools: text type only </li></ul><ul><li>Other tools (except ontology-based): table type only </li></ul><ul><li>BYU ontology: both types </li></ul>
  36. 36. Ease of Use <ul><li>HTML-aware tools, easiest to use </li></ul><ul><li>Wrapper languages, hardest to use </li></ul><ul><li>Other tools, in the middle </li></ul>
  37. 37. Output <ul><li>XML is the best output format for data sharing on the Web. </li></ul>
  38. 38. Support for Non-HTML Sources <ul><li>NLP-based and ontology-based, automatically support </li></ul><ul><li>Other tools, may support but need additional helper like syntactical and semantic analyzer </li></ul><ul><li>BYU ontology </li></ul><ul><ul><li>support </li></ul></ul>
  39. 39. Resilience and Adaptiveness <ul><li>Resilience: continuing to work properly in the occurrence of changes in the target pages </li></ul><ul><li>Adaptiveness: working properly with pages from some other sources but in the same application domain </li></ul><ul><li>Only BYU ontology has both the features. </li></ul>
  40. 40. Summary of Qualitative Analysis
  41. 41. Graphical Perspective of Qualitative Analysis
  42. 42. X means the information extraction system has the capability; X* means the information extraction system has the ability as long as the training corpus can accommodate the required training data; ? Shows that the systems can has the ability in somewhat degree; * means that the extraction pattern itself doesn’t show the ability, but the overall system has the capability. Nested_ data Free Resilient Permuta_tions Missing items Multi-slot Single-slot Semi Struc_ ture Name X X X ? X X ? ? X X X X ROAD_ RUNNER X X X AutoSlog X X X X X X X BYU Onto ? X* X X X X X WHISK ? X X X X X SRV ? X X X X X RAPIER X X * X X X STALKER X* X X X X X SoftMealy X X X WIEN
  43. 43. Problem of IE (unstructured documents) <ul><li>Meaning </li></ul><ul><li>Knowledge </li></ul><ul><li>Information </li></ul><ul><li>Data </li></ul>Source Target Information Extraction
  44. 44. Problem of IE (structured documents) <ul><li>Meaning </li></ul><ul><li>Knowledge </li></ul><ul><li>Information </li></ul><ul><li>Data </li></ul>Source Target Information Extraction
  45. 45. Problem of IE (semistructured documents) <ul><li>Meaning </li></ul><ul><li>Knowledge </li></ul><ul><li>Information </li></ul><ul><li>Data </li></ul>Source Target Information Extraction
  46. 46. Solution of IE (the Semantic Web) <ul><li>Meaning </li></ul><ul><li>Knowledge </li></ul><ul><li>Information </li></ul><ul><li>Data </li></ul>Source Target Information Extraction

×