IE for Semi-structured Document: Supervised Approach


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • For 15 minutes CONALD talk, skip slides 5, 9, 12, 13, 17-19, 21 For 25 minutes AIII talk, use all
  • How many Web sites have these problems? 9 out of the 30 sites surveyed by Kushmerick in 1997 Semistructured data (see e.g., Buneman PODS-97) Web CGI software becoming sophisticated The percentage will increase quickly, need a more powerful wrapper representation
  • IE for Semi-structured Document: Supervised Approach

    1. 1. Machine-learning based Semi-structured IE Chia-Hui Chang Department of Computer Science & Information Engineering National Central University [email_address]
    2. 2. Wrapper Induction <ul><li>Wrapper </li></ul><ul><ul><li>An extracting program to extract desired information from Web pages. </li></ul></ul><ul><ul><li>Semi-Structure Doc.– wrapper-> Structure Info . </li></ul></ul><ul><li>Web wrappers wrap... </li></ul><ul><ul><li>“ Query-able’’ or “Search-able’’ Web sites </li></ul></ul><ul><ul><li>Web pages with large itemized lists </li></ul></ul><ul><li>The primary issues are: </li></ul><ul><ul><li>How to build the extractor quickly? </li></ul></ul>
    3. 3. Semi-structured IE <ul><li>Independently of the traditional IE </li></ul><ul><li>The necessity of extracting and integrating data from multiple Web-based sources </li></ul>
    4. 4. Machine-Learning Based Approach <ul><li>A key component of IE systems is </li></ul><ul><ul><li>a set of extraction patterns </li></ul></ul><ul><ul><li>that can be generated by machine learning algorithms. </li></ul></ul>
    5. 5. Related Work <ul><li>Shopbot </li></ul><ul><ul><li>Doorenbos, Etzioni, Weld, AA-97 </li></ul></ul><ul><li>Ariadne </li></ul><ul><ul><li>Ashish, Knoblock, Coopis-97 </li></ul></ul><ul><li>WIEN </li></ul><ul><ul><li>Kushmerick, Weld, Doorenbos, IJCAI-97 </li></ul></ul><ul><li>SoftMealy wrapper representation </li></ul><ul><ul><li>Hsu, IJCAI-99 </li></ul></ul><ul><li>STALKER </li></ul><ul><ul><li>Muslea, Minton, Knoblock, AA-99 </li></ul></ul><ul><ul><li>A hierarchical FST </li></ul></ul>
    6. 6. WIEN N. Kushmerick, D. S. Weld, R. Doorenbos, University of Washington, 1997
    7. 7. Example 1
    8. 8. Extractor for Example 1
    9. 9. HLRT
    10. 10. Wrapper Induction <ul><li>Induction: </li></ul><ul><ul><li>The task of generalizing from labeled examples to a hypothesis </li></ul></ul><ul><li>Instances: pages </li></ul><ul><li>Labels: {(Congo, 242), (Egypt, 20), (Belize, 501), (Spain, 34)} </li></ul><ul><li>Hypotheses: </li></ul><ul><ul><li>E.g. (<p>, <HR>, <B>, </B>, <I>, </I>) </li></ul></ul>
    11. 11. BuildHLRT
    12. 12. Other Family <ul><li>OCLR (Open-Close-Left-Right) </li></ul><ul><ul><li>Use Open and Close as delimiters for each tuple </li></ul></ul><ul><li>HOCLRT </li></ul><ul><ul><li>Combine OCLR with Head and Tail </li></ul></ul><ul><li>N-LR and N-HLRT </li></ul><ul><ul><li>Nested LR </li></ul></ul><ul><ul><li>Nested HLRT </li></ul></ul>
    13. 13. Terminology <ul><li>Oracles </li></ul><ul><ul><li>Page Oracle </li></ul></ul><ul><ul><li>Label Oracle </li></ul></ul><ul><li>PAC analysis </li></ul><ul><ul><li>is to determine how many examples are necessary to build an wrapper with two parameters: accuracy  and confidence  : </li></ul></ul><ul><ul><li>Pr[E(w)<  ]>1-  , or Pr[E(w)>  ]<  </li></ul></ul>
    14. 14. Probably Approximate Correct (PAC) Analysis <ul><li>With  =0.1,  =0.1, K=4, an average of 5 tuples/page, Build HLRT must examine at least 72 examples </li></ul>
    15. 15. Empirical Evaluation <ul><ul><li>Extract 48% web pages successfully. </li></ul></ul><ul><ul><li>Weakness: </li></ul></ul><ul><ul><ul><li>Missing attributes, attributes not in order, tabular data, etc. </li></ul></ul></ul>
    16. 16. Softmealy Chun-Nan Hsu, Ming-Tzung Dung, 1998 Arizona State University
    17. 17. Softmealy Architecture <ul><li>Finite-State Transducers for Semi-Structured Text Mining </li></ul><ul><ul><li>Labeling: use a interface to label example by manually. </li></ul></ul><ul><ul><li>Learner: FST ( Finite-State Transducer) </li></ul></ul><ul><ul><li>Extractor: </li></ul></ul><ul><ul><li>Demonstration </li></ul></ul><ul><ul><ul><li> </li></ul></ul></ul>
    18. 18. Softmealy Wrapper <ul><li>SoftMealy wrapper representation </li></ul><ul><ul><li>Uses finite-state transducer where each distinct attribute permutations can be encoded as a successful path </li></ul></ul><ul><ul><li>Replaces delimiters with contextual rules that describes the context delimiting two adjacent attributes </li></ul></ul>
    19. 19. Example
    20. 20. Label the Answer Key 4 種情形
    21. 21. Finite State Transducer M -A A -N N -U U e extract extract extract extract skip skip skip skip skip 多解決了 (N, M) 、 (N, A, M) 2 個情形 b
    22. 22. Find the starting position -- Single Pass <ul><li>新增的定義 </li></ul>
    23. 23. Contextual based Rule Learning <ul><li>Tokens </li></ul><ul><li>Separators </li></ul><ul><ul><li>S L ::= … Punc(,) Spc(1) Html(<I>) </li></ul></ul><ul><ul><li>S R ::= C1Alph(Professor) Spc(1) OAlph(of) … </li></ul></ul><ul><li>Rule generalization </li></ul><ul><ul><li>Taxonomy Tree </li></ul></ul>
    24. 24. Tokens <ul><ul><li>All uppercase string: CALph </li></ul></ul><ul><ul><li>An uppercase letter, followed by at least one lowercase letter, C1Alph </li></ul></ul><ul><ul><li>A lowercase letter, followed by zero or more characters: OAlph </li></ul></ul><ul><ul><li>HTML tag: HTML </li></ul></ul><ul><ul><li>Punctuation symbol: Punc </li></ul></ul><ul><ul><li>Control characters: NL(1), Tab(4), Spc(3) </li></ul></ul>
    25. 25. Rule Generalization
    26. 26. Learning Algorithm <ul><li>Generalize each column by replacing each token with their least common ancestor </li></ul>
    27. 27. Taxonomy Tree
    28. 28. Generating to Extract the Body <ul><li>The contextual rules for the head and tail separators are: </li></ul><ul><li>h L ::= C1alpha(Staff) Html(</H2>) NL(1)Html(<HR>) NL(1) Html(<UL>) </li></ul><ul><li>t R ::= Html(</UL>) NL(1) Html(<HR>) NL(1) Html(<ADDRESS>) NL(1) Html(<I>) Clalpha(Please) </li></ul>
    29. 29. More Expressive Power <ul><li>Softmealy allows </li></ul><ul><ul><li>Disjunction </li></ul></ul><ul><ul><li>Multiple attribute orders within tuples </li></ul></ul><ul><ul><li>Missing attributes </li></ul></ul><ul><ul><li>Features of candidate strings </li></ul></ul>
    30. 30. Stalker <ul><ul><li>I. Muslea, S. Minton, C. Knoblock, </li></ul></ul><ul><ul><li>University of Southern California </li></ul></ul><ul><ul><li> </li></ul></ul>
    31. 31. STALKER <ul><li>Embedded Catalog Tree </li></ul><ul><ul><li>Leaves (primitive items): 所要擷取的東西。 </li></ul></ul><ul><ul><li>Internal nodes (items): </li></ul></ul><ul><ul><ul><li>Homogeneous list, or </li></ul></ul></ul><ul><ul><ul><li>Heterogeneous tuple. </li></ul></ul></ul>
    32. 32. EC Tree of a page
    33. 33. Extracting Data from a Document <ul><li>For each node in the EC Tree, the wrapper needs a rule that extracts that particular node from its parent </li></ul><ul><li>Additionally, for each list node, the wrapper requires a list iteration rule that decomposes the list into individual tuples. </li></ul><ul><li>Advantages: </li></ul><ul><ul><li>The hierarchical extraction based on the EC tree allows us to wrap information sources that have arbitrary many levels of embedded data. </li></ul></ul><ul><ul><li>Second, as each node is extracted independently of its siblings, our approach does not rely on there being a fixed ordering of the items, and we can easily handle extraction tasks from documents that may have missing items or items that appear in various orders. </li></ul></ul>
    34. 34. Extraction Rules as Finite Automata <ul><li>Landmarks </li></ul><ul><ul><li>A sequence of tokens and wildcards </li></ul></ul><ul><li>Landmark automata </li></ul><ul><ul><li>A non-deterministic finite automata </li></ul></ul>
    35. 35. Landmark Automata <ul><ul><li>A linear LA has one accepting state </li></ul></ul><ul><ul><li>from each non-accepting state, there are exactly two possible transitions: a loop to itself, and a transition to the next state; </li></ul></ul><ul><ul><li>each non-looping transition is labeled by a landmarks; </li></ul></ul><ul><ul><li>all looping transitions have the meaning “consume all tokens until you encounter the landmark that leads to the next state”. </li></ul></ul>
    36. 36. Rule Generating 1 st : terminals: {; reservation _Symbol_ _Word_} Candidate:{; <i> _Symbol_ _HtmlTag_} perfect Disj:{<i> _HtmlTag_} positive example: D3, D4 2 nd : uncover{D1, D2} Candicate:{; _Symbol_} Extract Credit info.
    37. 37. Possible Rules
    38. 40. The STALKER Algorithm
    39. 43. Features <ul><li>Process is performed in a hierarchical manner. </li></ul><ul><li>沒有 Attributes not in order 的問題。 </li></ul><ul><li>Use disjunctive rule 可以解決 Missing attributes 的問題。 </li></ul>
    40. 44. Multi-pass Softmealy Chun-Nan Hsu and Chian-Chi Chang Institute of Information Science Academia Sinica Taipei, Taiwan
    41. 45. Multi-pass
    42. 46. Tabular style document (Quote Server)
    43. 47. Tagged-list style document (Internet Address Finder)
    44. 48. Layout styles and learnability <ul><li>Tabular style </li></ul><ul><ul><li>missing attributes, ordering as hints </li></ul></ul><ul><li>Tagged-list style </li></ul><ul><ul><li>variant ordering, tags as hints </li></ul></ul><ul><li>Prediction </li></ul><ul><ul><li>single-pass for tabular style </li></ul></ul><ul><ul><li>multi-pass for tagged-list style </li></ul></ul>
    45. 49. Tabular result (Quote Server)
    46. 50. Tagged-list result (Internet Address Finder)
    47. 51. Comparison <ul><li>Both : </li></ul><ul><ul><li>can handle irregular missing attributes. </li></ul></ul><ul><ul><li>對於未見過的 attribute ,需要 training </li></ul></ul><ul><li>Single-pass : </li></ul><ul><ul><li>允許的 attribute permutations 有限 </li></ul></ul><ul><ul><li>Single-pass is good for tabular pages </li></ul></ul><ul><ul><li>比較快 </li></ul></ul><ul><li>Multi-pass: </li></ul><ul><ul><li>Attribute permutations 沒有影響 </li></ul></ul><ul><ul><li>Multi-pass is good for tagged-list pages </li></ul></ul><ul><ul><li>比較慢 </li></ul></ul>
    48. 52. Comparison <ul><li>Quote Server </li></ul><ul><ul><li>Stalker: 10 example tuples, 79% , 500 test </li></ul></ul><ul><ul><li>WIEN: the collection beyond learn’s capablity </li></ul></ul><ul><ul><li>SoftMealy: multi-pass 85% , single-pass 97% </li></ul></ul><ul><li>Internet Address Finder </li></ul><ul><ul><li>Stalker: 80% ~ 100% , 500 test </li></ul></ul><ul><ul><li>WIEN: the collection beyond learn’s capablity </li></ul></ul><ul><ul><li>SoftMealy: multi-pass 68% , single-pass 41% , </li></ul></ul>
    49. 53. Comparison <ul><li>Okra (tabular pages) </li></ul><ul><ul><li>Stalker: 97% , 1 example tuple </li></ul></ul><ul><ul><li>WIEN: 100% , 13 example tuples, 30 test </li></ul></ul><ul><ul><li>SoftMealy: single-pass 100% , 1 example tuple, 30 test </li></ul></ul><ul><li>Big-book (tagged-list pages) </li></ul><ul><ul><li>Stalker: 97% , 8 example tuples </li></ul></ul><ul><ul><li>WIEN: perfect , 18 example tuples, 30 test </li></ul></ul><ul><ul><li>SoftMealy: single-pass 97% , 4 examples, 30 test </li></ul></ul><ul><ul><li>multi-pass 100% , 6 examples, 30 test </li></ul></ul>
    50. 54. References <ul><li>Kushmerick, N. (2000) Wrapper induction: Efficiency and expressiveness . Artificial Intelligence J. 118(1-2):15-68 (special issue on Intelligent Internet Systems). </li></ul><ul><li>Chun-Nan Hsu and Ming-Tzung Dung. Generating finite-state transducers for semistructured data extraction from the web. Information Systems , 23(8):521-538, Special Issue on Semistructured Data, 1998. </li></ul><ul><li>Ion Muslea , Steve Minton, Craig Knoblock. Hierarchical Wrapper Induction for Semistructured Information Sources, Journal of Autonomous Agents and Multi-Agent Systems , 4:93-114, 2001 . </li></ul>