Information Extraction from the Web - Algorithms and Tools

2,254 views

Published on

Published in: Technology, Education
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,254
On SlideShare
0
From Embeds
0
Number of Embeds
92
Actions
Shares
0
Downloads
0
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Information Extraction from the Web - Algorithms and Tools

  1. 1. Algorithms and Tools Information Extraction from the Web Benjamin Habegger University of Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205 Seminary on Information Extraction from the Web ENSIAS, Rabat, Morocco - June 19, 2013
  2. 2. About Me @b_habegger http://www.linkedin.com/in/benjaminhabegger benjamin.habegger@insa-lyon.fr
  3. 3. Overview ● Fundamentals of information extraction from the web – Document representations – Approaches ● Algorithms to extract information from semi-structured web content – Wien, Stalker, DIPRE, IERel ● Tools to describe and web scrappers – WetDL, WebSource ● Applications and extensions of information extraction – Making our human web smarter – Learning mappings for data integration
  4. 4. What types of data are we taking about ?
  5. 5. Types of data on the Web ● Structured ● Unstructured ● Semi-structured
  6. 6. Types of data on the Web ● Structured ● Unstructured ● Semi-structured
  7. 7. Types of data on the Web ● Structured ● Unstructured ● Semi-structured
  8. 8. Semi-structured data ● Usually, but not limited to, data from a database formatted as HTML ● Listings of entities ● Presented in a “regular” presentation format
  9. 9. Multiple possible representations (DOM) Tree Rendered page <tr class="participant">   <td class="pname" id="part1968752570">      […]      <div class="pname">Benjamin</div>   </td>   […] </tr> String <tr class="participant">   <td class="pname" id="part1968752570">      […]      <div class="pname">Benjamin</div>   </td>   […] </tr> HTML string
  10. 10. What do we want to do with those documents ?
  11. 11. Information extraction from the web monster.frmonster.fr apec.frapec.fr remixjobs.comremixjobs.com Job DatabaseJob Database
  12. 12. Information extraction from the web ● Extract data from one or more web sites ● Wrap it into a predefined target format
  13. 13. How do we do this ?
  14. 14. Wrappers (scrapper) monster.fr apec.fr remixjobs.com Job Database
  15. 15. Algorithms to learn wrappers ● Wien ● Stalker ● SoftMealy ● IEPad ● RoadRunner ● DIPRE ● IERel ● TreePat Miner ● Squirrel
  16. 16. Wrapper representations ● A program ● A transducer (string or tree) ● A regular expression ● A tree pattern ● A query
  17. 17. Document and wrapper representations Algorithm Document Model Query/Wrapper Model Wien [Kushmerick] String LR-Patterns Stalker [Muslea] String Delimiter-rules SoftMealy [Hsu & Dung] Analysed String Transducer IERel [Habegger] HTML String *-Patterns Squirrel [Carme] DOM Tree Tree Automata Habegger & Debarbieux DOM Tree Tree-Pattern Queries
  18. 18. SoftMealy
  19. 19. SoftMealy ● Input: – Completely labeled document ● Preprocessing: – Tokenize input string ● Output: – A transducer
  20. 20. SoftMealy: Document Representation Symbol Description CAlph(x) String composed of only capitals C1Alph(x) Strinng starting with a capital Num(x) Numerical string Html(x) An HTML tag OAlph(x) String of alpha-numerical characters Punc(x) Punctuation symbol NL(n) n line feeds Tab(n) n tabulations Spc(n) n spaces
  21. 21. SoftMealy: Algorithm N E O
  22. 22. SoftMealy: Results
  23. 23. SoftMealy: Conclusion ● String-based wrapper induction algorithm ● Patterns which take format into account → Improvement over WIEN ● As WIEN & Stalker – imposes much labeling – “batch” approach
  24. 24. RoadRunner
  25. 25. RoadRunner ● Input: – Collection of sample pages ● Algorithm – Induce structural pattern from the pages ● Output – A DTD-like schema structure for the documents
  26. 26. RoadRunner: Example
  27. 27. RoadRunner: Results
  28. 28. RoadRunner ● Wraps regularities into a page pattern – Compacts structure ● Structural item of the found schema NOT mapped to a target schema ● Option: uses output as input of a mapping mining algorithm
  29. 29. DIPRE
  30. 30. Dipre [Brin1998] ● Input: – Example instances of a relation to be extracted – A collection of web documents ● Output: – Patterns to be applied to the collection – (New) instances extracted using the patterns
  31. 31. DIPRE: Relation extraction from a web cache Web Cache Relation Instances Very Basic Extraction Patterns
  32. 32. Dipre ● Interesting cyclic process ● Very (too) simple patterns for IE ● Problem of over-generalizations ● Pattern set drifting from their extraction target
  33. 33. IERel
  34. 34. IERel ● Input: – Examples of a relation to be extracted ● Algorithm – Extract patterns & generalize them ● Output – Extraction patterns
  35. 35. IERel: Document representation <tr class="participant"> <td class="pname" id="part1968752570"> […] <div class="pname"> B e n j a m i n </div> </td> […] </tr> §1§ §2§ […] §3§ B e n j a m i n §4§ §5§ […] §6§
  36. 36. IERel: Generalization <tr class="participant"> <td class="pname" id="part825438027"> […] <div class="pname"> M o h a m e d </div> </td> […] </tr> §1§ §7§ […] §3§ M o h a m e d §4§ §5§ […] §6§
  37. 37. IERel: Generalization §1§ §7§ […] §3§ M o h a m e d §4§ §5§ […] §6§ §1§ §2§ […] §3§ B e n j a m i n §4§ §5§ […] §6§ §1§ * […] §3§ * §4§ §5§ […] §6§
  38. 38. IERel: Interactive Learning Examples Extracted Results Patterns Refined Patterns Refined Patterns New examples / Negate wrong ones Results using refined patterns
  39. 39. Coping with over-generalization Learn a set of patterns i.e. a disjunction of conjunctions
  40. 40. IERel: Pattern construction
  41. 41. IERel: Evaluation ● Multiple tested domains – Online directories – Search engine results – Product catalogs
  42. 42. Demo
  43. 43. IERel: Example entropy
  44. 44. IERel: Conclusion ● Labeling can be limited ● Underlines the interest for interactive learning
  45. 45. Other representations
  46. 46. Learning Tree Pattern Queries
  47. 47. Maximal weight generalization
  48. 48. Other algorithms on trees ● Carme et al. – inducing node selecting tree automata ● Marty et al. – Tabluar descriptions of nodes to be selected – Using classification techiques
  49. 49. We can extract data from the web. Now what ?
  50. 50. Extraction is not all
  51. 51. WetDL – Query – Fetch – Parse – Extract – Transform – External ● Workflow description of a web navigation patterns ● An execution model ● A collection of meta-operators
  52. 52. Semantics of a WetDL workflow ● Nodes are processors – Receive messages through a queue – Process and dispatch the result messages ● A processor may generate 0, 1 or n messages ● Workflow terminates when all queues are empty
  53. 53. WebSource: execute WetDL flows ● Each node can: – enqueue data (push) – generate data (pull) ● Processing can occur: – on push (forward chaining) – on pull (backward chaining)
  54. 54. WetDL ● Simple description of navigation patterns – Straightforward operators in the context of IE ● Powerful expressiveness (in particular for IE) – We can describe most (if not all) web information extraction tasks
  55. 55. WebSource Open-source WetDL interpreter http://websource.sf.net/
  56. 56. Applications and extensions
  57. 57. Semabot: Motivation What does the following query give ? “lyon informatique emploi”
  58. 58. Semabot: Motivation A list of documents containing the terms “lyon” “informatique” “emploi”
  59. 59. Semabot: Objectives The query “lyon informatique emploi” should give: A list of computer engineer job offers
  60. 60. Semabot ● Registry of “object” schemas and wrappers ● Wrappers generate “objects” – Job offers, People, Products, etc. ● Crawler wraps pages and indexes objects
  61. 61. Semabot: Open problems ● Wrap the web into objects – i.e. what we have seen in this seminar ;) ● Interpret (some of) the terms of the query – “lyon” => http://en.wikipedia.org/wiki/Lyon – “emploi” => http://en.wikipedia.org/wiki/Job_(role)
  62. 62. Information Extraction ● WHAT ? – Make content adapted to human consumption as content consumable by a target schema ● HOW ? – Using machine learning approaches
  63. 63. Data Integration ● WHAT ? – Make content adapted to human consumption as content consumable by a target schema ● HOW ? – Using machine learning approaches to a source schema
  64. 64. Data Integration DB 1 Schema 1 App ASchema 2 Mappings Query Rewriting
  65. 65. Extracting = Mapping Data model Query Super Model String Regular Expressions / Automata Tree Xpath Expressions Relational data SQL/SPARQL Expressions
  66. 66. Wrapping HTML to RDF <li id=”gs2”> <b>Samsung Galaxy S II</b> <i>300 EUR</i> <br /> Vendor: charly@example.com </li> ● Samsung Galaxy S 300 EUR Vendor: charly@example.com http://phones.example.com/samsung/charly/#gs2 name price vendor Samsung Galaxy S II 300 EUR charly@example.com
  67. 67. Wrap-up ● Tour of information extraction – Learning wrappers – Building IE tasks ● Link with semantic web/open data ● Link with data integration
  68. 68. Perspectives ● Further explore the potential interactive learning ● Learning navigation patterns ● Search of “objects” rather than documents ● Extension of interaction cycle – pattern generation – some form of automated pattern evaluation – continuous (re)learning
  69. 69. Thank you @b_habegger http://www.linkedin.com/in/benjaminhabegger benjamin.habegger@insa-lyon.fr

×