• Save
Information Extraction from the Web - Algorithms and Tools
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Information Extraction from the Web - Algorithms and Tools

on

  • 928 views

 

Statistics

Views

Total Views
928
Views on SlideShare
867
Embed Views
61

Actions

Likes
0
Downloads
0
Comments
0

1 Embed 61

http://www.habegger.fr 61

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Information Extraction from the Web - Algorithms and Tools Presentation Transcript

  • 1. Algorithms and Tools Information Extraction from the Web Benjamin Habegger University of Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205 Seminary on Information Extraction from the Web ENSIAS, Rabat, Morocco - June 19, 2013
  • 2. About Me @b_habegger http://www.linkedin.com/in/benjaminhabegger benjamin.habegger@insa-lyon.fr
  • 3. Overview ● Fundamentals of information extraction from the web – Document representations – Approaches ● Algorithms to extract information from semi-structured web content – Wien, Stalker, DIPRE, IERel ● Tools to describe and web scrappers – WetDL, WebSource ● Applications and extensions of information extraction – Making our human web smarter – Learning mappings for data integration
  • 4. What types of data are we taking about ?
  • 5. Types of data on the Web ● Structured ● Unstructured ● Semi-structured
  • 6. Types of data on the Web ● Structured ● Unstructured ● Semi-structured
  • 7. Types of data on the Web ● Structured ● Unstructured ● Semi-structured
  • 8. Semi-structured data ● Usually, but not limited to, data from a database formatted as HTML ● Listings of entities ● Presented in a “regular” presentation format
  • 9. Multiple possible representations (DOM) Tree Rendered page <tr class="participant">   <td class="pname" id="part1968752570">      […]      <div class="pname">Benjamin</div>   </td>   […] </tr> String <tr class="participant">   <td class="pname" id="part1968752570">      […]      <div class="pname">Benjamin</div>   </td>   […] </tr> HTML string
  • 10. What do we want to do with those documents ?
  • 11. Information extraction from the web monster.frmonster.fr apec.frapec.fr remixjobs.comremixjobs.com Job DatabaseJob Database
  • 12. Information extraction from the web ● Extract data from one or more web sites ● Wrap it into a predefined target format
  • 13. How do we do this ?
  • 14. Wrappers (scrapper) monster.fr apec.fr remixjobs.com Job Database
  • 15. Algorithms to learn wrappers ● Wien ● Stalker ● SoftMealy ● IEPad ● RoadRunner ● DIPRE ● IERel ● TreePat Miner ● Squirrel
  • 16. Wrapper representations ● A program ● A transducer (string or tree) ● A regular expression ● A tree pattern ● A query
  • 17. Document and wrapper representations Algorithm Document Model Query/Wrapper Model Wien [Kushmerick] String LR-Patterns Stalker [Muslea] String Delimiter-rules SoftMealy [Hsu & Dung] Analysed String Transducer IERel [Habegger] HTML String *-Patterns Squirrel [Carme] DOM Tree Tree Automata Habegger & Debarbieux DOM Tree Tree-Pattern Queries
  • 18. SoftMealy
  • 19. SoftMealy ● Input: – Completely labeled document ● Preprocessing: – Tokenize input string ● Output: – A transducer
  • 20. SoftMealy: Document Representation Symbol Description CAlph(x) String composed of only capitals C1Alph(x) Strinng starting with a capital Num(x) Numerical string Html(x) An HTML tag OAlph(x) String of alpha-numerical characters Punc(x) Punctuation symbol NL(n) n line feeds Tab(n) n tabulations Spc(n) n spaces
  • 21. SoftMealy: Algorithm N E O
  • 22. SoftMealy: Results
  • 23. SoftMealy: Conclusion ● String-based wrapper induction algorithm ● Patterns which take format into account → Improvement over WIEN ● As WIEN & Stalker – imposes much labeling – “batch” approach
  • 24. RoadRunner
  • 25. RoadRunner ● Input: – Collection of sample pages ● Algorithm – Induce structural pattern from the pages ● Output – A DTD-like schema structure for the documents
  • 26. RoadRunner: Example
  • 27. RoadRunner: Results
  • 28. RoadRunner ● Wraps regularities into a page pattern – Compacts structure ● Structural item of the found schema NOT mapped to a target schema ● Option: uses output as input of a mapping mining algorithm
  • 29. DIPRE
  • 30. Dipre [Brin1998] ● Input: – Example instances of a relation to be extracted – A collection of web documents ● Output: – Patterns to be applied to the collection – (New) instances extracted using the patterns
  • 31. DIPRE: Relation extraction from a web cache Web Cache Relation Instances Very Basic Extraction Patterns
  • 32. Dipre ● Interesting cyclic process ● Very (too) simple patterns for IE ● Problem of over-generalizations ● Pattern set drifting from their extraction target
  • 33. IERel
  • 34. IERel ● Input: – Examples of a relation to be extracted ● Algorithm – Extract patterns & generalize them ● Output – Extraction patterns
  • 35. IERel: Document representation <tr class="participant"> <td class="pname" id="part1968752570"> […] <div class="pname"> B e n j a m i n </div> </td> […] </tr> §1§ §2§ […] §3§ B e n j a m i n §4§ §5§ […] §6§
  • 36. IERel: Generalization <tr class="participant"> <td class="pname" id="part825438027"> […] <div class="pname"> M o h a m e d </div> </td> […] </tr> §1§ §7§ […] §3§ M o h a m e d §4§ §5§ […] §6§
  • 37. IERel: Generalization §1§ §7§ […] §3§ M o h a m e d §4§ §5§ […] §6§ §1§ §2§ […] §3§ B e n j a m i n §4§ §5§ […] §6§ §1§ * […] §3§ * §4§ §5§ […] §6§
  • 38. IERel: Interactive Learning Examples Extracted Results Patterns Refined Patterns Refined Patterns New examples / Negate wrong ones Results using refined patterns
  • 39. Coping with over-generalization Learn a set of patterns i.e. a disjunction of conjunctions
  • 40. IERel: Pattern construction
  • 41. IERel: Evaluation ● Multiple tested domains – Online directories – Search engine results – Product catalogs
  • 42. Demo
  • 43. IERel: Example entropy
  • 44. IERel: Conclusion ● Labeling can be limited ● Underlines the interest for interactive learning
  • 45. Other representations
  • 46. Learning Tree Pattern Queries
  • 47. Maximal weight generalization
  • 48. Other algorithms on trees ● Carme et al. – inducing node selecting tree automata ● Marty et al. – Tabluar descriptions of nodes to be selected – Using classification techiques
  • 49. We can extract data from the web. Now what ?
  • 50. Extraction is not all
  • 51. WetDL – Query – Fetch – Parse – Extract – Transform – External ● Workflow description of a web navigation patterns ● An execution model ● A collection of meta-operators
  • 52. Semantics of a WetDL workflow ● Nodes are processors – Receive messages through a queue – Process and dispatch the result messages ● A processor may generate 0, 1 or n messages ● Workflow terminates when all queues are empty
  • 53. WebSource: execute WetDL flows ● Each node can: – enqueue data (push) – generate data (pull) ● Processing can occur: – on push (forward chaining) – on pull (backward chaining)
  • 54. WetDL ● Simple description of navigation patterns – Straightforward operators in the context of IE ● Powerful expressiveness (in particular for IE) – We can describe most (if not all) web information extraction tasks
  • 55. WebSource Open-source WetDL interpreter http://websource.sf.net/
  • 56. Applications and extensions
  • 57. Semabot: Motivation What does the following query give ? “lyon informatique emploi”
  • 58. Semabot: Motivation A list of documents containing the terms “lyon” “informatique” “emploi”
  • 59. Semabot: Objectives The query “lyon informatique emploi” should give: A list of computer engineer job offers
  • 60. Semabot ● Registry of “object” schemas and wrappers ● Wrappers generate “objects” – Job offers, People, Products, etc. ● Crawler wraps pages and indexes objects
  • 61. Semabot: Open problems ● Wrap the web into objects – i.e. what we have seen in this seminar ;) ● Interpret (some of) the terms of the query – “lyon” => http://en.wikipedia.org/wiki/Lyon – “emploi” => http://en.wikipedia.org/wiki/Job_(role)
  • 62. Information Extraction ● WHAT ? – Make content adapted to human consumption as content consumable by a target schema ● HOW ? – Using machine learning approaches
  • 63. Data Integration ● WHAT ? – Make content adapted to human consumption as content consumable by a target schema ● HOW ? – Using machine learning approaches to a source schema
  • 64. Data Integration DB 1 Schema 1 App ASchema 2 Mappings Query Rewriting
  • 65. Extracting = Mapping Data model Query Super Model String Regular Expressions / Automata Tree Xpath Expressions Relational data SQL/SPARQL Expressions
  • 66. Wrapping HTML to RDF <li id=”gs2”> <b>Samsung Galaxy S II</b> <i>300 EUR</i> <br /> Vendor: charly@example.com </li> ● Samsung Galaxy S 300 EUR Vendor: charly@example.com http://phones.example.com/samsung/charly/#gs2 name price vendor Samsung Galaxy S II 300 EUR charly@example.com
  • 67. Wrap-up ● Tour of information extraction – Learning wrappers – Building IE tasks ● Link with semantic web/open data ● Link with data integration
  • 68. Perspectives ● Further explore the potential interactive learning ● Learning navigation patterns ● Search of “objects” rather than documents ● Extension of interaction cycle – pattern generation – some form of automated pattern evaluation – continuous (re)learning
  • 69. Thank you @b_habegger http://www.linkedin.com/in/benjaminhabegger benjamin.habegger@insa-lyon.fr