Semi structure data extraction

2,180 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,180
On SlideShare
0
From Embeds
0
Number of Embeds
211
Actions
Shares
0
Downloads
24
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Semi structure data extraction

  1. 1. SEMI-STRUCTUREDATA EXTRACTIONRajendra Akerkar(with David Camacho, Maria D. R-Moreno,David F Barrero) F. Bonn, June 2007
  2. 2. INDEX Introduction I d i Semantic Generators The WebMantic architecture A practical example Some experimental issues Conclusions
  3. 3. INTRODUCTION
  4. 4. INTRODUCTION  Web information  Unstructured  Non-semantic  Designed for humans not for crawlers  Problems  Representation (HTML vs XML)  Extract, filter and reuse data  Share information  Volatility  Fault tolerance
  5. 5. INTRODUCTION  Information Extraction techniques  Machine learning  Pattern recognition  Wrappers technologies  Tools for automatic and semi-automatic Web data extraction  This work presents  A rule-based method for data identification l b d th d f d t id tifi ti  An approach to Web data extraction  A particular implementation of the previous method
  6. 6. SEMANTIC GENERATORS
  7. 7. SEMANTIC GENERATORS  Def: A Semantic Generator (Sg) is a non- non empty set of rules (HTML2XML) that can be used to translate HTML documents into XML documents  A Semantic Generator (Sg), is built by several rules which transform a set of non-semantic HTML tags into a set of semantic XML tags  HTML2XML rule format HTML2XMLi =< header > IS < body > #num
  8. 8. SEMANTIC GENERATORS  HTML2XML: <table.tr.td> IS <my-xml-tag> Tags: <table> <tr> <td> <A href…> etc… will be removed….only data will be extracted  #num: provides the number of cells to be processed  <my-xml-tag> Madrid <my-xml-tag>
  9. 9. SEMANTIC GENERATORS Semantic generator
  10. 10. THE WEBMANTIC ARCHITECTURE
  11. 11. WEBMANTIC ARCHITECTURE WebMantic allows:  Automatically generates Sg  Generalize HTML2XML rules G li l  Guiding the extraction process  Automatically generates Wrappers
  12. 12. WEBMANTIC ARCHITECTURE
  13. 13. WEBMANTIC ARCHITECTURE Tidy HTML p y parser (http://tidy.sourceforge.net). It ( p y f g ) translates HTML documents into well-formed HTML documents The HTML Tidy program (HTML parser and yp g ( p pretty printer) has been integrated as the first preprocessing module in WebMantic. Tree generator module. Once the HTML page is p p preprocessed by Tidy parser, a tree representation y yp , p of the structures stored in the page is built In this representation any table or list tags g generate a node, and the leafs of the tree are: cells , f f for tables (th,td,tr) or items for lists (li,lo)
  14. 14. WEBMANTIC ARCHITECTURE
  15. 15. WEBMANTIC ARCHITECTURE  HTML2XML: Rule generator module The tree module. representation obtained is used by this module to generate a set of rules (Sg) that represent the information to be translated HTML2XML rules
  16. 16. WEBMANTIC ARCHITECTURE
  17. 17. WEBMANTIC ARCHITECTURE Subsumption module. Previous module generates a rule for each structure to be translated. However, some of those rules can be generalized if the XML tag XML-tag represents the same concept. (i.e. the rules in previous example that represent the concepts of <data-record> and <country>)
  18. 18. WEBMANTIC ARCHITECTURE
  19. 19. WEBMANTIC ARCHITECTURE XML Parser module. This module receives both, the Semantic G th S ti Generator obtained i previous t bt i d in i module, and the (well formed) HTML document Semantic Generator Yahoo! Weather arser XML Pa X
  20. 20. A PRACTICAL EXAMPLE
  21. 21. WEBMANTIC GUI WebMantic’s GUI
  22. 22. WEBMANTIC GUI www.citypopulation.de
  23. 23. WEBMANTIC GUI www.citypopulation.de
  24. 24. WEBMANTIC GUI First tables & list are rejected
  25. 25. WEBMANTIC GUI First data-table is rejected
  26. 26. WEBMANTIC GUI data-table target
  27. 27. WEBMANTIC GUI XML tags generation (user interaction) i ( i i )
  28. 28. WEBMANTIC GUI XML tags & HTML2XML rules
  29. 29. WEBMANTIC HTML PROCESSING Tree T generated f d from HTML d document Relation between the HTML tree and the XML-tags provided by the user
  30. 30. WEBMANTIC HTML PROCESSING HTML2XML rules Semantic Generator: HTML2XML subsumed rules
  31. 31. EXPERIMENTAL RESULTS
  32. 32. EXPERIMENTAL RESULTS Experimental tests (Web sites used):  Population (www.citypopulation.de)
  33. 33. EXPERIMENTAL RESULTS Experimental tests (Web sites used):  Yahoo Weather (weather.yahoo.com)
  34. 34. EXPERIMENTAL RESULTS Experimental tests (Web sites used):  Iberia arilines (www.iberia.com)
  35. 35. EXPERIMENTAL RESULTS Several parameters have been evaluated: 1. Number of pages tested from each Web site 2. 2 Number of accessible structures 3. Maximum nested structure 4. 4 Average number of HTML2XML rules for each Semantic Generator (Sg), once the subsumption process has finished 5. Average time (seconds) to generate the Sg (Time Sg) 6. Average time (seconds) to translate from HTML to XMLfor the set of training pages (transformation time)
  36. 36. EXPERIMENTAL RESULTS
  37. 37. CONCLUSIONS
  38. 38. CONCLUSIONS AND FUTURE WORK  Conclusions:  We define a technique which is able to p f q provide a semantic representation (using XML-tags) to semi- structured (tables and lists) Web pages through a set of rules (encapsulated in a Semantic Generator)  Rules are created and automatically generalized  These rules can be used to preprocess Web pages with a similar structure, and convert them into XML documents with semantic tags d i h i  These can be integrated into information agents
  39. 39. CONCLUSIONS AND FUTURE WORK In the near future:  Other Web t h l i Oth W b technologies as DOM  Ontologies  Machine learning algorithms to automatically learns new web (similar) p g ( ) pages  Statistical knowledge extraction

×