Information extraction from HTML product catalogues <ul><li>Martin Labsk ý 1 , Vojtěch Svátek 1 , Pavel Praks 2 , Ondřej Š...
Agenda <ul><li>Overview of the Rainbow project </li></ul><ul><li>Extraction of product offers </li></ul><ul><ul><li>Annota...
Rainbow overview <ul><li>Goal </li></ul><ul><ul><li>to  present the  content and structure  of  legacy websites   to a use...
Application of Rainbow
Extraction of product offers <ul><li>Combines </li></ul><ul><ul><li>automatic document annotation using HMMs </li></ul></u...
Sample data
Preprocessing <ul><li>HTML cleanup </li></ul><ul><ul><li>conversion to valid XHTML </li></ul></ul><ul><li>Only potentially...
Annotation using HMMs <ul><li>HMM structure </li></ul><ul><ul><li>target, prefix, suffix and background states </li></ul><...
Impact of image information <ul><li>Image classifier </li></ul><ul><ul><li>classifies into 3 classes –  Pos ,  Neg ,  Unk ...
Ontology-based instance extraction Instance  extraction algorithm Instances (xml) Sesame RDF  repository Document annotate...
Domain ontology Presentation ontology
Instance extraction algorithm <ul><li>Sequentially parses annotated document </li></ul><ul><li>Adds annotated attributes t...
Search interface powered by Sesame
Future work <ul><li>Learn to correct annotation errors </li></ul><ul><ul><li>use document structure to detect unlabeled at...
Thank you! rainbow.vse.cz
Upcoming SlideShare
Loading in …5
×

prie.ppt

535 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
535
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

prie.ppt

  1. 1. Information extraction from HTML product catalogues <ul><li>Martin Labsk ý 1 , Vojtěch Svátek 1 , Pavel Praks 2 , Ondřej Šváb 1 </li></ul><ul><li>{labsky, svatek, xsvao06}@vse.cz, pavel.praks@vsb.cz </li></ul><ul><li>rainbow.vse.cz </li></ul><ul><li>1 Dept. of Information and Knowledge Engineering, </li></ul><ul><li>Prague University of Economics </li></ul><ul><li>2 Dept. of Applied Mathematics, Technical University of Ostrava </li></ul>Coupling quantitative and knowledge-based approaches
  2. 2. Agenda <ul><li>Overview of the Rainbow project </li></ul><ul><li>Extraction of product offers </li></ul><ul><ul><li>Annotation using HMMs </li></ul></ul><ul><ul><li>Impact of image information </li></ul></ul><ul><ul><li>Ontology-based instance extraction </li></ul></ul><ul><ul><li>Search interface </li></ul></ul><ul><li>Future work </li></ul>
  3. 3. Rainbow overview <ul><li>Goal </li></ul><ul><ul><li>to present the content and structure of legacy websites to a user or computer agent </li></ul></ul><ul><li>How </li></ul><ul><ul><li>m ultiway analysis of websites: utilize features derived from text , images , formatting , URLs , navigation structure and background knowledge </li></ul></ul><ul><li>Modular architecture, web services </li></ul><ul><ul><li>i nformation extraction (HMMs) </li></ul></ul><ul><ul><li>discovery of website navigation structure (link graph) </li></ul></ul><ul><ul><li>i mage classifiers (histograms, dimensions, similarity) </li></ul></ul><ul><ul><li>URL classifier (rule-based) </li></ul></ul><ul><ul><li>extractor of summarizing sentences (bootstrapped indicator keywords) </li></ul></ul>
  4. 4. Application of Rainbow
  5. 5. Extraction of product offers <ul><li>Combines </li></ul><ul><ul><li>automatic document annotation using HMMs </li></ul></ul><ul><ul><li>image classifier </li></ul></ul><ul><ul><li>ontology-based instance composition </li></ul></ul><ul><ul><li>URL classifier for focused crawling </li></ul></ul><ul><ul><li>structured search interface powered by Sesame </li></ul></ul><ul><li>The data </li></ul><ul><ul><li>over 1000 bicycle offers (labeled using 15 attributes) </li></ul></ul><ul><ul><li>in 100 pages from different websites </li></ul></ul>
  6. 6. Sample data
  7. 7. Preprocessing <ul><li>HTML cleanup </li></ul><ul><ul><li>conversion to valid XHTML </li></ul></ul><ul><li>Only potentially relevant blocks kept </li></ul><ul><ul><li>blocks that do not directly contain text or images omitted </li></ul></ul><ul><li>Formatting tags </li></ul><ul><ul><li>attributes removed </li></ul></ul><ul><ul><li>several rules matching common constructions (add-to-basket form, choose-amount button) </li></ul></ul><ul><li>Images </li></ul><ul><ul><li>baseline: all images treated as a single token </li></ul></ul>
  8. 8. Annotation using HMMs <ul><li>HMM structure </li></ul><ul><ul><li>target, prefix, suffix and background states </li></ul></ul><ul><ul><li>adopted from [Freitag, McCallum 99] </li></ul></ul><ul><li>Single tag trigram model for all tags </li></ul><ul><li>F-measures </li></ul><ul><ul><li>83% for name, 89% for price </li></ul></ul><ul><ul><li>56% average for 13 other attributes (17-90%) </li></ul></ul><ul><li>Variations </li></ul><ul><ul><li>word-ngram models for lexical probabilities of target states </li></ul></ul><ul><ul><li>state substructures instead of single target states, learned by EM </li></ul></ul>
  9. 9. Impact of image information <ul><li>Image classifier </li></ul><ul><ul><li>classifies into 3 classes – Pos , Neg , Unk </li></ul></ul><ul><ul><li>before HMM annotation, each image occurence in a document is substituted by its class </li></ul></ul><ul><ul><li>best result 6.6% error rate for binary classification with multi-layer perceptron (weka) </li></ul></ul><ul><li>Features used for classification </li></ul><ul><ul><li>dimensions (estimated 2-dimensional normal distribution) </li></ul></ul><ul><ul><li>similarity (latent semantic similarity [Praks 2004] ) </li></ul></ul><ul><ul><li>whether the same image repeats in the same document </li></ul></ul><ul><li>Results </li></ul><ul><ul><li>image precision increased by 19.1%, recall by 2% </li></ul></ul><ul><ul><li>improvements for other tags negligible </li></ul></ul>
  10. 10. Ontology-based instance extraction Instance extraction algorithm Instances (xml) Sesame RDF repository Document annotated by HMM Presentation ontology
  11. 11. Domain ontology Presentation ontology
  12. 12. Instance extraction algorithm <ul><li>Sequentially parses annotated document </li></ul><ul><li>Adds annotated attributes to working instance WI </li></ul><ul><li>If adding an attribute would cause an inconsitency, an empty working_instance is created. The old working_instance is saved only if it is consistent. </li></ul><ul><li>WI = empty_instance; </li></ul><ul><li>while (more_attributes) { </li></ul><ul><li>A = next_attribute; </li></ul><ul><li>if (cannot_add (WI, A)) { </li></ul><ul><li>if (consistent (WI)) { </li></ul><ul><li>store (WI); </li></ul><ul><li>} </li></ul><ul><li>WI = empty_instance; </li></ul><ul><li>} </li></ul><ul><li>add (WI, A); </li></ul><ul><li>} </li></ul>
  13. 13. Search interface powered by Sesame
  14. 14. Future work <ul><li>Learn to correct annotation errors </li></ul><ul><ul><li>use document structure to detect unlabeled attributes </li></ul></ul><ul><ul><li>bootstrap from these new examples </li></ul></ul><ul><ul><li>use ontology constraints on values (types, lists, regexps) </li></ul></ul><ul><li>Population algorithm </li></ul><ul><ul><li>utilize scores for each annotated attribute </li></ul></ul><ul><ul><li>augment presentation ontology with frequencies of attribute orderings </li></ul></ul><ul><ul><li>use approximate name matching to identify instances </li></ul></ul><ul><li>Improve search interface </li></ul><ul><ul><li>approximate name matching (word and char edit distance) </li></ul></ul>
  15. 15. Thank you! rainbow.vse.cz

×