Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web Information Extraction for the DB Research Domain


Published on

A presentation describing my final project for an engineering degree at the Hebrew University of Jerusalem - a system for extracting information from web sites into instances of an XML schema, utilizing machine learning, structural analysis of documents and a divide & conquer strategy.

  • Be the first to comment

  • Be the first to like this

Web Information Extraction for the DB Research Domain

  1. 1. WEB INFORMATION EXTRACTION FOR THE DB RESEARCH DOMAIN Michael Genkin ( Liat Kakun ( School of Engineering and Computer Science Advisor: Dr. Sara Cohen
  2. 2. Introduction <ul><li>Wealth of information available online </li></ul><ul><ul><li>To much for it to be handled, effectively, by humans. </li></ul></ul><ul><ul><li>Mostly inaccessible to computers </li></ul></ul><ul><li>A web information extraction project </li></ul><ul><ul><li>Provide a complete, domain specific, system </li></ul></ul><ul><ul><li>Allow structured queries on top of web information. </li></ul></ul><ul><ul><li>Part of a research on developing tools to support scientific policy management @ HUJI DB Group. </li></ul></ul><ul><ul><ul><li>Advisor: Dr. Sara Cohen </li></ul></ul></ul><ul><ul><ul><li>Other groups creating components – web crawler, UI. </li></ul></ul></ul>
  3. 3. Introduction <ul><li>Extract information from DB research projects’ web sites. </li></ul><ul><ul><li>Domain specific </li></ul></ul><ul><ul><li>Divide & Conquer </li></ul></ul><ul><ul><li>Structural document analysis </li></ul></ul><ul><ul><li>Linguistic analysis </li></ul></ul><ul><ul><li>Machine learning </li></ul></ul><ul><li>The domain encoded in an XML schema document </li></ul><ul><ul><li>Contains processing instruction as well as domain semantics. </li></ul></ul><ul><li>The result is an XML based, query-able, database </li></ul>
  4. 4. Methods – Structural Analysis #1 Before: After: Transform each input document into a structurally valid, monolithic, document – using industry standard tools such as HTML Tidy and Readability.
  5. 5. Methods – Structural Analysis #2 <ul><li>Vertically segment each document into logical blocks. </li></ul><ul><li>Employ, stack based, style analysis to identify each of the blocks. </li></ul>
  6. 6. Methods - Classification Employ multiclass classification (by vector similarity) to map the logical document blocks to the appropriate schema elements.
  7. 7. Methods – Pattern Recognition Pattern: .//bibliography/ul/li/* Mine likely candidate blocks for patterns using the PAT Tree algorithm; adjusted for finding a maximum likelihood pattern.
  8. 8. Methods – Metadata Extraction Use CRF for extraction of additional metadata where appropriate (e.g. bibliographic lists).
  9. 9. Results – Setting <ul><li>50 web pages of DB research projects from American and Israeli universities. </li></ul><ul><ul><li>Chosen manually to represent a wide variety of web page styles. </li></ul></ul><ul><li>All pages pre-processed by our systems – their structure analyzed; Then manually tagged for classification, patterns, metadata. </li></ul><ul><li>20% of the dataset is sampled for training purposes, randomly. </li></ul><ul><ul><li>Repeated 5 times, and averaged. </li></ul></ul>
  10. 10. Results – Measures
  11. 11. Results Precision Recall Pattern Recognition 85% 89.7% Classification Accuracy 82.5%
  12. 12. Conclusions <ul><li>This is a feasible approach for creating a web information extraction system. </li></ul><ul><li>Good results can be achieved with a relative small sample. </li></ul><ul><li>The modular system design allows easy adaptation for additional domains. </li></ul><ul><li>Future directions: </li></ul><ul><ul><li>Schema generation </li></ul></ul><ul><ul><li>Better information integration </li></ul></ul><ul><ul><li>Additional modules (e.g. deep linguistic analysis) </li></ul></ul>
  13. 13. Questions?