On building a search interface discovery system


Published on

Slides of my talk at RED'09 workshop

Published in: Technology
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

On building a search interface discovery system

  1. 1. On building a search interface discovery system <ul><li>Denis Shestakov </li></ul><ul><li>Helsinki University of Technology, Finland </li></ul><ul><li>fname.lname at tkk dot fi </li></ul>
  2. 2. Outline <ul><li>Background: search interfaces & deep Web </li></ul><ul><li>Motivation </li></ul><ul><li>Building directory of deep web resources </li></ul><ul><li>Interface Crawler </li></ul><ul><li>Experiments & results </li></ul><ul><li>Discussion & conclusion </li></ul>
  3. 3. Background <ul><li>Search engines (e.g., Google) do not crawl and index a significant portion of the Web </li></ul><ul><li>The information from non-indexable part of the Web cannot be found and accessed via searchers </li></ul><ul><li>Important type of web content which is badly indexed: </li></ul><ul><ul><li>Web pages generated based on parameters provided by users via search interfaces </li></ul></ul><ul><ul><li>Typically, these pages contain ‘high-quality’ structured content (e.g., product descriptions)‏ </li></ul></ul><ul><ul><li>Search interfaces are entry-points to myriads of databases on the Web </li></ul></ul><ul><li>The part of the Web ’behind’ search interfaces is known as deep Web (or hidden Web)‏ </li></ul><ul><li>Also, see VLDB’09 papers presented yesterday: Lixto & Kosmix </li></ul>
  4. 4. Background: example AutoTrader search form (http://autotrader.com/) :
  5. 5. Background: deep Web numbers & misconceptions <ul><li>Size of the deep Web: </li></ul><ul><ul><li>400 to 550 times larger than the indexable Web according to survey of 2001; but it is not that big </li></ul></ul><ul><ul><li>Comparable with the size of the indexable Web [indirect support in tech.report by Shestakov&Salakoski] </li></ul></ul><ul><li>Content of some (well, of many) web databases is, in fact, indexable: </li></ul><ul><ul><li>Correlation with database subjects: content of books/movies/music databases (i.e., relatively ’static’ data) is indexed well </li></ul></ul><ul><ul><li>Search engines’ crawlers do go behind web forms [see VLDB’08 work by Madhavan et al.] </li></ul></ul><ul><li>Total number of web databases: </li></ul><ul><ul><li>Survey of Apr’04 by Chang et al.: 450 000 web dbs </li></ul></ul><ul><ul><li>Underestimation </li></ul></ul><ul><ul><li>Now in 2009, several millions dbs available online </li></ul></ul>
  6. 6. Motivation <ul><li>Several millions databases available online … </li></ul><ul><li>To access a database, a user needs to know its URL </li></ul><ul><li>But there are directories/lists of databases, right? </li></ul><ul><ul><li>Biggest, Completeplanet.com, includes 70,000 resources </li></ul></ul><ul><ul><li>Manually created and maintained by domain specialists, such as Molecular Biology Database Collection with 1170 summaries of bioinformatics databases in 2009 </li></ul></ul><ul><li>Essentially, we currently have no idea about location of most deep web resources: </li></ul><ul><ul><li>And content of many of these databases is either not indexed or poorly indexed </li></ul></ul><ul><ul><li>I.e., undiscovered resources with unknown content </li></ul></ul><ul><li>Directories of online databases corresponding to the scale of deep Web are needed </li></ul>
  7. 7. Motivation <ul><li>Building such directories requires technique for finding search interfaces </li></ul><ul><ul><li>A database on the Web is identifiable by its search interface </li></ul></ul><ul><li>For any given topic there are too many web databases with relevant content: resource discovery has to be automatic </li></ul><ul><li>One specific application: general web search </li></ul><ul><ul><li>Transactional queries (i.e., find a site where further interaction will happen)‏ </li></ul></ul><ul><ul><li>For example, if a query suggests that a user wants to buy/sell a car search results should contain links to pages with web forms for car search </li></ul></ul>
  8. 8. Building directory of deep web resources <ul><li>1. Visit as many pages that potentially have search interfaces as possible </li></ul><ul><ul><li>(Dozens of) billions web pages vs. millions of databases </li></ul></ul><ul><ul><li>Visiting a page with a search interface during a ‘regular’ crawl is a rare event </li></ul></ul><ul><ul><li>It is even more rare if databases of interest belong to a particular domain </li></ul></ul><ul><ul><li>Thus, some visiting (or crawling) strategy could be very helpful </li></ul></ul>
  9. 9. Building directory of deep web resources <ul><li>2. Recognize search interface on a web page (focus in this work)‏ </li></ul>
  10. 10. Building directory of deep web resources <ul><li>2. Recognize search interface on a web page (focus in this work)‏ </li></ul><ul><ul><li>Forms have great variety in structure and vocabulary </li></ul></ul><ul><ul><li>JavaScript-rich and non-HTML forms (e.g., in Flash) have to be recognized </li></ul></ul>
  11. 11. Building directory of deep web resources <ul><li>3. Classify search interfaces (and, hence, databases) into subject hierarchy </li></ul><ul><ul><li>One of the challenges: some interfaces belong to several domains </li></ul></ul>
  12. 12. Interface crawler <ul><li>I-Crawler is a system to automatically discover search interfaces and identify a main subject of an underlying database </li></ul><ul><ul><li>Deal with JavaScript-rich and non-HTML forms </li></ul></ul><ul><ul><li>Use a binary domain-independent classifier for identifying searchable web forms </li></ul></ul><ul><ul><li>Divides all forms into two groups: u-forms (those with one or two visible fields) and s-forms (the rest)‏ </li></ul></ul><ul><ul><li>U- and s-forms are processed differently: u-interfaces are classified using query probing [Bergholz and Childlovskii, 2003; Gravano et al., 2003] </li></ul></ul>
  13. 13. Interface crawler: architecture
  14. 14. Experiments and results <ul><li>Tested the Interface Identification component </li></ul><ul><li>Datasets: </li></ul><ul><ul><li>216 searchable (HTML) web forms from the UIUC repository plus 90 searchable web forms (60 HTML forms and 30 JS-rich or non-HTML forms) and 300 non-searchable forms (270 and 30) added by us </li></ul></ul><ul><ul><li>Only s-forms from the dataset 1 </li></ul></ul><ul><ul><li>264 searchable forms and 264 non-searchable forms (all in Russian)‏ </li></ul></ul><ul><ul><li>90 searchable u-forms and 120 non-searchable u-forms </li></ul></ul><ul><li>Learning with two thirds of each dataset and testing on the remaining third </li></ul>
  15. 15. Experiments and results
  16. 16. Experiments and results <ul><li>Used the decision tree to detect search interfaces on real web sites </li></ul><ul><li>Three groups of web sites: </li></ul><ul><ul><li>150 deep web sites (in Russian)‏ </li></ul></ul><ul><ul><li>150 sites randomly selected from “Recreation” category of http://www.dmoz.org </li></ul></ul><ul><ul><li>150 sites randomly selected based on IP addresses </li></ul></ul><ul><li>All sites in each group were crawled to depth 5 </li></ul>
  17. 17. Discussion and conclusion <ul><li>One of the specific usage for the I-Crawler: deep web characterization (i.e., how many deep web resources on the Web)‏ </li></ul><ul><ul><li>Hence, while false positives are OK false negatives are not OK (resources are ignored)‏ </li></ul></ul><ul><li>Root pages of deep web sites are good starting points for discovering more databases </li></ul><ul><li>JS-rich and non-HTML forms become more and popular </li></ul><ul><ul><li>Recognizing them is essential </li></ul></ul><ul><li>Nowadays more and more content owners provide APIs to their data, databases, etc. </li></ul><ul><ul><li>Need in techniques for API-discovery </li></ul></ul>
  18. 18. Thank you! Questions?