On building a search interface discovery system
Upcoming SlideShare
Loading in...5

On building a search interface discovery system



Slides of my talk at RED'09 workshop

Slides of my talk at RED'09 workshop



Total Views
Views on SlideShare
Embed Views



1 Embed 1

http://www.slideee.com 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

On building a search interface discovery system On building a search interface discovery system Presentation Transcript

  • On building a search interface discovery system
    • Denis Shestakov
    • Helsinki University of Technology, Finland
    • fname.lname at tkk dot fi
  • Outline
    • Background: search interfaces & deep Web
    • Motivation
    • Building directory of deep web resources
    • Interface Crawler
    • Experiments & results
    • Discussion & conclusion
  • Background
    • Search engines (e.g., Google) do not crawl and index a significant portion of the Web
    • The information from non-indexable part of the Web cannot be found and accessed via searchers
    • Important type of web content which is badly indexed:
      • Web pages generated based on parameters provided by users via search interfaces
      • Typically, these pages contain ‘high-quality’ structured content (e.g., product descriptions)‏
      • Search interfaces are entry-points to myriads of databases on the Web
    • The part of the Web ’behind’ search interfaces is known as deep Web (or hidden Web)‏
    • Also, see VLDB’09 papers presented yesterday: Lixto & Kosmix
  • Background: example AutoTrader search form (http://autotrader.com/) :
  • Background: deep Web numbers & misconceptions
    • Size of the deep Web:
      • 400 to 550 times larger than the indexable Web according to survey of 2001; but it is not that big
      • Comparable with the size of the indexable Web [indirect support in tech.report by Shestakov&Salakoski]
    • Content of some (well, of many) web databases is, in fact, indexable:
      • Correlation with database subjects: content of books/movies/music databases (i.e., relatively ’static’ data) is indexed well
      • Search engines’ crawlers do go behind web forms [see VLDB’08 work by Madhavan et al.]
    • Total number of web databases:
      • Survey of Apr’04 by Chang et al.: 450 000 web dbs
      • Underestimation
      • Now in 2009, several millions dbs available online
  • Motivation
    • Several millions databases available online …
    • To access a database, a user needs to know its URL
    • But there are directories/lists of databases, right?
      • Biggest, Completeplanet.com, includes 70,000 resources
      • Manually created and maintained by domain specialists, such as Molecular Biology Database Collection with 1170 summaries of bioinformatics databases in 2009
    • Essentially, we currently have no idea about location of most deep web resources:
      • And content of many of these databases is either not indexed or poorly indexed
      • I.e., undiscovered resources with unknown content
    • Directories of online databases corresponding to the scale of deep Web are needed
  • Motivation
    • Building such directories requires technique for finding search interfaces
      • A database on the Web is identifiable by its search interface
    • For any given topic there are too many web databases with relevant content: resource discovery has to be automatic
    • One specific application: general web search
      • Transactional queries (i.e., find a site where further interaction will happen)‏
      • For example, if a query suggests that a user wants to buy/sell a car search results should contain links to pages with web forms for car search
  • Building directory of deep web resources
    • 1. Visit as many pages that potentially have search interfaces as possible
      • (Dozens of) billions web pages vs. millions of databases
      • Visiting a page with a search interface during a ‘regular’ crawl is a rare event
      • It is even more rare if databases of interest belong to a particular domain
      • Thus, some visiting (or crawling) strategy could be very helpful
  • Building directory of deep web resources
    • 2. Recognize search interface on a web page (focus in this work)‏
  • Building directory of deep web resources
    • 2. Recognize search interface on a web page (focus in this work)‏
      • Forms have great variety in structure and vocabulary
      • JavaScript-rich and non-HTML forms (e.g., in Flash) have to be recognized
  • Building directory of deep web resources
    • 3. Classify search interfaces (and, hence, databases) into subject hierarchy
      • One of the challenges: some interfaces belong to several domains
  • Interface crawler
    • I-Crawler is a system to automatically discover search interfaces and identify a main subject of an underlying database
      • Deal with JavaScript-rich and non-HTML forms
      • Use a binary domain-independent classifier for identifying searchable web forms
      • Divides all forms into two groups: u-forms (those with one or two visible fields) and s-forms (the rest)‏
      • U- and s-forms are processed differently: u-interfaces are classified using query probing [Bergholz and Childlovskii, 2003; Gravano et al., 2003]
  • Interface crawler: architecture
  • Experiments and results
    • Tested the Interface Identification component
    • Datasets:
      • 216 searchable (HTML) web forms from the UIUC repository plus 90 searchable web forms (60 HTML forms and 30 JS-rich or non-HTML forms) and 300 non-searchable forms (270 and 30) added by us
      • Only s-forms from the dataset 1
      • 264 searchable forms and 264 non-searchable forms (all in Russian)‏
      • 90 searchable u-forms and 120 non-searchable u-forms
    • Learning with two thirds of each dataset and testing on the remaining third
  • Experiments and results
  • Experiments and results
    • Used the decision tree to detect search interfaces on real web sites
    • Three groups of web sites:
      • 150 deep web sites (in Russian)‏
      • 150 sites randomly selected from “Recreation” category of http://www.dmoz.org
      • 150 sites randomly selected based on IP addresses
    • All sites in each group were crawled to depth 5
  • Discussion and conclusion
    • One of the specific usage for the I-Crawler: deep web characterization (i.e., how many deep web resources on the Web)‏
      • Hence, while false positives are OK false negatives are not OK (resources are ignored)‏
    • Root pages of deep web sites are good starting points for discovering more databases
    • JS-rich and non-HTML forms become more and popular
      • Recognizing them is essential
    • Nowadays more and more content owners provide APIs to their data, databases, etc.
      • Need in techniques for API-discovery
  • Thank you! Questions?