On building a search interface discovery system
Upcoming SlideShare
Loading in...5

On building a search interface discovery system



Slides of my talk at RED'09 workshop

Slides of my talk at RED'09 workshop



Total Views
Views on SlideShare
Embed Views



1 Embed 1

http://www.slideee.com 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    On building a search interface discovery system On building a search interface discovery system Presentation Transcript

    • On building a search interface discovery system
      • Denis Shestakov
      • Helsinki University of Technology, Finland
      • fname.lname at tkk dot fi
    • Outline
      • Background: search interfaces & deep Web
      • Motivation
      • Building directory of deep web resources
      • Interface Crawler
      • Experiments & results
      • Discussion & conclusion
    • Background
      • Search engines (e.g., Google) do not crawl and index a significant portion of the Web
      • The information from non-indexable part of the Web cannot be found and accessed via searchers
      • Important type of web content which is badly indexed:
        • Web pages generated based on parameters provided by users via search interfaces
        • Typically, these pages contain ‘high-quality’ structured content (e.g., product descriptions)‏
        • Search interfaces are entry-points to myriads of databases on the Web
      • The part of the Web ’behind’ search interfaces is known as deep Web (or hidden Web)‏
      • Also, see VLDB’09 papers presented yesterday: Lixto & Kosmix
    • Background: example AutoTrader search form (http://autotrader.com/) :
    • Background: deep Web numbers & misconceptions
      • Size of the deep Web:
        • 400 to 550 times larger than the indexable Web according to survey of 2001; but it is not that big
        • Comparable with the size of the indexable Web [indirect support in tech.report by Shestakov&Salakoski]
      • Content of some (well, of many) web databases is, in fact, indexable:
        • Correlation with database subjects: content of books/movies/music databases (i.e., relatively ’static’ data) is indexed well
        • Search engines’ crawlers do go behind web forms [see VLDB’08 work by Madhavan et al.]
      • Total number of web databases:
        • Survey of Apr’04 by Chang et al.: 450 000 web dbs
        • Underestimation
        • Now in 2009, several millions dbs available online
    • Motivation
      • Several millions databases available online …
      • To access a database, a user needs to know its URL
      • But there are directories/lists of databases, right?
        • Biggest, Completeplanet.com, includes 70,000 resources
        • Manually created and maintained by domain specialists, such as Molecular Biology Database Collection with 1170 summaries of bioinformatics databases in 2009
      • Essentially, we currently have no idea about location of most deep web resources:
        • And content of many of these databases is either not indexed or poorly indexed
        • I.e., undiscovered resources with unknown content
      • Directories of online databases corresponding to the scale of deep Web are needed
    • Motivation
      • Building such directories requires technique for finding search interfaces
        • A database on the Web is identifiable by its search interface
      • For any given topic there are too many web databases with relevant content: resource discovery has to be automatic
      • One specific application: general web search
        • Transactional queries (i.e., find a site where further interaction will happen)‏
        • For example, if a query suggests that a user wants to buy/sell a car search results should contain links to pages with web forms for car search
    • Building directory of deep web resources
      • 1. Visit as many pages that potentially have search interfaces as possible
        • (Dozens of) billions web pages vs. millions of databases
        • Visiting a page with a search interface during a ‘regular’ crawl is a rare event
        • It is even more rare if databases of interest belong to a particular domain
        • Thus, some visiting (or crawling) strategy could be very helpful
    • Building directory of deep web resources
      • 2. Recognize search interface on a web page (focus in this work)‏
    • Building directory of deep web resources
      • 2. Recognize search interface on a web page (focus in this work)‏
        • Forms have great variety in structure and vocabulary
        • JavaScript-rich and non-HTML forms (e.g., in Flash) have to be recognized
    • Building directory of deep web resources
      • 3. Classify search interfaces (and, hence, databases) into subject hierarchy
        • One of the challenges: some interfaces belong to several domains
    • Interface crawler
      • I-Crawler is a system to automatically discover search interfaces and identify a main subject of an underlying database
        • Deal with JavaScript-rich and non-HTML forms
        • Use a binary domain-independent classifier for identifying searchable web forms
        • Divides all forms into two groups: u-forms (those with one or two visible fields) and s-forms (the rest)‏
        • U- and s-forms are processed differently: u-interfaces are classified using query probing [Bergholz and Childlovskii, 2003; Gravano et al., 2003]
    • Interface crawler: architecture
    • Experiments and results
      • Tested the Interface Identification component
      • Datasets:
        • 216 searchable (HTML) web forms from the UIUC repository plus 90 searchable web forms (60 HTML forms and 30 JS-rich or non-HTML forms) and 300 non-searchable forms (270 and 30) added by us
        • Only s-forms from the dataset 1
        • 264 searchable forms and 264 non-searchable forms (all in Russian)‏
        • 90 searchable u-forms and 120 non-searchable u-forms
      • Learning with two thirds of each dataset and testing on the remaining third
    • Experiments and results
    • Experiments and results
      • Used the decision tree to detect search interfaces on real web sites
      • Three groups of web sites:
        • 150 deep web sites (in Russian)‏
        • 150 sites randomly selected from “Recreation” category of http://www.dmoz.org
        • 150 sites randomly selected based on IP addresses
      • All sites in each group were crawled to depth 5
    • Discussion and conclusion
      • One of the specific usage for the I-Crawler: deep web characterization (i.e., how many deep web resources on the Web)‏
        • Hence, while false positives are OK false negatives are not OK (resources are ignored)‏
      • Root pages of deep web sites are good starting points for discovering more databases
      • JS-rich and non-HTML forms become more and popular
        • Recognizing them is essential
      • Nowadays more and more content owners provide APIs to their data, databases, etc.
        • Need in techniques for API-discovery
    • Thank you! Questions?