Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizing, 12.06.2008

Search Interfaces on the Web:
Querying and Characterizing
Lectio Praecursoria
12.06.2008
Denis Shestakov
denis.shestakov@utu.fi
Department of Information Technology, University of Turku
Turku Centre for Computer Science

Background
• Search engines (e.g., Google) do not crawl
•
•

and index a significant portion of the Web
The information from non-indexable part of
the Web cannot be found and accessed via
searchers
Important type of web content which is badly
indexed:
• web pages generated based on parameters
provided by users via search interfaces

• Filling out a search form is a hard task for any
automatic agent (e.g., search engines’
robots)
Lectio Praecursoria 12.06.2008

2

Background
• The part of the Web ’behind’ search interfaces
•
•

is known as deep Web (or hidden Web)
Search interfaces are entry-points to myriads
of databases on the Web
The central problem:

• High-quality and publicly available data
stored in a huge number of databases is
available only via search interfaces (to access
a database of interest, a user has to know
location of its search interface)
• Web pages in the deep Web (so called datarich pages) contain blocks of structured
information (in contrast to ordinary web pages
which are typically unstructured)


3

Example of a search interface & search results
AutoTrader search form

(http://autotrader.com/):


4

Deep Web: numbers & misconceptions
• Number of web databases:

• Survey in April 2004: 450 000 web databases (and this is
underestimated value)

• Size of the deep Web:

• Survey of 2001: 400 to 550 times larger than the
indexable Web; but it is not that big
• No other reliable estimates of the entire size exist
• According to my own indirect assessments: comparable
with the size of the indexable Web

• Content of some web databases is, in fact,
indexable:

• No reliable estimates but one can expect one fourth is
indexed
• Correlation with database subjects: content of
books/movies/music databases (relatively ’static’ data) is
indexed well
• But, even if known to searchers, data is often outdated


5

Thesis contributions:
querying search interfaces
• Approach to automate querying and
•
•
•

retrieving information behind search
interfaces
Essential in case of complex queries
A form query language that allows to
formulate queries and extract useful
information from the pages with results
A prototype system for querying web
databases


6

characterization of the deep Web
• Previous surveys are based on study of
•
•
•

deep web resources mainly in English
Two new methods for characterizing the
deep Web
Two surveys of one national (Russian)
segment of the Web
Dataset describing more than 200 web
databases (statistically reliable)


7

finding web databases
• For any given topic there are too many web
•
•

databases with relevant content: discovery
automation is required
A system for finding and classifying search
interfaces
Intended for:
• Deep Web characterization studies
• Building directories of web databases

• Deal with Javascript-rich and non-HTML

search forms (these types of forms are ignored in
almost all other approaches to the deep Web)


8

Applications
•

Web search engines:

•

Information owners and providers

•

Vertical/topical search engines

• Eager to improve their coverage of the Web
• In April 2008 Google announced they were
experimenting with their form crawler (hence, most
likely, other searchers would also have it
tested/implemented/etc. in their robots within
2008-2009)
• Typically want to disseminate their (publiclyavailable) information
• Interest in discovery methods as they want their
resources to be discovered and searched
• Find information on a specialized topic
• Need methods to extract data from relevant
resources and aggregate it


9

Future work
•
•

The most promising direction: discovery of web
databases
The goal: building a relatively complete directory
(Yahoo!-like) of databases on the Web
• Specialized directories already exist
• Several ‘universal’ directories (e.g.,
completeplanet.com) also exist but, as reported,
are outdated and cover only a small portion of deep
web resources
• Due to the huge number of existing web databases,
building and then maintaining such a directory
would require automatic methods (discovery,
classification, etc.)


10

Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizing, 12.06.2008

More Related Content

What's hot

Similar to Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizing, 12.06.2008

More from Denis Shestakov

Recently uploaded

Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizing, 12.06.2008