Search Interfaces on the Web:
Querying and Characterizing
Lectio Praecursoria
12.06.2008
Denis Shestakov
denis.shestakov@utu.fi
Department of Information Technology, University of Turku
Turku Centre for Computer Science
Background
• Search engines (e.g., Google) do not crawl
•
•

and index a significant portion of the Web
The information from non-indexable part of
the Web cannot be found and accessed via
searchers
Important type of web content which is badly
indexed:
• web pages generated based on parameters
provided by users via search interfaces

• Filling out a search form is a hard task for any
automatic agent (e.g., search engines’
robots)
Lectio Praecursoria 12.06.2008

2
Background
• The part of the Web ’behind’ search interfaces
•
•

is known as deep Web (or hidden Web)
Search interfaces are entry-points to myriads
of databases on the Web
The central problem:

• High-quality and publicly available data
stored in a huge number of databases is
available only via search interfaces (to access
a database of interest, a user has to know
location of its search interface)
• Web pages in the deep Web (so called datarich pages) contain blocks of structured
information (in contrast to ordinary web pages
which are typically unstructured)

Lectio Praecursoria 12.06.2008

3
Example of a search interface & search results
AutoTrader search form

(http://autotrader.com/):

Lectio Praecursoria 12.06.2008

4
Deep Web: numbers & misconceptions
• Number of web databases:

• Survey in April 2004: 450 000 web databases (and this is
underestimated value)

• Size of the deep Web:

• Survey of 2001: 400 to 550 times larger than the
indexable Web; but it is not that big
• No other reliable estimates of the entire size exist
• According to my own indirect assessments: comparable
with the size of the indexable Web

• Content of some web databases is, in fact,
indexable:

• No reliable estimates but one can expect one fourth is
indexed
• Correlation with database subjects: content of
books/movies/music databases (relatively ’static’ data) is
indexed well
• But, even if known to searchers, data is often outdated

Lectio Praecursoria 12.06.2008

5
Thesis contributions:
querying search interfaces
• Approach to automate querying and
•
•
•

retrieving information behind search
interfaces
Essential in case of complex queries
A form query language that allows to
formulate queries and extract useful
information from the pages with results
A prototype system for querying web
databases

Lectio Praecursoria 12.06.2008

6
Thesis contributions:
characterization of the deep Web
• Previous surveys are based on study of
•
•
•

deep web resources mainly in English
Two new methods for characterizing the
deep Web
Two surveys of one national (Russian)
segment of the Web
Dataset describing more than 200 web
databases (statistically reliable)

Lectio Praecursoria 12.06.2008

7
Thesis contributions:
finding web databases
• For any given topic there are too many web
•
•

databases with relevant content: discovery
automation is required
A system for finding and classifying search
interfaces
Intended for:
• Deep Web characterization studies
• Building directories of web databases

• Deal with Javascript-rich and non-HTML

search forms (these types of forms are ignored in
almost all other approaches to the deep Web)

Lectio Praecursoria 12.06.2008

8
Applications
•

Web search engines:

•

Information owners and providers

•

Vertical/topical search engines

• Eager to improve their coverage of the Web
• In April 2008 Google announced they were
experimenting with their form crawler (hence, most
likely, other searchers would also have it
tested/implemented/etc. in their robots within
2008-2009)
• Typically want to disseminate their (publiclyavailable) information
• Interest in discovery methods as they want their
resources to be discovered and searched
• Find information on a specialized topic
• Need methods to extract data from relevant
resources and aggregate it

Lectio Praecursoria 12.06.2008

9
Future work
•
•

The most promising direction: discovery of web
databases
The goal: building a relatively complete directory
(Yahoo!-like) of databases on the Web
• Specialized directories already exist
• Several ‘universal’ directories (e.g.,
completeplanet.com) also exist but, as reported,
are outdated and cover only a small portion of deep
web resources
• Due to the huge number of existing web databases,
building and then maintaining such a directory
would require automatic methods (discovery,
classification, etc.)

Lectio Praecursoria 12.06.2008

10

Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizing, 12.06.2008

  • 1.
    Search Interfaces onthe Web: Querying and Characterizing Lectio Praecursoria 12.06.2008 Denis Shestakov denis.shestakov@utu.fi Department of Information Technology, University of Turku Turku Centre for Computer Science
  • 2.
    Background • Search engines(e.g., Google) do not crawl • • and index a significant portion of the Web The information from non-indexable part of the Web cannot be found and accessed via searchers Important type of web content which is badly indexed: • web pages generated based on parameters provided by users via search interfaces • Filling out a search form is a hard task for any automatic agent (e.g., search engines’ robots) Lectio Praecursoria 12.06.2008 2
  • 3.
    Background • The partof the Web ’behind’ search interfaces • • is known as deep Web (or hidden Web) Search interfaces are entry-points to myriads of databases on the Web The central problem: • High-quality and publicly available data stored in a huge number of databases is available only via search interfaces (to access a database of interest, a user has to know location of its search interface) • Web pages in the deep Web (so called datarich pages) contain blocks of structured information (in contrast to ordinary web pages which are typically unstructured) Lectio Praecursoria 12.06.2008 3
  • 4.
    Example of asearch interface & search results AutoTrader search form (http://autotrader.com/): Lectio Praecursoria 12.06.2008 4
  • 5.
    Deep Web: numbers& misconceptions • Number of web databases: • Survey in April 2004: 450 000 web databases (and this is underestimated value) • Size of the deep Web: • Survey of 2001: 400 to 550 times larger than the indexable Web; but it is not that big • No other reliable estimates of the entire size exist • According to my own indirect assessments: comparable with the size of the indexable Web • Content of some web databases is, in fact, indexable: • No reliable estimates but one can expect one fourth is indexed • Correlation with database subjects: content of books/movies/music databases (relatively ’static’ data) is indexed well • But, even if known to searchers, data is often outdated Lectio Praecursoria 12.06.2008 5
  • 6.
    Thesis contributions: querying searchinterfaces • Approach to automate querying and • • • retrieving information behind search interfaces Essential in case of complex queries A form query language that allows to formulate queries and extract useful information from the pages with results A prototype system for querying web databases Lectio Praecursoria 12.06.2008 6
  • 7.
    Thesis contributions: characterization ofthe deep Web • Previous surveys are based on study of • • • deep web resources mainly in English Two new methods for characterizing the deep Web Two surveys of one national (Russian) segment of the Web Dataset describing more than 200 web databases (statistically reliable) Lectio Praecursoria 12.06.2008 7
  • 8.
    Thesis contributions: finding webdatabases • For any given topic there are too many web • • databases with relevant content: discovery automation is required A system for finding and classifying search interfaces Intended for: • Deep Web characterization studies • Building directories of web databases • Deal with Javascript-rich and non-HTML search forms (these types of forms are ignored in almost all other approaches to the deep Web) Lectio Praecursoria 12.06.2008 8
  • 9.
    Applications • Web search engines: • Informationowners and providers • Vertical/topical search engines • Eager to improve their coverage of the Web • In April 2008 Google announced they were experimenting with their form crawler (hence, most likely, other searchers would also have it tested/implemented/etc. in their robots within 2008-2009) • Typically want to disseminate their (publiclyavailable) information • Interest in discovery methods as they want their resources to be discovered and searched • Find information on a specialized topic • Need methods to extract data from relevant resources and aggregate it Lectio Praecursoria 12.06.2008 9
  • 10.
    Future work • • The mostpromising direction: discovery of web databases The goal: building a relatively complete directory (Yahoo!-like) of databases on the Web • Specialized directories already exist • Several ‘universal’ directories (e.g., completeplanet.com) also exist but, as reported, are outdated and cover only a small portion of deep web resources • Due to the huge number of existing web databases, building and then maintaining such a directory would require automatic methods (discovery, classification, etc.) Lectio Praecursoria 12.06.2008 10