Search Interfaces on the Web:
Querying and Characterizing
Lectio Praecursoria
12.06.2008
Denis Shestakov
denis.shestakov@u...
Background
• Search engines (e.g., Google) do not crawl
•
•

and index a significant portion of the Web
The information fr...
Background
• The part of the Web ’behind’ search interfaces
•
•

is known as deep Web (or hidden Web)
Search interfaces ar...
Example of a search interface & search results
AutoTrader search form

(http://autotrader.com/):

Lectio Praecursoria 12.0...
Deep Web: numbers & misconceptions
• Number of web databases:

• Survey in April 2004: 450 000 web databases (and this is
...
Thesis contributions:
querying search interfaces
• Approach to automate querying and
•
•
•

retrieving information behind ...
Thesis contributions:
characterization of the deep Web
• Previous surveys are based on study of
•
•
•

deep web resources ...
Thesis contributions:
finding web databases
• For any given topic there are too many web
•
•

databases with relevant cont...
Applications
•

Web search engines:

•

Information owners and providers

•

Vertical/topical search engines

• Eager to i...
Future work
•
•

The most promising direction: discovery of web
databases
The goal: building a relatively complete directo...
Upcoming SlideShare
Loading in …5
×

Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizing, 12.06.2008

689 views

Published on

Lectio Praecursoria on my PhD dissertation titled "Search Interfaces on the Web: Querying and Characterizing" given in ICT building, Turku, Finland on June 12, 2008

Thesis contributions:
* Querying search interfaces
* Deep Web characterization
* Finding web databases

The text of thesis is available at http://www.slideshare.net/denshe/shestakov2008-search-interfacesonthewebqueryingandcharacterizing

Published in: Technology, Design
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
689
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizing, 12.06.2008

  1. 1. Search Interfaces on the Web: Querying and Characterizing Lectio Praecursoria 12.06.2008 Denis Shestakov denis.shestakov@utu.fi Department of Information Technology, University of Turku Turku Centre for Computer Science
  2. 2. Background • Search engines (e.g., Google) do not crawl • • and index a significant portion of the Web The information from non-indexable part of the Web cannot be found and accessed via searchers Important type of web content which is badly indexed: • web pages generated based on parameters provided by users via search interfaces • Filling out a search form is a hard task for any automatic agent (e.g., search engines’ robots) Lectio Praecursoria 12.06.2008 2
  3. 3. Background • The part of the Web ’behind’ search interfaces • • is known as deep Web (or hidden Web) Search interfaces are entry-points to myriads of databases on the Web The central problem: • High-quality and publicly available data stored in a huge number of databases is available only via search interfaces (to access a database of interest, a user has to know location of its search interface) • Web pages in the deep Web (so called datarich pages) contain blocks of structured information (in contrast to ordinary web pages which are typically unstructured) Lectio Praecursoria 12.06.2008 3
  4. 4. Example of a search interface & search results AutoTrader search form (http://autotrader.com/): Lectio Praecursoria 12.06.2008 4
  5. 5. Deep Web: numbers & misconceptions • Number of web databases: • Survey in April 2004: 450 000 web databases (and this is underestimated value) • Size of the deep Web: • Survey of 2001: 400 to 550 times larger than the indexable Web; but it is not that big • No other reliable estimates of the entire size exist • According to my own indirect assessments: comparable with the size of the indexable Web • Content of some web databases is, in fact, indexable: • No reliable estimates but one can expect one fourth is indexed • Correlation with database subjects: content of books/movies/music databases (relatively ’static’ data) is indexed well • But, even if known to searchers, data is often outdated Lectio Praecursoria 12.06.2008 5
  6. 6. Thesis contributions: querying search interfaces • Approach to automate querying and • • • retrieving information behind search interfaces Essential in case of complex queries A form query language that allows to formulate queries and extract useful information from the pages with results A prototype system for querying web databases Lectio Praecursoria 12.06.2008 6
  7. 7. Thesis contributions: characterization of the deep Web • Previous surveys are based on study of • • • deep web resources mainly in English Two new methods for characterizing the deep Web Two surveys of one national (Russian) segment of the Web Dataset describing more than 200 web databases (statistically reliable) Lectio Praecursoria 12.06.2008 7
  8. 8. Thesis contributions: finding web databases • For any given topic there are too many web • • databases with relevant content: discovery automation is required A system for finding and classifying search interfaces Intended for: • Deep Web characterization studies • Building directories of web databases • Deal with Javascript-rich and non-HTML search forms (these types of forms are ignored in almost all other approaches to the deep Web) Lectio Praecursoria 12.06.2008 8
  9. 9. Applications • Web search engines: • Information owners and providers • Vertical/topical search engines • Eager to improve their coverage of the Web • In April 2008 Google announced they were experimenting with their form crawler (hence, most likely, other searchers would also have it tested/implemented/etc. in their robots within 2008-2009) • Typically want to disseminate their (publiclyavailable) information • Interest in discovery methods as they want their resources to be discovered and searched • Find information on a specialized topic • Need methods to extract data from relevant resources and aggregate it Lectio Praecursoria 12.06.2008 9
  10. 10. Future work • • The most promising direction: discovery of web databases The goal: building a relatively complete directory (Yahoo!-like) of databases on the Web • Specialized directories already exist • Several ‘universal’ directories (e.g., completeplanet.com) also exist but, as reported, are outdated and cover only a small portion of deep web resources • Due to the huge number of existing web databases, building and then maintaining such a directory would require automatic methods (discovery, classification, etc.) Lectio Praecursoria 12.06.2008 10

×