Deep Web: Databases on the Web
Turku Centre for Computer Science, Finland
I N T R O D U C T I O N
Finding information on the Web using a web search engine is one of the primary
activities of today’s web users. For a majority of users results returned by conventional
search engines are an essentially complete set of links to all pages on the Web relevant to
their queries. However, current-day searchers do not crawl and index a significant portion
of the Web and, hence, web users relying on search engines only are unable to discover
and access a large amount of information from the non-indexable part of the Web.
Specifically, dynamic pages generated based on parameters provided by a user via web
search forms are not indexed by search engines and cannot be found in searchers’ results.
Such search interfaces provide web users with an online access to myriads of databases
on the Web. In order to obtain some information from a web database of interest, a user
issues his/her query by specifying query terms in a search form and receives the query
results, a set of dynamic pages which embed required information from a database. At the
same time, issuing a query via an arbitrary search interface is an extremely complex task
for any kind of automatic agents including web crawlers, which, at least up to the present
day, do not even attempt to pass through web forms on a large scale.
Content provided by many web databases is often of very high quality and can be
extremely valuable to many users. For example, the PubMed database
(http://www.pubmed.gov) allows a user to search through millions of high-quality peer-
reviewed papers on biomedical research, while the AutoTrader car classifieds database at
http://autotrader.com is highly useful for anyone wishing to buy or sell a car. In general,
since each searchable database is a collection of data in a specific domain it can often
provide more specific and detailed information that is not available or hard to find in the
indexable Web. The following section provides background information on the non-
indexable Web and web databases.
B A C K G R O U N D
Conventional web search engines index only a portion of the Web, called the publicly
indexable Web, which consists of publicly available web pages reachable by following
hyperlinks. The rest of the Web known as the non-indexable Web can be roughly divided
into two large parts. The first is the part consisting of protected web pages, i.e., pages that
are password-protected, marked as non-indexable by webmasters using the Robots
Exclusion Protocol, or only available when visited from certain networks (i.e., corporate
intranets). Web pages accessible via web search forms (or search interfaces) comprise the
second part of non-indexable Web known as the deep Web (Bergman, 2001) or the
hidden Web (Florescu, Levy & Mendelzon, 1998). The deep Web is not completely
unknown to search engines: some web databases can be accessed not only through web
forms but also through link-based navigational interfaces (e.g., a typical shopping site
usually allows browsing products in some subject hierarchy). Thus, a certain part of the
deep Web is, in fact, indexable as web searchers can technically reach content of those
databases with browse interfaces. Separation into indexable and non-indexable portions
of the Web is summarized in Figure 1. The deep Web shown in gray is formed by those
public pages that are accessible via web search forms.
Figure 1. Indexable and non-indexable portions of the Web and deep Web
Figure 2. User interaction with web database
A user interaction with a web database is depicted schematically in Figure 2. A web user
queries a database via a search interface located on a web page. First, search conditions
are specified in a form and then submitted to a web server. Middleware (often called a
server-side script) running on a web server processes a user’s query by transforming it
into a proper format and passing to an underlying database. Second, a server-side script
generates a resulting web page by embedding results returned by a database into page
template and, finally, a web server sends the generated page with results back to a user.
Frequently, results do not fit on one page and, therefore, the page returned to a user
contains some sort of navigation through the other pages with results. The resulting pages
are often termed data-rich or data-intensive (Crescenzi, Mecca & Merialdo, 2001).
While unstructured content is dominating on the publicly available Web the deep Web is
a huge source of structured data. Web databases can be broadly classified into two types:
1) structured web databases, that provide data objects as structured relational-like records
with attribute-value pairs, and 2) unstructured web databases, that provide data objects as
unstructured media (e.g., text, images, audio, video) (Chang et al., 2004). For example,
being a collection of research paper abstracts the PubMed database mentioned earlier is
considered to be an unstructured database, while autotrader.com has a structured database
for car classifieds, which returns structured car descriptions (e.g., make = “Toyota”,
model = “Corolla”, price = $5000). According to the recent survey of Chang et al. (2004),
structured databases dominate in the deep Web – approximately four in five web
databases are structured. The distinction between unstructured and structured databases is
meaningful since the unstructured content provided by web databases of the first type can
be adequately well handled by web searchers. The structured content is much different in
this respect as to be fully utilized it should be first extracted from data-intensive pages
and then processed in a way similar to instances of a traditional database.
C H A L L E N G E S O F D E E P W E B
Deep Web Characterization
Though the term deep Web was coined in 2000 (Bergman, 2001), which is sufficiently
long ago for any web-related concept/technology, many important characteristics of the
deep Web are still unknown. For instance, the total size of the deep Web (i.e., how much
information is stored in web databases) may only be guessed. At the same time, the
existing estimates for the total number of deep web sites are considered to be reliable and
consistent. It has been estimated that 43,000 - 96,000 deep web sites existed in March
2000 (Bergman, 2001). A 3-7 times increase in the four years has been reported in the
survey (Chang et al., 2004), where the total numbers for deep web sites and web
databases in April 2004 have been estimated as around 310,000 and 450,000
correspondingly. Moreover, even national- or domain-specific subsets of deep web
resources are, in fact, very large collections. For example, in summer 2005 there were
around 7,300 deep web sites on the Russian part of the Web only (Shestakov &
Salakoski, 2007). Or the bioinformatics community alone has been maintaining almost
one thousand molecular biology databases freely available on the Web (Galperin, 2007).
As mentioned earlier, the deep Web is partially indexed by web search engines because
the content of some web databases is accessible both via hyperlinks and search interfaces.
According to the rough estimates obtained based on the analysis of 80 databases in eight
domains (Chang et al., 2004), around 25% of deep web data is covered by Google
(http://google.com). At the same time, most deep web data indexed by Google is out-of-
date, i.e., pages returned by Google contain out-of-date information – only each fifth
indexed page is fresh.
Querying Web Databases
The Hidden Web Exposer (HiWE) (Raghavan & Garcia-Molina, 2001) is an example of
the deep web crawler, i.e., a system that provides a web crawler the ability to
automatically fill out search interfaces. Each time when the HiWE finds a web page with
a search interface, it performs three steps as follows:
1. Constructs a logical representation of a search form by associating a field label
(descriptive text that helps a user to understand the semantics of a form field) to each
form field. For example, as shown in Figure 2, “Car Make” is a field label for the
upper form field.
2. Takes values from a pre-existing value set (initially provided by a user) and submits
the form by assigning the most relevant values to form fields.
3. Analyzes the response and, if found valid, retrieves the resulting pages and stores
them in the repository (pages in which can be then indexed to support user queries,
just like a conventional search engine).
One of the major challenges for deep web crawlers such as the HiWE as well as for any
other automatic agents that query web databases is understanding semantics of search
interface. Web search forms are created and designed for human beings, but not for
computer applications. Consequently, such simple task (from the position of a user) as
filling out a search form turns out to be a computationally hard problem. Recognition of
field labels that describe fields of a search interface (see Figure 2) is a challenging task
since there are myriads of web coding practices and, thus, no common rules regarding the
relative position of two elements, that declare a field label and its corresponding field, in
the HTML code of a web page. The HiWE introduces the Layout-based Information
Extraction Technique (LITE) (Raghavan & Garcia-Molina, 2001), where in addition to
the code of a page, the physical layout of a page is also used to aid in extraction. LITE,
using the layout engine, identifies the pieces of text, if any, that are physically closest to
the form field, in the horizontal and vertical directions. These pieces of text (potential
field labels) are then ranked based on several heuristics and the highest ranked candidate
is considered to be a field label associated with the form field.
The different approach to automatically issuing queries to textual (unstructured) database
was presented by Callan and Connell (2001). The method called query-based sampling
exploits result from a first query to extract new query terms, which are then submitted,
and so on iteratively. The approach has been applied with some success to siphon data
hidden behind keyword-based search interfaces (Barbosa & Freire, 2004). Ntoulas et al.
(2005) have proposed an adaptive algorithm to select the best keyword queries, namely
keyword queries that can return a large fraction of database content. Particularly, it has
been shown that with only 83 queries, it is possible to download almost 80% of all
documents stored in the PubMed database. Query selection for crawling of structured
web databases was studied by Wu et al. (2006), who modeled a structured database as an
attribute-value graph. Then, the query-based web database crawling is transformed into a
graph traversal activity, which is parallel to the traditional problem of crawling the
publicly indexable Web following hyperlinks.
Although the HiWE provides a very effective approach to crawl the deep Web, it does
not extract data from the resulting pages. The DeLa (Data Extraction and Label
Assignment) system (Wang & Lochovsky, 2003) sends queries via search forms,
automatically extracts data objects from the resulting pages, fits the extracted data into a
table and assigns labels to the attributes of data objects, i.e., columns of the table. The
DeLa adopts the HiWE to detect the field labels of search interfaces and to fill out search
interfaces. The data extraction and label assignment approaches are based on two main
observations. First, data objects embedded in the resulting pages share a common tag
structure and they are listed continuously in the web pages. Thus, regular expression
wrappers can be automatically generated to extract such data objects and fit them into a
table. Second, a search interface, through which web users issue their queries, gives a
sketch of (part of) the underlying database. Based on this, extracted field labels are
matched to the columns of the table, thereby annotating the attributes of the extracted
The HiWE system (and hence the DeLa system) works with relatively simple search
interfaces. Besides, it does not consider the navigational complexity of many deep web
sites. Particularly, resulting pages may contain links to other web pages with relevant
information and consequently it is necessary to navigate these links for evaluation of their
relevancies. Additionally, obtaining pages with results sometimes requires two or even
more successful form submissions (as a rule, the second form is returned to refine/modify
search conditions provided in the previous form). The DEQUE (Deep Web Query
System) (Shestakov, Bhowmick & Lim, 2005) addressed these challenges and proposed a
web form query language called DEQUEL that allows a user to assign more values at a
time to the search interface’s fields than it is possible when the interface is manually
filled out. For example, data from relational tables or data obtained from querying other
web databases can be used as input values. The data extraction from resulting pages in
the DEQUE is based on a modified version of the RoadRunner technique (Crescenzi,
Mecca & Merialdo, 2001).
Among research efforts solely devoted to information extraction from resulting pages,
Caverlee et al. (2005) presented the Thor framework for sampling, locating, and
partitioning the query-related content regions (called Query-Answer Pagelets or QA-
Pagelets) from the deep Web. The Thor is designed as a suite of algorithms to support the
entire scope of the deep Web information platform. The Thor data extraction and
preparation subsystem supports a robust approach to the sampling, locating, and
partitioning of the QA-Pagelets in four phases: 1) the first phase probes web databases to
sample a set of resulting pages that covers all possible structurally diverse resulting
pages, such as pages with multiple records, single record, and no record; 2) the second
phase uses a page clustering algorithm to segment resulting pages according to their
control-flow dependent similarities, aiming at separating pages that contain query
matches from pages that contain no matches; 3) in the third phase, pages from top-ranked
clusters are examined at a subtree level to locate and rank the query-related regions of the
pages; and 4) the final phase uses a set of partitioning algorithms to separate and locate
itemized objects in each QA-Pagelet.
Searching for Web Databases
Previous subsection discussed how to query web databases assuming that search
interfaces to web databases of interest were already discovered. Surprisingly, finding of
search interfaces to web databases is a challenging problem in itself. Indeed, since several
hundred thousands of databases are available on the Web (Chang et al., 2004) one cannot
be aware that he/she knows the most relevant databases even in a very specific domain.
Realizing that, people have started to manually create web database collections such as
the Molecular Biology Database Collection (Galperin, 2007). However, manual
approaches are not practical as there are hundreds of thousands databases. Besides, since
new databases are constantly being added, the freshness of a manually maintained
collection is highly compromised.
Barbosa and Freire (2005) proposed a new crawling strategy to automatically locate web
databases. They built a form-focused crawler that uses three classifiers: page classifier
(classifies pages as belonging to topics in taxonomy), link classifier (identifies links that
are likely to lead to the pages with search interfaces in one or more steps), and form
classifier. The form classifier is a domain-independent binary classifier that uses a
decision tree to determine whether a web form is searchable or non-searchable (e.g.,
forms for login, registration, leaving comments, discussion group interfaces, mailing list
subscriptions, purchase forms, etc.).
F U T U R E T R E N D S
Typically, search interface and results of search are located on two different web pages
(as depicted in Figure 2). However, due to advances of web technologies, a web server
may send back only query results (in a specific format) rather than a whole generated
page with results (Garrett, 2005). In this case, a page is populated with results on the
client-side: specifically, a client-side script embedded within a page with search form
receives the data from a web server and modifies the page content by inserting the
received data. Ever-growing use of client-side scripts will pose a technical challenge
(Alvarez et al., 2004) since such scripts can modify or constrain the behavior of a search
form and, moreover, can be responsible for generation of resulting pages. Another
technical challenge is dealing with non-HTML search interfaces such as interfaces
implemented as Java applets or in Flash (Adobe Flash). It is worth to mention here that
all approaches described in the previous section work with HTML-forms only and do not
take into account client-side scripts.
Web search forms are user-friendly interfaces to web databases rather than program-
friendly. However, more and more web databases begin to provide APIs (i.e., program-
friendly interfaces) to their content. The prevalence of APIs will solve most of the
problems related to the querying of databases and, at the same time, will make the
problem of integrating deep Web data a very active area of research. For instance, one of
the challenges we will face is how to select only a few databases, which content is the
most appropriate for a given user’s query, among a large pool of databases. Nevertheless,
one can expect merely a gradual migration towards access web databases via APIs and,
thus, at least for recent years, many web databases will still be accessible only through
user-friendly web forms.
C O N C L U S I O N
The existence and continued growth of the deep Web creates new challenges for web
search engines, which are attempting to make all content on the Web easily accessible by
all users (Madhavan et al., 2006). Though much research has emerged in recent years on
querying web databases, there is still a great deal of work to be done. The deep Web will
require more effective access techniques since the traditional crawl-and-index techniques,
which have been quite successful for unstructured web pages in the publicly indexable
Web, may not be appropriate for mostly structured data in the deep Web. Thus, a new
field of research combining methods and research efforts of database and information
retrieval communities may be created.
B I B L I O G R A P H Y
Alvarez, M., Pan, A., Raposo, J., & Vina, J. (2004). Client-side deep web data extraction.
IEEE Conference on e-Commerce Technology for Dynamic E-Business (CEC-East’04),
Barbosa, L., & Freire, J. (2004). Siphoning hidden-web data through keyword-based
Brazilian Symposium on Data Base (SBBD’04), 309-321.
Barbosa, L., & Freire, J. (2005). Searching for hidden-web databases. 8th
Workshop on the Web and Databases (WebDB’05), 1-6.
Bergman, M. (2001). The deep Web: surfacing hidden value. Journal of Electronic
Callan, J., & Connell, M. (2001). Query-based sampling of text databases. ACM
Transactions on Information Systems (TOIS), 19(2), 97-130.
Caverlee, J., & Liu, L. (2005). QA-Pagelet: data preparation techniques for large-scale
data analysis of the deep Web. IEEE Transactions on Knowledge and Data Engineering,
Chang, K., He, B., Li, C., Patel, M., & Zhang, Z. (2004). Structured databases on the
Web: observations and implications. SIGMOD Rec., 33(3), 61-70.
Crescenzi, V., Mecca, G., & Merialdo, P. (2001). RoadRunner: towards automatic data
extraction from large web sites. 27th
International Conference on Very Large Data Bases
Florescu, D., Levy, A., & Mendelzon, A. (1998). Database techniques for the World-
Wide Web: a survey. SIGMOD Rec., 27(3), 59-74.
Galperin, M. (2007). The molecular biology database collection: 2007 update. Nucleic
Acids Research, 35(Database-Issue), 3-4.
Garrett, J. (2005). Ajax: a new approach to web applications. AdaptivePath. Retrieved
October 8, 2007, from http://www.adaptivepath.com/publications/essays/archives/000385.php
Madhavan, J., Halevy, A., Cohen, S., Dong, X., Jeffery, S., Ko, D., & Yu, C. (2006).
Structured data meets the Web: a few observations. IEEE Date Engineering Bulletin,
Ntoulas, A., Zerfos, P., & Cho, J. (2005). Downloading textual hidden web content
through keyword queries. 5th ACM/IEEE-CS Joint Conference on Digital Libraries
Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden Web. 27th
Conference on Very Large Data Bases (VLDB’01), 129-138.
Shestakov, D., Bhowmick, S., & Lim, E.-P. (2005). DEQUE: querying the deep Web.
Data&Knowledge Engineering, 52(3), 273-311.
Shestakov, D., & Salakoski, T. (2007). On estimating the scale of national deep Web. 18th
International Conference on Database and Expert Systems Applications (DEXA’07), 780-
Wang, J., & Lochovsky, F. (2003). Data extraction and label assignment for web
International Conference on World Wide Web (WWW’03), 187-196.
Wu, P., Wen, J.-R., Liu, H., & Ma, W.-Y. (2006). Query selection techniques for efficient
crawling of structured web sources. 22nd
International Conference on Data Engineering
Adobe Flash, Retrieved October 8, 2007, from http://www.adobe.com/products/flash/
K E Y T E R M S
Deep Web (also hidden Web) – the part of the Web that consists of all web pages
accessible via search interfaces.
Deep web site – a web site that provides an access via search interface to one or more
back-end web databases.
Non-indexable Web (also invisible Web) – the part of the Web that is not indexed by the
general-purpose web search engines.
Publicly indexable Web – the part of the Web that is indexed by the major web search
Search interface (also search form) – a user-friendly interface located on a web site that
allows a user to issue queries to a database.
Web database – a database accessible online via at least one search interface.
Web crawler – an automated program used by search engines to collect web pages to be