SlideShare a Scribd company logo
1 of 8
Download to read offline
Deep Web: Databases on the Web
Denis Shestakov
Turku Centre for Computer Science, Finland
I N T R O D U C T I O N
Finding information on the Web using a web search engine is one of the primary
activities of today’s web users. For a majority of users results returned by conventional
search engines are an essentially complete set of links to all pages on the Web relevant to
their queries. However, current-day searchers do not crawl and index a significant portion
of the Web and, hence, web users relying on search engines only are unable to discover
and access a large amount of information from the non-indexable part of the Web.
Specifically, dynamic pages generated based on parameters provided by a user via web
search forms are not indexed by search engines and cannot be found in searchers’ results.
Such search interfaces provide web users with an online access to myriads of databases
on the Web. In order to obtain some information from a web database of interest, a user
issues his/her query by specifying query terms in a search form and receives the query
results, a set of dynamic pages which embed required information from a database. At the
same time, issuing a query via an arbitrary search interface is an extremely complex task
for any kind of automatic agents including web crawlers, which, at least up to the present
day, do not even attempt to pass through web forms on a large scale.
Content provided by many web databases is often of very high quality and can be
extremely valuable to many users. For example, the PubMed database
(http://www.pubmed.gov) allows a user to search through millions of high-quality peer-
reviewed papers on biomedical research, while the AutoTrader car classifieds database at
http://autotrader.com is highly useful for anyone wishing to buy or sell a car. In general,
since each searchable database is a collection of data in a specific domain it can often
provide more specific and detailed information that is not available or hard to find in the
indexable Web. The following section provides background information on the non-
indexable Web and web databases.
B A C K G R O U N D
Conventional web search engines index only a portion of the Web, called the publicly
indexable Web, which consists of publicly available web pages reachable by following
hyperlinks. The rest of the Web known as the non-indexable Web can be roughly divided
into two large parts. The first is the part consisting of protected web pages, i.e., pages that
are password-protected, marked as non-indexable by webmasters using the Robots
Exclusion Protocol, or only available when visited from certain networks (i.e., corporate
intranets). Web pages accessible via web search forms (or search interfaces) comprise the
second part of non-indexable Web known as the deep Web (Bergman, 2001) or the
hidden Web (Florescu, Levy & Mendelzon, 1998). The deep Web is not completely
unknown to search engines: some web databases can be accessed not only through web
forms but also through link-based navigational interfaces (e.g., a typical shopping site
usually allows browsing products in some subject hierarchy). Thus, a certain part of the
deep Web is, in fact, indexable as web searchers can technically reach content of those
databases with browse interfaces. Separation into indexable and non-indexable portions
of the Web is summarized in Figure 1. The deep Web shown in gray is formed by those
public pages that are accessible via web search forms.
Figure 1. Indexable and non-indexable portions of the Web and deep Web
Figure 2. User interaction with web database
A user interaction with a web database is depicted schematically in Figure 2. A web user
queries a database via a search interface located on a web page. First, search conditions
are specified in a form and then submitted to a web server. Middleware (often called a
server-side script) running on a web server processes a user’s query by transforming it
into a proper format and passing to an underlying database. Second, a server-side script
generates a resulting web page by embedding results returned by a database into page
template and, finally, a web server sends the generated page with results back to a user.
Frequently, results do not fit on one page and, therefore, the page returned to a user
contains some sort of navigation through the other pages with results. The resulting pages
are often termed data-rich or data-intensive (Crescenzi, Mecca & Merialdo, 2001).
While unstructured content is dominating on the publicly available Web the deep Web is
a huge source of structured data. Web databases can be broadly classified into two types:
1) structured web databases, that provide data objects as structured relational-like records
with attribute-value pairs, and 2) unstructured web databases, that provide data objects as
unstructured media (e.g., text, images, audio, video) (Chang et al., 2004). For example,
being a collection of research paper abstracts the PubMed database mentioned earlier is
considered to be an unstructured database, while autotrader.com has a structured database
for car classifieds, which returns structured car descriptions (e.g., make = “Toyota”,
model = “Corolla”, price = $5000). According to the recent survey of Chang et al. (2004),
structured databases dominate in the deep Web – approximately four in five web
databases are structured. The distinction between unstructured and structured databases is
meaningful since the unstructured content provided by web databases of the first type can
be adequately well handled by web searchers. The structured content is much different in
this respect as to be fully utilized it should be first extracted from data-intensive pages
and then processed in a way similar to instances of a traditional database.
C H A L L E N G E S O F D E E P W E B
Deep Web Characterization
Though the term deep Web was coined in 2000 (Bergman, 2001), which is sufficiently
long ago for any web-related concept/technology, many important characteristics of the
deep Web are still unknown. For instance, the total size of the deep Web (i.e., how much
information is stored in web databases) may only be guessed. At the same time, the
existing estimates for the total number of deep web sites are considered to be reliable and
consistent. It has been estimated that 43,000 - 96,000 deep web sites existed in March
2000 (Bergman, 2001). A 3-7 times increase in the four years has been reported in the
survey (Chang et al., 2004), where the total numbers for deep web sites and web
databases in April 2004 have been estimated as around 310,000 and 450,000
correspondingly. Moreover, even national- or domain-specific subsets of deep web
resources are, in fact, very large collections. For example, in summer 2005 there were
around 7,300 deep web sites on the Russian part of the Web only (Shestakov &
Salakoski, 2007). Or the bioinformatics community alone has been maintaining almost
one thousand molecular biology databases freely available on the Web (Galperin, 2007).
As mentioned earlier, the deep Web is partially indexed by web search engines because
the content of some web databases is accessible both via hyperlinks and search interfaces.
According to the rough estimates obtained based on the analysis of 80 databases in eight
domains (Chang et al., 2004), around 25% of deep web data is covered by Google
(http://google.com). At the same time, most deep web data indexed by Google is out-of-
date, i.e., pages returned by Google contain out-of-date information – only each fifth
indexed page is fresh.
Querying Web Databases
The Hidden Web Exposer (HiWE) (Raghavan & Garcia-Molina, 2001) is an example of
the deep web crawler, i.e., a system that provides a web crawler the ability to
automatically fill out search interfaces. Each time when the HiWE finds a web page with
a search interface, it performs three steps as follows:
1. Constructs a logical representation of a search form by associating a field label
(descriptive text that helps a user to understand the semantics of a form field) to each
form field. For example, as shown in Figure 2, “Car Make” is a field label for the
upper form field.
2. Takes values from a pre-existing value set (initially provided by a user) and submits
the form by assigning the most relevant values to form fields.
3. Analyzes the response and, if found valid, retrieves the resulting pages and stores
them in the repository (pages in which can be then indexed to support user queries,
just like a conventional search engine).
One of the major challenges for deep web crawlers such as the HiWE as well as for any
other automatic agents that query web databases is understanding semantics of search
interface. Web search forms are created and designed for human beings, but not for
computer applications. Consequently, such simple task (from the position of a user) as
filling out a search form turns out to be a computationally hard problem. Recognition of
field labels that describe fields of a search interface (see Figure 2) is a challenging task
since there are myriads of web coding practices and, thus, no common rules regarding the
relative position of two elements, that declare a field label and its corresponding field, in
the HTML code of a web page. The HiWE introduces the Layout-based Information
Extraction Technique (LITE) (Raghavan & Garcia-Molina, 2001), where in addition to
the code of a page, the physical layout of a page is also used to aid in extraction. LITE,
using the layout engine, identifies the pieces of text, if any, that are physically closest to
the form field, in the horizontal and vertical directions. These pieces of text (potential
field labels) are then ranked based on several heuristics and the highest ranked candidate
is considered to be a field label associated with the form field.
The different approach to automatically issuing queries to textual (unstructured) database
was presented by Callan and Connell (2001). The method called query-based sampling
exploits result from a first query to extract new query terms, which are then submitted,
and so on iteratively. The approach has been applied with some success to siphon data
hidden behind keyword-based search interfaces (Barbosa & Freire, 2004). Ntoulas et al.
(2005) have proposed an adaptive algorithm to select the best keyword queries, namely
keyword queries that can return a large fraction of database content. Particularly, it has
been shown that with only 83 queries, it is possible to download almost 80% of all
documents stored in the PubMed database. Query selection for crawling of structured
web databases was studied by Wu et al. (2006), who modeled a structured database as an
attribute-value graph. Then, the query-based web database crawling is transformed into a
graph traversal activity, which is parallel to the traditional problem of crawling the
publicly indexable Web following hyperlinks.
Although the HiWE provides a very effective approach to crawl the deep Web, it does
not extract data from the resulting pages. The DeLa (Data Extraction and Label
Assignment) system (Wang & Lochovsky, 2003) sends queries via search forms,
automatically extracts data objects from the resulting pages, fits the extracted data into a
table and assigns labels to the attributes of data objects, i.e., columns of the table. The
DeLa adopts the HiWE to detect the field labels of search interfaces and to fill out search
interfaces. The data extraction and label assignment approaches are based on two main
observations. First, data objects embedded in the resulting pages share a common tag
structure and they are listed continuously in the web pages. Thus, regular expression
wrappers can be automatically generated to extract such data objects and fit them into a
table. Second, a search interface, through which web users issue their queries, gives a
sketch of (part of) the underlying database. Based on this, extracted field labels are
matched to the columns of the table, thereby annotating the attributes of the extracted
data.
The HiWE system (and hence the DeLa system) works with relatively simple search
interfaces. Besides, it does not consider the navigational complexity of many deep web
sites. Particularly, resulting pages may contain links to other web pages with relevant
information and consequently it is necessary to navigate these links for evaluation of their
relevancies. Additionally, obtaining pages with results sometimes requires two or even
more successful form submissions (as a rule, the second form is returned to refine/modify
search conditions provided in the previous form). The DEQUE (Deep Web Query
System) (Shestakov, Bhowmick & Lim, 2005) addressed these challenges and proposed a
web form query language called DEQUEL that allows a user to assign more values at a
time to the search interface’s fields than it is possible when the interface is manually
filled out. For example, data from relational tables or data obtained from querying other
web databases can be used as input values. The data extraction from resulting pages in
the DEQUE is based on a modified version of the RoadRunner technique (Crescenzi,
Mecca & Merialdo, 2001).
Among research efforts solely devoted to information extraction from resulting pages,
Caverlee et al. (2005) presented the Thor framework for sampling, locating, and
partitioning the query-related content regions (called Query-Answer Pagelets or QA-
Pagelets) from the deep Web. The Thor is designed as a suite of algorithms to support the
entire scope of the deep Web information platform. The Thor data extraction and
preparation subsystem supports a robust approach to the sampling, locating, and
partitioning of the QA-Pagelets in four phases: 1) the first phase probes web databases to
sample a set of resulting pages that covers all possible structurally diverse resulting
pages, such as pages with multiple records, single record, and no record; 2) the second
phase uses a page clustering algorithm to segment resulting pages according to their
control-flow dependent similarities, aiming at separating pages that contain query
matches from pages that contain no matches; 3) in the third phase, pages from top-ranked
clusters are examined at a subtree level to locate and rank the query-related regions of the
pages; and 4) the final phase uses a set of partitioning algorithms to separate and locate
itemized objects in each QA-Pagelet.
Searching for Web Databases
Previous subsection discussed how to query web databases assuming that search
interfaces to web databases of interest were already discovered. Surprisingly, finding of
search interfaces to web databases is a challenging problem in itself. Indeed, since several
hundred thousands of databases are available on the Web (Chang et al., 2004) one cannot
be aware that he/she knows the most relevant databases even in a very specific domain.
Realizing that, people have started to manually create web database collections such as
the Molecular Biology Database Collection (Galperin, 2007). However, manual
approaches are not practical as there are hundreds of thousands databases. Besides, since
new databases are constantly being added, the freshness of a manually maintained
collection is highly compromised.
Barbosa and Freire (2005) proposed a new crawling strategy to automatically locate web
databases. They built a form-focused crawler that uses three classifiers: page classifier
(classifies pages as belonging to topics in taxonomy), link classifier (identifies links that
are likely to lead to the pages with search interfaces in one or more steps), and form
classifier. The form classifier is a domain-independent binary classifier that uses a
decision tree to determine whether a web form is searchable or non-searchable (e.g.,
forms for login, registration, leaving comments, discussion group interfaces, mailing list
subscriptions, purchase forms, etc.).
F U T U R E T R E N D S
Typically, search interface and results of search are located on two different web pages
(as depicted in Figure 2). However, due to advances of web technologies, a web server
may send back only query results (in a specific format) rather than a whole generated
page with results (Garrett, 2005). In this case, a page is populated with results on the
client-side: specifically, a client-side script embedded within a page with search form
receives the data from a web server and modifies the page content by inserting the
received data. Ever-growing use of client-side scripts will pose a technical challenge
(Alvarez et al., 2004) since such scripts can modify or constrain the behavior of a search
form and, moreover, can be responsible for generation of resulting pages. Another
technical challenge is dealing with non-HTML search interfaces such as interfaces
implemented as Java applets or in Flash (Adobe Flash). It is worth to mention here that
all approaches described in the previous section work with HTML-forms only and do not
take into account client-side scripts.
Web search forms are user-friendly interfaces to web databases rather than program-
friendly. However, more and more web databases begin to provide APIs (i.e., program-
friendly interfaces) to their content. The prevalence of APIs will solve most of the
problems related to the querying of databases and, at the same time, will make the
problem of integrating deep Web data a very active area of research. For instance, one of
the challenges we will face is how to select only a few databases, which content is the
most appropriate for a given user’s query, among a large pool of databases. Nevertheless,
one can expect merely a gradual migration towards access web databases via APIs and,
thus, at least for recent years, many web databases will still be accessible only through
user-friendly web forms.
C O N C L U S I O N
The existence and continued growth of the deep Web creates new challenges for web
search engines, which are attempting to make all content on the Web easily accessible by
all users (Madhavan et al., 2006). Though much research has emerged in recent years on
querying web databases, there is still a great deal of work to be done. The deep Web will
require more effective access techniques since the traditional crawl-and-index techniques,
which have been quite successful for unstructured web pages in the publicly indexable
Web, may not be appropriate for mostly structured data in the deep Web. Thus, a new
field of research combining methods and research efforts of database and information
retrieval communities may be created.
B I B L I O G R A P H Y
Alvarez, M., Pan, A., Raposo, J., & Vina, J. (2004). Client-side deep web data extraction.
IEEE Conference on e-Commerce Technology for Dynamic E-Business (CEC-East’04),
158-161.
Barbosa, L., & Freire, J. (2004). Siphoning hidden-web data through keyword-based
interfaces. 19th
Brazilian Symposium on Data Base (SBBD’04), 309-321.
Barbosa, L., & Freire, J. (2005). Searching for hidden-web databases. 8th
International
Workshop on the Web and Databases (WebDB’05), 1-6.
Bergman, M. (2001). The deep Web: surfacing hidden value. Journal of Electronic
Publishing, 7(1).
Callan, J., & Connell, M. (2001). Query-based sampling of text databases. ACM
Transactions on Information Systems (TOIS), 19(2), 97-130.
Caverlee, J., & Liu, L. (2005). QA-Pagelet: data preparation techniques for large-scale
data analysis of the deep Web. IEEE Transactions on Knowledge and Data Engineering,
17(9), 1247-1262.
Chang, K., He, B., Li, C., Patel, M., & Zhang, Z. (2004). Structured databases on the
Web: observations and implications. SIGMOD Rec., 33(3), 61-70.
Crescenzi, V., Mecca, G., & Merialdo, P. (2001). RoadRunner: towards automatic data
extraction from large web sites. 27th
International Conference on Very Large Data Bases
(VLDB’01), 109-118.
Florescu, D., Levy, A., & Mendelzon, A. (1998). Database techniques for the World-
Wide Web: a survey. SIGMOD Rec., 27(3), 59-74.
Galperin, M. (2007). The molecular biology database collection: 2007 update. Nucleic
Acids Research, 35(Database-Issue), 3-4.
Garrett, J. (2005). Ajax: a new approach to web applications. AdaptivePath. Retrieved
October 8, 2007, from http://www.adaptivepath.com/publications/essays/archives/000385.php
Madhavan, J., Halevy, A., Cohen, S., Dong, X., Jeffery, S., Ko, D., & Yu, C. (2006).
Structured data meets the Web: a few observations. IEEE Date Engineering Bulletin,
29(4), 19-26.
Ntoulas, A., Zerfos, P., & Cho, J. (2005). Downloading textual hidden web content
through keyword queries. 5th ACM/IEEE-CS Joint Conference on Digital Libraries
(JCDL’05), 100-109.
Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden Web. 27th
International
Conference on Very Large Data Bases (VLDB’01), 129-138.
Shestakov, D., Bhowmick, S., & Lim, E.-P. (2005). DEQUE: querying the deep Web.
Data&Knowledge Engineering, 52(3), 273-311.
Shestakov, D., & Salakoski, T. (2007). On estimating the scale of national deep Web. 18th
International Conference on Database and Expert Systems Applications (DEXA’07), 780-
789.
Wang, J., & Lochovsky, F. (2003). Data extraction and label assignment for web
databases. 12th
International Conference on World Wide Web (WWW’03), 187-196.
Wu, P., Wen, J.-R., Liu, H., & Ma, W.-Y. (2006). Query selection techniques for efficient
crawling of structured web sources. 22nd
International Conference on Data Engineering
(ICDE’06), 47.
Adobe Flash, Retrieved October 8, 2007, from http://www.adobe.com/products/flash/
K E Y T E R M S
Deep Web (also hidden Web) – the part of the Web that consists of all web pages
accessible via search interfaces.
Deep web site – a web site that provides an access via search interface to one or more
back-end web databases.
Non-indexable Web (also invisible Web) – the part of the Web that is not indexed by the
general-purpose web search engines.
Publicly indexable Web – the part of the Web that is indexed by the major web search
engines.
Search interface (also search form) – a user-friendly interface located on a web site that
allows a user to issue queries to a database.
Web database – a database accessible online via at least one search interface.
Web crawler – an automated program used by search engines to collect web pages to be
indexed.

More Related Content

What's hot

HATEOAS: The Confusing Bit from REST
HATEOAS: The Confusing Bit from RESTHATEOAS: The Confusing Bit from REST
HATEOAS: The Confusing Bit from REST
elliando dias
 
Representational State Transfer (REST)
Representational State Transfer (REST)Representational State Transfer (REST)
Representational State Transfer (REST)
David Krmpotic
 

What's hot (19)

Rest and Rails
Rest and RailsRest and Rails
Rest and Rails
 
Uniform Resource Locator (URL)
Uniform Resource Locator (URL)Uniform Resource Locator (URL)
Uniform Resource Locator (URL)
 
Rest in Rails
Rest in RailsRest in Rails
Rest in Rails
 
The introduction of RESTful
The introduction of RESTful The introduction of RESTful
The introduction of RESTful
 
A language independent web data extraction using vision based page segmentati...
A language independent web data extraction using vision based page segmentati...A language independent web data extraction using vision based page segmentati...
A language independent web data extraction using vision based page segmentati...
 
Preprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage MiningPreprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage Mining
 
Introduction to Web Services
Introduction to Web ServicesIntroduction to Web Services
Introduction to Web Services
 
Implementation of Intelligent Web Server Monitoring
Implementation of Intelligent Web Server MonitoringImplementation of Intelligent Web Server Monitoring
Implementation of Intelligent Web Server Monitoring
 
The Rest Architectural Style
The Rest Architectural StyleThe Rest Architectural Style
The Rest Architectural Style
 
Web engineering
Web engineeringWeb engineering
Web engineering
 
HATEOAS: The Confusing Bit from REST
HATEOAS: The Confusing Bit from RESTHATEOAS: The Confusing Bit from REST
HATEOAS: The Confusing Bit from REST
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Web Content Mining
Web Content MiningWeb Content Mining
Web Content Mining
 
Web mining
Web miningWeb mining
Web mining
 
REST - Representational State Transfer
REST - Representational State TransferREST - Representational State Transfer
REST - Representational State Transfer
 
HTTP protocol and Streams Security
HTTP protocol and Streams SecurityHTTP protocol and Streams Security
HTTP protocol and Streams Security
 
Representational State Transfer (REST)
Representational State Transfer (REST)Representational State Transfer (REST)
Representational State Transfer (REST)
 
uniform resource locator
uniform resource locatoruniform resource locator
uniform resource locator
 
Web mining
Web miningWeb mining
Web mining
 

Similar to Deep Web: Databases on the Web

Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Concept
ijceronline
 

Similar to Deep Web: Databases on the Web (20)

International conference On Computer Science And technology
International conference On Computer Science And technologyInternational conference On Computer Science And technology
International conference On Computer Science And technology
 
IRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search ResultsIRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search Results
 
H017554148
H017554148H017554148
H017554148
 
Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Concept
 
WEB MINING.pptx
WEB MINING.pptxWEB MINING.pptx
WEB MINING.pptx
 
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUE
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUESTUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUE
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUE
 
HIGWGET-A Model for Crawling Secure Hidden WebPages
HIGWGET-A Model for Crawling Secure Hidden WebPagesHIGWGET-A Model for Crawling Secure Hidden WebPages
HIGWGET-A Model for Crawling Secure Hidden WebPages
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery system
 
E017624043
E017624043E017624043
E017624043
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
 
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
 
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
 
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
 
DWM-MODULE 6.pdf
DWM-MODULE 6.pdfDWM-MODULE 6.pdf
DWM-MODULE 6.pdf
 
Comparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining CategoriesComparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining Categories
 
Literature Survey on Web Mining
Literature Survey on Web MiningLiterature Survey on Web Mining
Literature Survey on Web Mining
 
The Data Records Extraction from Web Pages
The Data Records Extraction from Web PagesThe Data Records Extraction from Web Pages
The Data Records Extraction from Web Pages
 
Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawler
 
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
 
search
searchsearch
search
 

More from Denis Shestakov

Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Denis Shestakov
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database Systems
Denis Shestakov
 

More from Denis Shestakov (9)

Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
 
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with Hadoop
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Sampling national deep Web
Sampling national deep WebSampling national deep Web
Sampling national deep Web
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database Systems
 

Recently uploaded

一比一原版(毕业证书)新西兰怀特克利夫艺术设计学院毕业证原件一模一样
一比一原版(毕业证书)新西兰怀特克利夫艺术设计学院毕业证原件一模一样一比一原版(毕业证书)新西兰怀特克利夫艺术设计学院毕业证原件一模一样
一比一原版(毕业证书)新西兰怀特克利夫艺术设计学院毕业证原件一模一样
AS
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
ayvbos
 
一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理
F
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
F
 
一比一原版犹他大学毕业证如何办理
一比一原版犹他大学毕业证如何办理一比一原版犹他大学毕业证如何办理
一比一原版犹他大学毕业证如何办理
F
 
一比一原版英国格林多大学毕业证如何办理
一比一原版英国格林多大学毕业证如何办理一比一原版英国格林多大学毕业证如何办理
一比一原版英国格林多大学毕业证如何办理
AS
 
原版定制英国赫瑞瓦特大学毕业证原件一模一样
原版定制英国赫瑞瓦特大学毕业证原件一模一样原版定制英国赫瑞瓦特大学毕业证原件一模一样
原版定制英国赫瑞瓦特大学毕业证原件一模一样
AS
 
一比一原版澳大利亚迪肯大学毕业证如何办理
一比一原版澳大利亚迪肯大学毕业证如何办理一比一原版澳大利亚迪肯大学毕业证如何办理
一比一原版澳大利亚迪肯大学毕业证如何办理
SS
 
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
hfkmxufye
 
一比一原版(USYD毕业证书)悉尼大学毕业证原件一模一样
一比一原版(USYD毕业证书)悉尼大学毕业证原件一模一样一比一原版(USYD毕业证书)悉尼大学毕业证原件一模一样
一比一原版(USYD毕业证书)悉尼大学毕业证原件一模一样
ayvbos
 
一比一原版(Polytechnic毕业证书)新加坡理工学院毕业证原件一模一样
一比一原版(Polytechnic毕业证书)新加坡理工学院毕业证原件一模一样一比一原版(Polytechnic毕业证书)新加坡理工学院毕业证原件一模一样
一比一原版(Polytechnic毕业证书)新加坡理工学院毕业证原件一模一样
AS
 
一比一原版贝德福特大学毕业证学位证书
一比一原版贝德福特大学毕业证学位证书一比一原版贝德福特大学毕业证学位证书
一比一原版贝德福特大学毕业证学位证书
F
 
一比一原版桑佛德大学毕业证成绩单申请学校Offer快速办理
一比一原版桑佛德大学毕业证成绩单申请学校Offer快速办理一比一原版桑佛德大学毕业证成绩单申请学校Offer快速办理
一比一原版桑佛德大学毕业证成绩单申请学校Offer快速办理
apekaom
 

Recently uploaded (20)

一比一原版(毕业证书)新西兰怀特克利夫艺术设计学院毕业证原件一模一样
一比一原版(毕业证书)新西兰怀特克利夫艺术设计学院毕业证原件一模一样一比一原版(毕业证书)新西兰怀特克利夫艺术设计学院毕业证原件一模一样
一比一原版(毕业证书)新西兰怀特克利夫艺术设计学院毕业证原件一模一样
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
 
一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
 
一比一原版犹他大学毕业证如何办理
一比一原版犹他大学毕业证如何办理一比一原版犹他大学毕业证如何办理
一比一原版犹他大学毕业证如何办理
 
Lowongan Kerja LC Yogyakarta Terbaru 085746015303
Lowongan Kerja LC Yogyakarta Terbaru 085746015303Lowongan Kerja LC Yogyakarta Terbaru 085746015303
Lowongan Kerja LC Yogyakarta Terbaru 085746015303
 
一比一原版英国格林多大学毕业证如何办理
一比一原版英国格林多大学毕业证如何办理一比一原版英国格林多大学毕业证如何办理
一比一原版英国格林多大学毕业证如何办理
 
原版定制英国赫瑞瓦特大学毕业证原件一模一样
原版定制英国赫瑞瓦特大学毕业证原件一模一样原版定制英国赫瑞瓦特大学毕业证原件一模一样
原版定制英国赫瑞瓦特大学毕业证原件一模一样
 
APNIC Policy Roundup presented by Sunny Chendi at TWNOG 5.0
APNIC Policy Roundup presented by Sunny Chendi at TWNOG 5.0APNIC Policy Roundup presented by Sunny Chendi at TWNOG 5.0
APNIC Policy Roundup presented by Sunny Chendi at TWNOG 5.0
 
A LOOK INTO NETWORK TECHNOLOGIES MAINLY WAN.pptx
A LOOK INTO NETWORK TECHNOLOGIES MAINLY WAN.pptxA LOOK INTO NETWORK TECHNOLOGIES MAINLY WAN.pptx
A LOOK INTO NETWORK TECHNOLOGIES MAINLY WAN.pptx
 
一比一原版澳大利亚迪肯大学毕业证如何办理
一比一原版澳大利亚迪肯大学毕业证如何办理一比一原版澳大利亚迪肯大学毕业证如何办理
一比一原版澳大利亚迪肯大学毕业证如何办理
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
 
一比一原版(USYD毕业证书)悉尼大学毕业证原件一模一样
一比一原版(USYD毕业证书)悉尼大学毕业证原件一模一样一比一原版(USYD毕业证书)悉尼大学毕业证原件一模一样
一比一原版(USYD毕业证书)悉尼大学毕业证原件一模一样
 
Nungambakkam (Chennai) Independent Escorts - 9632533318 100% genuine
Nungambakkam (Chennai) Independent Escorts - 9632533318 100% genuineNungambakkam (Chennai) Independent Escorts - 9632533318 100% genuine
Nungambakkam (Chennai) Independent Escorts - 9632533318 100% genuine
 
一比一原版(Polytechnic毕业证书)新加坡理工学院毕业证原件一模一样
一比一原版(Polytechnic毕业证书)新加坡理工学院毕业证原件一模一样一比一原版(Polytechnic毕业证书)新加坡理工学院毕业证原件一模一样
一比一原版(Polytechnic毕业证书)新加坡理工学院毕业证原件一模一样
 
一比一原版贝德福特大学毕业证学位证书
一比一原版贝德福特大学毕业证学位证书一比一原版贝德福特大学毕业证学位证书
一比一原版贝德福特大学毕业证学位证书
 
一比一原版桑佛德大学毕业证成绩单申请学校Offer快速办理
一比一原版桑佛德大学毕业证成绩单申请学校Offer快速办理一比一原版桑佛德大学毕业证成绩单申请学校Offer快速办理
一比一原版桑佛德大学毕业证成绩单申请学校Offer快速办理
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 

Deep Web: Databases on the Web

  • 1. Deep Web: Databases on the Web Denis Shestakov Turku Centre for Computer Science, Finland I N T R O D U C T I O N Finding information on the Web using a web search engine is one of the primary activities of today’s web users. For a majority of users results returned by conventional search engines are an essentially complete set of links to all pages on the Web relevant to their queries. However, current-day searchers do not crawl and index a significant portion of the Web and, hence, web users relying on search engines only are unable to discover and access a large amount of information from the non-indexable part of the Web. Specifically, dynamic pages generated based on parameters provided by a user via web search forms are not indexed by search engines and cannot be found in searchers’ results. Such search interfaces provide web users with an online access to myriads of databases on the Web. In order to obtain some information from a web database of interest, a user issues his/her query by specifying query terms in a search form and receives the query results, a set of dynamic pages which embed required information from a database. At the same time, issuing a query via an arbitrary search interface is an extremely complex task for any kind of automatic agents including web crawlers, which, at least up to the present day, do not even attempt to pass through web forms on a large scale. Content provided by many web databases is often of very high quality and can be extremely valuable to many users. For example, the PubMed database (http://www.pubmed.gov) allows a user to search through millions of high-quality peer- reviewed papers on biomedical research, while the AutoTrader car classifieds database at http://autotrader.com is highly useful for anyone wishing to buy or sell a car. In general, since each searchable database is a collection of data in a specific domain it can often provide more specific and detailed information that is not available or hard to find in the indexable Web. The following section provides background information on the non- indexable Web and web databases. B A C K G R O U N D Conventional web search engines index only a portion of the Web, called the publicly indexable Web, which consists of publicly available web pages reachable by following hyperlinks. The rest of the Web known as the non-indexable Web can be roughly divided into two large parts. The first is the part consisting of protected web pages, i.e., pages that are password-protected, marked as non-indexable by webmasters using the Robots Exclusion Protocol, or only available when visited from certain networks (i.e., corporate intranets). Web pages accessible via web search forms (or search interfaces) comprise the second part of non-indexable Web known as the deep Web (Bergman, 2001) or the hidden Web (Florescu, Levy & Mendelzon, 1998). The deep Web is not completely unknown to search engines: some web databases can be accessed not only through web forms but also through link-based navigational interfaces (e.g., a typical shopping site usually allows browsing products in some subject hierarchy). Thus, a certain part of the deep Web is, in fact, indexable as web searchers can technically reach content of those
  • 2. databases with browse interfaces. Separation into indexable and non-indexable portions of the Web is summarized in Figure 1. The deep Web shown in gray is formed by those public pages that are accessible via web search forms. Figure 1. Indexable and non-indexable portions of the Web and deep Web Figure 2. User interaction with web database A user interaction with a web database is depicted schematically in Figure 2. A web user queries a database via a search interface located on a web page. First, search conditions are specified in a form and then submitted to a web server. Middleware (often called a server-side script) running on a web server processes a user’s query by transforming it into a proper format and passing to an underlying database. Second, a server-side script generates a resulting web page by embedding results returned by a database into page template and, finally, a web server sends the generated page with results back to a user. Frequently, results do not fit on one page and, therefore, the page returned to a user contains some sort of navigation through the other pages with results. The resulting pages are often termed data-rich or data-intensive (Crescenzi, Mecca & Merialdo, 2001). While unstructured content is dominating on the publicly available Web the deep Web is a huge source of structured data. Web databases can be broadly classified into two types: 1) structured web databases, that provide data objects as structured relational-like records with attribute-value pairs, and 2) unstructured web databases, that provide data objects as unstructured media (e.g., text, images, audio, video) (Chang et al., 2004). For example, being a collection of research paper abstracts the PubMed database mentioned earlier is considered to be an unstructured database, while autotrader.com has a structured database
  • 3. for car classifieds, which returns structured car descriptions (e.g., make = “Toyota”, model = “Corolla”, price = $5000). According to the recent survey of Chang et al. (2004), structured databases dominate in the deep Web – approximately four in five web databases are structured. The distinction between unstructured and structured databases is meaningful since the unstructured content provided by web databases of the first type can be adequately well handled by web searchers. The structured content is much different in this respect as to be fully utilized it should be first extracted from data-intensive pages and then processed in a way similar to instances of a traditional database. C H A L L E N G E S O F D E E P W E B Deep Web Characterization Though the term deep Web was coined in 2000 (Bergman, 2001), which is sufficiently long ago for any web-related concept/technology, many important characteristics of the deep Web are still unknown. For instance, the total size of the deep Web (i.e., how much information is stored in web databases) may only be guessed. At the same time, the existing estimates for the total number of deep web sites are considered to be reliable and consistent. It has been estimated that 43,000 - 96,000 deep web sites existed in March 2000 (Bergman, 2001). A 3-7 times increase in the four years has been reported in the survey (Chang et al., 2004), where the total numbers for deep web sites and web databases in April 2004 have been estimated as around 310,000 and 450,000 correspondingly. Moreover, even national- or domain-specific subsets of deep web resources are, in fact, very large collections. For example, in summer 2005 there were around 7,300 deep web sites on the Russian part of the Web only (Shestakov & Salakoski, 2007). Or the bioinformatics community alone has been maintaining almost one thousand molecular biology databases freely available on the Web (Galperin, 2007). As mentioned earlier, the deep Web is partially indexed by web search engines because the content of some web databases is accessible both via hyperlinks and search interfaces. According to the rough estimates obtained based on the analysis of 80 databases in eight domains (Chang et al., 2004), around 25% of deep web data is covered by Google (http://google.com). At the same time, most deep web data indexed by Google is out-of- date, i.e., pages returned by Google contain out-of-date information – only each fifth indexed page is fresh. Querying Web Databases The Hidden Web Exposer (HiWE) (Raghavan & Garcia-Molina, 2001) is an example of the deep web crawler, i.e., a system that provides a web crawler the ability to automatically fill out search interfaces. Each time when the HiWE finds a web page with a search interface, it performs three steps as follows: 1. Constructs a logical representation of a search form by associating a field label (descriptive text that helps a user to understand the semantics of a form field) to each form field. For example, as shown in Figure 2, “Car Make” is a field label for the upper form field. 2. Takes values from a pre-existing value set (initially provided by a user) and submits the form by assigning the most relevant values to form fields.
  • 4. 3. Analyzes the response and, if found valid, retrieves the resulting pages and stores them in the repository (pages in which can be then indexed to support user queries, just like a conventional search engine). One of the major challenges for deep web crawlers such as the HiWE as well as for any other automatic agents that query web databases is understanding semantics of search interface. Web search forms are created and designed for human beings, but not for computer applications. Consequently, such simple task (from the position of a user) as filling out a search form turns out to be a computationally hard problem. Recognition of field labels that describe fields of a search interface (see Figure 2) is a challenging task since there are myriads of web coding practices and, thus, no common rules regarding the relative position of two elements, that declare a field label and its corresponding field, in the HTML code of a web page. The HiWE introduces the Layout-based Information Extraction Technique (LITE) (Raghavan & Garcia-Molina, 2001), where in addition to the code of a page, the physical layout of a page is also used to aid in extraction. LITE, using the layout engine, identifies the pieces of text, if any, that are physically closest to the form field, in the horizontal and vertical directions. These pieces of text (potential field labels) are then ranked based on several heuristics and the highest ranked candidate is considered to be a field label associated with the form field. The different approach to automatically issuing queries to textual (unstructured) database was presented by Callan and Connell (2001). The method called query-based sampling exploits result from a first query to extract new query terms, which are then submitted, and so on iteratively. The approach has been applied with some success to siphon data hidden behind keyword-based search interfaces (Barbosa & Freire, 2004). Ntoulas et al. (2005) have proposed an adaptive algorithm to select the best keyword queries, namely keyword queries that can return a large fraction of database content. Particularly, it has been shown that with only 83 queries, it is possible to download almost 80% of all documents stored in the PubMed database. Query selection for crawling of structured web databases was studied by Wu et al. (2006), who modeled a structured database as an attribute-value graph. Then, the query-based web database crawling is transformed into a graph traversal activity, which is parallel to the traditional problem of crawling the publicly indexable Web following hyperlinks. Although the HiWE provides a very effective approach to crawl the deep Web, it does not extract data from the resulting pages. The DeLa (Data Extraction and Label Assignment) system (Wang & Lochovsky, 2003) sends queries via search forms, automatically extracts data objects from the resulting pages, fits the extracted data into a table and assigns labels to the attributes of data objects, i.e., columns of the table. The DeLa adopts the HiWE to detect the field labels of search interfaces and to fill out search interfaces. The data extraction and label assignment approaches are based on two main observations. First, data objects embedded in the resulting pages share a common tag structure and they are listed continuously in the web pages. Thus, regular expression wrappers can be automatically generated to extract such data objects and fit them into a table. Second, a search interface, through which web users issue their queries, gives a sketch of (part of) the underlying database. Based on this, extracted field labels are matched to the columns of the table, thereby annotating the attributes of the extracted data.
  • 5. The HiWE system (and hence the DeLa system) works with relatively simple search interfaces. Besides, it does not consider the navigational complexity of many deep web sites. Particularly, resulting pages may contain links to other web pages with relevant information and consequently it is necessary to navigate these links for evaluation of their relevancies. Additionally, obtaining pages with results sometimes requires two or even more successful form submissions (as a rule, the second form is returned to refine/modify search conditions provided in the previous form). The DEQUE (Deep Web Query System) (Shestakov, Bhowmick & Lim, 2005) addressed these challenges and proposed a web form query language called DEQUEL that allows a user to assign more values at a time to the search interface’s fields than it is possible when the interface is manually filled out. For example, data from relational tables or data obtained from querying other web databases can be used as input values. The data extraction from resulting pages in the DEQUE is based on a modified version of the RoadRunner technique (Crescenzi, Mecca & Merialdo, 2001). Among research efforts solely devoted to information extraction from resulting pages, Caverlee et al. (2005) presented the Thor framework for sampling, locating, and partitioning the query-related content regions (called Query-Answer Pagelets or QA- Pagelets) from the deep Web. The Thor is designed as a suite of algorithms to support the entire scope of the deep Web information platform. The Thor data extraction and preparation subsystem supports a robust approach to the sampling, locating, and partitioning of the QA-Pagelets in four phases: 1) the first phase probes web databases to sample a set of resulting pages that covers all possible structurally diverse resulting pages, such as pages with multiple records, single record, and no record; 2) the second phase uses a page clustering algorithm to segment resulting pages according to their control-flow dependent similarities, aiming at separating pages that contain query matches from pages that contain no matches; 3) in the third phase, pages from top-ranked clusters are examined at a subtree level to locate and rank the query-related regions of the pages; and 4) the final phase uses a set of partitioning algorithms to separate and locate itemized objects in each QA-Pagelet. Searching for Web Databases Previous subsection discussed how to query web databases assuming that search interfaces to web databases of interest were already discovered. Surprisingly, finding of search interfaces to web databases is a challenging problem in itself. Indeed, since several hundred thousands of databases are available on the Web (Chang et al., 2004) one cannot be aware that he/she knows the most relevant databases even in a very specific domain. Realizing that, people have started to manually create web database collections such as the Molecular Biology Database Collection (Galperin, 2007). However, manual approaches are not practical as there are hundreds of thousands databases. Besides, since new databases are constantly being added, the freshness of a manually maintained collection is highly compromised. Barbosa and Freire (2005) proposed a new crawling strategy to automatically locate web databases. They built a form-focused crawler that uses three classifiers: page classifier (classifies pages as belonging to topics in taxonomy), link classifier (identifies links that are likely to lead to the pages with search interfaces in one or more steps), and form classifier. The form classifier is a domain-independent binary classifier that uses a decision tree to determine whether a web form is searchable or non-searchable (e.g.,
  • 6. forms for login, registration, leaving comments, discussion group interfaces, mailing list subscriptions, purchase forms, etc.). F U T U R E T R E N D S Typically, search interface and results of search are located on two different web pages (as depicted in Figure 2). However, due to advances of web technologies, a web server may send back only query results (in a specific format) rather than a whole generated page with results (Garrett, 2005). In this case, a page is populated with results on the client-side: specifically, a client-side script embedded within a page with search form receives the data from a web server and modifies the page content by inserting the received data. Ever-growing use of client-side scripts will pose a technical challenge (Alvarez et al., 2004) since such scripts can modify or constrain the behavior of a search form and, moreover, can be responsible for generation of resulting pages. Another technical challenge is dealing with non-HTML search interfaces such as interfaces implemented as Java applets or in Flash (Adobe Flash). It is worth to mention here that all approaches described in the previous section work with HTML-forms only and do not take into account client-side scripts. Web search forms are user-friendly interfaces to web databases rather than program- friendly. However, more and more web databases begin to provide APIs (i.e., program- friendly interfaces) to their content. The prevalence of APIs will solve most of the problems related to the querying of databases and, at the same time, will make the problem of integrating deep Web data a very active area of research. For instance, one of the challenges we will face is how to select only a few databases, which content is the most appropriate for a given user’s query, among a large pool of databases. Nevertheless, one can expect merely a gradual migration towards access web databases via APIs and, thus, at least for recent years, many web databases will still be accessible only through user-friendly web forms. C O N C L U S I O N The existence and continued growth of the deep Web creates new challenges for web search engines, which are attempting to make all content on the Web easily accessible by all users (Madhavan et al., 2006). Though much research has emerged in recent years on querying web databases, there is still a great deal of work to be done. The deep Web will require more effective access techniques since the traditional crawl-and-index techniques, which have been quite successful for unstructured web pages in the publicly indexable Web, may not be appropriate for mostly structured data in the deep Web. Thus, a new field of research combining methods and research efforts of database and information retrieval communities may be created. B I B L I O G R A P H Y Alvarez, M., Pan, A., Raposo, J., & Vina, J. (2004). Client-side deep web data extraction. IEEE Conference on e-Commerce Technology for Dynamic E-Business (CEC-East’04), 158-161.
  • 7. Barbosa, L., & Freire, J. (2004). Siphoning hidden-web data through keyword-based interfaces. 19th Brazilian Symposium on Data Base (SBBD’04), 309-321. Barbosa, L., & Freire, J. (2005). Searching for hidden-web databases. 8th International Workshop on the Web and Databases (WebDB’05), 1-6. Bergman, M. (2001). The deep Web: surfacing hidden value. Journal of Electronic Publishing, 7(1). Callan, J., & Connell, M. (2001). Query-based sampling of text databases. ACM Transactions on Information Systems (TOIS), 19(2), 97-130. Caverlee, J., & Liu, L. (2005). QA-Pagelet: data preparation techniques for large-scale data analysis of the deep Web. IEEE Transactions on Knowledge and Data Engineering, 17(9), 1247-1262. Chang, K., He, B., Li, C., Patel, M., & Zhang, Z. (2004). Structured databases on the Web: observations and implications. SIGMOD Rec., 33(3), 61-70. Crescenzi, V., Mecca, G., & Merialdo, P. (2001). RoadRunner: towards automatic data extraction from large web sites. 27th International Conference on Very Large Data Bases (VLDB’01), 109-118. Florescu, D., Levy, A., & Mendelzon, A. (1998). Database techniques for the World- Wide Web: a survey. SIGMOD Rec., 27(3), 59-74. Galperin, M. (2007). The molecular biology database collection: 2007 update. Nucleic Acids Research, 35(Database-Issue), 3-4. Garrett, J. (2005). Ajax: a new approach to web applications. AdaptivePath. Retrieved October 8, 2007, from http://www.adaptivepath.com/publications/essays/archives/000385.php Madhavan, J., Halevy, A., Cohen, S., Dong, X., Jeffery, S., Ko, D., & Yu, C. (2006). Structured data meets the Web: a few observations. IEEE Date Engineering Bulletin, 29(4), 19-26. Ntoulas, A., Zerfos, P., & Cho, J. (2005). Downloading textual hidden web content through keyword queries. 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’05), 100-109. Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden Web. 27th International Conference on Very Large Data Bases (VLDB’01), 129-138. Shestakov, D., Bhowmick, S., & Lim, E.-P. (2005). DEQUE: querying the deep Web. Data&Knowledge Engineering, 52(3), 273-311.
  • 8. Shestakov, D., & Salakoski, T. (2007). On estimating the scale of national deep Web. 18th International Conference on Database and Expert Systems Applications (DEXA’07), 780- 789. Wang, J., & Lochovsky, F. (2003). Data extraction and label assignment for web databases. 12th International Conference on World Wide Web (WWW’03), 187-196. Wu, P., Wen, J.-R., Liu, H., & Ma, W.-Y. (2006). Query selection techniques for efficient crawling of structured web sources. 22nd International Conference on Data Engineering (ICDE’06), 47. Adobe Flash, Retrieved October 8, 2007, from http://www.adobe.com/products/flash/ K E Y T E R M S Deep Web (also hidden Web) – the part of the Web that consists of all web pages accessible via search interfaces. Deep web site – a web site that provides an access via search interface to one or more back-end web databases. Non-indexable Web (also invisible Web) – the part of the Web that is not indexed by the general-purpose web search engines. Publicly indexable Web – the part of the Web that is indexed by the major web search engines. Search interface (also search form) – a user-friendly interface located on a web site that allows a user to issue queries to a database. Web database – a database accessible online via at least one search interface. Web crawler – an automated program used by search engines to collect web pages to be indexed.