SlideShare a Scribd company logo
COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5 95
The Web has been rapidly “deepened” by massive databases online and current
search engines do not reach most of the data on the Internet [4]. While the surface
Web has linked billions of static HTML pages, a far more significant amount of
information is believed to be “hidden” in the deep Web, behind the query forms of
searchable databases, as Figure 1(a) conceptually illustrates. Such information may not
be accessible through static URL links because they are assembled into Web pages as
responses to queries submitted through the query interface of an underlying database.
Because current search engines cannot effectively crawl databases, such data remains
largely hidden from users (thus often also referred to as the invisible or hidden Web).
Using overlap analysis between pairs of search engines, it was estimated in [1] that
43,000–96,000 “deep Web sites” and an informal estimate of 7,500 terabytes of
data exist—500 times larger than the surface Web.
ACCESSING THE
DEEP WEB
By BIN HE, MITESH PATEL, ZHEN ZHANG, and
KEVIN CHEN-CHUAN CHANG
Attempting to locate and quantify material on the Web
that is hidden from typical search techniques.
94 May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM
I l l u s t r a t i o n b y PETER HOEY
COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5 95
The Web has been rapidly “deepened” by massive databases online and current
search engines do not reach most of the data on the Internet [4]. While the surface
Web has linked billions of static HTML pages, a far more significant amount of
information is believed to be “hidden” in the deep Web, behind the query forms of
searchable databases, as Figure 1(a) conceptually illustrates. Such information may not
be accessible through static URL links because they are assembled into Web pages as
responses to queries submitted through the query interface of an underlying database.
Because current search engines cannot effectively crawl databases, such data remains
largely hidden from users (thus often also referred to as the invisible or hidden Web).
Using overlap analysis between pairs of search engines, it was estimated in [1] that
43,000–96,000 “deep Web sites” and an informal estimate of 7,500 terabytes of
data exist—500 times larger than the surface Web.
ACCESSING THE
DEEP WEB
By BIN HE, MITESH PATEL, ZHEN ZHANG, and
KEVIN CHEN-CHUAN CHANG
Attempting to locate and quantify material on the Web
that is hidden from typical search techniques.
94 May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM
I l l u s t r a t i o n b y PETER HOEY
COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5 9796 May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM
end Web databases, each of
which is searchable through one
or more HTML forms as its
query interfaces. For instance, as Figure 1(b)
shows, bn.com is a deep Web site, providing several
Web databases (a book database, a music database,
among others) accessed via multiple query inter-
faces (“simple search” and “advanced search”). Note
that our definition of deep Web site did not
account for the virtual hosting case, where multiple
Web sites can be hosted on the same physical IP
address. Since identifying all the virtual hosts
within an IP address is rather difficult to conduct
in practice, we do not consider such cases in our
survey. Our IP sampling-based estimation is thus
accurate modulo the effect of virtual hosting.
When conducting the survey, we first find the
number of query interfaces for each Web site, then
the number of Web databases, and finally the num-
ber of deep Web sites.
First, as our survey specifically focuses on online
databases, we differentiate and exclude non-query
HTML forms (which do not access back-end data-
bases) from query interfaces. In particular, HTML
forms for login, subscription, registration, polling,
and message posting are not query interfaces. Sim-
ilarly, we also exclude “site search,” which many
Web sites now provide for searching HTML pages
on their sites. These pages are statically linked at
the “surface” of the sites; they are not dynamically
assembled from an underlying database. Note that
our survey considered only unique interfaces and
removed duplicates; many Web pages contain the
same query interfaces repeatedly, for example, in
bn.com, the simple book search in Figure 1(b) is
present in almost all pages.
Second, we survey Web databases and deep Web
sites based on the discovered query interfaces.
Specifically, we compute the number of Web data-
bases by finding the set of query interfaces (within
a site) that refer to the same database. In particular,
for any two query interfaces, we randomly choose
five objects from one and search them in the other.
We judge that the two interfaces are searching the
same database if and only if the objects from one
interface can always be found in the other one.
Finally, the recognition of deep Web site is rather
simple: A Web site is a deep Web site if it has at
least one query interface.
RESULTS
(Q1) Where to find “entrances” to databases? To
access a Web database, we must first find its
entrances: the query interfaces. How does an inter-
face (if any) locate in a site, that is, at which
depths? For each query interface, we measured the
depth as the minimum number of hops from the
root page of the site to the interface page.1
As this
study required deep crawling of Web sites, we ana-
lyzed one-tenth of our total IP samples: a subset of
100,000 IPs. We tested each IP sample by making
HTTP connections and found 281 Web servers.
Exhaustively crawling these servers to depth 10, we
found 24 of them are deep Web sites, which con-
tained a total of 129 query interfaces representing
34 Web databases.
With its myriad data-
bases and hidden con-
tent, this deep Web is an
important yet largely unexplored frontier for infor-
mation search. While we have understood the sur-
face Web relatively well, with various surveys [3,
7]), how is the deep Web differ-
ent? This article reports our sur-
vey of the deep Web, studying
the scale, subject distribution,
search-engine coverage, and
other access characteristics of
online databases.
We note that, while the study
conducted in 2000 [1] established
interest in this area, it focused on only the scale aspect,
and its result from overlap analysis tends to under-
estimate (as acknowledged in [1]). In overlap analysis,
the number of deep Web sites is estimated by exploit-
ing two search engines. If we find na deep Web sites
in the first search engine, nb in the second, and n0 in
both, we can estimate the total number as shown in
Equation 1 by assuming the two search engines ran-
domly and independently obtain their data. However,
as our survey found, search engines are highly corre-
lated in their coverage of deep Web data (see question
Q5 in this survey). Therefore, such an independence
assumption seems rather unrealistic, in which case the
result is significantly underestimated. In fact, the vio-
lation of this assumption and its consequence were
also discussed in [1].
Our survey took the IP sampling approach to
collect random server samples for
estimating the global scale as well
as facilitating subsequent analy-
sis. During April 2004, we
acquired and analyzed a random
sample of Web servers by IP sam-
pling. We randomly sampled
1,000,000 IPs (from the entire space of
2,230,124,544 valid IP addresses, after removing
reserved and unused IP ranges according to [8]).
For each IP, we used an HTTP client, the GNU
free software wget [5], to make an HTTP connec-
tion to it and download HTML pages. We then
identified and analyzed Web databases in this sam-
ple, in order to extrapolate our estimates of the
deep Web.
Our survey distinguishes three related notions
for accessing the deep Web: site, database, and
interface. A deep Web site is a Web server that pro-
vides information maintained in one or more back-
He fig 1a (5/07) - 26.5 picas
He fig 1a (5/07) - 39.5 picas
Cars.com Amazon.com
Apartments.com Biography.com
401carfinder.com
411 locate.com
Acura
City:
State:
Bedrooms:
Rent:
State (required)
First Name
Region Make
Type of Vehicle
Year
to
to
Price
Model
Make:
All
Model:
Any
Price:
Your ZIP:
30 mi
Within:
GO
Author:
First name/initials and last name
Title word(s)
Subject word(s) Start of subject Start(s) of subject word(s)
Start(s) of title word(s) Exact start of title
Start of last name Exact name
Title:
Subject:
ISBN:
Publisher:
Search Now
0 to
9999
dollars
Doesn’t matter
Select a State
City
Last Name (required)
Select a State
Start Your Search
Search
All Regions All Makes
Domestic
FIND YOUR CAR!
GO!
Search Over 25,000 personalities!
He equation 1 (5/07)
Figure 1a. The conceptual
view of the deep Web.
Equation 1.
Figure 1b. Site,
databases, and
interface.
He fig 1b (5/07) - 26.5 picas
He fig 1b (5/07) - 39.5 picas
The deep Web site Bn.com
book
database
music
database
advanced search
Title of Book
simple search advanced search simple search
Price
You can narrow your search by selecting one or more options below:
all prices
Author’s Name
Keywords
Age
all age ranges
Format
all formats
Search Tips
Subjects
all subjects
Clear FieldsSearch
Title
Author
Keyword
Keyword
Album Title
Artist Name
Song Title
Instrument
Label
Narrow My Choices by Style
SEARCH
ISBN
Artist
Title
Song
Artist SEARCH
All
Search
Clear
All Styles
With its myriad databases and hidden content, this deep Web is an
important yet largely unexplored frontier for information search.
1
Such depth information is obtained by a simple revision of the wget software.
COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5 9796 May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM
end Web databases, each of
which is searchable through one
or more HTML forms as its
query interfaces. For instance, as Figure 1(b)
shows, bn.com is a deep Web site, providing several
Web databases (a book database, a music database,
among others) accessed via multiple query inter-
faces (“simple search” and “advanced search”). Note
that our definition of deep Web site did not
account for the virtual hosting case, where multiple
Web sites can be hosted on the same physical IP
address. Since identifying all the virtual hosts
within an IP address is rather difficult to conduct
in practice, we do not consider such cases in our
survey. Our IP sampling-based estimation is thus
accurate modulo the effect of virtual hosting.
When conducting the survey, we first find the
number of query interfaces for each Web site, then
the number of Web databases, and finally the num-
ber of deep Web sites.
First, as our survey specifically focuses on online
databases, we differentiate and exclude non-query
HTML forms (which do not access back-end data-
bases) from query interfaces. In particular, HTML
forms for login, subscription, registration, polling,
and message posting are not query interfaces. Sim-
ilarly, we also exclude “site search,” which many
Web sites now provide for searching HTML pages
on their sites. These pages are statically linked at
the “surface” of the sites; they are not dynamically
assembled from an underlying database. Note that
our survey considered only unique interfaces and
removed duplicates; many Web pages contain the
same query interfaces repeatedly, for example, in
bn.com, the simple book search in Figure 1(b) is
present in almost all pages.
Second, we survey Web databases and deep Web
sites based on the discovered query interfaces.
Specifically, we compute the number of Web data-
bases by finding the set of query interfaces (within
a site) that refer to the same database. In particular,
for any two query interfaces, we randomly choose
five objects from one and search them in the other.
We judge that the two interfaces are searching the
same database if and only if the objects from one
interface can always be found in the other one.
Finally, the recognition of deep Web site is rather
simple: A Web site is a deep Web site if it has at
least one query interface.
RESULTS
(Q1) Where to find “entrances” to databases? To
access a Web database, we must first find its
entrances: the query interfaces. How does an inter-
face (if any) locate in a site, that is, at which
depths? For each query interface, we measured the
depth as the minimum number of hops from the
root page of the site to the interface page.1
As this
study required deep crawling of Web sites, we ana-
lyzed one-tenth of our total IP samples: a subset of
100,000 IPs. We tested each IP sample by making
HTTP connections and found 281 Web servers.
Exhaustively crawling these servers to depth 10, we
found 24 of them are deep Web sites, which con-
tained a total of 129 query interfaces representing
34 Web databases.
With its myriad data-
bases and hidden con-
tent, this deep Web is an
important yet largely unexplored frontier for infor-
mation search. While we have understood the sur-
face Web relatively well, with various surveys [3,
7]), how is the deep Web differ-
ent? This article reports our sur-
vey of the deep Web, studying
the scale, subject distribution,
search-engine coverage, and
other access characteristics of
online databases.
We note that, while the study
conducted in 2000 [1] established
interest in this area, it focused on only the scale aspect,
and its result from overlap analysis tends to under-
estimate (as acknowledged in [1]). In overlap analysis,
the number of deep Web sites is estimated by exploit-
ing two search engines. If we find na deep Web sites
in the first search engine, nb in the second, and n0 in
both, we can estimate the total number as shown in
Equation 1 by assuming the two search engines ran-
domly and independently obtain their data. However,
as our survey found, search engines are highly corre-
lated in their coverage of deep Web data (see question
Q5 in this survey). Therefore, such an independence
assumption seems rather unrealistic, in which case the
result is significantly underestimated. In fact, the vio-
lation of this assumption and its consequence were
also discussed in [1].
Our survey took the IP sampling approach to
collect random server samples for
estimating the global scale as well
as facilitating subsequent analy-
sis. During April 2004, we
acquired and analyzed a random
sample of Web servers by IP sam-
pling. We randomly sampled
1,000,000 IPs (from the entire space of
2,230,124,544 valid IP addresses, after removing
reserved and unused IP ranges according to [8]).
For each IP, we used an HTTP client, the GNU
free software wget [5], to make an HTTP connec-
tion to it and download HTML pages. We then
identified and analyzed Web databases in this sam-
ple, in order to extrapolate our estimates of the
deep Web.
Our survey distinguishes three related notions
for accessing the deep Web: site, database, and
interface. A deep Web site is a Web server that pro-
vides information maintained in one or more back-
He fig 1a (5/07) - 26.5 picas
He fig 1a (5/07) - 39.5 picas
Cars.com Amazon.com
Apartments.com Biography.com
401carfinder.com
411 locate.com
Acura
City:
State:
Bedrooms:
Rent:
State (required)
First Name
Region Make
Type of Vehicle
Year
to
to
Price
Model
Make:
All
Model:
Any
Price:
Your ZIP:
30 mi
Within:
GO
Author:
First name/initials and last name
Title word(s)
Subject word(s) Start of subject Start(s) of subject word(s)
Start(s) of title word(s) Exact start of title
Start of last name Exact name
Title:
Subject:
ISBN:
Publisher:
Search Now
0 to
9999
dollars
Doesn’t matter
Select a State
City
Last Name (required)
Select a State
Start Your Search
Search
All Regions All Makes
Domestic
FIND YOUR CAR!
GO!
Search Over 25,000 personalities!
He equation 1 (5/07)
Figure 1a. The conceptual
view of the deep Web.
Equation 1.
Figure 1b. Site,
databases, and
interface.
He fig 1b (5/07) - 26.5 picas
He fig 1b (5/07) - 39.5 picas
The deep Web site Bn.com
book
database
music
database
advanced search
Title of Book
simple search advanced search simple search
Price
You can narrow your search by selecting one or more options below:
all prices
Author’s Name
Keywords
Age
all age ranges
Format
all formats
Search Tips
Subjects
all subjects
Clear FieldsSearch
Title
Author
Keyword
Keyword
Album Title
Artist Name
Song Title
Instrument
Label
Narrow My Choices by Style
SEARCH
ISBN
Artist
Title
Song
Artist SEARCH
All
Search
Clear
All Styles
With its myriad databases and hidden content, this deep Web is an
important yet largely unexplored frontier for information search.
1
Such depth information is obtained by a simple revision of the wget software.
instance, cnn.com has an unstructured database of
news articles, while amazon.com has a structured
database for books, which returns book records (for
example, title = “gone with the wind,” format =
“paperback,” price =
$7.99).
By manual querying
and inspection of the
190 Web databases sam-
pled, we found 43
unstructured and 147
structured. We similarly
estimate their total
numbers to be 102,000
and 348,000 respec-
tively, as summarized in
Table 1. Thus, the deep
Web features mostly structured data sources, with a
dominating ratio of 3.4:1 versus unstructured
sources.
(Q4) What is the subject distribution of Web
databases? With respect to the top-level categories
of the yahoo.com directory as our taxonomy, we
manually categorized the
sampled 190 Web data-
bases. Figure 2(b) shows
the distribution of the
14 categories: Business
& Economy (be), Com-
puters & Internet (ci),
News & Media (nm),
Entertainment (en),
Recreation & Sports (rs),
Health (he), Govern-
ment (go), Regional (rg),
Society & Culture (sc),
Education (ed), Arts & Humanities (ah), Science
(si), Reference (re), and Others (ot).
The distribution indicates great subject diversity
among Web databases, indicating the emergence
and proliferation of Web databases are spanning
well across all subject domains. While there seems
to be a common perception that the deep Web is
driven and dominated by e-commerce (for exam-
ple, for product search), our survey indicates the
contrary. To contrast, we further identify non-com-
merce categories from Figure 2(b)—he, go, rg, sc,
ed, ah, si, re, ot—which together occupy 51% (97
out of 190 databases), leaving only a slight minor-
ity of 49% to the rest of commerce sites (broadly
defined). In comparison, the subject distribution of
the surface Web, as char-
acterized in [7], showed
that commerce sites
dominated with an 83%
share. Thus, the trend of
“deepening” emerges not
only across all areas, but
also relatively more sig-
nificantly in the non-
commerce ones.
(Q5) How do search engines cover the deep Web?
Since some deep Web sources also provide
“browse” directories with URL links to reach the
hidden content, how effective is it to crawl-and-
index the deep Web as search engines do for the
surface Web? We thus investigated how popular
search engines index data on the deep Web. In par-
ticular, we chose the three largest search engines
Google (google.com), Yahoo (yahoo.com), and
MSN (msn.com).
We randomly selected
20 Web databases from
the 190 in our sampling
result. For each database,
first, we manually sam-
pled five objects (result
pages) as test data, by
querying the source with
some random words. We then, for each object col-
lected, queried every search engine to test whether
the page was indexed by formulating queries specif-
ically matching the object page. (For instance, we
used distinctive phrases that occurred in the object
page as keywords and limited the search to only the
source site.)
Figure 3 reports our finding: Google and Yahoo
both indexed 32% of the deep Web objects, and
MSN had the smallest coverage of 11%. However,
there was significant overlap in what they covered:
the combined coverage of the three largest search
engines increased only to 37%, indicating they were
indexing almost the same objects. In particular, as
Figure 3 illustrates, Yahoo and Google overlapped
COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5 99
We found that query inter-
faces tend to locate shallowly in
their sites: none of the 129 query
interfaces had depth deeper than
5. To begin with, 72% (93 out of 129) interfaces
were found within depth 3. Further, since a Web
database may be accessed through multiple inter-
faces, we measured its depth as the minimum
depths of all its interfaces: 94% (32 out of 34) Web
databases appeared within depth
3; Figure 2(a) reports the depth
distribution of the 34 Web data-
bases. Finally, 91.6% (22 out of
24) deep Web sites had their
databases within depth 3. (We
refer to these ratios as depth-
three coverage, which will guide
our further larger-scale crawling
in Q2.)
(Q2) What is the scale of the
deep Web? We then tested and
analyzed all of the 1,000,000 IP
samples to estimate the scale of
the deep Web. As just identified,
with the high depth-three cover-
age, almost all Web databases can
be identified within depth 3. We
thus crawled to depth 3 for these
one million IPs.
The crawling found 2,256 Web servers, among
which we identified 126 deep Web sites, which
contained a total of 406 query interfaces represent-
ing 190 Web databases. Extrapolating from the s
=1,000,000 unique IP samples to the entire IP
space of t = 2,230,124,544 IPs, and accounting for
the depth-three coverage, we estimate the number
of deep Web sites as shown in Equation 2, the
number of Web databases as shown in Equation 3,
and the number of query inter-
faces as shown in Equation 4 (the
results are rounded to 1,000).
The second and third columns of
Table 1 summarize the sampling and the estima-
tion results respectively. We also compute the con-
fidence interval of each estimated number at 99%
level of confidence, as the 4th column of Table 1
shows, which evidently indicates the scale of the
deep Web is well on the order of
105 sites. We also observed the
multiplicity of access on the
deep Web. On average, each
deep Web site provides 1.5 data-
bases, and each database sup-
ports 2.8 query interfaces.
The earlier survey of [1] esti-
mated 43,000 to 96,000 deep
Web sites by overlap analysis
between pairs of search engines.
Although [1] did not explicitly
qualify what it measured as a
search site, by comparison, it
still indicates that our estimation
of the scale of the deep Web (on
the order of 105
sites), is quite
accurate. Further, it has been
expanding, resulting in a 3–7 times increase in the
four years from 2000–2004.
(Q3) How “structured” is the deep Web? While
information on the surface Web is mostly unstruc-
tured HTML text (and images), how is the nature
of the deep Web data different? We classified Web
databases into two types: unstructured databases,
which provide data objects as unstructured media
(text, images, audio, and video); and structured
databases, which provide data objects as structured
“relational” records with attribute-value pairs. For
98 May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM
He fig 2a (5/07) - 19.5 picas
He fig 2a (5/07) - 15 picas
30%
25%
20%
15%
10%
5%
0%
0 1 2 3 4
Depth
ProportionofWebDatabases
5 6 7 8 9 10
30%
25%
20%
15%
10%
5%
0%
0 1 2 3 4
Depth
ProportionofWebDatabases
5 6 7 8 9 10
He fig 2b (5/07) - 19.5 picas
He fig 2b (5/07) - 15 picas
25%
20%
15%
10%
5%
0%
be ci nm en rs he
Subject Categories
ProportionofWebDatabases
go rg sc ed ah resi ot
25%
20%
15%
10%
5%
0%
be ci nm en rs he
Subject Categories
ProportionofWebDatabases
go rg sc ed ah resi ot
He equation 2 (5/07)
He equation 3 (5/07)
He equation 4 (5/07)
He fig 3 (5/07)
The entire deep Web
Google.com (32%)
Yahoo.com (32%)
MSN.com (11%)
All (37%)
0% 5% 37% 100%
Table 1. Sampling and estimation of
the deep-Web scale.
He table 1 (5/07)
Deep Web sites
Web databases
–unstructured
–structured
Query interfaces
Sampling Results
126
190
43
147
406
Total Estimate
307,000
450,000
102,000
348,000
1,258,000
99% Confidence Interval
236,000 - 377,000
366,000 - 535,000
62,000 - 142,000
275,000 - 423,000
1,097,000 - 1,419,000
Figure 2a. Distribution
of Web databases
over depth.
Figure 2b. Distribution
of Web databases over
subject category.
Figure 3. Coverage of search
engines.
Table 1. Sampling and estimation
of the deep Web scale.
Equation 2.
Equation 3.
Equation 4.
While there seems to be a common perception that the deep Web is
driven and dominated by e-commerce (for example, for product search),
our survey indicates the contrary.
instance, cnn.com has an unstructured database of
news articles, while amazon.com has a structured
database for books, which returns book records (for
example, title = “gone with the wind,” format =
“paperback,” price =
$7.99).
By manual querying
and inspection of the
190 Web databases sam-
pled, we found 43
unstructured and 147
structured. We similarly
estimate their total
numbers to be 102,000
and 348,000 respec-
tively, as summarized in
Table 1. Thus, the deep
Web features mostly structured data sources, with a
dominating ratio of 3.4:1 versus unstructured
sources.
(Q4) What is the subject distribution of Web
databases? With respect to the top-level categories
of the yahoo.com directory as our taxonomy, we
manually categorized the
sampled 190 Web data-
bases. Figure 2(b) shows
the distribution of the
14 categories: Business
& Economy (be), Com-
puters & Internet (ci),
News & Media (nm),
Entertainment (en),
Recreation & Sports (rs),
Health (he), Govern-
ment (go), Regional (rg),
Society & Culture (sc),
Education (ed), Arts & Humanities (ah), Science
(si), Reference (re), and Others (ot).
The distribution indicates great subject diversity
among Web databases, indicating the emergence
and proliferation of Web databases are spanning
well across all subject domains. While there seems
to be a common perception that the deep Web is
driven and dominated by e-commerce (for exam-
ple, for product search), our survey indicates the
contrary. To contrast, we further identify non-com-
merce categories from Figure 2(b)—he, go, rg, sc,
ed, ah, si, re, ot—which together occupy 51% (97
out of 190 databases), leaving only a slight minor-
ity of 49% to the rest of commerce sites (broadly
defined). In comparison, the subject distribution of
the surface Web, as char-
acterized in [7], showed
that commerce sites
dominated with an 83%
share. Thus, the trend of
“deepening” emerges not
only across all areas, but
also relatively more sig-
nificantly in the non-
commerce ones.
(Q5) How do search engines cover the deep Web?
Since some deep Web sources also provide
“browse” directories with URL links to reach the
hidden content, how effective is it to crawl-and-
index the deep Web as search engines do for the
surface Web? We thus investigated how popular
search engines index data on the deep Web. In par-
ticular, we chose the three largest search engines
Google (google.com), Yahoo (yahoo.com), and
MSN (msn.com).
We randomly selected
20 Web databases from
the 190 in our sampling
result. For each database,
first, we manually sam-
pled five objects (result
pages) as test data, by
querying the source with
some random words. We then, for each object col-
lected, queried every search engine to test whether
the page was indexed by formulating queries specif-
ically matching the object page. (For instance, we
used distinctive phrases that occurred in the object
page as keywords and limited the search to only the
source site.)
Figure 3 reports our finding: Google and Yahoo
both indexed 32% of the deep Web objects, and
MSN had the smallest coverage of 11%. However,
there was significant overlap in what they covered:
the combined coverage of the three largest search
engines increased only to 37%, indicating they were
indexing almost the same objects. In particular, as
Figure 3 illustrates, Yahoo and Google overlapped
COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5 99
We found that query inter-
faces tend to locate shallowly in
their sites: none of the 129 query
interfaces had depth deeper than
5. To begin with, 72% (93 out of 129) interfaces
were found within depth 3. Further, since a Web
database may be accessed through multiple inter-
faces, we measured its depth as the minimum
depths of all its interfaces: 94% (32 out of 34) Web
databases appeared within depth
3; Figure 2(a) reports the depth
distribution of the 34 Web data-
bases. Finally, 91.6% (22 out of
24) deep Web sites had their
databases within depth 3. (We
refer to these ratios as depth-
three coverage, which will guide
our further larger-scale crawling
in Q2.)
(Q2) What is the scale of the
deep Web? We then tested and
analyzed all of the 1,000,000 IP
samples to estimate the scale of
the deep Web. As just identified,
with the high depth-three cover-
age, almost all Web databases can
be identified within depth 3. We
thus crawled to depth 3 for these
one million IPs.
The crawling found 2,256 Web servers, among
which we identified 126 deep Web sites, which
contained a total of 406 query interfaces represent-
ing 190 Web databases. Extrapolating from the s
=1,000,000 unique IP samples to the entire IP
space of t = 2,230,124,544 IPs, and accounting for
the depth-three coverage, we estimate the number
of deep Web sites as shown in Equation 2, the
number of Web databases as shown in Equation 3,
and the number of query inter-
faces as shown in Equation 4 (the
results are rounded to 1,000).
The second and third columns of
Table 1 summarize the sampling and the estima-
tion results respectively. We also compute the con-
fidence interval of each estimated number at 99%
level of confidence, as the 4th column of Table 1
shows, which evidently indicates the scale of the
deep Web is well on the order of
105 sites. We also observed the
multiplicity of access on the
deep Web. On average, each
deep Web site provides 1.5 data-
bases, and each database sup-
ports 2.8 query interfaces.
The earlier survey of [1] esti-
mated 43,000 to 96,000 deep
Web sites by overlap analysis
between pairs of search engines.
Although [1] did not explicitly
qualify what it measured as a
search site, by comparison, it
still indicates that our estimation
of the scale of the deep Web (on
the order of 105
sites), is quite
accurate. Further, it has been
expanding, resulting in a 3–7 times increase in the
four years from 2000–2004.
(Q3) How “structured” is the deep Web? While
information on the surface Web is mostly unstruc-
tured HTML text (and images), how is the nature
of the deep Web data different? We classified Web
databases into two types: unstructured databases,
which provide data objects as unstructured media
(text, images, audio, and video); and structured
databases, which provide data objects as structured
“relational” records with attribute-value pairs. For
98 May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM
He fig 2a (5/07) - 19.5 picas
He fig 2a (5/07) - 15 picas
30%
25%
20%
15%
10%
5%
0%
0 1 2 3 4
Depth
ProportionofWebDatabases
5 6 7 8 9 10
30%
25%
20%
15%
10%
5%
0%
0 1 2 3 4
Depth
ProportionofWebDatabases
5 6 7 8 9 10
He fig 2b (5/07) - 19.5 picas
He fig 2b (5/07) - 15 picas
25%
20%
15%
10%
5%
0%
be ci nm en rs he
Subject Categories
ProportionofWebDatabases
go rg sc ed ah resi ot
25%
20%
15%
10%
5%
0%
be ci nm en rs he
Subject Categories
ProportionofWebDatabases
go rg sc ed ah resi ot
He equation 2 (5/07)
He equation 3 (5/07)
He equation 4 (5/07)
He fig 3 (5/07)
The entire deep Web
Google.com (32%)
Yahoo.com (32%)
MSN.com (11%)
All (37%)
0% 5% 37% 100%
Table 1. Sampling and estimation of
the deep-Web scale.
He table 1 (5/07)
Deep Web sites
Web databases
–unstructured
–structured
Query interfaces
Sampling Results
126
190
43
147
406
Total Estimate
307,000
450,000
102,000
348,000
1,258,000
99% Confidence Interval
236,000 - 377,000
366,000 - 535,000
62,000 - 142,000
275,000 - 423,000
1,097,000 - 1,419,000
Figure 2a. Distribution
of Web databases
over depth.
Figure 2b. Distribution
of Web databases over
subject category.
Figure 3. Coverage of search
engines.
Table 1. Sampling and estimation
of the deep Web scale.
Equation 2.
Equation 3.
Equation 4.
While there seems to be a common perception that the deep Web is
driven and dominated by e-commerce (for example, for product search),
our survey indicates the contrary.
surface Web pages well, will miss the schematic
structure available in most Web databases. This sit-
uation is analogous to being limited to searching
for flight tickets by keywords only; not destina-
tions, dates, and prices.
As traditional access techniques may not be
appropriate for the deep Web, it is crucial to develop
more effective techniques. We speculate that the
deep Web will likely be better served with a data-
base-centered, discover-and-forward access model. A
search engine will automatically discover databases
on the Web by crawling and indexing their query
interfaces (and not their data pages). Upon user
querying, the search engine will forward users to the
appropriate databases for the actual search of data.
Querying the databases will use their data-specific
interfaces and thus fully leverage their structures. To
use the previous analogy of searching for flight infor-
mation, we can now query flights with the desired
attributes. Several recent research projects, including
MetaQuerier [2] and WISE-Integrator [6], are
exploring this exciting direction.
References
1. BrightPlanet.com. The deep Web: Surfacing hidden value; bright-
planet.com/resources/details/deepweb.html.
2. Chen-Chuan Chang, K., He, B., and Zhang, Z. Toward large scale
integration: Building a metaquerier over databases on the Web. In Pro-
ceedings of the 2nd CIDR Conference, 2005.
3. Fetterly, D., Manasse, M., Najork, M., and Wiener, J. A large-scale
study of the evolution of Web pages. In Proceedings of the 12th Inter-
national World Wide Web Conference, 2004, 669–678.
4. Ghanem, T.M. and Aref, W.G. Databases deepen the Web. IEEE
Computer 73, 1 (2004), 116–117.
5. GNU. wget; www.gnu.org/software/wget/wget.html.
6. He, H., Meng, W., Yu, C., and Wu, Z. Wise-integrator: An automatic
integrator of Web search interfaces for e-commerce. In Proceedings of
the 29th VLDB Conference, 2003.
7. Lawrence, S. and Giles, C.L. Accessibility of information on the Web.
Nature 400, 6740 (1999), 107–109.
8. O’Neill, E., Lavoie, B., and Bennett, R. Web characterization;
wcp.oclc.org.
Bin He (binhe@uiuc.edu) is a research staff member at IBM
Almaden Research Center in San Jose, CA.
Mitesh Patel (miteshp@microsoft.com) is a developer at Microsoft
Corporation.
Zhen Zhang (zhang2@uiuc.edu) is a graduate research assistant in
computer science at the University of Illinois at Urbana-Champaign.
Kevin Chen-Chuan Chang (kcchang@cs.uiuc.edu) is an
assistant professor of computer science at the University of Illinois at
Urbana-Champaign.
Permission to make digital or hard copies of all or part of this work for personal or class-
room use is granted without fee provided that copies are not made or distributed for
profit or commercial advantage and that copies bear this notice and the full citation on
the first page. To copy otherwise, to republish, to post on servers or to redistribute to
lists, requires prior specific permission and/or a fee.
© 2007 ACM 0001-0782/07/0500 $5.00
c
COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5 101
on 27% objects of their 32% coverage: a 84% over-
lap. Moreover, MSN’s coverage was entirely a sub-
set of Yahoo, and thus a 100% overlap.
The coverage results
reveal some interesting
phenomena. On one
hand, in contrast to the
common perception,
the deep Web is proba-
bly not inherently hid-
den or invisible: the
major search engines
were able to each index one-third (32%) of the
data. On the other hand, however, the coverage
seems bounded by an intrinsic limit. Combined,
these major engines covered only marginally more
than they did individually, due to their significant
overlap. This phenomenon clearly contrasts with
the surface Web where, as [7] reports, the overlap
between engines is low, and combining them (or
metasearch) can greatly improve coverage. In this
case, for the deep Web, the fact
that 63% objects were not
indexed by any engines indi-
cates certain inherent barriers
for crawling and indexing data.
Most Web databases remain
invisible, providing no link-
based access, and are thus not
indexable by current crawling
techniques; and even when
crawlable, Web databases are
rather dynamic, and thus crawl-
ing cannot keep up with their
updates.
(Q6) What is the coverage of
deep Web directories? Besides
traditional search engines, sev-
eral deep Web portal services
have emerged online, providing deep Web directo-
ries that classify Web databases in some tax-
onomies. To measure their coverage, we surveyed
four popular deep Web directories, as summarized
in Table 2. For each directory service, we recorded
the number of databases it claimed to have
indexed (on their Web sites). As a result, com-
pleteplanet.com was the largest such directory,
with over 70,000 databases.2
As shown in Table 2,
compared to our estimate, it covered only 15.6%
of the total 450,000 Web databases. However,
other directories covered even less, in the limited
range of 0.2%–3.1%. We believe this extremely
low coverage suggests that, with their apparently
manual classification of Web databases, such direc-
tory-based indexing ser-
vices can hardly scale for
the deep Web.
CONCLUSION
For further discussion,
we summarize the find-
ings of this survey for the
deep Web in Table 3 and
make the following con-
clusions. While impor-
tant for information
search, the deep Web remains largely unexplored
and is currently neither well supported nor well
understood. The poor coverage of both its data (by
search engines) and databases (by directory ser-
vices) suggests that access to the deep Web is not
adequately supported. In seeking to better under-
stand the deep Web, we’ve determined that in some
aspects it resembles the surface Web: it is large,
fast-growing, and diverse. However, they differ in
other aspects: the deep Web is more diversely dis-
tributed, is mostly structured, and suffers an inher-
ent limitation of crawling.
To support effective access to the deep Web,
although the crawl-and-index techniques widely
used in popular search engines have been quite suc-
cessful for the surface Web, such an access model
may not be appropriate for the deep Web. Crawl-
ing will likely encounter the limit of coverage,
which seems intrinsic because of the hidden and
dynamic nature of Web databases. Further, index-
ing the crawled data will likely face the barrier of
structural heterogeneity across the wide range of
deep Web data. The current keyword-based index-
ing (which all search engines do), while serving the
100 May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM
He table 2 (5/07)
Table 2. Coverage of deep-Web directories.
completeplanet.com
lii.org
turbo10.com
invisible-web.net
Number of Web Databases
70,000
14,000
2,300
1,000
Coverage
15.6%
3.1%
0.5%
0.2%
completeplanet.com
lii.org
turbo10.com
invisible-web.net
Number of Web Databases
70,000
14,000
2,300
1,000
Coverage
15.6%
3.1%
0.5%
0.2%
- 19.5 picas
He table 2 (5/07)- 15 picas
Table 3. Summary of findings in our survey.
He table 3 (5/07)
scale
diversity
structural
complexity
depth
search
engine
coverage
directory
coverage
FindingsAspect
The deep Web is of a large scale of 307,000 sites, 450,000 databases, and 1,258,000 interfaces.
It has been rapidly expanding, with 3–7 times increase between 2000–2004.
The deep Web is diversely distributed across all subject areas. Although e-commerce is a
main driving force, the trend of “deepening” emerges not only across all areas, but also
relatively more significantly in the non-commerce ones.
Data sources on the deep Web are mostly structured, with a 3.4 ratio outnumbering
unstructured sources, unlike the surface Web.
Web databases tend to locate shallowly in their sites; the vast majority of 94% can be found
at the top-3 levels.
The deep Web is not entirely “hidden” from crawling—major search engines cover about
one-third of the data. However, there seems to be an intrinsic limit of coverage—search
engines combined cover roughly the same data, unlike the surface Web.
While some deep-Web directory services have started to index databases on the Web, their
coverage is small, ranging from 0.2% to 15.6%.
2
However, we noticed that completeplanet.com also indexed “site search,” which we
have excluded; thus, its coverage could be overestimated.
Table 2. Coverage of deep
Web directories.
Table 3. Summary of
survey findings.
surface Web pages well, will miss the schematic
structure available in most Web databases. This sit-
uation is analogous to being limited to searching
for flight tickets by keywords only; not destina-
tions, dates, and prices.
As traditional access techniques may not be
appropriate for the deep Web, it is crucial to develop
more effective techniques. We speculate that the
deep Web will likely be better served with a data-
base-centered, discover-and-forward access model. A
search engine will automatically discover databases
on the Web by crawling and indexing their query
interfaces (and not their data pages). Upon user
querying, the search engine will forward users to the
appropriate databases for the actual search of data.
Querying the databases will use their data-specific
interfaces and thus fully leverage their structures. To
use the previous analogy of searching for flight infor-
mation, we can now query flights with the desired
attributes. Several recent research projects, including
MetaQuerier [2] and WISE-Integrator [6], are
exploring this exciting direction.
References
1. BrightPlanet.com. The deep Web: Surfacing hidden value; bright-
planet.com/resources/details/deepweb.html.
2. Chen-Chuan Chang, K., He, B., and Zhang, Z. Toward large scale
integration: Building a metaquerier over databases on the Web. In Pro-
ceedings of the 2nd CIDR Conference, 2005.
3. Fetterly, D., Manasse, M., Najork, M., and Wiener, J. A large-scale
study of the evolution of Web pages. In Proceedings of the 12th Inter-
national World Wide Web Conference, 2004, 669–678.
4. Ghanem, T.M. and Aref, W.G. Databases deepen the Web. IEEE
Computer 73, 1 (2004), 116–117.
5. GNU. wget; www.gnu.org/software/wget/wget.html.
6. He, H., Meng, W., Yu, C., and Wu, Z. Wise-integrator: An automatic
integrator of Web search interfaces for e-commerce. In Proceedings of
the 29th VLDB Conference, 2003.
7. Lawrence, S. and Giles, C.L. Accessibility of information on the Web.
Nature 400, 6740 (1999), 107–109.
8. O’Neill, E., Lavoie, B., and Bennett, R. Web characterization;
wcp.oclc.org.
Bin He (binhe@uiuc.edu) is a research staff member at IBM
Almaden Research Center in San Jose, CA.
Mitesh Patel (miteshp@microsoft.com) is a developer at Microsoft
Corporation.
Zhen Zhang (zhang2@uiuc.edu) is a graduate research assistant in
computer science at the University of Illinois at Urbana-Champaign.
Kevin Chen-Chuan Chang (kcchang@cs.uiuc.edu) is an
assistant professor of computer science at the University of Illinois at
Urbana-Champaign.
Permission to make digital or hard copies of all or part of this work for personal or class-
room use is granted without fee provided that copies are not made or distributed for
profit or commercial advantage and that copies bear this notice and the full citation on
the first page. To copy otherwise, to republish, to post on servers or to redistribute to
lists, requires prior specific permission and/or a fee.
© 2007 ACM 0001-0782/07/0500 $5.00
c
COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5 101
on 27% objects of their 32% coverage: a 84% over-
lap. Moreover, MSN’s coverage was entirely a sub-
set of Yahoo, and thus a 100% overlap.
The coverage results
reveal some interesting
phenomena. On one
hand, in contrast to the
common perception,
the deep Web is proba-
bly not inherently hid-
den or invisible: the
major search engines
were able to each index one-third (32%) of the
data. On the other hand, however, the coverage
seems bounded by an intrinsic limit. Combined,
these major engines covered only marginally more
than they did individually, due to their significant
overlap. This phenomenon clearly contrasts with
the surface Web where, as [7] reports, the overlap
between engines is low, and combining them (or
metasearch) can greatly improve coverage. In this
case, for the deep Web, the fact
that 63% objects were not
indexed by any engines indi-
cates certain inherent barriers
for crawling and indexing data.
Most Web databases remain
invisible, providing no link-
based access, and are thus not
indexable by current crawling
techniques; and even when
crawlable, Web databases are
rather dynamic, and thus crawl-
ing cannot keep up with their
updates.
(Q6) What is the coverage of
deep Web directories? Besides
traditional search engines, sev-
eral deep Web portal services
have emerged online, providing deep Web directo-
ries that classify Web databases in some tax-
onomies. To measure their coverage, we surveyed
four popular deep Web directories, as summarized
in Table 2. For each directory service, we recorded
the number of databases it claimed to have
indexed (on their Web sites). As a result, com-
pleteplanet.com was the largest such directory,
with over 70,000 databases.2
As shown in Table 2,
compared to our estimate, it covered only 15.6%
of the total 450,000 Web databases. However,
other directories covered even less, in the limited
range of 0.2%–3.1%. We believe this extremely
low coverage suggests that, with their apparently
manual classification of Web databases, such direc-
tory-based indexing ser-
vices can hardly scale for
the deep Web.
CONCLUSION
For further discussion,
we summarize the find-
ings of this survey for the
deep Web in Table 3 and
make the following con-
clusions. While impor-
tant for information
search, the deep Web remains largely unexplored
and is currently neither well supported nor well
understood. The poor coverage of both its data (by
search engines) and databases (by directory ser-
vices) suggests that access to the deep Web is not
adequately supported. In seeking to better under-
stand the deep Web, we’ve determined that in some
aspects it resembles the surface Web: it is large,
fast-growing, and diverse. However, they differ in
other aspects: the deep Web is more diversely dis-
tributed, is mostly structured, and suffers an inher-
ent limitation of crawling.
To support effective access to the deep Web,
although the crawl-and-index techniques widely
used in popular search engines have been quite suc-
cessful for the surface Web, such an access model
may not be appropriate for the deep Web. Crawl-
ing will likely encounter the limit of coverage,
which seems intrinsic because of the hidden and
dynamic nature of Web databases. Further, index-
ing the crawled data will likely face the barrier of
structural heterogeneity across the wide range of
deep Web data. The current keyword-based index-
ing (which all search engines do), while serving the
100 May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM
He table 2 (5/07)
Table 2. Coverage of deep-Web directories.
completeplanet.com
lii.org
turbo10.com
invisible-web.net
Number of Web Databases
70,000
14,000
2,300
1,000
Coverage
15.6%
3.1%
0.5%
0.2%
completeplanet.com
lii.org
turbo10.com
invisible-web.net
Number of Web Databases
70,000
14,000
2,300
1,000
Coverage
15.6%
3.1%
0.5%
0.2%
- 19.5 picas
He table 2 (5/07)- 15 picas
Table 3. Summary of findings in our survey.
He table 3 (5/07)
scale
diversity
structural
complexity
depth
search
engine
coverage
directory
coverage
FindingsAspect
The deep Web is of a large scale of 307,000 sites, 450,000 databases, and 1,258,000 interfaces.
It has been rapidly expanding, with 3–7 times increase between 2000–2004.
The deep Web is diversely distributed across all subject areas. Although e-commerce is a
main driving force, the trend of “deepening” emerges not only across all areas, but also
relatively more significantly in the non-commerce ones.
Data sources on the deep Web are mostly structured, with a 3.4 ratio outnumbering
unstructured sources, unlike the surface Web.
Web databases tend to locate shallowly in their sites; the vast majority of 94% can be found
at the top-3 levels.
The deep Web is not entirely “hidden” from crawling—major search engines cover about
one-third of the data. However, there seems to be an intrinsic limit of coverage—search
engines combined cover roughly the same data, unlike the surface Web.
While some deep-Web directory services have started to index databases on the Web, their
coverage is small, ranging from 0.2% to 15.6%.
2
However, we noticed that completeplanet.com also indexed “site search,” which we
have excluded; thus, its coverage could be overestimated.
Table 2. Coverage of deep
Web directories.
Table 3. Summary of
survey findings.

More Related Content

What's hot

IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET Journal
 
Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawler
Rishikesh Pathak
 
Understanding Seo At A Glance
Understanding Seo At A GlanceUnderstanding Seo At A Glance
Understanding Seo At A Glance
poojagupta267
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
vinay arora
 
On Incentive-based Tagging
On Incentive-based TaggingOn Incentive-based Tagging
On Incentive-based TaggingFrancesco Rizzo
 
Web crawler with seo analysis
Web crawler with seo analysis Web crawler with seo analysis
Web crawler with seo analysis Vikram Parmar
 
The Anatomy of GOOGLE Search Engine
The Anatomy of GOOGLE Search EngineThe Anatomy of GOOGLE Search Engine
The Anatomy of GOOGLE Search Engine
Manish Chopra
 
Text Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseText Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / Database
Naveen Kumar
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
ishmecse13
 
Search Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut AslantaşSearch Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut Aslantaş
Aykut Aslantaş
 
Search Engine Optimization
Search Engine OptimizationSearch Engine Optimization
Search Engine Optimization
Manickam Srinivasan
 
Library hacks
Library hacksLibrary hacks
Library hacks
Andy Powell
 
Discovering Heterogeneous Resources in the Internet
Discovering Heterogeneous Resources in the InternetDiscovering Heterogeneous Resources in the Internet
Discovering Heterogeneous Resources in the Internet
Razzakul Chowdhury
 
Extracting Resources that Help Tell Events' Stories
Extracting Resources that Help Tell Events' StoriesExtracting Resources that Help Tell Events' Stories
Extracting Resources that Help Tell Events' Stories
Carlo Andrea Conte
 
Pr
PrPr

What's hot (19)

IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
 
Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawler
 
Understanding Seo At A Glance
Understanding Seo At A GlanceUnderstanding Seo At A Glance
Understanding Seo At A Glance
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
On Incentive-based Tagging
On Incentive-based TaggingOn Incentive-based Tagging
On Incentive-based Tagging
 
Web crawler with seo analysis
Web crawler with seo analysis Web crawler with seo analysis
Web crawler with seo analysis
 
The Anatomy of GOOGLE Search Engine
The Anatomy of GOOGLE Search EngineThe Anatomy of GOOGLE Search Engine
The Anatomy of GOOGLE Search Engine
 
Text Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseText Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / Database
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Search Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut AslantaşSearch Engine Optimization - Aykut Aslantaş
Search Engine Optimization - Aykut Aslantaş
 
Search Engine Optimization
Search Engine OptimizationSearch Engine Optimization
Search Engine Optimization
 
Library hacks
Library hacksLibrary hacks
Library hacks
 
Discovering Heterogeneous Resources in the Internet
Discovering Heterogeneous Resources in the InternetDiscovering Heterogeneous Resources in the Internet
Discovering Heterogeneous Resources in the Internet
 
Web crawling
Web crawlingWeb crawling
Web crawling
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Phishing
PhishingPhishing
Phishing
 
Extracting Resources that Help Tell Events' Stories
Extracting Resources that Help Tell Events' StoriesExtracting Resources that Help Tell Events' Stories
Extracting Resources that Help Tell Events' Stories
 
Page rank
Page rankPage rank
Page rank
 
Pr
PrPr
Pr
 

Viewers also liked

Hire Drupal Developers For Your Website
Hire Drupal Developers For Your WebsiteHire Drupal Developers For Your Website
Hire Drupal Developers For Your Website
ajohnson85
 
Assic 4th Lecture
Assic 4th LectureAssic 4th Lecture
Assic 4th Lecture
babak danyal
 
горизонтальный мир бизнес и соцсети 2013
горизонтальный мир бизнес и соцсети 2013горизонтальный мир бизнес и соцсети 2013
горизонтальный мир бизнес и соцсети 2013Михаил Самохин
 
Ppt the dilemma of death
Ppt the dilemma of deathPpt the dilemma of death
Ppt the dilemma of deathLarry Langley
 
Software process improvement ten traps to avoid
Software process improvement   ten traps to avoidSoftware process improvement   ten traps to avoid
Software process improvement ten traps to avoid
babak danyal
 
22 microcontroller programs
22 microcontroller programs22 microcontroller programs
22 microcontroller programs
babak danyal
 
Micro Assignment 2
Micro Assignment 2Micro Assignment 2
Micro Assignment 2
babak danyal
 
Cns 13f-lec03- Classical Encryption Techniques
Cns 13f-lec03- Classical Encryption TechniquesCns 13f-lec03- Classical Encryption Techniques
Cns 13f-lec03- Classical Encryption Techniques
babak danyal
 
Attacks Attacks AND Attacks!
Attacks Attacks AND Attacks!Attacks Attacks AND Attacks!
Attacks Attacks AND Attacks!
Asad Ali
 
Vulnerabilities in IP Protocols
Vulnerabilities in IP ProtocolsVulnerabilities in IP Protocols
Vulnerabilities in IP Protocols
babak danyal
 
Micro Assignment 1
Micro Assignment 1Micro Assignment 1
Micro Assignment 1
babak danyal
 
Course outlines
Course outlinesCourse outlines
Course outlinesAsad Ali
 
Ciphers modes
Ciphers modesCiphers modes
Ciphers modes
Asad Ali
 
Master synchronous serial port (mssp)
Master synchronous serial port (mssp)Master synchronous serial port (mssp)
Master synchronous serial port (mssp)
babak danyal
 
The Role of Ulema and Mashaikh in the Pakistan Movement
The Role of Ulema and Mashaikh in the Pakistan MovementThe Role of Ulema and Mashaikh in the Pakistan Movement
The Role of Ulema and Mashaikh in the Pakistan Movementbabak danyal
 
History of Cipher System
History of Cipher SystemHistory of Cipher System
History of Cipher System
Asad Ali
 

Viewers also liked (20)

Hire Drupal Developers For Your Website
Hire Drupal Developers For Your WebsiteHire Drupal Developers For Your Website
Hire Drupal Developers For Your Website
 
Assic 4th Lecture
Assic 4th LectureAssic 4th Lecture
Assic 4th Lecture
 
горизонтальный мир бизнес и соцсети 2013
горизонтальный мир бизнес и соцсети 2013горизонтальный мир бизнес и соцсети 2013
горизонтальный мир бизнес и соцсети 2013
 
Ppt the dilemma of death
Ppt the dilemma of deathPpt the dilemma of death
Ppt the dilemma of death
 
Software process improvement ten traps to avoid
Software process improvement   ten traps to avoidSoftware process improvement   ten traps to avoid
Software process improvement ten traps to avoid
 
Mobile comm. 1
Mobile comm. 1Mobile comm. 1
Mobile comm. 1
 
Satcom 5
Satcom 5Satcom 5
Satcom 5
 
22 microcontroller programs
22 microcontroller programs22 microcontroller programs
22 microcontroller programs
 
Satcom 4
Satcom 4Satcom 4
Satcom 4
 
Micro Assignment 2
Micro Assignment 2Micro Assignment 2
Micro Assignment 2
 
Cns 13f-lec03- Classical Encryption Techniques
Cns 13f-lec03- Classical Encryption TechniquesCns 13f-lec03- Classical Encryption Techniques
Cns 13f-lec03- Classical Encryption Techniques
 
Mobile comm. 2
Mobile comm. 2Mobile comm. 2
Mobile comm. 2
 
Attacks Attacks AND Attacks!
Attacks Attacks AND Attacks!Attacks Attacks AND Attacks!
Attacks Attacks AND Attacks!
 
Vulnerabilities in IP Protocols
Vulnerabilities in IP ProtocolsVulnerabilities in IP Protocols
Vulnerabilities in IP Protocols
 
Micro Assignment 1
Micro Assignment 1Micro Assignment 1
Micro Assignment 1
 
Course outlines
Course outlinesCourse outlines
Course outlines
 
Ciphers modes
Ciphers modesCiphers modes
Ciphers modes
 
Master synchronous serial port (mssp)
Master synchronous serial port (mssp)Master synchronous serial port (mssp)
Master synchronous serial port (mssp)
 
The Role of Ulema and Mashaikh in the Pakistan Movement
The Role of Ulema and Mashaikh in the Pakistan MovementThe Role of Ulema and Mashaikh in the Pakistan Movement
The Role of Ulema and Mashaikh in the Pakistan Movement
 
History of Cipher System
History of Cipher SystemHistory of Cipher System
History of Cipher System
 

Similar to Accessing the deep web (2007)

Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the Web
Denis Shestakov
 
L017447590
L017447590L017447590
L017447590
IOSR Journals
 
E017624043
E017624043E017624043
E017624043
IOSR Journals
 
Deep Web
Deep WebDeep Web
Deep WebSt John
 
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Rana Jayant
 
Smart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web HarvestingSmart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web Harvesting
paperpublications3
 
WEB MINING.pptx
WEB MINING.pptxWEB MINING.pptx
WEB MINING.pptx
HarshithRaj21
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery system
Denis Shestakov
 
Done reread detecting phrase-level duplication on the world wide we
Done reread detecting phrase-level duplication on the world wide weDone reread detecting phrase-level duplication on the world wide we
Done reread detecting phrase-level duplication on the world wide weJames Arnold
 
Semantic Web Science
Semantic Web ScienceSemantic Web Science
Semantic Web Science
James Hendler
 
Scraping talk public
Scraping talk publicScraping talk public
Scraping talk publicNesta
 
Linked Data
Linked DataLinked Data
Linked Data
Danny Ayers
 
Semantic web
Semantic webSemantic web
Semantic web
cat_us
 
Search V Next Final
Search V Next FinalSearch V Next Final
Search V Next Final
Marianne Sweeny
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
IOSR Journals
 
Sampling of User Behavior Using Online Social Network
Sampling of User Behavior Using Online Social NetworkSampling of User Behavior Using Online Social Network
Sampling of User Behavior Using Online Social Network
Editor IJCATR
 
Why Search is the Problem
Why Search is the ProblemWhy Search is the Problem
Why Search is the Problem
Tyler Bell
 
Librarian Internet Index
Librarian Internet IndexLibrarian Internet Index
Librarian Internet Index
Olga Bautista
 

Similar to Accessing the deep web (2007) (20)

Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the Web
 
L017447590
L017447590L017447590
L017447590
 
E017624043
E017624043E017624043
E017624043
 
Deep Web
Deep WebDeep Web
Deep Web
 
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
 
Smart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web HarvestingSmart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web Harvesting
 
WEB MINING.pptx
WEB MINING.pptxWEB MINING.pptx
WEB MINING.pptx
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery system
 
CyberDefPos_Scott
CyberDefPos_ScottCyberDefPos_Scott
CyberDefPos_Scott
 
Done reread detecting phrase-level duplication on the world wide we
Done reread detecting phrase-level duplication on the world wide weDone reread detecting phrase-level duplication on the world wide we
Done reread detecting phrase-level duplication on the world wide we
 
Semantic Web Science
Semantic Web ScienceSemantic Web Science
Semantic Web Science
 
Scraping talk public
Scraping talk publicScraping talk public
Scraping talk public
 
Linked Data
Linked DataLinked Data
Linked Data
 
Semantic web
Semantic webSemantic web
Semantic web
 
Search V Next Final
Search V Next FinalSearch V Next Final
Search V Next Final
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
 
Sampling of User Behavior Using Online Social Network
Sampling of User Behavior Using Online Social NetworkSampling of User Behavior Using Online Social Network
Sampling of User Behavior Using Online Social Network
 
Why Search is the Problem
Why Search is the ProblemWhy Search is the Problem
Why Search is the Problem
 
Librarian Internet Index
Librarian Internet IndexLibrarian Internet Index
Librarian Internet Index
 
Mythology of search engine
Mythology of search engineMythology of search engine
Mythology of search engine
 

More from babak danyal

Easy Steps to implement UDP Server and Client Sockets
Easy Steps to implement UDP Server and Client SocketsEasy Steps to implement UDP Server and Client Sockets
Easy Steps to implement UDP Server and Client Sockets
babak danyal
 
Java IO Package and Streams
Java IO Package and StreamsJava IO Package and Streams
Java IO Package and Streams
babak danyal
 
Swing and Graphical User Interface in Java
Swing and Graphical User Interface in JavaSwing and Graphical User Interface in Java
Swing and Graphical User Interface in Java
babak danyal
 
Tcp sockets
Tcp socketsTcp sockets
Tcp sockets
babak danyal
 
block ciphers and the des
block ciphers and the desblock ciphers and the des
block ciphers and the des
babak danyal
 
key distribution in network security
key distribution in network securitykey distribution in network security
key distribution in network security
babak danyal
 
Lecture10 Signal and Systems
Lecture10 Signal and SystemsLecture10 Signal and Systems
Lecture10 Signal and Systems
babak danyal
 
Lecture8 Signal and Systems
Lecture8 Signal and SystemsLecture8 Signal and Systems
Lecture8 Signal and Systems
babak danyal
 
Lecture7 Signal and Systems
Lecture7 Signal and SystemsLecture7 Signal and Systems
Lecture7 Signal and Systems
babak danyal
 
Lecture6 Signal and Systems
Lecture6 Signal and SystemsLecture6 Signal and Systems
Lecture6 Signal and Systems
babak danyal
 
Lecture5 Signal and Systems
Lecture5 Signal and SystemsLecture5 Signal and Systems
Lecture5 Signal and Systems
babak danyal
 
Lecture4 Signal and Systems
Lecture4  Signal and SystemsLecture4  Signal and Systems
Lecture4 Signal and Systems
babak danyal
 
Lecture3 Signal and Systems
Lecture3 Signal and SystemsLecture3 Signal and Systems
Lecture3 Signal and Systems
babak danyal
 
Lecture2 Signal and Systems
Lecture2 Signal and SystemsLecture2 Signal and Systems
Lecture2 Signal and Systems
babak danyal
 
Lecture1 Intro To Signa
Lecture1 Intro To SignaLecture1 Intro To Signa
Lecture1 Intro To Signababak danyal
 
Lecture9 Signal and Systems
Lecture9 Signal and SystemsLecture9 Signal and Systems
Lecture9 Signal and Systems
babak danyal
 
Classical Encryption Techniques in Network Security
Classical Encryption Techniques in Network SecurityClassical Encryption Techniques in Network Security
Classical Encryption Techniques in Network Security
babak danyal
 
Problems at independence
Problems at independenceProblems at independence
Problems at independencebabak danyal
 

More from babak danyal (20)

applist
applistapplist
applist
 
Easy Steps to implement UDP Server and Client Sockets
Easy Steps to implement UDP Server and Client SocketsEasy Steps to implement UDP Server and Client Sockets
Easy Steps to implement UDP Server and Client Sockets
 
Java IO Package and Streams
Java IO Package and StreamsJava IO Package and Streams
Java IO Package and Streams
 
Swing and Graphical User Interface in Java
Swing and Graphical User Interface in JavaSwing and Graphical User Interface in Java
Swing and Graphical User Interface in Java
 
Tcp sockets
Tcp socketsTcp sockets
Tcp sockets
 
block ciphers and the des
block ciphers and the desblock ciphers and the des
block ciphers and the des
 
key distribution in network security
key distribution in network securitykey distribution in network security
key distribution in network security
 
Lecture10 Signal and Systems
Lecture10 Signal and SystemsLecture10 Signal and Systems
Lecture10 Signal and Systems
 
Lecture8 Signal and Systems
Lecture8 Signal and SystemsLecture8 Signal and Systems
Lecture8 Signal and Systems
 
Lecture7 Signal and Systems
Lecture7 Signal and SystemsLecture7 Signal and Systems
Lecture7 Signal and Systems
 
Lecture6 Signal and Systems
Lecture6 Signal and SystemsLecture6 Signal and Systems
Lecture6 Signal and Systems
 
Lecture5 Signal and Systems
Lecture5 Signal and SystemsLecture5 Signal and Systems
Lecture5 Signal and Systems
 
Lecture4 Signal and Systems
Lecture4  Signal and SystemsLecture4  Signal and Systems
Lecture4 Signal and Systems
 
Lecture3 Signal and Systems
Lecture3 Signal and SystemsLecture3 Signal and Systems
Lecture3 Signal and Systems
 
Lecture2 Signal and Systems
Lecture2 Signal and SystemsLecture2 Signal and Systems
Lecture2 Signal and Systems
 
Lecture1 Intro To Signa
Lecture1 Intro To SignaLecture1 Intro To Signa
Lecture1 Intro To Signa
 
Lecture9 Signal and Systems
Lecture9 Signal and SystemsLecture9 Signal and Systems
Lecture9 Signal and Systems
 
Lecture9
Lecture9Lecture9
Lecture9
 
Classical Encryption Techniques in Network Security
Classical Encryption Techniques in Network SecurityClassical Encryption Techniques in Network Security
Classical Encryption Techniques in Network Security
 
Problems at independence
Problems at independenceProblems at independence
Problems at independence
 

Recently uploaded

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 

Accessing the deep web (2007)

  • 1. COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5 95 The Web has been rapidly “deepened” by massive databases online and current search engines do not reach most of the data on the Internet [4]. While the surface Web has linked billions of static HTML pages, a far more significant amount of information is believed to be “hidden” in the deep Web, behind the query forms of searchable databases, as Figure 1(a) conceptually illustrates. Such information may not be accessible through static URL links because they are assembled into Web pages as responses to queries submitted through the query interface of an underlying database. Because current search engines cannot effectively crawl databases, such data remains largely hidden from users (thus often also referred to as the invisible or hidden Web). Using overlap analysis between pairs of search engines, it was estimated in [1] that 43,000–96,000 “deep Web sites” and an informal estimate of 7,500 terabytes of data exist—500 times larger than the surface Web. ACCESSING THE DEEP WEB By BIN HE, MITESH PATEL, ZHEN ZHANG, and KEVIN CHEN-CHUAN CHANG Attempting to locate and quantify material on the Web that is hidden from typical search techniques. 94 May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM I l l u s t r a t i o n b y PETER HOEY
  • 2. COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5 95 The Web has been rapidly “deepened” by massive databases online and current search engines do not reach most of the data on the Internet [4]. While the surface Web has linked billions of static HTML pages, a far more significant amount of information is believed to be “hidden” in the deep Web, behind the query forms of searchable databases, as Figure 1(a) conceptually illustrates. Such information may not be accessible through static URL links because they are assembled into Web pages as responses to queries submitted through the query interface of an underlying database. Because current search engines cannot effectively crawl databases, such data remains largely hidden from users (thus often also referred to as the invisible or hidden Web). Using overlap analysis between pairs of search engines, it was estimated in [1] that 43,000–96,000 “deep Web sites” and an informal estimate of 7,500 terabytes of data exist—500 times larger than the surface Web. ACCESSING THE DEEP WEB By BIN HE, MITESH PATEL, ZHEN ZHANG, and KEVIN CHEN-CHUAN CHANG Attempting to locate and quantify material on the Web that is hidden from typical search techniques. 94 May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM I l l u s t r a t i o n b y PETER HOEY
  • 3. COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5 9796 May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM end Web databases, each of which is searchable through one or more HTML forms as its query interfaces. For instance, as Figure 1(b) shows, bn.com is a deep Web site, providing several Web databases (a book database, a music database, among others) accessed via multiple query inter- faces (“simple search” and “advanced search”). Note that our definition of deep Web site did not account for the virtual hosting case, where multiple Web sites can be hosted on the same physical IP address. Since identifying all the virtual hosts within an IP address is rather difficult to conduct in practice, we do not consider such cases in our survey. Our IP sampling-based estimation is thus accurate modulo the effect of virtual hosting. When conducting the survey, we first find the number of query interfaces for each Web site, then the number of Web databases, and finally the num- ber of deep Web sites. First, as our survey specifically focuses on online databases, we differentiate and exclude non-query HTML forms (which do not access back-end data- bases) from query interfaces. In particular, HTML forms for login, subscription, registration, polling, and message posting are not query interfaces. Sim- ilarly, we also exclude “site search,” which many Web sites now provide for searching HTML pages on their sites. These pages are statically linked at the “surface” of the sites; they are not dynamically assembled from an underlying database. Note that our survey considered only unique interfaces and removed duplicates; many Web pages contain the same query interfaces repeatedly, for example, in bn.com, the simple book search in Figure 1(b) is present in almost all pages. Second, we survey Web databases and deep Web sites based on the discovered query interfaces. Specifically, we compute the number of Web data- bases by finding the set of query interfaces (within a site) that refer to the same database. In particular, for any two query interfaces, we randomly choose five objects from one and search them in the other. We judge that the two interfaces are searching the same database if and only if the objects from one interface can always be found in the other one. Finally, the recognition of deep Web site is rather simple: A Web site is a deep Web site if it has at least one query interface. RESULTS (Q1) Where to find “entrances” to databases? To access a Web database, we must first find its entrances: the query interfaces. How does an inter- face (if any) locate in a site, that is, at which depths? For each query interface, we measured the depth as the minimum number of hops from the root page of the site to the interface page.1 As this study required deep crawling of Web sites, we ana- lyzed one-tenth of our total IP samples: a subset of 100,000 IPs. We tested each IP sample by making HTTP connections and found 281 Web servers. Exhaustively crawling these servers to depth 10, we found 24 of them are deep Web sites, which con- tained a total of 129 query interfaces representing 34 Web databases. With its myriad data- bases and hidden con- tent, this deep Web is an important yet largely unexplored frontier for infor- mation search. While we have understood the sur- face Web relatively well, with various surveys [3, 7]), how is the deep Web differ- ent? This article reports our sur- vey of the deep Web, studying the scale, subject distribution, search-engine coverage, and other access characteristics of online databases. We note that, while the study conducted in 2000 [1] established interest in this area, it focused on only the scale aspect, and its result from overlap analysis tends to under- estimate (as acknowledged in [1]). In overlap analysis, the number of deep Web sites is estimated by exploit- ing two search engines. If we find na deep Web sites in the first search engine, nb in the second, and n0 in both, we can estimate the total number as shown in Equation 1 by assuming the two search engines ran- domly and independently obtain their data. However, as our survey found, search engines are highly corre- lated in their coverage of deep Web data (see question Q5 in this survey). Therefore, such an independence assumption seems rather unrealistic, in which case the result is significantly underestimated. In fact, the vio- lation of this assumption and its consequence were also discussed in [1]. Our survey took the IP sampling approach to collect random server samples for estimating the global scale as well as facilitating subsequent analy- sis. During April 2004, we acquired and analyzed a random sample of Web servers by IP sam- pling. We randomly sampled 1,000,000 IPs (from the entire space of 2,230,124,544 valid IP addresses, after removing reserved and unused IP ranges according to [8]). For each IP, we used an HTTP client, the GNU free software wget [5], to make an HTTP connec- tion to it and download HTML pages. We then identified and analyzed Web databases in this sam- ple, in order to extrapolate our estimates of the deep Web. Our survey distinguishes three related notions for accessing the deep Web: site, database, and interface. A deep Web site is a Web server that pro- vides information maintained in one or more back- He fig 1a (5/07) - 26.5 picas He fig 1a (5/07) - 39.5 picas Cars.com Amazon.com Apartments.com Biography.com 401carfinder.com 411 locate.com Acura City: State: Bedrooms: Rent: State (required) First Name Region Make Type of Vehicle Year to to Price Model Make: All Model: Any Price: Your ZIP: 30 mi Within: GO Author: First name/initials and last name Title word(s) Subject word(s) Start of subject Start(s) of subject word(s) Start(s) of title word(s) Exact start of title Start of last name Exact name Title: Subject: ISBN: Publisher: Search Now 0 to 9999 dollars Doesn’t matter Select a State City Last Name (required) Select a State Start Your Search Search All Regions All Makes Domestic FIND YOUR CAR! GO! Search Over 25,000 personalities! He equation 1 (5/07) Figure 1a. The conceptual view of the deep Web. Equation 1. Figure 1b. Site, databases, and interface. He fig 1b (5/07) - 26.5 picas He fig 1b (5/07) - 39.5 picas The deep Web site Bn.com book database music database advanced search Title of Book simple search advanced search simple search Price You can narrow your search by selecting one or more options below: all prices Author’s Name Keywords Age all age ranges Format all formats Search Tips Subjects all subjects Clear FieldsSearch Title Author Keyword Keyword Album Title Artist Name Song Title Instrument Label Narrow My Choices by Style SEARCH ISBN Artist Title Song Artist SEARCH All Search Clear All Styles With its myriad databases and hidden content, this deep Web is an important yet largely unexplored frontier for information search. 1 Such depth information is obtained by a simple revision of the wget software.
  • 4. COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5 9796 May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM end Web databases, each of which is searchable through one or more HTML forms as its query interfaces. For instance, as Figure 1(b) shows, bn.com is a deep Web site, providing several Web databases (a book database, a music database, among others) accessed via multiple query inter- faces (“simple search” and “advanced search”). Note that our definition of deep Web site did not account for the virtual hosting case, where multiple Web sites can be hosted on the same physical IP address. Since identifying all the virtual hosts within an IP address is rather difficult to conduct in practice, we do not consider such cases in our survey. Our IP sampling-based estimation is thus accurate modulo the effect of virtual hosting. When conducting the survey, we first find the number of query interfaces for each Web site, then the number of Web databases, and finally the num- ber of deep Web sites. First, as our survey specifically focuses on online databases, we differentiate and exclude non-query HTML forms (which do not access back-end data- bases) from query interfaces. In particular, HTML forms for login, subscription, registration, polling, and message posting are not query interfaces. Sim- ilarly, we also exclude “site search,” which many Web sites now provide for searching HTML pages on their sites. These pages are statically linked at the “surface” of the sites; they are not dynamically assembled from an underlying database. Note that our survey considered only unique interfaces and removed duplicates; many Web pages contain the same query interfaces repeatedly, for example, in bn.com, the simple book search in Figure 1(b) is present in almost all pages. Second, we survey Web databases and deep Web sites based on the discovered query interfaces. Specifically, we compute the number of Web data- bases by finding the set of query interfaces (within a site) that refer to the same database. In particular, for any two query interfaces, we randomly choose five objects from one and search them in the other. We judge that the two interfaces are searching the same database if and only if the objects from one interface can always be found in the other one. Finally, the recognition of deep Web site is rather simple: A Web site is a deep Web site if it has at least one query interface. RESULTS (Q1) Where to find “entrances” to databases? To access a Web database, we must first find its entrances: the query interfaces. How does an inter- face (if any) locate in a site, that is, at which depths? For each query interface, we measured the depth as the minimum number of hops from the root page of the site to the interface page.1 As this study required deep crawling of Web sites, we ana- lyzed one-tenth of our total IP samples: a subset of 100,000 IPs. We tested each IP sample by making HTTP connections and found 281 Web servers. Exhaustively crawling these servers to depth 10, we found 24 of them are deep Web sites, which con- tained a total of 129 query interfaces representing 34 Web databases. With its myriad data- bases and hidden con- tent, this deep Web is an important yet largely unexplored frontier for infor- mation search. While we have understood the sur- face Web relatively well, with various surveys [3, 7]), how is the deep Web differ- ent? This article reports our sur- vey of the deep Web, studying the scale, subject distribution, search-engine coverage, and other access characteristics of online databases. We note that, while the study conducted in 2000 [1] established interest in this area, it focused on only the scale aspect, and its result from overlap analysis tends to under- estimate (as acknowledged in [1]). In overlap analysis, the number of deep Web sites is estimated by exploit- ing two search engines. If we find na deep Web sites in the first search engine, nb in the second, and n0 in both, we can estimate the total number as shown in Equation 1 by assuming the two search engines ran- domly and independently obtain their data. However, as our survey found, search engines are highly corre- lated in their coverage of deep Web data (see question Q5 in this survey). Therefore, such an independence assumption seems rather unrealistic, in which case the result is significantly underestimated. In fact, the vio- lation of this assumption and its consequence were also discussed in [1]. Our survey took the IP sampling approach to collect random server samples for estimating the global scale as well as facilitating subsequent analy- sis. During April 2004, we acquired and analyzed a random sample of Web servers by IP sam- pling. We randomly sampled 1,000,000 IPs (from the entire space of 2,230,124,544 valid IP addresses, after removing reserved and unused IP ranges according to [8]). For each IP, we used an HTTP client, the GNU free software wget [5], to make an HTTP connec- tion to it and download HTML pages. We then identified and analyzed Web databases in this sam- ple, in order to extrapolate our estimates of the deep Web. Our survey distinguishes three related notions for accessing the deep Web: site, database, and interface. A deep Web site is a Web server that pro- vides information maintained in one or more back- He fig 1a (5/07) - 26.5 picas He fig 1a (5/07) - 39.5 picas Cars.com Amazon.com Apartments.com Biography.com 401carfinder.com 411 locate.com Acura City: State: Bedrooms: Rent: State (required) First Name Region Make Type of Vehicle Year to to Price Model Make: All Model: Any Price: Your ZIP: 30 mi Within: GO Author: First name/initials and last name Title word(s) Subject word(s) Start of subject Start(s) of subject word(s) Start(s) of title word(s) Exact start of title Start of last name Exact name Title: Subject: ISBN: Publisher: Search Now 0 to 9999 dollars Doesn’t matter Select a State City Last Name (required) Select a State Start Your Search Search All Regions All Makes Domestic FIND YOUR CAR! GO! Search Over 25,000 personalities! He equation 1 (5/07) Figure 1a. The conceptual view of the deep Web. Equation 1. Figure 1b. Site, databases, and interface. He fig 1b (5/07) - 26.5 picas He fig 1b (5/07) - 39.5 picas The deep Web site Bn.com book database music database advanced search Title of Book simple search advanced search simple search Price You can narrow your search by selecting one or more options below: all prices Author’s Name Keywords Age all age ranges Format all formats Search Tips Subjects all subjects Clear FieldsSearch Title Author Keyword Keyword Album Title Artist Name Song Title Instrument Label Narrow My Choices by Style SEARCH ISBN Artist Title Song Artist SEARCH All Search Clear All Styles With its myriad databases and hidden content, this deep Web is an important yet largely unexplored frontier for information search. 1 Such depth information is obtained by a simple revision of the wget software.
  • 5. instance, cnn.com has an unstructured database of news articles, while amazon.com has a structured database for books, which returns book records (for example, title = “gone with the wind,” format = “paperback,” price = $7.99). By manual querying and inspection of the 190 Web databases sam- pled, we found 43 unstructured and 147 structured. We similarly estimate their total numbers to be 102,000 and 348,000 respec- tively, as summarized in Table 1. Thus, the deep Web features mostly structured data sources, with a dominating ratio of 3.4:1 versus unstructured sources. (Q4) What is the subject distribution of Web databases? With respect to the top-level categories of the yahoo.com directory as our taxonomy, we manually categorized the sampled 190 Web data- bases. Figure 2(b) shows the distribution of the 14 categories: Business & Economy (be), Com- puters & Internet (ci), News & Media (nm), Entertainment (en), Recreation & Sports (rs), Health (he), Govern- ment (go), Regional (rg), Society & Culture (sc), Education (ed), Arts & Humanities (ah), Science (si), Reference (re), and Others (ot). The distribution indicates great subject diversity among Web databases, indicating the emergence and proliferation of Web databases are spanning well across all subject domains. While there seems to be a common perception that the deep Web is driven and dominated by e-commerce (for exam- ple, for product search), our survey indicates the contrary. To contrast, we further identify non-com- merce categories from Figure 2(b)—he, go, rg, sc, ed, ah, si, re, ot—which together occupy 51% (97 out of 190 databases), leaving only a slight minor- ity of 49% to the rest of commerce sites (broadly defined). In comparison, the subject distribution of the surface Web, as char- acterized in [7], showed that commerce sites dominated with an 83% share. Thus, the trend of “deepening” emerges not only across all areas, but also relatively more sig- nificantly in the non- commerce ones. (Q5) How do search engines cover the deep Web? Since some deep Web sources also provide “browse” directories with URL links to reach the hidden content, how effective is it to crawl-and- index the deep Web as search engines do for the surface Web? We thus investigated how popular search engines index data on the deep Web. In par- ticular, we chose the three largest search engines Google (google.com), Yahoo (yahoo.com), and MSN (msn.com). We randomly selected 20 Web databases from the 190 in our sampling result. For each database, first, we manually sam- pled five objects (result pages) as test data, by querying the source with some random words. We then, for each object col- lected, queried every search engine to test whether the page was indexed by formulating queries specif- ically matching the object page. (For instance, we used distinctive phrases that occurred in the object page as keywords and limited the search to only the source site.) Figure 3 reports our finding: Google and Yahoo both indexed 32% of the deep Web objects, and MSN had the smallest coverage of 11%. However, there was significant overlap in what they covered: the combined coverage of the three largest search engines increased only to 37%, indicating they were indexing almost the same objects. In particular, as Figure 3 illustrates, Yahoo and Google overlapped COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5 99 We found that query inter- faces tend to locate shallowly in their sites: none of the 129 query interfaces had depth deeper than 5. To begin with, 72% (93 out of 129) interfaces were found within depth 3. Further, since a Web database may be accessed through multiple inter- faces, we measured its depth as the minimum depths of all its interfaces: 94% (32 out of 34) Web databases appeared within depth 3; Figure 2(a) reports the depth distribution of the 34 Web data- bases. Finally, 91.6% (22 out of 24) deep Web sites had their databases within depth 3. (We refer to these ratios as depth- three coverage, which will guide our further larger-scale crawling in Q2.) (Q2) What is the scale of the deep Web? We then tested and analyzed all of the 1,000,000 IP samples to estimate the scale of the deep Web. As just identified, with the high depth-three cover- age, almost all Web databases can be identified within depth 3. We thus crawled to depth 3 for these one million IPs. The crawling found 2,256 Web servers, among which we identified 126 deep Web sites, which contained a total of 406 query interfaces represent- ing 190 Web databases. Extrapolating from the s =1,000,000 unique IP samples to the entire IP space of t = 2,230,124,544 IPs, and accounting for the depth-three coverage, we estimate the number of deep Web sites as shown in Equation 2, the number of Web databases as shown in Equation 3, and the number of query inter- faces as shown in Equation 4 (the results are rounded to 1,000). The second and third columns of Table 1 summarize the sampling and the estima- tion results respectively. We also compute the con- fidence interval of each estimated number at 99% level of confidence, as the 4th column of Table 1 shows, which evidently indicates the scale of the deep Web is well on the order of 105 sites. We also observed the multiplicity of access on the deep Web. On average, each deep Web site provides 1.5 data- bases, and each database sup- ports 2.8 query interfaces. The earlier survey of [1] esti- mated 43,000 to 96,000 deep Web sites by overlap analysis between pairs of search engines. Although [1] did not explicitly qualify what it measured as a search site, by comparison, it still indicates that our estimation of the scale of the deep Web (on the order of 105 sites), is quite accurate. Further, it has been expanding, resulting in a 3–7 times increase in the four years from 2000–2004. (Q3) How “structured” is the deep Web? While information on the surface Web is mostly unstruc- tured HTML text (and images), how is the nature of the deep Web data different? We classified Web databases into two types: unstructured databases, which provide data objects as unstructured media (text, images, audio, and video); and structured databases, which provide data objects as structured “relational” records with attribute-value pairs. For 98 May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM He fig 2a (5/07) - 19.5 picas He fig 2a (5/07) - 15 picas 30% 25% 20% 15% 10% 5% 0% 0 1 2 3 4 Depth ProportionofWebDatabases 5 6 7 8 9 10 30% 25% 20% 15% 10% 5% 0% 0 1 2 3 4 Depth ProportionofWebDatabases 5 6 7 8 9 10 He fig 2b (5/07) - 19.5 picas He fig 2b (5/07) - 15 picas 25% 20% 15% 10% 5% 0% be ci nm en rs he Subject Categories ProportionofWebDatabases go rg sc ed ah resi ot 25% 20% 15% 10% 5% 0% be ci nm en rs he Subject Categories ProportionofWebDatabases go rg sc ed ah resi ot He equation 2 (5/07) He equation 3 (5/07) He equation 4 (5/07) He fig 3 (5/07) The entire deep Web Google.com (32%) Yahoo.com (32%) MSN.com (11%) All (37%) 0% 5% 37% 100% Table 1. Sampling and estimation of the deep-Web scale. He table 1 (5/07) Deep Web sites Web databases –unstructured –structured Query interfaces Sampling Results 126 190 43 147 406 Total Estimate 307,000 450,000 102,000 348,000 1,258,000 99% Confidence Interval 236,000 - 377,000 366,000 - 535,000 62,000 - 142,000 275,000 - 423,000 1,097,000 - 1,419,000 Figure 2a. Distribution of Web databases over depth. Figure 2b. Distribution of Web databases over subject category. Figure 3. Coverage of search engines. Table 1. Sampling and estimation of the deep Web scale. Equation 2. Equation 3. Equation 4. While there seems to be a common perception that the deep Web is driven and dominated by e-commerce (for example, for product search), our survey indicates the contrary.
  • 6. instance, cnn.com has an unstructured database of news articles, while amazon.com has a structured database for books, which returns book records (for example, title = “gone with the wind,” format = “paperback,” price = $7.99). By manual querying and inspection of the 190 Web databases sam- pled, we found 43 unstructured and 147 structured. We similarly estimate their total numbers to be 102,000 and 348,000 respec- tively, as summarized in Table 1. Thus, the deep Web features mostly structured data sources, with a dominating ratio of 3.4:1 versus unstructured sources. (Q4) What is the subject distribution of Web databases? With respect to the top-level categories of the yahoo.com directory as our taxonomy, we manually categorized the sampled 190 Web data- bases. Figure 2(b) shows the distribution of the 14 categories: Business & Economy (be), Com- puters & Internet (ci), News & Media (nm), Entertainment (en), Recreation & Sports (rs), Health (he), Govern- ment (go), Regional (rg), Society & Culture (sc), Education (ed), Arts & Humanities (ah), Science (si), Reference (re), and Others (ot). The distribution indicates great subject diversity among Web databases, indicating the emergence and proliferation of Web databases are spanning well across all subject domains. While there seems to be a common perception that the deep Web is driven and dominated by e-commerce (for exam- ple, for product search), our survey indicates the contrary. To contrast, we further identify non-com- merce categories from Figure 2(b)—he, go, rg, sc, ed, ah, si, re, ot—which together occupy 51% (97 out of 190 databases), leaving only a slight minor- ity of 49% to the rest of commerce sites (broadly defined). In comparison, the subject distribution of the surface Web, as char- acterized in [7], showed that commerce sites dominated with an 83% share. Thus, the trend of “deepening” emerges not only across all areas, but also relatively more sig- nificantly in the non- commerce ones. (Q5) How do search engines cover the deep Web? Since some deep Web sources also provide “browse” directories with URL links to reach the hidden content, how effective is it to crawl-and- index the deep Web as search engines do for the surface Web? We thus investigated how popular search engines index data on the deep Web. In par- ticular, we chose the three largest search engines Google (google.com), Yahoo (yahoo.com), and MSN (msn.com). We randomly selected 20 Web databases from the 190 in our sampling result. For each database, first, we manually sam- pled five objects (result pages) as test data, by querying the source with some random words. We then, for each object col- lected, queried every search engine to test whether the page was indexed by formulating queries specif- ically matching the object page. (For instance, we used distinctive phrases that occurred in the object page as keywords and limited the search to only the source site.) Figure 3 reports our finding: Google and Yahoo both indexed 32% of the deep Web objects, and MSN had the smallest coverage of 11%. However, there was significant overlap in what they covered: the combined coverage of the three largest search engines increased only to 37%, indicating they were indexing almost the same objects. In particular, as Figure 3 illustrates, Yahoo and Google overlapped COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5 99 We found that query inter- faces tend to locate shallowly in their sites: none of the 129 query interfaces had depth deeper than 5. To begin with, 72% (93 out of 129) interfaces were found within depth 3. Further, since a Web database may be accessed through multiple inter- faces, we measured its depth as the minimum depths of all its interfaces: 94% (32 out of 34) Web databases appeared within depth 3; Figure 2(a) reports the depth distribution of the 34 Web data- bases. Finally, 91.6% (22 out of 24) deep Web sites had their databases within depth 3. (We refer to these ratios as depth- three coverage, which will guide our further larger-scale crawling in Q2.) (Q2) What is the scale of the deep Web? We then tested and analyzed all of the 1,000,000 IP samples to estimate the scale of the deep Web. As just identified, with the high depth-three cover- age, almost all Web databases can be identified within depth 3. We thus crawled to depth 3 for these one million IPs. The crawling found 2,256 Web servers, among which we identified 126 deep Web sites, which contained a total of 406 query interfaces represent- ing 190 Web databases. Extrapolating from the s =1,000,000 unique IP samples to the entire IP space of t = 2,230,124,544 IPs, and accounting for the depth-three coverage, we estimate the number of deep Web sites as shown in Equation 2, the number of Web databases as shown in Equation 3, and the number of query inter- faces as shown in Equation 4 (the results are rounded to 1,000). The second and third columns of Table 1 summarize the sampling and the estima- tion results respectively. We also compute the con- fidence interval of each estimated number at 99% level of confidence, as the 4th column of Table 1 shows, which evidently indicates the scale of the deep Web is well on the order of 105 sites. We also observed the multiplicity of access on the deep Web. On average, each deep Web site provides 1.5 data- bases, and each database sup- ports 2.8 query interfaces. The earlier survey of [1] esti- mated 43,000 to 96,000 deep Web sites by overlap analysis between pairs of search engines. Although [1] did not explicitly qualify what it measured as a search site, by comparison, it still indicates that our estimation of the scale of the deep Web (on the order of 105 sites), is quite accurate. Further, it has been expanding, resulting in a 3–7 times increase in the four years from 2000–2004. (Q3) How “structured” is the deep Web? While information on the surface Web is mostly unstruc- tured HTML text (and images), how is the nature of the deep Web data different? We classified Web databases into two types: unstructured databases, which provide data objects as unstructured media (text, images, audio, and video); and structured databases, which provide data objects as structured “relational” records with attribute-value pairs. For 98 May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM He fig 2a (5/07) - 19.5 picas He fig 2a (5/07) - 15 picas 30% 25% 20% 15% 10% 5% 0% 0 1 2 3 4 Depth ProportionofWebDatabases 5 6 7 8 9 10 30% 25% 20% 15% 10% 5% 0% 0 1 2 3 4 Depth ProportionofWebDatabases 5 6 7 8 9 10 He fig 2b (5/07) - 19.5 picas He fig 2b (5/07) - 15 picas 25% 20% 15% 10% 5% 0% be ci nm en rs he Subject Categories ProportionofWebDatabases go rg sc ed ah resi ot 25% 20% 15% 10% 5% 0% be ci nm en rs he Subject Categories ProportionofWebDatabases go rg sc ed ah resi ot He equation 2 (5/07) He equation 3 (5/07) He equation 4 (5/07) He fig 3 (5/07) The entire deep Web Google.com (32%) Yahoo.com (32%) MSN.com (11%) All (37%) 0% 5% 37% 100% Table 1. Sampling and estimation of the deep-Web scale. He table 1 (5/07) Deep Web sites Web databases –unstructured –structured Query interfaces Sampling Results 126 190 43 147 406 Total Estimate 307,000 450,000 102,000 348,000 1,258,000 99% Confidence Interval 236,000 - 377,000 366,000 - 535,000 62,000 - 142,000 275,000 - 423,000 1,097,000 - 1,419,000 Figure 2a. Distribution of Web databases over depth. Figure 2b. Distribution of Web databases over subject category. Figure 3. Coverage of search engines. Table 1. Sampling and estimation of the deep Web scale. Equation 2. Equation 3. Equation 4. While there seems to be a common perception that the deep Web is driven and dominated by e-commerce (for example, for product search), our survey indicates the contrary.
  • 7. surface Web pages well, will miss the schematic structure available in most Web databases. This sit- uation is analogous to being limited to searching for flight tickets by keywords only; not destina- tions, dates, and prices. As traditional access techniques may not be appropriate for the deep Web, it is crucial to develop more effective techniques. We speculate that the deep Web will likely be better served with a data- base-centered, discover-and-forward access model. A search engine will automatically discover databases on the Web by crawling and indexing their query interfaces (and not their data pages). Upon user querying, the search engine will forward users to the appropriate databases for the actual search of data. Querying the databases will use their data-specific interfaces and thus fully leverage their structures. To use the previous analogy of searching for flight infor- mation, we can now query flights with the desired attributes. Several recent research projects, including MetaQuerier [2] and WISE-Integrator [6], are exploring this exciting direction. References 1. BrightPlanet.com. The deep Web: Surfacing hidden value; bright- planet.com/resources/details/deepweb.html. 2. Chen-Chuan Chang, K., He, B., and Zhang, Z. Toward large scale integration: Building a metaquerier over databases on the Web. In Pro- ceedings of the 2nd CIDR Conference, 2005. 3. Fetterly, D., Manasse, M., Najork, M., and Wiener, J. A large-scale study of the evolution of Web pages. In Proceedings of the 12th Inter- national World Wide Web Conference, 2004, 669–678. 4. Ghanem, T.M. and Aref, W.G. Databases deepen the Web. IEEE Computer 73, 1 (2004), 116–117. 5. GNU. wget; www.gnu.org/software/wget/wget.html. 6. He, H., Meng, W., Yu, C., and Wu, Z. Wise-integrator: An automatic integrator of Web search interfaces for e-commerce. In Proceedings of the 29th VLDB Conference, 2003. 7. Lawrence, S. and Giles, C.L. Accessibility of information on the Web. Nature 400, 6740 (1999), 107–109. 8. O’Neill, E., Lavoie, B., and Bennett, R. Web characterization; wcp.oclc.org. Bin He (binhe@uiuc.edu) is a research staff member at IBM Almaden Research Center in San Jose, CA. Mitesh Patel (miteshp@microsoft.com) is a developer at Microsoft Corporation. Zhen Zhang (zhang2@uiuc.edu) is a graduate research assistant in computer science at the University of Illinois at Urbana-Champaign. Kevin Chen-Chuan Chang (kcchang@cs.uiuc.edu) is an assistant professor of computer science at the University of Illinois at Urbana-Champaign. Permission to make digital or hard copies of all or part of this work for personal or class- room use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. © 2007 ACM 0001-0782/07/0500 $5.00 c COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5 101 on 27% objects of their 32% coverage: a 84% over- lap. Moreover, MSN’s coverage was entirely a sub- set of Yahoo, and thus a 100% overlap. The coverage results reveal some interesting phenomena. On one hand, in contrast to the common perception, the deep Web is proba- bly not inherently hid- den or invisible: the major search engines were able to each index one-third (32%) of the data. On the other hand, however, the coverage seems bounded by an intrinsic limit. Combined, these major engines covered only marginally more than they did individually, due to their significant overlap. This phenomenon clearly contrasts with the surface Web where, as [7] reports, the overlap between engines is low, and combining them (or metasearch) can greatly improve coverage. In this case, for the deep Web, the fact that 63% objects were not indexed by any engines indi- cates certain inherent barriers for crawling and indexing data. Most Web databases remain invisible, providing no link- based access, and are thus not indexable by current crawling techniques; and even when crawlable, Web databases are rather dynamic, and thus crawl- ing cannot keep up with their updates. (Q6) What is the coverage of deep Web directories? Besides traditional search engines, sev- eral deep Web portal services have emerged online, providing deep Web directo- ries that classify Web databases in some tax- onomies. To measure their coverage, we surveyed four popular deep Web directories, as summarized in Table 2. For each directory service, we recorded the number of databases it claimed to have indexed (on their Web sites). As a result, com- pleteplanet.com was the largest such directory, with over 70,000 databases.2 As shown in Table 2, compared to our estimate, it covered only 15.6% of the total 450,000 Web databases. However, other directories covered even less, in the limited range of 0.2%–3.1%. We believe this extremely low coverage suggests that, with their apparently manual classification of Web databases, such direc- tory-based indexing ser- vices can hardly scale for the deep Web. CONCLUSION For further discussion, we summarize the find- ings of this survey for the deep Web in Table 3 and make the following con- clusions. While impor- tant for information search, the deep Web remains largely unexplored and is currently neither well supported nor well understood. The poor coverage of both its data (by search engines) and databases (by directory ser- vices) suggests that access to the deep Web is not adequately supported. In seeking to better under- stand the deep Web, we’ve determined that in some aspects it resembles the surface Web: it is large, fast-growing, and diverse. However, they differ in other aspects: the deep Web is more diversely dis- tributed, is mostly structured, and suffers an inher- ent limitation of crawling. To support effective access to the deep Web, although the crawl-and-index techniques widely used in popular search engines have been quite suc- cessful for the surface Web, such an access model may not be appropriate for the deep Web. Crawl- ing will likely encounter the limit of coverage, which seems intrinsic because of the hidden and dynamic nature of Web databases. Further, index- ing the crawled data will likely face the barrier of structural heterogeneity across the wide range of deep Web data. The current keyword-based index- ing (which all search engines do), while serving the 100 May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM He table 2 (5/07) Table 2. Coverage of deep-Web directories. completeplanet.com lii.org turbo10.com invisible-web.net Number of Web Databases 70,000 14,000 2,300 1,000 Coverage 15.6% 3.1% 0.5% 0.2% completeplanet.com lii.org turbo10.com invisible-web.net Number of Web Databases 70,000 14,000 2,300 1,000 Coverage 15.6% 3.1% 0.5% 0.2% - 19.5 picas He table 2 (5/07)- 15 picas Table 3. Summary of findings in our survey. He table 3 (5/07) scale diversity structural complexity depth search engine coverage directory coverage FindingsAspect The deep Web is of a large scale of 307,000 sites, 450,000 databases, and 1,258,000 interfaces. It has been rapidly expanding, with 3–7 times increase between 2000–2004. The deep Web is diversely distributed across all subject areas. Although e-commerce is a main driving force, the trend of “deepening” emerges not only across all areas, but also relatively more significantly in the non-commerce ones. Data sources on the deep Web are mostly structured, with a 3.4 ratio outnumbering unstructured sources, unlike the surface Web. Web databases tend to locate shallowly in their sites; the vast majority of 94% can be found at the top-3 levels. The deep Web is not entirely “hidden” from crawling—major search engines cover about one-third of the data. However, there seems to be an intrinsic limit of coverage—search engines combined cover roughly the same data, unlike the surface Web. While some deep-Web directory services have started to index databases on the Web, their coverage is small, ranging from 0.2% to 15.6%. 2 However, we noticed that completeplanet.com also indexed “site search,” which we have excluded; thus, its coverage could be overestimated. Table 2. Coverage of deep Web directories. Table 3. Summary of survey findings.
  • 8. surface Web pages well, will miss the schematic structure available in most Web databases. This sit- uation is analogous to being limited to searching for flight tickets by keywords only; not destina- tions, dates, and prices. As traditional access techniques may not be appropriate for the deep Web, it is crucial to develop more effective techniques. We speculate that the deep Web will likely be better served with a data- base-centered, discover-and-forward access model. A search engine will automatically discover databases on the Web by crawling and indexing their query interfaces (and not their data pages). Upon user querying, the search engine will forward users to the appropriate databases for the actual search of data. Querying the databases will use their data-specific interfaces and thus fully leverage their structures. To use the previous analogy of searching for flight infor- mation, we can now query flights with the desired attributes. Several recent research projects, including MetaQuerier [2] and WISE-Integrator [6], are exploring this exciting direction. References 1. BrightPlanet.com. The deep Web: Surfacing hidden value; bright- planet.com/resources/details/deepweb.html. 2. Chen-Chuan Chang, K., He, B., and Zhang, Z. Toward large scale integration: Building a metaquerier over databases on the Web. In Pro- ceedings of the 2nd CIDR Conference, 2005. 3. Fetterly, D., Manasse, M., Najork, M., and Wiener, J. A large-scale study of the evolution of Web pages. In Proceedings of the 12th Inter- national World Wide Web Conference, 2004, 669–678. 4. Ghanem, T.M. and Aref, W.G. Databases deepen the Web. IEEE Computer 73, 1 (2004), 116–117. 5. GNU. wget; www.gnu.org/software/wget/wget.html. 6. He, H., Meng, W., Yu, C., and Wu, Z. Wise-integrator: An automatic integrator of Web search interfaces for e-commerce. In Proceedings of the 29th VLDB Conference, 2003. 7. Lawrence, S. and Giles, C.L. Accessibility of information on the Web. Nature 400, 6740 (1999), 107–109. 8. O’Neill, E., Lavoie, B., and Bennett, R. Web characterization; wcp.oclc.org. Bin He (binhe@uiuc.edu) is a research staff member at IBM Almaden Research Center in San Jose, CA. Mitesh Patel (miteshp@microsoft.com) is a developer at Microsoft Corporation. Zhen Zhang (zhang2@uiuc.edu) is a graduate research assistant in computer science at the University of Illinois at Urbana-Champaign. Kevin Chen-Chuan Chang (kcchang@cs.uiuc.edu) is an assistant professor of computer science at the University of Illinois at Urbana-Champaign. Permission to make digital or hard copies of all or part of this work for personal or class- room use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. © 2007 ACM 0001-0782/07/0500 $5.00 c COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5 101 on 27% objects of their 32% coverage: a 84% over- lap. Moreover, MSN’s coverage was entirely a sub- set of Yahoo, and thus a 100% overlap. The coverage results reveal some interesting phenomena. On one hand, in contrast to the common perception, the deep Web is proba- bly not inherently hid- den or invisible: the major search engines were able to each index one-third (32%) of the data. On the other hand, however, the coverage seems bounded by an intrinsic limit. Combined, these major engines covered only marginally more than they did individually, due to their significant overlap. This phenomenon clearly contrasts with the surface Web where, as [7] reports, the overlap between engines is low, and combining them (or metasearch) can greatly improve coverage. In this case, for the deep Web, the fact that 63% objects were not indexed by any engines indi- cates certain inherent barriers for crawling and indexing data. Most Web databases remain invisible, providing no link- based access, and are thus not indexable by current crawling techniques; and even when crawlable, Web databases are rather dynamic, and thus crawl- ing cannot keep up with their updates. (Q6) What is the coverage of deep Web directories? Besides traditional search engines, sev- eral deep Web portal services have emerged online, providing deep Web directo- ries that classify Web databases in some tax- onomies. To measure their coverage, we surveyed four popular deep Web directories, as summarized in Table 2. For each directory service, we recorded the number of databases it claimed to have indexed (on their Web sites). As a result, com- pleteplanet.com was the largest such directory, with over 70,000 databases.2 As shown in Table 2, compared to our estimate, it covered only 15.6% of the total 450,000 Web databases. However, other directories covered even less, in the limited range of 0.2%–3.1%. We believe this extremely low coverage suggests that, with their apparently manual classification of Web databases, such direc- tory-based indexing ser- vices can hardly scale for the deep Web. CONCLUSION For further discussion, we summarize the find- ings of this survey for the deep Web in Table 3 and make the following con- clusions. While impor- tant for information search, the deep Web remains largely unexplored and is currently neither well supported nor well understood. The poor coverage of both its data (by search engines) and databases (by directory ser- vices) suggests that access to the deep Web is not adequately supported. In seeking to better under- stand the deep Web, we’ve determined that in some aspects it resembles the surface Web: it is large, fast-growing, and diverse. However, they differ in other aspects: the deep Web is more diversely dis- tributed, is mostly structured, and suffers an inher- ent limitation of crawling. To support effective access to the deep Web, although the crawl-and-index techniques widely used in popular search engines have been quite suc- cessful for the surface Web, such an access model may not be appropriate for the deep Web. Crawl- ing will likely encounter the limit of coverage, which seems intrinsic because of the hidden and dynamic nature of Web databases. Further, index- ing the crawled data will likely face the barrier of structural heterogeneity across the wide range of deep Web data. The current keyword-based index- ing (which all search engines do), while serving the 100 May 2007/Vol. 50, No. 5 COMMUNICATIONS OF THE ACM He table 2 (5/07) Table 2. Coverage of deep-Web directories. completeplanet.com lii.org turbo10.com invisible-web.net Number of Web Databases 70,000 14,000 2,300 1,000 Coverage 15.6% 3.1% 0.5% 0.2% completeplanet.com lii.org turbo10.com invisible-web.net Number of Web Databases 70,000 14,000 2,300 1,000 Coverage 15.6% 3.1% 0.5% 0.2% - 19.5 picas He table 2 (5/07)- 15 picas Table 3. Summary of findings in our survey. He table 3 (5/07) scale diversity structural complexity depth search engine coverage directory coverage FindingsAspect The deep Web is of a large scale of 307,000 sites, 450,000 databases, and 1,258,000 interfaces. It has been rapidly expanding, with 3–7 times increase between 2000–2004. The deep Web is diversely distributed across all subject areas. Although e-commerce is a main driving force, the trend of “deepening” emerges not only across all areas, but also relatively more significantly in the non-commerce ones. Data sources on the deep Web are mostly structured, with a 3.4 ratio outnumbering unstructured sources, unlike the surface Web. Web databases tend to locate shallowly in their sites; the vast majority of 94% can be found at the top-3 levels. The deep Web is not entirely “hidden” from crawling—major search engines cover about one-third of the data. However, there seems to be an intrinsic limit of coverage—search engines combined cover roughly the same data, unlike the surface Web. While some deep-Web directory services have started to index databases on the Web, their coverage is small, ranging from 0.2% to 15.6%. 2 However, we noticed that completeplanet.com also indexed “site search,” which we have excluded; thus, its coverage could be overestimated. Table 2. Coverage of deep Web directories. Table 3. Summary of survey findings.