Background● Deep Web: web content behind search interfaces● See example of interface -------->● Main problem: hard to crawl, thus content poorly indexed and not available for search (hidden)● Many research problems: roughly 150- 200 works addressing certain aspects of challenge (e.g., see Search interfaces on the Web: querying and characterizing, Shestakov, 2008)● "Clearly, the science and practice of deep web crawling is in its infancy" (in Web crawling, Olston&Najork, 2010)
Background● What is still unknown (surprisingly): ○ How large is deep Web: number of deep web resources? amount of content in them? what portion is indexed?● So far only several studies addressed this: ○ Bergman, 2001: number, amount of content ○ Chang et al., 2004: number, coverage ○ Shestakov et al., 2007: number ○ Chinese surveys: number ○ ....
Background● All approaches used so far are not good● Basically, the idea behind estimating number of deep web sites: ○ IP address random sampling method (proposed in 1997) ○ Description: take a pool of all IP addresses (~3 billions currently in use), generate a random sample (~one million is ok), connect to them, if it serves HTTP crawl it and search for search interfaces ○ Obtain a number of search interfaces in a sample and apply sampling math to get an estimate ○ One can restrict to some segment of the Web (e.g., national): then pool consists of national IP addresses only
Virtual Hosting● Bottleneck: virtual hosting● When only IP available then URLs for crawl look like these http://X.Y.Z.W -----> lots of web sites hosting on X.Z.Y.W missed● Examples: ○ OVH (hosting company): 65,000 servers host 7,500,000 ○ This survey: 670,000 hosts on 80,000 IP addresses● You cant ignore it!
Host-IP cluster sampling● What if a large list of hosts is available? ○ In fact, not very trivial to get one as such a list should cover a certain web segment well● Host random sampling can be applied (Shestakov et al., 2007) ○ Works but with limitations ○ Bottleneck: host aliasing, i.e., different hostnames lead to the same web site ■ Hard to solve: need to crawl all hosts in the list (their start web pages)● Idea: resolve all hosts to their IPs
Host-IP cluster sampling● Resolve all hosts in the list to their IP addresses ○ A set of host-IP pairs● Cluster hosts (pairs) by IP ○ IP1: host11,host12, host13, ... ○ IP2: host21,host22, host23, ... ○ ... ○ IPN: hostN1,hostN2, hostN3, ...● Generate random sample of IP● Analyze sampled IPs ○ E.g., if IP2 sampled then crawl host21,host22, host23, ...
Host-IP cluster sampling● Analyze sampled IPs ○ E.g., if IP2 sampled then crawl host21,host22, host23, ... NO ○ While crawling unknown (not in the list) hosts may be found ■ Crawl only those that either resolved to IP2 or to IPs that are not among lists IP list ( IP1, IP2,..., IPN)● Identify search interfaces YES ---> ○ Filtering, machine learning, manual check ○ Out of the scope (see ref  in the paper)● Apply sampling formulas (see Section 4.4 of the paper)
Results● Dataset: ○ ~670 thousand hostnames ○ Obtained from Yandex: good coverage of Russian Web as of 2006 ○ Resolved to ~80 thousands unique IP addresses ○ 77.2% of hosts shared their IPs with at least 20 other hosts <--virtual hosting scale● 1075 IPs sampled - 6237 hosts in initial crawl seed ○ Enough if satisfied with NUM+/-25% with 95% confidence
Comparison: host-IP vs IP samplingConclusion: IP random sampling (used in previous deepweb characterization studies) applied to the same datasetresulted in estimates that are 3.5 times smaller thanactual numbers (obtained by host-IP)
Conclusion● Proposed Host-IP clustering technique ○ Superior to IP random sampling● Accurately characterized a national web segment ○ As of 09/2006, 14,200+/-3800 deep web sites in Russian Web● Estimates obtained by Chang et al. (ref  in the paper) are underestimated● Planning to apply Host-IP to other datasets ○ Main challenge is to obtain a large list of hosts that reliably covers a certain web segment● Contact me if interested in Host-IP pairs datasets