Sampling national deep Web


Published on

Talk given at DEXA 2011 in Toulouse, France. Full text paper is available at

Published in: Technology, Design
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Sampling national deep Web

  1. 1. Sampling National Deep Web Denis Shestakov, fname.lname at aalto.fiDepartment of Media Technology, Aalto University DEXA11, Toulouse, France, 31.08.2011
  2. 2. Outline● Background● Our approach: Host-IP cluster random sampling● Results● Conclusions
  3. 3. Background● Deep Web: web content behind search interfaces● See example of interface -------->● Main problem: hard to crawl, thus content poorly indexed and not available for search (hidden)● Many research problems: roughly 150- 200 works addressing certain aspects of challenge (e.g., see Search interfaces on the Web: querying and characterizing, Shestakov, 2008)● "Clearly, the science and practice of deep web crawling is in its infancy" (in Web crawling, Olston&Najork, 2010)
  4. 4. Background● What is still unknown (surprisingly): ○ How large is deep Web: number of deep web resources? amount of content in them? what portion is indexed?● So far only several studies addressed this: ○ Bergman, 2001: number, amount of content ○ Chang et al., 2004: number, coverage ○ Shestakov et al., 2007: number ○ Chinese surveys: number ○ ....
  5. 5. Background● All approaches used so far are not good● Basically, the idea behind estimating number of deep web sites: ○ IP address random sampling method (proposed in 1997) ○ Description: take a pool of all IP addresses (~3 billions currently in use), generate a random sample (~one million is ok), connect to them, if it serves HTTP crawl it and search for search interfaces ○ Obtain a number of search interfaces in a sample and apply sampling math to get an estimate ○ One can restrict to some segment of the Web (e.g., national): then pool consists of national IP addresses only
  6. 6. Virtual Hosting● Bottleneck: virtual hosting● When only IP available then URLs for crawl look like these http://X.Y.Z.W -----> lots of web sites hosting on X.Z.Y.W missed● Examples: ○ OVH (hosting company): 65,000 servers host 7,500,000 ○ This survey: 670,000 hosts on 80,000 IP addresses● You cant ignore it!
  7. 7. Host-IP cluster sampling● What if a large list of hosts is available? ○ In fact, not very trivial to get one as such a list should cover a certain web segment well● Host random sampling can be applied (Shestakov et al., 2007) ○ Works but with limitations ○ Bottleneck: host aliasing, i.e., different hostnames lead to the same web site ■ Hard to solve: need to crawl all hosts in the list (their start web pages)● Idea: resolve all hosts to their IPs
  8. 8. Host-IP cluster sampling● Resolve all hosts in the list to their IP addresses ○ A set of host-IP pairs● Cluster hosts (pairs) by IP ○ IP1: host11,host12, host13, ... ○ IP2: host21,host22, host23, ... ○ ... ○ IPN: hostN1,hostN2, hostN3, ...● Generate random sample of IP● Analyze sampled IPs ○ E.g., if IP2 sampled then crawl host21,host22, host23, ...
  9. 9. Host-IP cluster sampling● Analyze sampled IPs ○ E.g., if IP2 sampled then crawl host21,host22, host23, ... NO ○ While crawling unknown (not in the list) hosts may be found ■ Crawl only those that either resolved to IP2 or to IPs that are not among lists IP list ( IP1, IP2,..., IPN)● Identify search interfaces YES ---> ○ Filtering, machine learning, manual check ○ Out of the scope (see ref [14] in the paper)● Apply sampling formulas (see Section 4.4 of the paper)
  10. 10. Results● Dataset: ○ ~670 thousand hostnames ○ Obtained from Yandex: good coverage of Russian Web as of 2006 ○ Resolved to ~80 thousands unique IP addresses ○ 77.2% of hosts shared their IPs with at least 20 other hosts <--virtual hosting scale● 1075 IPs sampled - 6237 hosts in initial crawl seed ○ Enough if satisfied with NUM+/-25% with 95% confidence
  11. 11. Results
  12. 12. Comparison: host-IP vs IP samplingConclusion: IP random sampling (used in previous deepweb characterization studies) applied to the same datasetresulted in estimates that are 3.5 times smaller thanactual numbers (obtained by host-IP)
  13. 13. Conclusion● Proposed Host-IP clustering technique ○ Superior to IP random sampling● Accurately characterized a national web segment ○ As of 09/2006, 14,200+/-3800 deep web sites in Russian Web● Estimates obtained by Chang et al. (ref [9] in the paper) are underestimated● Planning to apply Host-IP to other datasets ○ Main challenge is to obtain a large list of hosts that reliably covers a certain web segment● Contact me if interested in Host-IP pairs datasets
  14. 14. Thank you!Questions?