Sampling national deep Web

  • 259 views
Uploaded on

Talk given at DEXA 2011 in Toulouse, France. Full text paper is available at http://goo.gl/oCWPkN

Talk given at DEXA 2011 in Toulouse, France. Full text paper is available at http://goo.gl/oCWPkN

More in: Technology , Design
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • To cite please use: http://dx.doi.org/10.1007/978-3-642-23088-2_24
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
259
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
5
Comments
1
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Sampling National Deep Web Denis Shestakov, fname.lname at aalto.fiDepartment of Media Technology, Aalto University DEXA11, Toulouse, France, 31.08.2011
  • 2. Outline● Background● Our approach: Host-IP cluster random sampling● Results● Conclusions
  • 3. Background● Deep Web: web content behind search interfaces● See example of interface -------->● Main problem: hard to crawl, thus content poorly indexed and not available for search (hidden)● Many research problems: roughly 150- 200 works addressing certain aspects of challenge (e.g., see Search interfaces on the Web: querying and characterizing, Shestakov, 2008)● "Clearly, the science and practice of deep web crawling is in its infancy" (in Web crawling, Olston&Najork, 2010)
  • 4. Background● What is still unknown (surprisingly): ○ How large is deep Web: number of deep web resources? amount of content in them? what portion is indexed?● So far only several studies addressed this: ○ Bergman, 2001: number, amount of content ○ Chang et al., 2004: number, coverage ○ Shestakov et al., 2007: number ○ Chinese surveys: number ○ ....
  • 5. Background● All approaches used so far are not good● Basically, the idea behind estimating number of deep web sites: ○ IP address random sampling method (proposed in 1997) ○ Description: take a pool of all IP addresses (~3 billions currently in use), generate a random sample (~one million is ok), connect to them, if it serves HTTP crawl it and search for search interfaces ○ Obtain a number of search interfaces in a sample and apply sampling math to get an estimate ○ One can restrict to some segment of the Web (e.g., national): then pool consists of national IP addresses only
  • 6. Virtual Hosting● Bottleneck: virtual hosting● When only IP available then URLs for crawl look like these http://X.Y.Z.W -----> lots of web sites hosting on X.Z.Y.W missed● Examples: ○ OVH (hosting company): 65,000 servers host 7,500,000 ○ This survey: 670,000 hosts on 80,000 IP addresses● You cant ignore it!
  • 7. Host-IP cluster sampling● What if a large list of hosts is available? ○ In fact, not very trivial to get one as such a list should cover a certain web segment well● Host random sampling can be applied (Shestakov et al., 2007) ○ Works but with limitations ○ Bottleneck: host aliasing, i.e., different hostnames lead to the same web site ■ Hard to solve: need to crawl all hosts in the list (their start web pages)● Idea: resolve all hosts to their IPs
  • 8. Host-IP cluster sampling● Resolve all hosts in the list to their IP addresses ○ A set of host-IP pairs● Cluster hosts (pairs) by IP ○ IP1: host11,host12, host13, ... ○ IP2: host21,host22, host23, ... ○ ... ○ IPN: hostN1,hostN2, hostN3, ...● Generate random sample of IP● Analyze sampled IPs ○ E.g., if IP2 sampled then crawl host21,host22, host23, ...
  • 9. Host-IP cluster sampling● Analyze sampled IPs ○ E.g., if IP2 sampled then crawl host21,host22, host23, ... NO ○ While crawling unknown (not in the list) hosts may be found ■ Crawl only those that either resolved to IP2 or to IPs that are not among lists IP list ( IP1, IP2,..., IPN)● Identify search interfaces YES ---> ○ Filtering, machine learning, manual check ○ Out of the scope (see ref [14] in the paper)● Apply sampling formulas (see Section 4.4 of the paper)
  • 10. Results● Dataset: ○ ~670 thousand hostnames ○ Obtained from Yandex: good coverage of Russian Web as of 2006 ○ Resolved to ~80 thousands unique IP addresses ○ 77.2% of hosts shared their IPs with at least 20 other hosts <--virtual hosting scale● 1075 IPs sampled - 6237 hosts in initial crawl seed ○ Enough if satisfied with NUM+/-25% with 95% confidence
  • 11. Results
  • 12. Comparison: host-IP vs IP samplingConclusion: IP random sampling (used in previous deepweb characterization studies) applied to the same datasetresulted in estimates that are 3.5 times smaller thanactual numbers (obtained by host-IP)
  • 13. Conclusion● Proposed Host-IP clustering technique ○ Superior to IP random sampling● Accurately characterized a national web segment ○ As of 09/2006, 14,200+/-3800 deep web sites in Russian Web● Estimates obtained by Chang et al. (ref [9] in the paper) are underestimated● Planning to apply Host-IP to other datasets ○ Main challenge is to obtain a large list of hosts that reliably covers a certain web segment● Contact me if interested in Host-IP pairs datasets
  • 14. Thank you!Questions?