24. CONSTRAINTS
• Politeness
Wednesday, September 14, 2011
25. CONSTRAINTS
• Politeness
• Distributed
Wednesday, September 14, 2011
26. CONSTRAINTS
• Politeness
• Distributed
• Linear Scalability
Wednesday, September 14, 2011
27. CONSTRAINTS
• Politeness
• Distributed
• Linear Scalability
• Even partitioning
Wednesday, September 14, 2011
28. CONSTRAINTS
• Politeness
• Distributed
• Linear Scalability
• Even partitioning
• Minimum overlap
Wednesday, September 14, 2011
29. CONSTRAINTS
• Politeness it’s easy to burden
• Distributed
small servers
• Linear Scalability
• Even partitioning
• Minimum overlap
Wednesday, September 14, 2011
30. CONSTRAINTS
• Politeness
• Distributed (for any significant
crawl)
• Linear Scalability
• Even partitioning
• Minimum overlap
Wednesday, September 14, 2011
31. CONSTRAINTS
• Politeness
• Distributed
• Linear Scalability n machines =
n*m pages-per-second
• Even partitioning
• Minimum overlap
Wednesday, September 14, 2011
32. CONSTRAINTS
• Politeness
• Distributed
• Linear Scalability
• Even partitioning every machine should
perform equal work
• Minimum overlap
Wednesday, September 14, 2011
33. CONSTRAINTS
• Politeness
• Distributed
• Linear Scalability
• Even partitioning
• Minimum overlap crawl each page
exactly once
Wednesday, September 14, 2011
34. CONSTRAINTS
• Politeness
• Distributed
• Linear Scalability
• Even partitioning
• Minimum overlap
Wednesday, September 14, 2011
36. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
37. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
38. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
39. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
40. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
41. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
42. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
43. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
44. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
45. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
46. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
47. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
48. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
49. architecture overview
CRAWL
FETCHER INTERNET
URLs Web Data
PLANNER
URL
QUEUE
Web Data
STORAGE
Web Data
Wednesday, September 14, 2011
57. challenges:
DNS Lookup
Wednesday, September 14, 2011
58. challenges:
DNS Lookup
URLs Crawled
Wednesday, September 14, 2011
59. challenges:
DNS Lookup
URLs Crawled
Politeness
Wednesday, September 14, 2011
60. challenges:
DNS Lookup
URLs Crawled
Politeness
URL Frontier
Wednesday, September 14, 2011
61. challenges:
DNS Lookup
URLs Crawled
Politeness
URL Frontier
Queueing URLs
Wednesday, September 14, 2011
62. challenges:
DNS Lookup
URLs Crawled
Politeness
URL Frontier
Queueing URLs
Extracting URLs
Wednesday, September 14, 2011
63. challenges:
DNS LOOKUP
Wednesday, September 14, 2011
64. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
65. challenges:
DNS LOOKUP
can easily be a bottleneck
Wednesday, September 14, 2011
66. challenges:
DNS LOOKUP
• consider running your own DNS servers
• djbdns
• PowerDNS
• etc.
Wednesday, September 14, 2011
67. challenges:
DNS LOOKUP
• be aware of software limitations
• gethostbyaddr is synchronized
• same with many “default” DNS clients
Wednesday, September 14, 2011
68. challenges:
DNS LOOKUP
You’ll know when you need it
Wednesday, September 14, 2011
69. challenges:
URLs CRAWLED
Wednesday, September 14, 2011
70. Initialize:
UrlsDone = null
UrlFrontier = {'google.com/index.html', ..}
Repeat
url = UrlFrontier.getNext()
ip = DNSlookup(url.getHostname())
html = DownloadPage(ip, url.getPath())
UrlsDone.insert(url)
newUrls = parseForLinks(html)
For each newUrl
If not UrlsDone.contains(newUrl)
then UrlsTodo.insert(newUrl)
Wednesday, September 14, 2011
71. challenges:
URLs CRAWLED
1 machine, store in memory
Wednesday, September 14, 2011
72. challenges:
URLs CRAWLED
1 machine, store in memory
NAPKIN CALCULATION
Wednesday, September 14, 2011
73. challenges:
URLs CRAWLED
1 machine, store in memory
NAPKIN CALCULATION
~50 bytes per URL
e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations
Wednesday, September 14, 2011
74. challenges:
URLs CRAWLED
1 machine, store in memory
NAPKIN CALCULATION
~50 bytes per URL
e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations
+8 bytes for time-last-crawled
as long e.g. System.currentTimeMillis() -> 1314392455712
Wednesday, September 14, 2011
75. challenges:
URLs CRAWLED
1 machine, store in memory
NAPKIN CALCULATION
~50 bytes per URL
e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations
+8 bytes for time-last-crawled
as long e.g. System.currentTimeMillis() -> 1314392455712
x 100 million
Wednesday, September 14, 2011
76. challenges:
URLs CRAWLED
1 machine, store in memory
NAPKIN CALCULATION
~50 bytes per URL
e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations
+8 bytes for time-last-crawled
as long e.g. System.currentTimeMillis() -> 1314392455712
x 100 million
=~ 5.4 gigabytes
Wednesday, September 14, 2011
77. can we do better?
Wednesday, September 14, 2011
82. BLOOM FILTERS
answers either:
• yes, probably
• definitely not
Wednesday, September 14, 2011
83. BLOOM FILTERS
Have we crawled: http://www.xcombinator.com?
answers either:
• yes, probably
• definitely not
Wednesday, September 14, 2011
84. BLOOM FILTERS
Have we crawled: http://www.xcombinator.com?
answers either:
• yes, probably
• definitely not
Wednesday, September 14, 2011
85. challenges:
URLs CRAWLED
1 machine, bloom filter
100 million URLs
1 in 100 million chance
of false positive
see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8
Wednesday, September 14, 2011
86. challenges:
URLs CRAWLED
1 machine, bloom filter
NAPKIN CALCULATION
100 million URLs
1 in 100 million chance
of false positive
see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8
Wednesday, September 14, 2011
87. challenges:
URLs CRAWLED
1 machine, bloom filter
NAPKIN CALCULATION
100 million URLs
1 in 100 million chance
of false positive
=~ 457 megabytes
see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8
Wednesday, September 14, 2011
123. benefits:
~ 1/(n+1) URLs move on add/remove
virtual nodes help skew
robust (no SOP)
Wednesday, September 14, 2011
124. drawbacks:
naive solution won’t work for large sites
Wednesday, September 14, 2011
125. further reading:
Chord: A Scalable Peer-to-Peer Lookup Protocol for
Internet Applications (2001) Stoica et al.
Dynamo: Amazon’s Highly Available Key-value Store, SOSP
2007
Tapestry: A Resilient Global-Scale Overlay for Service
Deployment (2004) Zhao et al.
Wednesday, September 14, 2011
126. challenges:
QUEUEING URLS
Wednesday, September 14, 2011
177. challenges:
EXTRACTING URLS
be prepared:
Wednesday, September 14, 2011
178. challenges:
EXTRACTING URLS
be prepared:
use a streaming XML parser
Wednesday, September 14, 2011
179. challenges:
EXTRACTING URLS
be prepared:
use a streaming XML parser
use a library that handle’s bad markup
Wednesday, September 14, 2011
180. challenges:
EXTRACTING URLS
be prepared:
use a streaming XML parser
use a library that handle’s bad markup
be aware that URLs aren’t ASCII
Wednesday, September 14, 2011
181. challenges:
EXTRACTING URLS
be prepared:
use a streaming XML parser
use a library that handle’s bad markup
be aware that URLs aren’t ASCII
use a URL normalizer
Wednesday, September 14, 2011