Challenges in Large-Scale Web Crawling

19,043 views

Published on

Published in: Technology, Business
0 Comments
33 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
19,043
On SlideShare
0
From Embeds
0
Number of Embeds
2,036
Actions
Shares
0
Downloads
275
Comments
0
Likes
33
Embeds 0
No embeds

No notes for slide

Challenges in Large-Scale Web Crawling

  1. introduction to WEB CRAWLING & extraction by Nate MurrayWednesday, September 14, 2011
  2. WHO AM I ?Wednesday, September 14, 2011
  3. Nate Murray AT&T Interactive (Yellowpages.com) TB-scale data since 2009 Various crawlers since 2005Wednesday, September 14, 2011
  4. what is WEB CRAWLING ?Wednesday, September 14, 2011
  5. definition: web crawler a program that browses the web.Wednesday, September 14, 2011
  6. definition: web extraction transforming unstructured web data into structured dataWednesday, September 14, 2011
  7. definition: web extraction transforming semistructured web data into structured dataWednesday, September 14, 2011
  8. motivationWednesday, September 14, 2011
  9. motivation: bookmark buddiesWednesday, September 14, 2011
  10. motivation: bookmark buddies URL Title UsersWednesday, September 14, 2011
  11. motivation:Wednesday, September 14, 2011
  12. motivation: business hoursWednesday, September 14, 2011
  13. motivation: business hours Day Openness Mon Closed Tue 11:30-14:30 17:30-22:00 Wed 11:30-14:30 17:30-22:00 Thur 11:30-14:30 17:30-22:00 Fri 11:30-14:30 17:30-22:00 Sat 12:00-14:30 17:00-22:00 Sun - 17:00-21:00Wednesday, September 14, 2011
  14. motivation:Wednesday, September 14, 2011
  15. motivation: recommend videosWednesday, September 14, 2011
  16. motivation: recommend videos UsersWednesday, September 14, 2011
  17. motivation:Wednesday, September 14, 2011
  18. motivation: vertical searchWednesday, September 14, 2011
  19. motivation: vertical search Image SKU Name Price RatingWednesday, September 14, 2011
  20. motivation:Wednesday, September 14, 2011
  21. DESIRED PROPERTIESWednesday, September 14, 2011
  22. DESIRED PROPERTIES SPEEDWednesday, September 14, 2011
  23. CONSTRAINTSWednesday, September 14, 2011
  24. CONSTRAINTS • PolitenessWednesday, September 14, 2011
  25. CONSTRAINTS • Politeness • DistributedWednesday, September 14, 2011
  26. CONSTRAINTS • Politeness • Distributed • Linear ScalabilityWednesday, September 14, 2011
  27. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioningWednesday, September 14, 2011
  28. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlapWednesday, September 14, 2011
  29. CONSTRAINTS • Politeness it’s easy to burden • Distributed small servers • Linear Scalability • Even partitioning • Minimum overlapWednesday, September 14, 2011
  30. CONSTRAINTS • Politeness • Distributed (for any significant crawl) • Linear Scalability • Even partitioning • Minimum overlapWednesday, September 14, 2011
  31. CONSTRAINTS • Politeness • Distributed • Linear Scalability n machines = n*m pages-per-second • Even partitioning • Minimum overlapWednesday, September 14, 2011
  32. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning every machine should perform equal work • Minimum overlapWednesday, September 14, 2011
  33. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlap crawl each page exactly onceWednesday, September 14, 2011
  34. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlapWednesday, September 14, 2011
  35. BASIC ALGORITHMWednesday, September 14, 2011
  36. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  37. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  38. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  39. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  40. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  41. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  42. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  43. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  44. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  45. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  46. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  47. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  48. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  49. architecture overview CRAWL FETCHER INTERNET URLs Web Data PLANNER URL QUEUE Web Data STORAGE Web DataWednesday, September 14, 2011
  50. CHALLENGESWednesday, September 14, 2011
  51. challenges: depends on your ambitionsWednesday, September 14, 2011
  52. challenges: Google’s Index Size: 1998 - 26 million 2005 - 8 billion 2008 - 1 trillion http://www.nytimes.com/2005/08/15/technology/15search.html http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.htmlWednesday, September 14, 2011
  53. challenges: small crawls are easyWednesday, September 14, 2011
  54. challenges: < 10MM small crawls are easyWednesday, September 14, 2011
  55. challenges: large crawls are interestingWednesday, September 14, 2011
  56. challenges:Wednesday, September 14, 2011
  57. challenges: DNS LookupWednesday, September 14, 2011
  58. challenges: DNS Lookup URLs CrawledWednesday, September 14, 2011
  59. challenges: DNS Lookup URLs Crawled PolitenessWednesday, September 14, 2011
  60. challenges: DNS Lookup URLs Crawled Politeness URL FrontierWednesday, September 14, 2011
  61. challenges: DNS Lookup URLs Crawled Politeness URL Frontier Queueing URLsWednesday, September 14, 2011
  62. challenges: DNS Lookup URLs Crawled Politeness URL Frontier Queueing URLs Extracting URLsWednesday, September 14, 2011
  63. challenges: DNS LOOKUPWednesday, September 14, 2011
  64. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  65. challenges: DNS LOOKUP can easily be a bottleneckWednesday, September 14, 2011
  66. challenges: DNS LOOKUP • consider running your own DNS servers • djbdns • PowerDNS • etc.Wednesday, September 14, 2011
  67. challenges: DNS LOOKUP • be aware of software limitations • gethostbyaddr is synchronized • same with many “default” DNS clientsWednesday, September 14, 2011
  68. challenges: DNS LOOKUP You’ll know when you need itWednesday, September 14, 2011
  69. challenges: URLs CRAWLEDWednesday, September 14, 2011
  70. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  71. challenges: URLs CRAWLED 1 machine, store in memoryWednesday, September 14, 2011
  72. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATIONWednesday, September 14, 2011
  73. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentationsWednesday, September 14, 2011
  74. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712Wednesday, September 14, 2011
  75. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 x 100 millionWednesday, September 14, 2011
  76. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 x 100 million =~ 5.4 gigabytesWednesday, September 14, 2011
  77. can we do better?Wednesday, September 14, 2011
  78. BLOOM FILTERSWednesday, September 14, 2011
  79. BLOOM FILTERS answers the question: is this item in the set?Wednesday, September 14, 2011
  80. BLOOM FILTERS answers either:Wednesday, September 14, 2011
  81. BLOOM FILTERS answers either: • yes, probablyWednesday, September 14, 2011
  82. BLOOM FILTERS answers either: • yes, probably • definitely notWednesday, September 14, 2011
  83. BLOOM FILTERS Have we crawled: http://www.xcombinator.com? answers either: • yes, probably • definitely notWednesday, September 14, 2011
  84. BLOOM FILTERS Have we crawled: http://www.xcombinator.com? answers either: • yes, probably • definitely notWednesday, September 14, 2011
  85. challenges: URLs CRAWLED 1 machine, bloom filter 100 million URLs 1 in 100 million chance of false positive see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8Wednesday, September 14, 2011
  86. challenges: URLs CRAWLED 1 machine, bloom filter NAPKIN CALCULATION 100 million URLs 1 in 100 million chance of false positive see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8Wednesday, September 14, 2011
  87. challenges: URLs CRAWLED 1 machine, bloom filter NAPKIN CALCULATION 100 million URLs 1 in 100 million chance of false positive =~ 457 megabytes see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8Wednesday, September 14, 2011
  88. BLOOM FILTERWednesday, September 14, 2011
  89. BLOOM FILTER drawbacksWednesday, September 14, 2011
  90. BLOOM FILTER drawbacks • probabilistic - occasional errorsWednesday, September 14, 2011
  91. BLOOM FILTER drawbacks • probabilistic - occasional errors • estimate # of items ahead of timeWednesday, September 14, 2011
  92. BLOOM FILTER drawbacks • probabilistic - occasional errors • estimate # of items ahead of time • can’t deleteWednesday, September 14, 2011
  93. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • estimate # of items ahead of time • can’t deleteWednesday, September 14, 2011
  94. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • can’t deleteWednesday, September 14, 2011
  95. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t deleteWednesday, September 14, 2011
  96. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete • pick granularity (days)Wednesday, September 14, 2011
  97. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete • pick granularity (days) • cascade themWednesday, September 14, 2011
  98. BLOOM FILTERS references: http://en.wikipedia.org/wiki/Bloom_filter http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/Wednesday, September 14, 2011
  99. challenges: POLITENESSWednesday, September 14, 2011
  100. obey robots.txtWednesday, September 14, 2011
  101. rule of thumb: wait 2 seconds (w.r.t. ip)Wednesday, September 14, 2011
  102. centralized politenessWednesday, September 14, 2011
  103. centralized politeness SPOFWednesday, September 14, 2011
  104. centralized politeness SPOF contentionWednesday, September 14, 2011
  105. challenges: POLITENESSWednesday, September 14, 2011
  106. challenges: POLITENESS • Options:Wednesday, September 14, 2011
  107. challenges: POLITENESS • Options: • central databaseWednesday, September 14, 2011
  108. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper)Wednesday, September 14, 2011
  109. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distributionWednesday, September 14, 2011
  110. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution http://en.wikipedia.org/wiki/Paxos_(computer_science)Wednesday, September 14, 2011
  111. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution http://en.wikipedia.org/wiki/Paxos_(computer_science) http://zookeeper.apache.org/Wednesday, September 14, 2011
  112. challenges: URL FRONTIERWednesday, September 14, 2011
  113. url frontierWednesday, September 14, 2011
  114. idea: consistently distribute URLs based on IPWednesday, September 14, 2011
  115. modulo IP SHA-1 bucket (mod 5) 174.132.225.106 4dd14b0b... 2 74.125.224.115 cf4b7594... 1 157.166.255.19 0ac4d141... 4 69.22.138.129 6c1584fa... 4 98.139.50.166 327252c5... 3Wednesday, September 14, 2011
  116. benefits: same IP always goes to same machine simpleWednesday, September 14, 2011
  117. drawbacks: susceptible to skew can’t add / remove nodes without painWednesday, September 14, 2011
  118. consistent hashingWednesday, September 14, 2011
  119. source: http://michaelnielsen.org/blog/consistent-hashing/Wednesday, September 14, 2011
  120. source: http://michaelnielsen.org/blog/consistent-hashing/Wednesday, September 14, 2011
  121. source: http://michaelnielsen.org/blog/consistent-hashing/Wednesday, September 14, 2011
  122. source: http://michaelnielsen.org/blog/consistent-hashing/Wednesday, September 14, 2011
  123. benefits: ~ 1/(n+1) URLs move on add/remove virtual nodes help skew robust (no SOP)Wednesday, September 14, 2011
  124. drawbacks: naive solution won’t work for large sitesWednesday, September 14, 2011
  125. further reading: Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications (2001) Stoica et al. Dynamo: Amazon’s Highly Available Key-value Store, SOSP 2007 Tapestry: A Resilient Global-Scale Overlay for Service Deployment (2004) Zhao et al.Wednesday, September 14, 2011
  126. challenges: QUEUEING URLSWednesday, September 14, 2011
  127. situation:Wednesday, September 14, 2011
  128. situation: URLWednesday, September 14, 2011
  129. situation: URL not recently crawledWednesday, September 14, 2011
  130. situation: URL not recently crawled allowed by robots.txtWednesday, September 14, 2011
  131. situation: URL not recently crawled allowed by robots.txt politeWednesday, September 14, 2011
  132. how to you order them? (within a single machine)Wednesday, September 14, 2011
  133. hash each lane: 1 2 3 http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/Wednesday, September 14, 2011
  134. 1 2 3Wednesday, September 14, 2011
  135. 1 2 3Wednesday, September 14, 2011
  136. 1 2 3Wednesday, September 14, 2011
  137. 1 2 3Wednesday, September 14, 2011
  138. 1 2 3Wednesday, September 14, 2011
  139. 1 2 3Wednesday, September 14, 2011
  140. 1 2 3Wednesday, September 14, 2011
  141. 1 2 3Wednesday, September 14, 2011
  142. 1 2 3Wednesday, September 14, 2011
  143. 1 2 3Wednesday, September 14, 2011
  144. 1 2 3Wednesday, September 14, 2011
  145. 1 2 3Wednesday, September 14, 2011
  146. 1 2 3Wednesday, September 14, 2011
  147. 1 2 3Wednesday, September 14, 2011
  148. 1 2 3Wednesday, September 14, 2011
  149. 1 2 3Wednesday, September 14, 2011
  150. 1 2 3Wednesday, September 14, 2011
  151. 1 2 3Wednesday, September 14, 2011
  152. 1 2 3Wednesday, September 14, 2011
  153. 1 2 3Wednesday, September 14, 2011
  154. 1 2 3Wednesday, September 14, 2011
  155. ERLANG lookup: erlang B / C / engsetWednesday, September 14, 2011
  156. as many threads as possibleWednesday, September 14, 2011
  157. don’t sort input URLsWednesday, September 14, 2011
  158. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
  159. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ fetch http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
  160. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 wait http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
  161. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ fetch http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
  162. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? wait id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
  163. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 fetch http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
  164. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 waitWednesday, September 14, 2011
  165. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
  166. http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/Wednesday, September 14, 2011
  167. http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/Wednesday, September 14, 2011
  168. no waiting! http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/Wednesday, September 14, 2011
  169. challenges: EXTRACTING URLSWednesday, September 14, 2011
  170. challenges: EXTRACTING URLS the internet is full of garbageWednesday, September 14, 2011
  171. challenges: EXTRACTING URLSWednesday, September 14, 2011
  172. challenges: EXTRACTING URLS enormous pagesWednesday, September 14, 2011
  173. challenges: EXTRACTING URLS enormous pages terrible markupWednesday, September 14, 2011
  174. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urlsWednesday, September 14, 2011
  175. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls .net/Wednesday, September 14, 2011
  176. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls .net/ “unicode snowman dot net”Wednesday, September 14, 2011
  177. challenges: EXTRACTING URLS be prepared:Wednesday, September 14, 2011
  178. challenges: EXTRACTING URLS be prepared: use a streaming XML parserWednesday, September 14, 2011
  179. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markupWednesday, September 14, 2011
  180. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup be aware that URLs aren’t ASCIIWednesday, September 14, 2011
  181. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup be aware that URLs aren’t ASCII use a URL normalizerWednesday, September 14, 2011
  182. SOFTWAREWednesday, September 14, 2011
  183. software advice:Wednesday, September 14, 2011
  184. software advice: • goals determine scaleWednesday, September 14, 2011
  185. software advice: • goals determine scale • someone else has already done itWednesday, September 14, 2011
  186. 2 second crawler: function wgetspider() { wget --html-extension --convert-links --mirror --page-requisites --progress=bar --level=5 --no-parent --no-verbose --no-check-certificate "$@"; } $ wgetspider http://www.ischool.berkeley.edu/Wednesday, September 14, 2011
  187. java crawlers:Wednesday, September 14, 2011
  188. java crawlers: • Heritrix (Internet Archive)Wednesday, September 14, 2011
  189. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene)Wednesday, September 14, 2011
  190. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) • Bixo (Hadoop / Cascading)Wednesday, September 14, 2011
  191. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) • Bixo (Hadoop / Cascading) http://crawler.archive.org/ http://nutch.apache.org/ http://bixo.101tec.com/Wednesday, September 14, 2011
  192. extraction packages:Wednesday, September 14, 2011
  193. extraction packages: • mechanizeWednesday, September 14, 2011
  194. extraction packages: • mechanize • BeautifulSoup & urllib2Wednesday, September 14, 2011
  195. extraction packages: • mechanize • BeautifulSoup & urllib2 • ScrapyWednesday, September 14, 2011
  196. extraction packages: • mechanize • BeautifulSoup & urllib2 • Scrapy http://wwwsearch.sourceforge.net/mechanize/ http://www.crummy.com/software/BeautifulSoup/ http://scrapy.org/Wednesday, September 14, 2011
  197. wrapper induction(ish)Wednesday, September 14, 2011
  198. wrapper induction(ish) • ArielWednesday, September 14, 2011
  199. wrapper induction(ish) • Ariel • RoadRunnerWednesday, September 14, 2011
  200. wrapper induction(ish) • Ariel • RoadRunner • TemplateMakerWednesday, September 14, 2011
  201. wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker • scrubytWednesday, September 14, 2011
  202. wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker • scrubyt http://ariel.rubyforge.org/index.html http://www.dia.uniroma3.it/db/roadRunner/ http://code.google.com/p/templatemaker/ http://scrubyt.rubyforge.org/files/README.htmlWednesday, September 14, 2011
  203. QUESTIONS?Wednesday, September 14, 2011
  204. FEEDBACK: nate@xcombinator.com www.xcombinator.com @xcombinatorWednesday, September 14, 2011

×