Your SlideShare is downloading. ×
Challenges in Large-Scale Web Crawling
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Challenges in Large-Scale Web Crawling

10,277
views

Published on

Published in: Technology, Business

0 Comments
29 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
10,277
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
224
Comments
0
Likes
29
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. introduction to WEB CRAWLING & extraction by Nate MurrayWednesday, September 14, 2011
  • 2. WHO AM I ?Wednesday, September 14, 2011
  • 3. Nate Murray AT&T Interactive (Yellowpages.com) TB-scale data since 2009 Various crawlers since 2005Wednesday, September 14, 2011
  • 4. what is WEB CRAWLING ?Wednesday, September 14, 2011
  • 5. definition: web crawler a program that browses the web.Wednesday, September 14, 2011
  • 6. definition: web extraction transforming unstructured web data into structured dataWednesday, September 14, 2011
  • 7. definition: web extraction transforming semistructured web data into structured dataWednesday, September 14, 2011
  • 8. motivationWednesday, September 14, 2011
  • 9. motivation: bookmark buddiesWednesday, September 14, 2011
  • 10. motivation: bookmark buddies URL Title UsersWednesday, September 14, 2011
  • 11. motivation:Wednesday, September 14, 2011
  • 12. motivation: business hoursWednesday, September 14, 2011
  • 13. motivation: business hours Day Openness Mon Closed Tue 11:30-14:30 17:30-22:00 Wed 11:30-14:30 17:30-22:00 Thur 11:30-14:30 17:30-22:00 Fri 11:30-14:30 17:30-22:00 Sat 12:00-14:30 17:00-22:00 Sun - 17:00-21:00Wednesday, September 14, 2011
  • 14. motivation:Wednesday, September 14, 2011
  • 15. motivation: recommend videosWednesday, September 14, 2011
  • 16. motivation: recommend videos UsersWednesday, September 14, 2011
  • 17. motivation:Wednesday, September 14, 2011
  • 18. motivation: vertical searchWednesday, September 14, 2011
  • 19. motivation: vertical search Image SKU Name Price RatingWednesday, September 14, 2011
  • 20. motivation:Wednesday, September 14, 2011
  • 21. DESIRED PROPERTIESWednesday, September 14, 2011
  • 22. DESIRED PROPERTIES SPEEDWednesday, September 14, 2011
  • 23. CONSTRAINTSWednesday, September 14, 2011
  • 24. CONSTRAINTS • PolitenessWednesday, September 14, 2011
  • 25. CONSTRAINTS • Politeness • DistributedWednesday, September 14, 2011
  • 26. CONSTRAINTS • Politeness • Distributed • Linear ScalabilityWednesday, September 14, 2011
  • 27. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioningWednesday, September 14, 2011
  • 28. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlapWednesday, September 14, 2011
  • 29. CONSTRAINTS • Politeness it’s easy to burden • Distributed small servers • Linear Scalability • Even partitioning • Minimum overlapWednesday, September 14, 2011
  • 30. CONSTRAINTS • Politeness • Distributed (for any significant crawl) • Linear Scalability • Even partitioning • Minimum overlapWednesday, September 14, 2011
  • 31. CONSTRAINTS • Politeness • Distributed • Linear Scalability n machines = n*m pages-per-second • Even partitioning • Minimum overlapWednesday, September 14, 2011
  • 32. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning every machine should perform equal work • Minimum overlapWednesday, September 14, 2011
  • 33. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlap crawl each page exactly onceWednesday, September 14, 2011
  • 34. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlapWednesday, September 14, 2011
  • 35. BASIC ALGORITHMWednesday, September 14, 2011
  • 36. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  • 37. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  • 38. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  • 39. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  • 40. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  • 41. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  • 42. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  • 43. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  • 44. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  • 45. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  • 46. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  • 47. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  • 48. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  • 49. architecture overview CRAWL FETCHER INTERNET URLs Web Data PLANNER URL QUEUE Web Data STORAGE Web DataWednesday, September 14, 2011
  • 50. CHALLENGESWednesday, September 14, 2011
  • 51. challenges: depends on your ambitionsWednesday, September 14, 2011
  • 52. challenges: Google’s Index Size: 1998 - 26 million 2005 - 8 billion 2008 - 1 trillion http://www.nytimes.com/2005/08/15/technology/15search.html http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.htmlWednesday, September 14, 2011
  • 53. challenges: small crawls are easyWednesday, September 14, 2011
  • 54. challenges: < 10MM small crawls are easyWednesday, September 14, 2011
  • 55. challenges: large crawls are interestingWednesday, September 14, 2011
  • 56. challenges:Wednesday, September 14, 2011
  • 57. challenges: DNS LookupWednesday, September 14, 2011
  • 58. challenges: DNS Lookup URLs CrawledWednesday, September 14, 2011
  • 59. challenges: DNS Lookup URLs Crawled PolitenessWednesday, September 14, 2011
  • 60. challenges: DNS Lookup URLs Crawled Politeness URL FrontierWednesday, September 14, 2011
  • 61. challenges: DNS Lookup URLs Crawled Politeness URL Frontier Queueing URLsWednesday, September 14, 2011
  • 62. challenges: DNS Lookup URLs Crawled Politeness URL Frontier Queueing URLs Extracting URLsWednesday, September 14, 2011
  • 63. challenges: DNS LOOKUPWednesday, September 14, 2011
  • 64. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  • 65. challenges: DNS LOOKUP can easily be a bottleneckWednesday, September 14, 2011
  • 66. challenges: DNS LOOKUP • consider running your own DNS servers • djbdns • PowerDNS • etc.Wednesday, September 14, 2011
  • 67. challenges: DNS LOOKUP • be aware of software limitations • gethostbyaddr is synchronized • same with many “default” DNS clientsWednesday, September 14, 2011
  • 68. challenges: DNS LOOKUP You’ll know when you need itWednesday, September 14, 2011
  • 69. challenges: URLs CRAWLEDWednesday, September 14, 2011
  • 70. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  • 71. challenges: URLs CRAWLED 1 machine, store in memoryWednesday, September 14, 2011
  • 72. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATIONWednesday, September 14, 2011
  • 73. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentationsWednesday, September 14, 2011
  • 74. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712Wednesday, September 14, 2011
  • 75. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 x 100 millionWednesday, September 14, 2011
  • 76. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 x 100 million =~ 5.4 gigabytesWednesday, September 14, 2011
  • 77. can we do better?Wednesday, September 14, 2011
  • 78. BLOOM FILTERSWednesday, September 14, 2011
  • 79. BLOOM FILTERS answers the question: is this item in the set?Wednesday, September 14, 2011
  • 80. BLOOM FILTERS answers either:Wednesday, September 14, 2011
  • 81. BLOOM FILTERS answers either: • yes, probablyWednesday, September 14, 2011
  • 82. BLOOM FILTERS answers either: • yes, probably • definitely notWednesday, September 14, 2011
  • 83. BLOOM FILTERS Have we crawled: http://www.xcombinator.com? answers either: • yes, probably • definitely notWednesday, September 14, 2011
  • 84. BLOOM FILTERS Have we crawled: http://www.xcombinator.com? answers either: • yes, probably • definitely notWednesday, September 14, 2011
  • 85. challenges: URLs CRAWLED 1 machine, bloom filter 100 million URLs 1 in 100 million chance of false positive see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8Wednesday, September 14, 2011
  • 86. challenges: URLs CRAWLED 1 machine, bloom filter NAPKIN CALCULATION 100 million URLs 1 in 100 million chance of false positive see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8Wednesday, September 14, 2011
  • 87. challenges: URLs CRAWLED 1 machine, bloom filter NAPKIN CALCULATION 100 million URLs 1 in 100 million chance of false positive =~ 457 megabytes see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8Wednesday, September 14, 2011
  • 88. BLOOM FILTERWednesday, September 14, 2011
  • 89. BLOOM FILTER drawbacksWednesday, September 14, 2011
  • 90. BLOOM FILTER drawbacks • probabilistic - occasional errorsWednesday, September 14, 2011
  • 91. BLOOM FILTER drawbacks • probabilistic - occasional errors • estimate # of items ahead of timeWednesday, September 14, 2011
  • 92. BLOOM FILTER drawbacks • probabilistic - occasional errors • estimate # of items ahead of time • can’t deleteWednesday, September 14, 2011
  • 93. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • estimate # of items ahead of time • can’t deleteWednesday, September 14, 2011
  • 94. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • can’t deleteWednesday, September 14, 2011
  • 95. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t deleteWednesday, September 14, 2011
  • 96. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete • pick granularity (days)Wednesday, September 14, 2011
  • 97. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete • pick granularity (days) • cascade themWednesday, September 14, 2011
  • 98. BLOOM FILTERS references: http://en.wikipedia.org/wiki/Bloom_filter http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/Wednesday, September 14, 2011
  • 99. challenges: POLITENESSWednesday, September 14, 2011
  • 100. obey robots.txtWednesday, September 14, 2011
  • 101. rule of thumb: wait 2 seconds (w.r.t. ip)Wednesday, September 14, 2011
  • 102. centralized politenessWednesday, September 14, 2011
  • 103. centralized politeness SPOFWednesday, September 14, 2011
  • 104. centralized politeness SPOF contentionWednesday, September 14, 2011
  • 105. challenges: POLITENESSWednesday, September 14, 2011
  • 106. challenges: POLITENESS • Options:Wednesday, September 14, 2011
  • 107. challenges: POLITENESS • Options: • central databaseWednesday, September 14, 2011
  • 108. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper)Wednesday, September 14, 2011
  • 109. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distributionWednesday, September 14, 2011
  • 110. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution http://en.wikipedia.org/wiki/Paxos_(computer_science)Wednesday, September 14, 2011
  • 111. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution http://en.wikipedia.org/wiki/Paxos_(computer_science) http://zookeeper.apache.org/Wednesday, September 14, 2011
  • 112. challenges: URL FRONTIERWednesday, September 14, 2011
  • 113. url frontierWednesday, September 14, 2011
  • 114. idea: consistently distribute URLs based on IPWednesday, September 14, 2011
  • 115. modulo IP SHA-1 bucket (mod 5) 174.132.225.106 4dd14b0b... 2 74.125.224.115 cf4b7594... 1 157.166.255.19 0ac4d141... 4 69.22.138.129 6c1584fa... 4 98.139.50.166 327252c5... 3Wednesday, September 14, 2011
  • 116. benefits: same IP always goes to same machine simpleWednesday, September 14, 2011
  • 117. drawbacks: susceptible to skew can’t add / remove nodes without painWednesday, September 14, 2011
  • 118. consistent hashingWednesday, September 14, 2011
  • 119. source: http://michaelnielsen.org/blog/consistent-hashing/Wednesday, September 14, 2011
  • 120. source: http://michaelnielsen.org/blog/consistent-hashing/Wednesday, September 14, 2011
  • 121. source: http://michaelnielsen.org/blog/consistent-hashing/Wednesday, September 14, 2011
  • 122. source: http://michaelnielsen.org/blog/consistent-hashing/Wednesday, September 14, 2011
  • 123. benefits: ~ 1/(n+1) URLs move on add/remove virtual nodes help skew robust (no SOP)Wednesday, September 14, 2011
  • 124. drawbacks: naive solution won’t work for large sitesWednesday, September 14, 2011
  • 125. further reading: Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications (2001) Stoica et al. Dynamo: Amazon’s Highly Available Key-value Store, SOSP 2007 Tapestry: A Resilient Global-Scale Overlay for Service Deployment (2004) Zhao et al.Wednesday, September 14, 2011
  • 126. challenges: QUEUEING URLSWednesday, September 14, 2011
  • 127. situation:Wednesday, September 14, 2011
  • 128. situation: URLWednesday, September 14, 2011
  • 129. situation: URL not recently crawledWednesday, September 14, 2011
  • 130. situation: URL not recently crawled allowed by robots.txtWednesday, September 14, 2011
  • 131. situation: URL not recently crawled allowed by robots.txt politeWednesday, September 14, 2011
  • 132. how to you order them? (within a single machine)Wednesday, September 14, 2011
  • 133. hash each lane: 1 2 3 http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/Wednesday, September 14, 2011
  • 134. 1 2 3Wednesday, September 14, 2011
  • 135. 1 2 3Wednesday, September 14, 2011
  • 136. 1 2 3Wednesday, September 14, 2011
  • 137. 1 2 3Wednesday, September 14, 2011
  • 138. 1 2 3Wednesday, September 14, 2011
  • 139. 1 2 3Wednesday, September 14, 2011
  • 140. 1 2 3Wednesday, September 14, 2011
  • 141. 1 2 3Wednesday, September 14, 2011
  • 142. 1 2 3Wednesday, September 14, 2011
  • 143. 1 2 3Wednesday, September 14, 2011
  • 144. 1 2 3Wednesday, September 14, 2011
  • 145. 1 2 3Wednesday, September 14, 2011
  • 146. 1 2 3Wednesday, September 14, 2011
  • 147. 1 2 3Wednesday, September 14, 2011
  • 148. 1 2 3Wednesday, September 14, 2011
  • 149. 1 2 3Wednesday, September 14, 2011
  • 150. 1 2 3Wednesday, September 14, 2011
  • 151. 1 2 3Wednesday, September 14, 2011
  • 152. 1 2 3Wednesday, September 14, 2011
  • 153. 1 2 3Wednesday, September 14, 2011
  • 154. 1 2 3Wednesday, September 14, 2011
  • 155. ERLANG lookup: erlang B / C / engsetWednesday, September 14, 2011
  • 156. as many threads as possibleWednesday, September 14, 2011
  • 157. don’t sort input URLsWednesday, September 14, 2011
  • 158. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
  • 159. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ fetch http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
  • 160. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 wait http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
  • 161. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ fetch http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
  • 162. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? wait id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
  • 163. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 fetch http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
  • 164. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 waitWednesday, September 14, 2011
  • 165. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
  • 166. http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/Wednesday, September 14, 2011
  • 167. http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/Wednesday, September 14, 2011
  • 168. no waiting! http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/Wednesday, September 14, 2011
  • 169. challenges: EXTRACTING URLSWednesday, September 14, 2011
  • 170. challenges: EXTRACTING URLS the internet is full of garbageWednesday, September 14, 2011
  • 171. challenges: EXTRACTING URLSWednesday, September 14, 2011
  • 172. challenges: EXTRACTING URLS enormous pagesWednesday, September 14, 2011
  • 173. challenges: EXTRACTING URLS enormous pages terrible markupWednesday, September 14, 2011
  • 174. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urlsWednesday, September 14, 2011
  • 175. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls .net/Wednesday, September 14, 2011
  • 176. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls .net/ “unicode snowman dot net”Wednesday, September 14, 2011
  • 177. challenges: EXTRACTING URLS be prepared:Wednesday, September 14, 2011
  • 178. challenges: EXTRACTING URLS be prepared: use a streaming XML parserWednesday, September 14, 2011
  • 179. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markupWednesday, September 14, 2011
  • 180. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup be aware that URLs aren’t ASCIIWednesday, September 14, 2011
  • 181. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup be aware that URLs aren’t ASCII use a URL normalizerWednesday, September 14, 2011
  • 182. SOFTWAREWednesday, September 14, 2011
  • 183. software advice:Wednesday, September 14, 2011
  • 184. software advice: • goals determine scaleWednesday, September 14, 2011
  • 185. software advice: • goals determine scale • someone else has already done itWednesday, September 14, 2011
  • 186. 2 second crawler: function wgetspider() { wget --html-extension --convert-links --mirror --page-requisites --progress=bar --level=5 --no-parent --no-verbose --no-check-certificate "$@"; } $ wgetspider http://www.ischool.berkeley.edu/Wednesday, September 14, 2011
  • 187. java crawlers:Wednesday, September 14, 2011
  • 188. java crawlers: • Heritrix (Internet Archive)Wednesday, September 14, 2011
  • 189. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene)Wednesday, September 14, 2011
  • 190. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) • Bixo (Hadoop / Cascading)Wednesday, September 14, 2011
  • 191. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) • Bixo (Hadoop / Cascading) http://crawler.archive.org/ http://nutch.apache.org/ http://bixo.101tec.com/Wednesday, September 14, 2011
  • 192. extraction packages:Wednesday, September 14, 2011
  • 193. extraction packages: • mechanizeWednesday, September 14, 2011
  • 194. extraction packages: • mechanize • BeautifulSoup & urllib2Wednesday, September 14, 2011
  • 195. extraction packages: • mechanize • BeautifulSoup & urllib2 • ScrapyWednesday, September 14, 2011
  • 196. extraction packages: • mechanize • BeautifulSoup & urllib2 • Scrapy http://wwwsearch.sourceforge.net/mechanize/ http://www.crummy.com/software/BeautifulSoup/ http://scrapy.org/Wednesday, September 14, 2011
  • 197. wrapper induction(ish)Wednesday, September 14, 2011
  • 198. wrapper induction(ish) • ArielWednesday, September 14, 2011
  • 199. wrapper induction(ish) • Ariel • RoadRunnerWednesday, September 14, 2011
  • 200. wrapper induction(ish) • Ariel • RoadRunner • TemplateMakerWednesday, September 14, 2011
  • 201. wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker • scrubytWednesday, September 14, 2011
  • 202. wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker • scrubyt http://ariel.rubyforge.org/index.html http://www.dia.uniroma3.it/db/roadRunner/ http://code.google.com/p/templatemaker/ http://scrubyt.rubyforge.org/files/README.htmlWednesday, September 14, 2011
  • 203. QUESTIONS?Wednesday, September 14, 2011
  • 204. FEEDBACK: nate@xcombinator.com www.xcombinator.com @xcombinatorWednesday, September 14, 2011

×