• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Challenges in Large-Scale Web Crawling
 

Challenges in Large-Scale Web Crawling

on

  • 9,465 views

 

Statistics

Views

Total Views
9,465
Views on SlideShare
7,513
Embed Views
1,952

Actions

Likes
24
Downloads
188
Comments
0

10 Embeds 1,952

http://eigenjoy.com 1465
http://www.xcombinator.com 307
http://xcombinator.local 144
http://translate.googleusercontent.com 17
http://paper.li 7
http://feeds.feedburner.com 6
http://newsblur.com 3
http://tweetedtimes.com 1
http://twitter.com 1
http://webcache.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Challenges in Large-Scale Web Crawling Challenges in Large-Scale Web Crawling Presentation Transcript

    • introduction to WEB CRAWLING & extraction by Nate MurrayWednesday, September 14, 2011
    • WHO AM I ?Wednesday, September 14, 2011
    • Nate Murray AT&T Interactive (Yellowpages.com) TB-scale data since 2009 Various crawlers since 2005Wednesday, September 14, 2011
    • what is WEB CRAWLING ?Wednesday, September 14, 2011
    • definition: web crawler a program that browses the web.Wednesday, September 14, 2011
    • definition: web extraction transforming unstructured web data into structured dataWednesday, September 14, 2011
    • definition: web extraction transforming semistructured web data into structured dataWednesday, September 14, 2011
    • motivationWednesday, September 14, 2011
    • motivation: bookmark buddiesWednesday, September 14, 2011
    • motivation: bookmark buddies URL Title UsersWednesday, September 14, 2011
    • motivation:Wednesday, September 14, 2011
    • motivation: business hoursWednesday, September 14, 2011
    • motivation: business hours Day Openness Mon Closed Tue 11:30-14:30 17:30-22:00 Wed 11:30-14:30 17:30-22:00 Thur 11:30-14:30 17:30-22:00 Fri 11:30-14:30 17:30-22:00 Sat 12:00-14:30 17:00-22:00 Sun - 17:00-21:00Wednesday, September 14, 2011
    • motivation:Wednesday, September 14, 2011
    • motivation: recommend videosWednesday, September 14, 2011
    • motivation: recommend videos UsersWednesday, September 14, 2011
    • motivation:Wednesday, September 14, 2011
    • motivation: vertical searchWednesday, September 14, 2011
    • motivation: vertical search Image SKU Name Price RatingWednesday, September 14, 2011
    • motivation:Wednesday, September 14, 2011
    • DESIRED PROPERTIESWednesday, September 14, 2011
    • DESIRED PROPERTIES SPEEDWednesday, September 14, 2011
    • CONSTRAINTSWednesday, September 14, 2011
    • CONSTRAINTS • PolitenessWednesday, September 14, 2011
    • CONSTRAINTS • Politeness • DistributedWednesday, September 14, 2011
    • CONSTRAINTS • Politeness • Distributed • Linear ScalabilityWednesday, September 14, 2011
    • CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioningWednesday, September 14, 2011
    • CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlapWednesday, September 14, 2011
    • CONSTRAINTS • Politeness it’s easy to burden • Distributed small servers • Linear Scalability • Even partitioning • Minimum overlapWednesday, September 14, 2011
    • CONSTRAINTS • Politeness • Distributed (for any significant crawl) • Linear Scalability • Even partitioning • Minimum overlapWednesday, September 14, 2011
    • CONSTRAINTS • Politeness • Distributed • Linear Scalability n machines = n*m pages-per-second • Even partitioning • Minimum overlapWednesday, September 14, 2011
    • CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning every machine should perform equal work • Minimum overlapWednesday, September 14, 2011
    • CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlap crawl each page exactly onceWednesday, September 14, 2011
    • CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlapWednesday, September 14, 2011
    • BASIC ALGORITHMWednesday, September 14, 2011
    • Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
    • Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
    • Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
    • Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
    • Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
    • Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
    • Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
    • Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
    • Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
    • Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
    • Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
    • Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
    • Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
    • architecture overview CRAWL FETCHER INTERNET URLs Web Data PLANNER URL QUEUE Web Data STORAGE Web DataWednesday, September 14, 2011
    • CHALLENGESWednesday, September 14, 2011
    • challenges: depends on your ambitionsWednesday, September 14, 2011
    • challenges: Google’s Index Size: 1998 - 26 million 2005 - 8 billion 2008 - 1 trillion http://www.nytimes.com/2005/08/15/technology/15search.html http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.htmlWednesday, September 14, 2011
    • challenges: small crawls are easyWednesday, September 14, 2011
    • challenges: < 10MM small crawls are easyWednesday, September 14, 2011
    • challenges: large crawls are interestingWednesday, September 14, 2011
    • challenges:Wednesday, September 14, 2011
    • challenges: DNS LookupWednesday, September 14, 2011
    • challenges: DNS Lookup URLs CrawledWednesday, September 14, 2011
    • challenges: DNS Lookup URLs Crawled PolitenessWednesday, September 14, 2011
    • challenges: DNS Lookup URLs Crawled Politeness URL FrontierWednesday, September 14, 2011
    • challenges: DNS Lookup URLs Crawled Politeness URL Frontier Queueing URLsWednesday, September 14, 2011
    • challenges: DNS Lookup URLs Crawled Politeness URL Frontier Queueing URLs Extracting URLsWednesday, September 14, 2011
    • challenges: DNS LOOKUPWednesday, September 14, 2011
    • Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
    • challenges: DNS LOOKUP can easily be a bottleneckWednesday, September 14, 2011
    • challenges: DNS LOOKUP • consider running your own DNS servers • djbdns • PowerDNS • etc.Wednesday, September 14, 2011
    • challenges: DNS LOOKUP • be aware of software limitations • gethostbyaddr is synchronized • same with many “default” DNS clientsWednesday, September 14, 2011
    • challenges: DNS LOOKUP You’ll know when you need itWednesday, September 14, 2011
    • challenges: URLs CRAWLEDWednesday, September 14, 2011
    • Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
    • challenges: URLs CRAWLED 1 machine, store in memoryWednesday, September 14, 2011
    • challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATIONWednesday, September 14, 2011
    • challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentationsWednesday, September 14, 2011
    • challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712Wednesday, September 14, 2011
    • challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 x 100 millionWednesday, September 14, 2011
    • challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 x 100 million =~ 5.4 gigabytesWednesday, September 14, 2011
    • can we do better?Wednesday, September 14, 2011
    • BLOOM FILTERSWednesday, September 14, 2011
    • BLOOM FILTERS answers the question: is this item in the set?Wednesday, September 14, 2011
    • BLOOM FILTERS answers either:Wednesday, September 14, 2011
    • BLOOM FILTERS answers either: • yes, probablyWednesday, September 14, 2011
    • BLOOM FILTERS answers either: • yes, probably • definitely notWednesday, September 14, 2011
    • BLOOM FILTERS Have we crawled: http://www.xcombinator.com? answers either: • yes, probably • definitely notWednesday, September 14, 2011
    • BLOOM FILTERS Have we crawled: http://www.xcombinator.com? answers either: • yes, probably • definitely notWednesday, September 14, 2011
    • challenges: URLs CRAWLED 1 machine, bloom filter 100 million URLs 1 in 100 million chance of false positive see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8Wednesday, September 14, 2011
    • challenges: URLs CRAWLED 1 machine, bloom filter NAPKIN CALCULATION 100 million URLs 1 in 100 million chance of false positive see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8Wednesday, September 14, 2011
    • challenges: URLs CRAWLED 1 machine, bloom filter NAPKIN CALCULATION 100 million URLs 1 in 100 million chance of false positive =~ 457 megabytes see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8Wednesday, September 14, 2011
    • BLOOM FILTERWednesday, September 14, 2011
    • BLOOM FILTER drawbacksWednesday, September 14, 2011
    • BLOOM FILTER drawbacks • probabilistic - occasional errorsWednesday, September 14, 2011
    • BLOOM FILTER drawbacks • probabilistic - occasional errors • estimate # of items ahead of timeWednesday, September 14, 2011
    • BLOOM FILTER drawbacks • probabilistic - occasional errors • estimate # of items ahead of time • can’t deleteWednesday, September 14, 2011
    • BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • estimate # of items ahead of time • can’t deleteWednesday, September 14, 2011
    • BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • can’t deleteWednesday, September 14, 2011
    • BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t deleteWednesday, September 14, 2011
    • BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete • pick granularity (days)Wednesday, September 14, 2011
    • BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete • pick granularity (days) • cascade themWednesday, September 14, 2011
    • BLOOM FILTERS references: http://en.wikipedia.org/wiki/Bloom_filter http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/Wednesday, September 14, 2011
    • challenges: POLITENESSWednesday, September 14, 2011
    • obey robots.txtWednesday, September 14, 2011
    • rule of thumb: wait 2 seconds (w.r.t. ip)Wednesday, September 14, 2011
    • centralized politenessWednesday, September 14, 2011
    • centralized politeness SPOFWednesday, September 14, 2011
    • centralized politeness SPOF contentionWednesday, September 14, 2011
    • challenges: POLITENESSWednesday, September 14, 2011
    • challenges: POLITENESS • Options:Wednesday, September 14, 2011
    • challenges: POLITENESS • Options: • central databaseWednesday, September 14, 2011
    • challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper)Wednesday, September 14, 2011
    • challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distributionWednesday, September 14, 2011
    • challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution http://en.wikipedia.org/wiki/Paxos_(computer_science)Wednesday, September 14, 2011
    • challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution http://en.wikipedia.org/wiki/Paxos_(computer_science) http://zookeeper.apache.org/Wednesday, September 14, 2011
    • challenges: URL FRONTIERWednesday, September 14, 2011
    • url frontierWednesday, September 14, 2011
    • idea: consistently distribute URLs based on IPWednesday, September 14, 2011
    • modulo IP SHA-1 bucket (mod 5) 174.132.225.106 4dd14b0b... 2 74.125.224.115 cf4b7594... 1 157.166.255.19 0ac4d141... 4 69.22.138.129 6c1584fa... 4 98.139.50.166 327252c5... 3Wednesday, September 14, 2011
    • benefits: same IP always goes to same machine simpleWednesday, September 14, 2011
    • drawbacks: susceptible to skew can’t add / remove nodes without painWednesday, September 14, 2011
    • consistent hashingWednesday, September 14, 2011
    • source: http://michaelnielsen.org/blog/consistent-hashing/Wednesday, September 14, 2011
    • source: http://michaelnielsen.org/blog/consistent-hashing/Wednesday, September 14, 2011
    • source: http://michaelnielsen.org/blog/consistent-hashing/Wednesday, September 14, 2011
    • source: http://michaelnielsen.org/blog/consistent-hashing/Wednesday, September 14, 2011
    • benefits: ~ 1/(n+1) URLs move on add/remove virtual nodes help skew robust (no SOP)Wednesday, September 14, 2011
    • drawbacks: naive solution won’t work for large sitesWednesday, September 14, 2011
    • further reading: Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications (2001) Stoica et al. Dynamo: Amazon’s Highly Available Key-value Store, SOSP 2007 Tapestry: A Resilient Global-Scale Overlay for Service Deployment (2004) Zhao et al.Wednesday, September 14, 2011
    • challenges: QUEUEING URLSWednesday, September 14, 2011
    • situation:Wednesday, September 14, 2011
    • situation: URLWednesday, September 14, 2011
    • situation: URL not recently crawledWednesday, September 14, 2011
    • situation: URL not recently crawled allowed by robots.txtWednesday, September 14, 2011
    • situation: URL not recently crawled allowed by robots.txt politeWednesday, September 14, 2011
    • how to you order them? (within a single machine)Wednesday, September 14, 2011
    • hash each lane: 1 2 3 http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/Wednesday, September 14, 2011
    • 1 2 3Wednesday, September 14, 2011
    • 1 2 3Wednesday, September 14, 2011
    • 1 2 3Wednesday, September 14, 2011
    • 1 2 3Wednesday, September 14, 2011
    • 1 2 3Wednesday, September 14, 2011
    • 1 2 3Wednesday, September 14, 2011
    • 1 2 3Wednesday, September 14, 2011
    • 1 2 3Wednesday, September 14, 2011
    • 1 2 3Wednesday, September 14, 2011
    • 1 2 3Wednesday, September 14, 2011
    • 1 2 3Wednesday, September 14, 2011
    • 1 2 3Wednesday, September 14, 2011
    • 1 2 3Wednesday, September 14, 2011
    • 1 2 3Wednesday, September 14, 2011
    • 1 2 3Wednesday, September 14, 2011
    • 1 2 3Wednesday, September 14, 2011
    • 1 2 3Wednesday, September 14, 2011
    • 1 2 3Wednesday, September 14, 2011
    • 1 2 3Wednesday, September 14, 2011
    • 1 2 3Wednesday, September 14, 2011
    • 1 2 3Wednesday, September 14, 2011
    • ERLANG lookup: erlang B / C / engsetWednesday, September 14, 2011
    • as many threads as possibleWednesday, September 14, 2011
    • don’t sort input URLsWednesday, September 14, 2011
    • http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
    • http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ fetch http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
    • http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 wait http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
    • http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ fetch http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
    • http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? wait id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
    • http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 fetch http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
    • http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 waitWednesday, September 14, 2011
    • http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
    • http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/Wednesday, September 14, 2011
    • http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/Wednesday, September 14, 2011
    • no waiting! http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/Wednesday, September 14, 2011
    • challenges: EXTRACTING URLSWednesday, September 14, 2011
    • challenges: EXTRACTING URLS the internet is full of garbageWednesday, September 14, 2011
    • challenges: EXTRACTING URLSWednesday, September 14, 2011
    • challenges: EXTRACTING URLS enormous pagesWednesday, September 14, 2011
    • challenges: EXTRACTING URLS enormous pages terrible markupWednesday, September 14, 2011
    • challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urlsWednesday, September 14, 2011
    • challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls .net/Wednesday, September 14, 2011
    • challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls .net/ “unicode snowman dot net”Wednesday, September 14, 2011
    • challenges: EXTRACTING URLS be prepared:Wednesday, September 14, 2011
    • challenges: EXTRACTING URLS be prepared: use a streaming XML parserWednesday, September 14, 2011
    • challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markupWednesday, September 14, 2011
    • challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup be aware that URLs aren’t ASCIIWednesday, September 14, 2011
    • challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup be aware that URLs aren’t ASCII use a URL normalizerWednesday, September 14, 2011
    • SOFTWAREWednesday, September 14, 2011
    • software advice:Wednesday, September 14, 2011
    • software advice: • goals determine scaleWednesday, September 14, 2011
    • software advice: • goals determine scale • someone else has already done itWednesday, September 14, 2011
    • 2 second crawler: function wgetspider() { wget --html-extension --convert-links --mirror --page-requisites --progress=bar --level=5 --no-parent --no-verbose --no-check-certificate "$@"; } $ wgetspider http://www.ischool.berkeley.edu/Wednesday, September 14, 2011
    • java crawlers:Wednesday, September 14, 2011
    • java crawlers: • Heritrix (Internet Archive)Wednesday, September 14, 2011
    • java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene)Wednesday, September 14, 2011
    • java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) • Bixo (Hadoop / Cascading)Wednesday, September 14, 2011
    • java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) • Bixo (Hadoop / Cascading) http://crawler.archive.org/ http://nutch.apache.org/ http://bixo.101tec.com/Wednesday, September 14, 2011
    • extraction packages:Wednesday, September 14, 2011
    • extraction packages: • mechanizeWednesday, September 14, 2011
    • extraction packages: • mechanize • BeautifulSoup & urllib2Wednesday, September 14, 2011
    • extraction packages: • mechanize • BeautifulSoup & urllib2 • ScrapyWednesday, September 14, 2011
    • extraction packages: • mechanize • BeautifulSoup & urllib2 • Scrapy http://wwwsearch.sourceforge.net/mechanize/ http://www.crummy.com/software/BeautifulSoup/ http://scrapy.org/Wednesday, September 14, 2011
    • wrapper induction(ish)Wednesday, September 14, 2011
    • wrapper induction(ish) • ArielWednesday, September 14, 2011
    • wrapper induction(ish) • Ariel • RoadRunnerWednesday, September 14, 2011
    • wrapper induction(ish) • Ariel • RoadRunner • TemplateMakerWednesday, September 14, 2011
    • wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker • scrubytWednesday, September 14, 2011
    • wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker • scrubyt http://ariel.rubyforge.org/index.html http://www.dia.uniroma3.it/db/roadRunner/ http://code.google.com/p/templatemaker/ http://scrubyt.rubyforge.org/files/README.htmlWednesday, September 14, 2011
    • QUESTIONS?Wednesday, September 14, 2011
    • FEEDBACK: nate@xcombinator.com www.xcombinator.com @xcombinatorWednesday, September 14, 2011