SlideShare a Scribd company logo
1 of 204
Download to read offline
introduction to

                        WEB CRAWLING
                            & extraction                 by Nate Murray




Wednesday, September 14, 2011
WHO AM I ?



Wednesday, September 14, 2011
Nate Murray

                        AT&T Interactive (Yellowpages.com)

                                 TB-scale data since 2009

                                Various crawlers since 2005




Wednesday, September 14, 2011
what is

                        WEB CRAWLING ?



Wednesday, September 14, 2011
definition:
                 web crawler
                            a program that browses the web.




Wednesday, September 14, 2011
definition:
                 web extraction
                    transforming unstructured web data into
                                structured data




Wednesday, September 14, 2011
definition:
                 web extraction
                    transforming semistructured web data into
                                structured data




Wednesday, September 14, 2011
motivation




Wednesday, September 14, 2011
motivation: bookmark buddies




Wednesday, September 14, 2011
motivation: bookmark buddies

                                URL Title
                                 Users




Wednesday, September 14, 2011
motivation:




Wednesday, September 14, 2011
motivation:               business hours




Wednesday, September 14, 2011
motivation:               business hours


                                Day        Openness
                                Mon         Closed
                                Tue    11:30-14:30 17:30-22:00


                                Wed    11:30-14:30 17:30-22:00


                                Thur   11:30-14:30 17:30-22:00


                                 Fri   11:30-14:30 17:30-22:00


                                 Sat   12:00-14:30 17:00-22:00


                                Sun        -       17:00-21:00




Wednesday, September 14, 2011
motivation:




Wednesday, September 14, 2011
motivation: recommend videos




Wednesday, September 14, 2011
motivation: recommend videos




                                Users
Wednesday, September 14, 2011
motivation:




Wednesday, September 14, 2011
motivation:               vertical search




Wednesday, September 14, 2011
motivation:               vertical search

                  Image
                   SKU
                  Name
                   Price
                  Rating




Wednesday, September 14, 2011
motivation:




Wednesday, September 14, 2011
DESIRED PROPERTIES



Wednesday, September 14, 2011
DESIRED PROPERTIES



                                SPEED

Wednesday, September 14, 2011
CONSTRAINTS




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness                it’s easy to burden

         • Distributed
                                         small servers


              • Linear Scalability
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed               (for any significant
                                            crawl)
              • Linear Scalability
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability      n machines =
                                     n*m pages-per-second
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning    every machine should
                                      perform equal work

              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning
              • Minimum overlap      crawl each page
                                       exactly once




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
BASIC ALGORITHM



Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
architecture overview

         CRAWL
                                             FETCHER                   INTERNET
                                      URLs                  Web Data

        PLANNER
                          URL
                         QUEUE
                                                 Web Data




                                             STORAGE
                                Web Data




Wednesday, September 14, 2011
CHALLENGES



Wednesday, September 14, 2011
challenges:




                                depends on your ambitions




Wednesday, September 14, 2011
challenges:


                                Google’s Index Size:

                                 1998 - 26 million
                                 2005 - 8 billion
                                 2008 - 1 trillion




    http://www.nytimes.com/2005/08/15/technology/15search.html
    http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html




Wednesday, September 14, 2011
challenges:




                                small crawls are easy




Wednesday, September 14, 2011
challenges:




                                     < 10MM


                                small crawls are easy




Wednesday, September 14, 2011
challenges:




                                large crawls are interesting




Wednesday, September 14, 2011
challenges:




Wednesday, September 14, 2011
challenges:




                                DNS Lookup




Wednesday, September 14, 2011
challenges:




                                DNS Lookup
                                URLs Crawled




Wednesday, September 14, 2011
challenges:




                                DNS Lookup
                                URLs Crawled
                                 Politeness




Wednesday, September 14, 2011
challenges:




                                DNS Lookup
                                URLs Crawled
                                 Politeness
                                URL Frontier




Wednesday, September 14, 2011
challenges:




                                 DNS Lookup
                                URLs Crawled
                                  Politeness
                                 URL Frontier
                                Queueing URLs




Wednesday, September 14, 2011
challenges:




                                 DNS Lookup
                                 URLs Crawled
                                   Politeness
                                 URL Frontier
                                Queueing URLs
                                Extracting URLs




Wednesday, September 14, 2011
challenges:

                                DNS LOOKUP




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
challenges:
      DNS LOOKUP


                                can easily be a bottleneck




Wednesday, September 14, 2011
challenges:
      DNS LOOKUP

           • consider running your own DNS servers
             • djbdns
             • PowerDNS
             • etc.




Wednesday, September 14, 2011
challenges:
      DNS LOOKUP

           • be aware of software limitations
                • gethostbyaddr is synchronized
                • same with many “default” DNS clients




Wednesday, September 14, 2011
challenges:
      DNS LOOKUP


                                You’ll know when you need it




Wednesday, September 14, 2011
challenges:

                                URLs CRAWLED




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION
                ~50 bytes per URL
                                  e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION
                ~50 bytes per URL
                                  e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations


                   +8 bytes for time-last-crawled
                                    as long e.g. System.currentTimeMillis() -> 1314392455712




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION
                ~50 bytes per URL
                                  e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations


                   +8 bytes for time-last-crawled
                                    as long e.g. System.currentTimeMillis() -> 1314392455712


                   x        100 million




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION
                ~50 bytes per URL
                                  e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations


                   +8 bytes for time-last-crawled
                                    as long e.g. System.currentTimeMillis() -> 1314392455712


                   x        100 million
                  =~ 5.4 gigabytes


Wednesday, September 14, 2011
can we do better?




Wednesday, September 14, 2011
BLOOM FILTERS



Wednesday, September 14, 2011
BLOOM FILTERS



                                     answers the question:

                                is this item in the set?


Wednesday, September 14, 2011
BLOOM FILTERS



                                answers either:




Wednesday, September 14, 2011
BLOOM FILTERS



                                    answers either:

                                • yes, probably

Wednesday, September 14, 2011
BLOOM FILTERS



                                    answers either:

                                • yes, probably
                                • definitely not
Wednesday, September 14, 2011
BLOOM FILTERS


                Have we crawled: http://www.xcombinator.com?

                                    answers either:

                                • yes, probably
                                • definitely not
Wednesday, September 14, 2011
BLOOM FILTERS


                Have we crawled: http://www.xcombinator.com?

                                    answers either:

                                • yes, probably
                                • definitely not
Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, bloom filter



                100 million URLs

                      1 in 100 million chance
                                                of false positive




                                      see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                 1 machine, bloom filter

                                NAPKIN CALCULATION
                100 million URLs

                      1 in 100 million chance
                                                 of false positive




                                       see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                 1 machine, bloom filter

                                NAPKIN CALCULATION
                100 million URLs

                      1 in 100 million chance
                                                 of false positive



                  =~ 457 megabytes
                                       see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8




Wednesday, September 14, 2011
BLOOM FILTER




Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks




Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks


   • probabilistic - occasional errors




Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks


   • probabilistic - occasional errors

   • estimate # of items ahead of time




Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks


   • probabilistic - occasional errors

   • estimate # of items ahead of time

   • can’t delete


Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                         solutions
   • probabilistic - occasional errors

   • estimate # of items ahead of time

   • can’t delete


Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                            solutions
   • probabilistic - occasional errors
                             • acceptable
   • estimate # of items ahead of time

   • can’t delete


Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                         solutions
   • probabilistic - occasional errors
                             • acceptable
   • estimate # of items ahead of time
                             • not hard, see Dynamic BFs
   • can’t delete


Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                         solutions
   • probabilistic - occasional errors
                             • acceptable
   • estimate # of items ahead of time
                             • not hard, see Dynamic BFs
   • can’t delete
                             • pick granularity (days)

Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                         solutions
   • probabilistic - occasional errors
                             • acceptable
   • estimate # of items ahead of time
                             • not hard, see Dynamic BFs
   • can’t delete
                             • pick granularity (days)
                             • cascade them
Wednesday, September 14, 2011
BLOOM FILTERS
                                         references:

       http://en.wikipedia.org/wiki/Bloom_filter
       http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html
       http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/




Wednesday, September 14, 2011
challenges:

                                POLITENESS




Wednesday, September 14, 2011
obey robots.txt




Wednesday, September 14, 2011
rule of thumb:

                                wait 2 seconds (w.r.t. ip)




Wednesday, September 14, 2011
centralized politeness




Wednesday, September 14, 2011
centralized politeness




                                SPOF


Wednesday, September 14, 2011
centralized politeness




                                  SPOF
                                contention

Wednesday, September 14, 2011
challenges:
      POLITENESS




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database
                  • distributed locks (paxos/sigma/zookeeper)




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database
                  • distributed locks (paxos/sigma/zookeeper)
                  • controlled URL distribution




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database
                  • distributed locks (paxos/sigma/zookeeper)
                  • controlled URL distribution



    http://en.wikipedia.org/wiki/Paxos_(computer_science)


Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database
                  • distributed locks (paxos/sigma/zookeeper)
                  • controlled URL distribution



    http://en.wikipedia.org/wiki/Paxos_(computer_science)
    http://zookeeper.apache.org/

Wednesday, September 14, 2011
challenges:

                                URL FRONTIER




Wednesday, September 14, 2011
url frontier




Wednesday, September 14, 2011
idea:

                          consistently distribute URLs based on IP




Wednesday, September 14, 2011
modulo
                                IP      SHA-1       bucket (mod 5)
                  174.132.225.106     4dd14b0b...         2

                   74.125.224.115     cf4b7594...         1

                   157.166.255.19     0ac4d141...         4

                    69.22.138.129     6c1584fa...         4

                    98.139.50.166     327252c5...         3




Wednesday, September 14, 2011
benefits:



                                same IP always goes to same machine
                                              simple




Wednesday, September 14, 2011
drawbacks:



                                         susceptible to skew
                                can’t add / remove nodes without pain




Wednesday, September 14, 2011
consistent hashing



Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011
benefits:



                                ~ 1/(n+1) URLs move on add/remove
                                      virtual nodes help skew
                                          robust (no SOP)




Wednesday, September 14, 2011
drawbacks:



                           naive solution won’t work for large sites




Wednesday, September 14, 2011
further reading:



           Chord: A Scalable Peer-to-Peer Lookup Protocol for
           Internet Applications (2001) Stoica et al.

           Dynamo: Amazon’s Highly Available Key-value Store, SOSP
           2007

           Tapestry: A Resilient Global-Scale Overlay for Service
           Deployment (2004) Zhao et al.




Wednesday, September 14, 2011
challenges:

                                QUEUEING URLS




Wednesday, September 14, 2011
situation:




Wednesday, September 14, 2011
situation:
                                   URL




Wednesday, September 14, 2011
situation:
                                   URL
                                   not recently crawled




Wednesday, September 14, 2011
situation:
                                   URL
                                   not recently crawled
                                   allowed by robots.txt




Wednesday, September 14, 2011
situation:
                                   URL
                                   not recently crawled
                                   allowed by robots.txt
                                   polite




Wednesday, September 14, 2011
how to you order them?

                                (within a single machine)




Wednesday, September 14, 2011
hash each lane:




                1                         2                               3
                   http://yachtmaintenanceco.com/
                   http://www.amsterdamports.nl/
                   http://www.4s-dawn.com/
                   http://www.embassysuiteslittlerock.com/
                   http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm
                   http://mdgroover.iweb.bsu.edu
                   http://music.imbc.com/
                   http://www.robertjbradshaw.com
                   http://www.kerkattenhoven.be
                   http://www.escolania.org/
                   http://www.musiciansdfw.org/
Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
ERLANG




                                lookup: erlang B / C / engset

Wednesday, September 14, 2011
as many threads as possible




Wednesday, September 14, 2011
don’t sort input URLs




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
                                                                   fetch
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1           wait
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/

                                                                   fetch
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
                                                                   wait
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395

                                                                   fetch
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1
                                                                   wait




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://yachtmaintenanceco.com/
   http://www.amsterdamports.nl/
   http://www.4s-dawn.com/
   http://www.embassysuiteslittlerock.com/
   http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm
   http://mdgroover.iweb.bsu.edu
   http://music.imbc.com/
   http://www.robertjbradshaw.com
   http://www.kerkattenhoven.be
   http://www.escolania.org/
   http://www.musiciansdfw.org/
   http://www.ariana.org/




Wednesday, September 14, 2011
http://yachtmaintenanceco.com/
   http://www.amsterdamports.nl/
   http://www.4s-dawn.com/
   http://www.embassysuiteslittlerock.com/
   http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm
   http://mdgroover.iweb.bsu.edu
   http://music.imbc.com/
   http://www.robertjbradshaw.com
   http://www.kerkattenhoven.be
   http://www.escolania.org/
   http://www.musiciansdfw.org/
   http://www.ariana.org/




Wednesday, September 14, 2011
no waiting!
   http://yachtmaintenanceco.com/
   http://www.amsterdamports.nl/
   http://www.4s-dawn.com/
   http://www.embassysuiteslittlerock.com/
   http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm
   http://mdgroover.iweb.bsu.edu
   http://music.imbc.com/
   http://www.robertjbradshaw.com
   http://www.kerkattenhoven.be
   http://www.escolania.org/
   http://www.musiciansdfw.org/
   http://www.ariana.org/




Wednesday, September 14, 2011
challenges:

                                EXTRACTING URLS




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS


                                the internet is full of garbage




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                enormous pages




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                enormous pages

                                terrible markup




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                enormous pages

                                terrible markup

                                ridiculous urls




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                enormous pages

                                terrible markup

                                ridiculous urls


                                     .net/

Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                     enormous pages

                                     terrible markup

                                      ridiculous urls


                                           .net/
                                “unicode snowman dot net”

Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                be prepared:




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                       be prepared:


                                use a streaming XML parser




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                       be prepared:


                                use a streaming XML parser

                       use a library that handle’s bad markup




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                       be prepared:


                                use a streaming XML parser

                       use a library that handle’s bad markup

                            be aware that URLs aren’t ASCII




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                       be prepared:


                                use a streaming XML parser

                       use a library that handle’s bad markup

                            be aware that URLs aren’t ASCII

                                  use a URL normalizer


Wednesday, September 14, 2011
SOFTWARE



Wednesday, September 14, 2011
software advice:




Wednesday, September 14, 2011
software advice:


               •       goals determine scale




Wednesday, September 14, 2011
software advice:


               •       goals determine scale

               •       someone else has already done it




Wednesday, September 14, 2011
2 second crawler:

    function wgetspider() {
      wget --html-extension --convert-links --mirror 
        --page-requisites --progress=bar --level=5   
        --no-parent --no-verbose 
        --no-check-certificate "$@";
    }

    $ wgetspider http://www.ischool.berkeley.edu/




Wednesday, September 14, 2011
java crawlers:




Wednesday, September 14, 2011
java crawlers:

               •       Heritrix (Internet Archive)




Wednesday, September 14, 2011
java crawlers:

               •       Heritrix (Internet Archive)

               •       Nutch (Lucene)




Wednesday, September 14, 2011
java crawlers:

               •       Heritrix (Internet Archive)

               •       Nutch (Lucene)

               •       Bixo (Hadoop / Cascading)




Wednesday, September 14, 2011
java crawlers:

               •       Heritrix (Internet Archive)

               •       Nutch (Lucene)

               •       Bixo (Hadoop / Cascading)


    http://crawler.archive.org/
    http://nutch.apache.org/
    http://bixo.101tec.com/

Wednesday, September 14, 2011
extraction packages:




Wednesday, September 14, 2011
extraction packages:

               •       mechanize




Wednesday, September 14, 2011
extraction packages:

               •       mechanize

               •       BeautifulSoup & urllib2




Wednesday, September 14, 2011
extraction packages:

               •       mechanize

               •       BeautifulSoup & urllib2

               •       Scrapy




Wednesday, September 14, 2011
extraction packages:

               •       mechanize

               •       BeautifulSoup & urllib2

               •       Scrapy


    http://wwwsearch.sourceforge.net/mechanize/
    http://www.crummy.com/software/BeautifulSoup/
    http://scrapy.org/

Wednesday, September 14, 2011
wrapper induction(ish)




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel

               •       RoadRunner




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel

               •       RoadRunner

               •       TemplateMaker




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel

               •       RoadRunner

               •       TemplateMaker

               •       scrubyt




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel

               •       RoadRunner

               •       TemplateMaker

               •       scrubyt

    http://ariel.rubyforge.org/index.html
    http://www.dia.uniroma3.it/db/roadRunner/
    http://code.google.com/p/templatemaker/
    http://scrubyt.rubyforge.org/files/README.html

Wednesday, September 14, 2011
QUESTIONS?



Wednesday, September 14, 2011
FEEDBACK:


             nate@xcombinator.com
                                www.xcombinator.com

                                   @xcombinator
Wednesday, September 14, 2011

More Related Content

What's hot

Topic 1 Introduction to web analytics
Topic  1   Introduction to web analytics Topic  1   Introduction to web analytics
Topic 1 Introduction to web analytics Jigsaw Academy
 
SEO, SMO ppt by sushen jamwal
SEO, SMO ppt by sushen jamwalSEO, SMO ppt by sushen jamwal
SEO, SMO ppt by sushen jamwalSushen Jamwal
 
DIGITAL MARKETING
DIGITAL MARKETINGDIGITAL MARKETING
DIGITAL MARKETINGDeep Banik
 
An Introduction to Web Analytics
An Introduction to Web AnalyticsAn Introduction to Web Analytics
An Introduction to Web Analyticsiexpertsforum
 
IT8005_EC_Unit_II_Building_ECommerce
IT8005_EC_Unit_II_Building_ECommerceIT8005_EC_Unit_II_Building_ECommerce
IT8005_EC_Unit_II_Building_ECommercePalani Kumar
 
E-Commerce SEO Horror Stories : How to tackle the most common issues 
at scal...
E-Commerce SEO Horror Stories : How to tackle the most common issues 
at scal...E-Commerce SEO Horror Stories : How to tackle the most common issues 
at scal...
E-Commerce SEO Horror Stories : How to tackle the most common issues 
at scal...Aleyda Solís
 
Web Search and Mining
Web Search and MiningWeb Search and Mining
Web Search and Miningsathish sak
 
The most Damaging SEO Mistakes & Issues in 2021 and How to Avoid Them #EngagePDX
The most Damaging SEO Mistakes & Issues in 2021 and How to Avoid Them #EngagePDXThe most Damaging SEO Mistakes & Issues in 2021 and How to Avoid Them #EngagePDX
The most Damaging SEO Mistakes & Issues in 2021 and How to Avoid Them #EngagePDXAleyda Solís
 
Seo digital marketing
Seo digital marketingSeo digital marketing
Seo digital marketingShourya Puri
 
Introduction to Google Search Console
Introduction to Google Search ConsoleIntroduction to Google Search Console
Introduction to Google Search ConsoleRiley Haas
 
Seo Methodology & Process
Seo Methodology & ProcessSeo Methodology & Process
Seo Methodology & ProcessManindra Singh
 
SEO Reporting to Impress: How to Successfully Report your SEO Efforts & Resul...
SEO Reporting to Impress: How to Successfully Report your SEO Efforts & Resul...SEO Reporting to Impress: How to Successfully Report your SEO Efforts & Resul...
SEO Reporting to Impress: How to Successfully Report your SEO Efforts & Resul...Aleyda Solís
 
Google Ads Presentation
Google Ads Presentation Google Ads Presentation
Google Ads Presentation Ruchika
 
Crawling and Indexing
Crawling and IndexingCrawling and Indexing
Crawling and IndexingHimani Tyagi
 
Winning SEO Using Schema Markup and Structured Data
Winning SEO Using Schema Markup and Structured DataWinning SEO Using Schema Markup and Structured Data
Winning SEO Using Schema Markup and Structured DataMarc Trimble
 

What's hot (20)

Topic 1 Introduction to web analytics
Topic  1   Introduction to web analytics Topic  1   Introduction to web analytics
Topic 1 Introduction to web analytics
 
SEO, SMO ppt by sushen jamwal
SEO, SMO ppt by sushen jamwalSEO, SMO ppt by sushen jamwal
SEO, SMO ppt by sushen jamwal
 
Seo
SeoSeo
Seo
 
Web content mining
Web content miningWeb content mining
Web content mining
 
DIGITAL MARKETING
DIGITAL MARKETINGDIGITAL MARKETING
DIGITAL MARKETING
 
An Introduction to Web Analytics
An Introduction to Web AnalyticsAn Introduction to Web Analytics
An Introduction to Web Analytics
 
IT8005_EC_Unit_II_Building_ECommerce
IT8005_EC_Unit_II_Building_ECommerceIT8005_EC_Unit_II_Building_ECommerce
IT8005_EC_Unit_II_Building_ECommerce
 
basic Seo ppt
basic Seo pptbasic Seo ppt
basic Seo ppt
 
E-Commerce SEO Horror Stories : How to tackle the most common issues 
at scal...
E-Commerce SEO Horror Stories : How to tackle the most common issues 
at scal...E-Commerce SEO Horror Stories : How to tackle the most common issues 
at scal...
E-Commerce SEO Horror Stories : How to tackle the most common issues 
at scal...
 
Web Search and Mining
Web Search and MiningWeb Search and Mining
Web Search and Mining
 
The most Damaging SEO Mistakes & Issues in 2021 and How to Avoid Them #EngagePDX
The most Damaging SEO Mistakes & Issues in 2021 and How to Avoid Them #EngagePDXThe most Damaging SEO Mistakes & Issues in 2021 and How to Avoid Them #EngagePDX
The most Damaging SEO Mistakes & Issues in 2021 and How to Avoid Them #EngagePDX
 
Seo digital marketing
Seo digital marketingSeo digital marketing
Seo digital marketing
 
Introduction to Google Search Console
Introduction to Google Search ConsoleIntroduction to Google Search Console
Introduction to Google Search Console
 
Seo Methodology & Process
Seo Methodology & ProcessSeo Methodology & Process
Seo Methodology & Process
 
Foxtail Website Audit
Foxtail Website AuditFoxtail Website Audit
Foxtail Website Audit
 
SEO Reporting to Impress: How to Successfully Report your SEO Efforts & Resul...
SEO Reporting to Impress: How to Successfully Report your SEO Efforts & Resul...SEO Reporting to Impress: How to Successfully Report your SEO Efforts & Resul...
SEO Reporting to Impress: How to Successfully Report your SEO Efforts & Resul...
 
Google Ads Presentation
Google Ads Presentation Google Ads Presentation
Google Ads Presentation
 
Crawling and Indexing
Crawling and IndexingCrawling and Indexing
Crawling and Indexing
 
Winning SEO Using Schema Markup and Structured Data
Winning SEO Using Schema Markup and Structured DataWinning SEO Using Schema Markup and Structured Data
Winning SEO Using Schema Markup and Structured Data
 
Basic Search Engine Optimization
Basic Search Engine OptimizationBasic Search Engine Optimization
Basic Search Engine Optimization
 

Viewers also liked

Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawlingDenis Shestakov
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
Java Web Uygulama Geliştirme
Java Web Uygulama GeliştirmeJava Web Uygulama Geliştirme
Java Web Uygulama Geliştirmeahmetdemirelli
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawlingDenis Shestakov
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityRenato Lucindo
 

Viewers also liked (10)

Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Java Web Uygulama Geliştirme
Java Web Uygulama GeliştirmeJava Web Uygulama Geliştirme
Java Web Uygulama Geliştirme
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
 
Web Crawlers
Web CrawlersWeb Crawlers
Web Crawlers
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 

Recently uploaded

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 

Recently uploaded (20)

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Challenges in Large-Scale Web Crawling

  • 1. introduction to WEB CRAWLING & extraction by Nate Murray Wednesday, September 14, 2011
  • 2. WHO AM I ? Wednesday, September 14, 2011
  • 3. Nate Murray AT&T Interactive (Yellowpages.com) TB-scale data since 2009 Various crawlers since 2005 Wednesday, September 14, 2011
  • 4. what is WEB CRAWLING ? Wednesday, September 14, 2011
  • 5. definition: web crawler a program that browses the web. Wednesday, September 14, 2011
  • 6. definition: web extraction transforming unstructured web data into structured data Wednesday, September 14, 2011
  • 7. definition: web extraction transforming semistructured web data into structured data Wednesday, September 14, 2011
  • 10. motivation: bookmark buddies URL Title Users Wednesday, September 14, 2011
  • 12. motivation: business hours Wednesday, September 14, 2011
  • 13. motivation: business hours Day Openness Mon Closed Tue 11:30-14:30 17:30-22:00 Wed 11:30-14:30 17:30-22:00 Thur 11:30-14:30 17:30-22:00 Fri 11:30-14:30 17:30-22:00 Sat 12:00-14:30 17:00-22:00 Sun - 17:00-21:00 Wednesday, September 14, 2011
  • 16. motivation: recommend videos Users Wednesday, September 14, 2011
  • 18. motivation: vertical search Wednesday, September 14, 2011
  • 19. motivation: vertical search Image SKU Name Price Rating Wednesday, September 14, 2011
  • 22. DESIRED PROPERTIES SPEED Wednesday, September 14, 2011
  • 24. CONSTRAINTS • Politeness Wednesday, September 14, 2011
  • 25. CONSTRAINTS • Politeness • Distributed Wednesday, September 14, 2011
  • 26. CONSTRAINTS • Politeness • Distributed • Linear Scalability Wednesday, September 14, 2011
  • 27. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning Wednesday, September 14, 2011
  • 28. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 29. CONSTRAINTS • Politeness it’s easy to burden • Distributed small servers • Linear Scalability • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 30. CONSTRAINTS • Politeness • Distributed (for any significant crawl) • Linear Scalability • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 31. CONSTRAINTS • Politeness • Distributed • Linear Scalability n machines = n*m pages-per-second • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 32. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning every machine should perform equal work • Minimum overlap Wednesday, September 14, 2011
  • 33. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlap crawl each page exactly once Wednesday, September 14, 2011
  • 34. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 36. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 37. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 38. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 39. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 40. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 41. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 42. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 43. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 44. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 45. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 46. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 47. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 48. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 49. architecture overview CRAWL FETCHER INTERNET URLs Web Data PLANNER URL QUEUE Web Data STORAGE Web Data Wednesday, September 14, 2011
  • 51. challenges: depends on your ambitions Wednesday, September 14, 2011
  • 52. challenges: Google’s Index Size: 1998 - 26 million 2005 - 8 billion 2008 - 1 trillion http://www.nytimes.com/2005/08/15/technology/15search.html http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html Wednesday, September 14, 2011
  • 53. challenges: small crawls are easy Wednesday, September 14, 2011
  • 54. challenges: < 10MM small crawls are easy Wednesday, September 14, 2011
  • 55. challenges: large crawls are interesting Wednesday, September 14, 2011
  • 57. challenges: DNS Lookup Wednesday, September 14, 2011
  • 58. challenges: DNS Lookup URLs Crawled Wednesday, September 14, 2011
  • 59. challenges: DNS Lookup URLs Crawled Politeness Wednesday, September 14, 2011
  • 60. challenges: DNS Lookup URLs Crawled Politeness URL Frontier Wednesday, September 14, 2011
  • 61. challenges: DNS Lookup URLs Crawled Politeness URL Frontier Queueing URLs Wednesday, September 14, 2011
  • 62. challenges: DNS Lookup URLs Crawled Politeness URL Frontier Queueing URLs Extracting URLs Wednesday, September 14, 2011
  • 63. challenges: DNS LOOKUP Wednesday, September 14, 2011
  • 64. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 65. challenges: DNS LOOKUP can easily be a bottleneck Wednesday, September 14, 2011
  • 66. challenges: DNS LOOKUP • consider running your own DNS servers • djbdns • PowerDNS • etc. Wednesday, September 14, 2011
  • 67. challenges: DNS LOOKUP • be aware of software limitations • gethostbyaddr is synchronized • same with many “default” DNS clients Wednesday, September 14, 2011
  • 68. challenges: DNS LOOKUP You’ll know when you need it Wednesday, September 14, 2011
  • 69. challenges: URLs CRAWLED Wednesday, September 14, 2011
  • 70. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 71. challenges: URLs CRAWLED 1 machine, store in memory Wednesday, September 14, 2011
  • 72. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION Wednesday, September 14, 2011
  • 73. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations Wednesday, September 14, 2011
  • 74. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 Wednesday, September 14, 2011
  • 75. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 x 100 million Wednesday, September 14, 2011
  • 76. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 x 100 million =~ 5.4 gigabytes Wednesday, September 14, 2011
  • 77. can we do better? Wednesday, September 14, 2011
  • 79. BLOOM FILTERS answers the question: is this item in the set? Wednesday, September 14, 2011
  • 80. BLOOM FILTERS answers either: Wednesday, September 14, 2011
  • 81. BLOOM FILTERS answers either: • yes, probably Wednesday, September 14, 2011
  • 82. BLOOM FILTERS answers either: • yes, probably • definitely not Wednesday, September 14, 2011
  • 83. BLOOM FILTERS Have we crawled: http://www.xcombinator.com? answers either: • yes, probably • definitely not Wednesday, September 14, 2011
  • 84. BLOOM FILTERS Have we crawled: http://www.xcombinator.com? answers either: • yes, probably • definitely not Wednesday, September 14, 2011
  • 85. challenges: URLs CRAWLED 1 machine, bloom filter 100 million URLs 1 in 100 million chance of false positive see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8 Wednesday, September 14, 2011
  • 86. challenges: URLs CRAWLED 1 machine, bloom filter NAPKIN CALCULATION 100 million URLs 1 in 100 million chance of false positive see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8 Wednesday, September 14, 2011
  • 87. challenges: URLs CRAWLED 1 machine, bloom filter NAPKIN CALCULATION 100 million URLs 1 in 100 million chance of false positive =~ 457 megabytes see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8 Wednesday, September 14, 2011
  • 89. BLOOM FILTER drawbacks Wednesday, September 14, 2011
  • 90. BLOOM FILTER drawbacks • probabilistic - occasional errors Wednesday, September 14, 2011
  • 91. BLOOM FILTER drawbacks • probabilistic - occasional errors • estimate # of items ahead of time Wednesday, September 14, 2011
  • 92. BLOOM FILTER drawbacks • probabilistic - occasional errors • estimate # of items ahead of time • can’t delete Wednesday, September 14, 2011
  • 93. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • estimate # of items ahead of time • can’t delete Wednesday, September 14, 2011
  • 94. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • can’t delete Wednesday, September 14, 2011
  • 95. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete Wednesday, September 14, 2011
  • 96. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete • pick granularity (days) Wednesday, September 14, 2011
  • 97. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete • pick granularity (days) • cascade them Wednesday, September 14, 2011
  • 98. BLOOM FILTERS references: http://en.wikipedia.org/wiki/Bloom_filter http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/ Wednesday, September 14, 2011
  • 99. challenges: POLITENESS Wednesday, September 14, 2011
  • 101. rule of thumb: wait 2 seconds (w.r.t. ip) Wednesday, September 14, 2011
  • 103. centralized politeness SPOF Wednesday, September 14, 2011
  • 104. centralized politeness SPOF contention Wednesday, September 14, 2011
  • 105. challenges: POLITENESS Wednesday, September 14, 2011
  • 106. challenges: POLITENESS • Options: Wednesday, September 14, 2011
  • 107. challenges: POLITENESS • Options: • central database Wednesday, September 14, 2011
  • 108. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) Wednesday, September 14, 2011
  • 109. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution Wednesday, September 14, 2011
  • 110. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution http://en.wikipedia.org/wiki/Paxos_(computer_science) Wednesday, September 14, 2011
  • 111. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution http://en.wikipedia.org/wiki/Paxos_(computer_science) http://zookeeper.apache.org/ Wednesday, September 14, 2011
  • 112. challenges: URL FRONTIER Wednesday, September 14, 2011
  • 114. idea: consistently distribute URLs based on IP Wednesday, September 14, 2011
  • 115. modulo IP SHA-1 bucket (mod 5) 174.132.225.106 4dd14b0b... 2 74.125.224.115 cf4b7594... 1 157.166.255.19 0ac4d141... 4 69.22.138.129 6c1584fa... 4 98.139.50.166 327252c5... 3 Wednesday, September 14, 2011
  • 116. benefits: same IP always goes to same machine simple Wednesday, September 14, 2011
  • 117. drawbacks: susceptible to skew can’t add / remove nodes without pain Wednesday, September 14, 2011
  • 123. benefits: ~ 1/(n+1) URLs move on add/remove virtual nodes help skew robust (no SOP) Wednesday, September 14, 2011
  • 124. drawbacks: naive solution won’t work for large sites Wednesday, September 14, 2011
  • 125. further reading: Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications (2001) Stoica et al. Dynamo: Amazon’s Highly Available Key-value Store, SOSP 2007 Tapestry: A Resilient Global-Scale Overlay for Service Deployment (2004) Zhao et al. Wednesday, September 14, 2011
  • 126. challenges: QUEUEING URLS Wednesday, September 14, 2011
  • 128. situation: URL Wednesday, September 14, 2011
  • 129. situation: URL not recently crawled Wednesday, September 14, 2011
  • 130. situation: URL not recently crawled allowed by robots.txt Wednesday, September 14, 2011
  • 131. situation: URL not recently crawled allowed by robots.txt polite Wednesday, September 14, 2011
  • 132. how to you order them? (within a single machine) Wednesday, September 14, 2011
  • 133. hash each lane: 1 2 3 http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ Wednesday, September 14, 2011
  • 134. 1 2 3 Wednesday, September 14, 2011
  • 135. 1 2 3 Wednesday, September 14, 2011
  • 136. 1 2 3 Wednesday, September 14, 2011
  • 137. 1 2 3 Wednesday, September 14, 2011
  • 138. 1 2 3 Wednesday, September 14, 2011
  • 139. 1 2 3 Wednesday, September 14, 2011
  • 140. 1 2 3 Wednesday, September 14, 2011
  • 141. 1 2 3 Wednesday, September 14, 2011
  • 142. 1 2 3 Wednesday, September 14, 2011
  • 143. 1 2 3 Wednesday, September 14, 2011
  • 144. 1 2 3 Wednesday, September 14, 2011
  • 145. 1 2 3 Wednesday, September 14, 2011
  • 146. 1 2 3 Wednesday, September 14, 2011
  • 147. 1 2 3 Wednesday, September 14, 2011
  • 148. 1 2 3 Wednesday, September 14, 2011
  • 149. 1 2 3 Wednesday, September 14, 2011
  • 150. 1 2 3 Wednesday, September 14, 2011
  • 151. 1 2 3 Wednesday, September 14, 2011
  • 152. 1 2 3 Wednesday, September 14, 2011
  • 153. 1 2 3 Wednesday, September 14, 2011
  • 154. 1 2 3 Wednesday, September 14, 2011
  • 155. ERLANG lookup: erlang B / C / engset Wednesday, September 14, 2011
  • 156. as many threads as possible Wednesday, September 14, 2011
  • 157. don’t sort input URLs Wednesday, September 14, 2011
  • 158. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 159. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ fetch http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 160. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 wait http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 161. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ fetch http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 162. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? wait id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 163. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 fetch http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 164. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 wait Wednesday, September 14, 2011
  • 165. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 166. http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/ Wednesday, September 14, 2011
  • 167. http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/ Wednesday, September 14, 2011
  • 168. no waiting! http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/ Wednesday, September 14, 2011
  • 169. challenges: EXTRACTING URLS Wednesday, September 14, 2011
  • 170. challenges: EXTRACTING URLS the internet is full of garbage Wednesday, September 14, 2011
  • 171. challenges: EXTRACTING URLS Wednesday, September 14, 2011
  • 172. challenges: EXTRACTING URLS enormous pages Wednesday, September 14, 2011
  • 173. challenges: EXTRACTING URLS enormous pages terrible markup Wednesday, September 14, 2011
  • 174. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls Wednesday, September 14, 2011
  • 175. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls .net/ Wednesday, September 14, 2011
  • 176. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls .net/ “unicode snowman dot net” Wednesday, September 14, 2011
  • 177. challenges: EXTRACTING URLS be prepared: Wednesday, September 14, 2011
  • 178. challenges: EXTRACTING URLS be prepared: use a streaming XML parser Wednesday, September 14, 2011
  • 179. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup Wednesday, September 14, 2011
  • 180. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup be aware that URLs aren’t ASCII Wednesday, September 14, 2011
  • 181. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup be aware that URLs aren’t ASCII use a URL normalizer Wednesday, September 14, 2011
  • 184. software advice: • goals determine scale Wednesday, September 14, 2011
  • 185. software advice: • goals determine scale • someone else has already done it Wednesday, September 14, 2011
  • 186. 2 second crawler: function wgetspider() { wget --html-extension --convert-links --mirror --page-requisites --progress=bar --level=5 --no-parent --no-verbose --no-check-certificate "$@"; } $ wgetspider http://www.ischool.berkeley.edu/ Wednesday, September 14, 2011
  • 188. java crawlers: • Heritrix (Internet Archive) Wednesday, September 14, 2011
  • 189. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) Wednesday, September 14, 2011
  • 190. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) • Bixo (Hadoop / Cascading) Wednesday, September 14, 2011
  • 191. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) • Bixo (Hadoop / Cascading) http://crawler.archive.org/ http://nutch.apache.org/ http://bixo.101tec.com/ Wednesday, September 14, 2011
  • 193. extraction packages: • mechanize Wednesday, September 14, 2011
  • 194. extraction packages: • mechanize • BeautifulSoup & urllib2 Wednesday, September 14, 2011
  • 195. extraction packages: • mechanize • BeautifulSoup & urllib2 • Scrapy Wednesday, September 14, 2011
  • 196. extraction packages: • mechanize • BeautifulSoup & urllib2 • Scrapy http://wwwsearch.sourceforge.net/mechanize/ http://www.crummy.com/software/BeautifulSoup/ http://scrapy.org/ Wednesday, September 14, 2011
  • 198. wrapper induction(ish) • Ariel Wednesday, September 14, 2011
  • 199. wrapper induction(ish) • Ariel • RoadRunner Wednesday, September 14, 2011
  • 200. wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker Wednesday, September 14, 2011
  • 201. wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker • scrubyt Wednesday, September 14, 2011
  • 202. wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker • scrubyt http://ariel.rubyforge.org/index.html http://www.dia.uniroma3.it/db/roadRunner/ http://code.google.com/p/templatemaker/ http://scrubyt.rubyforge.org/files/README.html Wednesday, September 14, 2011
  • 204. FEEDBACK: nate@xcombinator.com www.xcombinator.com @xcombinator Wednesday, September 14, 2011