SlideShare a Scribd company logo
1 of 204
Download to read offline
introduction to

                        WEB CRAWLING
                            & extraction                 by Nate Murray




Wednesday, September 14, 2011
WHO AM I ?



Wednesday, September 14, 2011
Nate Murray

                        AT&T Interactive (Yellowpages.com)

                                 TB-scale data since 2009

                                Various crawlers since 2005




Wednesday, September 14, 2011
what is

                        WEB CRAWLING ?



Wednesday, September 14, 2011
definition:
                 web crawler
                            a program that browses the web.




Wednesday, September 14, 2011
definition:
                 web extraction
                    transforming unstructured web data into
                                structured data




Wednesday, September 14, 2011
definition:
                 web extraction
                    transforming semistructured web data into
                                structured data




Wednesday, September 14, 2011
motivation




Wednesday, September 14, 2011
motivation: bookmark buddies




Wednesday, September 14, 2011
motivation: bookmark buddies

                                URL Title
                                 Users




Wednesday, September 14, 2011
motivation:




Wednesday, September 14, 2011
motivation:               business hours




Wednesday, September 14, 2011
motivation:               business hours


                                Day        Openness
                                Mon         Closed
                                Tue    11:30-14:30 17:30-22:00


                                Wed    11:30-14:30 17:30-22:00


                                Thur   11:30-14:30 17:30-22:00


                                 Fri   11:30-14:30 17:30-22:00


                                 Sat   12:00-14:30 17:00-22:00


                                Sun        -       17:00-21:00




Wednesday, September 14, 2011
motivation:




Wednesday, September 14, 2011
motivation: recommend videos




Wednesday, September 14, 2011
motivation: recommend videos




                                Users
Wednesday, September 14, 2011
motivation:




Wednesday, September 14, 2011
motivation:               vertical search




Wednesday, September 14, 2011
motivation:               vertical search

                  Image
                   SKU
                  Name
                   Price
                  Rating




Wednesday, September 14, 2011
motivation:




Wednesday, September 14, 2011
DESIRED PROPERTIES



Wednesday, September 14, 2011
DESIRED PROPERTIES



                                SPEED

Wednesday, September 14, 2011
CONSTRAINTS




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness                it’s easy to burden

         • Distributed
                                         small servers


              • Linear Scalability
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed               (for any significant
                                            crawl)
              • Linear Scalability
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability      n machines =
                                     n*m pages-per-second
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning    every machine should
                                      perform equal work

              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning
              • Minimum overlap      crawl each page
                                       exactly once




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
BASIC ALGORITHM



Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
architecture overview

         CRAWL
                                             FETCHER                   INTERNET
                                      URLs                  Web Data

        PLANNER
                          URL
                         QUEUE
                                                 Web Data




                                             STORAGE
                                Web Data




Wednesday, September 14, 2011
CHALLENGES



Wednesday, September 14, 2011
challenges:




                                depends on your ambitions




Wednesday, September 14, 2011
challenges:


                                Google’s Index Size:

                                 1998 - 26 million
                                 2005 - 8 billion
                                 2008 - 1 trillion




    http://www.nytimes.com/2005/08/15/technology/15search.html
    http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html




Wednesday, September 14, 2011
challenges:




                                small crawls are easy




Wednesday, September 14, 2011
challenges:




                                     < 10MM


                                small crawls are easy




Wednesday, September 14, 2011
challenges:




                                large crawls are interesting




Wednesday, September 14, 2011
challenges:




Wednesday, September 14, 2011
challenges:




                                DNS Lookup




Wednesday, September 14, 2011
challenges:




                                DNS Lookup
                                URLs Crawled




Wednesday, September 14, 2011
challenges:




                                DNS Lookup
                                URLs Crawled
                                 Politeness




Wednesday, September 14, 2011
challenges:




                                DNS Lookup
                                URLs Crawled
                                 Politeness
                                URL Frontier




Wednesday, September 14, 2011
challenges:




                                 DNS Lookup
                                URLs Crawled
                                  Politeness
                                 URL Frontier
                                Queueing URLs




Wednesday, September 14, 2011
challenges:




                                 DNS Lookup
                                 URLs Crawled
                                   Politeness
                                 URL Frontier
                                Queueing URLs
                                Extracting URLs




Wednesday, September 14, 2011
challenges:

                                DNS LOOKUP




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
challenges:
      DNS LOOKUP


                                can easily be a bottleneck




Wednesday, September 14, 2011
challenges:
      DNS LOOKUP

           • consider running your own DNS servers
             • djbdns
             • PowerDNS
             • etc.




Wednesday, September 14, 2011
challenges:
      DNS LOOKUP

           • be aware of software limitations
                • gethostbyaddr is synchronized
                • same with many “default” DNS clients




Wednesday, September 14, 2011
challenges:
      DNS LOOKUP


                                You’ll know when you need it




Wednesday, September 14, 2011
challenges:

                                URLs CRAWLED




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION
                ~50 bytes per URL
                                  e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION
                ~50 bytes per URL
                                  e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations


                   +8 bytes for time-last-crawled
                                    as long e.g. System.currentTimeMillis() -> 1314392455712




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION
                ~50 bytes per URL
                                  e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations


                   +8 bytes for time-last-crawled
                                    as long e.g. System.currentTimeMillis() -> 1314392455712


                   x        100 million




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION
                ~50 bytes per URL
                                  e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations


                   +8 bytes for time-last-crawled
                                    as long e.g. System.currentTimeMillis() -> 1314392455712


                   x        100 million
                  =~ 5.4 gigabytes


Wednesday, September 14, 2011
can we do better?




Wednesday, September 14, 2011
BLOOM FILTERS



Wednesday, September 14, 2011
BLOOM FILTERS



                                     answers the question:

                                is this item in the set?


Wednesday, September 14, 2011
BLOOM FILTERS



                                answers either:




Wednesday, September 14, 2011
BLOOM FILTERS



                                    answers either:

                                • yes, probably

Wednesday, September 14, 2011
BLOOM FILTERS



                                    answers either:

                                • yes, probably
                                • definitely not
Wednesday, September 14, 2011
BLOOM FILTERS


                Have we crawled: http://www.xcombinator.com?

                                    answers either:

                                • yes, probably
                                • definitely not
Wednesday, September 14, 2011
BLOOM FILTERS


                Have we crawled: http://www.xcombinator.com?

                                    answers either:

                                • yes, probably
                                • definitely not
Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, bloom filter



                100 million URLs

                      1 in 100 million chance
                                                of false positive




                                      see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                 1 machine, bloom filter

                                NAPKIN CALCULATION
                100 million URLs

                      1 in 100 million chance
                                                 of false positive




                                       see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                 1 machine, bloom filter

                                NAPKIN CALCULATION
                100 million URLs

                      1 in 100 million chance
                                                 of false positive



                  =~ 457 megabytes
                                       see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8




Wednesday, September 14, 2011
BLOOM FILTER




Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks




Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks


   • probabilistic - occasional errors




Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks


   • probabilistic - occasional errors

   • estimate # of items ahead of time




Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks


   • probabilistic - occasional errors

   • estimate # of items ahead of time

   • can’t delete


Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                         solutions
   • probabilistic - occasional errors

   • estimate # of items ahead of time

   • can’t delete


Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                            solutions
   • probabilistic - occasional errors
                             • acceptable
   • estimate # of items ahead of time

   • can’t delete


Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                         solutions
   • probabilistic - occasional errors
                             • acceptable
   • estimate # of items ahead of time
                             • not hard, see Dynamic BFs
   • can’t delete


Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                         solutions
   • probabilistic - occasional errors
                             • acceptable
   • estimate # of items ahead of time
                             • not hard, see Dynamic BFs
   • can’t delete
                             • pick granularity (days)

Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                         solutions
   • probabilistic - occasional errors
                             • acceptable
   • estimate # of items ahead of time
                             • not hard, see Dynamic BFs
   • can’t delete
                             • pick granularity (days)
                             • cascade them
Wednesday, September 14, 2011
BLOOM FILTERS
                                         references:

       http://en.wikipedia.org/wiki/Bloom_filter
       http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html
       http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/




Wednesday, September 14, 2011
challenges:

                                POLITENESS




Wednesday, September 14, 2011
obey robots.txt




Wednesday, September 14, 2011
rule of thumb:

                                wait 2 seconds (w.r.t. ip)




Wednesday, September 14, 2011
centralized politeness




Wednesday, September 14, 2011
centralized politeness




                                SPOF


Wednesday, September 14, 2011
centralized politeness




                                  SPOF
                                contention

Wednesday, September 14, 2011
challenges:
      POLITENESS




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database
                  • distributed locks (paxos/sigma/zookeeper)




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database
                  • distributed locks (paxos/sigma/zookeeper)
                  • controlled URL distribution




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database
                  • distributed locks (paxos/sigma/zookeeper)
                  • controlled URL distribution



    http://en.wikipedia.org/wiki/Paxos_(computer_science)


Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database
                  • distributed locks (paxos/sigma/zookeeper)
                  • controlled URL distribution



    http://en.wikipedia.org/wiki/Paxos_(computer_science)
    http://zookeeper.apache.org/

Wednesday, September 14, 2011
challenges:

                                URL FRONTIER




Wednesday, September 14, 2011
url frontier




Wednesday, September 14, 2011
idea:

                          consistently distribute URLs based on IP




Wednesday, September 14, 2011
modulo
                                IP      SHA-1       bucket (mod 5)
                  174.132.225.106     4dd14b0b...         2

                   74.125.224.115     cf4b7594...         1

                   157.166.255.19     0ac4d141...         4

                    69.22.138.129     6c1584fa...         4

                    98.139.50.166     327252c5...         3




Wednesday, September 14, 2011
benefits:



                                same IP always goes to same machine
                                              simple




Wednesday, September 14, 2011
drawbacks:



                                         susceptible to skew
                                can’t add / remove nodes without pain




Wednesday, September 14, 2011
consistent hashing



Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011
benefits:



                                ~ 1/(n+1) URLs move on add/remove
                                      virtual nodes help skew
                                          robust (no SOP)




Wednesday, September 14, 2011
drawbacks:



                           naive solution won’t work for large sites




Wednesday, September 14, 2011
further reading:



           Chord: A Scalable Peer-to-Peer Lookup Protocol for
           Internet Applications (2001) Stoica et al.

           Dynamo: Amazon’s Highly Available Key-value Store, SOSP
           2007

           Tapestry: A Resilient Global-Scale Overlay for Service
           Deployment (2004) Zhao et al.




Wednesday, September 14, 2011
challenges:

                                QUEUEING URLS




Wednesday, September 14, 2011
situation:




Wednesday, September 14, 2011
situation:
                                   URL




Wednesday, September 14, 2011
situation:
                                   URL
                                   not recently crawled




Wednesday, September 14, 2011
situation:
                                   URL
                                   not recently crawled
                                   allowed by robots.txt




Wednesday, September 14, 2011
situation:
                                   URL
                                   not recently crawled
                                   allowed by robots.txt
                                   polite




Wednesday, September 14, 2011
how to you order them?

                                (within a single machine)




Wednesday, September 14, 2011
hash each lane:




                1                         2                               3
                   http://yachtmaintenanceco.com/
                   http://www.amsterdamports.nl/
                   http://www.4s-dawn.com/
                   http://www.embassysuiteslittlerock.com/
                   http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm
                   http://mdgroover.iweb.bsu.edu
                   http://music.imbc.com/
                   http://www.robertjbradshaw.com
                   http://www.kerkattenhoven.be
                   http://www.escolania.org/
                   http://www.musiciansdfw.org/
Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
ERLANG




                                lookup: erlang B / C / engset

Wednesday, September 14, 2011
as many threads as possible




Wednesday, September 14, 2011
don’t sort input URLs




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
                                                                   fetch
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1           wait
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/

                                                                   fetch
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
                                                                   wait
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395

                                                                   fetch
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1
                                                                   wait




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://yachtmaintenanceco.com/
   http://www.amsterdamports.nl/
   http://www.4s-dawn.com/
   http://www.embassysuiteslittlerock.com/
   http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm
   http://mdgroover.iweb.bsu.edu
   http://music.imbc.com/
   http://www.robertjbradshaw.com
   http://www.kerkattenhoven.be
   http://www.escolania.org/
   http://www.musiciansdfw.org/
   http://www.ariana.org/




Wednesday, September 14, 2011
http://yachtmaintenanceco.com/
   http://www.amsterdamports.nl/
   http://www.4s-dawn.com/
   http://www.embassysuiteslittlerock.com/
   http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm
   http://mdgroover.iweb.bsu.edu
   http://music.imbc.com/
   http://www.robertjbradshaw.com
   http://www.kerkattenhoven.be
   http://www.escolania.org/
   http://www.musiciansdfw.org/
   http://www.ariana.org/




Wednesday, September 14, 2011
no waiting!
   http://yachtmaintenanceco.com/
   http://www.amsterdamports.nl/
   http://www.4s-dawn.com/
   http://www.embassysuiteslittlerock.com/
   http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm
   http://mdgroover.iweb.bsu.edu
   http://music.imbc.com/
   http://www.robertjbradshaw.com
   http://www.kerkattenhoven.be
   http://www.escolania.org/
   http://www.musiciansdfw.org/
   http://www.ariana.org/




Wednesday, September 14, 2011
challenges:

                                EXTRACTING URLS




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS


                                the internet is full of garbage




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                enormous pages




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                enormous pages

                                terrible markup




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                enormous pages

                                terrible markup

                                ridiculous urls




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                enormous pages

                                terrible markup

                                ridiculous urls


                                     .net/

Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                     enormous pages

                                     terrible markup

                                      ridiculous urls


                                           .net/
                                “unicode snowman dot net”

Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                be prepared:




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                       be prepared:


                                use a streaming XML parser




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                       be prepared:


                                use a streaming XML parser

                       use a library that handle’s bad markup




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                       be prepared:


                                use a streaming XML parser

                       use a library that handle’s bad markup

                            be aware that URLs aren’t ASCII




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                       be prepared:


                                use a streaming XML parser

                       use a library that handle’s bad markup

                            be aware that URLs aren’t ASCII

                                  use a URL normalizer


Wednesday, September 14, 2011
SOFTWARE



Wednesday, September 14, 2011
software advice:




Wednesday, September 14, 2011
software advice:


               •       goals determine scale




Wednesday, September 14, 2011
software advice:


               •       goals determine scale

               •       someone else has already done it




Wednesday, September 14, 2011
2 second crawler:

    function wgetspider() {
      wget --html-extension --convert-links --mirror 
        --page-requisites --progress=bar --level=5   
        --no-parent --no-verbose 
        --no-check-certificate "$@";
    }

    $ wgetspider http://www.ischool.berkeley.edu/




Wednesday, September 14, 2011
java crawlers:




Wednesday, September 14, 2011
java crawlers:

               •       Heritrix (Internet Archive)




Wednesday, September 14, 2011
java crawlers:

               •       Heritrix (Internet Archive)

               •       Nutch (Lucene)




Wednesday, September 14, 2011
java crawlers:

               •       Heritrix (Internet Archive)

               •       Nutch (Lucene)

               •       Bixo (Hadoop / Cascading)




Wednesday, September 14, 2011
java crawlers:

               •       Heritrix (Internet Archive)

               •       Nutch (Lucene)

               •       Bixo (Hadoop / Cascading)


    http://crawler.archive.org/
    http://nutch.apache.org/
    http://bixo.101tec.com/

Wednesday, September 14, 2011
extraction packages:




Wednesday, September 14, 2011
extraction packages:

               •       mechanize




Wednesday, September 14, 2011
extraction packages:

               •       mechanize

               •       BeautifulSoup & urllib2




Wednesday, September 14, 2011
extraction packages:

               •       mechanize

               •       BeautifulSoup & urllib2

               •       Scrapy




Wednesday, September 14, 2011
extraction packages:

               •       mechanize

               •       BeautifulSoup & urllib2

               •       Scrapy


    http://wwwsearch.sourceforge.net/mechanize/
    http://www.crummy.com/software/BeautifulSoup/
    http://scrapy.org/

Wednesday, September 14, 2011
wrapper induction(ish)




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel

               •       RoadRunner




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel

               •       RoadRunner

               •       TemplateMaker




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel

               •       RoadRunner

               •       TemplateMaker

               •       scrubyt




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel

               •       RoadRunner

               •       TemplateMaker

               •       scrubyt

    http://ariel.rubyforge.org/index.html
    http://www.dia.uniroma3.it/db/roadRunner/
    http://code.google.com/p/templatemaker/
    http://scrubyt.rubyforge.org/files/README.html

Wednesday, September 14, 2011
QUESTIONS?



Wednesday, September 14, 2011
FEEDBACK:


             nate@xcombinator.com
                                www.xcombinator.com

                                   @xcombinator
Wednesday, September 14, 2011

More Related Content

What's hot

SEO Search Engine Optimization Proposal - Organic SEO Ranks
SEO Search Engine Optimization Proposal - Organic SEO RanksSEO Search Engine Optimization Proposal - Organic SEO Ranks
SEO Search Engine Optimization Proposal - Organic SEO RanksChakra Systems
 
SEO Low hanging Fruit: Identifying SEO Opportunities to Achieve Results Fast ...
SEO Low hanging Fruit: Identifying SEO Opportunities to Achieve Results Fast ...SEO Low hanging Fruit: Identifying SEO Opportunities to Achieve Results Fast ...
SEO Low hanging Fruit: Identifying SEO Opportunities to Achieve Results Fast ...Aleyda Solís
 
How to Create an Image ALT Optimization Strategy - Himani Kankaria at SEO Mas...
How to Create an Image ALT Optimization Strategy - Himani Kankaria at SEO Mas...How to Create an Image ALT Optimization Strategy - Himani Kankaria at SEO Mas...
How to Create an Image ALT Optimization Strategy - Himani Kankaria at SEO Mas...Himani Kankaria
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction ServicePromptCloud
 
Semantic seo and the evolution of queries
Semantic seo and the evolution of queriesSemantic seo and the evolution of queries
Semantic seo and the evolution of queriesBill Slawski
 
Discovering knowledge using web structure mining
Discovering knowledge using web structure miningDiscovering knowledge using web structure mining
Discovering knowledge using web structure miningAtul Khanna
 
Seo goals & objectives 2 quarter 2012
Seo goals & objectives 2 quarter 2012Seo goals & objectives 2 quarter 2012
Seo goals & objectives 2 quarter 2012EBY3081
 
How To EAT Links.pptx
How To EAT Links.pptxHow To EAT Links.pptx
How To EAT Links.pptxDixon Jones
 
Data Visualization: A Marketing Superpower - Clark Boyd
Data Visualization: A Marketing Superpower - Clark BoydData Visualization: A Marketing Superpower - Clark Boyd
Data Visualization: A Marketing Superpower - Clark BoydClark Boyd
 
Switching domain 3 months before an IPO - Lucia Lecesne - BrightonSEO April 2022
Switching domain 3 months before an IPO - Lucia Lecesne - BrightonSEO April 2022Switching domain 3 months before an IPO - Lucia Lecesne - BrightonSEO April 2022
Switching domain 3 months before an IPO - Lucia Lecesne - BrightonSEO April 2022Lucia Lecesne
 
A crash course into SEO and what moves the needle with scalable processes
A crash course into SEO and what moves the needle with scalable processesA crash course into SEO and what moves the needle with scalable processes
A crash course into SEO and what moves the needle with scalable processespatrickstox
 
The ‘traditional approach’ to SEO is broken - how to prioritise your efforts ...
The ‘traditional approach’ to SEO is broken - how to prioritise your efforts ...The ‘traditional approach’ to SEO is broken - how to prioritise your efforts ...
The ‘traditional approach’ to SEO is broken - how to prioritise your efforts ...James Brockbank
 
Why Scaling (Great) Content Is So Bloody Hard
Why Scaling (Great) Content Is So Bloody HardWhy Scaling (Great) Content Is So Bloody Hard
Why Scaling (Great) Content Is So Bloody HardAhrefs
 
Efficient focused web crawling approach
Efficient focused web crawling approachEfficient focused web crawling approach
Efficient focused web crawling approachSyed Islam
 
SEO Reporting for Success at #FOS22
SEO Reporting for Success at #FOS22SEO Reporting for Success at #FOS22
SEO Reporting for Success at #FOS22Aleyda Solís
 
How to E-A-T SEO
How to E-A-T SEOHow to E-A-T SEO
How to E-A-T SEOSean Si
 
How to leverage indexation tracking to monitor issues and improve performance
How to leverage indexation tracking to monitor issues and improve performanceHow to leverage indexation tracking to monitor issues and improve performance
How to leverage indexation tracking to monitor issues and improve performanceSimon Lesser
 

What's hot (20)

SEO proposal
SEO proposalSEO proposal
SEO proposal
 
Seo plan
Seo planSeo plan
Seo plan
 
SEO Search Engine Optimization Proposal - Organic SEO Ranks
SEO Search Engine Optimization Proposal - Organic SEO RanksSEO Search Engine Optimization Proposal - Organic SEO Ranks
SEO Search Engine Optimization Proposal - Organic SEO Ranks
 
SEO Proposal eMarket Agency
SEO Proposal eMarket AgencySEO Proposal eMarket Agency
SEO Proposal eMarket Agency
 
SEO Low hanging Fruit: Identifying SEO Opportunities to Achieve Results Fast ...
SEO Low hanging Fruit: Identifying SEO Opportunities to Achieve Results Fast ...SEO Low hanging Fruit: Identifying SEO Opportunities to Achieve Results Fast ...
SEO Low hanging Fruit: Identifying SEO Opportunities to Achieve Results Fast ...
 
How to Create an Image ALT Optimization Strategy - Himani Kankaria at SEO Mas...
How to Create an Image ALT Optimization Strategy - Himani Kankaria at SEO Mas...How to Create an Image ALT Optimization Strategy - Himani Kankaria at SEO Mas...
How to Create an Image ALT Optimization Strategy - Himani Kankaria at SEO Mas...
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
 
Semantic seo and the evolution of queries
Semantic seo and the evolution of queriesSemantic seo and the evolution of queries
Semantic seo and the evolution of queries
 
Discovering knowledge using web structure mining
Discovering knowledge using web structure miningDiscovering knowledge using web structure mining
Discovering knowledge using web structure mining
 
Seo goals & objectives 2 quarter 2012
Seo goals & objectives 2 quarter 2012Seo goals & objectives 2 quarter 2012
Seo goals & objectives 2 quarter 2012
 
How To EAT Links.pptx
How To EAT Links.pptxHow To EAT Links.pptx
How To EAT Links.pptx
 
Data Visualization: A Marketing Superpower - Clark Boyd
Data Visualization: A Marketing Superpower - Clark BoydData Visualization: A Marketing Superpower - Clark Boyd
Data Visualization: A Marketing Superpower - Clark Boyd
 
Switching domain 3 months before an IPO - Lucia Lecesne - BrightonSEO April 2022
Switching domain 3 months before an IPO - Lucia Lecesne - BrightonSEO April 2022Switching domain 3 months before an IPO - Lucia Lecesne - BrightonSEO April 2022
Switching domain 3 months before an IPO - Lucia Lecesne - BrightonSEO April 2022
 
A crash course into SEO and what moves the needle with scalable processes
A crash course into SEO and what moves the needle with scalable processesA crash course into SEO and what moves the needle with scalable processes
A crash course into SEO and what moves the needle with scalable processes
 
The ‘traditional approach’ to SEO is broken - how to prioritise your efforts ...
The ‘traditional approach’ to SEO is broken - how to prioritise your efforts ...The ‘traditional approach’ to SEO is broken - how to prioritise your efforts ...
The ‘traditional approach’ to SEO is broken - how to prioritise your efforts ...
 
Why Scaling (Great) Content Is So Bloody Hard
Why Scaling (Great) Content Is So Bloody HardWhy Scaling (Great) Content Is So Bloody Hard
Why Scaling (Great) Content Is So Bloody Hard
 
Efficient focused web crawling approach
Efficient focused web crawling approachEfficient focused web crawling approach
Efficient focused web crawling approach
 
SEO Reporting for Success at #FOS22
SEO Reporting for Success at #FOS22SEO Reporting for Success at #FOS22
SEO Reporting for Success at #FOS22
 
How to E-A-T SEO
How to E-A-T SEOHow to E-A-T SEO
How to E-A-T SEO
 
How to leverage indexation tracking to monitor issues and improve performance
How to leverage indexation tracking to monitor issues and improve performanceHow to leverage indexation tracking to monitor issues and improve performance
How to leverage indexation tracking to monitor issues and improve performance
 

Viewers also liked

Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawlingDenis Shestakov
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
Java Web Uygulama Geliştirme
Java Web Uygulama GeliştirmeJava Web Uygulama Geliştirme
Java Web Uygulama Geliştirmeahmetdemirelli
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawlingDenis Shestakov
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityRenato Lucindo
 

Viewers also liked (10)

Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Java Web Uygulama Geliştirme
Java Web Uygulama GeliştirmeJava Web Uygulama Geliştirme
Java Web Uygulama Geliştirme
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
 
Web Crawlers
Web CrawlersWeb Crawlers
Web Crawlers
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 

Recently uploaded

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

Challenges in Large-Scale Web Crawling

  • 1. introduction to WEB CRAWLING & extraction by Nate Murray Wednesday, September 14, 2011
  • 2. WHO AM I ? Wednesday, September 14, 2011
  • 3. Nate Murray AT&T Interactive (Yellowpages.com) TB-scale data since 2009 Various crawlers since 2005 Wednesday, September 14, 2011
  • 4. what is WEB CRAWLING ? Wednesday, September 14, 2011
  • 5. definition: web crawler a program that browses the web. Wednesday, September 14, 2011
  • 6. definition: web extraction transforming unstructured web data into structured data Wednesday, September 14, 2011
  • 7. definition: web extraction transforming semistructured web data into structured data Wednesday, September 14, 2011
  • 10. motivation: bookmark buddies URL Title Users Wednesday, September 14, 2011
  • 12. motivation: business hours Wednesday, September 14, 2011
  • 13. motivation: business hours Day Openness Mon Closed Tue 11:30-14:30 17:30-22:00 Wed 11:30-14:30 17:30-22:00 Thur 11:30-14:30 17:30-22:00 Fri 11:30-14:30 17:30-22:00 Sat 12:00-14:30 17:00-22:00 Sun - 17:00-21:00 Wednesday, September 14, 2011
  • 16. motivation: recommend videos Users Wednesday, September 14, 2011
  • 18. motivation: vertical search Wednesday, September 14, 2011
  • 19. motivation: vertical search Image SKU Name Price Rating Wednesday, September 14, 2011
  • 22. DESIRED PROPERTIES SPEED Wednesday, September 14, 2011
  • 24. CONSTRAINTS • Politeness Wednesday, September 14, 2011
  • 25. CONSTRAINTS • Politeness • Distributed Wednesday, September 14, 2011
  • 26. CONSTRAINTS • Politeness • Distributed • Linear Scalability Wednesday, September 14, 2011
  • 27. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning Wednesday, September 14, 2011
  • 28. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 29. CONSTRAINTS • Politeness it’s easy to burden • Distributed small servers • Linear Scalability • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 30. CONSTRAINTS • Politeness • Distributed (for any significant crawl) • Linear Scalability • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 31. CONSTRAINTS • Politeness • Distributed • Linear Scalability n machines = n*m pages-per-second • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 32. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning every machine should perform equal work • Minimum overlap Wednesday, September 14, 2011
  • 33. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlap crawl each page exactly once Wednesday, September 14, 2011
  • 34. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 36. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 37. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 38. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 39. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 40. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 41. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 42. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 43. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 44. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 45. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 46. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 47. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 48. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 49. architecture overview CRAWL FETCHER INTERNET URLs Web Data PLANNER URL QUEUE Web Data STORAGE Web Data Wednesday, September 14, 2011
  • 51. challenges: depends on your ambitions Wednesday, September 14, 2011
  • 52. challenges: Google’s Index Size: 1998 - 26 million 2005 - 8 billion 2008 - 1 trillion http://www.nytimes.com/2005/08/15/technology/15search.html http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html Wednesday, September 14, 2011
  • 53. challenges: small crawls are easy Wednesday, September 14, 2011
  • 54. challenges: < 10MM small crawls are easy Wednesday, September 14, 2011
  • 55. challenges: large crawls are interesting Wednesday, September 14, 2011
  • 57. challenges: DNS Lookup Wednesday, September 14, 2011
  • 58. challenges: DNS Lookup URLs Crawled Wednesday, September 14, 2011
  • 59. challenges: DNS Lookup URLs Crawled Politeness Wednesday, September 14, 2011
  • 60. challenges: DNS Lookup URLs Crawled Politeness URL Frontier Wednesday, September 14, 2011
  • 61. challenges: DNS Lookup URLs Crawled Politeness URL Frontier Queueing URLs Wednesday, September 14, 2011
  • 62. challenges: DNS Lookup URLs Crawled Politeness URL Frontier Queueing URLs Extracting URLs Wednesday, September 14, 2011
  • 63. challenges: DNS LOOKUP Wednesday, September 14, 2011
  • 64. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 65. challenges: DNS LOOKUP can easily be a bottleneck Wednesday, September 14, 2011
  • 66. challenges: DNS LOOKUP • consider running your own DNS servers • djbdns • PowerDNS • etc. Wednesday, September 14, 2011
  • 67. challenges: DNS LOOKUP • be aware of software limitations • gethostbyaddr is synchronized • same with many “default” DNS clients Wednesday, September 14, 2011
  • 68. challenges: DNS LOOKUP You’ll know when you need it Wednesday, September 14, 2011
  • 69. challenges: URLs CRAWLED Wednesday, September 14, 2011
  • 70. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 71. challenges: URLs CRAWLED 1 machine, store in memory Wednesday, September 14, 2011
  • 72. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION Wednesday, September 14, 2011
  • 73. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations Wednesday, September 14, 2011
  • 74. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 Wednesday, September 14, 2011
  • 75. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 x 100 million Wednesday, September 14, 2011
  • 76. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 x 100 million =~ 5.4 gigabytes Wednesday, September 14, 2011
  • 77. can we do better? Wednesday, September 14, 2011
  • 79. BLOOM FILTERS answers the question: is this item in the set? Wednesday, September 14, 2011
  • 80. BLOOM FILTERS answers either: Wednesday, September 14, 2011
  • 81. BLOOM FILTERS answers either: • yes, probably Wednesday, September 14, 2011
  • 82. BLOOM FILTERS answers either: • yes, probably • definitely not Wednesday, September 14, 2011
  • 83. BLOOM FILTERS Have we crawled: http://www.xcombinator.com? answers either: • yes, probably • definitely not Wednesday, September 14, 2011
  • 84. BLOOM FILTERS Have we crawled: http://www.xcombinator.com? answers either: • yes, probably • definitely not Wednesday, September 14, 2011
  • 85. challenges: URLs CRAWLED 1 machine, bloom filter 100 million URLs 1 in 100 million chance of false positive see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8 Wednesday, September 14, 2011
  • 86. challenges: URLs CRAWLED 1 machine, bloom filter NAPKIN CALCULATION 100 million URLs 1 in 100 million chance of false positive see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8 Wednesday, September 14, 2011
  • 87. challenges: URLs CRAWLED 1 machine, bloom filter NAPKIN CALCULATION 100 million URLs 1 in 100 million chance of false positive =~ 457 megabytes see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8 Wednesday, September 14, 2011
  • 89. BLOOM FILTER drawbacks Wednesday, September 14, 2011
  • 90. BLOOM FILTER drawbacks • probabilistic - occasional errors Wednesday, September 14, 2011
  • 91. BLOOM FILTER drawbacks • probabilistic - occasional errors • estimate # of items ahead of time Wednesday, September 14, 2011
  • 92. BLOOM FILTER drawbacks • probabilistic - occasional errors • estimate # of items ahead of time • can’t delete Wednesday, September 14, 2011
  • 93. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • estimate # of items ahead of time • can’t delete Wednesday, September 14, 2011
  • 94. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • can’t delete Wednesday, September 14, 2011
  • 95. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete Wednesday, September 14, 2011
  • 96. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete • pick granularity (days) Wednesday, September 14, 2011
  • 97. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete • pick granularity (days) • cascade them Wednesday, September 14, 2011
  • 98. BLOOM FILTERS references: http://en.wikipedia.org/wiki/Bloom_filter http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/ Wednesday, September 14, 2011
  • 99. challenges: POLITENESS Wednesday, September 14, 2011
  • 101. rule of thumb: wait 2 seconds (w.r.t. ip) Wednesday, September 14, 2011
  • 103. centralized politeness SPOF Wednesday, September 14, 2011
  • 104. centralized politeness SPOF contention Wednesday, September 14, 2011
  • 105. challenges: POLITENESS Wednesday, September 14, 2011
  • 106. challenges: POLITENESS • Options: Wednesday, September 14, 2011
  • 107. challenges: POLITENESS • Options: • central database Wednesday, September 14, 2011
  • 108. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) Wednesday, September 14, 2011
  • 109. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution Wednesday, September 14, 2011
  • 110. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution http://en.wikipedia.org/wiki/Paxos_(computer_science) Wednesday, September 14, 2011
  • 111. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution http://en.wikipedia.org/wiki/Paxos_(computer_science) http://zookeeper.apache.org/ Wednesday, September 14, 2011
  • 112. challenges: URL FRONTIER Wednesday, September 14, 2011
  • 114. idea: consistently distribute URLs based on IP Wednesday, September 14, 2011
  • 115. modulo IP SHA-1 bucket (mod 5) 174.132.225.106 4dd14b0b... 2 74.125.224.115 cf4b7594... 1 157.166.255.19 0ac4d141... 4 69.22.138.129 6c1584fa... 4 98.139.50.166 327252c5... 3 Wednesday, September 14, 2011
  • 116. benefits: same IP always goes to same machine simple Wednesday, September 14, 2011
  • 117. drawbacks: susceptible to skew can’t add / remove nodes without pain Wednesday, September 14, 2011
  • 123. benefits: ~ 1/(n+1) URLs move on add/remove virtual nodes help skew robust (no SOP) Wednesday, September 14, 2011
  • 124. drawbacks: naive solution won’t work for large sites Wednesday, September 14, 2011
  • 125. further reading: Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications (2001) Stoica et al. Dynamo: Amazon’s Highly Available Key-value Store, SOSP 2007 Tapestry: A Resilient Global-Scale Overlay for Service Deployment (2004) Zhao et al. Wednesday, September 14, 2011
  • 126. challenges: QUEUEING URLS Wednesday, September 14, 2011
  • 128. situation: URL Wednesday, September 14, 2011
  • 129. situation: URL not recently crawled Wednesday, September 14, 2011
  • 130. situation: URL not recently crawled allowed by robots.txt Wednesday, September 14, 2011
  • 131. situation: URL not recently crawled allowed by robots.txt polite Wednesday, September 14, 2011
  • 132. how to you order them? (within a single machine) Wednesday, September 14, 2011
  • 133. hash each lane: 1 2 3 http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ Wednesday, September 14, 2011
  • 134. 1 2 3 Wednesday, September 14, 2011
  • 135. 1 2 3 Wednesday, September 14, 2011
  • 136. 1 2 3 Wednesday, September 14, 2011
  • 137. 1 2 3 Wednesday, September 14, 2011
  • 138. 1 2 3 Wednesday, September 14, 2011
  • 139. 1 2 3 Wednesday, September 14, 2011
  • 140. 1 2 3 Wednesday, September 14, 2011
  • 141. 1 2 3 Wednesday, September 14, 2011
  • 142. 1 2 3 Wednesday, September 14, 2011
  • 143. 1 2 3 Wednesday, September 14, 2011
  • 144. 1 2 3 Wednesday, September 14, 2011
  • 145. 1 2 3 Wednesday, September 14, 2011
  • 146. 1 2 3 Wednesday, September 14, 2011
  • 147. 1 2 3 Wednesday, September 14, 2011
  • 148. 1 2 3 Wednesday, September 14, 2011
  • 149. 1 2 3 Wednesday, September 14, 2011
  • 150. 1 2 3 Wednesday, September 14, 2011
  • 151. 1 2 3 Wednesday, September 14, 2011
  • 152. 1 2 3 Wednesday, September 14, 2011
  • 153. 1 2 3 Wednesday, September 14, 2011
  • 154. 1 2 3 Wednesday, September 14, 2011
  • 155. ERLANG lookup: erlang B / C / engset Wednesday, September 14, 2011
  • 156. as many threads as possible Wednesday, September 14, 2011
  • 157. don’t sort input URLs Wednesday, September 14, 2011
  • 158. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 159. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ fetch http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 160. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 wait http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 161. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ fetch http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 162. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? wait id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 163. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 fetch http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 164. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 wait Wednesday, September 14, 2011
  • 165. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 166. http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/ Wednesday, September 14, 2011
  • 167. http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/ Wednesday, September 14, 2011
  • 168. no waiting! http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/ Wednesday, September 14, 2011
  • 169. challenges: EXTRACTING URLS Wednesday, September 14, 2011
  • 170. challenges: EXTRACTING URLS the internet is full of garbage Wednesday, September 14, 2011
  • 171. challenges: EXTRACTING URLS Wednesday, September 14, 2011
  • 172. challenges: EXTRACTING URLS enormous pages Wednesday, September 14, 2011
  • 173. challenges: EXTRACTING URLS enormous pages terrible markup Wednesday, September 14, 2011
  • 174. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls Wednesday, September 14, 2011
  • 175. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls .net/ Wednesday, September 14, 2011
  • 176. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls .net/ “unicode snowman dot net” Wednesday, September 14, 2011
  • 177. challenges: EXTRACTING URLS be prepared: Wednesday, September 14, 2011
  • 178. challenges: EXTRACTING URLS be prepared: use a streaming XML parser Wednesday, September 14, 2011
  • 179. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup Wednesday, September 14, 2011
  • 180. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup be aware that URLs aren’t ASCII Wednesday, September 14, 2011
  • 181. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup be aware that URLs aren’t ASCII use a URL normalizer Wednesday, September 14, 2011
  • 184. software advice: • goals determine scale Wednesday, September 14, 2011
  • 185. software advice: • goals determine scale • someone else has already done it Wednesday, September 14, 2011
  • 186. 2 second crawler: function wgetspider() { wget --html-extension --convert-links --mirror --page-requisites --progress=bar --level=5 --no-parent --no-verbose --no-check-certificate "$@"; } $ wgetspider http://www.ischool.berkeley.edu/ Wednesday, September 14, 2011
  • 188. java crawlers: • Heritrix (Internet Archive) Wednesday, September 14, 2011
  • 189. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) Wednesday, September 14, 2011
  • 190. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) • Bixo (Hadoop / Cascading) Wednesday, September 14, 2011
  • 191. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) • Bixo (Hadoop / Cascading) http://crawler.archive.org/ http://nutch.apache.org/ http://bixo.101tec.com/ Wednesday, September 14, 2011
  • 193. extraction packages: • mechanize Wednesday, September 14, 2011
  • 194. extraction packages: • mechanize • BeautifulSoup & urllib2 Wednesday, September 14, 2011
  • 195. extraction packages: • mechanize • BeautifulSoup & urllib2 • Scrapy Wednesday, September 14, 2011
  • 196. extraction packages: • mechanize • BeautifulSoup & urllib2 • Scrapy http://wwwsearch.sourceforge.net/mechanize/ http://www.crummy.com/software/BeautifulSoup/ http://scrapy.org/ Wednesday, September 14, 2011
  • 198. wrapper induction(ish) • Ariel Wednesday, September 14, 2011
  • 199. wrapper induction(ish) • Ariel • RoadRunner Wednesday, September 14, 2011
  • 200. wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker Wednesday, September 14, 2011
  • 201. wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker • scrubyt Wednesday, September 14, 2011
  • 202. wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker • scrubyt http://ariel.rubyforge.org/index.html http://www.dia.uniroma3.it/db/roadRunner/ http://code.google.com/p/templatemaker/ http://scrubyt.rubyforge.org/files/README.html Wednesday, September 14, 2011
  • 204. FEEDBACK: nate@xcombinator.com www.xcombinator.com @xcombinator Wednesday, September 14, 2011