SlideShare a Scribd company logo
introduction to

                        WEB CRAWLING
                            & extraction                 by Nate Murray




Wednesday, September 14, 2011
WHO AM I ?



Wednesday, September 14, 2011
Nate Murray

                        AT&T Interactive (Yellowpages.com)

                                 TB-scale data since 2009

                                Various crawlers since 2005




Wednesday, September 14, 2011
what is

                        WEB CRAWLING ?



Wednesday, September 14, 2011
definition:
                 web crawler
                            a program that browses the web.




Wednesday, September 14, 2011
definition:
                 web extraction
                    transforming unstructured web data into
                                structured data




Wednesday, September 14, 2011
definition:
                 web extraction
                    transforming semistructured web data into
                                structured data




Wednesday, September 14, 2011
motivation




Wednesday, September 14, 2011
motivation: bookmark buddies




Wednesday, September 14, 2011
motivation: bookmark buddies

                                URL Title
                                 Users




Wednesday, September 14, 2011
motivation:




Wednesday, September 14, 2011
motivation:               business hours




Wednesday, September 14, 2011
motivation:               business hours


                                Day        Openness
                                Mon         Closed
                                Tue    11:30-14:30 17:30-22:00


                                Wed    11:30-14:30 17:30-22:00


                                Thur   11:30-14:30 17:30-22:00


                                 Fri   11:30-14:30 17:30-22:00


                                 Sat   12:00-14:30 17:00-22:00


                                Sun        -       17:00-21:00




Wednesday, September 14, 2011
motivation:




Wednesday, September 14, 2011
motivation: recommend videos




Wednesday, September 14, 2011
motivation: recommend videos




                                Users
Wednesday, September 14, 2011
motivation:




Wednesday, September 14, 2011
motivation:               vertical search




Wednesday, September 14, 2011
motivation:               vertical search

                  Image
                   SKU
                  Name
                   Price
                  Rating




Wednesday, September 14, 2011
motivation:




Wednesday, September 14, 2011
DESIRED PROPERTIES



Wednesday, September 14, 2011
DESIRED PROPERTIES



                                SPEED

Wednesday, September 14, 2011
CONSTRAINTS




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness                it’s easy to burden

         • Distributed
                                         small servers


              • Linear Scalability
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed               (for any significant
                                            crawl)
              • Linear Scalability
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability      n machines =
                                     n*m pages-per-second
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning    every machine should
                                      perform equal work

              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning
              • Minimum overlap      crawl each page
                                       exactly once




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
BASIC ALGORITHM



Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
architecture overview

         CRAWL
                                             FETCHER                   INTERNET
                                      URLs                  Web Data

        PLANNER
                          URL
                         QUEUE
                                                 Web Data




                                             STORAGE
                                Web Data




Wednesday, September 14, 2011
CHALLENGES



Wednesday, September 14, 2011
challenges:




                                depends on your ambitions




Wednesday, September 14, 2011
challenges:


                                Google’s Index Size:

                                 1998 - 26 million
                                 2005 - 8 billion
                                 2008 - 1 trillion




    http://www.nytimes.com/2005/08/15/technology/15search.html
    http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html




Wednesday, September 14, 2011
challenges:




                                small crawls are easy




Wednesday, September 14, 2011
challenges:




                                     < 10MM


                                small crawls are easy




Wednesday, September 14, 2011
challenges:




                                large crawls are interesting




Wednesday, September 14, 2011
challenges:




Wednesday, September 14, 2011
challenges:




                                DNS Lookup




Wednesday, September 14, 2011
challenges:




                                DNS Lookup
                                URLs Crawled




Wednesday, September 14, 2011
challenges:




                                DNS Lookup
                                URLs Crawled
                                 Politeness




Wednesday, September 14, 2011
challenges:




                                DNS Lookup
                                URLs Crawled
                                 Politeness
                                URL Frontier




Wednesday, September 14, 2011
challenges:




                                 DNS Lookup
                                URLs Crawled
                                  Politeness
                                 URL Frontier
                                Queueing URLs




Wednesday, September 14, 2011
challenges:




                                 DNS Lookup
                                 URLs Crawled
                                   Politeness
                                 URL Frontier
                                Queueing URLs
                                Extracting URLs




Wednesday, September 14, 2011
challenges:

                                DNS LOOKUP




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
challenges:
      DNS LOOKUP


                                can easily be a bottleneck




Wednesday, September 14, 2011
challenges:
      DNS LOOKUP

           • consider running your own DNS servers
             • djbdns
             • PowerDNS
             • etc.




Wednesday, September 14, 2011
challenges:
      DNS LOOKUP

           • be aware of software limitations
                • gethostbyaddr is synchronized
                • same with many “default” DNS clients




Wednesday, September 14, 2011
challenges:
      DNS LOOKUP


                                You’ll know when you need it




Wednesday, September 14, 2011
challenges:

                                URLs CRAWLED




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION
                ~50 bytes per URL
                                  e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION
                ~50 bytes per URL
                                  e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations


                   +8 bytes for time-last-crawled
                                    as long e.g. System.currentTimeMillis() -> 1314392455712




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION
                ~50 bytes per URL
                                  e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations


                   +8 bytes for time-last-crawled
                                    as long e.g. System.currentTimeMillis() -> 1314392455712


                   x        100 million




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION
                ~50 bytes per URL
                                  e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations


                   +8 bytes for time-last-crawled
                                    as long e.g. System.currentTimeMillis() -> 1314392455712


                   x        100 million
                  =~ 5.4 gigabytes


Wednesday, September 14, 2011
can we do better?




Wednesday, September 14, 2011
BLOOM FILTERS



Wednesday, September 14, 2011
BLOOM FILTERS



                                     answers the question:

                                is this item in the set?


Wednesday, September 14, 2011
BLOOM FILTERS



                                answers either:




Wednesday, September 14, 2011
BLOOM FILTERS



                                    answers either:

                                • yes, probably

Wednesday, September 14, 2011
BLOOM FILTERS



                                    answers either:

                                • yes, probably
                                • definitely not
Wednesday, September 14, 2011
BLOOM FILTERS


                Have we crawled: http://www.xcombinator.com?

                                    answers either:

                                • yes, probably
                                • definitely not
Wednesday, September 14, 2011
BLOOM FILTERS


                Have we crawled: http://www.xcombinator.com?

                                    answers either:

                                • yes, probably
                                • definitely not
Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, bloom filter



                100 million URLs

                      1 in 100 million chance
                                                of false positive




                                      see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                 1 machine, bloom filter

                                NAPKIN CALCULATION
                100 million URLs

                      1 in 100 million chance
                                                 of false positive




                                       see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                 1 machine, bloom filter

                                NAPKIN CALCULATION
                100 million URLs

                      1 in 100 million chance
                                                 of false positive



                  =~ 457 megabytes
                                       see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8




Wednesday, September 14, 2011
BLOOM FILTER




Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks




Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks


   • probabilistic - occasional errors




Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks


   • probabilistic - occasional errors

   • estimate # of items ahead of time




Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks


   • probabilistic - occasional errors

   • estimate # of items ahead of time

   • can’t delete


Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                         solutions
   • probabilistic - occasional errors

   • estimate # of items ahead of time

   • can’t delete


Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                            solutions
   • probabilistic - occasional errors
                             • acceptable
   • estimate # of items ahead of time

   • can’t delete


Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                         solutions
   • probabilistic - occasional errors
                             • acceptable
   • estimate # of items ahead of time
                             • not hard, see Dynamic BFs
   • can’t delete


Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                         solutions
   • probabilistic - occasional errors
                             • acceptable
   • estimate # of items ahead of time
                             • not hard, see Dynamic BFs
   • can’t delete
                             • pick granularity (days)

Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                         solutions
   • probabilistic - occasional errors
                             • acceptable
   • estimate # of items ahead of time
                             • not hard, see Dynamic BFs
   • can’t delete
                             • pick granularity (days)
                             • cascade them
Wednesday, September 14, 2011
BLOOM FILTERS
                                         references:

       http://en.wikipedia.org/wiki/Bloom_filter
       http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html
       http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/




Wednesday, September 14, 2011
challenges:

                                POLITENESS




Wednesday, September 14, 2011
obey robots.txt




Wednesday, September 14, 2011
rule of thumb:

                                wait 2 seconds (w.r.t. ip)




Wednesday, September 14, 2011
centralized politeness




Wednesday, September 14, 2011
centralized politeness




                                SPOF


Wednesday, September 14, 2011
centralized politeness




                                  SPOF
                                contention

Wednesday, September 14, 2011
challenges:
      POLITENESS




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database
                  • distributed locks (paxos/sigma/zookeeper)




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database
                  • distributed locks (paxos/sigma/zookeeper)
                  • controlled URL distribution




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database
                  • distributed locks (paxos/sigma/zookeeper)
                  • controlled URL distribution



    http://en.wikipedia.org/wiki/Paxos_(computer_science)


Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database
                  • distributed locks (paxos/sigma/zookeeper)
                  • controlled URL distribution



    http://en.wikipedia.org/wiki/Paxos_(computer_science)
    http://zookeeper.apache.org/

Wednesday, September 14, 2011
challenges:

                                URL FRONTIER




Wednesday, September 14, 2011
url frontier




Wednesday, September 14, 2011
idea:

                          consistently distribute URLs based on IP




Wednesday, September 14, 2011
modulo
                                IP      SHA-1       bucket (mod 5)
                  174.132.225.106     4dd14b0b...         2

                   74.125.224.115     cf4b7594...         1

                   157.166.255.19     0ac4d141...         4

                    69.22.138.129     6c1584fa...         4

                    98.139.50.166     327252c5...         3




Wednesday, September 14, 2011
benefits:



                                same IP always goes to same machine
                                              simple




Wednesday, September 14, 2011
drawbacks:



                                         susceptible to skew
                                can’t add / remove nodes without pain




Wednesday, September 14, 2011
consistent hashing



Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011
benefits:



                                ~ 1/(n+1) URLs move on add/remove
                                      virtual nodes help skew
                                          robust (no SOP)




Wednesday, September 14, 2011
drawbacks:



                           naive solution won’t work for large sites




Wednesday, September 14, 2011
further reading:



           Chord: A Scalable Peer-to-Peer Lookup Protocol for
           Internet Applications (2001) Stoica et al.

           Dynamo: Amazon’s Highly Available Key-value Store, SOSP
           2007

           Tapestry: A Resilient Global-Scale Overlay for Service
           Deployment (2004) Zhao et al.




Wednesday, September 14, 2011
challenges:

                                QUEUEING URLS




Wednesday, September 14, 2011
situation:




Wednesday, September 14, 2011
situation:
                                   URL




Wednesday, September 14, 2011
situation:
                                   URL
                                   not recently crawled




Wednesday, September 14, 2011
situation:
                                   URL
                                   not recently crawled
                                   allowed by robots.txt




Wednesday, September 14, 2011
situation:
                                   URL
                                   not recently crawled
                                   allowed by robots.txt
                                   polite




Wednesday, September 14, 2011
how to you order them?

                                (within a single machine)




Wednesday, September 14, 2011
hash each lane:




                1                         2                               3
                   http://yachtmaintenanceco.com/
                   http://www.amsterdamports.nl/
                   http://www.4s-dawn.com/
                   http://www.embassysuiteslittlerock.com/
                   http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm
                   http://mdgroover.iweb.bsu.edu
                   http://music.imbc.com/
                   http://www.robertjbradshaw.com
                   http://www.kerkattenhoven.be
                   http://www.escolania.org/
                   http://www.musiciansdfw.org/
Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
ERLANG




                                lookup: erlang B / C / engset

Wednesday, September 14, 2011
as many threads as possible




Wednesday, September 14, 2011
don’t sort input URLs




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
                                                                   fetch
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1           wait
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/

                                                                   fetch
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
                                                                   wait
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395

                                                                   fetch
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1
                                                                   wait




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://yachtmaintenanceco.com/
   http://www.amsterdamports.nl/
   http://www.4s-dawn.com/
   http://www.embassysuiteslittlerock.com/
   http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm
   http://mdgroover.iweb.bsu.edu
   http://music.imbc.com/
   http://www.robertjbradshaw.com
   http://www.kerkattenhoven.be
   http://www.escolania.org/
   http://www.musiciansdfw.org/
   http://www.ariana.org/




Wednesday, September 14, 2011
http://yachtmaintenanceco.com/
   http://www.amsterdamports.nl/
   http://www.4s-dawn.com/
   http://www.embassysuiteslittlerock.com/
   http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm
   http://mdgroover.iweb.bsu.edu
   http://music.imbc.com/
   http://www.robertjbradshaw.com
   http://www.kerkattenhoven.be
   http://www.escolania.org/
   http://www.musiciansdfw.org/
   http://www.ariana.org/




Wednesday, September 14, 2011
no waiting!
   http://yachtmaintenanceco.com/
   http://www.amsterdamports.nl/
   http://www.4s-dawn.com/
   http://www.embassysuiteslittlerock.com/
   http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm
   http://mdgroover.iweb.bsu.edu
   http://music.imbc.com/
   http://www.robertjbradshaw.com
   http://www.kerkattenhoven.be
   http://www.escolania.org/
   http://www.musiciansdfw.org/
   http://www.ariana.org/




Wednesday, September 14, 2011
challenges:

                                EXTRACTING URLS




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS


                                the internet is full of garbage




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                enormous pages




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                enormous pages

                                terrible markup




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                enormous pages

                                terrible markup

                                ridiculous urls




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                enormous pages

                                terrible markup

                                ridiculous urls


                                     .net/

Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                     enormous pages

                                     terrible markup

                                      ridiculous urls


                                           .net/
                                “unicode snowman dot net”

Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                be prepared:




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                       be prepared:


                                use a streaming XML parser




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                       be prepared:


                                use a streaming XML parser

                       use a library that handle’s bad markup




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                       be prepared:


                                use a streaming XML parser

                       use a library that handle’s bad markup

                            be aware that URLs aren’t ASCII




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                       be prepared:


                                use a streaming XML parser

                       use a library that handle’s bad markup

                            be aware that URLs aren’t ASCII

                                  use a URL normalizer


Wednesday, September 14, 2011
SOFTWARE



Wednesday, September 14, 2011
software advice:




Wednesday, September 14, 2011
software advice:


               •       goals determine scale




Wednesday, September 14, 2011
software advice:


               •       goals determine scale

               •       someone else has already done it




Wednesday, September 14, 2011
2 second crawler:

    function wgetspider() {
      wget --html-extension --convert-links --mirror 
        --page-requisites --progress=bar --level=5   
        --no-parent --no-verbose 
        --no-check-certificate "$@";
    }

    $ wgetspider http://www.ischool.berkeley.edu/




Wednesday, September 14, 2011
java crawlers:




Wednesday, September 14, 2011
java crawlers:

               •       Heritrix (Internet Archive)




Wednesday, September 14, 2011
java crawlers:

               •       Heritrix (Internet Archive)

               •       Nutch (Lucene)




Wednesday, September 14, 2011
java crawlers:

               •       Heritrix (Internet Archive)

               •       Nutch (Lucene)

               •       Bixo (Hadoop / Cascading)




Wednesday, September 14, 2011
java crawlers:

               •       Heritrix (Internet Archive)

               •       Nutch (Lucene)

               •       Bixo (Hadoop / Cascading)


    http://crawler.archive.org/
    http://nutch.apache.org/
    http://bixo.101tec.com/

Wednesday, September 14, 2011
extraction packages:




Wednesday, September 14, 2011
extraction packages:

               •       mechanize




Wednesday, September 14, 2011
extraction packages:

               •       mechanize

               •       BeautifulSoup & urllib2




Wednesday, September 14, 2011
extraction packages:

               •       mechanize

               •       BeautifulSoup & urllib2

               •       Scrapy




Wednesday, September 14, 2011
extraction packages:

               •       mechanize

               •       BeautifulSoup & urllib2

               •       Scrapy


    http://wwwsearch.sourceforge.net/mechanize/
    http://www.crummy.com/software/BeautifulSoup/
    http://scrapy.org/

Wednesday, September 14, 2011
wrapper induction(ish)




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel

               •       RoadRunner




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel

               •       RoadRunner

               •       TemplateMaker




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel

               •       RoadRunner

               •       TemplateMaker

               •       scrubyt




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel

               •       RoadRunner

               •       TemplateMaker

               •       scrubyt

    http://ariel.rubyforge.org/index.html
    http://www.dia.uniroma3.it/db/roadRunner/
    http://code.google.com/p/templatemaker/
    http://scrubyt.rubyforge.org/files/README.html

Wednesday, September 14, 2011
QUESTIONS?



Wednesday, September 14, 2011
FEEDBACK:


             nate@xcombinator.com
                                www.xcombinator.com

                                   @xcombinator
Wednesday, September 14, 2011

More Related Content

What's hot

kintoneのデータをSQLで操作!? ~Drivers/Gateway/Sync 徹底比較~
kintoneのデータをSQLで操作!? ~Drivers/Gateway/Sync 徹底比較~kintoneのデータをSQLで操作!? ~Drivers/Gateway/Sync 徹底比較~
kintoneのデータをSQLで操作!? ~Drivers/Gateway/Sync 徹底比較~
CData Software Japan
 
Hacking GA4 for SEO - Brighton SEO - Apr 2023
Hacking GA4 for SEO - Brighton SEO - Apr 2023Hacking GA4 for SEO - Brighton SEO - Apr 2023
Hacking GA4 for SEO - Brighton SEO - Apr 2023
Nitesh Sharoff
 
Brand Mentions vs Links: Making PR Coverage Work Harder for SEO.
Brand Mentions vs Links: Making PR Coverage Work Harder for SEO.Brand Mentions vs Links: Making PR Coverage Work Harder for SEO.
Brand Mentions vs Links: Making PR Coverage Work Harder for SEO.
James Brockbank
 
ArcGIS JavaScript API (build a web layer-based map application with html5 and...
ArcGIS JavaScript API (build a web layer-based map application with html5 and...ArcGIS JavaScript API (build a web layer-based map application with html5 and...
ArcGIS JavaScript API (build a web layer-based map application with html5 and...
Stefano Marchisio
 
How to Generate Content that Builds it's Own Links
How to Generate Content that Builds it's Own LinksHow to Generate Content that Builds it's Own Links
How to Generate Content that Builds it's Own Links
LizGration
 
BrightonSEO April 2022 - Kara Thurkettle - Search in the Metaverse.pdf
BrightonSEO April 2022 - Kara Thurkettle - Search in the Metaverse.pdfBrightonSEO April 2022 - Kara Thurkettle - Search in the Metaverse.pdf
BrightonSEO April 2022 - Kara Thurkettle - Search in the Metaverse.pdf
🇺🇲 🇬🇧 Kara Thurkettle
 
BrightonSEO - Master Crawl Budget Optimization for Enterprise Websites
BrightonSEO - Master Crawl Budget Optimization for Enterprise WebsitesBrightonSEO - Master Crawl Budget Optimization for Enterprise Websites
BrightonSEO - Master Crawl Budget Optimization for Enterprise Websites
Manick Bhan
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake Architecture
DATAVERSITY
 
JoomlaDay Conference_September 2023 PDF.pdf
JoomlaDay Conference_September 2023 PDF.pdfJoomlaDay Conference_September 2023 PDF.pdf
JoomlaDay Conference_September 2023 PDF.pdf
Oliver Brett
 
News SEO: Why we’ve de commissioned AMP - Brighton SEO September 2021
News SEO: Why we’ve de commissioned AMP - Brighton SEO September 2021News SEO: Why we’ve de commissioned AMP - Brighton SEO September 2021
News SEO: Why we’ve de commissioned AMP - Brighton SEO September 2021
Daniel Smullen
 
How not to fail at your international expansion
How not to fail at your international expansionHow not to fail at your international expansion
How not to fail at your international expansion
Lidia Infante
 
Technip Energies Italy: Planning is a graph matter
Technip Energies Italy: Planning is a graph matterTechnip Energies Italy: Planning is a graph matter
Technip Energies Italy: Planning is a graph matter
Neo4j
 
Building a Search Intent-Driven Website Architecture (SEO Mastery Summit 2022...
Building a Search Intent-Driven Website Architecture (SEO Mastery Summit 2022...Building a Search Intent-Driven Website Architecture (SEO Mastery Summit 2022...
Building a Search Intent-Driven Website Architecture (SEO Mastery Summit 2022...
LazarinaStoyanova
 
Martin McGarry - SEO strategy c/o England manager Gareth Southgate
Martin McGarry - SEO strategy c/o England manager Gareth SouthgateMartin McGarry - SEO strategy c/o England manager Gareth Southgate
Martin McGarry - SEO strategy c/o England manager Gareth Southgate
Martin McGarry
 
Kleecks - AI-Martech as a game changer-DEF.pdf
Kleecks - AI-Martech as a game changer-DEF.pdfKleecks - AI-Martech as a game changer-DEF.pdf
Kleecks - AI-Martech as a game changer-DEF.pdf
Kleecks
 
Trucks on a Graph: How JB Hunt Uses Neo4j
Trucks on a Graph: How JB Hunt Uses Neo4jTrucks on a Graph: How JB Hunt Uses Neo4j
Trucks on a Graph: How JB Hunt Uses Neo4j
Neo4j
 
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
Neo4j
 
Entity Seo Mastery
Entity Seo MasteryEntity Seo Mastery
Entity Seo Mastery
Dixon Jones
 
The Knowledge Graph. What is it? How does it work? How do you get in? @YoastCon
The Knowledge Graph. What is it? How does it work? How do you get in? @YoastConThe Knowledge Graph. What is it? How does it work? How do you get in? @YoastCon
The Knowledge Graph. What is it? How does it work? How do you get in? @YoastCon
Jason Barnard
 

What's hot (20)

kintoneのデータをSQLで操作!? ~Drivers/Gateway/Sync 徹底比較~
kintoneのデータをSQLで操作!? ~Drivers/Gateway/Sync 徹底比較~kintoneのデータをSQLで操作!? ~Drivers/Gateway/Sync 徹底比較~
kintoneのデータをSQLで操作!? ~Drivers/Gateway/Sync 徹底比較~
 
Hacking GA4 for SEO - Brighton SEO - Apr 2023
Hacking GA4 for SEO - Brighton SEO - Apr 2023Hacking GA4 for SEO - Brighton SEO - Apr 2023
Hacking GA4 for SEO - Brighton SEO - Apr 2023
 
Brand Mentions vs Links: Making PR Coverage Work Harder for SEO.
Brand Mentions vs Links: Making PR Coverage Work Harder for SEO.Brand Mentions vs Links: Making PR Coverage Work Harder for SEO.
Brand Mentions vs Links: Making PR Coverage Work Harder for SEO.
 
ArcGIS JavaScript API (build a web layer-based map application with html5 and...
ArcGIS JavaScript API (build a web layer-based map application with html5 and...ArcGIS JavaScript API (build a web layer-based map application with html5 and...
ArcGIS JavaScript API (build a web layer-based map application with html5 and...
 
How to Generate Content that Builds it's Own Links
How to Generate Content that Builds it's Own LinksHow to Generate Content that Builds it's Own Links
How to Generate Content that Builds it's Own Links
 
BrightonSEO April 2022 - Kara Thurkettle - Search in the Metaverse.pdf
BrightonSEO April 2022 - Kara Thurkettle - Search in the Metaverse.pdfBrightonSEO April 2022 - Kara Thurkettle - Search in the Metaverse.pdf
BrightonSEO April 2022 - Kara Thurkettle - Search in the Metaverse.pdf
 
BrightonSEO - Master Crawl Budget Optimization for Enterprise Websites
BrightonSEO - Master Crawl Budget Optimization for Enterprise WebsitesBrightonSEO - Master Crawl Budget Optimization for Enterprise Websites
BrightonSEO - Master Crawl Budget Optimization for Enterprise Websites
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake Architecture
 
JoomlaDay Conference_September 2023 PDF.pdf
JoomlaDay Conference_September 2023 PDF.pdfJoomlaDay Conference_September 2023 PDF.pdf
JoomlaDay Conference_September 2023 PDF.pdf
 
Graphic Deesigning Course content
Graphic Deesigning Course content Graphic Deesigning Course content
Graphic Deesigning Course content
 
News SEO: Why we’ve de commissioned AMP - Brighton SEO September 2021
News SEO: Why we’ve de commissioned AMP - Brighton SEO September 2021News SEO: Why we’ve de commissioned AMP - Brighton SEO September 2021
News SEO: Why we’ve de commissioned AMP - Brighton SEO September 2021
 
How not to fail at your international expansion
How not to fail at your international expansionHow not to fail at your international expansion
How not to fail at your international expansion
 
Technip Energies Italy: Planning is a graph matter
Technip Energies Italy: Planning is a graph matterTechnip Energies Italy: Planning is a graph matter
Technip Energies Italy: Planning is a graph matter
 
Building a Search Intent-Driven Website Architecture (SEO Mastery Summit 2022...
Building a Search Intent-Driven Website Architecture (SEO Mastery Summit 2022...Building a Search Intent-Driven Website Architecture (SEO Mastery Summit 2022...
Building a Search Intent-Driven Website Architecture (SEO Mastery Summit 2022...
 
Martin McGarry - SEO strategy c/o England manager Gareth Southgate
Martin McGarry - SEO strategy c/o England manager Gareth SouthgateMartin McGarry - SEO strategy c/o England manager Gareth Southgate
Martin McGarry - SEO strategy c/o England manager Gareth Southgate
 
Kleecks - AI-Martech as a game changer-DEF.pdf
Kleecks - AI-Martech as a game changer-DEF.pdfKleecks - AI-Martech as a game changer-DEF.pdf
Kleecks - AI-Martech as a game changer-DEF.pdf
 
Trucks on a Graph: How JB Hunt Uses Neo4j
Trucks on a Graph: How JB Hunt Uses Neo4jTrucks on a Graph: How JB Hunt Uses Neo4j
Trucks on a Graph: How JB Hunt Uses Neo4j
 
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
 
Entity Seo Mastery
Entity Seo MasteryEntity Seo Mastery
Entity Seo Mastery
 
The Knowledge Graph. What is it? How does it work? How do you get in? @YoastCon
The Knowledge Graph. What is it? How does it work? How do you get in? @YoastConThe Knowledge Graph. What is it? How does it work? How do you get in? @YoastCon
The Knowledge Graph. What is it? How does it work? How do you get in? @YoastCon
 

Viewers also liked

Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
Julien Nioche
 
Java Web Uygulama Geliştirme
Java Web Uygulama GeliştirmeJava Web Uygulama Geliştirme
Java Web Uygulama Geliştirme
ahmetdemirelli
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
Denis Shestakov
 
Web Crawlers
Web CrawlersWeb Crawlers
Web Crawlers
Enes Caglar
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
Jurriaan Persyn
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
Hadoop User Group
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
Renato Lucindo
 

Viewers also liked (9)

Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Java Web Uygulama Geliştirme
Java Web Uygulama GeliştirmeJava Web Uygulama Geliştirme
Java Web Uygulama Geliştirme
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
 
Web Crawlers
Web CrawlersWeb Crawlers
Web Crawlers
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 

Recently uploaded

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 

Recently uploaded (20)

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 

Challenges in Large-Scale Web Crawling

  • 1. introduction to WEB CRAWLING & extraction by Nate Murray Wednesday, September 14, 2011
  • 2. WHO AM I ? Wednesday, September 14, 2011
  • 3. Nate Murray AT&T Interactive (Yellowpages.com) TB-scale data since 2009 Various crawlers since 2005 Wednesday, September 14, 2011
  • 4. what is WEB CRAWLING ? Wednesday, September 14, 2011
  • 5. definition: web crawler a program that browses the web. Wednesday, September 14, 2011
  • 6. definition: web extraction transforming unstructured web data into structured data Wednesday, September 14, 2011
  • 7. definition: web extraction transforming semistructured web data into structured data Wednesday, September 14, 2011
  • 10. motivation: bookmark buddies URL Title Users Wednesday, September 14, 2011
  • 12. motivation: business hours Wednesday, September 14, 2011
  • 13. motivation: business hours Day Openness Mon Closed Tue 11:30-14:30 17:30-22:00 Wed 11:30-14:30 17:30-22:00 Thur 11:30-14:30 17:30-22:00 Fri 11:30-14:30 17:30-22:00 Sat 12:00-14:30 17:00-22:00 Sun - 17:00-21:00 Wednesday, September 14, 2011
  • 16. motivation: recommend videos Users Wednesday, September 14, 2011
  • 18. motivation: vertical search Wednesday, September 14, 2011
  • 19. motivation: vertical search Image SKU Name Price Rating Wednesday, September 14, 2011
  • 22. DESIRED PROPERTIES SPEED Wednesday, September 14, 2011
  • 24. CONSTRAINTS • Politeness Wednesday, September 14, 2011
  • 25. CONSTRAINTS • Politeness • Distributed Wednesday, September 14, 2011
  • 26. CONSTRAINTS • Politeness • Distributed • Linear Scalability Wednesday, September 14, 2011
  • 27. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning Wednesday, September 14, 2011
  • 28. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 29. CONSTRAINTS • Politeness it’s easy to burden • Distributed small servers • Linear Scalability • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 30. CONSTRAINTS • Politeness • Distributed (for any significant crawl) • Linear Scalability • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 31. CONSTRAINTS • Politeness • Distributed • Linear Scalability n machines = n*m pages-per-second • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 32. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning every machine should perform equal work • Minimum overlap Wednesday, September 14, 2011
  • 33. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlap crawl each page exactly once Wednesday, September 14, 2011
  • 34. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 36. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 37. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 38. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 39. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 40. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 41. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 42. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 43. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 44. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 45. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 46. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 47. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 48. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 49. architecture overview CRAWL FETCHER INTERNET URLs Web Data PLANNER URL QUEUE Web Data STORAGE Web Data Wednesday, September 14, 2011
  • 51. challenges: depends on your ambitions Wednesday, September 14, 2011
  • 52. challenges: Google’s Index Size: 1998 - 26 million 2005 - 8 billion 2008 - 1 trillion http://www.nytimes.com/2005/08/15/technology/15search.html http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html Wednesday, September 14, 2011
  • 53. challenges: small crawls are easy Wednesday, September 14, 2011
  • 54. challenges: < 10MM small crawls are easy Wednesday, September 14, 2011
  • 55. challenges: large crawls are interesting Wednesday, September 14, 2011
  • 57. challenges: DNS Lookup Wednesday, September 14, 2011
  • 58. challenges: DNS Lookup URLs Crawled Wednesday, September 14, 2011
  • 59. challenges: DNS Lookup URLs Crawled Politeness Wednesday, September 14, 2011
  • 60. challenges: DNS Lookup URLs Crawled Politeness URL Frontier Wednesday, September 14, 2011
  • 61. challenges: DNS Lookup URLs Crawled Politeness URL Frontier Queueing URLs Wednesday, September 14, 2011
  • 62. challenges: DNS Lookup URLs Crawled Politeness URL Frontier Queueing URLs Extracting URLs Wednesday, September 14, 2011
  • 63. challenges: DNS LOOKUP Wednesday, September 14, 2011
  • 64. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 65. challenges: DNS LOOKUP can easily be a bottleneck Wednesday, September 14, 2011
  • 66. challenges: DNS LOOKUP • consider running your own DNS servers • djbdns • PowerDNS • etc. Wednesday, September 14, 2011
  • 67. challenges: DNS LOOKUP • be aware of software limitations • gethostbyaddr is synchronized • same with many “default” DNS clients Wednesday, September 14, 2011
  • 68. challenges: DNS LOOKUP You’ll know when you need it Wednesday, September 14, 2011
  • 69. challenges: URLs CRAWLED Wednesday, September 14, 2011
  • 70. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 71. challenges: URLs CRAWLED 1 machine, store in memory Wednesday, September 14, 2011
  • 72. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION Wednesday, September 14, 2011
  • 73. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations Wednesday, September 14, 2011
  • 74. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 Wednesday, September 14, 2011
  • 75. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 x 100 million Wednesday, September 14, 2011
  • 76. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 x 100 million =~ 5.4 gigabytes Wednesday, September 14, 2011
  • 77. can we do better? Wednesday, September 14, 2011
  • 79. BLOOM FILTERS answers the question: is this item in the set? Wednesday, September 14, 2011
  • 80. BLOOM FILTERS answers either: Wednesday, September 14, 2011
  • 81. BLOOM FILTERS answers either: • yes, probably Wednesday, September 14, 2011
  • 82. BLOOM FILTERS answers either: • yes, probably • definitely not Wednesday, September 14, 2011
  • 83. BLOOM FILTERS Have we crawled: http://www.xcombinator.com? answers either: • yes, probably • definitely not Wednesday, September 14, 2011
  • 84. BLOOM FILTERS Have we crawled: http://www.xcombinator.com? answers either: • yes, probably • definitely not Wednesday, September 14, 2011
  • 85. challenges: URLs CRAWLED 1 machine, bloom filter 100 million URLs 1 in 100 million chance of false positive see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8 Wednesday, September 14, 2011
  • 86. challenges: URLs CRAWLED 1 machine, bloom filter NAPKIN CALCULATION 100 million URLs 1 in 100 million chance of false positive see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8 Wednesday, September 14, 2011
  • 87. challenges: URLs CRAWLED 1 machine, bloom filter NAPKIN CALCULATION 100 million URLs 1 in 100 million chance of false positive =~ 457 megabytes see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8 Wednesday, September 14, 2011
  • 89. BLOOM FILTER drawbacks Wednesday, September 14, 2011
  • 90. BLOOM FILTER drawbacks • probabilistic - occasional errors Wednesday, September 14, 2011
  • 91. BLOOM FILTER drawbacks • probabilistic - occasional errors • estimate # of items ahead of time Wednesday, September 14, 2011
  • 92. BLOOM FILTER drawbacks • probabilistic - occasional errors • estimate # of items ahead of time • can’t delete Wednesday, September 14, 2011
  • 93. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • estimate # of items ahead of time • can’t delete Wednesday, September 14, 2011
  • 94. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • can’t delete Wednesday, September 14, 2011
  • 95. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete Wednesday, September 14, 2011
  • 96. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete • pick granularity (days) Wednesday, September 14, 2011
  • 97. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete • pick granularity (days) • cascade them Wednesday, September 14, 2011
  • 98. BLOOM FILTERS references: http://en.wikipedia.org/wiki/Bloom_filter http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/ Wednesday, September 14, 2011
  • 99. challenges: POLITENESS Wednesday, September 14, 2011
  • 101. rule of thumb: wait 2 seconds (w.r.t. ip) Wednesday, September 14, 2011
  • 103. centralized politeness SPOF Wednesday, September 14, 2011
  • 104. centralized politeness SPOF contention Wednesday, September 14, 2011
  • 105. challenges: POLITENESS Wednesday, September 14, 2011
  • 106. challenges: POLITENESS • Options: Wednesday, September 14, 2011
  • 107. challenges: POLITENESS • Options: • central database Wednesday, September 14, 2011
  • 108. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) Wednesday, September 14, 2011
  • 109. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution Wednesday, September 14, 2011
  • 110. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution http://en.wikipedia.org/wiki/Paxos_(computer_science) Wednesday, September 14, 2011
  • 111. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution http://en.wikipedia.org/wiki/Paxos_(computer_science) http://zookeeper.apache.org/ Wednesday, September 14, 2011
  • 112. challenges: URL FRONTIER Wednesday, September 14, 2011
  • 114. idea: consistently distribute URLs based on IP Wednesday, September 14, 2011
  • 115. modulo IP SHA-1 bucket (mod 5) 174.132.225.106 4dd14b0b... 2 74.125.224.115 cf4b7594... 1 157.166.255.19 0ac4d141... 4 69.22.138.129 6c1584fa... 4 98.139.50.166 327252c5... 3 Wednesday, September 14, 2011
  • 116. benefits: same IP always goes to same machine simple Wednesday, September 14, 2011
  • 117. drawbacks: susceptible to skew can’t add / remove nodes without pain Wednesday, September 14, 2011
  • 123. benefits: ~ 1/(n+1) URLs move on add/remove virtual nodes help skew robust (no SOP) Wednesday, September 14, 2011
  • 124. drawbacks: naive solution won’t work for large sites Wednesday, September 14, 2011
  • 125. further reading: Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications (2001) Stoica et al. Dynamo: Amazon’s Highly Available Key-value Store, SOSP 2007 Tapestry: A Resilient Global-Scale Overlay for Service Deployment (2004) Zhao et al. Wednesday, September 14, 2011
  • 126. challenges: QUEUEING URLS Wednesday, September 14, 2011
  • 128. situation: URL Wednesday, September 14, 2011
  • 129. situation: URL not recently crawled Wednesday, September 14, 2011
  • 130. situation: URL not recently crawled allowed by robots.txt Wednesday, September 14, 2011
  • 131. situation: URL not recently crawled allowed by robots.txt polite Wednesday, September 14, 2011
  • 132. how to you order them? (within a single machine) Wednesday, September 14, 2011
  • 133. hash each lane: 1 2 3 http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ Wednesday, September 14, 2011
  • 134. 1 2 3 Wednesday, September 14, 2011
  • 135. 1 2 3 Wednesday, September 14, 2011
  • 136. 1 2 3 Wednesday, September 14, 2011
  • 137. 1 2 3 Wednesday, September 14, 2011
  • 138. 1 2 3 Wednesday, September 14, 2011
  • 139. 1 2 3 Wednesday, September 14, 2011
  • 140. 1 2 3 Wednesday, September 14, 2011
  • 141. 1 2 3 Wednesday, September 14, 2011
  • 142. 1 2 3 Wednesday, September 14, 2011
  • 143. 1 2 3 Wednesday, September 14, 2011
  • 144. 1 2 3 Wednesday, September 14, 2011
  • 145. 1 2 3 Wednesday, September 14, 2011
  • 146. 1 2 3 Wednesday, September 14, 2011
  • 147. 1 2 3 Wednesday, September 14, 2011
  • 148. 1 2 3 Wednesday, September 14, 2011
  • 149. 1 2 3 Wednesday, September 14, 2011
  • 150. 1 2 3 Wednesday, September 14, 2011
  • 151. 1 2 3 Wednesday, September 14, 2011
  • 152. 1 2 3 Wednesday, September 14, 2011
  • 153. 1 2 3 Wednesday, September 14, 2011
  • 154. 1 2 3 Wednesday, September 14, 2011
  • 155. ERLANG lookup: erlang B / C / engset Wednesday, September 14, 2011
  • 156. as many threads as possible Wednesday, September 14, 2011
  • 157. don’t sort input URLs Wednesday, September 14, 2011
  • 158. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 159. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ fetch http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 160. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 wait http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 161. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ fetch http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 162. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? wait id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 163. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 fetch http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 164. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 wait Wednesday, September 14, 2011
  • 165. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 166. http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/ Wednesday, September 14, 2011
  • 167. http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/ Wednesday, September 14, 2011
  • 168. no waiting! http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/ Wednesday, September 14, 2011
  • 169. challenges: EXTRACTING URLS Wednesday, September 14, 2011
  • 170. challenges: EXTRACTING URLS the internet is full of garbage Wednesday, September 14, 2011
  • 171. challenges: EXTRACTING URLS Wednesday, September 14, 2011
  • 172. challenges: EXTRACTING URLS enormous pages Wednesday, September 14, 2011
  • 173. challenges: EXTRACTING URLS enormous pages terrible markup Wednesday, September 14, 2011
  • 174. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls Wednesday, September 14, 2011
  • 175. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls .net/ Wednesday, September 14, 2011
  • 176. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls .net/ “unicode snowman dot net” Wednesday, September 14, 2011
  • 177. challenges: EXTRACTING URLS be prepared: Wednesday, September 14, 2011
  • 178. challenges: EXTRACTING URLS be prepared: use a streaming XML parser Wednesday, September 14, 2011
  • 179. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup Wednesday, September 14, 2011
  • 180. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup be aware that URLs aren’t ASCII Wednesday, September 14, 2011
  • 181. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup be aware that URLs aren’t ASCII use a URL normalizer Wednesday, September 14, 2011
  • 184. software advice: • goals determine scale Wednesday, September 14, 2011
  • 185. software advice: • goals determine scale • someone else has already done it Wednesday, September 14, 2011
  • 186. 2 second crawler: function wgetspider() { wget --html-extension --convert-links --mirror --page-requisites --progress=bar --level=5 --no-parent --no-verbose --no-check-certificate "$@"; } $ wgetspider http://www.ischool.berkeley.edu/ Wednesday, September 14, 2011
  • 188. java crawlers: • Heritrix (Internet Archive) Wednesday, September 14, 2011
  • 189. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) Wednesday, September 14, 2011
  • 190. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) • Bixo (Hadoop / Cascading) Wednesday, September 14, 2011
  • 191. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) • Bixo (Hadoop / Cascading) http://crawler.archive.org/ http://nutch.apache.org/ http://bixo.101tec.com/ Wednesday, September 14, 2011
  • 193. extraction packages: • mechanize Wednesday, September 14, 2011
  • 194. extraction packages: • mechanize • BeautifulSoup & urllib2 Wednesday, September 14, 2011
  • 195. extraction packages: • mechanize • BeautifulSoup & urllib2 • Scrapy Wednesday, September 14, 2011
  • 196. extraction packages: • mechanize • BeautifulSoup & urllib2 • Scrapy http://wwwsearch.sourceforge.net/mechanize/ http://www.crummy.com/software/BeautifulSoup/ http://scrapy.org/ Wednesday, September 14, 2011
  • 198. wrapper induction(ish) • Ariel Wednesday, September 14, 2011
  • 199. wrapper induction(ish) • Ariel • RoadRunner Wednesday, September 14, 2011
  • 200. wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker Wednesday, September 14, 2011
  • 201. wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker • scrubyt Wednesday, September 14, 2011
  • 202. wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker • scrubyt http://ariel.rubyforge.org/index.html http://www.dia.uniroma3.it/db/roadRunner/ http://code.google.com/p/templatemaker/ http://scrubyt.rubyforge.org/files/README.html Wednesday, September 14, 2011
  • 204. FEEDBACK: nate@xcombinator.com www.xcombinator.com @xcombinator Wednesday, September 14, 2011