SlideShare a Scribd company logo
1 of 204
Download to read offline
introduction to

                        WEB CRAWLING
                            & extraction                 by Nate Murray




Wednesday, September 14, 2011
WHO AM I ?



Wednesday, September 14, 2011
Nate Murray

                        AT&T Interactive (Yellowpages.com)

                                 TB-scale data since 2009

                                Various crawlers since 2005




Wednesday, September 14, 2011
what is

                        WEB CRAWLING ?



Wednesday, September 14, 2011
definition:
                 web crawler
                            a program that browses the web.




Wednesday, September 14, 2011
definition:
                 web extraction
                    transforming unstructured web data into
                                structured data




Wednesday, September 14, 2011
definition:
                 web extraction
                    transforming semistructured web data into
                                structured data




Wednesday, September 14, 2011
motivation




Wednesday, September 14, 2011
motivation: bookmark buddies




Wednesday, September 14, 2011
motivation: bookmark buddies

                                URL Title
                                 Users




Wednesday, September 14, 2011
motivation:




Wednesday, September 14, 2011
motivation:               business hours




Wednesday, September 14, 2011
motivation:               business hours


                                Day        Openness
                                Mon         Closed
                                Tue    11:30-14:30 17:30-22:00


                                Wed    11:30-14:30 17:30-22:00


                                Thur   11:30-14:30 17:30-22:00


                                 Fri   11:30-14:30 17:30-22:00


                                 Sat   12:00-14:30 17:00-22:00


                                Sun        -       17:00-21:00




Wednesday, September 14, 2011
motivation:




Wednesday, September 14, 2011
motivation: recommend videos




Wednesday, September 14, 2011
motivation: recommend videos




                                Users
Wednesday, September 14, 2011
motivation:




Wednesday, September 14, 2011
motivation:               vertical search




Wednesday, September 14, 2011
motivation:               vertical search

                  Image
                   SKU
                  Name
                   Price
                  Rating




Wednesday, September 14, 2011
motivation:




Wednesday, September 14, 2011
DESIRED PROPERTIES



Wednesday, September 14, 2011
DESIRED PROPERTIES



                                SPEED

Wednesday, September 14, 2011
CONSTRAINTS




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness                it’s easy to burden

         • Distributed
                                         small servers


              • Linear Scalability
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed               (for any significant
                                            crawl)
              • Linear Scalability
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability      n machines =
                                     n*m pages-per-second
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning    every machine should
                                      perform equal work

              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning
              • Minimum overlap      crawl each page
                                       exactly once




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
BASIC ALGORITHM



Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
architecture overview

         CRAWL
                                             FETCHER                   INTERNET
                                      URLs                  Web Data

        PLANNER
                          URL
                         QUEUE
                                                 Web Data




                                             STORAGE
                                Web Data




Wednesday, September 14, 2011
CHALLENGES



Wednesday, September 14, 2011
challenges:




                                depends on your ambitions




Wednesday, September 14, 2011
challenges:


                                Google’s Index Size:

                                 1998 - 26 million
                                 2005 - 8 billion
                                 2008 - 1 trillion




    http://www.nytimes.com/2005/08/15/technology/15search.html
    http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html




Wednesday, September 14, 2011
challenges:




                                small crawls are easy




Wednesday, September 14, 2011
challenges:




                                     < 10MM


                                small crawls are easy




Wednesday, September 14, 2011
challenges:




                                large crawls are interesting




Wednesday, September 14, 2011
challenges:




Wednesday, September 14, 2011
challenges:




                                DNS Lookup




Wednesday, September 14, 2011
challenges:




                                DNS Lookup
                                URLs Crawled




Wednesday, September 14, 2011
challenges:




                                DNS Lookup
                                URLs Crawled
                                 Politeness




Wednesday, September 14, 2011
challenges:




                                DNS Lookup
                                URLs Crawled
                                 Politeness
                                URL Frontier




Wednesday, September 14, 2011
challenges:




                                 DNS Lookup
                                URLs Crawled
                                  Politeness
                                 URL Frontier
                                Queueing URLs




Wednesday, September 14, 2011
challenges:




                                 DNS Lookup
                                 URLs Crawled
                                   Politeness
                                 URL Frontier
                                Queueing URLs
                                Extracting URLs




Wednesday, September 14, 2011
challenges:

                                DNS LOOKUP




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
challenges:
      DNS LOOKUP


                                can easily be a bottleneck




Wednesday, September 14, 2011
challenges:
      DNS LOOKUP

           • consider running your own DNS servers
             • djbdns
             • PowerDNS
             • etc.




Wednesday, September 14, 2011
challenges:
      DNS LOOKUP

           • be aware of software limitations
                • gethostbyaddr is synchronized
                • same with many “default” DNS clients




Wednesday, September 14, 2011
challenges:
      DNS LOOKUP


                                You’ll know when you need it




Wednesday, September 14, 2011
challenges:

                                URLs CRAWLED




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION
                ~50 bytes per URL
                                  e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION
                ~50 bytes per URL
                                  e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations


                   +8 bytes for time-last-crawled
                                    as long e.g. System.currentTimeMillis() -> 1314392455712




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION
                ~50 bytes per URL
                                  e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations


                   +8 bytes for time-last-crawled
                                    as long e.g. System.currentTimeMillis() -> 1314392455712


                   x        100 million




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION
                ~50 bytes per URL
                                  e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations


                   +8 bytes for time-last-crawled
                                    as long e.g. System.currentTimeMillis() -> 1314392455712


                   x        100 million
                  =~ 5.4 gigabytes


Wednesday, September 14, 2011
can we do better?




Wednesday, September 14, 2011
BLOOM FILTERS



Wednesday, September 14, 2011
BLOOM FILTERS



                                     answers the question:

                                is this item in the set?


Wednesday, September 14, 2011
BLOOM FILTERS



                                answers either:




Wednesday, September 14, 2011
BLOOM FILTERS



                                    answers either:

                                • yes, probably

Wednesday, September 14, 2011
BLOOM FILTERS



                                    answers either:

                                • yes, probably
                                • definitely not
Wednesday, September 14, 2011
BLOOM FILTERS


                Have we crawled: http://www.xcombinator.com?

                                    answers either:

                                • yes, probably
                                • definitely not
Wednesday, September 14, 2011
BLOOM FILTERS


                Have we crawled: http://www.xcombinator.com?

                                    answers either:

                                • yes, probably
                                • definitely not
Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, bloom filter



                100 million URLs

                      1 in 100 million chance
                                                of false positive




                                      see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                 1 machine, bloom filter

                                NAPKIN CALCULATION
                100 million URLs

                      1 in 100 million chance
                                                 of false positive




                                       see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                 1 machine, bloom filter

                                NAPKIN CALCULATION
                100 million URLs

                      1 in 100 million chance
                                                 of false positive



                  =~ 457 megabytes
                                       see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8




Wednesday, September 14, 2011
BLOOM FILTER




Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks




Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks


   • probabilistic - occasional errors




Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks


   • probabilistic - occasional errors

   • estimate # of items ahead of time




Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks


   • probabilistic - occasional errors

   • estimate # of items ahead of time

   • can’t delete


Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                         solutions
   • probabilistic - occasional errors

   • estimate # of items ahead of time

   • can’t delete


Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                            solutions
   • probabilistic - occasional errors
                             • acceptable
   • estimate # of items ahead of time

   • can’t delete


Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                         solutions
   • probabilistic - occasional errors
                             • acceptable
   • estimate # of items ahead of time
                             • not hard, see Dynamic BFs
   • can’t delete


Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                         solutions
   • probabilistic - occasional errors
                             • acceptable
   • estimate # of items ahead of time
                             • not hard, see Dynamic BFs
   • can’t delete
                             • pick granularity (days)

Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                         solutions
   • probabilistic - occasional errors
                             • acceptable
   • estimate # of items ahead of time
                             • not hard, see Dynamic BFs
   • can’t delete
                             • pick granularity (days)
                             • cascade them
Wednesday, September 14, 2011
BLOOM FILTERS
                                         references:

       http://en.wikipedia.org/wiki/Bloom_filter
       http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html
       http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/




Wednesday, September 14, 2011
challenges:

                                POLITENESS




Wednesday, September 14, 2011
obey robots.txt




Wednesday, September 14, 2011
rule of thumb:

                                wait 2 seconds (w.r.t. ip)




Wednesday, September 14, 2011
centralized politeness




Wednesday, September 14, 2011
centralized politeness




                                SPOF


Wednesday, September 14, 2011
centralized politeness




                                  SPOF
                                contention

Wednesday, September 14, 2011
challenges:
      POLITENESS




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database
                  • distributed locks (paxos/sigma/zookeeper)




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database
                  • distributed locks (paxos/sigma/zookeeper)
                  • controlled URL distribution




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database
                  • distributed locks (paxos/sigma/zookeeper)
                  • controlled URL distribution



    http://en.wikipedia.org/wiki/Paxos_(computer_science)


Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database
                  • distributed locks (paxos/sigma/zookeeper)
                  • controlled URL distribution



    http://en.wikipedia.org/wiki/Paxos_(computer_science)
    http://zookeeper.apache.org/

Wednesday, September 14, 2011
challenges:

                                URL FRONTIER




Wednesday, September 14, 2011
url frontier




Wednesday, September 14, 2011
idea:

                          consistently distribute URLs based on IP




Wednesday, September 14, 2011
modulo
                                IP      SHA-1       bucket (mod 5)
                  174.132.225.106     4dd14b0b...         2

                   74.125.224.115     cf4b7594...         1

                   157.166.255.19     0ac4d141...         4

                    69.22.138.129     6c1584fa...         4

                    98.139.50.166     327252c5...         3




Wednesday, September 14, 2011
benefits:



                                same IP always goes to same machine
                                              simple




Wednesday, September 14, 2011
drawbacks:



                                         susceptible to skew
                                can’t add / remove nodes without pain




Wednesday, September 14, 2011
consistent hashing



Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011
benefits:



                                ~ 1/(n+1) URLs move on add/remove
                                      virtual nodes help skew
                                          robust (no SOP)




Wednesday, September 14, 2011
drawbacks:



                           naive solution won’t work for large sites




Wednesday, September 14, 2011
further reading:



           Chord: A Scalable Peer-to-Peer Lookup Protocol for
           Internet Applications (2001) Stoica et al.

           Dynamo: Amazon’s Highly Available Key-value Store, SOSP
           2007

           Tapestry: A Resilient Global-Scale Overlay for Service
           Deployment (2004) Zhao et al.




Wednesday, September 14, 2011
challenges:

                                QUEUEING URLS




Wednesday, September 14, 2011
situation:




Wednesday, September 14, 2011
situation:
                                   URL




Wednesday, September 14, 2011
situation:
                                   URL
                                   not recently crawled




Wednesday, September 14, 2011
situation:
                                   URL
                                   not recently crawled
                                   allowed by robots.txt




Wednesday, September 14, 2011
situation:
                                   URL
                                   not recently crawled
                                   allowed by robots.txt
                                   polite




Wednesday, September 14, 2011
how to you order them?

                                (within a single machine)




Wednesday, September 14, 2011
hash each lane:




                1                         2                               3
                   http://yachtmaintenanceco.com/
                   http://www.amsterdamports.nl/
                   http://www.4s-dawn.com/
                   http://www.embassysuiteslittlerock.com/
                   http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm
                   http://mdgroover.iweb.bsu.edu
                   http://music.imbc.com/
                   http://www.robertjbradshaw.com
                   http://www.kerkattenhoven.be
                   http://www.escolania.org/
                   http://www.musiciansdfw.org/
Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
ERLANG




                                lookup: erlang B / C / engset

Wednesday, September 14, 2011
as many threads as possible




Wednesday, September 14, 2011
don’t sort input URLs




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
                                                                   fetch
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1           wait
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/

                                                                   fetch
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
                                                                   wait
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395

                                                                   fetch
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1
                                                                   wait




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://yachtmaintenanceco.com/
   http://www.amsterdamports.nl/
   http://www.4s-dawn.com/
   http://www.embassysuiteslittlerock.com/
   http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm
   http://mdgroover.iweb.bsu.edu
   http://music.imbc.com/
   http://www.robertjbradshaw.com
   http://www.kerkattenhoven.be
   http://www.escolania.org/
   http://www.musiciansdfw.org/
   http://www.ariana.org/




Wednesday, September 14, 2011
http://yachtmaintenanceco.com/
   http://www.amsterdamports.nl/
   http://www.4s-dawn.com/
   http://www.embassysuiteslittlerock.com/
   http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm
   http://mdgroover.iweb.bsu.edu
   http://music.imbc.com/
   http://www.robertjbradshaw.com
   http://www.kerkattenhoven.be
   http://www.escolania.org/
   http://www.musiciansdfw.org/
   http://www.ariana.org/




Wednesday, September 14, 2011
no waiting!
   http://yachtmaintenanceco.com/
   http://www.amsterdamports.nl/
   http://www.4s-dawn.com/
   http://www.embassysuiteslittlerock.com/
   http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm
   http://mdgroover.iweb.bsu.edu
   http://music.imbc.com/
   http://www.robertjbradshaw.com
   http://www.kerkattenhoven.be
   http://www.escolania.org/
   http://www.musiciansdfw.org/
   http://www.ariana.org/




Wednesday, September 14, 2011
challenges:

                                EXTRACTING URLS




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS


                                the internet is full of garbage




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                enormous pages




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                enormous pages

                                terrible markup




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                enormous pages

                                terrible markup

                                ridiculous urls




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                enormous pages

                                terrible markup

                                ridiculous urls


                                     .net/

Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                     enormous pages

                                     terrible markup

                                      ridiculous urls


                                           .net/
                                “unicode snowman dot net”

Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                be prepared:




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                       be prepared:


                                use a streaming XML parser




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                       be prepared:


                                use a streaming XML parser

                       use a library that handle’s bad markup




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                       be prepared:


                                use a streaming XML parser

                       use a library that handle’s bad markup

                            be aware that URLs aren’t ASCII




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                       be prepared:


                                use a streaming XML parser

                       use a library that handle’s bad markup

                            be aware that URLs aren’t ASCII

                                  use a URL normalizer


Wednesday, September 14, 2011
SOFTWARE



Wednesday, September 14, 2011
software advice:




Wednesday, September 14, 2011
software advice:


               •       goals determine scale




Wednesday, September 14, 2011
software advice:


               •       goals determine scale

               •       someone else has already done it




Wednesday, September 14, 2011
2 second crawler:

    function wgetspider() {
      wget --html-extension --convert-links --mirror 
        --page-requisites --progress=bar --level=5   
        --no-parent --no-verbose 
        --no-check-certificate "$@";
    }

    $ wgetspider http://www.ischool.berkeley.edu/




Wednesday, September 14, 2011
java crawlers:




Wednesday, September 14, 2011
java crawlers:

               •       Heritrix (Internet Archive)




Wednesday, September 14, 2011
java crawlers:

               •       Heritrix (Internet Archive)

               •       Nutch (Lucene)




Wednesday, September 14, 2011
java crawlers:

               •       Heritrix (Internet Archive)

               •       Nutch (Lucene)

               •       Bixo (Hadoop / Cascading)




Wednesday, September 14, 2011
java crawlers:

               •       Heritrix (Internet Archive)

               •       Nutch (Lucene)

               •       Bixo (Hadoop / Cascading)


    http://crawler.archive.org/
    http://nutch.apache.org/
    http://bixo.101tec.com/

Wednesday, September 14, 2011
extraction packages:




Wednesday, September 14, 2011
extraction packages:

               •       mechanize




Wednesday, September 14, 2011
extraction packages:

               •       mechanize

               •       BeautifulSoup & urllib2




Wednesday, September 14, 2011
extraction packages:

               •       mechanize

               •       BeautifulSoup & urllib2

               •       Scrapy




Wednesday, September 14, 2011
extraction packages:

               •       mechanize

               •       BeautifulSoup & urllib2

               •       Scrapy


    http://wwwsearch.sourceforge.net/mechanize/
    http://www.crummy.com/software/BeautifulSoup/
    http://scrapy.org/

Wednesday, September 14, 2011
wrapper induction(ish)




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel

               •       RoadRunner




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel

               •       RoadRunner

               •       TemplateMaker




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel

               •       RoadRunner

               •       TemplateMaker

               •       scrubyt




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel

               •       RoadRunner

               •       TemplateMaker

               •       scrubyt

    http://ariel.rubyforge.org/index.html
    http://www.dia.uniroma3.it/db/roadRunner/
    http://code.google.com/p/templatemaker/
    http://scrubyt.rubyforge.org/files/README.html

Wednesday, September 14, 2011
QUESTIONS?



Wednesday, September 14, 2011
FEEDBACK:


             nate@xcombinator.com
                                www.xcombinator.com

                                   @xcombinator
Wednesday, September 14, 2011

More Related Content

What's hot

Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with PythonMaris Lemba
 
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스BOAZ Bigdata
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawlingDenis Shestakov
 
Pubcon 2023 - In-House SEO Product Management
Pubcon 2023 - In-House SEO Product ManagementPubcon 2023 - In-House SEO Product Management
Pubcon 2023 - In-House SEO Product ManagementKeith Goode
 
Behemoth SEO: Search Strategy for Huge Websites
Behemoth SEO: Search Strategy for Huge WebsitesBehemoth SEO: Search Strategy for Huge Websites
Behemoth SEO: Search Strategy for Huge WebsitesPhilipp Klöckner
 
Introduction to web analytics
Introduction to web analyticsIntroduction to web analytics
Introduction to web analyticsRajeev Pillai
 
The True Value of Syndicated Links Experiment
The True Value of Syndicated Links ExperimentThe True Value of Syndicated Links Experiment
The True Value of Syndicated Links ExperimentSarah Fleming
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documentsTommy Tavenner
 
What is Web-scraping?
What is Web-scraping?What is Web-scraping?
What is Web-scraping?Yu-Chang Ho
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With PythonRobert Dempsey
 
Rendering SEO Manifesto - Why we need to go beyond JavaScript SEO
Rendering SEO Manifesto - Why we need to go beyond JavaScript SEORendering SEO Manifesto - Why we need to go beyond JavaScript SEO
Rendering SEO Manifesto - Why we need to go beyond JavaScript SEOOnely
 
BrightonSEO - Apr 2022 - No excuses for doing UX
BrightonSEO - Apr 2022 - No excuses for doing UXBrightonSEO - Apr 2022 - No excuses for doing UX
BrightonSEO - Apr 2022 - No excuses for doing UXOban International
 

What's hot (20)

“Web crawler”
“Web crawler”“Web crawler”
“Web crawler”
 
On page SEO
On page SEOOn page SEO
On page SEO
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스
 
Technical SEO.pdf
Technical SEO.pdfTechnical SEO.pdf
Technical SEO.pdf
 
SERP: All you need to know about #SERP
SERP: All you need to know about #SERPSERP: All you need to know about #SERP
SERP: All you need to know about #SERP
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Pubcon 2023 - In-House SEO Product Management
Pubcon 2023 - In-House SEO Product ManagementPubcon 2023 - In-House SEO Product Management
Pubcon 2023 - In-House SEO Product Management
 
Behemoth SEO: Search Strategy for Huge Websites
Behemoth SEO: Search Strategy for Huge WebsitesBehemoth SEO: Search Strategy for Huge Websites
Behemoth SEO: Search Strategy for Huge Websites
 
Introduction to web analytics
Introduction to web analyticsIntroduction to web analytics
Introduction to web analytics
 
On page seo
On page seoOn page seo
On page seo
 
The True Value of Syndicated Links Experiment
The True Value of Syndicated Links ExperimentThe True Value of Syndicated Links Experiment
The True Value of Syndicated Links Experiment
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
 
What is Web-scraping?
What is Web-scraping?What is Web-scraping?
What is Web-scraping?
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
clickstream analysis
 clickstream analysis clickstream analysis
clickstream analysis
 
Rendering SEO Manifesto - Why we need to go beyond JavaScript SEO
Rendering SEO Manifesto - Why we need to go beyond JavaScript SEORendering SEO Manifesto - Why we need to go beyond JavaScript SEO
Rendering SEO Manifesto - Why we need to go beyond JavaScript SEO
 
BrightonSEO - Apr 2022 - No excuses for doing UX
BrightonSEO - Apr 2022 - No excuses for doing UXBrightonSEO - Apr 2022 - No excuses for doing UX
BrightonSEO - Apr 2022 - No excuses for doing UX
 

Viewers also liked

Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawlingDenis Shestakov
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
Java Web Uygulama Geliştirme
Java Web Uygulama GeliştirmeJava Web Uygulama Geliştirme
Java Web Uygulama Geliştirmeahmetdemirelli
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityRenato Lucindo
 

Viewers also liked (9)

Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Java Web Uygulama Geliştirme
Java Web Uygulama GeliştirmeJava Web Uygulama Geliştirme
Java Web Uygulama Geliştirme
 
Web Crawlers
Web CrawlersWeb Crawlers
Web Crawlers
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 

Recently uploaded

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 

Recently uploaded (20)

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 

Introduction to web crawling & extraction

  • 1. introduction to WEB CRAWLING & extraction by Nate Murray Wednesday, September 14, 2011
  • 2. WHO AM I ? Wednesday, September 14, 2011
  • 3. Nate Murray AT&T Interactive (Yellowpages.com) TB-scale data since 2009 Various crawlers since 2005 Wednesday, September 14, 2011
  • 4. what is WEB CRAWLING ? Wednesday, September 14, 2011
  • 5. definition: web crawler a program that browses the web. Wednesday, September 14, 2011
  • 6. definition: web extraction transforming unstructured web data into structured data Wednesday, September 14, 2011
  • 7. definition: web extraction transforming semistructured web data into structured data Wednesday, September 14, 2011
  • 10. motivation: bookmark buddies URL Title Users Wednesday, September 14, 2011
  • 12. motivation: business hours Wednesday, September 14, 2011
  • 13. motivation: business hours Day Openness Mon Closed Tue 11:30-14:30 17:30-22:00 Wed 11:30-14:30 17:30-22:00 Thur 11:30-14:30 17:30-22:00 Fri 11:30-14:30 17:30-22:00 Sat 12:00-14:30 17:00-22:00 Sun - 17:00-21:00 Wednesday, September 14, 2011
  • 16. motivation: recommend videos Users Wednesday, September 14, 2011
  • 18. motivation: vertical search Wednesday, September 14, 2011
  • 19. motivation: vertical search Image SKU Name Price Rating Wednesday, September 14, 2011
  • 22. DESIRED PROPERTIES SPEED Wednesday, September 14, 2011
  • 24. CONSTRAINTS • Politeness Wednesday, September 14, 2011
  • 25. CONSTRAINTS • Politeness • Distributed Wednesday, September 14, 2011
  • 26. CONSTRAINTS • Politeness • Distributed • Linear Scalability Wednesday, September 14, 2011
  • 27. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning Wednesday, September 14, 2011
  • 28. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 29. CONSTRAINTS • Politeness it’s easy to burden • Distributed small servers • Linear Scalability • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 30. CONSTRAINTS • Politeness • Distributed (for any significant crawl) • Linear Scalability • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 31. CONSTRAINTS • Politeness • Distributed • Linear Scalability n machines = n*m pages-per-second • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 32. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning every machine should perform equal work • Minimum overlap Wednesday, September 14, 2011
  • 33. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlap crawl each page exactly once Wednesday, September 14, 2011
  • 34. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 36. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 37. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 38. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 39. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 40. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 41. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 42. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 43. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 44. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 45. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 46. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 47. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 48. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 49. architecture overview CRAWL FETCHER INTERNET URLs Web Data PLANNER URL QUEUE Web Data STORAGE Web Data Wednesday, September 14, 2011
  • 51. challenges: depends on your ambitions Wednesday, September 14, 2011
  • 52. challenges: Google’s Index Size: 1998 - 26 million 2005 - 8 billion 2008 - 1 trillion http://www.nytimes.com/2005/08/15/technology/15search.html http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html Wednesday, September 14, 2011
  • 53. challenges: small crawls are easy Wednesday, September 14, 2011
  • 54. challenges: < 10MM small crawls are easy Wednesday, September 14, 2011
  • 55. challenges: large crawls are interesting Wednesday, September 14, 2011
  • 57. challenges: DNS Lookup Wednesday, September 14, 2011
  • 58. challenges: DNS Lookup URLs Crawled Wednesday, September 14, 2011
  • 59. challenges: DNS Lookup URLs Crawled Politeness Wednesday, September 14, 2011
  • 60. challenges: DNS Lookup URLs Crawled Politeness URL Frontier Wednesday, September 14, 2011
  • 61. challenges: DNS Lookup URLs Crawled Politeness URL Frontier Queueing URLs Wednesday, September 14, 2011
  • 62. challenges: DNS Lookup URLs Crawled Politeness URL Frontier Queueing URLs Extracting URLs Wednesday, September 14, 2011
  • 63. challenges: DNS LOOKUP Wednesday, September 14, 2011
  • 64. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 65. challenges: DNS LOOKUP can easily be a bottleneck Wednesday, September 14, 2011
  • 66. challenges: DNS LOOKUP • consider running your own DNS servers • djbdns • PowerDNS • etc. Wednesday, September 14, 2011
  • 67. challenges: DNS LOOKUP • be aware of software limitations • gethostbyaddr is synchronized • same with many “default” DNS clients Wednesday, September 14, 2011
  • 68. challenges: DNS LOOKUP You’ll know when you need it Wednesday, September 14, 2011
  • 69. challenges: URLs CRAWLED Wednesday, September 14, 2011
  • 70. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 71. challenges: URLs CRAWLED 1 machine, store in memory Wednesday, September 14, 2011
  • 72. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION Wednesday, September 14, 2011
  • 73. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations Wednesday, September 14, 2011
  • 74. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 Wednesday, September 14, 2011
  • 75. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 x 100 million Wednesday, September 14, 2011
  • 76. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 x 100 million =~ 5.4 gigabytes Wednesday, September 14, 2011
  • 77. can we do better? Wednesday, September 14, 2011
  • 79. BLOOM FILTERS answers the question: is this item in the set? Wednesday, September 14, 2011
  • 80. BLOOM FILTERS answers either: Wednesday, September 14, 2011
  • 81. BLOOM FILTERS answers either: • yes, probably Wednesday, September 14, 2011
  • 82. BLOOM FILTERS answers either: • yes, probably • definitely not Wednesday, September 14, 2011
  • 83. BLOOM FILTERS Have we crawled: http://www.xcombinator.com? answers either: • yes, probably • definitely not Wednesday, September 14, 2011
  • 84. BLOOM FILTERS Have we crawled: http://www.xcombinator.com? answers either: • yes, probably • definitely not Wednesday, September 14, 2011
  • 85. challenges: URLs CRAWLED 1 machine, bloom filter 100 million URLs 1 in 100 million chance of false positive see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8 Wednesday, September 14, 2011
  • 86. challenges: URLs CRAWLED 1 machine, bloom filter NAPKIN CALCULATION 100 million URLs 1 in 100 million chance of false positive see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8 Wednesday, September 14, 2011
  • 87. challenges: URLs CRAWLED 1 machine, bloom filter NAPKIN CALCULATION 100 million URLs 1 in 100 million chance of false positive =~ 457 megabytes see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8 Wednesday, September 14, 2011
  • 89. BLOOM FILTER drawbacks Wednesday, September 14, 2011
  • 90. BLOOM FILTER drawbacks • probabilistic - occasional errors Wednesday, September 14, 2011
  • 91. BLOOM FILTER drawbacks • probabilistic - occasional errors • estimate # of items ahead of time Wednesday, September 14, 2011
  • 92. BLOOM FILTER drawbacks • probabilistic - occasional errors • estimate # of items ahead of time • can’t delete Wednesday, September 14, 2011
  • 93. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • estimate # of items ahead of time • can’t delete Wednesday, September 14, 2011
  • 94. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • can’t delete Wednesday, September 14, 2011
  • 95. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete Wednesday, September 14, 2011
  • 96. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete • pick granularity (days) Wednesday, September 14, 2011
  • 97. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete • pick granularity (days) • cascade them Wednesday, September 14, 2011
  • 98. BLOOM FILTERS references: http://en.wikipedia.org/wiki/Bloom_filter http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/ Wednesday, September 14, 2011
  • 99. challenges: POLITENESS Wednesday, September 14, 2011
  • 101. rule of thumb: wait 2 seconds (w.r.t. ip) Wednesday, September 14, 2011
  • 103. centralized politeness SPOF Wednesday, September 14, 2011
  • 104. centralized politeness SPOF contention Wednesday, September 14, 2011
  • 105. challenges: POLITENESS Wednesday, September 14, 2011
  • 106. challenges: POLITENESS • Options: Wednesday, September 14, 2011
  • 107. challenges: POLITENESS • Options: • central database Wednesday, September 14, 2011
  • 108. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) Wednesday, September 14, 2011
  • 109. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution Wednesday, September 14, 2011
  • 110. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution http://en.wikipedia.org/wiki/Paxos_(computer_science) Wednesday, September 14, 2011
  • 111. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution http://en.wikipedia.org/wiki/Paxos_(computer_science) http://zookeeper.apache.org/ Wednesday, September 14, 2011
  • 112. challenges: URL FRONTIER Wednesday, September 14, 2011
  • 114. idea: consistently distribute URLs based on IP Wednesday, September 14, 2011
  • 115. modulo IP SHA-1 bucket (mod 5) 174.132.225.106 4dd14b0b... 2 74.125.224.115 cf4b7594... 1 157.166.255.19 0ac4d141... 4 69.22.138.129 6c1584fa... 4 98.139.50.166 327252c5... 3 Wednesday, September 14, 2011
  • 116. benefits: same IP always goes to same machine simple Wednesday, September 14, 2011
  • 117. drawbacks: susceptible to skew can’t add / remove nodes without pain Wednesday, September 14, 2011
  • 123. benefits: ~ 1/(n+1) URLs move on add/remove virtual nodes help skew robust (no SOP) Wednesday, September 14, 2011
  • 124. drawbacks: naive solution won’t work for large sites Wednesday, September 14, 2011
  • 125. further reading: Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications (2001) Stoica et al. Dynamo: Amazon’s Highly Available Key-value Store, SOSP 2007 Tapestry: A Resilient Global-Scale Overlay for Service Deployment (2004) Zhao et al. Wednesday, September 14, 2011
  • 126. challenges: QUEUEING URLS Wednesday, September 14, 2011
  • 128. situation: URL Wednesday, September 14, 2011
  • 129. situation: URL not recently crawled Wednesday, September 14, 2011
  • 130. situation: URL not recently crawled allowed by robots.txt Wednesday, September 14, 2011
  • 131. situation: URL not recently crawled allowed by robots.txt polite Wednesday, September 14, 2011
  • 132. how to you order them? (within a single machine) Wednesday, September 14, 2011
  • 133. hash each lane: 1 2 3 http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ Wednesday, September 14, 2011
  • 134. 1 2 3 Wednesday, September 14, 2011
  • 135. 1 2 3 Wednesday, September 14, 2011
  • 136. 1 2 3 Wednesday, September 14, 2011
  • 137. 1 2 3 Wednesday, September 14, 2011
  • 138. 1 2 3 Wednesday, September 14, 2011
  • 139. 1 2 3 Wednesday, September 14, 2011
  • 140. 1 2 3 Wednesday, September 14, 2011
  • 141. 1 2 3 Wednesday, September 14, 2011
  • 142. 1 2 3 Wednesday, September 14, 2011
  • 143. 1 2 3 Wednesday, September 14, 2011
  • 144. 1 2 3 Wednesday, September 14, 2011
  • 145. 1 2 3 Wednesday, September 14, 2011
  • 146. 1 2 3 Wednesday, September 14, 2011
  • 147. 1 2 3 Wednesday, September 14, 2011
  • 148. 1 2 3 Wednesday, September 14, 2011
  • 149. 1 2 3 Wednesday, September 14, 2011
  • 150. 1 2 3 Wednesday, September 14, 2011
  • 151. 1 2 3 Wednesday, September 14, 2011
  • 152. 1 2 3 Wednesday, September 14, 2011
  • 153. 1 2 3 Wednesday, September 14, 2011
  • 154. 1 2 3 Wednesday, September 14, 2011
  • 155. ERLANG lookup: erlang B / C / engset Wednesday, September 14, 2011
  • 156. as many threads as possible Wednesday, September 14, 2011
  • 157. don’t sort input URLs Wednesday, September 14, 2011
  • 158. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 159. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ fetch http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 160. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 wait http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 161. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ fetch http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 162. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? wait id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 163. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 fetch http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 164. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 wait Wednesday, September 14, 2011
  • 165. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 166. http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/ Wednesday, September 14, 2011
  • 167. http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/ Wednesday, September 14, 2011
  • 168. no waiting! http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/ Wednesday, September 14, 2011
  • 169. challenges: EXTRACTING URLS Wednesday, September 14, 2011
  • 170. challenges: EXTRACTING URLS the internet is full of garbage Wednesday, September 14, 2011
  • 171. challenges: EXTRACTING URLS Wednesday, September 14, 2011
  • 172. challenges: EXTRACTING URLS enormous pages Wednesday, September 14, 2011
  • 173. challenges: EXTRACTING URLS enormous pages terrible markup Wednesday, September 14, 2011
  • 174. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls Wednesday, September 14, 2011
  • 175. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls .net/ Wednesday, September 14, 2011
  • 176. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls .net/ “unicode snowman dot net” Wednesday, September 14, 2011
  • 177. challenges: EXTRACTING URLS be prepared: Wednesday, September 14, 2011
  • 178. challenges: EXTRACTING URLS be prepared: use a streaming XML parser Wednesday, September 14, 2011
  • 179. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup Wednesday, September 14, 2011
  • 180. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup be aware that URLs aren’t ASCII Wednesday, September 14, 2011
  • 181. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup be aware that URLs aren’t ASCII use a URL normalizer Wednesday, September 14, 2011
  • 184. software advice: • goals determine scale Wednesday, September 14, 2011
  • 185. software advice: • goals determine scale • someone else has already done it Wednesday, September 14, 2011
  • 186. 2 second crawler: function wgetspider() { wget --html-extension --convert-links --mirror --page-requisites --progress=bar --level=5 --no-parent --no-verbose --no-check-certificate "$@"; } $ wgetspider http://www.ischool.berkeley.edu/ Wednesday, September 14, 2011
  • 188. java crawlers: • Heritrix (Internet Archive) Wednesday, September 14, 2011
  • 189. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) Wednesday, September 14, 2011
  • 190. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) • Bixo (Hadoop / Cascading) Wednesday, September 14, 2011
  • 191. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) • Bixo (Hadoop / Cascading) http://crawler.archive.org/ http://nutch.apache.org/ http://bixo.101tec.com/ Wednesday, September 14, 2011
  • 193. extraction packages: • mechanize Wednesday, September 14, 2011
  • 194. extraction packages: • mechanize • BeautifulSoup & urllib2 Wednesday, September 14, 2011
  • 195. extraction packages: • mechanize • BeautifulSoup & urllib2 • Scrapy Wednesday, September 14, 2011
  • 196. extraction packages: • mechanize • BeautifulSoup & urllib2 • Scrapy http://wwwsearch.sourceforge.net/mechanize/ http://www.crummy.com/software/BeautifulSoup/ http://scrapy.org/ Wednesday, September 14, 2011
  • 198. wrapper induction(ish) • Ariel Wednesday, September 14, 2011
  • 199. wrapper induction(ish) • Ariel • RoadRunner Wednesday, September 14, 2011
  • 200. wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker Wednesday, September 14, 2011
  • 201. wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker • scrubyt Wednesday, September 14, 2011
  • 202. wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker • scrubyt http://ariel.rubyforge.org/index.html http://www.dia.uniroma3.it/db/roadRunner/ http://code.google.com/p/templatemaker/ http://scrubyt.rubyforge.org/files/README.html Wednesday, September 14, 2011
  • 204. FEEDBACK: nate@xcombinator.com www.xcombinator.com @xcombinator Wednesday, September 14, 2011