SlideShare a Scribd company logo
1 of 204
Download to read offline
introduction to

                        WEB CRAWLING
                            & extraction                 by Nate Murray




Wednesday, September 14, 2011
WHO AM I ?



Wednesday, September 14, 2011
Nate Murray

                        AT&T Interactive (Yellowpages.com)

                                 TB-scale data since 2009

                                Various crawlers since 2005




Wednesday, September 14, 2011
what is

                        WEB CRAWLING ?



Wednesday, September 14, 2011
definition:
                 web crawler
                            a program that browses the web.




Wednesday, September 14, 2011
definition:
                 web extraction
                    transforming unstructured web data into
                                structured data




Wednesday, September 14, 2011
definition:
                 web extraction
                    transforming semistructured web data into
                                structured data




Wednesday, September 14, 2011
motivation




Wednesday, September 14, 2011
motivation: bookmark buddies




Wednesday, September 14, 2011
motivation: bookmark buddies

                                URL Title
                                 Users




Wednesday, September 14, 2011
motivation:




Wednesday, September 14, 2011
motivation:               business hours




Wednesday, September 14, 2011
motivation:               business hours


                                Day        Openness
                                Mon         Closed
                                Tue    11:30-14:30 17:30-22:00


                                Wed    11:30-14:30 17:30-22:00


                                Thur   11:30-14:30 17:30-22:00


                                 Fri   11:30-14:30 17:30-22:00


                                 Sat   12:00-14:30 17:00-22:00


                                Sun        -       17:00-21:00




Wednesday, September 14, 2011
motivation:




Wednesday, September 14, 2011
motivation: recommend videos




Wednesday, September 14, 2011
motivation: recommend videos




                                Users
Wednesday, September 14, 2011
motivation:




Wednesday, September 14, 2011
motivation:               vertical search




Wednesday, September 14, 2011
motivation:               vertical search

                  Image
                   SKU
                  Name
                   Price
                  Rating




Wednesday, September 14, 2011
motivation:




Wednesday, September 14, 2011
DESIRED PROPERTIES



Wednesday, September 14, 2011
DESIRED PROPERTIES



                                SPEED

Wednesday, September 14, 2011
CONSTRAINTS




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness                it’s easy to burden

         • Distributed
                                         small servers


              • Linear Scalability
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed               (for any significant
                                            crawl)
              • Linear Scalability
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability      n machines =
                                     n*m pages-per-second
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning    every machine should
                                      perform equal work

              • Minimum overlap



Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning
              • Minimum overlap      crawl each page
                                       exactly once




Wednesday, September 14, 2011
CONSTRAINTS

         • Politeness
         • Distributed
              • Linear Scalability
              • Even partitioning
              • Minimum overlap



Wednesday, September 14, 2011
BASIC ALGORITHM



Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
architecture overview

         CRAWL
                                             FETCHER                   INTERNET
                                      URLs                  Web Data

        PLANNER
                          URL
                         QUEUE
                                                 Web Data




                                             STORAGE
                                Web Data




Wednesday, September 14, 2011
CHALLENGES



Wednesday, September 14, 2011
challenges:




                                depends on your ambitions




Wednesday, September 14, 2011
challenges:


                                Google’s Index Size:

                                 1998 - 26 million
                                 2005 - 8 billion
                                 2008 - 1 trillion




    http://www.nytimes.com/2005/08/15/technology/15search.html
    http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html




Wednesday, September 14, 2011
challenges:




                                small crawls are easy




Wednesday, September 14, 2011
challenges:




                                     < 10MM


                                small crawls are easy




Wednesday, September 14, 2011
challenges:




                                large crawls are interesting




Wednesday, September 14, 2011
challenges:




Wednesday, September 14, 2011
challenges:




                                DNS Lookup




Wednesday, September 14, 2011
challenges:




                                DNS Lookup
                                URLs Crawled




Wednesday, September 14, 2011
challenges:




                                DNS Lookup
                                URLs Crawled
                                 Politeness




Wednesday, September 14, 2011
challenges:




                                DNS Lookup
                                URLs Crawled
                                 Politeness
                                URL Frontier




Wednesday, September 14, 2011
challenges:




                                 DNS Lookup
                                URLs Crawled
                                  Politeness
                                 URL Frontier
                                Queueing URLs




Wednesday, September 14, 2011
challenges:




                                 DNS Lookup
                                 URLs Crawled
                                   Politeness
                                 URL Frontier
                                Queueing URLs
                                Extracting URLs




Wednesday, September 14, 2011
challenges:

                                DNS LOOKUP




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
challenges:
      DNS LOOKUP


                                can easily be a bottleneck




Wednesday, September 14, 2011
challenges:
      DNS LOOKUP

           • consider running your own DNS servers
             • djbdns
             • PowerDNS
             • etc.




Wednesday, September 14, 2011
challenges:
      DNS LOOKUP

           • be aware of software limitations
                • gethostbyaddr is synchronized
                • same with many “default” DNS clients




Wednesday, September 14, 2011
challenges:
      DNS LOOKUP


                                You’ll know when you need it




Wednesday, September 14, 2011
challenges:

                                URLs CRAWLED




Wednesday, September 14, 2011
Initialize:
             UrlsDone = null
             UrlFrontier = {'google.com/index.html', ..}
         Repeat
             url = UrlFrontier.getNext()
             ip = DNSlookup(url.getHostname())
             html = DownloadPage(ip, url.getPath())
             UrlsDone.insert(url)
             newUrls = parseForLinks(html)
             For each newUrl
               If not UrlsDone.contains(newUrl)
               then UrlsTodo.insert(newUrl)




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION
                ~50 bytes per URL
                                  e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION
                ~50 bytes per URL
                                  e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations


                   +8 bytes for time-last-crawled
                                    as long e.g. System.currentTimeMillis() -> 1314392455712




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION
                ~50 bytes per URL
                                  e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations


                   +8 bytes for time-last-crawled
                                    as long e.g. System.currentTimeMillis() -> 1314392455712


                   x        100 million




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, store in memory

                                NAPKIN CALCULATION
                ~50 bytes per URL
                                  e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations


                   +8 bytes for time-last-crawled
                                    as long e.g. System.currentTimeMillis() -> 1314392455712


                   x        100 million
                  =~ 5.4 gigabytes


Wednesday, September 14, 2011
can we do better?




Wednesday, September 14, 2011
BLOOM FILTERS



Wednesday, September 14, 2011
BLOOM FILTERS



                                     answers the question:

                                is this item in the set?


Wednesday, September 14, 2011
BLOOM FILTERS



                                answers either:




Wednesday, September 14, 2011
BLOOM FILTERS



                                    answers either:

                                • yes, probably

Wednesday, September 14, 2011
BLOOM FILTERS



                                    answers either:

                                • yes, probably
                                • definitely not
Wednesday, September 14, 2011
BLOOM FILTERS


                Have we crawled: http://www.xcombinator.com?

                                    answers either:

                                • yes, probably
                                • definitely not
Wednesday, September 14, 2011
BLOOM FILTERS


                Have we crawled: http://www.xcombinator.com?

                                    answers either:

                                • yes, probably
                                • definitely not
Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                1 machine, bloom filter



                100 million URLs

                      1 in 100 million chance
                                                of false positive




                                      see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                 1 machine, bloom filter

                                NAPKIN CALCULATION
                100 million URLs

                      1 in 100 million chance
                                                 of false positive




                                       see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8




Wednesday, September 14, 2011
challenges:
      URLs CRAWLED
                                 1 machine, bloom filter

                                NAPKIN CALCULATION
                100 million URLs

                      1 in 100 million chance
                                                 of false positive



                  =~ 457 megabytes
                                       see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8




Wednesday, September 14, 2011
BLOOM FILTER




Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks




Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks


   • probabilistic - occasional errors




Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks


   • probabilistic - occasional errors

   • estimate # of items ahead of time




Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks


   • probabilistic - occasional errors

   • estimate # of items ahead of time

   • can’t delete


Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                         solutions
   • probabilistic - occasional errors

   • estimate # of items ahead of time

   • can’t delete


Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                            solutions
   • probabilistic - occasional errors
                             • acceptable
   • estimate # of items ahead of time

   • can’t delete


Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                         solutions
   • probabilistic - occasional errors
                             • acceptable
   • estimate # of items ahead of time
                             • not hard, see Dynamic BFs
   • can’t delete


Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                         solutions
   • probabilistic - occasional errors
                             • acceptable
   • estimate # of items ahead of time
                             • not hard, see Dynamic BFs
   • can’t delete
                             • pick granularity (days)

Wednesday, September 14, 2011
BLOOM FILTER
                       drawbacks
                                         solutions
   • probabilistic - occasional errors
                             • acceptable
   • estimate # of items ahead of time
                             • not hard, see Dynamic BFs
   • can’t delete
                             • pick granularity (days)
                             • cascade them
Wednesday, September 14, 2011
BLOOM FILTERS
                                         references:

       http://en.wikipedia.org/wiki/Bloom_filter
       http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html
       http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/




Wednesday, September 14, 2011
challenges:

                                POLITENESS




Wednesday, September 14, 2011
obey robots.txt




Wednesday, September 14, 2011
rule of thumb:

                                wait 2 seconds (w.r.t. ip)




Wednesday, September 14, 2011
centralized politeness




Wednesday, September 14, 2011
centralized politeness




                                SPOF


Wednesday, September 14, 2011
centralized politeness




                                  SPOF
                                contention

Wednesday, September 14, 2011
challenges:
      POLITENESS




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database
                  • distributed locks (paxos/sigma/zookeeper)




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database
                  • distributed locks (paxos/sigma/zookeeper)
                  • controlled URL distribution




Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database
                  • distributed locks (paxos/sigma/zookeeper)
                  • controlled URL distribution



    http://en.wikipedia.org/wiki/Paxos_(computer_science)


Wednesday, September 14, 2011
challenges:
      POLITENESS

                • Options:
                  • central database
                  • distributed locks (paxos/sigma/zookeeper)
                  • controlled URL distribution



    http://en.wikipedia.org/wiki/Paxos_(computer_science)
    http://zookeeper.apache.org/

Wednesday, September 14, 2011
challenges:

                                URL FRONTIER




Wednesday, September 14, 2011
url frontier




Wednesday, September 14, 2011
idea:

                          consistently distribute URLs based on IP




Wednesday, September 14, 2011
modulo
                                IP      SHA-1       bucket (mod 5)
                  174.132.225.106     4dd14b0b...         2

                   74.125.224.115     cf4b7594...         1

                   157.166.255.19     0ac4d141...         4

                    69.22.138.129     6c1584fa...         4

                    98.139.50.166     327252c5...         3




Wednesday, September 14, 2011
benefits:



                                same IP always goes to same machine
                                              simple




Wednesday, September 14, 2011
drawbacks:



                                         susceptible to skew
                                can’t add / remove nodes without pain




Wednesday, September 14, 2011
consistent hashing



Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/

Wednesday, September 14, 2011
benefits:



                                ~ 1/(n+1) URLs move on add/remove
                                      virtual nodes help skew
                                          robust (no SOP)




Wednesday, September 14, 2011
drawbacks:



                           naive solution won’t work for large sites




Wednesday, September 14, 2011
further reading:



           Chord: A Scalable Peer-to-Peer Lookup Protocol for
           Internet Applications (2001) Stoica et al.

           Dynamo: Amazon’s Highly Available Key-value Store, SOSP
           2007

           Tapestry: A Resilient Global-Scale Overlay for Service
           Deployment (2004) Zhao et al.




Wednesday, September 14, 2011
challenges:

                                QUEUEING URLS




Wednesday, September 14, 2011
situation:




Wednesday, September 14, 2011
situation:
                                   URL




Wednesday, September 14, 2011
situation:
                                   URL
                                   not recently crawled




Wednesday, September 14, 2011
situation:
                                   URL
                                   not recently crawled
                                   allowed by robots.txt




Wednesday, September 14, 2011
situation:
                                   URL
                                   not recently crawled
                                   allowed by robots.txt
                                   polite




Wednesday, September 14, 2011
how to you order them?

                                (within a single machine)




Wednesday, September 14, 2011
hash each lane:




                1                         2                               3
                   http://yachtmaintenanceco.com/
                   http://www.amsterdamports.nl/
                   http://www.4s-dawn.com/
                   http://www.embassysuiteslittlerock.com/
                   http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm
                   http://mdgroover.iweb.bsu.edu
                   http://music.imbc.com/
                   http://www.robertjbradshaw.com
                   http://www.kerkattenhoven.be
                   http://www.escolania.org/
                   http://www.musiciansdfw.org/
Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
1           2   3


Wednesday, September 14, 2011
ERLANG




                                lookup: erlang B / C / engset

Wednesday, September 14, 2011
as many threads as possible




Wednesday, September 14, 2011
don’t sort input URLs




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
                                                                   fetch
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1           wait
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/

                                                                   fetch
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
                                                                   wait
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395

                                                                   fetch
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1
                                                                   wait




Wednesday, September 14, 2011
http://abcnews.go.com/
   http://abcnews.go.com/2020/ABCNEWSSpecial/
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/2020/story?id=207269&amp;page=1
   http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395
   http://abcnews.go.com/International/News/story?
   id=203089&amp;page=1
   http://abcnews.go.com/International/Pope/
   http://abcnews.go.com/International/story?id=81417&amp;page=1




Wednesday, September 14, 2011
http://yachtmaintenanceco.com/
   http://www.amsterdamports.nl/
   http://www.4s-dawn.com/
   http://www.embassysuiteslittlerock.com/
   http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm
   http://mdgroover.iweb.bsu.edu
   http://music.imbc.com/
   http://www.robertjbradshaw.com
   http://www.kerkattenhoven.be
   http://www.escolania.org/
   http://www.musiciansdfw.org/
   http://www.ariana.org/




Wednesday, September 14, 2011
http://yachtmaintenanceco.com/
   http://www.amsterdamports.nl/
   http://www.4s-dawn.com/
   http://www.embassysuiteslittlerock.com/
   http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm
   http://mdgroover.iweb.bsu.edu
   http://music.imbc.com/
   http://www.robertjbradshaw.com
   http://www.kerkattenhoven.be
   http://www.escolania.org/
   http://www.musiciansdfw.org/
   http://www.ariana.org/




Wednesday, September 14, 2011
no waiting!
   http://yachtmaintenanceco.com/
   http://www.amsterdamports.nl/
   http://www.4s-dawn.com/
   http://www.embassysuiteslittlerock.com/
   http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm
   http://mdgroover.iweb.bsu.edu
   http://music.imbc.com/
   http://www.robertjbradshaw.com
   http://www.kerkattenhoven.be
   http://www.escolania.org/
   http://www.musiciansdfw.org/
   http://www.ariana.org/




Wednesday, September 14, 2011
challenges:

                                EXTRACTING URLS




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS


                                the internet is full of garbage




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                enormous pages




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                enormous pages

                                terrible markup




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                enormous pages

                                terrible markup

                                ridiculous urls




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                enormous pages

                                terrible markup

                                ridiculous urls


                                     .net/

Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                     enormous pages

                                     terrible markup

                                      ridiculous urls


                                           .net/
                                “unicode snowman dot net”

Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                be prepared:




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                       be prepared:


                                use a streaming XML parser




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                       be prepared:


                                use a streaming XML parser

                       use a library that handle’s bad markup




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                       be prepared:


                                use a streaming XML parser

                       use a library that handle’s bad markup

                            be aware that URLs aren’t ASCII




Wednesday, September 14, 2011
challenges:

      EXTRACTING URLS
                                       be prepared:


                                use a streaming XML parser

                       use a library that handle’s bad markup

                            be aware that URLs aren’t ASCII

                                  use a URL normalizer


Wednesday, September 14, 2011
SOFTWARE



Wednesday, September 14, 2011
software advice:




Wednesday, September 14, 2011
software advice:


               •       goals determine scale




Wednesday, September 14, 2011
software advice:


               •       goals determine scale

               •       someone else has already done it




Wednesday, September 14, 2011
2 second crawler:

    function wgetspider() {
      wget --html-extension --convert-links --mirror 
        --page-requisites --progress=bar --level=5   
        --no-parent --no-verbose 
        --no-check-certificate "$@";
    }

    $ wgetspider http://www.ischool.berkeley.edu/




Wednesday, September 14, 2011
java crawlers:




Wednesday, September 14, 2011
java crawlers:

               •       Heritrix (Internet Archive)




Wednesday, September 14, 2011
java crawlers:

               •       Heritrix (Internet Archive)

               •       Nutch (Lucene)




Wednesday, September 14, 2011
java crawlers:

               •       Heritrix (Internet Archive)

               •       Nutch (Lucene)

               •       Bixo (Hadoop / Cascading)




Wednesday, September 14, 2011
java crawlers:

               •       Heritrix (Internet Archive)

               •       Nutch (Lucene)

               •       Bixo (Hadoop / Cascading)


    http://crawler.archive.org/
    http://nutch.apache.org/
    http://bixo.101tec.com/

Wednesday, September 14, 2011
extraction packages:




Wednesday, September 14, 2011
extraction packages:

               •       mechanize




Wednesday, September 14, 2011
extraction packages:

               •       mechanize

               •       BeautifulSoup & urllib2




Wednesday, September 14, 2011
extraction packages:

               •       mechanize

               •       BeautifulSoup & urllib2

               •       Scrapy




Wednesday, September 14, 2011
extraction packages:

               •       mechanize

               •       BeautifulSoup & urllib2

               •       Scrapy


    http://wwwsearch.sourceforge.net/mechanize/
    http://www.crummy.com/software/BeautifulSoup/
    http://scrapy.org/

Wednesday, September 14, 2011
wrapper induction(ish)




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel

               •       RoadRunner




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel

               •       RoadRunner

               •       TemplateMaker




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel

               •       RoadRunner

               •       TemplateMaker

               •       scrubyt




Wednesday, September 14, 2011
wrapper induction(ish)
               •       Ariel

               •       RoadRunner

               •       TemplateMaker

               •       scrubyt

    http://ariel.rubyforge.org/index.html
    http://www.dia.uniroma3.it/db/roadRunner/
    http://code.google.com/p/templatemaker/
    http://scrubyt.rubyforge.org/files/README.html

Wednesday, September 14, 2011
QUESTIONS?



Wednesday, September 14, 2011
FEEDBACK:


             nate@xcombinator.com
                                www.xcombinator.com

                                   @xcombinator
Wednesday, September 14, 2011

More Related Content

What's hot

Top 5 Benefits of SEO
Top 5 Benefits of SEOTop 5 Benefits of SEO
Top 5 Benefits of SEOMedia-Mosaic
 
How to build effective SEO teams from scratch - SMXL (Milan) 2021
How to build effective SEO teams from scratch - SMXL (Milan) 2021How to build effective SEO teams from scratch - SMXL (Milan) 2021
How to build effective SEO teams from scratch - SMXL (Milan) 2021SeoProfy Presentations
 
Seo strategy guide 2019
Seo strategy guide 2019Seo strategy guide 2019
Seo strategy guide 2019Sanjay Patwal
 
Search enginge optimization (seo) proposal
Search enginge optimization (seo) proposalSearch enginge optimization (seo) proposal
Search enginge optimization (seo) proposalVikas Kashyap
 
SEO proposal - free template - SEOmonitor.com
SEO proposal - free template - SEOmonitor.comSEO proposal - free template - SEOmonitor.com
SEO proposal - free template - SEOmonitor.comMadalina-Carla Mihai
 
What is a Robot txt file?
What is a Robot txt file?What is a Robot txt file?
What is a Robot txt file?Abhishek Mitra
 
SEO Search Engine Optimization Proposal - Organic SEO Ranks
SEO Search Engine Optimization Proposal - Organic SEO RanksSEO Search Engine Optimization Proposal - Organic SEO Ranks
SEO Search Engine Optimization Proposal - Organic SEO RanksChakra Systems
 
Slawski New Approaches for Structured Data:Evolution of Question Answering
Slawski   New Approaches for Structured Data:Evolution of Question Answering Slawski   New Approaches for Structured Data:Evolution of Question Answering
Slawski New Approaches for Structured Data:Evolution of Question Answering Bill Slawski
 
Victor Karpenko (SeoProfy) CMSEO 2019 Presentation
Victor Karpenko (SeoProfy) CMSEO 2019 PresentationVictor Karpenko (SeoProfy) CMSEO 2019 Presentation
Victor Karpenko (SeoProfy) CMSEO 2019 PresentationSeoProfy Presentations
 
TECHNICAL SEO QA - SHINING A LIGHT ON INVISIBLE WORK (BrightonSEO April 2022)
TECHNICAL SEO QA - SHINING A LIGHT ON INVISIBLE WORK (BrightonSEO April 2022)TECHNICAL SEO QA - SHINING A LIGHT ON INVISIBLE WORK (BrightonSEO April 2022)
TECHNICAL SEO QA - SHINING A LIGHT ON INVISIBLE WORK (BrightonSEO April 2022)Gianna Brachetti-Truskawa 🐙
 
What is SEO? - Basic SEO Guide for Beginners.pptx
What is SEO? - Basic SEO Guide for Beginners.pptxWhat is SEO? - Basic SEO Guide for Beginners.pptx
What is SEO? - Basic SEO Guide for Beginners.pptxGeromme Talampas
 
Thriving as an SEO Specialist: Frameworks & Tips to Manage Complex SEO Processes
Thriving as an SEO Specialist: Frameworks & Tips to Manage Complex SEO ProcessesThriving as an SEO Specialist: Frameworks & Tips to Manage Complex SEO Processes
Thriving as an SEO Specialist: Frameworks & Tips to Manage Complex SEO ProcessesAleyda Solís
 
How to automate a long tail SEO strategy for ecommerce
How to automate a long tail SEO strategy for ecommerceHow to automate a long tail SEO strategy for ecommerce
How to automate a long tail SEO strategy for ecommercePierreOlivierDanhaiv1
 
Improving Crawling and Indexing using Real-Time Log File Insights
Improving Crawling and Indexing using Real-Time Log File InsightsImproving Crawling and Indexing using Real-Time Log File Insights
Improving Crawling and Indexing using Real-Time Log File InsightsSteven van Vessum
 

What's hot (20)

Top 5 Benefits of SEO
Top 5 Benefits of SEOTop 5 Benefits of SEO
Top 5 Benefits of SEO
 
How to build effective SEO teams from scratch - SMXL (Milan) 2021
How to build effective SEO teams from scratch - SMXL (Milan) 2021How to build effective SEO teams from scratch - SMXL (Milan) 2021
How to build effective SEO teams from scratch - SMXL (Milan) 2021
 
Seo strategy guide 2019
Seo strategy guide 2019Seo strategy guide 2019
Seo strategy guide 2019
 
Search enginge optimization (seo) proposal
Search enginge optimization (seo) proposalSearch enginge optimization (seo) proposal
Search enginge optimization (seo) proposal
 
SEO proposal - free template - SEOmonitor.com
SEO proposal - free template - SEOmonitor.comSEO proposal - free template - SEOmonitor.com
SEO proposal - free template - SEOmonitor.com
 
What is a Robot txt file?
What is a Robot txt file?What is a Robot txt file?
What is a Robot txt file?
 
SEO Search Engine Optimization Proposal - Organic SEO Ranks
SEO Search Engine Optimization Proposal - Organic SEO RanksSEO Search Engine Optimization Proposal - Organic SEO Ranks
SEO Search Engine Optimization Proposal - Organic SEO Ranks
 
New Seo Proposal
New Seo ProposalNew Seo Proposal
New Seo Proposal
 
Seo Marketing Plan Ppt
Seo Marketing Plan PptSeo Marketing Plan Ppt
Seo Marketing Plan Ppt
 
SEO
SEO SEO
SEO
 
Slawski New Approaches for Structured Data:Evolution of Question Answering
Slawski   New Approaches for Structured Data:Evolution of Question Answering Slawski   New Approaches for Structured Data:Evolution of Question Answering
Slawski New Approaches for Structured Data:Evolution of Question Answering
 
Victor Karpenko (SeoProfy) CMSEO 2019 Presentation
Victor Karpenko (SeoProfy) CMSEO 2019 PresentationVictor Karpenko (SeoProfy) CMSEO 2019 Presentation
Victor Karpenko (SeoProfy) CMSEO 2019 Presentation
 
SEO Strategy Guide [2019]
 SEO Strategy Guide [2019] SEO Strategy Guide [2019]
SEO Strategy Guide [2019]
 
TECHNICAL SEO QA - SHINING A LIGHT ON INVISIBLE WORK (BrightonSEO April 2022)
TECHNICAL SEO QA - SHINING A LIGHT ON INVISIBLE WORK (BrightonSEO April 2022)TECHNICAL SEO QA - SHINING A LIGHT ON INVISIBLE WORK (BrightonSEO April 2022)
TECHNICAL SEO QA - SHINING A LIGHT ON INVISIBLE WORK (BrightonSEO April 2022)
 
What is SEO? - Basic SEO Guide for Beginners.pptx
What is SEO? - Basic SEO Guide for Beginners.pptxWhat is SEO? - Basic SEO Guide for Beginners.pptx
What is SEO? - Basic SEO Guide for Beginners.pptx
 
SEO Marketing
SEO Marketing SEO Marketing
SEO Marketing
 
Sample SEO Proposal
Sample SEO ProposalSample SEO Proposal
Sample SEO Proposal
 
Thriving as an SEO Specialist: Frameworks & Tips to Manage Complex SEO Processes
Thriving as an SEO Specialist: Frameworks & Tips to Manage Complex SEO ProcessesThriving as an SEO Specialist: Frameworks & Tips to Manage Complex SEO Processes
Thriving as an SEO Specialist: Frameworks & Tips to Manage Complex SEO Processes
 
How to automate a long tail SEO strategy for ecommerce
How to automate a long tail SEO strategy for ecommerceHow to automate a long tail SEO strategy for ecommerce
How to automate a long tail SEO strategy for ecommerce
 
Improving Crawling and Indexing using Real-Time Log File Insights
Improving Crawling and Indexing using Real-Time Log File InsightsImproving Crawling and Indexing using Real-Time Log File Insights
Improving Crawling and Indexing using Real-Time Log File Insights
 

Viewers also liked

Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawlingDenis Shestakov
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
Java Web Uygulama Geliştirme
Java Web Uygulama GeliştirmeJava Web Uygulama Geliştirme
Java Web Uygulama Geliştirmeahmetdemirelli
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawlingDenis Shestakov
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityRenato Lucindo
 

Viewers also liked (10)

Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Java Web Uygulama Geliştirme
Java Web Uygulama GeliştirmeJava Web Uygulama Geliştirme
Java Web Uygulama Geliştirme
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
 
Web Crawlers
Web CrawlersWeb Crawlers
Web Crawlers
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 

Recently uploaded

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Recently uploaded (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

Introduction to web crawling & extraction

  • 1. introduction to WEB CRAWLING & extraction by Nate Murray Wednesday, September 14, 2011
  • 2. WHO AM I ? Wednesday, September 14, 2011
  • 3. Nate Murray AT&T Interactive (Yellowpages.com) TB-scale data since 2009 Various crawlers since 2005 Wednesday, September 14, 2011
  • 4. what is WEB CRAWLING ? Wednesday, September 14, 2011
  • 5. definition: web crawler a program that browses the web. Wednesday, September 14, 2011
  • 6. definition: web extraction transforming unstructured web data into structured data Wednesday, September 14, 2011
  • 7. definition: web extraction transforming semistructured web data into structured data Wednesday, September 14, 2011
  • 10. motivation: bookmark buddies URL Title Users Wednesday, September 14, 2011
  • 12. motivation: business hours Wednesday, September 14, 2011
  • 13. motivation: business hours Day Openness Mon Closed Tue 11:30-14:30 17:30-22:00 Wed 11:30-14:30 17:30-22:00 Thur 11:30-14:30 17:30-22:00 Fri 11:30-14:30 17:30-22:00 Sat 12:00-14:30 17:00-22:00 Sun - 17:00-21:00 Wednesday, September 14, 2011
  • 16. motivation: recommend videos Users Wednesday, September 14, 2011
  • 18. motivation: vertical search Wednesday, September 14, 2011
  • 19. motivation: vertical search Image SKU Name Price Rating Wednesday, September 14, 2011
  • 22. DESIRED PROPERTIES SPEED Wednesday, September 14, 2011
  • 24. CONSTRAINTS • Politeness Wednesday, September 14, 2011
  • 25. CONSTRAINTS • Politeness • Distributed Wednesday, September 14, 2011
  • 26. CONSTRAINTS • Politeness • Distributed • Linear Scalability Wednesday, September 14, 2011
  • 27. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning Wednesday, September 14, 2011
  • 28. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 29. CONSTRAINTS • Politeness it’s easy to burden • Distributed small servers • Linear Scalability • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 30. CONSTRAINTS • Politeness • Distributed (for any significant crawl) • Linear Scalability • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 31. CONSTRAINTS • Politeness • Distributed • Linear Scalability n machines = n*m pages-per-second • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 32. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning every machine should perform equal work • Minimum overlap Wednesday, September 14, 2011
  • 33. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlap crawl each page exactly once Wednesday, September 14, 2011
  • 34. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlap Wednesday, September 14, 2011
  • 36. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 37. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 38. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 39. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 40. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 41. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 42. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 43. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 44. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 45. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 46. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 47. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 48. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 49. architecture overview CRAWL FETCHER INTERNET URLs Web Data PLANNER URL QUEUE Web Data STORAGE Web Data Wednesday, September 14, 2011
  • 51. challenges: depends on your ambitions Wednesday, September 14, 2011
  • 52. challenges: Google’s Index Size: 1998 - 26 million 2005 - 8 billion 2008 - 1 trillion http://www.nytimes.com/2005/08/15/technology/15search.html http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html Wednesday, September 14, 2011
  • 53. challenges: small crawls are easy Wednesday, September 14, 2011
  • 54. challenges: < 10MM small crawls are easy Wednesday, September 14, 2011
  • 55. challenges: large crawls are interesting Wednesday, September 14, 2011
  • 57. challenges: DNS Lookup Wednesday, September 14, 2011
  • 58. challenges: DNS Lookup URLs Crawled Wednesday, September 14, 2011
  • 59. challenges: DNS Lookup URLs Crawled Politeness Wednesday, September 14, 2011
  • 60. challenges: DNS Lookup URLs Crawled Politeness URL Frontier Wednesday, September 14, 2011
  • 61. challenges: DNS Lookup URLs Crawled Politeness URL Frontier Queueing URLs Wednesday, September 14, 2011
  • 62. challenges: DNS Lookup URLs Crawled Politeness URL Frontier Queueing URLs Extracting URLs Wednesday, September 14, 2011
  • 63. challenges: DNS LOOKUP Wednesday, September 14, 2011
  • 64. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 65. challenges: DNS LOOKUP can easily be a bottleneck Wednesday, September 14, 2011
  • 66. challenges: DNS LOOKUP • consider running your own DNS servers • djbdns • PowerDNS • etc. Wednesday, September 14, 2011
  • 67. challenges: DNS LOOKUP • be aware of software limitations • gethostbyaddr is synchronized • same with many “default” DNS clients Wednesday, September 14, 2011
  • 68. challenges: DNS LOOKUP You’ll know when you need it Wednesday, September 14, 2011
  • 69. challenges: URLs CRAWLED Wednesday, September 14, 2011
  • 70. Initialize:     UrlsDone = null     UrlFrontier = {'google.com/index.html', ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl) Wednesday, September 14, 2011
  • 71. challenges: URLs CRAWLED 1 machine, store in memory Wednesday, September 14, 2011
  • 72. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION Wednesday, September 14, 2011
  • 73. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations Wednesday, September 14, 2011
  • 74. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 Wednesday, September 14, 2011
  • 75. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 x 100 million Wednesday, September 14, 2011
  • 76. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 x 100 million =~ 5.4 gigabytes Wednesday, September 14, 2011
  • 77. can we do better? Wednesday, September 14, 2011
  • 79. BLOOM FILTERS answers the question: is this item in the set? Wednesday, September 14, 2011
  • 80. BLOOM FILTERS answers either: Wednesday, September 14, 2011
  • 81. BLOOM FILTERS answers either: • yes, probably Wednesday, September 14, 2011
  • 82. BLOOM FILTERS answers either: • yes, probably • definitely not Wednesday, September 14, 2011
  • 83. BLOOM FILTERS Have we crawled: http://www.xcombinator.com? answers either: • yes, probably • definitely not Wednesday, September 14, 2011
  • 84. BLOOM FILTERS Have we crawled: http://www.xcombinator.com? answers either: • yes, probably • definitely not Wednesday, September 14, 2011
  • 85. challenges: URLs CRAWLED 1 machine, bloom filter 100 million URLs 1 in 100 million chance of false positive see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8 Wednesday, September 14, 2011
  • 86. challenges: URLs CRAWLED 1 machine, bloom filter NAPKIN CALCULATION 100 million URLs 1 in 100 million chance of false positive see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8 Wednesday, September 14, 2011
  • 87. challenges: URLs CRAWLED 1 machine, bloom filter NAPKIN CALCULATION 100 million URLs 1 in 100 million chance of false positive =~ 457 megabytes see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8 Wednesday, September 14, 2011
  • 89. BLOOM FILTER drawbacks Wednesday, September 14, 2011
  • 90. BLOOM FILTER drawbacks • probabilistic - occasional errors Wednesday, September 14, 2011
  • 91. BLOOM FILTER drawbacks • probabilistic - occasional errors • estimate # of items ahead of time Wednesday, September 14, 2011
  • 92. BLOOM FILTER drawbacks • probabilistic - occasional errors • estimate # of items ahead of time • can’t delete Wednesday, September 14, 2011
  • 93. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • estimate # of items ahead of time • can’t delete Wednesday, September 14, 2011
  • 94. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • can’t delete Wednesday, September 14, 2011
  • 95. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete Wednesday, September 14, 2011
  • 96. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete • pick granularity (days) Wednesday, September 14, 2011
  • 97. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete • pick granularity (days) • cascade them Wednesday, September 14, 2011
  • 98. BLOOM FILTERS references: http://en.wikipedia.org/wiki/Bloom_filter http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/ Wednesday, September 14, 2011
  • 99. challenges: POLITENESS Wednesday, September 14, 2011
  • 101. rule of thumb: wait 2 seconds (w.r.t. ip) Wednesday, September 14, 2011
  • 103. centralized politeness SPOF Wednesday, September 14, 2011
  • 104. centralized politeness SPOF contention Wednesday, September 14, 2011
  • 105. challenges: POLITENESS Wednesday, September 14, 2011
  • 106. challenges: POLITENESS • Options: Wednesday, September 14, 2011
  • 107. challenges: POLITENESS • Options: • central database Wednesday, September 14, 2011
  • 108. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) Wednesday, September 14, 2011
  • 109. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution Wednesday, September 14, 2011
  • 110. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution http://en.wikipedia.org/wiki/Paxos_(computer_science) Wednesday, September 14, 2011
  • 111. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution http://en.wikipedia.org/wiki/Paxos_(computer_science) http://zookeeper.apache.org/ Wednesday, September 14, 2011
  • 112. challenges: URL FRONTIER Wednesday, September 14, 2011
  • 114. idea: consistently distribute URLs based on IP Wednesday, September 14, 2011
  • 115. modulo IP SHA-1 bucket (mod 5) 174.132.225.106 4dd14b0b... 2 74.125.224.115 cf4b7594... 1 157.166.255.19 0ac4d141... 4 69.22.138.129 6c1584fa... 4 98.139.50.166 327252c5... 3 Wednesday, September 14, 2011
  • 116. benefits: same IP always goes to same machine simple Wednesday, September 14, 2011
  • 117. drawbacks: susceptible to skew can’t add / remove nodes without pain Wednesday, September 14, 2011
  • 123. benefits: ~ 1/(n+1) URLs move on add/remove virtual nodes help skew robust (no SOP) Wednesday, September 14, 2011
  • 124. drawbacks: naive solution won’t work for large sites Wednesday, September 14, 2011
  • 125. further reading: Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications (2001) Stoica et al. Dynamo: Amazon’s Highly Available Key-value Store, SOSP 2007 Tapestry: A Resilient Global-Scale Overlay for Service Deployment (2004) Zhao et al. Wednesday, September 14, 2011
  • 126. challenges: QUEUEING URLS Wednesday, September 14, 2011
  • 128. situation: URL Wednesday, September 14, 2011
  • 129. situation: URL not recently crawled Wednesday, September 14, 2011
  • 130. situation: URL not recently crawled allowed by robots.txt Wednesday, September 14, 2011
  • 131. situation: URL not recently crawled allowed by robots.txt polite Wednesday, September 14, 2011
  • 132. how to you order them? (within a single machine) Wednesday, September 14, 2011
  • 133. hash each lane: 1 2 3 http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ Wednesday, September 14, 2011
  • 134. 1 2 3 Wednesday, September 14, 2011
  • 135. 1 2 3 Wednesday, September 14, 2011
  • 136. 1 2 3 Wednesday, September 14, 2011
  • 137. 1 2 3 Wednesday, September 14, 2011
  • 138. 1 2 3 Wednesday, September 14, 2011
  • 139. 1 2 3 Wednesday, September 14, 2011
  • 140. 1 2 3 Wednesday, September 14, 2011
  • 141. 1 2 3 Wednesday, September 14, 2011
  • 142. 1 2 3 Wednesday, September 14, 2011
  • 143. 1 2 3 Wednesday, September 14, 2011
  • 144. 1 2 3 Wednesday, September 14, 2011
  • 145. 1 2 3 Wednesday, September 14, 2011
  • 146. 1 2 3 Wednesday, September 14, 2011
  • 147. 1 2 3 Wednesday, September 14, 2011
  • 148. 1 2 3 Wednesday, September 14, 2011
  • 149. 1 2 3 Wednesday, September 14, 2011
  • 150. 1 2 3 Wednesday, September 14, 2011
  • 151. 1 2 3 Wednesday, September 14, 2011
  • 152. 1 2 3 Wednesday, September 14, 2011
  • 153. 1 2 3 Wednesday, September 14, 2011
  • 154. 1 2 3 Wednesday, September 14, 2011
  • 155. ERLANG lookup: erlang B / C / engset Wednesday, September 14, 2011
  • 156. as many threads as possible Wednesday, September 14, 2011
  • 157. don’t sort input URLs Wednesday, September 14, 2011
  • 158. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 159. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ fetch http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 160. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 wait http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 161. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ fetch http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 162. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? wait id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 163. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 fetch http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 164. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 wait Wednesday, September 14, 2011
  • 165. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 Wednesday, September 14, 2011
  • 166. http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/ Wednesday, September 14, 2011
  • 167. http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/ Wednesday, September 14, 2011
  • 168. no waiting! http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/ Wednesday, September 14, 2011
  • 169. challenges: EXTRACTING URLS Wednesday, September 14, 2011
  • 170. challenges: EXTRACTING URLS the internet is full of garbage Wednesday, September 14, 2011
  • 171. challenges: EXTRACTING URLS Wednesday, September 14, 2011
  • 172. challenges: EXTRACTING URLS enormous pages Wednesday, September 14, 2011
  • 173. challenges: EXTRACTING URLS enormous pages terrible markup Wednesday, September 14, 2011
  • 174. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls Wednesday, September 14, 2011
  • 175. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls .net/ Wednesday, September 14, 2011
  • 176. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls .net/ “unicode snowman dot net” Wednesday, September 14, 2011
  • 177. challenges: EXTRACTING URLS be prepared: Wednesday, September 14, 2011
  • 178. challenges: EXTRACTING URLS be prepared: use a streaming XML parser Wednesday, September 14, 2011
  • 179. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup Wednesday, September 14, 2011
  • 180. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup be aware that URLs aren’t ASCII Wednesday, September 14, 2011
  • 181. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup be aware that URLs aren’t ASCII use a URL normalizer Wednesday, September 14, 2011
  • 184. software advice: • goals determine scale Wednesday, September 14, 2011
  • 185. software advice: • goals determine scale • someone else has already done it Wednesday, September 14, 2011
  • 186. 2 second crawler: function wgetspider() { wget --html-extension --convert-links --mirror --page-requisites --progress=bar --level=5 --no-parent --no-verbose --no-check-certificate "$@"; } $ wgetspider http://www.ischool.berkeley.edu/ Wednesday, September 14, 2011
  • 188. java crawlers: • Heritrix (Internet Archive) Wednesday, September 14, 2011
  • 189. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) Wednesday, September 14, 2011
  • 190. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) • Bixo (Hadoop / Cascading) Wednesday, September 14, 2011
  • 191. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) • Bixo (Hadoop / Cascading) http://crawler.archive.org/ http://nutch.apache.org/ http://bixo.101tec.com/ Wednesday, September 14, 2011
  • 193. extraction packages: • mechanize Wednesday, September 14, 2011
  • 194. extraction packages: • mechanize • BeautifulSoup & urllib2 Wednesday, September 14, 2011
  • 195. extraction packages: • mechanize • BeautifulSoup & urllib2 • Scrapy Wednesday, September 14, 2011
  • 196. extraction packages: • mechanize • BeautifulSoup & urllib2 • Scrapy http://wwwsearch.sourceforge.net/mechanize/ http://www.crummy.com/software/BeautifulSoup/ http://scrapy.org/ Wednesday, September 14, 2011
  • 198. wrapper induction(ish) • Ariel Wednesday, September 14, 2011
  • 199. wrapper induction(ish) • Ariel • RoadRunner Wednesday, September 14, 2011
  • 200. wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker Wednesday, September 14, 2011
  • 201. wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker • scrubyt Wednesday, September 14, 2011
  • 202. wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker • scrubyt http://ariel.rubyforge.org/index.html http://www.dia.uniroma3.it/db/roadRunner/ http://code.google.com/p/templatemaker/ http://scrubyt.rubyforge.org/files/README.html Wednesday, September 14, 2011
  • 204. FEEDBACK: nate@xcombinator.com www.xcombinator.com @xcombinator Wednesday, September 14, 2011