0
introduction to                        WEB CRAWLING                            & extraction                 by Nate Murray...
WHO AM I ?Wednesday, September 14, 2011
Nate Murray                        AT&T Interactive (Yellowpages.com)                                 TB-scale data since ...
what is                        WEB CRAWLING ?Wednesday, September 14, 2011
definition:                 web crawler                            a program that browses the web.Wednesday, September 14, ...
definition:                 web extraction                    transforming unstructured web data into                      ...
definition:                 web extraction                    transforming semistructured web data into                    ...
motivationWednesday, September 14, 2011
motivation: bookmark buddiesWednesday, September 14, 2011
motivation: bookmark buddies                                URL Title                                 UsersWednesday, Sept...
motivation:Wednesday, September 14, 2011
motivation:               business hoursWednesday, September 14, 2011
motivation:               business hours                                Day        Openness                               ...
motivation:Wednesday, September 14, 2011
motivation: recommend videosWednesday, September 14, 2011
motivation: recommend videos                                UsersWednesday, September 14, 2011
motivation:Wednesday, September 14, 2011
motivation:               vertical searchWednesday, September 14, 2011
motivation:               vertical search                  Image                   SKU                  Name              ...
motivation:Wednesday, September 14, 2011
DESIRED PROPERTIESWednesday, September 14, 2011
DESIRED PROPERTIES                                SPEEDWednesday, September 14, 2011
CONSTRAINTSWednesday, September 14, 2011
CONSTRAINTS         • PolitenessWednesday, September 14, 2011
CONSTRAINTS         • Politeness         • DistributedWednesday, September 14, 2011
CONSTRAINTS         • Politeness         • Distributed              • Linear ScalabilityWednesday, September 14, 2011
CONSTRAINTS         • Politeness         • Distributed              • Linear Scalability              • Even partitioningW...
CONSTRAINTS         • Politeness         • Distributed              • Linear Scalability              • Even partitioning ...
CONSTRAINTS         • Politeness                it’s easy to burden         • Distributed                                 ...
CONSTRAINTS         • Politeness         • Distributed               (for any significant                                  ...
CONSTRAINTS         • Politeness         • Distributed              • Linear Scalability      n machines =                ...
CONSTRAINTS         • Politeness         • Distributed              • Linear Scalability              • Even partitioning ...
CONSTRAINTS         • Politeness         • Distributed              • Linear Scalability              • Even partitioning ...
CONSTRAINTS         • Politeness         • Distributed              • Linear Scalability              • Even partitioning ...
BASIC ALGORITHMWednesday, September 14, 2011
Initialize:             UrlsDone = null             UrlFrontier = {google.com/index.html, ..}         Repeat             u...
Initialize:             UrlsDone = null             UrlFrontier = {google.com/index.html, ..}         Repeat             u...
Initialize:             UrlsDone = null             UrlFrontier = {google.com/index.html, ..}         Repeat             u...
Initialize:             UrlsDone = null             UrlFrontier = {google.com/index.html, ..}         Repeat             u...
Initialize:             UrlsDone = null             UrlFrontier = {google.com/index.html, ..}         Repeat             u...
Initialize:             UrlsDone = null             UrlFrontier = {google.com/index.html, ..}         Repeat             u...
Initialize:             UrlsDone = null             UrlFrontier = {google.com/index.html, ..}         Repeat             u...
Initialize:             UrlsDone = null             UrlFrontier = {google.com/index.html, ..}         Repeat             u...
Initialize:             UrlsDone = null             UrlFrontier = {google.com/index.html, ..}         Repeat             u...
Initialize:             UrlsDone = null             UrlFrontier = {google.com/index.html, ..}         Repeat             u...
Initialize:             UrlsDone = null             UrlFrontier = {google.com/index.html, ..}         Repeat             u...
Initialize:             UrlsDone = null             UrlFrontier = {google.com/index.html, ..}         Repeat             u...
Initialize:             UrlsDone = null             UrlFrontier = {google.com/index.html, ..}         Repeat             u...
architecture overview         CRAWL                                             FETCHER                   INTERNET        ...
CHALLENGESWednesday, September 14, 2011
challenges:                                depends on your ambitionsWednesday, September 14, 2011
challenges:                                Google’s Index Size:                                 1998 - 26 million         ...
challenges:                                small crawls are easyWednesday, September 14, 2011
challenges:                                     < 10MM                                small crawls are easyWednesday, Sept...
challenges:                                large crawls are interestingWednesday, September 14, 2011
challenges:Wednesday, September 14, 2011
challenges:                                DNS LookupWednesday, September 14, 2011
challenges:                                DNS Lookup                                URLs CrawledWednesday, September 14, ...
challenges:                                DNS Lookup                                URLs Crawled                         ...
challenges:                                DNS Lookup                                URLs Crawled                         ...
challenges:                                 DNS Lookup                                URLs Crawled                        ...
challenges:                                 DNS Lookup                                 URLs Crawled                       ...
challenges:                                DNS LOOKUPWednesday, September 14, 2011
Initialize:             UrlsDone = null             UrlFrontier = {google.com/index.html, ..}         Repeat             u...
challenges:      DNS LOOKUP                                can easily be a bottleneckWednesday, September 14, 2011
challenges:      DNS LOOKUP           • consider running your own DNS servers             • djbdns             • PowerDNS ...
challenges:      DNS LOOKUP           • be aware of software limitations                • gethostbyaddr is synchronized   ...
challenges:      DNS LOOKUP                                You’ll know when you need itWednesday, September 14, 2011
challenges:                                URLs CRAWLEDWednesday, September 14, 2011
Initialize:             UrlsDone = null             UrlFrontier = {google.com/index.html, ..}         Repeat             u...
challenges:      URLs CRAWLED                                1 machine, store in memoryWednesday, September 14, 2011
challenges:      URLs CRAWLED                                1 machine, store in memory                                NAP...
challenges:      URLs CRAWLED                                1 machine, store in memory                                NAP...
challenges:      URLs CRAWLED                                1 machine, store in memory                                NAP...
challenges:      URLs CRAWLED                                1 machine, store in memory                                NAP...
challenges:      URLs CRAWLED                                1 machine, store in memory                                NAP...
can we do better?Wednesday, September 14, 2011
BLOOM FILTERSWednesday, September 14, 2011
BLOOM FILTERS                                     answers the question:                                is this item in the...
BLOOM FILTERS                                answers either:Wednesday, September 14, 2011
BLOOM FILTERS                                    answers either:                                • yes, probablyWednesday, ...
BLOOM FILTERS                                    answers either:                                • yes, probably           ...
BLOOM FILTERS                Have we crawled: http://www.xcombinator.com?                                    answers eithe...
BLOOM FILTERS                Have we crawled: http://www.xcombinator.com?                                    answers eithe...
challenges:      URLs CRAWLED                                1 machine, bloom filter                100 million URLs       ...
challenges:      URLs CRAWLED                                 1 machine, bloom filter                                NAPKIN...
challenges:      URLs CRAWLED                                 1 machine, bloom filter                                NAPKIN...
BLOOM FILTERWednesday, September 14, 2011
BLOOM FILTER                       drawbacksWednesday, September 14, 2011
BLOOM FILTER                       drawbacks   • probabilistic - occasional errorsWednesday, September 14, 2011
BLOOM FILTER                       drawbacks   • probabilistic - occasional errors   • estimate # of items ahead of timeWe...
BLOOM FILTER                       drawbacks   • probabilistic - occasional errors   • estimate # of items ahead of time  ...
BLOOM FILTER                       drawbacks                                         solutions   • probabilistic - occasio...
BLOOM FILTER                       drawbacks                                            solutions   • probabilistic - occa...
BLOOM FILTER                       drawbacks                                         solutions   • probabilistic - occasio...
BLOOM FILTER                       drawbacks                                         solutions   • probabilistic - occasio...
BLOOM FILTER                       drawbacks                                         solutions   • probabilistic - occasio...
BLOOM FILTERS                                         references:       http://en.wikipedia.org/wiki/Bloom_filter       ht...
challenges:                                POLITENESSWednesday, September 14, 2011
obey robots.txtWednesday, September 14, 2011
rule of thumb:                                wait 2 seconds (w.r.t. ip)Wednesday, September 14, 2011
centralized politenessWednesday, September 14, 2011
centralized politeness                                SPOFWednesday, September 14, 2011
centralized politeness                                  SPOF                                contentionWednesday, September...
challenges:      POLITENESSWednesday, September 14, 2011
challenges:      POLITENESS                • Options:Wednesday, September 14, 2011
challenges:      POLITENESS                • Options:                  • central databaseWednesday, September 14, 2011
challenges:      POLITENESS                • Options:                  • central database                  • distributed l...
challenges:      POLITENESS                • Options:                  • central database                  • distributed l...
challenges:      POLITENESS                • Options:                  • central database                  • distributed l...
challenges:      POLITENESS                • Options:                  • central database                  • distributed l...
challenges:                                URL FRONTIERWednesday, September 14, 2011
url frontierWednesday, September 14, 2011
idea:                          consistently distribute URLs based on IPWednesday, September 14, 2011
modulo                                IP      SHA-1       bucket (mod 5)                  174.132.225.106     4dd14b0b... ...
benefits:                                same IP always goes to same machine                                              s...
drawbacks:                                         susceptible to skew                                can’t add / remove n...
consistent hashingWednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/Wednesday, September 14, 2011
source: http://michaelnielsen.org/blog/consistent-hashing/Wednesday, September 14, 2011
benefits:                                ~ 1/(n+1) URLs move on add/remove                                      virtual nod...
drawbacks:                           naive solution won’t work for large sitesWednesday, September 14, 2011
further reading:           Chord: A Scalable Peer-to-Peer Lookup Protocol for           Internet Applications (2001) Stoic...
challenges:                                QUEUEING URLSWednesday, September 14, 2011
situation:Wednesday, September 14, 2011
situation:                                   URLWednesday, September 14, 2011
situation:                                   URL                                   not recently crawledWednesday, Septembe...
situation:                                   URL                                   not recently crawled                   ...
situation:                                   URL                                   not recently crawled                   ...
how to you order them?                                (within a single machine)Wednesday, September 14, 2011
hash each lane:                1                         2                               3                   http://yachtm...
1           2   3Wednesday, September 14, 2011
1           2   3Wednesday, September 14, 2011
1           2   3Wednesday, September 14, 2011
1           2   3Wednesday, September 14, 2011
1           2   3Wednesday, September 14, 2011
1           2   3Wednesday, September 14, 2011
1           2   3Wednesday, September 14, 2011
1           2   3Wednesday, September 14, 2011
1           2   3Wednesday, September 14, 2011
1           2   3Wednesday, September 14, 2011
1           2   3Wednesday, September 14, 2011
1           2   3Wednesday, September 14, 2011
1           2   3Wednesday, September 14, 2011
1           2   3Wednesday, September 14, 2011
1           2   3Wednesday, September 14, 2011
1           2   3Wednesday, September 14, 2011
1           2   3Wednesday, September 14, 2011
1           2   3Wednesday, September 14, 2011
1           2   3Wednesday, September 14, 2011
1           2   3Wednesday, September 14, 2011
1           2   3Wednesday, September 14, 2011
ERLANG                                lookup: erlang B / C / engsetWednesday, September 14, 2011
as many threads as possibleWednesday, September 14, 2011
don’t sort input URLsWednesday, September 14, 2011
http://abcnews.go.com/   http://abcnews.go.com/2020/ABCNEWSSpecial/   http://abcnews.go.com/2020/story?id=207269&amp;page=...
http://abcnews.go.com/   http://abcnews.go.com/2020/ABCNEWSSpecial/                                                       ...
http://abcnews.go.com/   http://abcnews.go.com/2020/ABCNEWSSpecial/   http://abcnews.go.com/2020/story?id=207269&amp;page=...
http://abcnews.go.com/   http://abcnews.go.com/2020/ABCNEWSSpecial/                                                       ...
http://abcnews.go.com/   http://abcnews.go.com/2020/ABCNEWSSpecial/   http://abcnews.go.com/2020/story?id=207269&amp;page=...
http://abcnews.go.com/   http://abcnews.go.com/2020/ABCNEWSSpecial/   http://abcnews.go.com/2020/story?id=207269&amp;page=...
http://abcnews.go.com/   http://abcnews.go.com/2020/ABCNEWSSpecial/   http://abcnews.go.com/2020/story?id=207269&amp;page=...
http://abcnews.go.com/   http://abcnews.go.com/2020/ABCNEWSSpecial/   http://abcnews.go.com/2020/story?id=207269&amp;page=...
http://yachtmaintenanceco.com/   http://www.amsterdamports.nl/   http://www.4s-dawn.com/   http://www.embassysuiteslittler...
http://yachtmaintenanceco.com/   http://www.amsterdamports.nl/   http://www.4s-dawn.com/   http://www.embassysuiteslittler...
no waiting!   http://yachtmaintenanceco.com/   http://www.amsterdamports.nl/   http://www.4s-dawn.com/   http://www.embass...
challenges:                                EXTRACTING URLSWednesday, September 14, 2011
challenges:      EXTRACTING URLS                                the internet is full of garbageWednesday, September 14, 2011
challenges:      EXTRACTING URLSWednesday, September 14, 2011
challenges:      EXTRACTING URLS                                enormous pagesWednesday, September 14, 2011
challenges:      EXTRACTING URLS                                enormous pages                                terrible mar...
challenges:      EXTRACTING URLS                                enormous pages                                terrible mar...
challenges:      EXTRACTING URLS                                enormous pages                                terrible mar...
challenges:      EXTRACTING URLS                                     enormous pages                                     te...
challenges:      EXTRACTING URLS                                be prepared:Wednesday, September 14, 2011
challenges:      EXTRACTING URLS                                       be prepared:                                use a s...
challenges:      EXTRACTING URLS                                       be prepared:                                use a s...
challenges:      EXTRACTING URLS                                       be prepared:                                use a s...
challenges:      EXTRACTING URLS                                       be prepared:                                use a s...
SOFTWAREWednesday, September 14, 2011
software advice:Wednesday, September 14, 2011
software advice:               •       goals determine scaleWednesday, September 14, 2011
software advice:               •       goals determine scale               •       someone else has already done itWednesd...
2 second crawler:    function wgetspider() {      wget --html-extension --convert-links --mirror         --page-requisites...
java crawlers:Wednesday, September 14, 2011
java crawlers:               •       Heritrix (Internet Archive)Wednesday, September 14, 2011
java crawlers:               •       Heritrix (Internet Archive)               •       Nutch (Lucene)Wednesday, September ...
java crawlers:               •       Heritrix (Internet Archive)               •       Nutch (Lucene)               •     ...
java crawlers:               •       Heritrix (Internet Archive)               •       Nutch (Lucene)               •     ...
extraction packages:Wednesday, September 14, 2011
extraction packages:               •       mechanizeWednesday, September 14, 2011
extraction packages:               •       mechanize               •       BeautifulSoup & urllib2Wednesday, September 14,...
extraction packages:               •       mechanize               •       BeautifulSoup & urllib2               •       S...
extraction packages:               •       mechanize               •       BeautifulSoup & urllib2               •       S...
wrapper induction(ish)Wednesday, September 14, 2011
wrapper induction(ish)               •       ArielWednesday, September 14, 2011
wrapper induction(ish)               •       Ariel               •       RoadRunnerWednesday, September 14, 2011
wrapper induction(ish)               •       Ariel               •       RoadRunner               •       TemplateMakerWed...
wrapper induction(ish)               •       Ariel               •       RoadRunner               •       TemplateMaker   ...
wrapper induction(ish)               •       Ariel               •       RoadRunner               •       TemplateMaker   ...
QUESTIONS?Wednesday, September 14, 2011
FEEDBACK:             nate@xcombinator.com                                www.xcombinator.com                             ...
Upcoming SlideShare
Loading in...5
×

Challenges in Large-Scale Web Crawling

10,545

Published on

Published in: Technology, Business
0 Comments
29 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
10,545
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
231
Comments
0
Likes
29
Embeds 0
No embeds

No notes for slide

Transcript of "Challenges in Large-Scale Web Crawling"

  1. 1. introduction to WEB CRAWLING & extraction by Nate MurrayWednesday, September 14, 2011
  2. 2. WHO AM I ?Wednesday, September 14, 2011
  3. 3. Nate Murray AT&T Interactive (Yellowpages.com) TB-scale data since 2009 Various crawlers since 2005Wednesday, September 14, 2011
  4. 4. what is WEB CRAWLING ?Wednesday, September 14, 2011
  5. 5. definition: web crawler a program that browses the web.Wednesday, September 14, 2011
  6. 6. definition: web extraction transforming unstructured web data into structured dataWednesday, September 14, 2011
  7. 7. definition: web extraction transforming semistructured web data into structured dataWednesday, September 14, 2011
  8. 8. motivationWednesday, September 14, 2011
  9. 9. motivation: bookmark buddiesWednesday, September 14, 2011
  10. 10. motivation: bookmark buddies URL Title UsersWednesday, September 14, 2011
  11. 11. motivation:Wednesday, September 14, 2011
  12. 12. motivation: business hoursWednesday, September 14, 2011
  13. 13. motivation: business hours Day Openness Mon Closed Tue 11:30-14:30 17:30-22:00 Wed 11:30-14:30 17:30-22:00 Thur 11:30-14:30 17:30-22:00 Fri 11:30-14:30 17:30-22:00 Sat 12:00-14:30 17:00-22:00 Sun - 17:00-21:00Wednesday, September 14, 2011
  14. 14. motivation:Wednesday, September 14, 2011
  15. 15. motivation: recommend videosWednesday, September 14, 2011
  16. 16. motivation: recommend videos UsersWednesday, September 14, 2011
  17. 17. motivation:Wednesday, September 14, 2011
  18. 18. motivation: vertical searchWednesday, September 14, 2011
  19. 19. motivation: vertical search Image SKU Name Price RatingWednesday, September 14, 2011
  20. 20. motivation:Wednesday, September 14, 2011
  21. 21. DESIRED PROPERTIESWednesday, September 14, 2011
  22. 22. DESIRED PROPERTIES SPEEDWednesday, September 14, 2011
  23. 23. CONSTRAINTSWednesday, September 14, 2011
  24. 24. CONSTRAINTS • PolitenessWednesday, September 14, 2011
  25. 25. CONSTRAINTS • Politeness • DistributedWednesday, September 14, 2011
  26. 26. CONSTRAINTS • Politeness • Distributed • Linear ScalabilityWednesday, September 14, 2011
  27. 27. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioningWednesday, September 14, 2011
  28. 28. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlapWednesday, September 14, 2011
  29. 29. CONSTRAINTS • Politeness it’s easy to burden • Distributed small servers • Linear Scalability • Even partitioning • Minimum overlapWednesday, September 14, 2011
  30. 30. CONSTRAINTS • Politeness • Distributed (for any significant crawl) • Linear Scalability • Even partitioning • Minimum overlapWednesday, September 14, 2011
  31. 31. CONSTRAINTS • Politeness • Distributed • Linear Scalability n machines = n*m pages-per-second • Even partitioning • Minimum overlapWednesday, September 14, 2011
  32. 32. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning every machine should perform equal work • Minimum overlapWednesday, September 14, 2011
  33. 33. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlap crawl each page exactly onceWednesday, September 14, 2011
  34. 34. CONSTRAINTS • Politeness • Distributed • Linear Scalability • Even partitioning • Minimum overlapWednesday, September 14, 2011
  35. 35. BASIC ALGORITHMWednesday, September 14, 2011
  36. 36. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  37. 37. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  38. 38. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  39. 39. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  40. 40. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  41. 41. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  42. 42. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  43. 43. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  44. 44. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  45. 45. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  46. 46. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  47. 47. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  48. 48. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  49. 49. architecture overview CRAWL FETCHER INTERNET URLs Web Data PLANNER URL QUEUE Web Data STORAGE Web DataWednesday, September 14, 2011
  50. 50. CHALLENGESWednesday, September 14, 2011
  51. 51. challenges: depends on your ambitionsWednesday, September 14, 2011
  52. 52. challenges: Google’s Index Size: 1998 - 26 million 2005 - 8 billion 2008 - 1 trillion http://www.nytimes.com/2005/08/15/technology/15search.html http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.htmlWednesday, September 14, 2011
  53. 53. challenges: small crawls are easyWednesday, September 14, 2011
  54. 54. challenges: < 10MM small crawls are easyWednesday, September 14, 2011
  55. 55. challenges: large crawls are interestingWednesday, September 14, 2011
  56. 56. challenges:Wednesday, September 14, 2011
  57. 57. challenges: DNS LookupWednesday, September 14, 2011
  58. 58. challenges: DNS Lookup URLs CrawledWednesday, September 14, 2011
  59. 59. challenges: DNS Lookup URLs Crawled PolitenessWednesday, September 14, 2011
  60. 60. challenges: DNS Lookup URLs Crawled Politeness URL FrontierWednesday, September 14, 2011
  61. 61. challenges: DNS Lookup URLs Crawled Politeness URL Frontier Queueing URLsWednesday, September 14, 2011
  62. 62. challenges: DNS Lookup URLs Crawled Politeness URL Frontier Queueing URLs Extracting URLsWednesday, September 14, 2011
  63. 63. challenges: DNS LOOKUPWednesday, September 14, 2011
  64. 64. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  65. 65. challenges: DNS LOOKUP can easily be a bottleneckWednesday, September 14, 2011
  66. 66. challenges: DNS LOOKUP • consider running your own DNS servers • djbdns • PowerDNS • etc.Wednesday, September 14, 2011
  67. 67. challenges: DNS LOOKUP • be aware of software limitations • gethostbyaddr is synchronized • same with many “default” DNS clientsWednesday, September 14, 2011
  68. 68. challenges: DNS LOOKUP You’ll know when you need itWednesday, September 14, 2011
  69. 69. challenges: URLs CRAWLEDWednesday, September 14, 2011
  70. 70. Initialize:     UrlsDone = null     UrlFrontier = {google.com/index.html, ..} Repeat     url = UrlFrontier.getNext()     ip = DNSlookup(url.getHostname())     html = DownloadPage(ip, url.getPath())     UrlsDone.insert(url)     newUrls = parseForLinks(html)     For each newUrl       If not UrlsDone.contains(newUrl)       then UrlsTodo.insert(newUrl)Wednesday, September 14, 2011
  71. 71. challenges: URLs CRAWLED 1 machine, store in memoryWednesday, September 14, 2011
  72. 72. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATIONWednesday, September 14, 2011
  73. 73. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentationsWednesday, September 14, 2011
  74. 74. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712Wednesday, September 14, 2011
  75. 75. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 x 100 millionWednesday, September 14, 2011
  76. 76. challenges: URLs CRAWLED 1 machine, store in memory NAPKIN CALCULATION ~50 bytes per URL e.g. http://wiki.apache.org/cassandra/ArticlesAndPresentations +8 bytes for time-last-crawled as long e.g. System.currentTimeMillis() -> 1314392455712 x 100 million =~ 5.4 gigabytesWednesday, September 14, 2011
  77. 77. can we do better?Wednesday, September 14, 2011
  78. 78. BLOOM FILTERSWednesday, September 14, 2011
  79. 79. BLOOM FILTERS answers the question: is this item in the set?Wednesday, September 14, 2011
  80. 80. BLOOM FILTERS answers either:Wednesday, September 14, 2011
  81. 81. BLOOM FILTERS answers either: • yes, probablyWednesday, September 14, 2011
  82. 82. BLOOM FILTERS answers either: • yes, probably • definitely notWednesday, September 14, 2011
  83. 83. BLOOM FILTERS Have we crawled: http://www.xcombinator.com? answers either: • yes, probably • definitely notWednesday, September 14, 2011
  84. 84. BLOOM FILTERS Have we crawled: http://www.xcombinator.com? answers either: • yes, probably • definitely notWednesday, September 14, 2011
  85. 85. challenges: URLs CRAWLED 1 machine, bloom filter 100 million URLs 1 in 100 million chance of false positive see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8Wednesday, September 14, 2011
  86. 86. challenges: URLs CRAWLED 1 machine, bloom filter NAPKIN CALCULATION 100 million URLs 1 in 100 million chance of false positive see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8Wednesday, September 14, 2011
  87. 87. challenges: URLs CRAWLED 1 machine, bloom filter NAPKIN CALCULATION 100 million URLs 1 in 100 million chance of false positive =~ 457 megabytes see: http://hur.st/bloomfilter?n=100000000&p=1.0E-8Wednesday, September 14, 2011
  88. 88. BLOOM FILTERWednesday, September 14, 2011
  89. 89. BLOOM FILTER drawbacksWednesday, September 14, 2011
  90. 90. BLOOM FILTER drawbacks • probabilistic - occasional errorsWednesday, September 14, 2011
  91. 91. BLOOM FILTER drawbacks • probabilistic - occasional errors • estimate # of items ahead of timeWednesday, September 14, 2011
  92. 92. BLOOM FILTER drawbacks • probabilistic - occasional errors • estimate # of items ahead of time • can’t deleteWednesday, September 14, 2011
  93. 93. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • estimate # of items ahead of time • can’t deleteWednesday, September 14, 2011
  94. 94. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • can’t deleteWednesday, September 14, 2011
  95. 95. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t deleteWednesday, September 14, 2011
  96. 96. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete • pick granularity (days)Wednesday, September 14, 2011
  97. 97. BLOOM FILTER drawbacks solutions • probabilistic - occasional errors • acceptable • estimate # of items ahead of time • not hard, see Dynamic BFs • can’t delete • pick granularity (days) • cascade themWednesday, September 14, 2011
  98. 98. BLOOM FILTERS references: http://en.wikipedia.org/wiki/Bloom_filter http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/Wednesday, September 14, 2011
  99. 99. challenges: POLITENESSWednesday, September 14, 2011
  100. 100. obey robots.txtWednesday, September 14, 2011
  101. 101. rule of thumb: wait 2 seconds (w.r.t. ip)Wednesday, September 14, 2011
  102. 102. centralized politenessWednesday, September 14, 2011
  103. 103. centralized politeness SPOFWednesday, September 14, 2011
  104. 104. centralized politeness SPOF contentionWednesday, September 14, 2011
  105. 105. challenges: POLITENESSWednesday, September 14, 2011
  106. 106. challenges: POLITENESS • Options:Wednesday, September 14, 2011
  107. 107. challenges: POLITENESS • Options: • central databaseWednesday, September 14, 2011
  108. 108. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper)Wednesday, September 14, 2011
  109. 109. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distributionWednesday, September 14, 2011
  110. 110. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution http://en.wikipedia.org/wiki/Paxos_(computer_science)Wednesday, September 14, 2011
  111. 111. challenges: POLITENESS • Options: • central database • distributed locks (paxos/sigma/zookeeper) • controlled URL distribution http://en.wikipedia.org/wiki/Paxos_(computer_science) http://zookeeper.apache.org/Wednesday, September 14, 2011
  112. 112. challenges: URL FRONTIERWednesday, September 14, 2011
  113. 113. url frontierWednesday, September 14, 2011
  114. 114. idea: consistently distribute URLs based on IPWednesday, September 14, 2011
  115. 115. modulo IP SHA-1 bucket (mod 5) 174.132.225.106 4dd14b0b... 2 74.125.224.115 cf4b7594... 1 157.166.255.19 0ac4d141... 4 69.22.138.129 6c1584fa... 4 98.139.50.166 327252c5... 3Wednesday, September 14, 2011
  116. 116. benefits: same IP always goes to same machine simpleWednesday, September 14, 2011
  117. 117. drawbacks: susceptible to skew can’t add / remove nodes without painWednesday, September 14, 2011
  118. 118. consistent hashingWednesday, September 14, 2011
  119. 119. source: http://michaelnielsen.org/blog/consistent-hashing/Wednesday, September 14, 2011
  120. 120. source: http://michaelnielsen.org/blog/consistent-hashing/Wednesday, September 14, 2011
  121. 121. source: http://michaelnielsen.org/blog/consistent-hashing/Wednesday, September 14, 2011
  122. 122. source: http://michaelnielsen.org/blog/consistent-hashing/Wednesday, September 14, 2011
  123. 123. benefits: ~ 1/(n+1) URLs move on add/remove virtual nodes help skew robust (no SOP)Wednesday, September 14, 2011
  124. 124. drawbacks: naive solution won’t work for large sitesWednesday, September 14, 2011
  125. 125. further reading: Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications (2001) Stoica et al. Dynamo: Amazon’s Highly Available Key-value Store, SOSP 2007 Tapestry: A Resilient Global-Scale Overlay for Service Deployment (2004) Zhao et al.Wednesday, September 14, 2011
  126. 126. challenges: QUEUEING URLSWednesday, September 14, 2011
  127. 127. situation:Wednesday, September 14, 2011
  128. 128. situation: URLWednesday, September 14, 2011
  129. 129. situation: URL not recently crawledWednesday, September 14, 2011
  130. 130. situation: URL not recently crawled allowed by robots.txtWednesday, September 14, 2011
  131. 131. situation: URL not recently crawled allowed by robots.txt politeWednesday, September 14, 2011
  132. 132. how to you order them? (within a single machine)Wednesday, September 14, 2011
  133. 133. hash each lane: 1 2 3 http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/Wednesday, September 14, 2011
  134. 134. 1 2 3Wednesday, September 14, 2011
  135. 135. 1 2 3Wednesday, September 14, 2011
  136. 136. 1 2 3Wednesday, September 14, 2011
  137. 137. 1 2 3Wednesday, September 14, 2011
  138. 138. 1 2 3Wednesday, September 14, 2011
  139. 139. 1 2 3Wednesday, September 14, 2011
  140. 140. 1 2 3Wednesday, September 14, 2011
  141. 141. 1 2 3Wednesday, September 14, 2011
  142. 142. 1 2 3Wednesday, September 14, 2011
  143. 143. 1 2 3Wednesday, September 14, 2011
  144. 144. 1 2 3Wednesday, September 14, 2011
  145. 145. 1 2 3Wednesday, September 14, 2011
  146. 146. 1 2 3Wednesday, September 14, 2011
  147. 147. 1 2 3Wednesday, September 14, 2011
  148. 148. 1 2 3Wednesday, September 14, 2011
  149. 149. 1 2 3Wednesday, September 14, 2011
  150. 150. 1 2 3Wednesday, September 14, 2011
  151. 151. 1 2 3Wednesday, September 14, 2011
  152. 152. 1 2 3Wednesday, September 14, 2011
  153. 153. 1 2 3Wednesday, September 14, 2011
  154. 154. 1 2 3Wednesday, September 14, 2011
  155. 155. ERLANG lookup: erlang B / C / engsetWednesday, September 14, 2011
  156. 156. as many threads as possibleWednesday, September 14, 2011
  157. 157. don’t sort input URLsWednesday, September 14, 2011
  158. 158. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
  159. 159. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ fetch http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
  160. 160. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 wait http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
  161. 161. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ fetch http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
  162. 162. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? wait id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
  163. 163. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 fetch http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
  164. 164. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1 waitWednesday, September 14, 2011
  165. 165. http://abcnews.go.com/ http://abcnews.go.com/2020/ABCNEWSSpecial/ http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/2020/story?id=207269&amp;page=1 http://abcnews.go.com/GMA/JoelSiegel/story?id=1734395 http://abcnews.go.com/International/News/story? id=203089&amp;page=1 http://abcnews.go.com/International/Pope/ http://abcnews.go.com/International/story?id=81417&amp;page=1Wednesday, September 14, 2011
  166. 166. http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/Wednesday, September 14, 2011
  167. 167. http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/Wednesday, September 14, 2011
  168. 168. no waiting! http://yachtmaintenanceco.com/ http://www.amsterdamports.nl/ http://www.4s-dawn.com/ http://www.embassysuiteslittlerock.com/ http://members.tripod.com/airfields_freeman/NM/Airfields_NM_NW.htm http://mdgroover.iweb.bsu.edu http://music.imbc.com/ http://www.robertjbradshaw.com http://www.kerkattenhoven.be http://www.escolania.org/ http://www.musiciansdfw.org/ http://www.ariana.org/Wednesday, September 14, 2011
  169. 169. challenges: EXTRACTING URLSWednesday, September 14, 2011
  170. 170. challenges: EXTRACTING URLS the internet is full of garbageWednesday, September 14, 2011
  171. 171. challenges: EXTRACTING URLSWednesday, September 14, 2011
  172. 172. challenges: EXTRACTING URLS enormous pagesWednesday, September 14, 2011
  173. 173. challenges: EXTRACTING URLS enormous pages terrible markupWednesday, September 14, 2011
  174. 174. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urlsWednesday, September 14, 2011
  175. 175. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls .net/Wednesday, September 14, 2011
  176. 176. challenges: EXTRACTING URLS enormous pages terrible markup ridiculous urls .net/ “unicode snowman dot net”Wednesday, September 14, 2011
  177. 177. challenges: EXTRACTING URLS be prepared:Wednesday, September 14, 2011
  178. 178. challenges: EXTRACTING URLS be prepared: use a streaming XML parserWednesday, September 14, 2011
  179. 179. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markupWednesday, September 14, 2011
  180. 180. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup be aware that URLs aren’t ASCIIWednesday, September 14, 2011
  181. 181. challenges: EXTRACTING URLS be prepared: use a streaming XML parser use a library that handle’s bad markup be aware that URLs aren’t ASCII use a URL normalizerWednesday, September 14, 2011
  182. 182. SOFTWAREWednesday, September 14, 2011
  183. 183. software advice:Wednesday, September 14, 2011
  184. 184. software advice: • goals determine scaleWednesday, September 14, 2011
  185. 185. software advice: • goals determine scale • someone else has already done itWednesday, September 14, 2011
  186. 186. 2 second crawler: function wgetspider() { wget --html-extension --convert-links --mirror --page-requisites --progress=bar --level=5 --no-parent --no-verbose --no-check-certificate "$@"; } $ wgetspider http://www.ischool.berkeley.edu/Wednesday, September 14, 2011
  187. 187. java crawlers:Wednesday, September 14, 2011
  188. 188. java crawlers: • Heritrix (Internet Archive)Wednesday, September 14, 2011
  189. 189. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene)Wednesday, September 14, 2011
  190. 190. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) • Bixo (Hadoop / Cascading)Wednesday, September 14, 2011
  191. 191. java crawlers: • Heritrix (Internet Archive) • Nutch (Lucene) • Bixo (Hadoop / Cascading) http://crawler.archive.org/ http://nutch.apache.org/ http://bixo.101tec.com/Wednesday, September 14, 2011
  192. 192. extraction packages:Wednesday, September 14, 2011
  193. 193. extraction packages: • mechanizeWednesday, September 14, 2011
  194. 194. extraction packages: • mechanize • BeautifulSoup & urllib2Wednesday, September 14, 2011
  195. 195. extraction packages: • mechanize • BeautifulSoup & urllib2 • ScrapyWednesday, September 14, 2011
  196. 196. extraction packages: • mechanize • BeautifulSoup & urllib2 • Scrapy http://wwwsearch.sourceforge.net/mechanize/ http://www.crummy.com/software/BeautifulSoup/ http://scrapy.org/Wednesday, September 14, 2011
  197. 197. wrapper induction(ish)Wednesday, September 14, 2011
  198. 198. wrapper induction(ish) • ArielWednesday, September 14, 2011
  199. 199. wrapper induction(ish) • Ariel • RoadRunnerWednesday, September 14, 2011
  200. 200. wrapper induction(ish) • Ariel • RoadRunner • TemplateMakerWednesday, September 14, 2011
  201. 201. wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker • scrubytWednesday, September 14, 2011
  202. 202. wrapper induction(ish) • Ariel • RoadRunner • TemplateMaker • scrubyt http://ariel.rubyforge.org/index.html http://www.dia.uniroma3.it/db/roadRunner/ http://code.google.com/p/templatemaker/ http://scrubyt.rubyforge.org/files/README.htmlWednesday, September 14, 2011
  203. 203. QUESTIONS?Wednesday, September 14, 2011
  204. 204. FEEDBACK: nate@xcombinator.com www.xcombinator.com @xcombinatorWednesday, September 14, 2011
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×