SlideShare a Scribd company logo
Design and Implementation of a High-
Performance Distributed Web Crawler


    Vladislav Shkapenyuk* Torsten Suel
                CIS Department
              Polytechnic University
               Brooklyn, NY 11201



* Currently at AT&T Research Labs, Florham Park
Overview
:
 1. Introduction
 2. PolyBot System Architecture
 3. Data Structures and Performance
 4. Experimental Results
 5. Discussion and Open Problems
1. Introduction
Web Crawler: (also called spider or robot)


     tool for data acquisition in search engines
     large engines need high-performance
     crawlers
     need to parallelize crawling task
     PolyBot: a parallel/distributed web crawler
     cluster vs. wide-area distributed
Basic structure of a search engine:


                               indexin
                               g

           Crawle
           r                             Index


                       disk
                       s

                       Search.com            look
 Query: “computer”                           up
Crawler
                             Crawle
                             r


                                          disk
   fetches pages from the web             s

   starts at set of “seed pages”
   parses fetched pages for
   hyperlinks
   then follows those links (e.g., BFS)
   variations:
- recrawling
- focused crawling
- random walks
Breadth-First Crawl:
  Basic idea:
- start at a set of known URLs
- explore in “concentric circles” around these URLs



  start
  pages
  distance-one
  pages
  distance-two
  pages



  used by broad web search engines
  balances load between servers
Crawling Strategy and Download Rate:
   crawling strategy: “What page to download next?”
   download rate: “How many pages per second?”
   different scenarios require different strategies
   lots of recent work on crawling strategy
   little published work on optimizing download rate
(main exception: Mercator from DEC/Compaq/HP?)
   somewhat separate issues
   building a slow crawler is (fairly) easy ...
Basic System
Architecture




    Application determines crawling
System Requirements:
   flexibility (different crawling strategies)
   scalabilty (sustainable high performance at low cost)
   robustness
(odd server content/behavior, crashes)
   crawling etiquette and speed control
(robot exclusion, 30 second intervals, domain level
throttling, speed control for other users)
   manageable and reconfigurable
(interface for statistics and control, system setup)
Details: (lots of ‘em)
       robot exclusion
   - robots.txt file and meta tags
   - robot exclusion adds overhead
       handling filetypes
   (exclude some extensions, and use mime types)
      URL extensions and CGI scripts
   (to strip or not to strip? Ignore?)
      frames, imagemaps
      black holes (robot traps)
   (limit maximum depth of a site)
      different names for same site?
   (could check IP address, but no perfect solution)
Crawling courtesy
    minimize load on crawled server
    no more than one outstanding request per
    site
    better: wait 30 seconds between accesses to
    site
 (this number is not fixed)
    problems:
 - one server may have many sites
 - one site may be on many servers
 - 3 years to crawl a 3-million page site!
    give contact info for large crawls
Contributions:
    distributed architecture based on collection of services
- separation of concerns
- efficient interfaces
    I/O efficient techniques for URL handling
- lazy URL -seen structure
- manager data structures
    scheduling policies
- manager scheduling and shuffling
    resulting system limited by network and parsing
performane
    detailed description and how-to (limited experiments)
2. PolyBot System
Architecture
Structure:
  separation of crawling strategy and basic
  system
  collection of scalable distributed services
(DNS, downloading, scheduling, strategy)
  for clusters and wide-area distributed
  optimized per-node performance
  no random disk accesses (no per-page access)
Basic Architecture (ctd):
     application issues
requests to manager
     manager does DNS
and robot exclusion
     manager schedules
URL on downloader
     downloader gets
     file
and puts it on disk
     application is
     notified
of new files
     application parses
     new
files for hyperlinks
     application sends
     data
to storage component
(indexing done later)
System components:
   downloader: optimized HTTP client written in Python
(everything else in C++)
   DNS resolver: uses asynchronous DNS library
   manager uses Berkeley DB and STL for external and
internal data structures
   manager does robot exclusion by generating requests
to downloaders and parsing files
   application does parsing and handling of URLs
(has this page already been downloaded?)
Scaling the system:
   small system on previous pages:
3-5 workstations and 250-400 pages/sec peak
   can scale up by adding downloaders and DNS
   resolvers
   at 400-600 pages/sec, application becomes bottleneck
   at 8 downloaders manager becomes bottleneck
need to replicate application and manager
   hash-based technique (Internet Archive crawler)
partitions URLs and hosts among application parts
   data transfer batched and via file system (NFS)
Scaling
up:
    20
    machines
    1500
    pages/s?
    depends on
crawl strategy
    hash to
    nodes
based on site
(b/c robot ex)
3. Data Structures and Techniques
Crawling
Application pcre
  parsing using       library
   NFS eventually bottleneck
   URL-seen problem:
- need to check if file has been parsed or downloaded before
- after 20 million pages, we have “seen” over 100 million URLs
- each URL is 50 to 75 bytes on average
   Options: compress URLs in main memory, or use disk
- prefix+huffman coding (DEC, JY01) or Bloom Filter (Archive)
- disk access with caching (Mercator)
- we use lazy/bulk operations on disk
Implementation of URL-seen check:
- while less than a few million URLs seen, keep in main memory
- then write URLs to file in alphabetic, prefix-compressed order
- collect new URLs in memory and periodically reform bulk
check by merging new URLs into the file on disk
   When is a newly a parsed URL downloaded?
   Reordering request stream
- want to space ot requests from same subdomain
- needed due to load on small domains and due to security tools
- sort URLs with hostname reversed (e.g., com.amazon.www),
and then “unshuffle” the stream provable load balance
Crawling Manager
  large stream of incoming URL request files
  goal: schedule URLs roughly in the order that they
come, while observing time-out rule (30 seconds)
and maintaining high speed
  must do DNS and robot excl. “right before”download
  keep requests on disk as long as possible!
- otherwise, structures grow too large after few million pages
(performance killer)
Manager Data Structures:




  when to insert new URLs into internal structures?
URL Loading
Policy new request file from disk whenever less
  read
than x hosts in ready queue
  choose x > speed * timeout (e.g., 100 pages/s * 30s)
  # of current host data structures is
x + speed * timeout + n_down + n_transit
which is usually < 2x
  nice behavior for BDB caching policy
  performs reordering only when necessary!
4. Experimental Results
   crawl of 120 million pages over 19 days
161 million HTTP request
16 million robots.txt requests
138 million successful non-robots requests
17 million HTTP errors (401, 403, 404 etc)
121 million pages retrieved
   slow during day, fast at night
   peak about 300 pages/s over T3
   many downtimes due to attacks, crashes, revisions
   “slow tail” of requests at the end (4 days)
   lots of things happen
Experimental Results ctd.




    bytes in bytes out frames out

    Poly T3 connection over 24 hours of 5/28/01
    (courtesy of AppliedTheory)
Experimental Results ctd.
   sustaining performance:
- will find out when data structures hit disk
- I/O-efficiency vital
   speed control tricky
- vary number of connections based on feedback
- also upper bound on connections
- complicated interactions in system
- not clear what we should want
  other configuration: 140 pages/sec sustained
on 2 Ultra10 with 60GB EIDE and 1GB/768MB
  similar for Linux on Intel
More Detailed Evaluation (to be done)
   Problems
- cannot get commercial crawlers
- need simulation systen to find system bottlenecks
- often not much of a tradeoff (get it right!)
   Example: manager data structures
- with our loading policy, manager can feed several
downloaders
- naïve policy: disk access per page
   parallel communication overhead
- low for limited number of nodes (URL exchange)
- wide-area distributed: where do yu want the data?
- more relevant for highly distributed systems
5. Discussion and Open Problems
Related work
    Mercator (Heydon/Najork from DEC/Compaq)
- used in altaVista
- centralized system (2-CPU Alpha with RAID disks)
- URL-seen test by fast disk access and caching
- one thread per HTTP connection
- completely in Java, with pluggable components
    Atrax: very recent distributed extension to Mercator
- combines several Mercators
- URL hashing, and off-line URL check (as we do)
Related work (ctd.)
    early Internet Archive crawler (circa 96)
- uses hashing to partition URLs between crawlers
- bloom filter for “URL seen” structure
    early Google crawler (1998)
    P2P crawlers (grub.org and others)
    Cho/Garcia-Molina (WWW 2002)
- study of overhead/quality tradeoff in parallel crawlers
- difference: we scale services separately, and focus on
single-node performance
- in our experience, parallel overhead low
Open Problems:
    Measuring and tuning peak performance
- need simulation environment
- eventually reduces to parsing and network
- to be improved: space, fault-tolerance (Xactions?)
    Highly Distributed crawling
- highly distributed (e.g., grub.org) ? (maybe)
- bybrid? (different services)
- few high-performance sites? (several Universities)
    Recrawling and focused crawling strategies
- what strategies?
- how to express?
- how to implement?

More Related Content

What's hot

ClickHouse Materialized Views: The Magic Continues
ClickHouse Materialized Views: The Magic ContinuesClickHouse Materialized Views: The Magic Continues
ClickHouse Materialized Views: The Magic Continues
Altinity Ltd
 
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
Altinity Ltd
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides
Altinity Ltd
 
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
ClickHouse Mark Cache, by Mik Kocikowski, CloudflareClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
Altinity Ltd
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
alexbaranau
 
Why I quit Amazon and Build the Next-gen Streaming System
Why I quit Amazon and Build the Next-gen Streaming SystemWhy I quit Amazon and Build the Next-gen Streaming System
Why I quit Amazon and Build the Next-gen Streaming System
Yingjun Wu
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
Altinity Ltd
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databasesArangoDB Database
 
MongoDB Sharding
MongoDB ShardingMongoDB Sharding
MongoDB Sharding
Rob Walters
 
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesWebinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Altinity Ltd
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
sudhakara st
 
Better than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouseBetter than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouse
Altinity Ltd
 
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
Altinity Ltd
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
Bishal Khanal
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouse
Altinity Ltd
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOTricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Altinity Ltd
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
 
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
HostedbyConfluent
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
Amazon Web Services
 

What's hot (20)

ClickHouse Materialized Views: The Magic Continues
ClickHouse Materialized Views: The Magic ContinuesClickHouse Materialized Views: The Magic Continues
ClickHouse Materialized Views: The Magic Continues
 
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides
 
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
ClickHouse Mark Cache, by Mik Kocikowski, CloudflareClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Why I quit Amazon and Build the Next-gen Streaming System
Why I quit Amazon and Build the Next-gen Streaming SystemWhy I quit Amazon and Build the Next-gen Streaming System
Why I quit Amazon and Build the Next-gen Streaming System
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databases
 
MongoDB Sharding
MongoDB ShardingMongoDB Sharding
MongoDB Sharding
 
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesWebinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Better than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouseBetter than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouse
 
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouse
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOTricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 

Viewers also liked

Etsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureEtsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureDan McKinley
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
Denis Shestakov
 
Scaling GIS Data in Non-relational Data Stores
Scaling GIS Data in Non-relational Data StoresScaling GIS Data in Non-relational Data Stores
Scaling GIS Data in Non-relational Data Stores
Mike Malone
 
分布式Key Value Store漫谈
分布式Key Value Store漫谈分布式Key Value Store漫谈
分布式Key Value Store漫谈
Tim Y
 
High Performance Web Applications
High Performance Web ApplicationsHigh Performance Web Applications
High Performance Web Applications
Amazon Web Services
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 

Viewers also liked (7)

Web crawler
Web crawlerWeb crawler
Web crawler
 
Etsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureEtsy Activity Feeds Architecture
Etsy Activity Feeds Architecture
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Scaling GIS Data in Non-relational Data Stores
Scaling GIS Data in Non-relational Data StoresScaling GIS Data in Non-relational Data Stores
Scaling GIS Data in Non-relational Data Stores
 
分布式Key Value Store漫谈
分布式Key Value Store漫谈分布式Key Value Store漫谈
分布式Key Value Store漫谈
 
High Performance Web Applications
High Performance Web ApplicationsHigh Performance Web Applications
High Performance Web Applications
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 

Similar to Design and Implementation of a High- Performance Distributed Web Crawler

Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypresNekoGato
 
How Web Browsers Work
How Web Browsers WorkHow Web Browsers Work
How Web Browsers Work
military
 
Web crawler
Web crawlerWeb crawler
Web crawler
anusha kurapati
 
webcrawler.pptx
webcrawler.pptxwebcrawler.pptx
webcrawler.pptx
NiteshKumar176268
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
lucenerevolution
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
Julien Nioche
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
srikanthhadoop
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
Julien Nioche
 
Web Crawling Using Location Aware Technique
Web Crawling Using Location Aware TechniqueWeb Crawling Using Location Aware Technique
Web Crawling Using Location Aware Technique
ijsrd.com
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
Scrapinghub
 
Spring 2007 SharePoint Connections Oleson Advanced Administration and Plannin...
Spring 2007 SharePoint Connections Oleson Advanced Administration and Plannin...Spring 2007 SharePoint Connections Oleson Advanced Administration and Plannin...
Spring 2007 SharePoint Connections Oleson Advanced Administration and Plannin...
Joel Oleson
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
mynameismrslide
 
Faster and resourceful multi core web crawling
Faster and resourceful multi core web crawlingFaster and resourceful multi core web crawling
Faster and resourceful multi core web crawling
International Journal of Science and Research (IJSR)
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
iosrjce
 
E017624043
E017624043E017624043
E017624043
IOSR Journals
 
Remix
RemixRemix
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
Julien Nioche
 

Similar to Design and Implementation of a High- Performance Distributed Web Crawler (20)

Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
How Web Browsers Work
How Web Browsers WorkHow Web Browsers Work
How Web Browsers Work
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
webcrawler.pptx
webcrawler.pptxwebcrawler.pptx
webcrawler.pptx
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Jagmohancrawl
JagmohancrawlJagmohancrawl
Jagmohancrawl
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Web Crawling Using Location Aware Technique
Web Crawling Using Location Aware TechniqueWeb Crawling Using Location Aware Technique
Web Crawling Using Location Aware Technique
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
Spring 2007 SharePoint Connections Oleson Advanced Administration and Plannin...
Spring 2007 SharePoint Connections Oleson Advanced Administration and Plannin...Spring 2007 SharePoint Connections Oleson Advanced Administration and Plannin...
Spring 2007 SharePoint Connections Oleson Advanced Administration and Plannin...
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
Seminar on crawler
Seminar on crawlerSeminar on crawler
Seminar on crawler
 
Faster and resourceful multi core web crawling
Faster and resourceful multi core web crawlingFaster and resourceful multi core web crawling
Faster and resourceful multi core web crawling
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
 
E017624043
E017624043E017624043
E017624043
 
Remix
RemixRemix
Remix
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 

More from George Ang

Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...George Ang
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarizationGeorge Ang
 
Huffman coding
Huffman codingHuffman coding
Huffman codingGeorge Ang
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textGeorge Ang
 
大规模数据处理的那些事儿
大规模数据处理的那些事儿大规模数据处理的那些事儿
大规模数据处理的那些事儿George Ang
 
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势George Ang
 
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程George Ang
 
腾讯大讲堂04 im qq
腾讯大讲堂04 im qq腾讯大讲堂04 im qq
腾讯大讲堂04 im qqGeorge Ang
 
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道George Ang
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化George Ang
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间George Ang
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨George Ang
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站George Ang
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程George Ang
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagementGeorge Ang
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享George Ang
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享George Ang
 

More from George Ang (20)

Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarization
 
Huffman coding
Huffman codingHuffman coding
Huffman coding
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar text
 
大规模数据处理的那些事儿
大规模数据处理的那些事儿大规模数据处理的那些事儿
大规模数据处理的那些事儿
 
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势
 
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程
 
腾讯大讲堂04 im qq
腾讯大讲堂04 im qq腾讯大讲堂04 im qq
腾讯大讲堂04 im qq
 
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 

Design and Implementation of a High- Performance Distributed Web Crawler

  • 1. Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk* Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 * Currently at AT&T Research Labs, Florham Park
  • 2. Overview : 1. Introduction 2. PolyBot System Architecture 3. Data Structures and Performance 4. Experimental Results 5. Discussion and Open Problems
  • 3. 1. Introduction Web Crawler: (also called spider or robot) tool for data acquisition in search engines large engines need high-performance crawlers need to parallelize crawling task PolyBot: a parallel/distributed web crawler cluster vs. wide-area distributed
  • 4. Basic structure of a search engine: indexin g Crawle r Index disk s Search.com look Query: “computer” up
  • 5. Crawler Crawle r disk fetches pages from the web s starts at set of “seed pages” parses fetched pages for hyperlinks then follows those links (e.g., BFS) variations: - recrawling - focused crawling - random walks
  • 6. Breadth-First Crawl: Basic idea: - start at a set of known URLs - explore in “concentric circles” around these URLs start pages distance-one pages distance-two pages used by broad web search engines balances load between servers
  • 7. Crawling Strategy and Download Rate: crawling strategy: “What page to download next?” download rate: “How many pages per second?” different scenarios require different strategies lots of recent work on crawling strategy little published work on optimizing download rate (main exception: Mercator from DEC/Compaq/HP?) somewhat separate issues building a slow crawler is (fairly) easy ...
  • 8. Basic System Architecture Application determines crawling
  • 9. System Requirements: flexibility (different crawling strategies) scalabilty (sustainable high performance at low cost) robustness (odd server content/behavior, crashes) crawling etiquette and speed control (robot exclusion, 30 second intervals, domain level throttling, speed control for other users) manageable and reconfigurable (interface for statistics and control, system setup)
  • 10. Details: (lots of ‘em) robot exclusion - robots.txt file and meta tags - robot exclusion adds overhead handling filetypes (exclude some extensions, and use mime types) URL extensions and CGI scripts (to strip or not to strip? Ignore?) frames, imagemaps black holes (robot traps) (limit maximum depth of a site) different names for same site? (could check IP address, but no perfect solution)
  • 11. Crawling courtesy minimize load on crawled server no more than one outstanding request per site better: wait 30 seconds between accesses to site (this number is not fixed) problems: - one server may have many sites - one site may be on many servers - 3 years to crawl a 3-million page site! give contact info for large crawls
  • 12. Contributions: distributed architecture based on collection of services - separation of concerns - efficient interfaces I/O efficient techniques for URL handling - lazy URL -seen structure - manager data structures scheduling policies - manager scheduling and shuffling resulting system limited by network and parsing performane detailed description and how-to (limited experiments)
  • 13. 2. PolyBot System Architecture Structure: separation of crawling strategy and basic system collection of scalable distributed services (DNS, downloading, scheduling, strategy) for clusters and wide-area distributed optimized per-node performance no random disk accesses (no per-page access)
  • 14. Basic Architecture (ctd): application issues requests to manager manager does DNS and robot exclusion manager schedules URL on downloader downloader gets file and puts it on disk application is notified of new files application parses new files for hyperlinks application sends data to storage component (indexing done later)
  • 15. System components: downloader: optimized HTTP client written in Python (everything else in C++) DNS resolver: uses asynchronous DNS library manager uses Berkeley DB and STL for external and internal data structures manager does robot exclusion by generating requests to downloaders and parsing files application does parsing and handling of URLs (has this page already been downloaded?)
  • 16. Scaling the system: small system on previous pages: 3-5 workstations and 250-400 pages/sec peak can scale up by adding downloaders and DNS resolvers at 400-600 pages/sec, application becomes bottleneck at 8 downloaders manager becomes bottleneck need to replicate application and manager hash-based technique (Internet Archive crawler) partitions URLs and hosts among application parts data transfer batched and via file system (NFS)
  • 17. Scaling up: 20 machines 1500 pages/s? depends on crawl strategy hash to nodes based on site (b/c robot ex)
  • 18. 3. Data Structures and Techniques Crawling Application pcre parsing using library NFS eventually bottleneck URL-seen problem: - need to check if file has been parsed or downloaded before - after 20 million pages, we have “seen” over 100 million URLs - each URL is 50 to 75 bytes on average Options: compress URLs in main memory, or use disk - prefix+huffman coding (DEC, JY01) or Bloom Filter (Archive) - disk access with caching (Mercator) - we use lazy/bulk operations on disk
  • 19. Implementation of URL-seen check: - while less than a few million URLs seen, keep in main memory - then write URLs to file in alphabetic, prefix-compressed order - collect new URLs in memory and periodically reform bulk check by merging new URLs into the file on disk When is a newly a parsed URL downloaded? Reordering request stream - want to space ot requests from same subdomain - needed due to load on small domains and due to security tools - sort URLs with hostname reversed (e.g., com.amazon.www), and then “unshuffle” the stream provable load balance
  • 20. Crawling Manager large stream of incoming URL request files goal: schedule URLs roughly in the order that they come, while observing time-out rule (30 seconds) and maintaining high speed must do DNS and robot excl. “right before”download keep requests on disk as long as possible! - otherwise, structures grow too large after few million pages (performance killer)
  • 21. Manager Data Structures: when to insert new URLs into internal structures?
  • 22. URL Loading Policy new request file from disk whenever less read than x hosts in ready queue choose x > speed * timeout (e.g., 100 pages/s * 30s) # of current host data structures is x + speed * timeout + n_down + n_transit which is usually < 2x nice behavior for BDB caching policy performs reordering only when necessary!
  • 23. 4. Experimental Results crawl of 120 million pages over 19 days 161 million HTTP request 16 million robots.txt requests 138 million successful non-robots requests 17 million HTTP errors (401, 403, 404 etc) 121 million pages retrieved slow during day, fast at night peak about 300 pages/s over T3 many downtimes due to attacks, crashes, revisions “slow tail” of requests at the end (4 days) lots of things happen
  • 24. Experimental Results ctd. bytes in bytes out frames out Poly T3 connection over 24 hours of 5/28/01 (courtesy of AppliedTheory)
  • 25. Experimental Results ctd. sustaining performance: - will find out when data structures hit disk - I/O-efficiency vital speed control tricky - vary number of connections based on feedback - also upper bound on connections - complicated interactions in system - not clear what we should want other configuration: 140 pages/sec sustained on 2 Ultra10 with 60GB EIDE and 1GB/768MB similar for Linux on Intel
  • 26. More Detailed Evaluation (to be done) Problems - cannot get commercial crawlers - need simulation systen to find system bottlenecks - often not much of a tradeoff (get it right!) Example: manager data structures - with our loading policy, manager can feed several downloaders - naïve policy: disk access per page parallel communication overhead - low for limited number of nodes (URL exchange) - wide-area distributed: where do yu want the data? - more relevant for highly distributed systems
  • 27. 5. Discussion and Open Problems Related work Mercator (Heydon/Najork from DEC/Compaq) - used in altaVista - centralized system (2-CPU Alpha with RAID disks) - URL-seen test by fast disk access and caching - one thread per HTTP connection - completely in Java, with pluggable components Atrax: very recent distributed extension to Mercator - combines several Mercators - URL hashing, and off-line URL check (as we do)
  • 28. Related work (ctd.) early Internet Archive crawler (circa 96) - uses hashing to partition URLs between crawlers - bloom filter for “URL seen” structure early Google crawler (1998) P2P crawlers (grub.org and others) Cho/Garcia-Molina (WWW 2002) - study of overhead/quality tradeoff in parallel crawlers - difference: we scale services separately, and focus on single-node performance - in our experience, parallel overhead low
  • 29. Open Problems: Measuring and tuning peak performance - need simulation environment - eventually reduces to parsing and network - to be improved: space, fault-tolerance (Xactions?) Highly Distributed crawling - highly distributed (e.g., grub.org) ? (maybe) - bybrid? (different services) - few high-performance sites? (several Universities) Recrawling and focused crawling strategies - what strategies? - how to express? - how to implement?