Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Building a Scalable Web Crawler with Hadoop



Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl...

Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.



Total Views
Views on SlideShare
Embed Views



14 Embeds 1,637

http://commoncrawl.org 969
http://www.commoncrawl.org 596
http://blog.ownlinux.net 28 17
http://paper.li 12
http://nourlcn.github.com 3
http://feeds.feedburner.com 3
http://webcache.googleusercontent.com 2
http://storify.com 2
http://twitter.com 1
https://duckduckgo.com 1 1
http://translate.googleusercontent.com 1 1


Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Building a Scalable Web Crawler with Hadoop Building a Scalable Web Crawler with Hadoop Presentation Transcript

    • CommonCrawlBuilding an open Web-Scale crawl using Hadoop.
      Architect / Engineer at CommonCrawl
    • Who is CommonCrawl ?
      A 501(c)3 non-profit “dedicated tobuilding, maintaining and making widely available a comprehensive crawl of the Internet for the purpose of enabling a new wave of innovation, education and research.”
      Funded through a grant by Gil Elbaz, former Googler and founder of Applied Semantics, and current CEO of Factual Inc.
      Board members include Carl Malamud and Nova Spivack.
    • Motivations Behind CommonCrawl
      Internet is a massively disruptive force.
      Exponential advances in computing capacity, storage and bandwidth are creating constant flux and disequilibrium in the IT domain.
      Cloud computing makes large scale, on-demand computing affordable for even the smallest startup.
      Hadoop provides the technology stack that enables us to crunch massive amounts of data.
      Having the ability to “Map-Reduce the Internet” opens up lots of new opportunities for disruptive innovation and we would like to reduce the cost of doing this by an order of magnitude, at least.
      White list only the major search engines trend by Webmasters puts the future of the Open Web at risk and stifles future search innovation and evolution.
    • Our Strategy
      Crawl broadly and frequently across all TLDs.
      Prioritize the crawl based on simplified criteria (rank and freshness).
      Upload the crawl corpus to S3.
      Make our S3 bucket widely accessible to as many users as possible.
      Build support libraries to facilitate access to the S3 data via Hadoop.
      Focus on doing a few things really well.
      Listen to customers and open up more metadata and services as needed.
      We are not a comprehensive crawl, and may never be 
    • Some Numbers
      URLs in Crawl DB – 14 billion
      URLs with inverse link graph – 1.6 billion
      URLS with content in S3 – 2.5 billion
      Recent crawled documents – 500 million
      Uploaded documents after Deduping 300 million.
      Newly discovered URLs – 1.9 billion
      # of Vertices in Page Rank (recent caclulation) – 3.5 billion
      # of Edges in Page Rank Graph (recentcaclulation) – 17 billion
    • Current System Design
      Batch oriented crawl list generation.
      High volume crawling via independent crawlers.
      Crawlers dump data into HDFS.
      Map-Reduce jobs parse, extract metadata from crawled documents in bulk independently of crawlers.
      Periodically, we ‘checkpoint’ the crawl, which involves, among other things:
      Post processing of crawled documents (deduping etc.)
      ARC file generation
      Link graph updates
      Crawl database updates.
      Crawl list regeneration.
    • Our Cluster Config
      Modest internal cluster consisting of 24 Hadoop nodes,4 crawler nodes, and 2 NameNode / Databaseservers.
      Each Hadoop node has 6 x 1.5 TB drives and Dual-QuadCore Xeons with 24 or 32 GB of RAM.
      9 Map Tasks per node, avg 4 Reducers per node, BLOCK compression using LZO.
    • Crawler Design Overview
    • Crawler Design Details
      Java codebase.
      Asynchronous IO model using custom NIO based HTTP stack.
      Lots of worker threads that synchronize with main thread via Asynchronous message queues.
      Can sustain a crawl rate of ~250 URLS per second.
      Up to 500 active HTTP connections at any one time.
      Currently, no document parsing in crawler process.
      We currently run 8 crawlers and crawl on average ~100 million URLs per day, when crawling.
      During post processing phase, on average we process 800 million documents.
      AfterDeduping, we package and upload on average approximately 500 million documents to S3.
    • Crawl Database
      Primary Keys are 128 bit URL fingerprints, consisting of 64 bit domain fingerprint, and 64 bit URL fingerprint (Rabin-Hash).
      Keys are distributed viamodulo operation of URL portion of fingerprint only.
      Currently, we run 4 reducers per node, and there is one node down, so we have 92 unique shards.
      Keys in each shard are sorted by Domain FP, then URL FP.
      We like the 64 bit domain id, since it is a generated key, but it is wasteful.
      We may move to a 32 bit root domain id / 32 bit domain id + 64 URL fingerprint key scheme in the future, and then sort by root domain, domain, and then FP per shard.
    • Crawl Database – Continued
      • Values in the Crawl Database consist of extensible Metadata structures.
      • We currently use our own DDL and compiler for generating structures (vs. using Thrift/ProtoBuffers/Avro).
      • Avro / ProtoBufs were not available when we started, and we added lots of Hadoop friendly stuff to our version (multipart [key] attributes lead to auto WritableComparable derived classes, with built-in Raw Comparator support etc.).
      • Our compiler also generates RPC stubs, with Google ProtoBuf style message passing semantics (Message w/ optional Struct In, optional Struct Out) instead of Thrift style semantics (Method with multiple arguments and a return type).
      • We prefer the former because it is better attuned to our preference towards the asynchronous style of RPC programming.
    • Map-Reduce Pipeline – Parse/Dedupe/Arc Generation
      Phase 1
      Phase 2
    • Map-Reduce Pipeline – Link Graph Construction
      Link Graph Construction
      Inverse Link Graph Construction
    • Map-Reduce Pipeline – PageRank Edge Graph Construction
    • Page Rank Process
      Distribution Phase
      Generate Page Rank Values
      Calculation Phase
    • The Need For a Smarter Merge
      Pipelining nature of HDFS means each Reducer writes it’s output to local disk first, then to Repl Level – 1 other nodes.
      If intermediate data record sets are already sorted, the need to run an Identity Mapper/Shuffle/Merge Sort phase to join to sorted record sets is very expensive.
    • Our Solution: