Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Like this? Share it with your network

Share

Building a Scalable Web Crawler with Hadoop

on

  • 20,712 views

Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl...

Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.

Statistics

Views

Total Views
20,712
Views on SlideShare
19,042
Embed Views
1,670

Actions

Likes
27
Downloads
383
Comments
0

15 Embeds 1,670

http://commoncrawl.org 1001
http://www.commoncrawl.org 596
http://blog.ownlinux.net 28
http://127.0.0.1:4000 17
http://paper.li 12
http://nourlcn.github.com 3
http://feeds.feedburner.com 3
http://webcache.googleusercontent.com 2
http://storify.com 2
http://10.100.0.241:15871 1
http://translate.googleusercontent.com 1
http://192.168.1.130 1
https://duckduckgo.com 1
http://twitter.com 1
http://blog.commoncrawl.org 1
More...

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Building a Scalable Web Crawler with Hadoop Presentation Transcript

  • 1. CommonCrawlBuilding an open Web-Scale crawl using Hadoop.
    AhadRana
    Architect / Engineer at CommonCrawl
    ahad@commoncrawl.org
  • 2. Who is CommonCrawl ?
    A 501(c)3 non-profit “dedicated tobuilding, maintaining and making widely available a comprehensive crawl of the Internet for the purpose of enabling a new wave of innovation, education and research.”
    Funded through a grant by Gil Elbaz, former Googler and founder of Applied Semantics, and current CEO of Factual Inc.
    Board members include Carl Malamud and Nova Spivack.
  • 3. Motivations Behind CommonCrawl
    Internet is a massively disruptive force.
    Exponential advances in computing capacity, storage and bandwidth are creating constant flux and disequilibrium in the IT domain.
    Cloud computing makes large scale, on-demand computing affordable for even the smallest startup.
    Hadoop provides the technology stack that enables us to crunch massive amounts of data.
    Having the ability to “Map-Reduce the Internet” opens up lots of new opportunities for disruptive innovation and we would like to reduce the cost of doing this by an order of magnitude, at least.
    White list only the major search engines trend by Webmasters puts the future of the Open Web at risk and stifles future search innovation and evolution.
  • 4. Our Strategy
    Crawl broadly and frequently across all TLDs.
    Prioritize the crawl based on simplified criteria (rank and freshness).
    Upload the crawl corpus to S3.
    Make our S3 bucket widely accessible to as many users as possible.
    Build support libraries to facilitate access to the S3 data via Hadoop.
    Focus on doing a few things really well.
    Listen to customers and open up more metadata and services as needed.
    We are not a comprehensive crawl, and may never be 
  • 5. Some Numbers
    URLs in Crawl DB – 14 billion
    URLs with inverse link graph – 1.6 billion
    URLS with content in S3 – 2.5 billion
    Recent crawled documents – 500 million
    Uploaded documents after Deduping 300 million.
    Newly discovered URLs – 1.9 billion
    # of Vertices in Page Rank (recent caclulation) – 3.5 billion
    # of Edges in Page Rank Graph (recentcaclulation) – 17 billion
  • 6. Current System Design
    Batch oriented crawl list generation.
    High volume crawling via independent crawlers.
    Crawlers dump data into HDFS.
    Map-Reduce jobs parse, extract metadata from crawled documents in bulk independently of crawlers.
    Periodically, we ‘checkpoint’ the crawl, which involves, among other things:
    Post processing of crawled documents (deduping etc.)
    ARC file generation
    Link graph updates
    Crawl database updates.
    Crawl list regeneration.
  • 7. Our Cluster Config
    Modest internal cluster consisting of 24 Hadoop nodes,4 crawler nodes, and 2 NameNode / Databaseservers.
    Each Hadoop node has 6 x 1.5 TB drives and Dual-QuadCore Xeons with 24 or 32 GB of RAM.
    9 Map Tasks per node, avg 4 Reducers per node, BLOCK compression using LZO.
  • 8. Crawler Design Overview
  • 9. Crawler Design Details
    Java codebase.
    Asynchronous IO model using custom NIO based HTTP stack.
    Lots of worker threads that synchronize with main thread via Asynchronous message queues.
    Can sustain a crawl rate of ~250 URLS per second.
    Up to 500 active HTTP connections at any one time.
    Currently, no document parsing in crawler process.
    We currently run 8 crawlers and crawl on average ~100 million URLs per day, when crawling.
    During post processing phase, on average we process 800 million documents.
    AfterDeduping, we package and upload on average approximately 500 million documents to S3.
  • 10. Crawl Database
    Primary Keys are 128 bit URL fingerprints, consisting of 64 bit domain fingerprint, and 64 bit URL fingerprint (Rabin-Hash).
    Keys are distributed viamodulo operation of URL portion of fingerprint only.
    Currently, we run 4 reducers per node, and there is one node down, so we have 92 unique shards.
    Keys in each shard are sorted by Domain FP, then URL FP.
    We like the 64 bit domain id, since it is a generated key, but it is wasteful.
    We may move to a 32 bit root domain id / 32 bit domain id + 64 URL fingerprint key scheme in the future, and then sort by root domain, domain, and then FP per shard.
  • 11. Crawl Database – Continued
    • Values in the Crawl Database consist of extensible Metadata structures.
    • 12. We currently use our own DDL and compiler for generating structures (vs. using Thrift/ProtoBuffers/Avro).
    • 13. Avro / ProtoBufs were not available when we started, and we added lots of Hadoop friendly stuff to our version (multipart [key] attributes lead to auto WritableComparable derived classes, with built-in Raw Comparator support etc.).
    • 14. Our compiler also generates RPC stubs, with Google ProtoBuf style message passing semantics (Message w/ optional Struct In, optional Struct Out) instead of Thrift style semantics (Method with multiple arguments and a return type).
    • 15. We prefer the former because it is better attuned to our preference towards the asynchronous style of RPC programming.
  • Map-Reduce Pipeline – Parse/Dedupe/Arc Generation
    Phase 1
    Phase 2
  • 16. Map-Reduce Pipeline – Link Graph Construction
    Link Graph Construction
    Inverse Link Graph Construction
  • 17. Map-Reduce Pipeline – PageRank Edge Graph Construction
  • 18. Page Rank Process
    Distribution Phase
    Generate Page Rank Values
    Calculation Phase
  • 19. The Need For a Smarter Merge
    Pipelining nature of HDFS means each Reducer writes it’s output to local disk first, then to Repl Level – 1 other nodes.
    If intermediate data record sets are already sorted, the need to run an Identity Mapper/Shuffle/Merge Sort phase to join to sorted record sets is very expensive.
  • 20. Our Solution: