Common crawlpresentation


Published on

Published in: Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Common crawlpresentation

  1. 1. CommonCrawl Building an open Web-Scale crawl using Hadoop. Ahad Rana Architect / Engineer at CommonCrawl
  2. 2. Who is CommonCrawl ? • A 501(c)3 non-profit “dedicated to building, maintaining and making widely available a comprehensive crawl of the Internet for the purpose of enabling a new wave of innovation, education and research.” • Funded through a grant by Gil Elbaz, former Googler and founder of Applied Semantics, and current CEO of Factual Inc. • Board members include Carl Malamud and Nova Spivack.
  3. 3. Motivations Behind CommonCrawl • Internet is a massively disruptive force. • Exponential advances in computing capacity, storage and bandwidth are creating constant flux and disequilibrium in the IT domain. • Cloud computing makes large scale, on-demand computing affordable for even the smallest startup. • Hadoop provides the technology stack that enables us to crunch massive amounts of data. • Having the ability to “Map-Reduce the Internet” opens up lots of new opportunities for disruptive innovation and we would like to reduce the cost of doing this by an order of magnitude, at least. • White list only the major search engines trend by Webmasters puts the future of the Open Web at risk and stifles future search innovation and evolution.
  4. 4. Our Strategy • Crawl broadly and frequently across all TLDs. • Prioritize the crawl based on simplified criteria (rank and freshness). • Upload the crawl corpus to S3. • Make our S3 bucket widely accessible to as many users as possible. • Build support libraries to facilitate access to the S3 data via Hadoop. • Focus on doing a few things really well. • Listen to customers and open up more metadata and services as needed. • We are not a comprehensive crawl, and may never be 
  5. 5. Some Numbers • URLs in Crawl DB – 14 billion • URLs with inverse link graph – 1.6 billion • URLS with content in S3 – 2.5 billion • Recent crawled documents – 500 million • Uploaded documents after Deduping 300 million. • Newly discovered URLs – 1.9 billion • # of Vertices in Page Rank (recent caclulation) – 3.5 billion • # of Edges in Page Rank Graph (recent caclulation) – 17 billion
  6. 6. Current System Design • Batch oriented crawl list generation. • High volume crawling via independent crawlers. • Crawlers dump data into HDFS. • Map-Reduce jobs parse, extract metadata from crawled documents in bulk independently of crawlers. • Periodically, we ‘checkpoint’ the crawl, which involves, among other things: – Post processing of crawled documents (deduping etc.) – ARC file generation – Link graph updates – Crawl database updates. – Crawl list regeneration.
  7. 7. Our Cluster Config • Modest internal cluster consisting of 24 Hadoop nodes,4 crawler nodes, and 2 NameNode / Database servers. • Each Hadoop node has 6 x 1.5 TB drives and Dual-QuadCore Xeons with 24 or 32 GB of RAM. • 9 Map Tasks per node, avg 4 Reducers per node, BLOCK compression using LZO.
  8. 8. Crawler Design Overview
  9. 9. Crawler Design Details • Java codebase. • Asynchronous IO model using custom NIO based HTTP stack. • Lots of worker threads that synchronize with main thread via Asynchronous message queues. • Can sustain a crawl rate of ~250 URLS per second. • Up to 500 active HTTP connections at any one time. • Currently, no document parsing in crawler process. • We currently run 8 crawlers and crawl on average ~100 million URLs per day, when crawling. • During post processing phase, on average we process 800 million documents. • After Deduping, we package and upload on average approximately 500 million documents to S3.
  10. 10. Crawl Database • Primary Keys are 128 bit URL fingerprints, consisting of 64 bit domain fingerprint, and 64 bit URL fingerprint (Rabin-Hash). • Keys are distributed via modulo operation of URL portion of fingerprint only. • Currently, we run 4 reducers per node, and there is one node down, so we have 92 unique shards. • Keys in each shard are sorted by Domain FP, then URL FP. • We like the 64 bit domain id, since it is a generated key, but it is wasteful. • We may move to a 32 bit root domain id / 32 bit domain id + 64 URL fingerprint key scheme in the future, and then sort by root domain, domain, and then FP per shard.
  11. 11. Crawl Database – Continued • Values in the Crawl Database consist of extensible Metadata structures. • We currently use our own DDL and compiler for generating structures (vs. using Thrift/ProtoBuffers/Avro). • Avro / ProtoBufs were not available when we started, and we added lots of Hadoop friendly stuff to our version (multipart [key] attributes lead to auto WritableComparable derived classes, with built-in Raw Comparator support etc.). • Our compiler also generates RPC stubs, with Google ProtoBuf style message passing semantics (Message w/ optional Struct In, optional Struct Out) instead of Thrift style semantics (Method with multiple arguments and a return type). • We prefer the former because it is better attuned to our preference towards the asynchronous style of RPC programming.
  12. 12. Map-Reduce Pipeline – Parse/Dedupe/Arc Generation Phase 1 Phase 2
  13. 13. Map-Reduce Pipeline – Link Graph Construction Link Graph Construction Inverse Link Graph Construction
  14. 14. Map-Reduce Pipeline – PageRank Edge Graph Construction
  15. 15. Page Rank Process Distribution Phase Calculation Phase Generate Page Rank Values
  16. 16. The Need For a Smarter Merge • Pipelining nature of HDFS means each Reducer writes it’s output to local disk first, then to Repl Level – 1 other nodes. • If intermediate data record sets are already sorted, the need to run an Identity Mapper/Shuffle/Merge Sort phase to join to sorted record sets is very expensive.
  17. 17. Our Solution: