CommonCrawlBuilding an open Web-Scale crawl using Hadoop. AhadRana Architect / Engineer at CommonCrawl email@example.com
Who is CommonCrawl ? A 501(c)3 non-profit “dedicated tobuilding, maintaining and making widely available a comprehensive crawl of the Internet for the purpose of enabling a new wave of innovation, education and research.” Funded through a grant by Gil Elbaz, former Googler and founder of Applied Semantics, and current CEO of Factual Inc. Board members include Carl Malamud and Nova Spivack.
Motivations Behind CommonCrawl Internet is a massively disruptive force. Exponential advances in computing capacity, storage and bandwidth are creating constant flux and disequilibrium in the IT domain. Cloud computing makes large scale, on-demand computing affordable for even the smallest startup. Hadoop provides the technology stack that enables us to crunch massive amounts of data. Having the ability to “Map-Reduce the Internet” opens up lots of new opportunities for disruptive innovation and we would like to reduce the cost of doing this by an order of magnitude, at least. White list only the major search engines trend by Webmasters puts the future of the Open Web at risk and stifles future search innovation and evolution.
Our Strategy Crawl broadly and frequently across all TLDs. Prioritize the crawl based on simplified criteria (rank and freshness). Upload the crawl corpus to S3. Make our S3 bucket widely accessible to as many users as possible. Build support libraries to facilitate access to the S3 data via Hadoop. Focus on doing a few things really well. Listen to customers and open up more metadata and services as needed. We are not a comprehensive crawl, and may never be
Some Numbers URLs in Crawl DB – 14 billion URLs with inverse link graph – 1.6 billion URLS with content in S3 – 2.5 billion Recent crawled documents – 500 million Uploaded documents after Deduping 300 million. Newly discovered URLs – 1.9 billion # of Vertices in Page Rank (recent caclulation) – 3.5 billion # of Edges in Page Rank Graph (recentcaclulation) – 17 billion
Current System Design Batch oriented crawl list generation. High volume crawling via independent crawlers. Crawlers dump data into HDFS. Map-Reduce jobs parse, extract metadata from crawled documents in bulk independently of crawlers. Periodically, we ‘checkpoint’ the crawl, which involves, among other things: Post processing of crawled documents (deduping etc.) ARC file generation Link graph updates Crawl database updates. Crawl list regeneration.
Our Cluster Config Modest internal cluster consisting of 24 Hadoop nodes,4 crawler nodes, and 2 NameNode / Databaseservers. Each Hadoop node has 6 x 1.5 TB drives and Dual-QuadCore Xeons with 24 or 32 GB of RAM. 9 Map Tasks per node, avg 4 Reducers per node, BLOCK compression using LZO.
Crawler Design Overview
Crawler Design Details Java codebase. Asynchronous IO model using custom NIO based HTTP stack. Lots of worker threads that synchronize with main thread via Asynchronous message queues. Can sustain a crawl rate of ~250 URLS per second. Up to 500 active HTTP connections at any one time. Currently, no document parsing in crawler process. We currently run 8 crawlers and crawl on average ~100 million URLs per day, when crawling. During post processing phase, on average we process 800 million documents. AfterDeduping, we package and upload on average approximately 500 million documents to S3.
Crawl Database Primary Keys are 128 bit URL fingerprints, consisting of 64 bit domain fingerprint, and 64 bit URL fingerprint (Rabin-Hash). Keys are distributed viamodulo operation of URL portion of fingerprint only. Currently, we run 4 reducers per node, and there is one node down, so we have 92 unique shards. Keys in each shard are sorted by Domain FP, then URL FP. We like the 64 bit domain id, since it is a generated key, but it is wasteful. We may move to a 32 bit root domain id / 32 bit domain id + 64 URL fingerprint key scheme in the future, and then sort by root domain, domain, and then FP per shard.
Crawl Database – Continued
Values in the Crawl Database consist of extensible Metadata structures.
We currently use our own DDL and compiler for generating structures (vs. using Thrift/ProtoBuffers/Avro).
Avro / ProtoBufs were not available when we started, and we added lots of Hadoop friendly stuff to our version (multipart [key] attributes lead to auto WritableComparable derived classes, with built-in Raw Comparator support etc.).
Our compiler also generates RPC stubs, with Google ProtoBuf style message passing semantics (Message w/ optional Struct In, optional Struct Out) instead of Thrift style semantics (Method with multiple arguments and a return type).
We prefer the former because it is better attuned to our preference towards the asynchronous style of RPC programming.
Map-Reduce Pipeline – Link Graph Construction Link Graph Construction Inverse Link Graph Construction
Map-Reduce Pipeline – PageRank Edge Graph Construction
Page Rank Process Distribution Phase Generate Page Rank Values Calculation Phase
The Need For a Smarter Merge Pipelining nature of HDFS means each Reducer writes it’s output to local disk first, then to Repl Level – 1 other nodes. If intermediate data record sets are already sorted, the need to run an Identity Mapper/Shuffle/Merge Sort phase to join to sorted record sets is very expensive.