Introduction to Common Crawl
Upcoming SlideShare
Loading in...5
×
 

Introduction to Common Crawl

on

  • 7,116 views

March 2013 slides about Common Crawl. Original presentation was to the Working with Open Data class at the UC Berkeley School of Information http://www.ischool.berkeley.edu/courses/i290t-wod

March 2013 slides about Common Crawl. Original presentation was to the Working with Open Data class at the UC Berkeley School of Information http://www.ischool.berkeley.edu/courses/i290t-wod

Statistics

Views

Total Views
7,116
Views on SlideShare
7,116
Embed Views
0

Actions

Likes
1
Downloads
20
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Introduction to Common Crawl Introduction to Common Crawl Presentation Transcript

  • Introduction to Common Crawl Dave Lester March 21, 2013Monday, April 1, 13
  • video intro: https://www.youtube.com/watch?v=ozX4GvUWDm4Monday, April 1, 13
  • What is Common Crawl? • non-profit org providing an open repository of web crawl data to be accessed and analyzed by anyone • data is currently shared as a public dataset on Amazon S3Monday, April 1, 13
  • Why Open Data? • It’s difficult to crawl the web at scale • Provides a shared resource for researchers to compare results and recreate experimentsMonday, April 1, 13
  • 2012 Corpus Stats • Total # of Web Documents: 3.8 billion • Total Uncompressed Content Size: 100 TB+ • # of Domains: 61 million • # of PDFs: 92.2 million • # of Word Docs: 6.6 million • # of Excel Docs: 1.3 millionMonday, April 1, 13
  • Other Data Sources • Blekko - “spam-free search engine” • their metadata includes: • rank on a linear scale, and 0-10 web rank • true/false for Blekko’s webspam algorithm thinking this domain or page is spam • true/false for Blekko’s pr0n detection algorithmMonday, April 1, 13
  • What is Crawled? • Check out the new URL search tool: http://commoncrawl.org/url-search-tool/ • (try entering ischool.berkeley.edu) • First five people to share open source code on GitHub that incorporates a JSON file from URL Search will each get $100 in AWS Credit!Monday, April 1, 13
  • How is Data Crawled? • Customized crawler (it’s open source!) • Some basic page rank included. Lots of time spent optimizing this and filtering spam • See Apache Nutch as alternative web-scale crawler • Future datasets may incl other crawl sourcesMonday, April 1, 13
  • Common Crawl UsesMonday, April 1, 13
  • Analyze References to Facebook • of ~1.3 Billion URLs: • 22% of Web pages contain Facebook URLs • 8% of Web pages implement Open Graph tags • Among ~500m hardcoded links to Facebook, only 3.5 million are unique • These are primarily for simple social integrationsMonday, April 1, 13
  • References to FB Pages • /merriamwebster 676071 (0.14%) • /kevjumba 651389 (0.14%) • /placeformusic 618963 (0.13%) • /lyricskeeper 517999 (0.11%) • /kayak 465179 (0.10%) • /twitter 281882 (0.06%) Monday, April 1, 13
  • Analyze JavaScript Libraries on the Web 1. jQuery (82.64%) 6. Modernizr (0.59%) 2. Prototype(6.06%) 7. Dojo(0.21%) 3. Mootools (4.83%) 8. Ember (0.14%) 4. Ext (3.47%) 9. Underscore (0.11%) 5. YUI (1.78%) 10. Backbone (0.09%)Monday, April 1, 13
  • Library Co-occurenceMonday, April 1, 13
  • Web Data Commons • sub-corpus of Common Crawl data • includes RDFa, hCalendar, hCard, Geo Microdata, hResume, XFN • built using 2009/2010 corpusMonday, April 1, 13
  • Monday, April 1, 13
  • Traitor: Associating Concepts http://www.youtube.com/watch?v=c7Y149RnQjwMonday, April 1, 13
  • Associated Costs? • Complete data set, ~$1300.00 • Facebook Link Analysis, $434.61 • Searchable Index of Data Set, $100 • “average per-hour cost for a High-CPU Medium Instance (c1.medium) was about $.018, just under one tenth of the on- demand rate”Monday, April 1, 13
  • Give it a TryMonday, April 1, 13
  • ARC Files • Files contain the full HTTP response and payload for all pages crawled. • Format designed by the Internet Archive • ARC files are a series of concatenated GZIP documentsMonday, April 1, 13
  • Text-Only Files • Saved as sequence files -- consisting of binary key/value pairs. (Used extensively in MapReduce as input/output formats) • On average 20% the size of raw content • located in the segment directories, with a file name of "textData-nnnnn". For example: • s3://aws-publicdatasets/common-crawl/parse- output/segment/1341690169105/ textData-00112Monday, April 1, 13
  • Metadata Files • For each URL, metadata files contain status information, the HTTP response code, and file names and offsets of ARC files where the raw content can be found. • Also contain the HTML title, HTML meta tags, RSS/Atom information, and all anchors/hyperlinks from HTML documents (including all fields on the link tags). • Records in the Metadata files are in the same order and have the same file numbers as the Text Only content • Saved as sequence filesMonday, April 1, 13
  • Browsing Data • You can use s3cmd on your local machine • Install using pip, ‘pip install s3cmd’ • Configure, ‘s3cmd --configure’ • Requires AWS keys • Demo: s3cmd ls s3://aws-publicdatasets/common- crawl/parse-output/Monday, April 1, 13
  • Common Crawl AMI • Amazon Machine Image loaded with Common Crawl example programs, a development Hadoop instance, and scripts to submit jobs to Amazon Elastic MapReduce • Amazon AMI ID: "ami-07339a6e"Monday, April 1, 13
  • Running Example MR Jobs Using the AMI • ccRunExample [ LocalHadoop | AmazonEMR ] [ ExampleName ] ( S3Bucket ) • bin/ccRunExample LocalHadoop ExampleMetadataDomainPageCount aws- publicdatasets/common-crawl/parse-output/ segment/1341690167474/ • look at the code: nano src/java/org/commoncrawl/ examples/ExampleMetadataDomainPageCount.javaMonday, April 1, 13
  • Code Samples to Try • http://github.com/commoncrawl/ • Pete Warden’s Ruby example • http://petewarden.typepad.com/searchbrowser/2012/03/twelve-steps-to-running-your- ruby-code-across-five-billion-web-pages.htmlMonday, April 1, 13
  • Helpful Resources • Developer Documentation: • https://commoncrawl.atlassian.net/ • Developer Discussion List: • https://groups.google.com/group/common-crawlMonday, April 1, 13
  • Questions? • @davelester • dave@davelester.org • www.davelester.orgMonday, April 1, 13