Introduction to Common Crawl

  • 8,331 views
Uploaded on

March 2013 slides about Common Crawl. Original presentation was to the Working with Open Data class at the UC Berkeley School of Information http://www.ischool.berkeley.edu/courses/i290t-wod

March 2013 slides about Common Crawl. Original presentation was to the Working with Open Data class at the UC Berkeley School of Information http://www.ischool.berkeley.edu/courses/i290t-wod

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
8,331
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
30
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Introduction to Common Crawl Dave Lester March 21, 2013Monday, April 1, 13
  • 2. video intro: https://www.youtube.com/watch?v=ozX4GvUWDm4Monday, April 1, 13
  • 3. What is Common Crawl? • non-profit org providing an open repository of web crawl data to be accessed and analyzed by anyone • data is currently shared as a public dataset on Amazon S3Monday, April 1, 13
  • 4. Why Open Data? • It’s difficult to crawl the web at scale • Provides a shared resource for researchers to compare results and recreate experimentsMonday, April 1, 13
  • 5. 2012 Corpus Stats • Total # of Web Documents: 3.8 billion • Total Uncompressed Content Size: 100 TB+ • # of Domains: 61 million • # of PDFs: 92.2 million • # of Word Docs: 6.6 million • # of Excel Docs: 1.3 millionMonday, April 1, 13
  • 6. Other Data Sources • Blekko - “spam-free search engine” • their metadata includes: • rank on a linear scale, and 0-10 web rank • true/false for Blekko’s webspam algorithm thinking this domain or page is spam • true/false for Blekko’s pr0n detection algorithmMonday, April 1, 13
  • 7. What is Crawled? • Check out the new URL search tool: http://commoncrawl.org/url-search-tool/ • (try entering ischool.berkeley.edu) • First five people to share open source code on GitHub that incorporates a JSON file from URL Search will each get $100 in AWS Credit!Monday, April 1, 13
  • 8. How is Data Crawled? • Customized crawler (it’s open source!) • Some basic page rank included. Lots of time spent optimizing this and filtering spam • See Apache Nutch as alternative web-scale crawler • Future datasets may incl other crawl sourcesMonday, April 1, 13
  • 9. Common Crawl UsesMonday, April 1, 13
  • 10. Analyze References to Facebook • of ~1.3 Billion URLs: • 22% of Web pages contain Facebook URLs • 8% of Web pages implement Open Graph tags • Among ~500m hardcoded links to Facebook, only 3.5 million are unique • These are primarily for simple social integrationsMonday, April 1, 13
  • 11. References to FB Pages • /merriamwebster 676071 (0.14%) • /kevjumba 651389 (0.14%) • /placeformusic 618963 (0.13%) • /lyricskeeper 517999 (0.11%) • /kayak 465179 (0.10%) • /twitter 281882 (0.06%) Monday, April 1, 13
  • 12. Analyze JavaScript Libraries on the Web 1. jQuery (82.64%) 6. Modernizr (0.59%) 2. Prototype(6.06%) 7. Dojo(0.21%) 3. Mootools (4.83%) 8. Ember (0.14%) 4. Ext (3.47%) 9. Underscore (0.11%) 5. YUI (1.78%) 10. Backbone (0.09%)Monday, April 1, 13
  • 13. Library Co-occurenceMonday, April 1, 13
  • 14. Web Data Commons • sub-corpus of Common Crawl data • includes RDFa, hCalendar, hCard, Geo Microdata, hResume, XFN • built using 2009/2010 corpusMonday, April 1, 13
  • 15. Monday, April 1, 13
  • 16. Traitor: Associating Concepts http://www.youtube.com/watch?v=c7Y149RnQjwMonday, April 1, 13
  • 17. Associated Costs? • Complete data set, ~$1300.00 • Facebook Link Analysis, $434.61 • Searchable Index of Data Set, $100 • “average per-hour cost for a High-CPU Medium Instance (c1.medium) was about $.018, just under one tenth of the on- demand rate”Monday, April 1, 13
  • 18. Give it a TryMonday, April 1, 13
  • 19. ARC Files • Files contain the full HTTP response and payload for all pages crawled. • Format designed by the Internet Archive • ARC files are a series of concatenated GZIP documentsMonday, April 1, 13
  • 20. Text-Only Files • Saved as sequence files -- consisting of binary key/value pairs. (Used extensively in MapReduce as input/output formats) • On average 20% the size of raw content • located in the segment directories, with a file name of "textData-nnnnn". For example: • s3://aws-publicdatasets/common-crawl/parse- output/segment/1341690169105/ textData-00112Monday, April 1, 13
  • 21. Metadata Files • For each URL, metadata files contain status information, the HTTP response code, and file names and offsets of ARC files where the raw content can be found. • Also contain the HTML title, HTML meta tags, RSS/Atom information, and all anchors/hyperlinks from HTML documents (including all fields on the link tags). • Records in the Metadata files are in the same order and have the same file numbers as the Text Only content • Saved as sequence filesMonday, April 1, 13
  • 22. Browsing Data • You can use s3cmd on your local machine • Install using pip, ‘pip install s3cmd’ • Configure, ‘s3cmd --configure’ • Requires AWS keys • Demo: s3cmd ls s3://aws-publicdatasets/common- crawl/parse-output/Monday, April 1, 13
  • 23. Common Crawl AMI • Amazon Machine Image loaded with Common Crawl example programs, a development Hadoop instance, and scripts to submit jobs to Amazon Elastic MapReduce • Amazon AMI ID: "ami-07339a6e"Monday, April 1, 13
  • 24. Running Example MR Jobs Using the AMI • ccRunExample [ LocalHadoop | AmazonEMR ] [ ExampleName ] ( S3Bucket ) • bin/ccRunExample LocalHadoop ExampleMetadataDomainPageCount aws- publicdatasets/common-crawl/parse-output/ segment/1341690167474/ • look at the code: nano src/java/org/commoncrawl/ examples/ExampleMetadataDomainPageCount.javaMonday, April 1, 13
  • 25. Code Samples to Try • http://github.com/commoncrawl/ • Pete Warden’s Ruby example • http://petewarden.typepad.com/searchbrowser/2012/03/twelve-steps-to-running-your- ruby-code-across-five-billion-web-pages.htmlMonday, April 1, 13
  • 26. Helpful Resources • Developer Documentation: • https://commoncrawl.atlassian.net/ • Developer Discussion List: • https://groups.google.com/group/common-crawlMonday, April 1, 13
  • 27. Questions? • @davelester • dave@davelester.org • www.davelester.orgMonday, April 1, 13