Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Common Crawl: An Open Repository of Web Data


Published on

Talk given by Lisa Green from the Common Crawl Foundation at the Hadoop User Group UK meetup on 10 October in London

Published in: Technology
  • Be the first to comment

Common Crawl: An Open Repository of Web Data

  1. 1. London HUG Common Crawl : WhatRepository An Open Does Theof Web Data Data World Mean to Society? Lisa Green Lisa Green 1 October 2012 10 October 2012
  2. 2. Photo license: Public Domain Origin:
  3. 3. Photo license: CC-BY-SA Origin:
  4. 4. Image license: CC-BY Origin:
  5. 5. Still Nascent • Even cheaper storage • Even cheaper compute • Education • Open DataImage license: CC-BY Credit: NASA, ESA, and the Hubble Heritage Team (STScI/AURA)
  6. 6. GratisProprietary Libre Commercial
  7. 7. ProgressInsightAnalysis Data
  8. 8. Gil Elbaz
  9. 9. Common Crawl Data• ~8 Billion web pages• ~120 TB• 2008-2012• ARC files, JSON metadata, text files• Available to anyone
  10. 10. ARC Files - Raw ContentMetadata• Status information• HTTP response code• File names & offsets of ARC files• HTML title• HTML meta tags• RSS/Atom information• All anchors/hyperlinksText Files - Text Only
  11. 11. Change between 2010 and 2012• URLs with embedded data +6%• Microdata +14%• RDFa +26%
  12. 12. • 22% of Web pages contain Facebook URLs• 8% of Web pages implement Open Graph tags
  13. 13. http://wikientities.appspot.comA corpus of anchortext-WikipediaConcept-Count from the CommonCrawl dataset, to benefit research on WSD, NLP and IR.Given a sentence, it canExplicit Topic Modeling: help identify entities(person, location, organization) in wikipediaGiven a concept (represented as a the sentenceand map them onto Wikipedia, it can tell what are the most commonterms people use to describe the concept.
  14. 14. Mapping French websites related to Open Data
  15. 15. Other Use Examples• Apache Giraph Testing• Maplight• Tineye• Factual• Sentiment Analysis Projects
  16. 16. In Development• N-gram and Link Graph Extracts• Pig Reader• More Frequent Full Crawls• Focused Subset Crawls at High Frequency• Open Educational Resources
  17. 17. Thank YouLondon HUG What Does The Data World Lisa Green Mean to Society? @commoncrawl Lisa Green @boudicca 1 October 2012