London HUG

225 views
193 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
225
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

London HUG

  1. 1. London HUG Common Crawl : WhatRepository An Open Does Theof Web Data Data World Mean to Society? Lisa Green Lisa Green 1 October 2012 10 October 2012
  2. 2. Photo license: Public Domain Origin: http://en.wikipedia.org/wiki/File:Floppy_disk_2009_G1.jpg
  3. 3. Photo license: CC-BY-SA Origin: http://en.wikipedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_08.jpg
  4. 4. Image license: CC-BY Origin: http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg
  5. 5. Still Nascent • Even cheaper storage • Even cheaper compute • Education • Open DataImage license: CC-BY Credit: NASA, ESA, and the Hubble Heritage Team (STScI/AURA)
  6. 6. GratisProprietary Libre Commercial
  7. 7. ProgressInsightAnalysis Data
  8. 8. Gil Elbaz
  9. 9. Common Crawl Data• ~8 Billion web pages• ~120 TB• 2008-2012• ARC files, JSON metadata, text files• Available to anyone
  10. 10. ARC Files - Raw ContentMetadata• Status information• HTTP response code• File names & offsets of ARC files• HTML title• HTML meta tags• RSS/Atom information• All anchors/hyperlinksText Files - Text Only http://commoncrawl.org/get-started
  11. 11. Change between 2010 and 2012• URLs with embedded data +6%• Microdata +14%• RDFa +26% http://webdatacommons.org
  12. 12. • 22% of Web pages contain Facebook URLs• 8% of Web pages implement Open Graph tags
  13. 13. http://wikientities.appspot.comA corpus of anchortext-WikipediaConcept-Count from the CommonCrawl dataset, to benefit research on WSD, NLP and IR.Given a sentence, it canExplicit Topic Modeling: help identify entities(person, location, organization) in wikipediaGiven a concept (represented as a the sentenceand map them onto Wikipedia concepts.page), it can tell what are the most commonterms people use to describe the concept.
  14. 14. Mapping French websites related to Open Data
  15. 15. Other Use Examples• Apache Giraph Testing• Maplight• Tineye• Factual• Sentiment Analysis Projects
  16. 16. In Development• N-gram and Link Graph Extracts• Pig Reader• More Frequent Full Crawls• Focused Subset Crawls at High Frequency• Open Educational Resources
  17. 17. Thank YouLondon HUG What Does The Data World Lisa Green Mean to Society? lisa@commoncrawl.org www.commoncrawl.org @commoncrawl Lisa Green @boudicca 1 October 2012

×