Common Crawl: An Open Repository of Web Data
Upcoming SlideShare
Loading in...5
×
 

Common Crawl: An Open Repository of Web Data

on

  • 2,380 views

Talk given by Lisa Green from the Common Crawl Foundation at the Hadoop User Group UK meetup on 10 October in London

Talk given by Lisa Green from the Common Crawl Foundation at the Hadoop User Group UK meetup on 10 October in London

Statistics

Views

Total Views
2,380
Views on SlideShare
2,380
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Common Crawl: An Open Repository of Web Data Common Crawl: An Open Repository of Web Data Presentation Transcript

  • London HUG Common Crawl : WhatRepository An Open Does Theof Web Data Data World Mean to Society? Lisa Green Lisa Green 1 October 2012 10 October 2012
  • Photo license: Public Domain Origin: http://en.wikipedia.org/wiki/File:Floppy_disk_2009_G1.jpg
  • Photo license: CC-BY-SA Origin: http://en.wikipedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_08.jpg
  • Image license: CC-BY Origin: http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg
  • Still Nascent • Even cheaper storage • Even cheaper compute • Education • Open DataImage license: CC-BY Credit: NASA, ESA, and the Hubble Heritage Team (STScI/AURA)
  • GratisProprietary Libre Commercial
  • ProgressInsightAnalysis Data
  • Gil Elbaz
  • Common Crawl Data• ~8 Billion web pages• ~120 TB• 2008-2012• ARC files, JSON metadata, text files• Available to anyone
  • ARC Files - Raw ContentMetadata• Status information• HTTP response code• File names & offsets of ARC files• HTML title• HTML meta tags• RSS/Atom information• All anchors/hyperlinksText Files - Text Only http://commoncrawl.org/get-started
  • Change between 2010 and 2012• URLs with embedded data +6%• Microdata +14%• RDFa +26% http://webdatacommons.org
  • • 22% of Web pages contain Facebook URLs• 8% of Web pages implement Open Graph tags
  • http://wikientities.appspot.comA corpus of anchortext-WikipediaConcept-Count from the CommonCrawl dataset, to benefit research on WSD, NLP and IR.Given a sentence, it canExplicit Topic Modeling: help identify entities(person, location, organization) in wikipediaGiven a concept (represented as a the sentenceand map them onto Wikipedia concepts.page), it can tell what are the most commonterms people use to describe the concept.
  • Mapping French websites related to Open Data
  • Other Use Examples• Apache Giraph Testing• Maplight• Tineye• Factual• Sentiment Analysis Projects
  • In Development• N-gram and Link Graph Extracts• Pig Reader• More Frequent Full Crawls• Focused Subset Crawls at High Frequency• Open Educational Resources
  • Thank YouLondon HUG What Does The Data World Lisa Green Mean to Society? lisa@commoncrawl.org www.commoncrawl.org @commoncrawl Lisa Green @boudicca 1 October 2012