• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
London HUG
 

London HUG

on

  • 127 views

 

Statistics

Views

Total Views
127
Views on SlideShare
127
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    London HUG London HUG Presentation Transcript

    • London HUG Common Crawl : WhatRepository An Open Does Theof Web Data Data World Mean to Society? Lisa Green Lisa Green 1 October 2012 10 October 2012
    • Photo license: Public Domain Origin: http://en.wikipedia.org/wiki/File:Floppy_disk_2009_G1.jpg
    • Photo license: CC-BY-SA Origin: http://en.wikipedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_08.jpg
    • Image license: CC-BY Origin: http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg
    • Still Nascent • Even cheaper storage • Even cheaper compute • Education • Open DataImage license: CC-BY Credit: NASA, ESA, and the Hubble Heritage Team (STScI/AURA)
    • GratisProprietary Libre Commercial
    • ProgressInsightAnalysis Data
    • Gil Elbaz
    • Common Crawl Data• ~8 Billion web pages• ~120 TB• 2008-2012• ARC files, JSON metadata, text files• Available to anyone
    • ARC Files - Raw ContentMetadata• Status information• HTTP response code• File names & offsets of ARC files• HTML title• HTML meta tags• RSS/Atom information• All anchors/hyperlinksText Files - Text Only http://commoncrawl.org/get-started
    • Change between 2010 and 2012• URLs with embedded data +6%• Microdata +14%• RDFa +26% http://webdatacommons.org
    • • 22% of Web pages contain Facebook URLs• 8% of Web pages implement Open Graph tags
    • http://wikientities.appspot.comA corpus of anchortext-WikipediaConcept-Count from the CommonCrawl dataset, to benefit research on WSD, NLP and IR.Given a sentence, it canExplicit Topic Modeling: help identify entities(person, location, organization) in wikipediaGiven a concept (represented as a the sentenceand map them onto Wikipedia concepts.page), it can tell what are the most commonterms people use to describe the concept.
    • Mapping French websites related to Open Data
    • Other Use Examples• Apache Giraph Testing• Maplight• Tineye• Factual• Sentiment Analysis Projects
    • In Development• N-gram and Link Graph Extracts• Pig Reader• More Frequent Full Crawls• Focused Subset Crawls at High Frequency• Open Educational Resources
    • Thank YouLondon HUG What Does The Data World Lisa Green Mean to Society? lisa@commoncrawl.org www.commoncrawl.org @commoncrawl Lisa Green @boudicca 1 October 2012