Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
Common Crawl Data
• ~8 Billion web pages
• ~120 TB
• ARC files, JSON metadata, text files
• Available to anyone
ARC Files - Raw Content
• Status information
• HTTP response code
• File names & offsets of ARC files
• HTML title
• HTML meta tags
• RSS/Atom information
• All anchors/hyperlinks
Text Files - Text Only
Change between 2010 and 2012
• URLs with embedded data +6%
• Microdata +14%
• RDFa +26%
• 22% of Web pages contain Facebook URLs
• 8% of Web pages implement Open Graph tags
A corpus of anchortext-WikipediaConcept-Count
from the CommonCrawl dataset, to benefit
research on WSD, NLP and IR.
Given a sentence, it can
Explicit Topic Modeling: help identify entities
(person, location, organization) in wikipedia
Given a concept (represented as a the sentence
and map them onto Wikipedia concepts.
page), it can tell what are the most common
terms people use to describe the concept.