An introduction deck for the Web of Data to my team, including basic semantic web, Linked Open Data, primer, and then DBpedia, Linked Data Integration Framework (LDIF), Common Crawl Database, Web Data Commons.
T H I S I S A M I X E D D E C K W I T H S L I D E S F R O M P R O F . D R . C H R I SB I Z E R , O L I V E R G R I S E L , S O R E N A U E R A N D J E S S E W A N GTHE WEB OF DATA
AGENDAIntroduction to the Web of (Open Semantic) DataLinked Open Data and 5-star Data PrinciplesDBpedia – Query Wikipedia as a databaseLinked Data Integration FrameworkCommon Crawl DatabaseWeb Data CommonsSummary
11/7/11“To a computer, then, the web isa flat, boring world devoid of meaning”Tim BernersLee, http://www.w3.org/Talks/WWW94Tim/
11/7/11“This is a pity, as in fact documents on the webdescribe real objects andimaginary concepts, and giveparticular relationships between them”Tim Berners Lee,http://www.w3.org/Talks/WWW94Tim/
“Adding semantics to the web involves two things:allowing documents which have informationin machine-readable forms, and allowing links tobe created with relationship values.”Tim BernersLee, http://www.w3.org/Talks/WWW94Tim/
11/7/11THE WEB OF DATA - HOW?RDF / Triple Stores / SPARQLGraph stores with dynamic schemasStrong interoperabilityJSON-LDUpgrade your JSON with scoped vocabulariesWeb / Mobile / JS developer friendlyRDFa + schema.org & rNewsPublish annotation in structured markupVocabulary understood by Search Engines
11/7/11THE WEB OF DATA - WHAT?Linked Open DataStarted with DBpedia – Wikipedia as databaseIn 2011.09, LOD cloud has near 300 datasetsWeb Data CommonsBased on Common Crawl DatabaseLOD + OpenGraph + Schema.orgKnowledge-Bases?Can we be a valuable contributor?
LINKED DATA PARADIGMUse URIs as names for thingsUse HTTP URIs so that people canlook up those names.When someone looks up aURI, provide useful information.Include links to other URIs. so thatthey can discover more things.
5 ★ OPEN DATATim Berners-Lee, inventor of the Web and Linked Datainitiator, suggested a 5 star deployment scheme for OpenData.Here, we give examples for each step ofthe stars and explain costsand benefits that comealong with it.http://5stardata.info/
DBPEDIAJoined project to• create a huge, multi-lingualknowledge base• by extracting structuredinformation from Wikipedia• make the knowledge baseavailable on the Webas Linked Data under an openlicense
WE HELPED DBPEDIA (3.5, 2010.4)• Extraction frameworkcompletely rewritten• Mapping languageredesigned• Hosted on a wikihttp://mappings.dbpedia.org• A lot more thingsextracted• … 020040060080010001200DBPEDIA 3.4 DBPEDIA 3.5Total Triples
DBPEDIA 3.8 (NOW)• Structured Information in Wikipedia• infoboxes• geo-coordinates• categorization of articles• inter-language links• links to images and external webpages• titles and abstracts• tables and lists• Currently 111 localized editions
LEARNING LINKAGE RULESUSING GENETIC PROGRAMMING based on existing referencelinks GenLink learns comparisons aggregations transformations weights instead of subtreecrossover, we use a set ofcustom crossover operatorsAggregation CrossoverTransformation Crossover
RESULTS FOR THE CORA EVALUATIONDATA SET Citations to research papers from the Cora research paper searchengine Attributes: Title, Author, Venue, Date of publication Reference Links: 1600 GenLink achieved an F-measure 96.6% against the validation set. Carvalho et al. report an F-measure of 91.0 % against the validation set(last line).
LEARNED RULERobert Isele and Christian Bizer: Learning Expressive LinkageRules using Genetic Programming. PVLDB 5(11):1638-1649, 2012
ACTIVE LEARNING OF LINKAGE RULES• Query Strategy: Select the link candidate for which thelinkage rules in the current population disagree the most.
STRUCTURED DATA ON THEWEBWE HAVE THE TOOLS NOW
HTML-EMBEDDED STRUCTURED DATAON THE WEBMore and more Websites semanticallymarkup the content of their HTML pages.MicroformatsMicrodataRDFa
MICROFORMATS• Microformat effort dates back to 2003• Small set of fixed formats• hcard : people, companies, organizations, and places• XFN : relationships between people• hCalendar : calendaring and events• hListing : small-ads; classifieds• hReview : reviews of products, businesses, events• Shortcoming of Microformats• can not represent any kind of data.• indexed by Google and Yahoo since 2009
RDFA• serialization format for embedding RDF datainto HTML pages• proposed in 2004, W3C Recommendation in 2008• can be used together with any vocabulary• can assign URIs as global primary keys to entities
OPEN GRAPH PROTOCOL• allows site owners to determine howentities are described in Facebook• relies on RDFa for encoding data in HTML pages• available since April 2010
MICRODATA• alternative technique for embedding structured data• proposed in 2009 by WHATWG as part of HTML5 work• tries to be simpler than RDFa (5 new attributes instead of8)• W3C currently tries to reconcile the two alternativeproposals
SCHEMA.ORG• ask site owners to embeddata to enrich search results.• 200+ Types:Event, Organization, Person, Place, Product, Review• Encoding: Microdata or alternatively RDFa
USAGE OF SCHEMA.ORG DATA @GOOGLEAnswers tofact queriesData snippetswithinsearch resultsData tableswithinsearch results
THE COMMON CRAWL CORPORA• Provides two web corpora on Amazon S3• 2009/2010 Corpus: 2.5 billion HTML pages• June 2012 Corpus: 3.0 billion HTML pages• The June 2012 Corpus• unique HTML pages: 3,005,629,093• pay-level-domains (PLDs): 40.6 million• size of the corpus in compressed form: 48 terabyte• Crawler uses PageRank to decide which pages to retrievesnapshot of the popular part of the Webnumber of pages per site varies widely• youtube.com: 93.1 million pages• 37.5 million PLDs with less than 100 pages
WEB DATA COMMONS• WebDataCommons.org Project• extracts all Microformat, Microdata, RDFa data from the Common Crawl• provides the extracted data for free download• Two extractions runs• 2009/2010 CC Corpus: 2.5 billion HTML pages 5.1 billion RDF triples• 2012 CC Corpus: 3.0 billion HTML pages 7.3 billion RDF triples• Jointed project of
THE WDC EXTRACTION FRAMEWORK• 700.000 input files queued in SQS• EC2 workers take tasks from SQS• Workers read and write S3 bucketsS3SQS42EC2...42 43 ...CC R42 R43 ...WDCWorkers 100 spot instances of type c1.xlarge(7G RAM, 8 cores) 5600 machine/hours 398 US$
WEBSITES CONTAININGSTRUCTURED DATA (CC 2012)2.29 million websites (PLDs) out of 40.6 millionprovide Microformat, Microdata or RDFa data(5.65%)369 million of the 3 billion pages containMicroformat, Microdata or RDFa data (12.3%).
Grouped by Alexa Website Popularity Rank(site rank based on amount of page views)POPULARITY OF WEBSITESCONTAINING STRUCTURED DATA
WEB COMMON DATA GLOBAL DATA SPACEPRESENT FUTURE
TAKEAWAYS• Linked Open Data is a great vision• LOD cloud contains lots of data that we CANconsume• Common crawl database lowers the bar for web-scale R&D• Web Data Commons is a good quality semanticdataset• Web Data Commons offers opportunities for easyaccess of large amount of semantic data
CHALLENGES• LOD is still sparse or at least spotty• LOD is mostly brittle (not much statistics built-in)• Global data space is just started forming• Data integration requires efforts and may containerrors• Sophisticated Natural Language Processing work isrequired to get data analyzed and utilized
THANK YOU!CREDITS: CHRIS BIZER, OLIVER GRISEL, SOREN AUER