Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Web of data and web data commons


Published on

An introduction deck for the Web of Data to my team, including basic semantic web, Linked Open Data, primer, and then DBpedia, Linked Data Integration Framework (LDIF), Common Crawl Database, Web Data Commons.

Published in: Education, Technology

The Web of data and web data commons

  1. 1. T H I S I S A M I X E D D E C K W I T H S L I D E S F R O M P R O F . D R . C H R I SB I Z E R , O L I V E R G R I S E L , S O R E N A U E R A N D J E S S E W A N GTHE WEB OF DATA
  2. 2. AGENDAIntroduction to the Web of (Open Semantic) DataLinked Open Data and 5-star Data PrinciplesDBpedia – Query Wikipedia as a databaseLinked Data Integration FrameworkCommon Crawl DatabaseWeb Data CommonsSummary
  3. 3. 11/7/11“To a computer, then, the web isa flat, boring world devoid of meaning”Tim BernersLee,
  4. 4. 11/7/11“This is a pity, as in fact documents on the webdescribe real objects andimaginary concepts, and giveparticular relationships between them”Tim Berners Lee,
  5. 5. “Adding semantics to the web involves two things:allowing documents which have informationin machine-readable forms, and allowing links tobe created with relationship values.”Tim BernersLee,
  6. 6. 11/7/11THE WEB OF DATA - HOW?RDF / Triple Stores / SPARQLGraph stores with dynamic schemasStrong interoperabilityJSON-LDUpgrade your JSON with scoped vocabulariesWeb / Mobile / JS developer friendlyRDFa + & rNewsPublish annotation in structured markupVocabulary understood by Search Engines
  7. 7. 11/7/11THE WEB OF DATA - WHAT?Linked Open DataStarted with DBpedia – Wikipedia as databaseIn 2011.09, LOD cloud has near 300 datasetsWeb Data CommonsBased on Common Crawl DatabaseLOD + OpenGraph + Schema.orgKnowledge-Bases?Can we be a valuable contributor?
  8. 8. LINKED DATA PARADIGMUse URIs as names for thingsUse HTTP URIs so that people canlook up those names.When someone looks up aURI, provide useful information.Include links to other URIs. so thatthey can discover more things.
  9. 9. 5 ★ OPEN DATATim Berners-Lee, inventor of the Web and Linked Datainitiator, suggested a 5 star deployment scheme for OpenData.Here, we give examples for each step ofthe stars and explain costsand benefits that comealong with it.
  11. 11. DBPEDIAJoined project to• create a huge, multi-lingualknowledge base• by extracting structuredinformation from Wikipedia• make the knowledge baseavailable on the Webas Linked Data under an openlicense
  12. 12. WE HELPED DBPEDIA (3.5, 2010.4)• Extraction frameworkcompletely rewritten• Mapping languageredesigned• Hosted on a wiki• A lot more thingsextracted• … 020040060080010001200DBPEDIA 3.4 DBPEDIA 3.5Total Triples
  13. 13. 11/7/112007 20082009 2010
  14. 14. 2011.09
  15. 15. DBPEDIA 3.8 (NOW)• Structured Information in Wikipedia• infoboxes• geo-coordinates• categorization of articles• inter-language links• links to images and external webpages• titles and abstracts• tables and lists• Currently 111 localized editions
  16. 16. Category Instances StatementsDistinctPropertiesPerson 871,630 18,323,794 6,195,234Artist 100,793 3,723,440 998,616Actor 25,340 1,070,066 247,690Musical Artist 46,364 2,069,152 550,225Athlete 217,067 6,373,136 1,853,233Politician 41,126 1,407,548 454,209Place 643,260 24,698,893 8,026,305Building 65,355 1,058,610 530,010Airport 11,675 352,377 138,944Bridge 3,425 66,968 34,470Skyscraper 68 3,091 719Populated Place 424,291 20,565,679 6,212,991River 26,892 681,782 208,146Organisation 206,670 4,940,190 2,029,620Band 29,101 1,126,744 298,743Company 48,989 1,048,251 445,758Educ.Institution 43,250 958,257 493,792Work 360,808 9,649,228 3,566,511Book 44,339 1,111,960 408,724Film 75,067 2,663,487 787,129Musical Work 160,383 4,116,625 1,635,655Album 122,729 3,400,942 1,224,746Single 42,393 1,226,636 534,023Software 28,930 731,138 242,411Television Show 24,784 565,136 282,594 0 10,000,000 20,000,000 30,000,000PersonArtistActorMusical ArtistAthletePoliticianPlaceBuildingAirportBridgeSkyscraperPopulated PlaceRiverOrganisationBandCompanyEduc.InstitutionWorkBookFilmMusical WorkAlbumSingleSoftwareTelevision ShowDistinctPropertiesStatementsInstances
  18. 18. CONSUMING LINKED DATABrowsers• LOD Cloud• Tabulator• Disco• Linked Open DataExplorer• Marbles• ObjectViewerSearch Engines•• Sindice•• LOD Cache (Virtuosoby OpenLinkSoftware)• SWSE - DERI• VisiNav• Falcon• Swoogle
  19. 19. LDIF – LINKED DATA INTEGRATIONFRAMEWORK• Single Machine /Hadoop Version• tested with 3.6 billionRDF quads
  21. 21. LEARNING LINKAGE RULESUSING GENETIC PROGRAMMING based on existing referencelinks GenLink learns comparisons aggregations transformations weights instead of subtreecrossover, we use a set ofcustom crossover operatorsAggregation CrossoverTransformation Crossover
  22. 22. RESULTS FOR THE CORA EVALUATIONDATA SET Citations to research papers from the Cora research paper searchengine Attributes: Title, Author, Venue, Date of publication Reference Links: 1600 GenLink achieved an F-measure 96.6% against the validation set. Carvalho et al. report an F-measure of 91.0 % against the validation set(last line).
  23. 23. LEARNED RULERobert Isele and Christian Bizer: Learning Expressive LinkageRules using Genetic Programming. PVLDB 5(11):1638-1649, 2012
  24. 24. ACTIVE LEARNING OF LINKAGE RULES• Query Strategy: Select the link candidate for which thelinkage rules in the current population disagree the most.
  26. 26. HTML-EMBEDDED STRUCTURED DATAON THE WEBMore and more Websites semanticallymarkup the content of their HTML pages.MicroformatsMicrodataRDFa
  27. 27. MICROFORMATS• Microformat effort dates back to 2003• Small set of fixed formats• hcard : people, companies, organizations, and places• XFN : relationships between people• hCalendar : calendaring and events• hListing : small-ads; classifieds• hReview : reviews of products, businesses, events• Shortcoming of Microformats• can not represent any kind of data.• indexed by Google and Yahoo since 2009
  28. 28. RDFA• serialization format for embedding RDF datainto HTML pages• proposed in 2004, W3C Recommendation in 2008• can be used together with any vocabulary• can assign URIs as global primary keys to entities
  29. 29. OPEN GRAPH PROTOCOL• allows site owners to determine howentities are described in Facebook• relies on RDFa for encoding data in HTML pages• available since April 2010
  30. 30. MICRODATA• alternative technique for embedding structured data• proposed in 2009 by WHATWG as part of HTML5 work• tries to be simpler than RDFa (5 new attributes instead of8)• W3C currently tries to reconcile the two alternativeproposals
  31. 31. SCHEMA.ORG• ask site owners to embeddata to enrich search results.• 200+ Types:Event, Organization, Person, Place, Product, Review• Encoding: Microdata or alternatively RDFa
  32. 32. USAGE OF SCHEMA.ORG DATA @GOOGLEAnswers tofact queriesData snippetswithinsearch resultsData tableswithinsearch results
  33. 33. THE COMMON CRAWL CORPORA• Provides two web corpora on Amazon S3• 2009/2010 Corpus: 2.5 billion HTML pages• June 2012 Corpus: 3.0 billion HTML pages• The June 2012 Corpus• unique HTML pages: 3,005,629,093• pay-level-domains (PLDs): 40.6 million• size of the corpus in compressed form: 48 terabyte• Crawler uses PageRank to decide which pages to retrievesnapshot of the popular part of the Webnumber of pages per site varies widely• 93.1 million pages• 37.5 million PLDs with less than 100 pages
  34. 34. LOOKUP INDEX
  35. 35. WEB DATA COMMONS• Project• extracts all Microformat, Microdata, RDFa data from the Common Crawl• provides the extracted data for free download• Two extractions runs• 2009/2010 CC Corpus: 2.5 billion HTML pages  5.1 billion RDF triples• 2012 CC Corpus: 3.0 billion HTML pages  7.3 billion RDF triples• Jointed project of
  36. 36. THE WDC EXTRACTION FRAMEWORK• 700.000 input files queued in SQS• EC2 workers take tasks from SQS• Workers read and write S3 bucketsS3SQS42EC2...42 43 ...CC R42 R43 ...WDCWorkers 100 spot instances of type c1.xlarge(7G RAM, 8 cores) 5600 machine/hours 398 US$
  37. 37. WEBSITES CONTAININGSTRUCTURED DATA (CC 2012)2.29 million websites (PLDs) out of 40.6 millionprovide Microformat, Microdata or RDFa data(5.65%)369 million of the 3 billion pages containMicroformat, Microdata or RDFa data (12.3%).
  38. 38.  Grouped by Alexa Website Popularity Rank(site rank based on amount of page views)POPULARITY OF WEBSITESCONTAINING STRUCTURED DATA
  41. 41. • Top Classes:• Topics• CMS and Blogmetadata• Product data• Ratings• Navigationalmetadata• Company listingsRDFA TOPICS (CC 2012)
  42. 42. • Top Classes:• Topics• CMS and Blogmetadata• Navigationalmetadata• Products and offers• Business listings• Ratings• Places• EventsMICRODATA TOPICS (CC 2012)datavoc = Google„sRich Snippet Vocabulary
  43. 43. CLASS / PROPERTY DISTRIBUTIONA small set ofclasses / propertiesis used.Heterogenity onschema leveleasy to overcome.
  44. 44. MICROFORMATS Top Classes: Topics Persons Organisations Events Listingsand Reviews Recipes
  46. 46. SHOPS BY PRODUCT CATEGORY• Classifier trained for 9 product categories on descriptions from Amazon.• Examined 9000 English-language shops.
  47. 47. • Microdata, 2012Looking Deeper into Job PostingshiringOrganization: 40% String, 60 %
  49. 49. TAKEAWAYS• Linked Open Data is a great vision• LOD cloud contains lots of data that we CANconsume• Common crawl database lowers the bar for web-scale R&D• Web Data Commons is a good quality semanticdataset• Web Data Commons offers opportunities for easyaccess of large amount of semantic data
  50. 50. CHALLENGES• LOD is still sparse or at least spotty• LOD is mostly brittle (not much statistics built-in)• Global data space is just started forming• Data integration requires efforts and may containerrors• Sophisticated Natural Language Processing work isrequired to get data analyzed and utilized