Successfully reported this slideshow.

The Web of data and web data commons

6

Share

Upcoming SlideShare
London HUG
London HUG
Loading in …3
×
1 of 58
1 of 58

The Web of data and web data commons

6

Share

Download to read offline

Description

An introduction deck for the Web of Data to my team, including basic semantic web, Linked Open Data, primer, and then DBpedia, Linked Data Integration Framework (LDIF), Common Crawl Database, Web Data Commons.

Transcript

  1. 1. T H I S I S A M I X E D D E C K W I T H S L I D E S F R O M P R O F . D R . C H R I S B I Z E R , O L I V E R G R I S E L , S O R E N A U E R A N D J E S S E W A N G THE WEB OF DATA
  2. 2. AGENDA Introduction to the Web of (Open Semantic) Data Linked Open Data and 5-star Data Principles DBpedia – Query Wikipedia as a database Linked Data Integration Framework Common Crawl Database Web Data Commons Summary
  3. 3. 11/7/11 “To a computer, then, the web is a flat, boring world devoid of meaning” Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
  4. 4. 11/7/11 “This is a pity, as in fact documents on the web describe real objects and imaginary concepts, and give particular relationships between them” Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
  5. 5. “Adding semantics to the web involves two things: allowing documents which have information in machine-readable forms, and allowing links to be created with relationship values.” Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
  6. 6. 11/7/11 THE WEB OF DATA - HOW? RDF / Triple Stores / SPARQL Graph stores with dynamic schemas Strong interoperability JSON-LD Upgrade your JSON with scoped vocabularies Web / Mobile / JS developer friendly RDFa + schema.org & rNews Publish annotation in structured markup Vocabulary understood by Search Engines
  7. 7. 11/7/11 THE WEB OF DATA - WHAT? Linked Open Data Started with DBpedia – Wikipedia as database In 2011.09, LOD cloud has near 300 datasets Web Data Commons Based on Common Crawl Database LOD + OpenGraph + Schema.org Knowledge-Bases? Can we be a valuable contributor?
  8. 8. LINKED DATA PARADIGM Use URIs as names for things Use HTTP URIs so that people can look up those names. When someone looks up a URI, provide useful information. Include links to other URIs. so that they can discover more things.
  9. 9. 5 ★ OPEN DATA Tim Berners-Lee, inventor of the Web and Linked Data initiator, suggested a 5 star deployment scheme for Open Data. Here, we give examples for each step of the stars and explain costs and benefits that come along with it. http://5stardata.info/
  10. 10. AND IT STARTS WITH…
  11. 11. DBPEDIA Joined project to • create a huge, multi-lingual knowledge base • by extracting structured information from Wikipedia • make the knowledge base available on the Web as Linked Data under an open license
  12. 12. WE HELPED DBPEDIA (3.5, 2010.4) • Extraction framework completely rewritten • Mapping language redesigned • Hosted on a wiki http://mappings.dbpedi a.org • A lot more things extracted • … 0 200 400 600 800 1000 1200 DBPEDIA 3.4 DBPEDIA 3.5 Total Triples
  13. 13. 11/7/11 2007 2008 2009 2010
  14. 14. 2011.0 9
  15. 15. DBPEDIA 3.8 (NOW) • Structured Information in Wikipedia • infoboxes • geo-coordinates • categorization of articles • inter-language links • links to images and external webpages • titles and abstracts • tables and lists • Currently 111 localized editions
  16. 16. Category Instances Statements Distinct Properties Person 871,630 18,323,794 6,195,234 Artist 100,793 3,723,440 998,616 Actor 25,340 1,070,066 247,690 Musical Artist 46,364 2,069,152 550,225 Athlete 217,067 6,373,136 1,853,233 Politician 41,126 1,407,548 454,209 Place 643,260 24,698,893 8,026,305 Building 65,355 1,058,610 530,010 Airport 11,675 352,377 138,944 Bridge 3,425 66,968 34,470 Skyscraper 68 3,091 719 Populated Place 424,291 20,565,679 6,212,991 River 26,892 681,782 208,146 Organisation 206,670 4,940,190 2,029,620 Band 29,101 1,126,744 298,743 Company 48,989 1,048,251 445,758 Educ.Institution 43,250 958,257 493,792 Work 360,808 9,649,228 3,566,511 Book 44,339 1,111,960 408,724 Film 75,067 2,663,487 787,129 Musical Work 160,383 4,116,625 1,635,655 Album 122,729 3,400,942 1,224,746 Single 42,393 1,226,636 534,023 Software 28,930 731,138 242,411 Television Show 24,784 565,136 282,594 0 10,000,000 20,000,000 30,000,000 Person Artist Actor Musical Artist Athlete Politician Place Building Airport Bridge Skyscraper Populated Place River Organisation Band Company Educ.Institution Work Book Film Musical Work Album Single Software Television Show Distinct Properties Statements Instances
  17. 17. CROSS-LANGUAGE OVERLAP
  18. 18. CONSUMING LINKED DATA Browsers • LOD Cloud http://datahub.io • Tabulator • Disco • Linked Open Data Explorer • Marbles • ObjectViewer Search Engines • Sameas.org • Sindice • Sig.ma • LOD Cache (Virtuoso by OpenLinkSoftware) • SWSE - DERI • VisiNav • Falcon • Swoogle
  19. 19. LDIF – LINKED DATA INTEGRATION FRAMEWORK • Single Machine / Hadoop Version • tested with 3.6 billion RDF quads
  20. 20. A SILK LINKAGE RULE
  21. 21. LEARNING LINKAGE RULES USING GENETIC PROGRAMMING  based on existing reference links  GenLink learns  comparisons  aggregations  transformations  weights  instead of subtree crossover, we use a set of custom crossover operators Aggregation Crossover Transformation Crossover
  22. 22. RESULTS FOR THE CORA EVALUATION DATA SET  Citations to research papers from the Cora research paper search engine  Attributes: Title, Author, Venue, Date of publication  Reference Links: 1600  GenLink achieved an F-measure 96.6% against the validation set.  Carvalho et al. report an F-measure of 91.0 % against the validation set (last line).
  23. 23. LEARNED RULE Robert Isele and Christian Bizer: Learning Expressive Linkage Rules using Genetic Programming. PVLDB 5(11):1638-1649, 2012
  24. 24. ACTIVE LEARNING OF LINKAGE RULES • Query Strategy: Select the link candidate for which the linkage rules in the current population disagree the most.
  25. 25. STRUCTURED DATA ON THE WEB WE HAVE THE TOOLS NOW
  26. 26. HTML-EMBEDDED STRUCTURED DATA ON THE WEB More and more Websites semantically markup the content of their HTML pages. Microformats Microdata RDFa
  27. 27. MICROFORMATS • Microformat effort dates back to 2003 • Small set of fixed formats • hcard : people, companies, organizations, and places • XFN : relationships between people • hCalendar : calendaring and events • hListing : small-ads; classifieds • hReview : reviews of products, businesses, events • Shortcoming of Microformats • can not represent any kind of data. • indexed by Google and Yahoo since 2009
  28. 28. RDFA • serialization format for embedding RDF data into HTML pages • proposed in 2004, W3C Recommendation in 2008 • can be used together with any vocabulary • can assign URIs as global primary keys to entities
  29. 29. OPEN GRAPH PROTOCOL • allows site owners to determine how entities are described in Facebook • relies on RDFa for encoding data in HTML pages • available since April 2010
  30. 30. MICRODATA • alternative technique for embedding structured data • proposed in 2009 by WHATWG as part of HTML5 work • tries to be simpler than RDFa (5 new attributes instead of 8) • W3C currently tries to reconcile the two alternative proposals
  31. 31. SCHEMA.ORG • ask site owners to embed data to enrich search results. • 200+ Types: Event, Organization, Person, Place, Product, Review • Encoding: Microdata or alternatively RDFa
  32. 32. USAGE OF SCHEMA.ORG DATA @ GOOGLE Answers to fact queries Data snippets within search results Data tables within search results
  33. 33. THE COMMON CRAWL CORPORA • Provides two web corpora on Amazon S3 • 2009/2010 Corpus: 2.5 billion HTML pages • June 2012 Corpus: 3.0 billion HTML pages • The June 2012 Corpus • unique HTML pages: 3,005,629,093 • pay-level-domains (PLDs): 40.6 million • size of the corpus in compressed form: 48 terabyte • Crawler uses PageRank to decide which pages to retrieve snapshot of the popular part of the Web number of pages per site varies widely • youtube.com: 93.1 million pages • 37.5 million PLDs with less than 100 pages
  34. 34. LOOKUP INDEX
  35. 35. WEB DATA COMMONS • WebDataCommons.org Project • extracts all Microformat, Microdata, RDFa data from the Common Crawl • provides the extracted data for free download • Two extractions runs • 2009/2010 CC Corpus: 2.5 billion HTML pages  5.1 billion RDF triples • 2012 CC Corpus: 3.0 billion HTML pages  7.3 billion RDF triples • Jointed project of
  36. 36. THE WDC EXTRACTION FRAMEWORK • 700.000 input files queued in SQS • EC2 workers take tasks from SQS • Workers read and write S3 buckets S3 SQS 42 EC2 ... 42 43 ... CC R42 R43 ... WDC Workers  100 spot instances of type c1.xlarge (7G RAM, 8 cores)  5600 machine/hours  398 US$
  37. 37. WEBSITES CONTAINING STRUCTURED DATA (CC 2012) 2.29 million websites (PLDs) out of 40.6 million provide Microformat, Microdata or RDFa data (5.65%) 369 million of the 3 billion pages contain Microformat, Microdata or RDFa data (12.3%).
  38. 38.  Grouped by Alexa Website Popularity Rank (site rank based on amount of page views) POPULARITY OF WEBSITES CONTAINING STRUCTURED DATA
  39. 39. BREAKDOWN BY ENCODING FORMAT (CC 2012)
  40. 40. DISTRIBUTION BY TOP LEVEL DOMAIN
  41. 41. • Top Classes: • Topics • CMS and Blog metadata • Product data • Ratings • Navigational metadata • Company listings RDFA TOPICS (CC 2012)
  42. 42. • Top Classes: • Topics • CMS and Blog metadata • Navigational metadata • Products and offers • Business listings • Ratings • Places • Events MICRODATA TOPICS (CC 2012) datavoc = Google„s Rich Snippet Vocabulary
  43. 43. CLASS / PROPERTY DISTRIBUTION A small set of classes / properties is used. Heterogenity on schema level easy to overcome.
  44. 44. MICROFORMATS  Top Classes:  Topics  Persons  Organisations  Events  Listings and Reviews  Recipes
  45. 45. LOOKING DEEPER INTO THE E- COMMERCE DATA • Microdata, 2012
  46. 46. SHOPS BY PRODUCT CATEGORY • Classifier trained for 9 product categories on descriptions from Amazon. • Examined 9000 English-language shops.
  47. 47. • Microdata, 2012 Looking Deeper into Job Postings hiringOrganization: 40% String, 60 % Object Schema.org
  48. 48. WEB COMMON DATA  GLOBAL DATA SPACE PRESENT  FUTURE
  49. 49. TAKEAWAYS • Linked Open Data is a great vision • LOD cloud contains lots of data that we CAN consume • Common crawl database lowers the bar for web- scale R&D • Web Data Commons is a good quality semantic dataset • Web Data Commons offers opportunities for easy access of large amount of semantic data
  50. 50. CHALLENGES • LOD is still sparse or at least spotty • LOD is mostly brittle (not much statistics built-in) • Global data space is just started forming • Data integration requires efforts and may contain errors • Sophisticated Natural Language Processing work is required to get data analyzed and utilized
  51. 51. THANK YOU! CREDITS: CHRIS BIZER, OLIVER GRISEL, SOREN AUER

Description

An introduction deck for the Web of Data to my team, including basic semantic web, Linked Open Data, primer, and then DBpedia, Linked Data Integration Framework (LDIF), Common Crawl Database, Web Data Commons.

Transcript

  1. 1. T H I S I S A M I X E D D E C K W I T H S L I D E S F R O M P R O F . D R . C H R I S B I Z E R , O L I V E R G R I S E L , S O R E N A U E R A N D J E S S E W A N G THE WEB OF DATA
  2. 2. AGENDA Introduction to the Web of (Open Semantic) Data Linked Open Data and 5-star Data Principles DBpedia – Query Wikipedia as a database Linked Data Integration Framework Common Crawl Database Web Data Commons Summary
  3. 3. 11/7/11 “To a computer, then, the web is a flat, boring world devoid of meaning” Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
  4. 4. 11/7/11 “This is a pity, as in fact documents on the web describe real objects and imaginary concepts, and give particular relationships between them” Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
  5. 5. “Adding semantics to the web involves two things: allowing documents which have information in machine-readable forms, and allowing links to be created with relationship values.” Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
  6. 6. 11/7/11 THE WEB OF DATA - HOW? RDF / Triple Stores / SPARQL Graph stores with dynamic schemas Strong interoperability JSON-LD Upgrade your JSON with scoped vocabularies Web / Mobile / JS developer friendly RDFa + schema.org & rNews Publish annotation in structured markup Vocabulary understood by Search Engines
  7. 7. 11/7/11 THE WEB OF DATA - WHAT? Linked Open Data Started with DBpedia – Wikipedia as database In 2011.09, LOD cloud has near 300 datasets Web Data Commons Based on Common Crawl Database LOD + OpenGraph + Schema.org Knowledge-Bases? Can we be a valuable contributor?
  8. 8. LINKED DATA PARADIGM Use URIs as names for things Use HTTP URIs so that people can look up those names. When someone looks up a URI, provide useful information. Include links to other URIs. so that they can discover more things.
  9. 9. 5 ★ OPEN DATA Tim Berners-Lee, inventor of the Web and Linked Data initiator, suggested a 5 star deployment scheme for Open Data. Here, we give examples for each step of the stars and explain costs and benefits that come along with it. http://5stardata.info/
  10. 10. AND IT STARTS WITH…
  11. 11. DBPEDIA Joined project to • create a huge, multi-lingual knowledge base • by extracting structured information from Wikipedia • make the knowledge base available on the Web as Linked Data under an open license
  12. 12. WE HELPED DBPEDIA (3.5, 2010.4) • Extraction framework completely rewritten • Mapping language redesigned • Hosted on a wiki http://mappings.dbpedi a.org • A lot more things extracted • … 0 200 400 600 800 1000 1200 DBPEDIA 3.4 DBPEDIA 3.5 Total Triples
  13. 13. 11/7/11 2007 2008 2009 2010
  14. 14. 2011.0 9
  15. 15. DBPEDIA 3.8 (NOW) • Structured Information in Wikipedia • infoboxes • geo-coordinates • categorization of articles • inter-language links • links to images and external webpages • titles and abstracts • tables and lists • Currently 111 localized editions
  16. 16. Category Instances Statements Distinct Properties Person 871,630 18,323,794 6,195,234 Artist 100,793 3,723,440 998,616 Actor 25,340 1,070,066 247,690 Musical Artist 46,364 2,069,152 550,225 Athlete 217,067 6,373,136 1,853,233 Politician 41,126 1,407,548 454,209 Place 643,260 24,698,893 8,026,305 Building 65,355 1,058,610 530,010 Airport 11,675 352,377 138,944 Bridge 3,425 66,968 34,470 Skyscraper 68 3,091 719 Populated Place 424,291 20,565,679 6,212,991 River 26,892 681,782 208,146 Organisation 206,670 4,940,190 2,029,620 Band 29,101 1,126,744 298,743 Company 48,989 1,048,251 445,758 Educ.Institution 43,250 958,257 493,792 Work 360,808 9,649,228 3,566,511 Book 44,339 1,111,960 408,724 Film 75,067 2,663,487 787,129 Musical Work 160,383 4,116,625 1,635,655 Album 122,729 3,400,942 1,224,746 Single 42,393 1,226,636 534,023 Software 28,930 731,138 242,411 Television Show 24,784 565,136 282,594 0 10,000,000 20,000,000 30,000,000 Person Artist Actor Musical Artist Athlete Politician Place Building Airport Bridge Skyscraper Populated Place River Organisation Band Company Educ.Institution Work Book Film Musical Work Album Single Software Television Show Distinct Properties Statements Instances
  17. 17. CROSS-LANGUAGE OVERLAP
  18. 18. CONSUMING LINKED DATA Browsers • LOD Cloud http://datahub.io • Tabulator • Disco • Linked Open Data Explorer • Marbles • ObjectViewer Search Engines • Sameas.org • Sindice • Sig.ma • LOD Cache (Virtuoso by OpenLinkSoftware) • SWSE - DERI • VisiNav • Falcon • Swoogle
  19. 19. LDIF – LINKED DATA INTEGRATION FRAMEWORK • Single Machine / Hadoop Version • tested with 3.6 billion RDF quads
  20. 20. A SILK LINKAGE RULE
  21. 21. LEARNING LINKAGE RULES USING GENETIC PROGRAMMING  based on existing reference links  GenLink learns  comparisons  aggregations  transformations  weights  instead of subtree crossover, we use a set of custom crossover operators Aggregation Crossover Transformation Crossover
  22. 22. RESULTS FOR THE CORA EVALUATION DATA SET  Citations to research papers from the Cora research paper search engine  Attributes: Title, Author, Venue, Date of publication  Reference Links: 1600  GenLink achieved an F-measure 96.6% against the validation set.  Carvalho et al. report an F-measure of 91.0 % against the validation set (last line).
  23. 23. LEARNED RULE Robert Isele and Christian Bizer: Learning Expressive Linkage Rules using Genetic Programming. PVLDB 5(11):1638-1649, 2012
  24. 24. ACTIVE LEARNING OF LINKAGE RULES • Query Strategy: Select the link candidate for which the linkage rules in the current population disagree the most.
  25. 25. STRUCTURED DATA ON THE WEB WE HAVE THE TOOLS NOW
  26. 26. HTML-EMBEDDED STRUCTURED DATA ON THE WEB More and more Websites semantically markup the content of their HTML pages. Microformats Microdata RDFa
  27. 27. MICROFORMATS • Microformat effort dates back to 2003 • Small set of fixed formats • hcard : people, companies, organizations, and places • XFN : relationships between people • hCalendar : calendaring and events • hListing : small-ads; classifieds • hReview : reviews of products, businesses, events • Shortcoming of Microformats • can not represent any kind of data. • indexed by Google and Yahoo since 2009
  28. 28. RDFA • serialization format for embedding RDF data into HTML pages • proposed in 2004, W3C Recommendation in 2008 • can be used together with any vocabulary • can assign URIs as global primary keys to entities
  29. 29. OPEN GRAPH PROTOCOL • allows site owners to determine how entities are described in Facebook • relies on RDFa for encoding data in HTML pages • available since April 2010
  30. 30. MICRODATA • alternative technique for embedding structured data • proposed in 2009 by WHATWG as part of HTML5 work • tries to be simpler than RDFa (5 new attributes instead of 8) • W3C currently tries to reconcile the two alternative proposals
  31. 31. SCHEMA.ORG • ask site owners to embed data to enrich search results. • 200+ Types: Event, Organization, Person, Place, Product, Review • Encoding: Microdata or alternatively RDFa
  32. 32. USAGE OF SCHEMA.ORG DATA @ GOOGLE Answers to fact queries Data snippets within search results Data tables within search results
  33. 33. THE COMMON CRAWL CORPORA • Provides two web corpora on Amazon S3 • 2009/2010 Corpus: 2.5 billion HTML pages • June 2012 Corpus: 3.0 billion HTML pages • The June 2012 Corpus • unique HTML pages: 3,005,629,093 • pay-level-domains (PLDs): 40.6 million • size of the corpus in compressed form: 48 terabyte • Crawler uses PageRank to decide which pages to retrieve snapshot of the popular part of the Web number of pages per site varies widely • youtube.com: 93.1 million pages • 37.5 million PLDs with less than 100 pages
  34. 34. LOOKUP INDEX
  35. 35. WEB DATA COMMONS • WebDataCommons.org Project • extracts all Microformat, Microdata, RDFa data from the Common Crawl • provides the extracted data for free download • Two extractions runs • 2009/2010 CC Corpus: 2.5 billion HTML pages  5.1 billion RDF triples • 2012 CC Corpus: 3.0 billion HTML pages  7.3 billion RDF triples • Jointed project of
  36. 36. THE WDC EXTRACTION FRAMEWORK • 700.000 input files queued in SQS • EC2 workers take tasks from SQS • Workers read and write S3 buckets S3 SQS 42 EC2 ... 42 43 ... CC R42 R43 ... WDC Workers  100 spot instances of type c1.xlarge (7G RAM, 8 cores)  5600 machine/hours  398 US$
  37. 37. WEBSITES CONTAINING STRUCTURED DATA (CC 2012) 2.29 million websites (PLDs) out of 40.6 million provide Microformat, Microdata or RDFa data (5.65%) 369 million of the 3 billion pages contain Microformat, Microdata or RDFa data (12.3%).
  38. 38.  Grouped by Alexa Website Popularity Rank (site rank based on amount of page views) POPULARITY OF WEBSITES CONTAINING STRUCTURED DATA
  39. 39. BREAKDOWN BY ENCODING FORMAT (CC 2012)
  40. 40. DISTRIBUTION BY TOP LEVEL DOMAIN
  41. 41. • Top Classes: • Topics • CMS and Blog metadata • Product data • Ratings • Navigational metadata • Company listings RDFA TOPICS (CC 2012)
  42. 42. • Top Classes: • Topics • CMS and Blog metadata • Navigational metadata • Products and offers • Business listings • Ratings • Places • Events MICRODATA TOPICS (CC 2012) datavoc = Google„s Rich Snippet Vocabulary
  43. 43. CLASS / PROPERTY DISTRIBUTION A small set of classes / properties is used. Heterogenity on schema level easy to overcome.
  44. 44. MICROFORMATS  Top Classes:  Topics  Persons  Organisations  Events  Listings and Reviews  Recipes
  45. 45. LOOKING DEEPER INTO THE E- COMMERCE DATA • Microdata, 2012
  46. 46. SHOPS BY PRODUCT CATEGORY • Classifier trained for 9 product categories on descriptions from Amazon. • Examined 9000 English-language shops.
  47. 47. • Microdata, 2012 Looking Deeper into Job Postings hiringOrganization: 40% String, 60 % Object Schema.org
  48. 48. WEB COMMON DATA  GLOBAL DATA SPACE PRESENT  FUTURE
  49. 49. TAKEAWAYS • Linked Open Data is a great vision • LOD cloud contains lots of data that we CAN consume • Common crawl database lowers the bar for web- scale R&D • Web Data Commons is a good quality semantic dataset • Web Data Commons offers opportunities for easy access of large amount of semantic data
  50. 50. CHALLENGES • LOD is still sparse or at least spotty • LOD is mostly brittle (not much statistics built-in) • Global data space is just started forming • Data integration requires efforts and may contain errors • Sophisticated Natural Language Processing work is required to get data analyzed and utilized
  51. 51. THANK YOU! CREDITS: CHRIS BIZER, OLIVER GRISEL, SOREN AUER

More Related Content

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

×