Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 1
Prof. Dr. Christian Bizer
Ev...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 2
Data and Web Science Group @...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 3
Querying the classic Web
DB
...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 4
Long standing Goal
Query the...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 5
2001 Article: The Semantic W...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 6
13 Years Later
There are 1.3...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 7
Outline
1. Linked Data
2. HT...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 8
1. Linked Data
B C
RDF
RDF
l...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 9
Global Identifiers and Links...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 10
Effort Distribution between...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 11
W3C Linking Open Data Proje...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 12
LOD Datasets on the Web: Se...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 13
Newer statistics
− LODstats...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 14
Ontological Agreement
− Out...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 15
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 16
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 17
Uptake in the Government Do...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 18
Uptake in the Libraries Com...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 19
Industry Uptake
− Media Ind...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 20
2. HTML-embedded Data
Micro...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 21
Schema.org
− ask site owner...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 22
Open Graph Protocol
− allow...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 23
The Common Crawl
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 24
The WebDataCommons.org Proj...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 25
Websites providing Structur...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 26
Breakdown by Encoding Forma...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 27
− Top Classes:
− Topics
• C...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 28
− Top Classes:
− Topics
• C...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 29
Class / Property Distributi...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 30
Looking Deeper into the E-C...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 31
Usage of Schema.org Data @ ...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 32
Usage of Open Graph Protoco...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 33
Valuable Resource for Compa...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 34
Identity Resolution for Ele...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 35
Linked Data vs. HTML-embede...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 36
Title
Description
Cross
Lan...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 37
Extracting Knowledge from W...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 38
The DBpedia Knowledge Base ...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 39
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 40
1. Answer fact queries: “bi...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 41
Applications of Google‘s Kn...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 42
DBpedia as Background Knowl...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 43
Unemployment Table with add...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 44
RapidMiner Linked Open Data...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 45
Finding Correlations
− Use ...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 46
Conclusions
1. Publication ...
Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 47
Thanks
Mannheim Linked Open...
Upcoming SlideShare
Loading in …5
×

Evolving the Web into a Global Database - Advances and Applications.

1,154 views

Published on

Invited talk (Festvortrag im Rahmen der Verleihung des Carl-Adam-Petri-Preises), KIT, Karlsruhe, January 2014.

Published in: Internet, Education

Evolving the Web into a Global Database - Advances and Applications.

  1. 1. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 1 Prof. Dr. Christian Bizer Evolving the Web into a global Database - Advances and Applications -
  2. 2. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 2 Data and Web Science Group @ University of Mannheim − 3 Professors • Prof. Dr. Heiner Stuckenschmidt • Prof. Dr. Simone Paolo Ponzetto • Prof. Dr. Christian Bizer − 5 Post-Doctoral Researchers − 18 PhD Students − http://dws.informatik.uni-mannheim.de/ 1. Research methods for integrating and mining large amounts of heterogeneous information within enterprise and open Web contexts. 2. Empirically analyze the content and structure of the Web.
  3. 3. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 3 Querying the classic Web DB HTML
  4. 4. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 4 Long standing Goal Query the Web like a single, global database
  5. 5. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 5 2001 Article: The Semantic Web Envisions three things to happen: 1.people publish data in structured form in addition to HTML pages on the Web 2.common vocabularies / ontologies are used to represent data 3.people implement cool applications that do smart things with the available data. Tim Berners-Lee, James Hendler and Ora Lassila: The Semantic Web. Scientific American, May 2001.
  6. 6. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 6 13 Years Later There are 1.3 million publications about the Semantic Web on Google Scholar, but 1. Do people publish structured data on the Web? 2. Do people agree on common vocabularies / ontologies? 3. What are the cool applications that exploit the data?
  7. 7. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 7 Outline 1. Linked Data 2. HTML-embedded Data 3. The Role of Wikipedia 4. Conclusions
  8. 8. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 8 1. Linked Data B C RDF RDF link A D E RDF links RDF links RDF links RDF RDF RDF RDF RDF RDF RDF RDF RDF • by using RDF to publish structured data on the Web • by setting links between data items within different data sources. Set of best practices for publishing structured data on the Web in the form of a single global data graph.
  9. 9. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 9 Global Identifiers and Links as Integration Hints  publishing Identity Links on the Web  publishing Vocabulary Links on the Web <http://www4.wiwiss.fu-berlin.de/is-group/resource/persons/Person4> owl:sameAs <http://dblp.l3s.de/d2r/resource/authors/Christian_Bizer> . <http://xmlns.com/foaf/0.1/Person> owl:equivalentClass <http://dbpedia.org/ontology/Person> .
  10. 10. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 10 Effort Distribution between Publisher and Consumer Publishers or third parties provides identity/vocabulary links Consumer mines missing identity/vocabulary links Effort Distribution
  11. 11. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 11 W3C Linking Open Data Project − Grassroots community effort started in 2007 to • publish existing open license datasets as Linked Data on the Web • interlink things between different data sources • maintain a data set catalog on the CKAN DataHub
  12. 12. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 12 LOD Datasets on the Web: September 2011 295 data sets 31,6 billion RDF triples
  13. 13. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 13 Newer statistics − LODstats (University of Leipzig, 2014): 928 data sets − LDspider Crawl (University of Mannheim, 2013): 850 data sets Distribution by Topical Domain (September 2011) Domain Data Sets Triples Percent RDF Links Percent Media 25 1,841,852,061 5.82 % 50,440,705 10.01 % Geographic 31 6,145,532,484 19.43 % 35,812,328 7.11 % Government 49 13,315,009,400 42.09 % 19,343,519 3.84 % Library 87 2,950,720,693 9.33 % 139,925,218 27.76 % Cross-domain 41 4,184,635,715 13.23 % 63,183,065 12.54 % Life sciences 41 3,036,336,004 9.60 % 191,844,090 38.06 % User content 20 134,127,413 0.42 % 3,449,143 0.68 % SUM 295 31,634,213,770 503,998,829
  14. 14. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 14 Ontological Agreement − Out of the 295 data sources • 102 (35%) only use terms from common vocabularies • 105 (36%) only use proprietary terms • 88 (29%) mix common and proprietary terms − Popular Vocabularies Vocabulary # Data Sets Dublin Core 92 (31.19 %) FOAF 81 (27.46 %) SKOS 58 (19.66 %) GEO 25 (8.47 %) AKT 17 (5.76 %) BIBO 14 (4.75 %) Music Ontology 13 (4.41 %) SIOC 10 (3.39 %)
  15. 15. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 15
  16. 16. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 16
  17. 17. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 17 Uptake in the Government Domain − Goals • Make data available to the public and other government agencies • Ease data integration by providing unique identifiers and by setting links − W3C Government Linked Data Working Group
  18. 18. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 18 Uptake in the Libraries Community − Institutions publishing Linked Data • Library of Congress (subject headings) • German National Library (PND dataset and subject headings) • Swedish National Library (Libris - catalog) • Hungarian National Library (OPAC and Digital Library) • Europeana Digital Library (4 million artifacts) − Goals: 1. Integrate Library Catalogs on global scale 2. Interconnect resources between repositories (by topic, by location, by historical period, by ...)
  19. 19. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 19 Industry Uptake − Media Industry • British Broadcasting Corporation • New York Times • Wolters Kluwer • Springer − Pharmaceutical Industry • Johnson & Johnson • Eli Lilly and Company • AstraZeneca − IT Industry • IBM
  20. 20. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 20 2. HTML-embedded Data Microformats Microdata RDFa Websites semantically markup the content of their HTML pages using:
  21. 21. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 21 Schema.org − ask site owners since 2011 to markup data to enrich search results. − 200+ Types: Event, Organization, Person, Place, Product, Review − Encoding: Microdata or alternatively RDFa
  22. 22. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 22 Open Graph Protocol − allows site owners to determine how entities are described in Facebook − relies on RDFa for encoding data in HTML pages − available since April 2010
  23. 23. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 23 The Common Crawl
  24. 24. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 24 The WebDataCommons.org Project − extracts all Microformat, Microdata, RDFa data from the Common Crawl − analyzes and provides the extracted data for download − Two extractions runs • 2009/2010 CC Corpus: 2.5 billion HTML pages  5.1 billion RDF triples • 2012 CC Corpus: 3.0 billion HTML pages  7.3 billion RDF triples − used 100 machines on Amazon EC2 • approx. 3000 machine/hours (spot instances of type c1.xlarge)  550 EUR − Jointed effort in the context of the EU project
  25. 25. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 25 Websites providing Structured Data (2012) 2.29 million websites (PLDs) out of 40 million provide Microformat, Microdata or RDFa data (5.65%) 369 million of the 3 billion pages contain Microformat, Microdata or RDFa data (12.3%) Google, October 2013: 15% of all websites provide structured data.
  26. 26. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 26 Breakdown by Encoding Format and Site Popularity Grouped by Alexa Website Popularity Rank (rank based on amount of page views)
  27. 27. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 27 − Top Classes: − Topics • CMS and Blog metadata • Product data • Ratings/Reviews • Company listings RDFa Topics (CC 2012) og = Facebook‘s Open Graph Protocol
  28. 28. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 28 − Top Classes: − Topics • CMS and Blog metadata • Navigational metadata • Products and offers • Business listings • Ratings • Places • Events Microdata Topics (CC 2012) schema = Schema.org datavoc = Google‘s Rich Snippet Vocabulary
  29. 29. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 29 Class / Property Distribution  A small set of classes / properties is used.  Strong focus on Schema.org and Facebook vocabularies
  30. 30. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 30 Looking Deeper into the E-Commerce Data Microdata (2012) Example Names: • AppleMacBook Air MC968/A 11.6-Inch Laptop • Apple MacBook Air 11-in, Intel Core i5 1.60GHz, 64 GB, Lion 10.7 Example Description: • Faster Flash Storage with 64 GB Solid State Drive and USB 3.0 …
  31. 31. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 31 Usage of Schema.org Data @ Google Rich snippets within search results
  32. 32. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 32 Usage of Open Graph Protocol Data @ Facebook − allows site owners to determine how entities are described in Facebook
  33. 33. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 33 Valuable Resource for Comparison Shopping Sites − We analyzed 1.9 million product offers from 9200 shops − We trained classifier for 9 product categories on product descriptions from Amazon.
  34. 34. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 34 Identity Resolution for Electronic Products − We trained parser for product descriptions on offers for electronic products from Amazon. − We used Silk framework for identity resolution.
  35. 35. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 35 Linked Data vs. HTML-embeded Data LOD Cloud Microdata, Microformats, RDFa < 1000 sources millions of sources covers wider range of specific topics focused on search engines and Facebook contains more complex data structures very simple and shallow data structures partial ontology agreement strong ontology agreement data integration eased by RDF links data integration requires NLP techniques
  36. 36. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 36 Title Description Cross Language Links Geo- Coordinates Images Infoboxes 3. The Role of Wikipedia
  37. 37. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 37 Extracting Knowledge from Wikipedia
  38. 38. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 38 The DBpedia Knowledge Base - Version 3.9 − describes 4.00 million things, out of which 3.22 million are classified in a consistent ontology using 529 classes and 2217 different properties • 832,000 persons • 639,000 places • 209,000 organizations • 116,000 music albums − Altogether 2.46 billion pieces of information (RDF triples) • 24,000,000 links to external web pages • 27,200,000 external links into other RDF datasets − DBpedia Internationalization • provide data from 119 Wikipedia language editions for download • 24 popular languages we provide cleaned infobox data
  39. 39. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 39
  40. 40. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 40 1. Answer fact queries: “birthdate michael douglas” 2. Compare things: „compare eiffel tower vs empire state building” Applications of Google‘s Knowledge Graph
  41. 41. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 41 Applications of Google‘s Knowledge Graph 3. Enrich search results with infoboxes and lists • Infoboxes might also contain Microdata/RDFa data, e.g. concerts of a band 3. Rank of search results using new Hummingbird ranking algorithm
  42. 42. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 42 DBpedia as Background Knowledge for Data Mining − Which factors correlate with unemployment in France?
  43. 43. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 43 Unemployment Table with additional Attributes
  44. 44. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 44 RapidMiner Linked Open Data Extension Allows you to 1. link local table to DBpedia and other LOD data sources 2. extend local table with additional attributes 3. mine extended tables using all Rapidminer features
  45. 45. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 45 Finding Correlations − Use additional attributes to find interesting correlations − Example correlation for unemployment in France: • African islands, Islands in the Indian Ocean, Outermost regions of the EU (positive) • Population growth (positive) • Disposable income (negative) • Energy consumption (negative) • Fast food restaurants (positive) • Hospital beds/inhabitants (negative) • Police stations (positive)
  46. 46. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 46 Conclusions 1. Publication of Structured Data • There is more data than most people from research and industry like • Exciting test-bed for data profiling and data integration techniques • Not even the research focus has moved to the integration of 1000s of sources 1. Ontological Agreement • Application-pull helps (Google et al.) • But data source-specific attributes are also important (e.g. in life science or statistics domain) 1. Applications • the big players are moving • there is a lot of experimentation in industry, but many efforts are still in the prototype stage
  47. 47. Bizer: Evolving the Web into a global Database – Advances and Applications, 30.1.2014 Slide 47 Thanks Mannheim Linked Open Data Meetup −Free beer and food −Talks by Springer, Wolters Kluwer, Semantic Web Company, LOD2 project participants, DWS group members −Sunday, February 23, 2014, 6:30 PM −http://www.meetup.com/OpenKnowledgeFoundation/M annheim-DE/1092882/ Advertisement

×