Keynote talk at the 18th International Conference on Business Information Systems, 24-26 June 2015, Poznań, Poland
URL:
http://bis.kie.ue.poznan.pl/bis2015/keynote-speakers/
Abstract:
Motivated by Google, Yahoo!, Microsoft, and Facebook, hundreds of thousands of websites have started to annotate structured data within their pages using markup formats such as Microdata, RDFa, and Microformats. In parallel, the adoption of Linked Data technologies by government agencies, libraries, and scientific institutions has risen considerably. In his talk, Christian Bizer will give an overview of the content profile of the resulting Web of Data. He will showcase applications that exploit the Web of Data and will discuss the challenges of integrating and cleansing data from thousands of independent Web data sources.
The Graph Structure of the Web - Aggregated by Pay-Level Domainoli-unima
The document summarizes research on analyzing the structure of the 2012 web graph when aggregated by pay-level domain (PLD) rather than by individual pages. Some key findings include: the indegree distribution follows a power law but the outdegree distribution does not; the bow-tie structure is unbalanced with a large OUT component compared to previous studies; approximately 42% of domains are connected by paths and the average path length is 4.27 hops; and high connectivity depends more on links to hubs than on hubs themselves. Analysis of topic-specific subgraphs and the public suffix graph show varying patterns of internal and external links.
Search Joins with the Web - ICDT2014 Invited LectureChris Bizer
The talk will discuss the concept of Search Joins. A Search Join is a join operation which extends a local table with additional attributes based on the large corpus of structured data that is published on the Web in various formats. The challenge for Search Joins is to decide which Web tables to join with the local table in order to deliver high-quality results. Search joins are useful in various application scenarios. They allow for example a local table about cities to be extended with an attribute containing the average temperature of each city for manual inspection. They also allow tables to be extended with large sets of additional attributes as a basis for data mining, for instance to identify factors that might explain why the inhabitants of one city claim to be happier than the inhabitants of another.
In the talk, Christian Bizer will draw a theoretical framework for Search Joins and will highlight how recent developments in the context of Linked Data, RDFa and Microdata publishing, public data repositories as well as crowd-sourcing integration knowledge contribute to the feasibility of Search Joins in an increasing number of topical domains.
Extending Tables with Data from over a Million WebsitesChris Bizer
The slideset describes the Mannheim Search Join Engine and was used to present our submission to the Semantic Web Challenge 2014.
More information about the Semantic Web Challenge:
http://challenge.semanticweb.org/
Paper about the Mannheim Search Join Engine:
http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/Lehmberg-Ritze-Ristoski-Eckert-Paulheim-Bizer-TableExtension-SemanticWebChallenge-ISWC2014-Paper.pdf
Abstract:
This Big Data Track submission demonstrates how the BTC 2014 dataset, Microdata annotations from thousands of websites, as well as millions of HTML tables are used to extend local tables with additional columns. Table extension is a useful operation within a wide range of application scenarios: Image you are an analyst having a local table describing companies and you want to extend this table with the headquarter of each company. Or imagine you are a film lover and want to extend a table describing films with attributes like director, genre, and release date of each film. The Mannheim SearchJoin Engine automatically performs such table extension operations based on a large data corpus gathered from over a million websites that publish structured data in various formats. Given a local table, the SearchJoin Engine searches the corpus for additional data describing the entities of the input table. The discovered data are then joined with the local table and their content is consolidated using schema matching and data fusion methods. As result, the user is presented with an extended table and given the opportunity to examine the provenance of the added data. Our experiments show that the Mannheim SearchJoin Engine achieves a coverage close to 100% and a precision of around 90% within different application scenarios.
Adoption of the Linked Data Best Practices in Different Topical DomainsChris Bizer
Slides from the presentation of the following paper:
Max Schmachtenberg, Christian Bizer, Heiko Paulheim: Adoption of the Linked Data Best Practices in Different Topical Domains. 13th International Semantic Web Conference (ISWC2014) - RDB Track, pp. 245-260, Riva del Garda, Italy, October 2014.
Paper URL:
http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/SchmachtenbergBizerPaulheim-AdoptionOfLinkedDataBestPractices.pdf
Abstract:
The central idea of Linked Data is that data publishers support applications in discovering and integrating data by complying to a set of best practices in the areas of linking, vocabulary usage, and metadata provision. In 2011, the State of the LOD Cloud report analyzed the adoption of these best practices by linked datasets within different topical domains. The report was based on information that was provided by the dataset publishers themselves via the datahub.io Linked Data catalog. In this paper, we revisit and update the findings of the 2011 State of the LOD Cloud report based on a crawl of the Web of Linked Data conducted in April 2014. We analyze how the adoption of the different best practices has changed and present an overview of the linkage relationships between datasets in the form of an updated LOD cloud diagram, this time not based on information from dataset providers, but on data that can actually be retrieved by a Linked Data crawler. Among others, we find that the number of linked datasets has approximately doubled between 2011 and 2014, that there is increased agreement on common vocabularies for describing certain types of entities, and that provenance and license metadata is still rarely provided by the data sources.
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Chris Bizer
http://iswc2016.semanticweb.org/pages/program/keynote-bizer.html
Semantic Web technologies, such as Linked Data and Schema.org, are used by a significant number of websites to support the automated processing of their content. In the talk, I will contrast the original vision of the Semantic Web with empirical findings about the adoption of Semantic Web technologies on the Web. The analysis will show areas in which data providers behave as envisioned by the Semantic Web community but will also reveal areas in which real-world adoption patterns strongly deviate. Afterwards, I will discuss the challenges that result from the current adoption situation. To address these challenges, I will exemplify entity reconciliation, vocabulary matching, and data quality assessment techniques which exploit all semantic clues that are provided while being tolerant to noise and lazy data providers.
DBpedia - An Interlinking Hub in the Web of DataChris Bizer
Given and overview about the DBpedia project and the role of DBpedia in the Web of Data and outlines the next steps from the Dbpedia project as well as ideas for using DBpedia data within the BBC.
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackChris Bizer
The document discusses research that revisits the graph structure of the web using a new large crawl from Common Crawl. It finds that the web has become more dense and connected over time, with the largest strongly connected component growing significantly. While previous research found power laws for in- and out-degrees, this data does not fit power laws and instead has heavy-tailed distributions. The shape of the bow-tie structure also depends on the specific crawl used. The authors provide the new crawl data and analysis to enable further research on the evolving structure of the web graph.
Mining the Web of Linked Data with RapidMinerHeiko Paulheim
Lots of data from different domains is published as Linked Open Data. While there are quite a few browsers for that data, as well as intelligent tools for particular purposes, a versatile tool for deriving additional knowledge by mining the Web of Linked Data is still missing. In this challenge entry, we introduce the RapidMiner Linked Open Data extension. The extension hooks into the powerful data mining platform RapidMiner, and offers operators for accessing Linked Open Data in RapidMiner, allowing for using it in sophisticated data analysis workflows without the need to know SPARQL or RDF. As an example, we show how statistical data on scientific publications, published as an RDF data cube, can be linked to further datasets and analyzed using additional background knowledge from various LOD datasets.
The Graph Structure of the Web - Aggregated by Pay-Level Domainoli-unima
The document summarizes research on analyzing the structure of the 2012 web graph when aggregated by pay-level domain (PLD) rather than by individual pages. Some key findings include: the indegree distribution follows a power law but the outdegree distribution does not; the bow-tie structure is unbalanced with a large OUT component compared to previous studies; approximately 42% of domains are connected by paths and the average path length is 4.27 hops; and high connectivity depends more on links to hubs than on hubs themselves. Analysis of topic-specific subgraphs and the public suffix graph show varying patterns of internal and external links.
Search Joins with the Web - ICDT2014 Invited LectureChris Bizer
The talk will discuss the concept of Search Joins. A Search Join is a join operation which extends a local table with additional attributes based on the large corpus of structured data that is published on the Web in various formats. The challenge for Search Joins is to decide which Web tables to join with the local table in order to deliver high-quality results. Search joins are useful in various application scenarios. They allow for example a local table about cities to be extended with an attribute containing the average temperature of each city for manual inspection. They also allow tables to be extended with large sets of additional attributes as a basis for data mining, for instance to identify factors that might explain why the inhabitants of one city claim to be happier than the inhabitants of another.
In the talk, Christian Bizer will draw a theoretical framework for Search Joins and will highlight how recent developments in the context of Linked Data, RDFa and Microdata publishing, public data repositories as well as crowd-sourcing integration knowledge contribute to the feasibility of Search Joins in an increasing number of topical domains.
Extending Tables with Data from over a Million WebsitesChris Bizer
The slideset describes the Mannheim Search Join Engine and was used to present our submission to the Semantic Web Challenge 2014.
More information about the Semantic Web Challenge:
http://challenge.semanticweb.org/
Paper about the Mannheim Search Join Engine:
http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/Lehmberg-Ritze-Ristoski-Eckert-Paulheim-Bizer-TableExtension-SemanticWebChallenge-ISWC2014-Paper.pdf
Abstract:
This Big Data Track submission demonstrates how the BTC 2014 dataset, Microdata annotations from thousands of websites, as well as millions of HTML tables are used to extend local tables with additional columns. Table extension is a useful operation within a wide range of application scenarios: Image you are an analyst having a local table describing companies and you want to extend this table with the headquarter of each company. Or imagine you are a film lover and want to extend a table describing films with attributes like director, genre, and release date of each film. The Mannheim SearchJoin Engine automatically performs such table extension operations based on a large data corpus gathered from over a million websites that publish structured data in various formats. Given a local table, the SearchJoin Engine searches the corpus for additional data describing the entities of the input table. The discovered data are then joined with the local table and their content is consolidated using schema matching and data fusion methods. As result, the user is presented with an extended table and given the opportunity to examine the provenance of the added data. Our experiments show that the Mannheim SearchJoin Engine achieves a coverage close to 100% and a precision of around 90% within different application scenarios.
Adoption of the Linked Data Best Practices in Different Topical DomainsChris Bizer
Slides from the presentation of the following paper:
Max Schmachtenberg, Christian Bizer, Heiko Paulheim: Adoption of the Linked Data Best Practices in Different Topical Domains. 13th International Semantic Web Conference (ISWC2014) - RDB Track, pp. 245-260, Riva del Garda, Italy, October 2014.
Paper URL:
http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/SchmachtenbergBizerPaulheim-AdoptionOfLinkedDataBestPractices.pdf
Abstract:
The central idea of Linked Data is that data publishers support applications in discovering and integrating data by complying to a set of best practices in the areas of linking, vocabulary usage, and metadata provision. In 2011, the State of the LOD Cloud report analyzed the adoption of these best practices by linked datasets within different topical domains. The report was based on information that was provided by the dataset publishers themselves via the datahub.io Linked Data catalog. In this paper, we revisit and update the findings of the 2011 State of the LOD Cloud report based on a crawl of the Web of Linked Data conducted in April 2014. We analyze how the adoption of the different best practices has changed and present an overview of the linkage relationships between datasets in the form of an updated LOD cloud diagram, this time not based on information from dataset providers, but on data that can actually be retrieved by a Linked Data crawler. Among others, we find that the number of linked datasets has approximately doubled between 2011 and 2014, that there is increased agreement on common vocabularies for describing certain types of entities, and that provenance and license metadata is still rarely provided by the data sources.
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Chris Bizer
http://iswc2016.semanticweb.org/pages/program/keynote-bizer.html
Semantic Web technologies, such as Linked Data and Schema.org, are used by a significant number of websites to support the automated processing of their content. In the talk, I will contrast the original vision of the Semantic Web with empirical findings about the adoption of Semantic Web technologies on the Web. The analysis will show areas in which data providers behave as envisioned by the Semantic Web community but will also reveal areas in which real-world adoption patterns strongly deviate. Afterwards, I will discuss the challenges that result from the current adoption situation. To address these challenges, I will exemplify entity reconciliation, vocabulary matching, and data quality assessment techniques which exploit all semantic clues that are provided while being tolerant to noise and lazy data providers.
DBpedia - An Interlinking Hub in the Web of DataChris Bizer
Given and overview about the DBpedia project and the role of DBpedia in the Web of Data and outlines the next steps from the Dbpedia project as well as ideas for using DBpedia data within the BBC.
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackChris Bizer
The document discusses research that revisits the graph structure of the web using a new large crawl from Common Crawl. It finds that the web has become more dense and connected over time, with the largest strongly connected component growing significantly. While previous research found power laws for in- and out-degrees, this data does not fit power laws and instead has heavy-tailed distributions. The shape of the bow-tie structure also depends on the specific crawl used. The authors provide the new crawl data and analysis to enable further research on the evolving structure of the web graph.
Mining the Web of Linked Data with RapidMinerHeiko Paulheim
Lots of data from different domains is published as Linked Open Data. While there are quite a few browsers for that data, as well as intelligent tools for particular purposes, a versatile tool for deriving additional knowledge by mining the Web of Linked Data is still missing. In this challenge entry, we introduce the RapidMiner Linked Open Data extension. The extension hooks into the powerful data mining platform RapidMiner, and offers operators for accessing Linked Open Data in RapidMiner, allowing for using it in sophisticated data analysis workflows without the need to know SPARQL or RDF. As an example, we show how statistical data on scientific publications, published as an RDF data cube, can be linked to further datasets and analyzed using additional background knowledge from various LOD datasets.
This document discusses how linking open data can make data more valuable and useful. It recommends following semantic web and linked data practices like publishing data using RDF, linking entities to related datasets, and maintaining and improving links over time. Linking data allows queries across datasets, facilitates data integration, and enables new applications by connecting related information. The key is to link data in a way that answers questions and benefits both data publishers and users, and to iteratively enhance link quality and coverage.
Registration / Certification Interoperability Architecture (overlay peer-review)Herbert Van de Sompel
This document discusses an architecture for interoperability between registration and certification functions in scholarly communication. It provides historical context on decoupling these functions and standards that could enable interoperability, such as Linked Data Notifications (LDN), ActivityStreams 2.0, and web linking. An example flow is described where a preprint is registered, an overlay reviewer is notified and decides to review it, and the outcome is later linked back to the original preprint registration. Overall technologies now exist to build an interoperable system where registration, certification and other functions can be fulfilled independently through standardized communication.
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?Martin Hepp
The document discusses the current state and future of the Semantic Web and linked data initiatives. It notes several successes such as the Linked Open Data cloud and schemas like Schema.org and GoodRelations. However, it argues that the original vision of the Semantic Web, which aimed to allow computers to help process information by applying structured data standards at web scale, has not fully been realized. Schemas like Schema.org focus more on information extraction than direct data consumption. The document calls for challenging assumptions through empirical analysis rather than ideological debates.
Presentation about reference rot given at the Complexity Science Hub in Vienna, November 2021.
Links to web resources frequently break (link rot), and linked content can change at unpredictable rates (content drift). These dynamics of the Web are detrimental when references to web resources provide evidence or supporting information.
This presentation will report on research that assessed the extent of these problems for links to web resources in scholarly literature, by using three vast corpora of publications and a range of public web archives. It will also describe the Robust Link approach that offers a proactive, uniform, and machine-actionable way to combat link rot and content drift. Finally, it will introduce the Robustify web service and API that was devised to generate links that remain functional over time, paying special attention to challenges related to deploying infrastructure that is required to be long lasting.
This slide deck provides an overview of proposals to use HTTP Links as a means to address some long standing problems related to scholarly resources on the web.
This document discusses linked data life cycles, including modeling, publishing, discovery, integration, and use cases. It describes key concepts like dataspaces, DSSPs, linked data principles, and the linked open data cloud. Challenges with linked data include schema mapping, write-enablement, authentication, and dataset dynamics as data sources change over time.
The paper describes the work being conducted in the Cross-institutional Authority Collaboration (Institutionenübergreifende Integration von Normdaten, IN2N) project. This pilot project, executed in cooperation with the German National Library and the German Film Institute, aims to establish new collaboration models to improve cross-domain authority maintenance. The paper outlines applied strategies for providing a shared infrastructure as well as workflows for exchanging data about persons; interface enhancements permitting the exploitation of innovative web approaches; and cross-institutional data search and representation solutions. Furthermore, we discuss specific boundary conditions, such as disparities in the level of data granularity, for an interoperable cataloguing environment.
The document discusses open data for open government and the benefits of publishing government data in a semantic, linked, and open format on the web. It provides examples of open data initiatives in the US, UK, and other countries that have led to the development of many applications by third parties using publicly available government data. The speaker advocates that governments publish not just documents but the underlying data to allow others to build new sites and applications to make use of the information.
Presentation for a workshop about persistent identifiers organized by the Royal Library of The Netherlands and DANS. Highlights the non-trivial commitments required of all parties involved in persistent identifier systems to actually keep links based on persistent identifiers ... err ... persistent.
[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...Data Beers
This document discusses using linked data approaches for data integration. It introduces linked data as a way to publish and connect disparate data sources using common identifiers and semantic web standards like URIs and RDF. This allows data to be queried and exploited as a single global database. Examples are given of applying linked data for integrating enterprise data sources and for publishing geospatial data from Ecuador using semantic representations. The benefits of linked data for data integration are that it enables querying across data silos and consuming data without complex transformations by using the graph-based RDF data model.
Linked Data: turning the web into a context graphLeigh Dodds
A presentation I gave at Strataconf 2012. I reviewed the concepts of Linked Data and argued that while the approach has come from the semantic web community, there are interesting parallels with efforts from Facebook and Schema.org. Linked Data provides a way for us to create resolvable identifiers + discover useful data by just using the web infrastructure more effectively.
The document introduces the concept of Linked Data and discusses how it can be used to publish structured data on the web by connecting data from different sources. It explains the principles of Linked Data, including using HTTP URIs to identify things, providing useful information when URIs are dereferenced, and including links to other URIs to enable discovery of related data. Examples of existing Linked Data datasets and applications that consume Linked Data are also presented.
Heuristics for Fixing Common Errors in Deployed schema.org MicrodataRobert Meusel
Being promoted by major search engines such as Google,
Yahoo!, Bing, and Yandex, Microdata embedded in web pages, especially using schema.org, has become one of the most important markup languages for the Web. However, deployed Microdata is most often not free from errors, which limits its practical use. In this paper, we use
the WebDataCommons corpus of Microdata extracted from more than 250 million web pages for a quantitative analysis of common mistakes in Microdata provision. Since it is unrealistic that data providers will provide clean and correct data, we discuss a set of heuristics that can be applied on the data consumer side to fix many of those mistakes in a
post-processing step. We apply those heuristics to provide an improved knowledge base constructed from the raw Microdata extraction.
Looks at hyperlinks from the perspective of a managed collection of resources for which link persistence/integrity is considered a quality of service concern. Distinguishes between links into other managed collections and to the web at large. Considers link rot and content drift.
Data.dcs: Converting Legacy Data into Linked DataMatthew Rowe
This document discusses converting legacy data from the Department of Computer Science (DCS) at the University of Sheffield into linked data. It describes extracting data from websites and publications databases, converting it to RDF triples, resolving duplicate entities, and linking the data to external datasets like DBLP. The goal is to make DCS data about people, publications, and research groups machine-readable and queryable while integrating it into the larger web of linked open data.
Linked Open Data projects aim to extend the web of documents to a web of linked data by adding semantics through standards like RDF and ontologies. The Linked Open Data cloud has grown significantly since 2007 and contains billions of RDF triples and links between data sources. Projects like LOD2 build on this by developing technologies and linking more open datasets to enable new applications. For Linked Data to achieve its full potential, openness and allowing free access and reuse is important, though it does mean losing some control over data usage.
This document discusses data-driven smart governance and describes how governments can utilize data, information, and intelligence through interaction, integration, and influence. It provides examples of how open data, data standards, semantic technologies, machine learning, and public-private partnerships can help power more data-driven decision making and transparent, responsive government services.
This document introduces Linked Data and provides an overview of its key concepts and benefits. It discusses how Linked Data builds on existing web standards by linking structured data across websites on the web. It also outlines practical steps for publishing Linked Data, such as identifying data to publish, assigning unique URLs, and linking data to existing datasets. The goal of Linked Data is to extend the web into a global data space by creating a decentralized "Web of Data."
1. The document provides guidance for academics on using social media for professional purposes. It outlines different types of social media platforms and how they can be used.
2. Academics are encouraged to develop their online presence through blogging, sharing content and case studies, and connecting with other professionals. Organizations can support staff by providing training and acting as role models.
3. Individuals should understand how to create a relevant online profile and take advantage of opportunities for self-determined learning through social media. They can make good use of social media by sharing their own work as well as the achievements of others.
This document discusses how linking open data can make data more valuable and useful. It recommends following semantic web and linked data practices like publishing data using RDF, linking entities to related datasets, and maintaining and improving links over time. Linking data allows queries across datasets, facilitates data integration, and enables new applications by connecting related information. The key is to link data in a way that answers questions and benefits both data publishers and users, and to iteratively enhance link quality and coverage.
Registration / Certification Interoperability Architecture (overlay peer-review)Herbert Van de Sompel
This document discusses an architecture for interoperability between registration and certification functions in scholarly communication. It provides historical context on decoupling these functions and standards that could enable interoperability, such as Linked Data Notifications (LDN), ActivityStreams 2.0, and web linking. An example flow is described where a preprint is registered, an overlay reviewer is notified and decides to review it, and the outcome is later linked back to the original preprint registration. Overall technologies now exist to build an interoperable system where registration, certification and other functions can be fulfilled independently through standardized communication.
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?Martin Hepp
The document discusses the current state and future of the Semantic Web and linked data initiatives. It notes several successes such as the Linked Open Data cloud and schemas like Schema.org and GoodRelations. However, it argues that the original vision of the Semantic Web, which aimed to allow computers to help process information by applying structured data standards at web scale, has not fully been realized. Schemas like Schema.org focus more on information extraction than direct data consumption. The document calls for challenging assumptions through empirical analysis rather than ideological debates.
Presentation about reference rot given at the Complexity Science Hub in Vienna, November 2021.
Links to web resources frequently break (link rot), and linked content can change at unpredictable rates (content drift). These dynamics of the Web are detrimental when references to web resources provide evidence or supporting information.
This presentation will report on research that assessed the extent of these problems for links to web resources in scholarly literature, by using three vast corpora of publications and a range of public web archives. It will also describe the Robust Link approach that offers a proactive, uniform, and machine-actionable way to combat link rot and content drift. Finally, it will introduce the Robustify web service and API that was devised to generate links that remain functional over time, paying special attention to challenges related to deploying infrastructure that is required to be long lasting.
This slide deck provides an overview of proposals to use HTTP Links as a means to address some long standing problems related to scholarly resources on the web.
This document discusses linked data life cycles, including modeling, publishing, discovery, integration, and use cases. It describes key concepts like dataspaces, DSSPs, linked data principles, and the linked open data cloud. Challenges with linked data include schema mapping, write-enablement, authentication, and dataset dynamics as data sources change over time.
The paper describes the work being conducted in the Cross-institutional Authority Collaboration (Institutionenübergreifende Integration von Normdaten, IN2N) project. This pilot project, executed in cooperation with the German National Library and the German Film Institute, aims to establish new collaboration models to improve cross-domain authority maintenance. The paper outlines applied strategies for providing a shared infrastructure as well as workflows for exchanging data about persons; interface enhancements permitting the exploitation of innovative web approaches; and cross-institutional data search and representation solutions. Furthermore, we discuss specific boundary conditions, such as disparities in the level of data granularity, for an interoperable cataloguing environment.
The document discusses open data for open government and the benefits of publishing government data in a semantic, linked, and open format on the web. It provides examples of open data initiatives in the US, UK, and other countries that have led to the development of many applications by third parties using publicly available government data. The speaker advocates that governments publish not just documents but the underlying data to allow others to build new sites and applications to make use of the information.
Presentation for a workshop about persistent identifiers organized by the Royal Library of The Netherlands and DANS. Highlights the non-trivial commitments required of all parties involved in persistent identifier systems to actually keep links based on persistent identifiers ... err ... persistent.
[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...Data Beers
This document discusses using linked data approaches for data integration. It introduces linked data as a way to publish and connect disparate data sources using common identifiers and semantic web standards like URIs and RDF. This allows data to be queried and exploited as a single global database. Examples are given of applying linked data for integrating enterprise data sources and for publishing geospatial data from Ecuador using semantic representations. The benefits of linked data for data integration are that it enables querying across data silos and consuming data without complex transformations by using the graph-based RDF data model.
Linked Data: turning the web into a context graphLeigh Dodds
A presentation I gave at Strataconf 2012. I reviewed the concepts of Linked Data and argued that while the approach has come from the semantic web community, there are interesting parallels with efforts from Facebook and Schema.org. Linked Data provides a way for us to create resolvable identifiers + discover useful data by just using the web infrastructure more effectively.
The document introduces the concept of Linked Data and discusses how it can be used to publish structured data on the web by connecting data from different sources. It explains the principles of Linked Data, including using HTTP URIs to identify things, providing useful information when URIs are dereferenced, and including links to other URIs to enable discovery of related data. Examples of existing Linked Data datasets and applications that consume Linked Data are also presented.
Heuristics for Fixing Common Errors in Deployed schema.org MicrodataRobert Meusel
Being promoted by major search engines such as Google,
Yahoo!, Bing, and Yandex, Microdata embedded in web pages, especially using schema.org, has become one of the most important markup languages for the Web. However, deployed Microdata is most often not free from errors, which limits its practical use. In this paper, we use
the WebDataCommons corpus of Microdata extracted from more than 250 million web pages for a quantitative analysis of common mistakes in Microdata provision. Since it is unrealistic that data providers will provide clean and correct data, we discuss a set of heuristics that can be applied on the data consumer side to fix many of those mistakes in a
post-processing step. We apply those heuristics to provide an improved knowledge base constructed from the raw Microdata extraction.
Looks at hyperlinks from the perspective of a managed collection of resources for which link persistence/integrity is considered a quality of service concern. Distinguishes between links into other managed collections and to the web at large. Considers link rot and content drift.
Data.dcs: Converting Legacy Data into Linked DataMatthew Rowe
This document discusses converting legacy data from the Department of Computer Science (DCS) at the University of Sheffield into linked data. It describes extracting data from websites and publications databases, converting it to RDF triples, resolving duplicate entities, and linking the data to external datasets like DBLP. The goal is to make DCS data about people, publications, and research groups machine-readable and queryable while integrating it into the larger web of linked open data.
Linked Open Data projects aim to extend the web of documents to a web of linked data by adding semantics through standards like RDF and ontologies. The Linked Open Data cloud has grown significantly since 2007 and contains billions of RDF triples and links between data sources. Projects like LOD2 build on this by developing technologies and linking more open datasets to enable new applications. For Linked Data to achieve its full potential, openness and allowing free access and reuse is important, though it does mean losing some control over data usage.
This document discusses data-driven smart governance and describes how governments can utilize data, information, and intelligence through interaction, integration, and influence. It provides examples of how open data, data standards, semantic technologies, machine learning, and public-private partnerships can help power more data-driven decision making and transparent, responsive government services.
This document introduces Linked Data and provides an overview of its key concepts and benefits. It discusses how Linked Data builds on existing web standards by linking structured data across websites on the web. It also outlines practical steps for publishing Linked Data, such as identifying data to publish, assigning unique URLs, and linking data to existing datasets. The goal of Linked Data is to extend the web into a global data space by creating a decentralized "Web of Data."
1. The document provides guidance for academics on using social media for professional purposes. It outlines different types of social media platforms and how they can be used.
2. Academics are encouraged to develop their online presence through blogging, sharing content and case studies, and connecting with other professionals. Organizations can support staff by providing training and acting as role models.
3. Individuals should understand how to create a relevant online profile and take advantage of opportunities for self-determined learning through social media. They can make good use of social media by sharing their own work as well as the achievements of others.
Digital archives contain vast amounts of information stored as binary digits. The amount of digital data being created is growing exponentially and is estimated to exceed 500 quadrillion files. Digital archives can preserve important records and make information widely accessible online, helping to promote accountability, justice, and bearing witness. However, digital archives also pose challenges around long-term preservation due to their dependence on continued formats and storage mediums.
Using social media as academics for learning, teaching and researchSue Beckingham
Using social media in higher education for teaching, academic professional development, research,student guidance, per support, student professional development, recruitment and university communication.
For everybody who gets tired of questions like “when is the Semantic Web actually going to happen”, or any other suggestion that the Semantic Web programme is “only vision, no progress”.
Social media for research and knowledge sharingHasnain Zafar
Slides for my pre-conference talk/workshop on Social Media for research at National Public Health Conference 2013, 11th -13th NOVEMBER 2013, CONCORDE HOTEL, SHAH ALAM,SELANGOR, MALAYSIA.
The Digital Academic: Social and Other Digital Media for AcademicsDeborah Lupton
A presentation used in workshops to teach academics about how to use social media and other digital media for professional purposes. Includes discussion of Academia.edu, LinkedIn, blogs, Twitter, Facebook, institutional e-repositories, Storify, SlideShare, Pinterest and more.
Skills Development Through Authentic AssessmentAlan Cann
"Authentic assessment" is relevant to real world outcomes and engaging for students. Much of the treadmill activity of conventional assessment (essays and exams) has little to do with what goes on in the workplace. Faced with the task of developing a "research skills" module for 300 biological sciences students, I attempted to apply the principles of authentic assessment. The practical problems in achieving this with a large number of students involve the staffing demands of this approach, and there are problems with applying performance-based outcomes to large groups of students. Team-based learning enhances student engagement and represents a shift from a teacher-based strategy to a student-centred approach.
Networking and the importance of a professional online presenceSue Beckingham
This document discusses the importance of developing a professional online presence and networking. It provides tips for optimizing one's professional identity on various social media platforms like LinkedIn, Twitter, blogs and online portfolios. This includes connecting with others in your field, showcasing your work, engaging with relevant content and organizations, and ensuring your online profiles highlight your skills, interests and story. The document stresses that your online networks and voice are important for standing out, gaining opportunities and being found by potential employers.
Social media for researchers - maximizing your personal impactAlan Cann
This document provides an overview of how researchers can use social media to maximize their personal impact. It discusses how social media can enhance the academic research cycle by enabling more effective collaboration, opportunities to forge new connections, receiving feedback, and more rapidly disseminating work. While social media presents some criticisms like privacy issues and a loss of authority, the document encourages researchers to participate and build good networks as a way to make an impact beyond traditional citations.
This document discusses the use of social media tools for researchers. It outlines several essential competencies for researchers, including knowledge base, professional development, communication and dissemination, and professional conduct. It then examines how specific social media platforms like Twitter, blogs, Mendeley, and ResearchGate can help researchers in each of these areas. The document provides tips for successful use of social media but also notes potential pitfalls to avoid, such as privacy and blurring of personal and professional boundaries. Useful links for further information are also included.
Using social media for learning and teaching #Bett2017 #ALiSOnlineSue Beckingham
This session explores how social media can be used to connect, communicate, curate, collaborate and create to enhance the learning experience both within and outside of the classroom. Learning activities and social media spaces will be shared to demonstrate how learners can develop digital capabilities and establish digital wellbeing.
http://alis-online.com/sessions/sioe-jan17/2016/12/2/social-media
1. The document discusses how graph theory concepts like degree centrality, betweenness centrality, and center were applied to analyze football matches from the 2010 FIFA World Cup semi-finals and final.
2. By constructing graphs based on pass networks, it was able to predict that Spain would defeat the Netherlands in the final based on Spain having a more well-balanced and interconnected passing strategy compared to the Netherlands' more predictable attack.
3. Spain's graph showed low, evenly distributed betweenness scores, indicating a well-balanced passing game, while the Netherlands' scores were more concentrated, showing dependency on few key players. The analysis confirmed Spain played a clever game relying on many short passes between a network of midfielders.
Best Practice for Social Media in Teaching & Learning Contexts, slides accompanying a presentation by Nicola Osborne, EDINA Digital Education Manager, for Abertay University (Dundee). The hashtag for this event was #AbTLEJan2017.
The document provides an introduction to graph theory concepts. It defines graphs as pairs of vertices and edges, and introduces basic graph terminology like order, size, adjacency, and isomorphism. Graphs can be represented geometrically by drawing vertices as points and edges as lines between them. Both simple graphs and multigraphs are discussed.
The role and importance of social media in science Jari Laru
The role and importance of social media in science presentation in the course: 920001J - Introduction to Doctoral Training (1 ECTS credit). UNIOGS, University of Oulu, Finland.
Using social media to build your academic careerlisbk
Sides for talk on "Using social media to build your academic career" given by Brian Kelly, Innovation Advocate at Cetis, University of Bolton on 11 September 2014 at a symposium on “How to Build an Academic Career” in the Maria Baers Auditorium, Brussels, Belgium.
See http://ukwebfocus.wordpress.com/events/using-social-media-to-build-your-academic-career/
and
http://ukwebfocus.wordpress.com/2014/09/10/using-social-media-to-build-your-academic-career/
TFF2015, Christian Bizer, Uni Mannheim "Schema.org-Annotationen in Webseiten"TourismFastForward
Professor Christian Bizer erforscht technische und empirische Fragen bezüglich der Entwicklung des World Wide Webs hin zu einem globalen Datenraum. Im Rahmen des WebDataCommons-Projekts untersucht er die globale Adaption von Webseiten-Annotations-Standards wie Schema.org. Im Rahmen des DBpedia-Projekts beschäftigt er sich mit der Ableitung umfassender Multi-Domänen-Wissensbasen aus Wikipedia und dem World Wide Web.
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Chris Bizer
Slideset of the keynote talk given by Christian Bizer at the Language Data and Knowledge (LDK 2019) conference in Leipzig, Germany
Abstract of the talk:
Millions of websites have started to annotate data describing products, local business, events, jobs, places, recipes, and reviews within their HTML pages using the schema.org vocabulary. These annotations are widely used by search engines to render rich snippets within search results. Surprisingly, the annotations are hardly used by the research community. In the talk, Christian Bizer investigates the potential of schema.org annotations for being used as training data for tasks such as entity matching, information extraction, and sentiment analysis. Web pages that offer semantic annotations often also contain additional structured data in the form of HTML tables. In the second part of the talk, Christian Bizer discusses the interplay of semantic annotations and web tables for information extraction as well as the general potential of relational HTML tables for complementing knowledge bases such as DBpedia, focusing on the discovery of formerly unknown long tail entities as well as the extraction of n-ary relations.
Link to the conference website:
http://2019.ldk-conf.org/invited-speakers/
Integrating Product Data from the Semantic Web using Deep Learning TechniquesChris Bizer
The adoption of schema.org annotations on the Web has sharply increased over the last years with hundreds of thousands of websites annotating information about products, events, local businesses, reviews, and job postings within their pages. In the talk, Christian Bizer will discuss the integration of schema.org product data from large numbers of websites for the use cases of building product knowledge graphs as well as comparing product prizes across e-shops. The key challenge for this integration is to determine which webpages describe the same product. Christian Bizer will demonstrate how this challenge can be handled by deriving a large pool of training data from schema.org annotations and using this data to train transformer-based product matchers. He will discuss how the matchers exploit the richness of the training data available for widely sold head products using multi-task learning but can also excel on matching long-tail products using contrastive pre-training as well as cross-language learning.
Big Data – Marketing Challenge or Opportunity?edynamic
This document discusses big data and how companies can leverage customer data and digital platforms to improve customer engagement. It defines big data as high-volume, high-velocity and high-variety information that requires new technologies to capture, curate, manage and process within a viable time frame. The document provides examples of social, mobile and cloud computing driving big data, and outlines key principles for using big data strategically to understand customers and drive authentic engagement across channels.
This document discusses emerging trends in web technologies, such as Web 2.0, and their potential applications and benefits for enterprises. It outlines how tools like blogs, wikis, user-generated content, and social networking can help businesses engage customers, streamline operations, and gain insights from collective intelligence and user data. Examples are given of companies that have successfully adopted these new approaches to marketing, customer service, management, and collaboration.
Matthew Brown discusses semantic search engine optimization techniques. He defines semantic SEO as using semantic web technologies to send detailed page content meanings to search engines in a way computers can process. Brown recommends starting with Schema.org and Open Graph vocabularies and provides links to resources on structured data types, markup troubleshooting, and semantic web statistics. He also lists people involved in both SEO and semantic web fields.
Digital Transformation of Civil Engineering and Constructionpdemian
This document summarizes a presentation on the digital transformation of civil engineering and construction. It discusses drivers for digital transformation like client demands for more information and improved productivity. It also discusses the potential for a national digital twin and recent research projects. These include a BIM search engine called 3DIR, identifying national capabilities needed for information management, and applications of augmented and virtual reality. The presentation concludes that the UK is a world leader in areas like mandating BIM use and is in an exciting time for digital transformation in the built environment sector.
This document provides an overview and agenda for a presentation on product data management using Neo4j graph databases. The presentation will include an introduction to graph databases and Neo4j by Bruno Ungermann from Neo4j, followed by a discussion of using graph databases for product data management by Dr. Andreas Weber from semantic PDM. Examples will be provided of graph models and how they can be used for various domains including logistics, manufacturing, and customer relationships. Attendees will have an opportunity to ask questions and discuss use cases.
A possible future role of schema.org for business reportingsopekmir
The presentation demonstrates a vision for the “reporting extension” that could enhance the processes related to business reporting and the role it could have for the SBR vision.
Digital dealer conference automotive guerrilla marketing using social media v2Ralph Paglia
The document discusses using a Web 2.0 social marketing strategy for car dealerships by leveraging OEM content and customer data across multiple connected Web 2.0 sites to generate sales opportunities at little to no cost. It recommends using platforms like blogs, social networks, and review sites to engage prospects and previous customers through user-generated and OEM-supplied content.
Leverage various Web 2.0 sites and user-generated content along with OEM-supplied content to create a low-cost guerrilla marketing strategy for car dealerships. Use social networks, blogs, photos, videos and more across multiple sites to engage customers and generate sales opportunities. Reach a critical mass of content and users across connected Web 2.0 platforms to drive traffic and prospects to the dealership without spending more on advertising.
Digital dealer6 web20-guerrillamarketing-v2Ralph Paglia
Leverage various Web 2.0 sites and user-generated content along with OEM-supplied content to create a low-cost guerrilla marketing strategy for car dealerships. Use social networks, blogs, photos, videos and more across multiple sites to engage customers and generate sales opportunities. Reach a critical mass of content and users across connected Web 2.0 platforms to drive traffic and prospects to the dealership without spending more money.
Automotive guerrilla marketing for car dealers using social media and web 2 0Ralph Paglia
The document discusses using a Web 2.0 social marketing strategy for car dealerships by leveraging OEM-supplied content and customer lists on various Web 2.0 sites to generate sales opportunities at little to no cost. It recommends using multiple email lists and search engine optimization across interconnected Web 2.0 sites to engage prospects and previous customers. Attendees will receive access to online resources to gain a competitive advantage in their local markets.
Automotive guerrilla marketing for car dealers using social media and web 2 0Social Media Marketing
Leverage various Web 2.0 sites and user-generated content along with OEM-supplied content to create a low-cost guerrilla marketing strategy for car dealerships. Use social networks, blogs, photos, videos and more across multiple sites to engage customers and generate sales opportunities. Reach a critical mass of content and users across connected Web 2.0 platforms to drive traffic and prospects to the dealership without spending more money.
Acs Presentation Thinking Outside Of Inbox V2Johnny Teoh
The document discusses the concept of Web 2.0 and how it enables new ways of collaborating and sharing information online. It provides examples of how corporations are leveraging Web 2.0 tools like blogs, wikis and social networking to boost collaboration, share knowledge, and engage with customers. The document also outlines the author's daily activities using various Web 2.0 technologies like blogs, wikis and social networks as part of his job at IBM.
Building for success on the capable web - t3imd 2020Andrey Lipattsev
This document discusses building for success on the capable web. It emphasizes acquiring users, engaging them through the conversion funnel, and retaining them. It highlights the importance of core web vitals like LCP, CLS and FID for performance. It also discusses monitoring success through Google Site Kit and bringing new content formats like Stories to the web. The overall message is on optimizing the user experience at each stage of the funnel through performance, engagement and content.
Knowledge Graph Implementation into Drupal Content Management System (CMS) fo...Martin Kaltenböck
Slides of presentation of Martin Kaltenböck (Managing Partner Semantic Web Company, SWC https://www.semantic-web.com) at the Taxonomy Boot Camp London 2017 on 17th of October 2017 with the title: Knowledge Graph Implementation into Drupal Content Management System (CMS) for the UN Climate Technology Centre and Network (CTCN)
The document discusses how Orbitz Worldwide uses Hadoop and big data to drive web analytics. It faces challenges with processing massive amounts of log data from millions of searches. Orbitz implemented a Hadoop infrastructure to provide long-term storage, access for developers and analysts, and rapid deployment of reporting applications. This allows Orbitz to aggregate data, run analysis jobs like traffic source mapping in minutes rather than hours, and generate over 25 million records per month. The implementation helps Orbitz shift analytics from innovation to mainstream use across business units.
You’ve probably seen some cool data visualizations, and perhaps daydreamed of the day your organization’s data would be as easily computable. You’re not alone, many enterprises are falling behind the Big Data bandwagon and don’t all have clear direction to optimize their use of enterprise data as part of their cloud strategy.
Data is the “lifeblood of your business. It contains your organization’s history. And it’s trying to tell you something.” So, isn’t it time we pay attention?
Similar to Evolving the Web into a Global Dataspace – Advances and Applications (20)
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?Chris Bizer
The Web contains vast amounts of structured data in the form of HTML tables, schema.org annotations, as well as datasets accessible via data repositories. The automated integration of data from larger numbers of Web data sources is a long-standing research challenge as the integration requires dealing with several tricky tasks such as schema matching, entity matching, and data indexing for retrieval. Most state-of-the-art methods for these tasks rely on variants of the BERT transformer model fine-tuned using significant amounts of task-specific training data. In the talk, Christian Bizer will critically review BERT-based data integration methods and question their robustness concerning out-of-distribution entities. He will compare the performance of BERT-based methods with results of GPT-4-based data integration methods and will argue that GPT-4-based methods are more training data efficient and more robust concerning unseen entities.
Using the Semantic Web as Training Data for Product MatchingChris Bizer
Talk at the OpenKG Forum co-located with JIST2019 about using schema.org annotations from the Web for training product matchers.
See also:
http://webdatacommons.org/largescaleproductcorpus/v2/
http://jist2019.openkg.cn/index.php/openkgasia-forum/
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebChris Bizer
Current research on knowledge graph completion focuses on employing graph embeddings for the task of link prediction. But knowledge graph completion is more than link prediction and tasks such as adding formerly unknown long-tail entities to the graph, extending the schema of the graph with additional properties, and completing and updating numeric values are equally important tasks. In the talk, Christian Bizer will review recent results on using data from large numbers of independent websites to accomplish these tasks. He will focus on two types of web content – relational HTML tables and semantic annotations within HTML pages – and will discuss the potential of these types of content for set completion, schema extension, and fact checking, as well as their utility as training data for matching textual entity descriptions.
Data Search and Search Joins (Universität Heidelberg 2015)Chris Bizer
The amount of structured data that is published on the Web has increased sharply over the last years. The deluge of available data calls for new search techniques which support users in finding and integrating data from large numbers of data sources. In his talk, Christian Bizer will give an overview of the different types of data search that have been proposed so far: Entity search, table search, constraint and unconstraint search joins. As an example of a system from the last category, he will introduce the Mannheim Search Join Engine which provides for executing unconstraint search joins over different types of Web data including Linked Data, Microdata, Web tables and Wikipedia tables.
Exploring the Application Potential of Relational Web TablesChris Bizer
The Web contains large amounts of HTML tables. Most of these tables are used for layout purposes, but a small subset of the tables is relational, meaning that they contain structured data describing a set of entities. Relational web tables cover a wide range of topics and there is a growing body of research investigating the utility of web table data for applications such as complementing cross-domain knowledge bases, extending arbitrary tables with additional attributes, and translating data values.
Until recently, most of the research around web tables originated from the large search engine companies as they were the only ones having access to large web crawls and thus were able to extract web table corpora from the crawls. This situation has changed in 2012 with the University of Mannheim and in 2014 with the Dresden University of Technology starting to extract web table corpora from the CommonCrawl, a large public web corpus.
In the talk, I will introduce the 2015 version of the Web Data Commons - Web Table Corpus. Afterward, I will give an overview of the different efforts that are currently conducted by my group on exploring the application potential of relational web tables. These efforts include profiling the content of web tables by matching them to cross-domain knowledge bases such as DBpedia, fusing web table data in order to complement cross-domain knowledge bases, and performing SearchJoins between a local table and a web table corpus in order to extend the local table with additional attributes.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Marlon Dumas
This webinar discusses the limitations of traditional approaches for business process simulation based on had-crafted model with restrictive assumptions. It shows how process mining techniques can be assembled together to discover high-fidelity digital twins of end-to-end processes from event data.
Did you know that drowning is a leading cause of unintentional death among young children? According to recent data, children aged 1-4 years are at the highest risk. Let's raise awareness and take steps to prevent these tragic incidents. Supervision, barriers around pools, and learning CPR can make a difference. Stay safe this summer!
Enhanced data collection methods can help uncover the true extent of child abuse and neglect. This includes Integrated Data Systems from various sources (e.g., schools, healthcare providers, social services) to identify patterns and potential cases of abuse and neglect.
Evolving the Web into a Global Dataspace – Advances and Applications
1. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 1
Prof. Dr. Christian Bizer
Evolving the Web into a Global
Dataspace
- Advances and Applications -
18th International Conference on Business Information System (BIS 2015)
2. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 2
Hello
Professor Christian Bizer
University of Mannheim
Research Topics
−Web Technologies
−Web Data Profiling
−Web Data Integration
−Web Mining
3. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 3
Data and Web Science Group @ University of Mannheim
− 6 Professors
• Heiner Stuckenschmidt
• Rainer Gemulla
• Christian Bizer
• Simone Ponzetto
• Heiko Paulheim
• Johanna Völker
− 25 researchers and PhD students
− http://dws.informatik.uni-mannheim.de/
1. Research methods for integrating and mining large
amounts of heterogeneous information from the Web.
2. Empirically analyze the content and structure of the Web.
4. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 4
Querying the Classic Web
DB
HTML
5. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 5
Long Standing Goal
Query the Web like
a single, global
database
6. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 6
2001 Article: The Semantic Web
Envisions three things to happen:
1.people publish data in structured form
in addition to HTML pages on the Web
2.common vocabularies / ontologies are
used to represent data
3.people implement cool applications that
do smart things with the available data
Tim Berners-Lee, James Hendler and Ora Lassila:
The Semantic Web. Scientific American, May 2001.
7. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 7
14 Years Later
There are 1.5 million publications about the
Semantic Web on Google Scholar, but
1. Do people publish structured data on the Web?
2. Do people agree on common vocabularies / ontologies?
3. What are the cool applications that do smart things
with the data?
8. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 8
Outline
1. Semantic Annotations in HTML Pages
2. Linked Data
3. Knowledge Graphs
4. Conclusions
9. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 9
1. Semantic Annotations in HTML Pages
Simple idea: Help machines to understand
Web content by marking up data in HTML
pages.
<div itemtype="http://schema.org/Hotel">
<span itemprop="name">Vienna Marriott Hotel</span>
<span itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Parkring 12a</span>
<span itemprop="addressLocality">Vienna</span>
<span itemprop="addressCountry">Austria</span>
</span>
<div itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
<span itemprop="ratingValue"> 4 </span> stars-based on
<span itemprop="reviewCount"> 250 </span> reviews.
</div>
10. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 10
Semantic Annotation Formats
Microformats
Microdata
RDFa
− date back to 2003
− small set of fixed formats
− W3C Recommendation in 2008
− can represent any type of data
− proposed in 2009
− tries to be simpler than RDFa
11. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 11
Open Graph Protocol
− allows site owners to determine how
entities are displayed in Facebook
− relies on RDFa for marking up data in HTML pages
− available since April 2010
12. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 12
Schema.org
− ask site owners since 2011 to
annotate data for enriching search results
− 675 Types: Event, Place, Local Business, Product, Review, Person
− Encoding: Microdata, RDFa, JSON-LD
13. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 13
Usage of Schema.org Data @ Google
Rich snippets
within
search results
14. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 14
Event Data in Google Applications
https://developers.google.com/structured-data/
15. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 15
Flight Offers in Google Search Results
Annotated
webpages
directly below
Google Flights
results
16. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 16
Rich-Snippets Get More User Attention
− Suchen
Source: www.looktracker.com
Potential business
incentive.
17. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 17
Motivation for Semantic Annotations
− Study by searchmetrics.com in 2013: 10.000s of search keywords
− Type of rich-snippet displayed by Google:
Source: http://www.searchmetrics.com/de/knowledge-base/schema/
Google displays Rich-Snippets for 40% of all
queries.
18. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 18
The Common Crawl
19. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 19
The Web Data Commons Project
− extracts all Microformat, Microdata, RDFa data
from the Common Crawl
− analyzes and provides the extracted data for download
− four extraction runs so far
• 2014 CC Corpus: 2.0 billion HTML pages 20.4 billion RDF triples
• 2013 CC Corpus: 2.2 billion HTML pages 17.2 billion RDF triples
• 2012 CC Corpus: 3.0 billion HTML pages 7.3 billion RDF triples
• 2009/2010 CC Corpus: 2.5 billion HTML pages 5.1 billion RDF triples
− uses 100 machines on Amazon EC2
• approx. 3000 machine/hours
(spot instances of type c3.xlarge) 550 Euro
− http://www.webdatacommons.org/structureddata/
20. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 20
Overall Adoption 2014
620 million HTML pages out of the 2 billion pages
provide semantic annotations (30%).
2.72 million pay-level-domains (PLDs) out of the
15.68 million pay-level-domains covered by the
crawl provide annotations (17%).
Google, 2014*:
5 million websites provide Schema.org data.
* Guha in LDOW2014 Keynote
21. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 21
Number of PLDs providing Semantic Annotations
22. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 22
Most Popular Classes
RDFa
Microdata
24. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 24
Adoption by E-Commerce Websites
Distribution by Alexa Top-15 Shopping Sites
Top-Level Domain
TLD #PLDs
com 38344
co.uk 3605
net 1813
de 1333
pl 1273
com.br 1194
ru 1165
com.au 1062
nl 1002
Website schema:Product
Amazon.com
Ebay.com
NetFlix.com
Amazon.co.uk
Walmart.com
etsy.com
Ikea.com
Bestbuy.com
Homedepot.com
Target.com
Groupon.com
Newegg.com
Lowes.com
Macys.com
Nordstrom.com
Adoption by Top-15:
60 %
25. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 25
Properties used to Describe Products
Top 15 Properties PLDs
# %
schema:Product/name 78,292 87 %
schema:Product/image 59,445 66 %
schema:Product/description 58,228 65 %
schema:Product/offers 57,633 64 %
schema:Offer/price 54,290 61 %
schema:Offer/availability 36,789 41 %
schema:Offer/priceCurrency 30,610 34 %
schema:Product/url 23,723 26 %
schema:Product/aggregateRating 21,166 24 %
schema:AggregateRating/ratingValue 20,513 23 %
schema:AggregateRating/reviewCount 14,930 17 %
schema:Product/manufacturer 10,150 11 %
schema:Product/brand 9,739 11 %
schema:Product/productID 9,221 10 %
schema:Product/sku 7955 9 %
26. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 26
Adoption by Travel Websites
Top 15 Travel Websites schema:Hotel Any Class
Booking.com (uses DataVoc)
TripAdvisor
Expedia
Agoda
Hotels.com
Kayak
Priceline
Travelocity
Orbitz
ChoiceHotels
HolidayCheck
ChoiceHotels
InterContinental Hotels Group
Marriott International
Global Hyatt Corp.
Adoption:
73 %
27. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 27
Properties used to Describe Hotels
Top 10 Properties PLDs
# %
schema:Hotel/name 4173 88,35 %
schema:Hotel/address 3311 70,10 %
schema:Hotel/telephone 2488 52,68 %
schema:PostalAddress/streetAddress 2362 50,01 %
schema:PostalAddress/addressLocality 2231 47,24 %
schema:Hotel/url 2102 44,51 %
schema:PostalAddress/postalCode 2096 44,38 %
schema:AggregateRating/ratingValue 1952 41,33 %
schema:Hotel/aggregateRating 1866 39,51 %
schema:AggregateRating/bestRating 1697 35,93 %
28. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 28
Adoption by Job Websites
Distribution by Top-10 Employment Sites
Top-Level Domain
Adoption by Top-10: 70 %
TLD #PLDs
jobs 908
com 828
org 263
co.uk 194
net 40
nl 38
ca 33
de 32
jobs 908
Website schema:JobPosting
Indeed.com
Monster.com
Careerbuilder.com
Snagajob.com
Jobsdb.com
Jobsearch.about.com
Jobs.net
Internships.com
Jobs.aol.com
Quintcareers.com
29. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 29
Properties used to Describe Job Postings
Top 10 Properties PLDs
# %
JobPosting/title 2588 91.16 %
JobPosting/hiringOrganization 1412 49.74 %
JobPosting/description 1192 41.99 %
JobPosting/jobLocation 1062 37.41 %
Organization/name 862 30.36 %
JobPosting/datePosted 793 27.93 %
Place/address 471 16.59 %
JobPosting/baseSalary 227 8.00 %
JobPosting/industry 209 7.36 %
JobPosting/educationRequirements 145 5.11 %
30. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 30
Class / Property Distribution
Only a small set of
classes / properties
is used.
Strong focus on
Schema.org and
Facebook vocabularies.
schema.org
675 classes
965 properties
31. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 31
Opportunity 1: Search Engine Optimization
Get richer visibility in search results and potentially more clicks.
32. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 32
Opportunity 2: Change Push to Pull Communication
− Current situation:
• Information providers need to
push data into multiple channels
• multiple search engines
• multiple domain-specific portals
− Web approach:
• You maintain a website
• All interested parties
crawl your data
33. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 33
Opportunity 3: Applications beyond Rich-Snippets
− E-Commerce
• Rich source of product data, offers, and reviews
• Opportunity to build global product catalogs
• Opportunity to mine product and rating data on global-scale
− Tourism
• Additional data for tourism applications: Nearby local businesses, nearby
landmarks, nearby hospitals, nearby events
• Search engines as new competitors put pressure on large booking portals?
− Recruitment
• Increased market transparency
• Search engines as new competitors put pressure on job portals that charge
per posting?
− High up-to-dateness of data
• as original data providers know about changes first
34. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 34
Main Challenge: Data Integration and Cleansing
The schema is standardized, but
1. entity names differ
2. the schema is rather shallow and a rather low number of
properties is used
3. data quality differs as the data is created by experts and
rookies
35. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 35
Property PLDs
# %
schema:Product/name 78,292 87%
schema:Product/description 58,228 65%
schema:Product/manufacturer 10,150 11%
schema:Product/brand 9,739 11%
schema:Product/productID 9,221 10%
Looking Deeper into the E-Commerce Data
1. The structure of the data is rather shallow
• Product features are encoded in titles and descriptions
• Example product name:
“Apple MacBook Air 11-in, Intel Core i5 1.60GHz, 64 GB”
• Example product description:
“Faster Flash Storage with 64 GB Solid State Drive and USB 3.0 …”
• Product IDs are provided by only 10% of the websites
• Categorization information is provided only by 2% of the websites.
36. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 36
Categorization of Product Offers
− We analyzed 1.9 million product offers from 9200 shops
− We trained bag-of-words classifier for 9 product categories
on product descriptions from Amazon.
Source: Petar Petrovski, Volha Bryl, Christian Bizer: Integrating Product Data from Websites offering
Microdata Markup. In: 4th Workshop on Data Extraction and Object Search (DEOS2014) @ WWW2014
37. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 37
Identity Resolution for Electronic Products
− We trained feature extractors for product descriptions on offers for
electronic products from Amazon.
− We used the Silk framework for identity resolution.
Precision= 85%
38. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 38
Starting Points for Further Improvements
− Identity Resolution
• Exploit product identifiers to learn better product recognizers
• 10% of the websites (9,221 PLDs) use s:Product/productID
• 1% of the websites (935 PLDs) use s:Product/gtin13
− Categorization of Products
• Exploit categorization information provided by subset of the websites
• 1,5% of the websites (1,497 PLDs) use s:Offer/category
• 0,5% of the websites (460 PLDs) use s:WebPage/breadcrumb
• Challenge: Integration of ~ 2,000 product taxonomies
Home > Shop > Outdoor & Garden > Barbecues & Outdoor Living > Garden
Furniture > Tables > Dining Tables
Home > Shop > Outdoor & Garden > Barbecues & Outdoor Living > Garden
Furniture > Tables > Dining Tables
Philadelphia Eagles > Philadelphia Eagles Mens > Philadelphia Eagles Mens Jerseys >
over $60
Philadelphia Eagles > Philadelphia Eagles Mens > Philadelphia Eagles Mens Jerseys >
over $60
39. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 39
Conclusion: Semantic Annotations in HTML Pages
1. Wide-spread adoption of semantic annotations
• motivated by mayor search engines
1. Strong ontology agreement driven by data consumers
• Schema.org, Open Graph Protocol
1. Main application: Rich-snippets
2. Endless data pool for
• Commercial applications
• product and travel data integration and mining
• up-to-date listings of local businesses
• job search engines that increase market transparency
• Research
• large-scale data integration and mining
• information extraction (using annotations as distant supervision*)
* Foley, et al.: Learning to Extract Local Events from the Web. SIGIR 2015
40. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 40
Download and Play with the Data
− http://www.webdatacommons.org/structureddata/
− Only tip of the iceberg, as each website is only partly crawled.
41. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 41
2. Linked Data
B C
RDF
RDF
link
A D E
RDF
links
RDF
links
RDF
links
RDF
RDF
RDF
RDF
RDF RDF
RDF
RDF
RDF
• by using RDF to publish structured data directly on the Web
• by setting links between data items within different data sources.
Set of best practices for publishing structured data on
the Web in the form of a single global data graph.
42. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 42
Links as Integration Hints
publishing Identity Links on the Web
publishing Vocabulary Links on the Web
<http://www4.wiwiss.fu-berlin.de/is-group/resource/persons/Person4>
owl:sameAs
<http://dblp.l3s.de/d2r/resource/authors/Christian_Bizer> .
<http://xmlns.com/foaf/0.1/Person>
owl:equivalentClass
<http://dbpedia.org/ontology/Person> .
43. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 43
Effort Distribution between Publisher and Consumer
Publishers or third
parties provides
identity/vocabulary links
Consumer mines missing
identity/vocabulary links
Effort
Distribution
44. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 44
LOD Datasets on the Web: April 2014
Growth without new category Social Networking: 94 %
Source: Max Schmachtenberg, Christian Bizer, Heiko Paulheim: Adoption of the Linked Data Best Practices in
Different Topical Domains. In: 13th International Semantic Web Conference (ISWC2014).
45. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 45
Uptake in the Government Domain
− Various efforts by public sector
institutions world-wide
− Forerunners
• UK government
• US government
− Types of data published
• statistical data
• environmental data
• budget and election data
− Goals
• Make data available to the public and
other government agencies
• Ease data integration by using standards,
providing unique identifiers and by setting links
46. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 46
Uptake in the Libraries Community
− Institutions publishing Linked Data
• Library of Congress (subject headings)
• German National Library (PND dataset and subject headings)
• Swedish National Library (Libris - catalog)
• Hungarian National Library (OPAC and digital library)
• Europeana Digital Library (4 million artifacts)
• Springer (metadata about conference proceedings)
− Goals:
1. Interconnect resources between repositories
(by topic, by location, by historical period, by ...)
2. Integrate library catalogs on global scale
47. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 47
Uptake in the Life Science Domain
− Goals:
1. Connect life science datasets
in order to support
• biological knowledge discovery
• drug discovery
1. Reuse results of previous
integration efforts
48. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 48
Uptake in the Linguistic Research Community
http://linguistic-lod.org/llod-cloud
http://www.lider-project.eu/
49. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 49
Ontological Agreement
− Strong agreement on some vocabularies
− Proprietary vocabularies are used in
addition to common ones,
as data is often very specific
Widely-Used Vocabularies
Proprietary Vocabularies
50. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 50
RDF Links
− Some datasets put a lot of effort into linking
− Many datasets only link to a small number of
other datasets or do not set RDF links at all
Datasets with Top In-Degrees Out-Degrees per Category
51. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 51
RDF Links in the LOD Cloud: August 2014
52. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 52
53. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 53
54. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 54
Linked Data as Background Knowledge for Data Mining
Which factors correlate with unemployment in France?
55. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 55
Unemployment Table with Additional Attributes
56. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 56
RapidMiner Linked Open Data Extension
Allows you to
1. link local table to LOD data sources
2. extend local table with additional attributes
3. mine extended tables using all Rapidminer features
57. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 57
Finding Correlations
− Use additional attributes to find interesting correlations
− Example correlation for unemployment in France:
• African islands, islands in the Indian Ocean,
outermost regions of the EU (positive)
• Population growth (positive)
• Energy consumption (negative)
• Hospital beds/inhabitants
(negative)
• Fast food restaurants (positive)
• Police stations (positive)
Source: Petar Ristoski, Christian Bizer, and Heiko Paulheim: Mining the Web of Linked
Data with RapidMiner. Semantic Web Challenge, Winner of the Open Track, 2014.
58. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 58
Commercial Applications: Content Management at BBC
− Interconnect content management systems of different TV and radio stations.
− Similar efforts to connect content repositories at Elsevier and Springer.
Source: http://www.w3.org/2001/sw/sweo/public/UseCases/BBC
59. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 59
− IBM Rational uses Linked Data
technologies to connect data
from different
• software development tools
• software lifecycle tools
− Goals:
1. Make data independent
of concrete tool (IBM or third party)
2. Allow services (reporting, discovery)
to access data from all tools
3. Distributed data space as an
alternative to central repository or
integration hub / bus
Commercial Applications: Application Integration at IBM
Source: http://www.w3.org/2001/sw/sweo/public/UseCases/IBM
60. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 60
Conclusion: Linked Data vs. HTML-embeded Data
Linked Data Microdata, Microformats, RDFa
~ 1000 sources millions of sources
covers wider range of specific topics
focused on search engines and
facebook
more complex
data structures
very simple and shallow
data structures
partial ontology agreement strong ontology agreement
data integration eased by RDF links
data integration often
requires NLP techniques
various application prototypes
some industrial uptake
strong application pull
by search engines
61. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 61
3. Knowledge Graphs
− Google Knowledge Graph
• development started 2012, builds on Freebase
• 570 million objects described by over 18 billion facts (2012)
• 1500 classes, 35,000 properties
− Microsoft Satori Knowledge Base
• revealed to the public in mid-2013
− Yahoo Knowledge Graph
• revealed to the public early-2014
− Knowledge Graphs employ RDF-style graph data models
Large cross-domain knowledge bases which
aim
to cover all “relevant” entities in the world.
62. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 62
Data Sources used to Build Knowledge Graphs
1. Wikipedia
• infoboxes, category system, information extraction from text
1. Open license sources
• e.g. CIA World Factbook, MusicBrainz, …
1. Commercial third-party data
• e.g. IMDB, company listings, …
1. schema.org annotations in web pages
• e.g. contact information for companies
• e.g. logos of companies
Lots of effort is spend on data integration and manual data curation
63. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 63
Application of the Google Knowledge Graph
− Enrich search results with knowledge cards and lists
− Goal: Fulfil information need without having users navigate to other
websites
64. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 64
Application of the Microsoft Knowledge Graph
65. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 65
1. Answer fact queries: “birthdate michael douglas”
2. Compare things: ”compare eiffel tower vs empire state building”
Applications of the Google Knowledge Graph
66. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 66
Google Now Smart Cards
− Direct answers are especially important in the mobile context
− Google Now displays direct answers for 19.45% of the queries
(Source: Stone Temple Consulting, 2015)
− Medical facts are reviewed by an average of 11.1 doctors
(Source: Google)
67. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 67
New SEO Topic: How to influence Knowledge Graphs?
Source: http://searchengineland.com/
leveraging-wikidata-gain-google-
knowledge-graph-result-219706
68. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 68
Behind-the-Scenes Applications
− Google
• uses its knowledge graph to identity entities in web pages (Entity Linking)
• Hummingbird ranking algorithm (deployed in 2013) uses
knowledge graph as background knowledge for ranking
search results.
− Yahoo
• uses its knowledge graph to “support applications across the company:
• Web Search, Content Understanding
• Recommendation, Personalization, Advertisement”*
− Data Integration
• becomes matching data sources against knowledge graphs
as intermediate schemata.
Various tasks become easier, if you know all
entities in the world.
*Source: Nicolas Torzec, Yahoo 2014
69. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 69
Public Knowledge Graphs
70. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 70
The DBpedia Knowledge Base - Version 2014
− Describes 4.58 million things, out of which
4.22 million are classified in a consistent ontology
using 685 classes and 2679 different properties
• 1,445,000 persons
• 735,000 places
• 241,000 organizations
• 123,000 music albums
− Altogether 3 billion pieces of information (RDF triples)
• 580 million were extracted from the English edition of Wikipedia
• 29,000,000 links to external web pages
• 50,000,000 external links into other RDF datasets
− DBpedia Internationalization
• provides data from 125 Wikipedia language editions for download
• For 28 popular languages DBpedia provides cleaned infobox data
71. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 71
DBpedia @ BIS2015
1. Thursday, 10:00
The Past, Present & Future of DBpedia
Keynote by Dimitris Kontokostas
2. Thursday, 10:45
4th DBpedia Community Meeting
Room 2
72. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 72
Google Knowledge Vault
− Research project to build a knowledge base
using facts extracted from 1 billion web pages
1. Web text (TXT): Entity linking,
relationship extraction
2. HTML trees (DOM): Wrapper induction
3. HTML tables (TBL): Relational tables
4. Semantic Annotations (ANO): schema.org, OGP
− Employs probabilistic model for data fusion
− Results: 1.6 billion facts
• 271 million with confidence >90%
• 90 million not in Freebase
Source: Luna Dong, Evgeniy Gabrilovich, et al.: Knowledge Vault:
A Web-scale approach to probabilistic knowledge fusion. In SIGKDD, 2014.
73. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 73
Data Sources for Public Research in this Space
1. Common Crawl
• ~ 2 billion HTML pages
• updated very couple of months
1. WebDataCommons HTML Tables Corpus
• 147 million relational web tables
• selected out of the 11 billion tables contained in the Common Crawl
• http://webdatacommons.org/webtables/
1. WebDataCommons Microdata and RDFa Corpora
• 20.4 billion RDF triples
• http://www.webdatacommons.org/structureddata/
1. Billion Triples Challenge Dataset 2014
• 4 billion RDF triples crawled from Linked Data sources
• http://km.aifb.kit.edu/projects/btc-2014/
74. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 74
Conclusion: 2001 Article - The Semantic Web
Envisions three things to happen:
1.people publish data in structured form
in addition to HTML pages on the Web
2.common vocabularies / ontologies are used
to represent data
3.people implement cool applications that
do smart things with the available data
Tim Berners-Lee, James Hendler and Ora Lassila:
The Semantic Web. Scientific American, May 2001.
75. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 75
4. Conclusions
1. Publication of Structured Data
• there is more data available as most people from research and industry like
• especially, schema.org annotations are currently gaining traction
• exciting test-bed for research on data profiling and data integration techniques
1. Ontological Agreement
• exists due to application-pull (Google, Facebook)
• but data source-specific attributes are also important
(e.g. in life science or government statistics domain)
1. Applications
• the big players are moving (Rich-Snippets, Knowledge Graphs)
• there is a lot of further application potential in the available data
• experimentation in industry, but many efforts are still in the prototype stage
76. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 76
Thanks
− References
• Robert Meusel, Petar Petrovski and Christian Bizer: The WebDataCommons Microdata, RDFa
and Microformat Dataset Series. 13th International Semantic Web Conference (ISWC2014).
• Max Schmachtenberg, Christian Bizer, Heiko Paulheim: Adoption of the Linked Data Best
Practices in Different Topical Domains (Slides, Video). 13th International Semantic Web
Conference (ISWC2014).
• Petar Petrovski, Volha Bryl, Christian Bizer: Integrating Product Data from Websites offering
Microdata Markup. 4th Workshop on Data Extraction and Object Search (DEOS2014).
− Detailed statistics on RDFa, Microdata and Microformats adoption
• http://www.webdatacommons.org/structureddata/
− Detailed statistics on Linked Data adoption
• http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/
Editor's Notes
Since 2012
High agreement on vocabulary
biggest Datasets in their category (288 million product descriptions, 42 million reviews)
http://www.alexa.com/topsites/category/Top/Shopping
Amazon Instant Video: Ja mit JSON-LD
Potential reason: HR databases are not stuctured
Hotels: 60% of booking via websites, commission 20%
Tricky leagal questions involved
PrecisionElectrinics = 93%
PrecisionAppeal= 88%
Google: 300 people
Microsoft: 120 people
Yahoo: 30 people