Hello, My name is AhmadAssaf from SAP research France, and I am about to present the work I did with my colleague Aline Senart on Data Quality Principles in the Semantic Web. I would like to thank the organizing committee that gave me the opportunity to remotely present this paper as some unforeseen circumstances prevented me from attending the conference in person.
I will start my presentation by discussing the problems around data quality in the semantic web, then I will list some of the relevant work in that area. Afterwards, I will discuss some related context to our problem and then present our proposal as a set of questions and examples that we gathered throughout our experience with Data Integration in SAP. In the end, I will conclude and discuss some of the future work we planned.
With the rise of semantic web in recent years data became widely available on the web and the development and application of ontologies have been gaining big momentum in a range of application domains. As the Linked Open Data cloud diagram shows, The Web of Linked Data is growing rapidly and contains data from a wide range of different domains, including life science data, geographic data, government data, library and media data, as well as cross-domain data sets such as DBpedia or Freebase. One of the core concepts of semantic web is to that anyone today has the ability to publish and retrieve information to be consumed or integrated into their applications, this resulted in new challenges such as havingheterogeneous sources with diverse data qualities For example; some values may be out of date, incomplete or incorrect or we can have some data sources that mayprovide conflicting values for the same properties.On the other hand, this huge amount of information is of great value especially for organizations and companies that can enrich their internal datasets with data from sources like social media feeds, sensor or governmental data … etc. Analyzing this new type of data within the context of existing enterprise data should bring new or more accurate business insights and allow better recognition of sales and market opportunities. But How easy it is actually to find, trust and integrate the USEFUL information ?!
So, how to properly integrate external data with internal ones? A certain level of quality is clearly necessary. Data Quality involves data management, modeling, analysis storage and presentation since it ensures that the data is fit to be combined and used to infer better business decisions. Quality is important in every data driven application; but it is a subjective measure, as the saying goes “beauty is in the eye of the beholder”. Quality is mainly realized when the data is used, it can be defined as fit for purpose of use !! But data can have multiple uses, and we want a generic framework to assess its quality. Several studies have found out that most data quality problems are in fact “data misinterpretations” or problems in the data semantics, for example if I am querying for a revenue per year value, different sources might have different presentations for the term “year”, it can be a fiscal or calendar year or even the last 12 months. By now we identified that there is a need for new Data quality principles in terms of measures and classes.
Now we will talk a little bit about some of the related work; several projects have proposed solutions to identify good data sources simplifying greatly the task of finding and consuming high-quality data, for example some have applied linkage ranking by analyzing incoming and outgoing links to sort out resources, however, these approaches ignore the fact that real-world entity can evolve over time and can fall short for temporal data. For example, a person can change her phone number and address and so facts that describe the same real-world entity at different times can contain different values.Sieve is another example of a framework that tried to express quality assessment and fusion methods. A very interesting and closely related attempt was made by MediaWikias they proposed a quality Criteria for Linked Data sources. Though this classification is good, some criteria on the quality of the used ontologies and the links between data and ontology concepts are missing.
Now, we will talk a little bit about some related context. as you know The main goal behind using Linked Data is to easily enable knowledge sharing and publishing. The basic assumption is that the usefulness of Linked data will increase if it is more interlinked with other data; Tim Berners-Lee defined 4 keys principles for publishing Linked Data, they are :Make the data available on the web: assign URIs to identify things.Make the data machine readable: use HTTP URIs so that looking up these names is easy.Use publishing standards: when the lookup is done provide useful information using standards like RDF.Link your data: include links to other resources to enable users to discover more things.
Building on the previous principles and based on our experience with powerful data integration software to extract, transform, and load data from applications, databases and other data sources, we have derived five principles for data quality in the Semantic Web Quality of data source: This principle is related to the availability of the data and the credibility of the data source.Quality of raw data: This principle is mainly related to the absence of duplicates, entry mistakes, and noise in the data.Quality of the semantic conversion: This principle is related to the transformation of raw data into rich data by using vocabularies.Quality of the linking process: This principle is related to the quality of links between two datasets.Global quality: This principle is cross-cutting the other principles and covers the source, raw data, semantic conversion, reasoning and links quality.
In the rest of the presentation, I will describe each of these principles. Lets start with the quality of data sources, they span mainly across 5 different attributes which are Accessibility, Authority & Sustainability,License, Trustworthiness & verifiability and Performance
So lets say that an organization wants to embed some external data, what are the challenges that it can face during the process with regards to the data source quality. ?At First, Will it be easy for the company to find relevant useful information, lets say that the company have found several resources that can serve its needs, can it simply trust all of them ?! What if the information needed is not available in any of the well known repositories like Freebase or DBpedia ? Another aspect that matters a lot to companies is how simple it is to directly use the data in terms of its license, is it friendly to use or a commercial license has to be purchased ?! This actually can be cumbersome as legal checks have to made to ensure good compliance with the company`s policies. Now, lets assume that the company managed to find a good, relevant trust worthy resource, will it be easy to access its data ? For example, does the data source provide access points for the data, whether they are SPARQL online endpoints or via a REST API or even someDownloadable dumps. Are the protocols and methods properly defined, for example if I am accessing the source via a REST API, is it well documented ?! In addition to all of this, the performance and throughput of the data source is another aspect especially when dealing with it directly and not through a downloadable dump. For example, how many requests can the API server daily, for example Freebase allows 100,000 queries per user a day, with a maximum of 10 queries a second. All of these points are vital to ensure that a data source can be qualified as a good quality one.================================================A de-referenceable Uniform Resource Identifier or de-referenceable URI is a resource retrieval mechanism that uses any of the internet protocols (e.g. HTTP) to obtain a copy or representation of the resource it identifies.
Now we will come to the Quality of raw data: This principle is mainly related to the absence of duplicates, entry mistakes, and noise in the data, it contains 7 main attributes that are: Accuracy, Comprehensibility, Completeness, Typing, Provenance, Versatility and Traceability
Raw data accuracy can be achieved by ensuring that nodes refer to factually and lexically correct information, this can be achieved by checking if the data is described using accurate non duplicated labels. In addition to this, raw data should be clean and up to date and formatted in a consistent way (i.e., are the dates all formatted yyyy/mm/dd)? Tools such as Google Refine  or Data Wrangler  provide already a good answer to these issues by allowing the cleaning of complex data sets.Moreover, data`s consistency should be ensured by validating the underlying vocabulary and syntax of the document with other resources, for example checks should be made to guarantee that there is no definition of entities as being members of disjoint classes, confirm valid usage of inverse-functional properties, there is no redefinition of existing properties, ensure usage of homogeneous data types and that there is no stating of inconsistent values for properties. For example, is the population of Europe the same as the sum of the population of the European countries?
To continue with the other attributes, data is comprehensible by providing human-readable labeling of classes, properties and entities by providing an rdfs:label and human-readable description of them by providing an rdfs:comment. Moreover, by giving indications of metadata about a dataset,one or more exemplary URIs,an exemplary SPARQL query, and of some of the vocabularies used. In addition to comprehensibility, completeness is a measure to check if there is all the data needed to represent information about a real world entity, Moreover, is the data related or linked to this set complete as well, e.g., all European countries, all French cities, all street addresses, all postal codes and so on ?!Having data properly typed as concepts from vocabularies allow users to step further into analyzing the data and reach more informed business decisions, so typing the data is an important quality factor of linked data.Moreover, provenance in the Semantic Web is considered as one of the most important indicators of "quality." Data sets can be used or rejected depending on the availability of sufficient and/or relevant metadata attached, there are lots of important that can be asked on the data available, for example: Where did the data originate from? Why was the data received? Who is the data describing? Who is the author of this data … and more other questions
Ensuring data versatility is important to ensure a good quality wide spread data. For example, provision of the data in various formats,provision of the data in various languages,human and machine -readable indications of a SPARQL-endpoint.To handle various representations, a data source can enable content negotiation. Thereby a consumer can specify accepted formats and languages by adding a corresponding accept-header to an HTTP request.Moreover, Raw data and its attached information like queries, formulas should be also traceable especially if it is a result of a merge process.
Now we will talk about the Quality of the semantic conversion: This principle is related to the transformation of raw data into rich data by using vocabularies and consists of three attributes that are: Correctness, granularity and consistency
Semantic conversion is the process of transforming “normal” raw data into “rich” data, so we have as our input a tabular data for example, and we want to have an RDF representation of that data using some vocabulary. The use of high quality vocabularies and the efficiency of data discovery process are major factors in increasing the quality of data. Moreover, one of the most important aspect that affects the quality of the semantic conversion is the quality and suitability of its data model with the intended usage, amodel that does not fit your usage could lead to computationally expensive workarounds.Data structure should be properly modeled and applied correctly to all the concepts, so generally any two URI references connected by owl:sameAs should be the same thing. In addition to that, the model should capture enough information to present all the expected data. Finally, the conversion should be done correctly and consistently, an example taken from Dbpedia is that a resource can state different minimal temperatures of a planet ,the respective property being defined as having Kelvin values. These values result from an erroneous conversion of the Kelvin and Celsius values extracted from wikipedia.
Now we will talk about the quality of the linking process, This principle is related to the quality of links between two datasets and consists of three main attributes that are: connectedness, isomorphism and directionality
connectedness is a measure to ensure that the merge or combination happened at the correct points, Frameworks like Silk  ease the linking process but don’t tackle per se the quality of the links that are generated. The quality depends on the link generation configuration. The quality is however improved if your data is linked to some reference dataset. Many Linked Data applications have come to rely on owl:sameAs for linking datasets. However, the current semantics for owl:sameAs assert that identity entails isomorphism, or that if a=b, then all statements of a and b are shared by both. This becomes problematic when dealing with provenance, context, and imperfect representations, all of which are endemic issues in Linked Data. Ensuring that the combined sets are modeled in a compatible way and reconciled is an important factor. In addition to this, the graph representation of the resources after the linkage should be consistent in what we call the directionality property.
Now we will talk about some global quality measures that cross cut all the previous classes ( the data source, raw data, links … etc. ). An important factor that should be ensured is the timeliness of that class, for example ensuring that the data source contain that latest up to date raw data, and the inbound and outbound to this data are also updated. In the source specification, is the update and validation frequency stated so that updates are made automatically ? For example, in an astronomical data source, the discovery of an additional moon in the solar system doesn't necessarily influence the correctness of the present data but failing to update changed data however does, (e. g. the redefinition of Pluto now being a dwarf planet resulting in the nullification of the fact of Pluto being a planet).Another aspect is the history and changes record, in order to keep track of who changed the data and when.Finally, the freshness of each class is another important aspect, for example the ability to replicate the remote repository into local triple stores ( via a local dump) and maintain the timeliness of the replica in all of its aspects. Depending on the amount of changed data, it may be important to have data dumps with deltas to previous major releases. For example monthly dump of full dataset and daily deltas. For smaller change sets the use of feeds to announce changes can be very fitting.
To Conclude, we have identified the data quality in the semantic web is of increasing importance, hence we have proposed five main classes of data quality principles to be applied in the semantic context. For each class, we listed the specific criteria that represent the quality of a data source on the Web. We believe that following these principles will lead to higher quality Semantic Web, which will result in better data usage and mash-ups thus more informed decisions. We have applied these principles in several scenarios and use cases in some of the systems in SAP, due to the lack of space we did not present our preliminary results in this paper; however we plan to apply them in a broader set of use cases and scenarios in order to have better evaluation. We would also like to provide tools that can formally prove that these principles are met.
Data quality principles in the semantic web
Data Quality Principles in the Semantic WebAhmad Assaf and Aline SenartSAP Research, Real-Time Intelligence Program, SAP Research France SAS1st International Workshop on Data Quality Management and Semantic Technologies(DQMST)September 21, 2012 Public