Using Web Data Provenance for Quality Assessment

Using
Web Data Provenance
for
Quality Assessment
Olaf Hartig*
Jun Zhao˚

*Humboldt-Universität zu Berlin ˚University of Oxford

Information Quality (IQ)
● Common definition: fitness for use of information
● Multidimensional concept
Category* Criteria / Dimensions
Intrinsic Accuracy, Believability, Objectivity, ...
Contextual Completeness, Relevance, Timeliness, ...
Representational Conciseness, Understandability, ...
Accessibility Availability, Security, ...
*Classification by Wang and Strong, 1996

● IQ criteria not independent of each other
● Relevancy of criteria determined by task and preferences

Olaf Hartig - Using Web Data Provenance for Quality Assessment 2

IQ Assessment

● Assigning numerical values (IQ scores) to IQ criteria
● It is difficult!
● Precision vs. Practicality

Manual methods Semi-automatic methods
● Questionnaires ● Rating-based
● Reputation-based


Automated IQ Assessment
● Literature only outlines ideas for automatic methods
● Content analysis
● Comparison (e.g. outlier detection)
● Application of information retrieval methods
● Analysis of results from data cleansing
● Sampling techniques
● Context analysis
● Analysis of metadata
● Utilization of domain knowledge


Our Goal:
Methods to automatically assess
IQ criteria of Web data

Primary means:
Provenance of assessed data


Outline

1. Web Data Provenance

2. General Assessment Approach

3. Development of Assessment Methods


Existing Provenance Research
● Main research areas: (scientific) workflows, DBMSs
● General focus:
data creation


Provenance of Web Data


Provenance of Web Data

Web data provenance
comprises
two dimensions:
Data Creation • Data Access


Model of Web Data Provenance
● Provenance graph describes provenance of a data item
● Nodes: provenance elements – pieces of provenance info
● Edges: relate provenance elements to each other
● Subgraphs for related data items possible


Model of Web Data Provenance
● Provenance model defines: Actors
● Types of provenance elements
Executions
● Relationships
Artifacts


Data Access Dimension
Data Item
Data Accessor
(Non-Human)
contains
performs retrieved by Document

Execution Time
Data Access
accessed

Data Providing Service
(Non-Human)
controls
uses
Service Provider
Data Publisher
(Human)

Relation to
the provided Information
Resource


Data Access Dimension cont.

(Verified)
Artifact

Integrity Verification

Verification Result
{incomplete}
Signer

Signature Verification Relation to
the signed Data

Signature Method


Data Creation Dimension
Provenance
Information

Source Data
Execution Time Provenance
Information

Creation Guidelines
Data Creator
Data Creation
(Human or Non-human)

{complete,disjoint}

Data Creating Device
(e.g. Sensor) Data Item

Data Creating Service
(e.g. Software Agent) part of
responsible for responsible for Provenance
Data Creating Entity Information
(e.g. Person, Group, Orga.) (Encompassing)
Data Item
Relation to
the created Data

Outline





A General Approach

● Blueprint for actual assessment methods that
● Address specific scenario
● Focus on specific IQ criterion
● Provenance elements have an influence on IQ
● Impact values represent these influences
● Assessment is affected by knowing about the influences
● Calculation of the IQ score with an assessment function
that combines all impact values


General Assessment Procedure

Step 1 – Generate a provenance graph for the data item

Step 2 – Annotate the provenance graph with impact values

Step 3 – Execute the assessment function


Outline





Designing Assessment Methods
● Developing the general approach into an actual method
● Fundamental design question:

For which IQ criterion do we want to apply the method?


Designing Assessment Methods
● Developing the general approach into an actual method
● Fundamental design question:

For which IQ criterion do we want to apply the method?

● Timeliness: degree to which the data item is up-to-date
with respect to the task at hand
● Representation* as an absolute measure in [0,1]
● 1 – meeting the most strict timeliness standards
● 0 – unacceptable

*Following Ballou et al., 1998

1 Generate the Provenance Graph

What types of provenance elements are necessary?
What level of detail (i.e. granularity) is necessary?

Where and how do we get provenance information?
● Two complementary options:
● Recording
● Analyzing metadata


Example:
● Sensors (e.g. sensor1) hourly take measurement (e.g. msr)
● All msr stored in a Web-accessible storage device (store)
● Our system (sys) accesses them for further processing
● sys assesses the timeliness of all msr


Example:
● Sensors (e.g. sensor1) hourly take measurement (e.g. msr)
● All msr stored in a Web-accessible storage device (store)
● Our system (sys) accesses them for further processing
● sys assesses the timeliness of all msr
msr created by performed by sensor1
type: Data Item cExc type: Data Creator
type: Data Creation

contained by Execution Time: 10:00

doc retrieved by store
type: Document type: Data Providing Service
aExc accessed
type: Data Access
sys performed by
type: Data Accessor Execution Time: 10:13

2 Annotation with Impact Values

How might each provenance
element influence the IQ criterion?
● Systematically analyze each type of provenance elements

What kind of impact values are necessary?
How do we represent the influences by impact values?
● Impact values not necessarily numerical
● Depends on the assessment function in step 3

How do we determine impact values?


Determining Impact Values
● From the provenance information
● From user input
● Configuration options
● Rating-based, Reputation-based
● By content analysis
● Comparison (e.g. outlier detection)
● Adoption of information retrieval methods
● Adoption of data cleansing techniques
● By context analysis
● Further metadata
● Domain knowledge


How might each provenance
element influence the IQ criterion?

Data Creation Dimension:

Prov. Element Type Impact Values
Data Creation ● creation time
● weights

Creation Guidelines -
(Source) Data Item ● expiry time
Data Creator -

type: Data Creation

contained by Execution Time: 10:00

aExc accessed
type: Data Access
sys performed by

● weights

Data Creator -

type: Data Creation
creation time
contained by 10:00 Execution Time: 10:00

aExc accessed
type: Data Access
sys performed by

● weights

Data Creator -

expiry time type: Data Creation
11:00 creation time

aExc accessed
type: Data Access
sys performed by

● weights

Data Creator -

3 Assessment Function

How do we represent the IQ criterion by an IQ score?

What does the assessment function look like?
● Develop the function together with the impact values
● Take incompleteness into consideration
● Provenance graphs could be fragmentary
● Annotations could be missing


Step 3 – Assessment Function



11:00 creation time

aExc accessed
type: Data Access
sys performed by



t(msr) = 1 – (10:15 – 10:00) / (11:00 – 10:00)
=1– 0.25h / 1h
= 0.75

11:00 creation time

aExc accessed
type: Data Access
sys performed by


Conclusion
● Web Data Provenance (data creation + data access)
● General approach for provenance-based IQ assessment
● Impact values: influence of provenance elements on IQ
● Design decisions for actual assessment methods
● Application to timeliness (more in the paper)

● Future work:
● How do we deal with incompleteness?
● Application of the approach to other IQ criteria


These slides have been created by
Olaf Hartig
http://olafhartig.de

This work is licensed under a
Creative Commons Attribution-Share Alike 3.0 License
(http://creativecommons.org/licenses/by-sa/3.0/)

Attribution:
● http://www.flickr.com/photos/rrrrred/3809362767/
● http://www.hasslefreeclipart.com


Using Web Data Provenance for Quality Assessment

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Using Web Data Provenance for Quality Assessment

Similar to Using Web Data Provenance for Quality Assessment (20)

More from Olaf Hartig

More from Olaf Hartig (20)

Recently uploaded

Recently uploaded (20)

Using Web Data Provenance for Quality Assessment