This document describes a quality assurance workflow authoring tool for citizen science and crowd-sourced data. The tool aims to integrate authoritative and crowd-sourced data by bringing together a structured, standards-based institutional approach with a citizen-focused, timely crowd-sourced approach. The tool uses a BPMN-based workflow to chain OGC Web Processing Services for quality control processes. This allows stakeholders to design customizable QA workflows by selecting from a repository of generic quality control processes.
COBWEB A quality assurance workflow authoring tool for citizen science and crowdsourced data
1. A Quality Assurance workflow Authoring Tool
for citizen science and crowd-sourced data.
Didier Leibovici,
Julian Rosser, Mike Jackson and the COBWEB project
Nottingham Geospatial Institute
University of Nottingham, UK
2. • Aim is to bring together a precise, structured, top-
down and formal standards-based institutional
approach with low cost, relevant, rich and timely
citizen-focussed approach of the crowd but where
there are short-comings of completeness, precision,
interoperability and often minimal direction.
• Not straight forward - the two perspectives of what
constitutes useful, QA’d, fit-for-use data are very
different.
Research Objective - to integrate (with QA)
authoritative and crowd-sourced data
3. Crowd Sourcing Authoritative Government
Data
‘Non-systematic incomplete coverage vs Systematic + comprehensive
Near ‘real-time’ and ongoing data
collection allowing trend analysis
vs ‘Historic’ and ‘snap-shot’ map
data
Free ‘un-calibrated’ data but often at
hi-res and up-to-the-minute
vs Quality assured ‘expensive’
data.
‘Unstructured’ and mass consumer
driven metadata and mash-ups.
vs ‘Structured’ and defined metadata
but often in rigid ontologies.
Unconstrained capture + distribution
from ‘ubiquitous’ mobile devices
vs ‘Controlled’ licensing, access
policies and digital rights.
Simple’ consumer driven web services
for data collection + processing.
vs ‘Complex ‘institutional survey +
GIS applications
A clash of paradigms and Market Dynamics:
Jackson, M. J., Rahemtulla, H. + Morley, J. (2010). “The Synergistic Use of Authenticated + Crowd-Sourced Data for
Emergency Response”, Proc, 2nd Int Workshop on Validation of Geo-Information Products for Crisis Management
(VALgEO), 11-13/10/10, Ispra, Italy, pp 91-99. http://globesec.jrc.ec.europa.eu/workshops/valgeo-2010/proceedings
5. When considering the use of crowd-sourced GI data we
need to quality assure it from:
1. A Spatial (geometric) perspective
2. A Thematic (domain attribution) perspective
3. A Temporal (time-related attribution) perspective
And in terms of data quality “Elements” we have to consider:
Completeness – by area, by class,
Consistency – e.g. topological, semantic, temporal
Accuracy – relative, absolute
Usability – fitness for purpose for a particular application or
requirement
Aspects of Quality
6. Solution adopted (i)
• “Internal” quality metrics <Completeness,
positional accuracy, consistency, etc.>
defined by ISO 19157
• “External” consumer quality <fitness for
purpose> metrics based on GeoViQua
[www.geoviqua.org>]
• Stakeholder model QA <data collector’s
judgement, trust, reliability> [Meek et al
2014]
7. Metadata on Data Quality three models
• ISO19157 (producer model)
where DQ_Scope will be ”feature"
DQ_Usability
• DQ_Completeness
DQ_CompletenessCommission
DQ_CompletenessOmission
• DQ_ThematicAccuracy
DQ_ThematicClassificationCorrectness
DQ_NonQuantitativeAttributeAccuracy
DQ_QuantitativeAttributeAccuracy
• DQ_LogicalConsistency
DQ_ConceptualConsistency
DQ_DomainConsistency
DQ_FormatConsistency
DQ_TopologicalConsistency
• DQ_TemporalAccuracy
DQ_AccuracyOfATimeMeasurement
DQ_TemporalConsistency
DQ_TemporalValidity
• DQ_PositionalAccuracy
DQ_AbsoluteExternalPositionalAccuracy
DQ_GriddedDataPositionalAccuracy
DQ_RelativeInternalPositionalAccuracy
Simplified GeoViqua model (consumer model)
where DQ_Scope will be ”external data"
GVQ_PositiveFeedback
GVQ_NegativeFeedback
COBWEB Stakeholder Quality Model
where DQ_Scope will be ”volunteer"
CSQ_Vagueness
CSQ_Ambiguity
CSQ_Judgement
CSQ_Reliability
CSQ_Validity
CSQ_Trust
CSQ_NoContribution
8. Solution adopted (ii)
• OGC WPS standard which allows access to a repository of
processes and services from compliant clients
• A key aspect of the standard is the provision to chain disparate
processes and services to form a reusable workflow
• Use of BPMN rather than (BPEL) for workflow engine - excels in
modelling processes visually allowing non-domain experts to
communicate and mutually understand their models.
• Configurable workflows - stakeholders able to design a solution
to fit use case from a generic set of WPS processes
9. Solution adopted (iii)
• Github used for code repository and open source
evolution of solution
• Built on open source implementations of WPS, client
libraries (52 North), BPMN implementation is JBPM
maintained by JBOSS, WPS runs on Apache Tomcat,
JBPM deployed on JBOSS Wildfly
• Full details in “A BPMN solution for chaining OGC
services to quality assure location-based crowd-
sourced data”, Meek, Jackson, Leibovici (2015)
submitted to: Computers and Geosciences
Mike Jackson, 4-5 Nov., 2015, China
10. the COBWEB QAQC the 7+ pillars of Quality Controls (QC)
7 pillars of QC and the 7+ cross-pillar
a QC
11.
12. .workflow authoring tool
BPMN encoding
.composition support
SKOS encoding
.repository of QCs
as WPS
QAQC: workflow of QC as WPS
QAwAT
QAwOnt
QAwWPS
14. Qualifying the Observations, the Volunteers
and the Authoritative data
Quality elements generated & evolving
QC examplesExample of a QA workflow
Design and composition using a graphical tool
15. QC examplesQAQC workflow Authoring Tool (QAwAT)
QAwAT
Design and composition in Eclipse
Design and composition JBPM web editor
16. Some results on the Japanese knotweed co-design
beforeQA
0.0 0.2 0.4 0.6 0.8 1.0
02468
0.5 + artificial sd 0.0001
DQ_ClassificationCorrectness & DQ_Usability
Density
18. Rosser J, Pourabdolllah A, Brackin R, Jackson MJ, Leibovici DG (2016) Full Meta Objects for Flexible Geoprocessing Workflows:
profiling WPS or BPMN? 19th AGILE Conference, 14-17 June 2016, Helsinki, Finland
Leibovici DG, Williams J, Rosser J.F, Hodges C, Scott D, Chapman C, Higgins C, and Jackson M.J (2016) The COBWEB Quality
Assurance System in Practice: Example for an Invasive Species Study. ECSA conference 19-21 May 2016, Berlin, Germany
Meek, S., Jackson, M., Leibovici, L. (2016), A BPMN solution for chaining OGC services to quality assure location-based
crowdsourced data , Computers &Geosciences, 87(2016)76–83
Leibovici DG, Meek S, Rosser J and Jackson MJ (2015) DQ in the citizen science project COBWEB: extending the standards. Data
Quality DWG, OGC/TC Nottingham, September 2015, U.K
Leibovici DG, Evans B, Hodges C, Wiemann S, Meek S, Rosser J and Jackson MJ (2015 ) On Data Quality Assurance and Conflation
Entanglement in Crowdsourcing for Environmental Studies. ISSDQ 2015 - The 9th International Symposium on Spatial Data Quality,
29-30 September, La Grande Motte, France
Meek S, Jackson MJ, Leibovici DG (2014) A flexible framework for assessing the quality of crowdsourced data. AGILE conference, 3-6
June 2014, Castellon, Spain
Leibovici DG and Jackson MJ (2013) Copula metadata est. AGILE conference, 14-17 May 2013, Leuven, Belgium
Leibovici DG, Pourabdollah A and Jackson MJ (2013) Which Spatial Data Quality can be meta-propagated? Journal of Spatial
Sciences, 58(1): 3-14
Leibovici DG, Pourabdollah A and Jackson M (2011) Meta-propagation of Uncertainties for Scientific Workflow Management in
Interoperable Spatial Data Infrastructures. EGU 2011, European Geosciences Union, General Assembly, Vienna, Austria April
2011
Pawlowicz S, Leibovici DG, Haines-Young R, Saull R and Jackson M (2011) Dynamical Surveying Adjustments for Crowd-sourced Data
Observations. EnviroInfo 2011, Ispra, Italy
Leibovici DG and Pourabdollah A (2010) Workflow Uncertainty using a Metamodel Framework and Metadata for Data and Processes.
OGC TC/PC Meetings, 20-24 September 2010, Toulouse, France
Jackson, M., Rahemtulla, H., Morley, J. (2010). The synergistic use of authenticated and crowd-sourced data for
emergency response, International Workshop on Validation of Geo-Information Products for Crisis
Management (VALgEO), Ispra, Italy. pp 91-99.
19. Quality Assurance workflow Authoring Tool
(QAwAT)
Didier G. Leibovici,
Julian Rosser, Mike Jackson and the COBWEB project
Nottingham Geospatial Institute
University of Nottingham, UK
Email: firstname.secondname@nottingham.ac.uk
Thank you!
Editor's Notes
1/ QAWAT is the quality assurance tool designed by the university of Nottingham with the FP7 project COBWEB.
It is used for citizen science by the stakeholder who designed the survey campaign using the data capture tool.
2/ Quality assurance via the QAQC processing comes during the data capture or after the data capture of observations from each Volunteering citizen
and aims at producing metadata on data quality for the captured observation
4/ Added to the ISO standard for quality: the producer model, the QAQC uses a quality model to qualify the volunteers: the stakeholder model
as well as a very simplified version of the consumer model designed by GeoViQUA.
The Stakeholder quality model evaluate or calibrate the properties of the volunteer seen as a sensor using 6 dimensions related to its accuracy, consistency, and trust
3/ The COBWEB platform and the Quality Assurance tool are based on interoperability standards for geoprocessing, workflow and quality information encoding
5/ Each Quality Control produces or updates a number of quality elements from the three models;
and is seen as a single task within the workflow evaluating the quality from different type of controls: the pillars.
The whole QA workflow will be composed a a series of QC’s belonging to these 7 pillars.
The 7+ pillars deal with security and privacy when necessary for a specific QC which would belong to one of the 7 pillars
6/ The categorisation of the QCs in 7 pillars is to help in the development of geoprocesses and in the composition of the workflow.
The 7 pillars represent the top an ontology of the QCs for VGI ,citizen science and crowdsourcing.
A similar QC can exist in different pillars but the quality elements generated or the rule to assign their values will be different due to the semantic of the pillars.
7/ Any code wrapped up into a WPS is registered as part of a particular pillar and the workflow web editor is the tool used to compose the workflow.
In its algorithm, each QC will have a processing or geoprocessing part followed by a logic reasoning part depending on the results of the first part and the semantic attached to it and to the pillar description.
8/ From the list of QCs in each pillar, any tool to compose a BPMN workflow can be used. The conceptual aspect of the workflow captured by the graphical analytic of the BPMN can be annotated and shared within the different stakeholders. QAwAT is to compose and execute the workflow using the workflow engine linked to the WPS.
9/ As an example we have here a quality Control in pillar1 helping to assess the position of the aimed point when taking a photo with the smartphone.
Besides reporting the obs point as being the LoS point one can also from the distance to the point estimate a Topological Consistency, with the position of the observer to identify properly what he/she is observing.
From the uncertainty of the GPS of the phone and of the DEM and of the bearings parameters. The LoS point, distance to it and its uncertainty can be computed then the probability for a Normal distribution to be less than a stakeholder given’s threshold of a reasonable distance to make a proper observation.
10/ We used JBPM suite which has a workflow engine and an editor either within Eclipse or as a web editor.
The workflow engine has been modified to accept WPS services as tasks.
11/ Using the online editor. The editor enables one to drag and drop the tasks. Then the input URL and output names are filled in.
The QA workflow can be run from the web interface or later from a WPS interface once the whole workflow is wrapped into a process of the WPS.
12/ The QAQC starts by postulating 50% quality uncertainty for Classification Correctness (for example)
13/ The quality elements evolves through the workflow as different QC update their values.
The results therefore depend on the choice of QCs and set of parameters used in that workflow.
The BPMN of the workflow is stored as metaquality element so encoding the provenance of the metadata values on data quality.
This is a qualifying not a validating step or verifying step. It indicates how uncertain we are about each observation considering the rules put in the QA workflow.
Low and high quality values being less uncertain, here of being a true Japanese Knotweed observation or not.
After verification using the ground truth, even though the tendency is that the QAQC helps to identify correct observations,
you can see that some high quality were given to some wrong JKW observations and vice versa.
Quite a number of citizen’s data still have uncertainty attached, i.e. still around 50%.
14/ and here are the results for the Snowdonian National Park and the initial survey for Japanese knotweed performed during May-July 2015, with some the data source used in the pillars as well.
In summary, the ones close to managed lands or woods but not to the EO identified areas as at risk of having JKW have a lower quality (higher uncertainty).
15/ Some of the research on data quality and workflows at the Nottingham Geospatial Institute
1/ QAWAT is the quality assurance tool designed by the university of Nottingham with the FP7 project COBWEB.
It is used for citizen science by the stakeholder who designed the survey campaign using the data capture tool.
2/ Quality assurance comes during the data capture or after the data capture of observations from each Volunteering citizen
and aims at producing metadata on data quality for the captured observation
3/ The COBWEB platform and the Quality Assurance tool are based on interoperability standards for geoprocessing, workflow and quality information encoding
4/ Added to the ISO standard for quality: the producer model,
the QAQC uses a quality model to qualify the volunteers: the stakeholder model
as well as a very simplified version of the consumer model designed by GeoViQUA.
The Stakeholder quality model evaluate or calibrate the properties of the volunteer seen as a sensor using 6 dimensions related to its accuracy, consistency, and trust
5/ Each Quality Control or QC for short produces or updates a number of quality elements from the three models;
and is seen as a single task within workflow evaluating the quality from different type of controls: the pillars.
The whole QA workflow will be composed a a series of QC belonging to these 7 pillars.
The 7+ pillar deals with security and privacy when necessary for a specific QC which would belong to one of the 7 pillars
6/ The categorisation of the QCs in 7 pillars is to help in the development of geoprocesses and in the composition of the workflow.
The 7 pillars represent the top an ontology of the QCs for VGI ,citizen science and crowdsourcing.
A similar QC can exist in different pillars but the quality elements generated or the rule to assign their values will be different due to the semantic of the pillars.
7/ Any code wrapped up into a WPS is registered as part of a particular pillar and the workflow web editor is the tool used to compose the workflow.
In its algorithm, each QC will have a processing or geoprocessing part followed by a logic reasoning part depending on the results of the first part and the semantic attached to it and to the pillar description.
8/ From the list of QCs in each pillar, any tool to compose a BPMN workflow can be used. The conceptual aspect of the workflow captured by the graphical analytic of the BPMN can be annotated and shared within the different stakeholders. QAwAT is to compose and execute the workflow using the workflow engine linked to the WPS.
9/ As an example we have here a quality Control in pillar1 helping to assess the position of the aimed point when taking a photo with the smartphone.
Besides reporting the obs point as being the LoS point one can also from the distance to the point estimate a Topological Consitency, with the position of the observer to identify properly what he/she is observing.
From the uncertainty of the GPS of the phone and of the DEM and of the bearings parameters. The LoS point, distance to it and its uncertainty can be computed then the probability for a Normal distribution to be less than a stakeholder given’s threshold of a reasonable distance to make a proper observation.
10/ We used JBPM suite which has a workflow engine and an editor either within Eclipse or as a web editor.
The workflow engine has been modified to accept WPS services as tasks.
11/ Using the online editor. The editor allows to drag and rop the tasks. Then the input URL and ouput names are filled in.
The QA workflow can be run from the web interface or later from a WPS interface once the whole workflow is wrapped into a process of the WPS.
12/ The QAQC starts by postulating 50% quality uncertainty for Classification Correctness (for example)
13/ The quality elements evolves through the workflow as different QC update their values.
The results therefore eon the choice of QCs and set of parameters used in that workflow.
The BPMN of the workflow is stored as metaquality element so encoding the provenance of the metadata values on data quality.
This is a qualifying not a validating step or verifying step. It indicates how uncertain we are about each observation considering the rules put in the QA workflow.
Low and high quality values being less uncertain, here of being a true Japanese Knotweed observation or not.
After verification using the ground truth, even though the tendency is that the QAQC helps to identify correct observations,
you can see that some high quality were given to some wrong JKW observations and vice versa.
Quite a number of citizen’s data still have uncertainty attached, i.e. still around 50%.
14/ … and here is the map of the results for the Snowdonian National Park and the initial survey for Japanese knotweed performed during May-July 2015, with some the data source used in the pillars as well.
Basically the ones too close to ‘managed lands’ or ‘woods’ and not enough close to the area identified from EO as at risk of having JKW have a lower quality.
15/ Some of the research on data quality, error propagation and workflows at the Nottingham Geospatial Institute