Data Quality
Standards and Application to Open Data
February 21, 2018 – Brunel University, UK
Marco Torchiano
marco.torchiano@polito.it
Version 1.1.0
© Marco Torchiano, 2018
About me
 Marco Torchiano
 Associate Professor, Politecnico di Torino
 Senior Member IEEE
 Faculty Fellow – Nexa Center for Internet
and Society
 Member UNI CT504–Software Engineering
 Contacts:
– mailto:marco.torchiano@polito.it
– http://softeng.polito.it/torchiano/
– Twitter: @mtorchiano
3
Current Research Interests
 Mobile UI Automated Testing
 PhD student working on fragility
 (Open)Data Quality
 PhD student working on KB quality
 Software Energy Consumption
 Several collaborations
 Also: MDD, Survey methodology, code
obfuscation, SE education, …
4
Acknowledgments
 Antonio Vetrò
 The counterpart for
this line of research
 Many other people
 L.Canova, R.Iemma, F.Iuliano, F.Morando,
C.Orozco Minotas, G.Procaccianti,
R.Rashid
5
OPEN DATA QUALITY
7
Open Coesione
 portal about the fulfilment of
investments using the 2007-2013
European Cohesion funds
 Interactive Interface
 Downloadable .csv datasets
 ~100 billion Euros are being tracked,
~100K projects
 http://www.opencoesione.gov.it/
9
Errors in data
10
43 !
* extraction, transformation, and loading
11
Accuracy
12
» Refer always to raw data
» If not possible, estimate accuracy on analysis (e.g., about 5% in the example above)
43 !
13
Missing data
14
15
»Outliers can point to interesting facts
Outliers
16
»… or to something which deserves a second look
Outliers
17
Valu
e
pcvc= percentage of cells with correct value
18
ISO DATA QUALITY
STANDARDS
19
ISO - SQuaRE
2503x
Quality
Requirements
2504x
Quality
Evaluation
2501x
Quality Model
2500x
Quality Management
2502x
Quality Measurement
Family of standards
20
ISO SQuaRE
 Internal Quality
 Values, formats, relation
 External Quality
 Technological environment
 Quality in Use
 Context of use of the data user
21
ISO 25012
Data Quality Model
22
2503x
Quality
Requirements
2504x
Quality
Evaluation
2501x
Quality Model
2500x
Quality Management
2502x
Quality Measurement
Roles
 Data Quality evaluator
 Data Producer
 Data Acquirer
 Data User
23
Data evaluator
 Defines/adapts a quality model
 Evaluate and act
 Data correction
 Technological adjustments
 Organizational measures
24
Model structure
 Characteristic
 Main aspects, e.g., usability
 Sub-Characteristic (optional)
 A detailed aspect of a characteristic, e.g.
Understandability
 Metric
 A set of rules to assign and interpret a
(numerical) evaluation to a specific (sub)-
characteristic
25
Characteristics
 Accuracy
 Completeness
 Consistency
 Credibility
 Currentness
 Accessibility
 Compliance
 Confidentiality
 Efficiency
 Precision
 Traceability
 Understandability
 Availability
 Portability
 Recoverability
26
Characteristics
 Accuracy
 Correspondence between data and reality
(syntactic and semantic)
 Completeness
 Computer: presence of all necessary
values
 User: how much the data is able to satisfy
the needs
 Consistency
 Absence of contradictions in the data
27
Characteristics
 Credibility
 The extent to which data are regarded as
true and credible by users
 Currentness
 the extent to which data is up-to-date
 Accessibility
 The capability of data to be accessed,
particularly by people who need
supporting technology or special
configuration because of some disability
28
Characteristics
 Regulatory compliance
 The capability of data to adhere to standards,
conventions or regulations in force and similar
rules relating to data quality
 Confidentiality
 The capability of the data to be accessed and
interpreted only by authorized users
 Efficiency
 The capability of data to be processed (accessed,
acquired, updated, etc) and to provide
appropriate levels of performance using the
appropriate amounts and types of resources
under stated conditions
29
Characteristics
 Precision
 Capability of the value assigned to an
attribute to provide the degree of
information needed in a stated context of
use
 Traceability
 Presence of attributes providing an audit trail
of access and changes made to data
 Understandability
 The extent to which data can be read and
interpreted by users
30
Characteristics
 Availability
 The capability of data to be always
retrievable.
 Recoverability
 The capability to preserve a specified level of
operations and its physical and logical
integrity, even in the event of failure
 Portability
 The capability of data to be moved to
another platform preserving quality
31
Inherent System
Dependent
Facts
(Data)
Artefacts
(D+Hw+Sw+Sys)
Accuracy
Completeness
Consistency
Credibility
Currentness
Accessibility
UnderstandabilityHCI
Support
Compliance
Confidentiality
Efficiency
Precision
Traceability
Perspectives
32
Availability
Portability
Recoverability
ISO 25024
Measurement of Data Quality
33
2503x
Quality
Requirements
2504x
Quality
Evaluation
2501x
Quality Model
2500x
Quality Management
2502x
Quality Measurement
Relationships among standards
ISO/IEC 25010
System and Software
Product Quality
ISO/IEC 25012
Data Quality
composed of
Quality characteristics
Quality sub-characteristics
composed of
Quality Measure
ISO/IEC 25022, 25023, 25024
Measuremen
t function
defines
composed of
Quality Measure Elements
QME
Measuremen
t method
ISO/IEC 25021
Property to quantifyTarget Entity
Source: ISO/IEC 25024 34
Data Life Cycle: examples
Data
design
Data
collection
Data
integration
External
data
acquisition
Source: ISO/IEC 25024
Data
processin
g
Presentation
Other use
Data store
Delete
35
Data design: target entities
 Architecture
 Contextual schema
 Data models (conceptual, logical,
physical)
 Data dictionary
 Document
36
Data design: properties
 Attribute
 Element
 Information
 Metadata
 Vocabulary
37
Other stages: target entities
 Data file
 DBMS
 RDBMS
 Form
 Presentation device
38
Properties
 Data format
 Data item
 Data value
 Information item
 Information item content
 Data record
39
Metrics definition
A) ID: abbreviated code of the quality characteristics +
(I/D)+serial number
b) Name: QM name related to data;
c) Description
d) Measurement function: formula showing how the QMEs
are combined to produce the QM;
e) DLC, Target entities, Properties: DLC includes stages of
the DLC where the data QMEs are applicable, target
entities and properties of target entities;
f) Note: in the note, additional information such as an
acceptable range of values, reference to other standards,
explanations or interpretation or criteria, measurement
method used to obtain the
40
ACCURACY (Acc-I-1)
Copyright: ISO/IEC 25024
42
CASE STUDIES
Open Government Data
50
Open Government Data
OD: open data, data that can be
 Used
 Reused
 Redistributed
 By anyone and with any goal
G: Government produced or commissioned
by a government or an institutional
entity controlled by the government
http://opengovernmentdata.org
51
Why OGD ?
 Transparency
 Social and commercial value
 Participation
52
Case 1: Open Coesione
 Published data
 Structured
 Open data format
OpenCoesione
Statistical data from municipalities
 Residents
 Weddings
 Commercial activities
60
Datasets analyzed
61
Orchestrated disclosure Decentralized disclosure
● Open Coesione
● portal about the
fulfilment of
investments using the
2007-2013 European
Cohesion funds
● 85 billion Euros are
being tracked, 850K
projects
Dataset
Torino
Roma
Milano
Firenze
Bologna
Residents X X X X X
Weddings X X X
Business
Activities
X X X
Open Coesione
Measures
Characteristic Description ISO name
Completeness
Percentage of complete cells Com-I-1 (cell)
Percentage of complete rows Com-I-1 (row)
Accuracy
Percentage of syntactically accurate
cells
Acc-I-1
Traceability
Track of creation Tra-D-2 ( c )
Track of update Tra-D-2 (u)
Currentness
Percentage of current rows Cur-I-2
Delay in publication ~Cur-I-1
Compliance
eGSM compliance Cmp-D-1
five stars open data Cmp-D-1
Understandability
Percentage of columns with metadata Und-I-3
Percentage of columns in
comprehensible format
Und-I-4
63
e.GMS
1. Accessibility (mandatory if
appl)
2. Addressee (optional)
3. Aggregation (optional)
4. Audience (optional)
5. Contributor (optional)
6. Coverage (recommended)
7. Creator (mandatory)
8. Date (mandatory)
9. Description (optional)
10. Digital signature (optional)
11. Disposal (optional)
12. Format (optional)
13. Identifier (mandatory if appl)
14. Language (recommended)
15. Location (optional)
16. Mandate (optional)
17. Preservation (optional)
18. Publisher (mandatory if appl)
19. Relation (optional)
20. Rights (optional)
21. Source (optional)
22. Status (optional)
23. Subject (mandatory)
24. Title (mandatory)
25. Type (optional)
UK - e-Governmant Metadata Standard
https://www.oasis-open.org/committees/download.php/7271/eGMS%20version%203.pdf 64
Results – Open Coesione
65
0.00 0.20 0.40 0.60 0.80 1.00
Com-I-1 (cell)
Com-I-1 (row)
Acc-I-1
Tra-D-2 ( c )
Tra-D-2 (u)
Cur-I-2
~Cur-I-1
Cmp-D-1
Cmp-D-1
Und-I-3
Und-I-4
Null/zero
values :
domain
uncertain
Track
updates
missing
Missing
metadata
data not
linked
0 0.2 0.4 0.6 0.8 1
Com-I-1 (cell)
Com-I-1 (row)
Acc-I-1
Tra-D-2 ( c )
Tra-D-2 (u)
Cur-I-2
~Cur-I-1
Cmp-D-1
Cmp-D-1
Und-I-3
Und-I-4
Results – Municipality data
66
Discrepancies
of values with
domain
No info on
updates
Missing
metadata
Findings
 Disclosure strategy implies different
data quality
 Centralized vs.
 Decentralized
 Traceability is generally lacking
 Proposals to use Sw Conf Mgmt tools
 Metadata is often missing or
incomplete
67
Case 2: Public Contracts
 Published data
 Structured
 Open format
Data on public contracts ex Art.37
Decree Transparency + prescriptions
ANAC
68
Public contracts
 Decree Transparency (14 march 2013
n.33)
 Public contracts (Art.37 & Art 9.)
 Open Data Publication
 XML Standard Format (ANAC)
 Selected administrations: Italian
universities
69
Data Structure
XML
METADATA
DATA
LOTS
PARTICIPANTS
WINNER
70
<lotto>
<cig>4421574E47</cig>
<strutturaProponente>
<codiceFiscaleProp>00518460019</codiceFiscaleProp>
<denominazione>Politecnico di Torino</denominazione>
</strutturaProponente>
<oggetto>
Procedura di cottimo fiduciario per affidamento servizio di manutenzione e
assistenza di primo livello stazioni self-service
</oggetto>
<sceltaContraente>08-AFFIDAMENTO IN ECONOMIA - COTTIMO FIDUCIARIO</sceltaContraente>
<partecipanti>
<partecipante>
<codiceFiscale>06267040019</codiceFiscale>
<ragioneSociale>OVER S.A.S. DI VERGNANO CARLO & C.</ragioneSociale>
</partecipante>
</partecipanti>
<aggiudicatari>
<aggiudicatario>
<codiceFiscale>06267040019</codiceFiscale>
<ragioneSociale>OVER S.A.S. DI VERGNANO CARLO & C.</ragioneSociale>
</aggiudicatario>
</aggiudicatari>
<importoAggiudicazione>7500.00</importoAggiudicazione>
<tempiCompletamento>
<dataInizio>2014-09-01</dataInizio>
<dataUltimazione>2014-11-30</dataUltimazione>
</tempiCompletamento>
<importoSommeLiquidate>7500.00</importoSommeLiquidate>
</lotto>
71
Quality Evaluation Framework
Intrinsic
Dimensions
Domain
Dependent
Dimension Measure
Accuracy Percentage of elements
with correct values.
Completeness
Percentage of complete
elements.
Percentage of complete
aggregate elements.
Dimension Measure
Consistency Percentage of lots that
meet the Intrarelational
and Interrelational
Integrity Constraints.
Duplication Number of duplicates.
72
Identification of datasets
 First 25 universities of the overall ranking for
the 2014 provided by the newspaper Il Sole 24
Ore.
 Only 12 universities provide summary tables in
XML format.
Total numer of assessed lots: 123702
Average number of published lots:10308,5
 The remaining 13 universities either do not
provide the summary tables or they provide
summary tables but not in XML format.
73
CIG
74
The University of Torino
publishes summary tables that
have 100% cig completeness,
that is, the 100% of lots have the
cig element but about 32% of
them are out of domain.
1
0.94
0.9999
0.999
0.67
1
0.99
1
0.998
0.997
0.9998
0.99
1
1
1
1
1
1
1
1
1
1
1
1
0.600.700.800.901.00
UniBo
PoliMi
PoliTo
UniMi
UniTo
UniVe
UniUpo
UniFe
UniMib
UniPv
UniSa
UnivPm
Unique Tender Identifier
A lot of “00000000000”.
The element is present for
each lot but it is always
empty.
Choice of
contracting part
75
0.9999
0.998
0.9999
0
1
1
1
0.9991
1
1
1
1
1
1
1
1
1
1
1
0.999
1
1
1
1
0.000.200.400.600.801.00
UniBo
PoliMi
PoliTo
UniMi
UniTo
UniVe
UniUpo
UniFe
UniMib
UniPv
UniSa
UnivPm
All the lots published by
University of Milano have a
winner but no information about
the participants.
Fiscal Code
76
1
0.97
0.99
1
1
1
1
1
1
1
1
1
1
1
1
1
0.974
1
1
0.996
1
0.951
0.900.920.940.960.981.00
UniBo
PoliMi
PoliTo
UniMi
UniTo
UniVe
UniUpo
UniFe
UniMib
UniPv
UniSa
UnivPm
In 14% of lots the amount paid is
greater than the awarded amount.
Amount paid
vs. Total paid
78
0.87
0.97
0.96
0.9999
0.998
0.99
0.999
0.93
0.995
0.9999
0.98
0.98
0.800.850.900.951.00
UniBo
PoliMi
PoliTo
UniMi
UniTo
UniVe
UniUpo
UniFe
UniMib
UniPv
UniSa
UnivPm
PayedlessorequaltoAwarded
Final considerations
 ISO standard provides several
predefined measures
 Must be adapted to the case at hand
 Can be aggregated in different ways
 Possibility to define new measures
 ISO standard is intended for
structured data
 What about semantic knowledge bases?
79
References
 ISO/IEC 25012:2008, Software engineering — Software
product Quality Requirements and Evaluation (SQuaRE) —
Data quality model
 ISO 25024:2015, Software engineering — Software product
Quality Requirements and Evaluation (SQuaRE) —
Measurement of data quality
 Vetrò, Antonio; Canova, Lorenzo; Torchiano, Marco; Orozco
Minotas, Camilo; Iemma, Raimondo; Morando, Federico “Open
Data Quality Measurement Framework: Definition and
Application to Open Government Data”GOVERNMENT
INFORMATION QUARTERLY, Vol.33, pp.325-337, ISSN:0740-
624X
 Torchiano, Marco; Vetro', Antonio; Iuliano, Francesca
“Preserving the Benefits of Open Government Data by
Measuring and Improving Their Quality: An Empirical Study” in
IEEE 41st Annual Computer Software and Applications
Conference (COMPSAC 2017)
80

Data Quality - Standards and Application to Open Data

  • 1.
    Data Quality Standards andApplication to Open Data February 21, 2018 – Brunel University, UK Marco Torchiano marco.torchiano@polito.it Version 1.1.0 © Marco Torchiano, 2018
  • 2.
    About me  MarcoTorchiano  Associate Professor, Politecnico di Torino  Senior Member IEEE  Faculty Fellow – Nexa Center for Internet and Society  Member UNI CT504–Software Engineering  Contacts: – mailto:marco.torchiano@polito.it – http://softeng.polito.it/torchiano/ – Twitter: @mtorchiano 3
  • 3.
    Current Research Interests Mobile UI Automated Testing  PhD student working on fragility  (Open)Data Quality  PhD student working on KB quality  Software Energy Consumption  Several collaborations  Also: MDD, Survey methodology, code obfuscation, SE education, … 4
  • 4.
    Acknowledgments  Antonio Vetrò The counterpart for this line of research  Many other people  L.Canova, R.Iemma, F.Iuliano, F.Morando, C.Orozco Minotas, G.Procaccianti, R.Rashid 5
  • 5.
  • 6.
    Open Coesione  portalabout the fulfilment of investments using the 2007-2013 European Cohesion funds  Interactive Interface  Downloadable .csv datasets  ~100 billion Euros are being tracked, ~100K projects  http://www.opencoesione.gov.it/
  • 7.
  • 8.
  • 9.
    43 ! * extraction,transformation, and loading 11
  • 10.
  • 11.
    » Refer alwaysto raw data » If not possible, estimate accuracy on analysis (e.g., about 5% in the example above) 43 ! 13
  • 12.
  • 13.
  • 14.
    »Outliers can pointto interesting facts Outliers 16
  • 15.
    »… or tosomething which deserves a second look Outliers 17
  • 16.
    Valu e pcvc= percentage ofcells with correct value 18
  • 17.
  • 18.
    ISO - SQuaRE 2503x Quality Requirements 2504x Quality Evaluation 2501x QualityModel 2500x Quality Management 2502x Quality Measurement Family of standards 20
  • 19.
    ISO SQuaRE  InternalQuality  Values, formats, relation  External Quality  Technological environment  Quality in Use  Context of use of the data user 21
  • 20.
    ISO 25012 Data QualityModel 22 2503x Quality Requirements 2504x Quality Evaluation 2501x Quality Model 2500x Quality Management 2502x Quality Measurement
  • 21.
    Roles  Data Qualityevaluator  Data Producer  Data Acquirer  Data User 23
  • 22.
    Data evaluator  Defines/adaptsa quality model  Evaluate and act  Data correction  Technological adjustments  Organizational measures 24
  • 23.
    Model structure  Characteristic Main aspects, e.g., usability  Sub-Characteristic (optional)  A detailed aspect of a characteristic, e.g. Understandability  Metric  A set of rules to assign and interpret a (numerical) evaluation to a specific (sub)- characteristic 25
  • 24.
    Characteristics  Accuracy  Completeness Consistency  Credibility  Currentness  Accessibility  Compliance  Confidentiality  Efficiency  Precision  Traceability  Understandability  Availability  Portability  Recoverability 26
  • 25.
    Characteristics  Accuracy  Correspondencebetween data and reality (syntactic and semantic)  Completeness  Computer: presence of all necessary values  User: how much the data is able to satisfy the needs  Consistency  Absence of contradictions in the data 27
  • 26.
    Characteristics  Credibility  Theextent to which data are regarded as true and credible by users  Currentness  the extent to which data is up-to-date  Accessibility  The capability of data to be accessed, particularly by people who need supporting technology or special configuration because of some disability 28
  • 27.
    Characteristics  Regulatory compliance The capability of data to adhere to standards, conventions or regulations in force and similar rules relating to data quality  Confidentiality  The capability of the data to be accessed and interpreted only by authorized users  Efficiency  The capability of data to be processed (accessed, acquired, updated, etc) and to provide appropriate levels of performance using the appropriate amounts and types of resources under stated conditions 29
  • 28.
    Characteristics  Precision  Capabilityof the value assigned to an attribute to provide the degree of information needed in a stated context of use  Traceability  Presence of attributes providing an audit trail of access and changes made to data  Understandability  The extent to which data can be read and interpreted by users 30
  • 29.
    Characteristics  Availability  Thecapability of data to be always retrievable.  Recoverability  The capability to preserve a specified level of operations and its physical and logical integrity, even in the event of failure  Portability  The capability of data to be moved to another platform preserving quality 31
  • 30.
  • 31.
    ISO 25024 Measurement ofData Quality 33 2503x Quality Requirements 2504x Quality Evaluation 2501x Quality Model 2500x Quality Management 2502x Quality Measurement
  • 32.
    Relationships among standards ISO/IEC25010 System and Software Product Quality ISO/IEC 25012 Data Quality composed of Quality characteristics Quality sub-characteristics composed of Quality Measure ISO/IEC 25022, 25023, 25024 Measuremen t function defines composed of Quality Measure Elements QME Measuremen t method ISO/IEC 25021 Property to quantifyTarget Entity Source: ISO/IEC 25024 34
  • 33.
    Data Life Cycle:examples Data design Data collection Data integration External data acquisition Source: ISO/IEC 25024 Data processin g Presentation Other use Data store Delete 35
  • 34.
    Data design: targetentities  Architecture  Contextual schema  Data models (conceptual, logical, physical)  Data dictionary  Document 36
  • 35.
    Data design: properties Attribute  Element  Information  Metadata  Vocabulary 37
  • 36.
    Other stages: targetentities  Data file  DBMS  RDBMS  Form  Presentation device 38
  • 37.
    Properties  Data format Data item  Data value  Information item  Information item content  Data record 39
  • 38.
    Metrics definition A) ID:abbreviated code of the quality characteristics + (I/D)+serial number b) Name: QM name related to data; c) Description d) Measurement function: formula showing how the QMEs are combined to produce the QM; e) DLC, Target entities, Properties: DLC includes stages of the DLC where the data QMEs are applicable, target entities and properties of target entities; f) Note: in the note, additional information such as an acceptable range of values, reference to other standards, explanations or interpretation or criteria, measurement method used to obtain the 40
  • 39.
  • 40.
  • 41.
    Open Government Data OD:open data, data that can be  Used  Reused  Redistributed  By anyone and with any goal G: Government produced or commissioned by a government or an institutional entity controlled by the government http://opengovernmentdata.org 51
  • 42.
    Why OGD ? Transparency  Social and commercial value  Participation 52
  • 43.
    Case 1: OpenCoesione  Published data  Structured  Open data format OpenCoesione Statistical data from municipalities  Residents  Weddings  Commercial activities 60
  • 44.
    Datasets analyzed 61 Orchestrated disclosureDecentralized disclosure ● Open Coesione ● portal about the fulfilment of investments using the 2007-2013 European Cohesion funds ● 85 billion Euros are being tracked, 850K projects Dataset Torino Roma Milano Firenze Bologna Residents X X X X X Weddings X X X Business Activities X X X
  • 45.
  • 46.
    Measures Characteristic Description ISOname Completeness Percentage of complete cells Com-I-1 (cell) Percentage of complete rows Com-I-1 (row) Accuracy Percentage of syntactically accurate cells Acc-I-1 Traceability Track of creation Tra-D-2 ( c ) Track of update Tra-D-2 (u) Currentness Percentage of current rows Cur-I-2 Delay in publication ~Cur-I-1 Compliance eGSM compliance Cmp-D-1 five stars open data Cmp-D-1 Understandability Percentage of columns with metadata Und-I-3 Percentage of columns in comprehensible format Und-I-4 63
  • 47.
    e.GMS 1. Accessibility (mandatoryif appl) 2. Addressee (optional) 3. Aggregation (optional) 4. Audience (optional) 5. Contributor (optional) 6. Coverage (recommended) 7. Creator (mandatory) 8. Date (mandatory) 9. Description (optional) 10. Digital signature (optional) 11. Disposal (optional) 12. Format (optional) 13. Identifier (mandatory if appl) 14. Language (recommended) 15. Location (optional) 16. Mandate (optional) 17. Preservation (optional) 18. Publisher (mandatory if appl) 19. Relation (optional) 20. Rights (optional) 21. Source (optional) 22. Status (optional) 23. Subject (mandatory) 24. Title (mandatory) 25. Type (optional) UK - e-Governmant Metadata Standard https://www.oasis-open.org/committees/download.php/7271/eGMS%20version%203.pdf 64
  • 48.
    Results – OpenCoesione 65 0.00 0.20 0.40 0.60 0.80 1.00 Com-I-1 (cell) Com-I-1 (row) Acc-I-1 Tra-D-2 ( c ) Tra-D-2 (u) Cur-I-2 ~Cur-I-1 Cmp-D-1 Cmp-D-1 Und-I-3 Und-I-4 Null/zero values : domain uncertain Track updates missing Missing metadata data not linked
  • 49.
    0 0.2 0.40.6 0.8 1 Com-I-1 (cell) Com-I-1 (row) Acc-I-1 Tra-D-2 ( c ) Tra-D-2 (u) Cur-I-2 ~Cur-I-1 Cmp-D-1 Cmp-D-1 Und-I-3 Und-I-4 Results – Municipality data 66 Discrepancies of values with domain No info on updates Missing metadata
  • 50.
    Findings  Disclosure strategyimplies different data quality  Centralized vs.  Decentralized  Traceability is generally lacking  Proposals to use Sw Conf Mgmt tools  Metadata is often missing or incomplete 67
  • 51.
    Case 2: PublicContracts  Published data  Structured  Open format Data on public contracts ex Art.37 Decree Transparency + prescriptions ANAC 68
  • 52.
    Public contracts  DecreeTransparency (14 march 2013 n.33)  Public contracts (Art.37 & Art 9.)  Open Data Publication  XML Standard Format (ANAC)  Selected administrations: Italian universities 69
  • 53.
  • 54.
    <lotto> <cig>4421574E47</cig> <strutturaProponente> <codiceFiscaleProp>00518460019</codiceFiscaleProp> <denominazione>Politecnico di Torino</denominazione> </strutturaProponente> <oggetto> Proceduradi cottimo fiduciario per affidamento servizio di manutenzione e assistenza di primo livello stazioni self-service </oggetto> <sceltaContraente>08-AFFIDAMENTO IN ECONOMIA - COTTIMO FIDUCIARIO</sceltaContraente> <partecipanti> <partecipante> <codiceFiscale>06267040019</codiceFiscale> <ragioneSociale>OVER S.A.S. DI VERGNANO CARLO & C.</ragioneSociale> </partecipante> </partecipanti> <aggiudicatari> <aggiudicatario> <codiceFiscale>06267040019</codiceFiscale> <ragioneSociale>OVER S.A.S. DI VERGNANO CARLO & C.</ragioneSociale> </aggiudicatario> </aggiudicatari> <importoAggiudicazione>7500.00</importoAggiudicazione> <tempiCompletamento> <dataInizio>2014-09-01</dataInizio> <dataUltimazione>2014-11-30</dataUltimazione> </tempiCompletamento> <importoSommeLiquidate>7500.00</importoSommeLiquidate> </lotto> 71
  • 55.
    Quality Evaluation Framework Intrinsic Dimensions Domain Dependent DimensionMeasure Accuracy Percentage of elements with correct values. Completeness Percentage of complete elements. Percentage of complete aggregate elements. Dimension Measure Consistency Percentage of lots that meet the Intrarelational and Interrelational Integrity Constraints. Duplication Number of duplicates. 72
  • 56.
    Identification of datasets First 25 universities of the overall ranking for the 2014 provided by the newspaper Il Sole 24 Ore.  Only 12 universities provide summary tables in XML format. Total numer of assessed lots: 123702 Average number of published lots:10308,5  The remaining 13 universities either do not provide the summary tables or they provide summary tables but not in XML format. 73
  • 57.
    CIG 74 The University ofTorino publishes summary tables that have 100% cig completeness, that is, the 100% of lots have the cig element but about 32% of them are out of domain. 1 0.94 0.9999 0.999 0.67 1 0.99 1 0.998 0.997 0.9998 0.99 1 1 1 1 1 1 1 1 1 1 1 1 0.600.700.800.901.00 UniBo PoliMi PoliTo UniMi UniTo UniVe UniUpo UniFe UniMib UniPv UniSa UnivPm Unique Tender Identifier A lot of “00000000000”.
  • 58.
    The element ispresent for each lot but it is always empty. Choice of contracting part 75 0.9999 0.998 0.9999 0 1 1 1 0.9991 1 1 1 1 1 1 1 1 1 1 1 0.999 1 1 1 1 0.000.200.400.600.801.00 UniBo PoliMi PoliTo UniMi UniTo UniVe UniUpo UniFe UniMib UniPv UniSa UnivPm
  • 59.
    All the lotspublished by University of Milano have a winner but no information about the participants. Fiscal Code 76 1 0.97 0.99 1 1 1 1 1 1 1 1 1 1 1 1 1 0.974 1 1 0.996 1 0.951 0.900.920.940.960.981.00 UniBo PoliMi PoliTo UniMi UniTo UniVe UniUpo UniFe UniMib UniPv UniSa UnivPm
  • 60.
    In 14% oflots the amount paid is greater than the awarded amount. Amount paid vs. Total paid 78 0.87 0.97 0.96 0.9999 0.998 0.99 0.999 0.93 0.995 0.9999 0.98 0.98 0.800.850.900.951.00 UniBo PoliMi PoliTo UniMi UniTo UniVe UniUpo UniFe UniMib UniPv UniSa UnivPm PayedlessorequaltoAwarded
  • 61.
    Final considerations  ISOstandard provides several predefined measures  Must be adapted to the case at hand  Can be aggregated in different ways  Possibility to define new measures  ISO standard is intended for structured data  What about semantic knowledge bases? 79
  • 62.
    References  ISO/IEC 25012:2008,Software engineering — Software product Quality Requirements and Evaluation (SQuaRE) — Data quality model  ISO 25024:2015, Software engineering — Software product Quality Requirements and Evaluation (SQuaRE) — Measurement of data quality  Vetrò, Antonio; Canova, Lorenzo; Torchiano, Marco; Orozco Minotas, Camilo; Iemma, Raimondo; Morando, Federico “Open Data Quality Measurement Framework: Definition and Application to Open Government Data”GOVERNMENT INFORMATION QUARTERLY, Vol.33, pp.325-337, ISSN:0740- 624X  Torchiano, Marco; Vetro', Antonio; Iuliano, Francesca “Preserving the Benefits of Open Government Data by Measuring and Improving Their Quality: An Empirical Study” in IEEE 41st Annual Computer Software and Applications Conference (COMPSAC 2017) 80

Editor's Notes

  • #53 Transparency isn’t just about access, it is also about sharing and reuse — often, to understand material it needs to be analyzed and visualized and this requires that the material be open so that it can be freely used and reused.
  • #73 To assess the quality we consider different dimensions:intrinsic dimesions which do not depend on the type of the data and domain dependent dimensions. As intrinsic dimension we evaluate the Accuracy computed as the percentage of elements with correct values and Completeness computed as percentage of complete elements and the percentage of complete aggregate elements, where an element is considered not correct or incomplete if it does not meet the specification of its domain or the number occurrences specified in the XML schema. For the domain dependent dimensions we evaluate the consistency by defining a set of integrity constraints that strictly depend on public constracts domain as for axample that the amountPaid must be less than or equal to the award amount or that if a public contract does not have a successful tenderer the amount paid must be equal to zero.
  • #74 To conduct the evaluation we selected the first 25 universities of the general ranking for the 2014 provided by the newspaper Il Sole 24 Ore. Only 12 of them provide summary tables in the xml format for a total of 123702 assessed lot. The remaining 13 Universities either do not provide the summary tables or they provide summary table but not in XML format.
  • #75 The accuracy and completeness were computed for all elements but we will show the most interesting and moreover we wiil see only some of the integrity constraints defined to asses the consistency. The cig is the unique identifier of a lot. The university of torino has a completeness on the cig of 100% this means that the cig element is present in all analysed lots but in the 32% of cases it is out of domain.
  • #76 The scelta contraente is one of the most important element because it specifies the procedure for the selection of the contractor and it can be used by the authorities to detect illegal award of contracts. High accuracy and completeness will improve the transparency of contracts. The completeness sceltaContraente element for the university of Milano is 100% but percentage of correct elements is equal to 0 this because the scelta contraente is always present in all the lots provided by the university of Milano but its value is always empty.
  • #77 The codiceFiscale is the unique identifier for the participants, an interesting aspect is that the University of Milano is not classified because in all the summary tables provided by the University there isn’t information about the participants.
  • #78 This results is highlighted by the lots has participant and the successful tenderer is participant interrelational constraints. The first one computes the percentage of lots which have a successful tenderer and have at least one participant while the second constraints computes the percentage of cells in which the sucessful tenderer of a lot is a participant for the same lot. In both cases the percentage for the university of Milano is equal to zero because there isn’t information on participants in the analysed files. For the university of Milano-Bococca the percentage of lot has participant constraint is slightly higher than the successfulTenderer is Participant and this means that althought in some lots there are participants, the successful tenderer is not one of those participant.
  • #79 The first IntrarelationalConstraint computes the percentage of lots in which the amount paid is less than or equal to the award amount and we can see that the 14% of lot of University of Bologna have an amount paid greater then the award amount this shows that more public money than requested is spent. The successfulTenderer_amountPaint computes the percentage of cells in which there isn’t information about the successful tenderer but the amount paid is different by zero. For the 40 % of lots of the University of Bologna there is not information about the successfull tenderer but an amount of money is distributed and it is not possible to track the money, that is, it is not known who receives the money.