Data Quality - Standards and Application to Open Data

Marco Torchiano
Marco TorchianoAssociate Professor
Data Quality
Standards and Application to Open Data
February 21, 2018 – Brunel University, UK
Marco Torchiano
marco.torchiano@polito.it
Version 1.1.0
© Marco Torchiano, 2018
About me
 Marco Torchiano
 Associate Professor, Politecnico di Torino
 Senior Member IEEE
 Faculty Fellow – Nexa Center for Internet
and Society
 Member UNI CT504–Software Engineering
 Contacts:
– mailto:marco.torchiano@polito.it
– http://softeng.polito.it/torchiano/
– Twitter: @mtorchiano
3
Current Research Interests
 Mobile UI Automated Testing
 PhD student working on fragility
 (Open)Data Quality
 PhD student working on KB quality
 Software Energy Consumption
 Several collaborations
 Also: MDD, Survey methodology, code
obfuscation, SE education, …
4
Acknowledgments
 Antonio Vetrò
 The counterpart for
this line of research
 Many other people
 L.Canova, R.Iemma, F.Iuliano, F.Morando,
C.Orozco Minotas, G.Procaccianti,
R.Rashid
5
OPEN DATA QUALITY
7
Open Coesione
 portal about the fulfilment of
investments using the 2007-2013
European Cohesion funds
 Interactive Interface
 Downloadable .csv datasets
 ~100 billion Euros are being tracked,
~100K projects
 http://www.opencoesione.gov.it/
9
Errors in data
10
43 !
* extraction, transformation, and loading
11
Accuracy
12
» Refer always to raw data
» If not possible, estimate accuracy on analysis (e.g., about 5% in the example above)
43 !
13
Missing data
14
15
»Outliers can point to interesting facts
Outliers
16
»… or to something which deserves a second look
Outliers
17
Valu
e
pcvc= percentage of cells with correct value
18
ISO DATA QUALITY
STANDARDS
19
ISO - SQuaRE
2503x
Quality
Requirements
2504x
Quality
Evaluation
2501x
Quality Model
2500x
Quality Management
2502x
Quality Measurement
Family of standards
20
ISO SQuaRE
 Internal Quality
 Values, formats, relation
 External Quality
 Technological environment
 Quality in Use
 Context of use of the data user
21
ISO 25012
Data Quality Model
22
2503x
Quality
Requirements
2504x
Quality
Evaluation
2501x
Quality Model
2500x
Quality Management
2502x
Quality Measurement
Roles
 Data Quality evaluator
 Data Producer
 Data Acquirer
 Data User
23
Data evaluator
 Defines/adapts a quality model
 Evaluate and act
 Data correction
 Technological adjustments
 Organizational measures
24
Model structure
 Characteristic
 Main aspects, e.g., usability
 Sub-Characteristic (optional)
 A detailed aspect of a characteristic, e.g.
Understandability
 Metric
 A set of rules to assign and interpret a
(numerical) evaluation to a specific (sub)-
characteristic
25
Characteristics
 Accuracy
 Completeness
 Consistency
 Credibility
 Currentness
 Accessibility
 Compliance
 Confidentiality
 Efficiency
 Precision
 Traceability
 Understandability
 Availability
 Portability
 Recoverability
26
Characteristics
 Accuracy
 Correspondence between data and reality
(syntactic and semantic)
 Completeness
 Computer: presence of all necessary
values
 User: how much the data is able to satisfy
the needs
 Consistency
 Absence of contradictions in the data
27
Characteristics
 Credibility
 The extent to which data are regarded as
true and credible by users
 Currentness
 the extent to which data is up-to-date
 Accessibility
 The capability of data to be accessed,
particularly by people who need
supporting technology or special
configuration because of some disability
28
Characteristics
 Regulatory compliance
 The capability of data to adhere to standards,
conventions or regulations in force and similar
rules relating to data quality
 Confidentiality
 The capability of the data to be accessed and
interpreted only by authorized users
 Efficiency
 The capability of data to be processed (accessed,
acquired, updated, etc) and to provide
appropriate levels of performance using the
appropriate amounts and types of resources
under stated conditions
29
Characteristics
 Precision
 Capability of the value assigned to an
attribute to provide the degree of
information needed in a stated context of
use
 Traceability
 Presence of attributes providing an audit trail
of access and changes made to data
 Understandability
 The extent to which data can be read and
interpreted by users
30
Characteristics
 Availability
 The capability of data to be always
retrievable.
 Recoverability
 The capability to preserve a specified level of
operations and its physical and logical
integrity, even in the event of failure
 Portability
 The capability of data to be moved to
another platform preserving quality
31
Inherent System
Dependent
Facts
(Data)
Artefacts
(D+Hw+Sw+Sys)
Accuracy
Completeness
Consistency
Credibility
Currentness
Accessibility
UnderstandabilityHCI
Support
Compliance
Confidentiality
Efficiency
Precision
Traceability
Perspectives
32
Availability
Portability
Recoverability
ISO 25024
Measurement of Data Quality
33
2503x
Quality
Requirements
2504x
Quality
Evaluation
2501x
Quality Model
2500x
Quality Management
2502x
Quality Measurement
Relationships among standards
ISO/IEC 25010
System and Software
Product Quality
ISO/IEC 25012
Data Quality
composed of
Quality characteristics
Quality sub-characteristics
composed of
Quality Measure
ISO/IEC 25022, 25023, 25024
Measuremen
t function
defines
composed of
Quality Measure Elements
QME
Measuremen
t method
ISO/IEC 25021
Property to quantifyTarget Entity
Source: ISO/IEC 25024 34
Data Life Cycle: examples
Data
design
Data
collection
Data
integration
External
data
acquisition
Source: ISO/IEC 25024
Data
processin
g
Presentation
Other use
Data store
Delete
35
Data design: target entities
 Architecture
 Contextual schema
 Data models (conceptual, logical,
physical)
 Data dictionary
 Document
36
Data design: properties
 Attribute
 Element
 Information
 Metadata
 Vocabulary
37
Other stages: target entities
 Data file
 DBMS
 RDBMS
 Form
 Presentation device
38
Properties
 Data format
 Data item
 Data value
 Information item
 Information item content
 Data record
39
Metrics definition
A) ID: abbreviated code of the quality characteristics +
(I/D)+serial number
b) Name: QM name related to data;
c) Description
d) Measurement function: formula showing how the QMEs
are combined to produce the QM;
e) DLC, Target entities, Properties: DLC includes stages of
the DLC where the data QMEs are applicable, target
entities and properties of target entities;
f) Note: in the note, additional information such as an
acceptable range of values, reference to other standards,
explanations or interpretation or criteria, measurement
method used to obtain the
40
ACCURACY (Acc-I-1)
Copyright: ISO/IEC 25024
42
CASE STUDIES
Open Government Data
50
Open Government Data
OD: open data, data that can be
 Used
 Reused
 Redistributed
 By anyone and with any goal
G: Government produced or commissioned
by a government or an institutional
entity controlled by the government
http://opengovernmentdata.org
51
Why OGD ?
 Transparency
 Social and commercial value
 Participation
52
Case 1: Open Coesione
 Published data
 Structured
 Open data format
OpenCoesione
Statistical data from municipalities
 Residents
 Weddings
 Commercial activities
60
Datasets analyzed
61
Orchestrated disclosure Decentralized disclosure
● Open Coesione
● portal about the
fulfilment of
investments using the
2007-2013 European
Cohesion funds
● 85 billion Euros are
being tracked, 850K
projects
Dataset
Torino
Roma
Milano
Firenze
Bologna
Residents X X X X X
Weddings X X X
Business
Activities
X X X
Open Coesione
Measures
Characteristic Description ISO name
Completeness
Percentage of complete cells Com-I-1 (cell)
Percentage of complete rows Com-I-1 (row)
Accuracy
Percentage of syntactically accurate
cells
Acc-I-1
Traceability
Track of creation Tra-D-2 ( c )
Track of update Tra-D-2 (u)
Currentness
Percentage of current rows Cur-I-2
Delay in publication ~Cur-I-1
Compliance
eGSM compliance Cmp-D-1
five stars open data Cmp-D-1
Understandability
Percentage of columns with metadata Und-I-3
Percentage of columns in
comprehensible format
Und-I-4
63
e.GMS
1. Accessibility (mandatory if
appl)
2. Addressee (optional)
3. Aggregation (optional)
4. Audience (optional)
5. Contributor (optional)
6. Coverage (recommended)
7. Creator (mandatory)
8. Date (mandatory)
9. Description (optional)
10. Digital signature (optional)
11. Disposal (optional)
12. Format (optional)
13. Identifier (mandatory if appl)
14. Language (recommended)
15. Location (optional)
16. Mandate (optional)
17. Preservation (optional)
18. Publisher (mandatory if appl)
19. Relation (optional)
20. Rights (optional)
21. Source (optional)
22. Status (optional)
23. Subject (mandatory)
24. Title (mandatory)
25. Type (optional)
UK - e-Governmant Metadata Standard
https://www.oasis-open.org/committees/download.php/7271/eGMS%20version%203.pdf 64
Results – Open Coesione
65
0.00 0.20 0.40 0.60 0.80 1.00
Com-I-1 (cell)
Com-I-1 (row)
Acc-I-1
Tra-D-2 ( c )
Tra-D-2 (u)
Cur-I-2
~Cur-I-1
Cmp-D-1
Cmp-D-1
Und-I-3
Und-I-4
Null/zero
values :
domain
uncertain
Track
updates
missing
Missing
metadata
data not
linked
0 0.2 0.4 0.6 0.8 1
Com-I-1 (cell)
Com-I-1 (row)
Acc-I-1
Tra-D-2 ( c )
Tra-D-2 (u)
Cur-I-2
~Cur-I-1
Cmp-D-1
Cmp-D-1
Und-I-3
Und-I-4
Results – Municipality data
66
Discrepancies
of values with
domain
No info on
updates
Missing
metadata
Findings
 Disclosure strategy implies different
data quality
 Centralized vs.
 Decentralized
 Traceability is generally lacking
 Proposals to use Sw Conf Mgmt tools
 Metadata is often missing or
incomplete
67
Case 2: Public Contracts
 Published data
 Structured
 Open format
Data on public contracts ex Art.37
Decree Transparency + prescriptions
ANAC
68
Public contracts
 Decree Transparency (14 march 2013
n.33)
 Public contracts (Art.37 & Art 9.)
 Open Data Publication
 XML Standard Format (ANAC)
 Selected administrations: Italian
universities
69
Data Structure
XML
METADATA
DATA
LOTS
PARTICIPANTS
WINNER
70
<lotto>
<cig>4421574E47</cig>
<strutturaProponente>
<codiceFiscaleProp>00518460019</codiceFiscaleProp>
<denominazione>Politecnico di Torino</denominazione>
</strutturaProponente>
<oggetto>
Procedura di cottimo fiduciario per affidamento servizio di manutenzione e
assistenza di primo livello stazioni self-service
</oggetto>
<sceltaContraente>08-AFFIDAMENTO IN ECONOMIA - COTTIMO FIDUCIARIO</sceltaContraente>
<partecipanti>
<partecipante>
<codiceFiscale>06267040019</codiceFiscale>
<ragioneSociale>OVER S.A.S. DI VERGNANO CARLO & C.</ragioneSociale>
</partecipante>
</partecipanti>
<aggiudicatari>
<aggiudicatario>
<codiceFiscale>06267040019</codiceFiscale>
<ragioneSociale>OVER S.A.S. DI VERGNANO CARLO & C.</ragioneSociale>
</aggiudicatario>
</aggiudicatari>
<importoAggiudicazione>7500.00</importoAggiudicazione>
<tempiCompletamento>
<dataInizio>2014-09-01</dataInizio>
<dataUltimazione>2014-11-30</dataUltimazione>
</tempiCompletamento>
<importoSommeLiquidate>7500.00</importoSommeLiquidate>
</lotto>
71
Quality Evaluation Framework
Intrinsic
Dimensions
Domain
Dependent
Dimension Measure
Accuracy Percentage of elements
with correct values.
Completeness
Percentage of complete
elements.
Percentage of complete
aggregate elements.
Dimension Measure
Consistency Percentage of lots that
meet the Intrarelational
and Interrelational
Integrity Constraints.
Duplication Number of duplicates.
72
Identification of datasets
 First 25 universities of the overall ranking for
the 2014 provided by the newspaper Il Sole 24
Ore.
 Only 12 universities provide summary tables in
XML format.
Total numer of assessed lots: 123702
Average number of published lots:10308,5
 The remaining 13 universities either do not
provide the summary tables or they provide
summary tables but not in XML format.
73
CIG
74
The University of Torino
publishes summary tables that
have 100% cig completeness,
that is, the 100% of lots have the
cig element but about 32% of
them are out of domain.
1
0.94
0.9999
0.999
0.67
1
0.99
1
0.998
0.997
0.9998
0.99
1
1
1
1
1
1
1
1
1
1
1
1
0.600.700.800.901.00
UniBo
PoliMi
PoliTo
UniMi
UniTo
UniVe
UniUpo
UniFe
UniMib
UniPv
UniSa
UnivPm
Unique Tender Identifier
A lot of “00000000000”.
The element is present for
each lot but it is always
empty.
Choice of
contracting part
75
0.9999
0.998
0.9999
0
1
1
1
0.9991
1
1
1
1
1
1
1
1
1
1
1
0.999
1
1
1
1
0.000.200.400.600.801.00
UniBo
PoliMi
PoliTo
UniMi
UniTo
UniVe
UniUpo
UniFe
UniMib
UniPv
UniSa
UnivPm
All the lots published by
University of Milano have a
winner but no information about
the participants.
Fiscal Code
76
1
0.97
0.99
1
1
1
1
1
1
1
1
1
1
1
1
1
0.974
1
1
0.996
1
0.951
0.900.920.940.960.981.00
UniBo
PoliMi
PoliTo
UniMi
UniTo
UniVe
UniUpo
UniFe
UniMib
UniPv
UniSa
UnivPm
In 14% of lots the amount paid is
greater than the awarded amount.
Amount paid
vs. Total paid
78
0.87
0.97
0.96
0.9999
0.998
0.99
0.999
0.93
0.995
0.9999
0.98
0.98
0.800.850.900.951.00
UniBo
PoliMi
PoliTo
UniMi
UniTo
UniVe
UniUpo
UniFe
UniMib
UniPv
UniSa
UnivPm
PayedlessorequaltoAwarded
Final considerations
 ISO standard provides several
predefined measures
 Must be adapted to the case at hand
 Can be aggregated in different ways
 Possibility to define new measures
 ISO standard is intended for
structured data
 What about semantic knowledge bases?
79
References
 ISO/IEC 25012:2008, Software engineering — Software
product Quality Requirements and Evaluation (SQuaRE) —
Data quality model
 ISO 25024:2015, Software engineering — Software product
Quality Requirements and Evaluation (SQuaRE) —
Measurement of data quality
 Vetrò, Antonio; Canova, Lorenzo; Torchiano, Marco; Orozco
Minotas, Camilo; Iemma, Raimondo; Morando, Federico “Open
Data Quality Measurement Framework: Definition and
Application to Open Government Data”GOVERNMENT
INFORMATION QUARTERLY, Vol.33, pp.325-337, ISSN:0740-
624X
 Torchiano, Marco; Vetro', Antonio; Iuliano, Francesca
“Preserving the Benefits of Open Government Data by
Measuring and Improving Their Quality: An Empirical Study” in
IEEE 41st Annual Computer Software and Applications
Conference (COMPSAC 2017)
80
1 of 62

Recommended

Data Governance by
Data GovernanceData Governance
Data GovernanceBoris Otto
3K views37 slides
Data Quality Best Practices by
Data Quality Best PracticesData Quality Best Practices
Data Quality Best PracticesDATAVERSITY
1.9K views32 slides
Data quality architecture by
Data quality architectureData quality architecture
Data quality architectureanicewick
9.9K views17 slides
Data Governance by
Data GovernanceData Governance
Data GovernanceSambaSoup
9.8K views12 slides
Data quality overview by
Data quality overviewData quality overview
Data quality overviewAlex Meadows
5.9K views11 slides
Data Quality & Data Governance by
Data Quality & Data GovernanceData Quality & Data Governance
Data Quality & Data GovernanceTuba Yaman Him
1.9K views21 slides

More Related Content

What's hot

Data Modeling Fundamentals by
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling FundamentalsDATAVERSITY
1.3K views46 slides
Data Quality by
Data QualityData Quality
Data Qualityjerdeb
740 views46 slides
Data Quality Strategies by
Data Quality StrategiesData Quality Strategies
Data Quality StrategiesDATAVERSITY
4.9K views40 slides
Data Quality Presentation by
Data Quality PresentationData Quality Presentation
Data Quality PresentationStephen McCarthy
9.1K views11 slides
Data Governance Workshop by
Data Governance WorkshopData Governance Workshop
Data Governance WorkshopCCG
441 views51 slides
The data quality challenge by
The data quality challengeThe data quality challenge
The data quality challengeLenia Miltiadous
1.1K views29 slides

What's hot(20)

Data Modeling Fundamentals by DATAVERSITY
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling Fundamentals
DATAVERSITY1.3K views
Data Quality by jerdeb
Data QualityData Quality
Data Quality
jerdeb740 views
Data Quality Strategies by DATAVERSITY
Data Quality StrategiesData Quality Strategies
Data Quality Strategies
DATAVERSITY4.9K views
Data Governance Workshop by CCG
Data Governance WorkshopData Governance Workshop
Data Governance Workshop
CCG441 views
Best Practices in Metadata Management by DATAVERSITY
Best Practices in Metadata ManagementBest Practices in Metadata Management
Best Practices in Metadata Management
DATAVERSITY1.6K views
Data Architecture, Solution Architecture, Platform Architecture — What’s the ... by DATAVERSITY
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
DATAVERSITY1.3K views
DAS Slides: Building a Data Strategy – Practical Steps for Aligning with Busi... by DATAVERSITY
DAS Slides: Building a Data Strategy – Practical Steps for Aligning with Busi...DAS Slides: Building a Data Strategy – Practical Steps for Aligning with Busi...
DAS Slides: Building a Data Strategy – Practical Steps for Aligning with Busi...
DATAVERSITY2K views
Data quality and data profiling by Shailja Khurana
Data quality and data profilingData quality and data profiling
Data quality and data profiling
Shailja Khurana44.9K views
Best Practices in Metadata Management by DATAVERSITY
Best Practices in Metadata ManagementBest Practices in Metadata Management
Best Practices in Metadata Management
DATAVERSITY629 views
Enterprise Data Architect Job Description by Lars E Martinsson
Enterprise Data Architect Job DescriptionEnterprise Data Architect Job Description
Enterprise Data Architect Job Description
Lars E Martinsson22.7K views
Brief introduction to data visualization by Zach Gemignani
Brief introduction to data visualizationBrief introduction to data visualization
Brief introduction to data visualization
Zach Gemignani16K views
Improving Data Literacy Around Data Architecture by DATAVERSITY
Improving Data Literacy Around Data ArchitectureImproving Data Literacy Around Data Architecture
Improving Data Literacy Around Data Architecture
DATAVERSITY973 views
Business requirements gathering for bi by Corey Dayhuff
Business requirements gathering for biBusiness requirements gathering for bi
Business requirements gathering for bi
Corey Dayhuff2.1K views
Data Architecture Strategies: Data Architecture for Digital Transformation by DATAVERSITY
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital Transformation
DATAVERSITY1.6K views
Introduction to Data Governance by John Bao Vuu
Introduction to Data GovernanceIntroduction to Data Governance
Introduction to Data Governance
John Bao Vuu854 views

Similar to Data Quality - Standards and Application to Open Data

Thesis Defense MBI by
Thesis Defense MBIThesis Defense MBI
Thesis Defense MBIJuan Hernandez
730 views22 slides
ENVRIPLUS Data for Science Theme by
ENVRIPLUS Data for Science ThemeENVRIPLUS Data for Science Theme
ENVRIPLUS Data for Science ThemeEUDAT
143 views19 slides
Profiling Linked Open Data by
Profiling Linked Open DataProfiling Linked Open Data
Profiling Linked Open DataBlerina Spahiu
529 views28 slides
Enhancing educational data quality in heterogeneous learning contexts using p... by
Enhancing educational data quality in heterogeneous learning contexts using p...Enhancing educational data quality in heterogeneous learning contexts using p...
Enhancing educational data quality in heterogeneous learning contexts using p...Alex Rayón Jerez
3.5K views106 slides
Sensor metadata management with SWM (SMWCon fall 2013) by
Sensor metadata management with SWM (SMWCon fall 2013)Sensor metadata management with SWM (SMWCon fall 2013)
Sensor metadata management with SWM (SMWCon fall 2013)jwnoteboom
669 views19 slides
Metadata Quality Assurance by
Metadata Quality AssuranceMetadata Quality Assurance
Metadata Quality AssurancePéter Király
1.6K views26 slides

Similar to Data Quality - Standards and Application to Open Data(20)

ENVRIPLUS Data for Science Theme by EUDAT
ENVRIPLUS Data for Science ThemeENVRIPLUS Data for Science Theme
ENVRIPLUS Data for Science Theme
EUDAT143 views
Enhancing educational data quality in heterogeneous learning contexts using p... by Alex Rayón Jerez
Enhancing educational data quality in heterogeneous learning contexts using p...Enhancing educational data quality in heterogeneous learning contexts using p...
Enhancing educational data quality in heterogeneous learning contexts using p...
Alex Rayón Jerez3.5K views
Sensor metadata management with SWM (SMWCon fall 2013) by jwnoteboom
Sensor metadata management with SWM (SMWCon fall 2013)Sensor metadata management with SWM (SMWCon fall 2013)
Sensor metadata management with SWM (SMWCon fall 2013)
jwnoteboom669 views
Metadata Quality Assurance by Péter Király
Metadata Quality AssuranceMetadata Quality Assurance
Metadata Quality Assurance
Péter Király1.6K views
Lecture 1 Introduction to Computer Networks by Darwish Ahmad
Lecture 1 Introduction to Computer NetworksLecture 1 Introduction to Computer Networks
Lecture 1 Introduction to Computer Networks
Darwish Ahmad64 views
Rinascimento Digitale - A Digital Renaissance by John Newton
Rinascimento Digitale - A Digital RenaissanceRinascimento Digitale - A Digital Renaissance
Rinascimento Digitale - A Digital Renaissance
John Newton951 views
Metadata quality Assurance Framework at QQML2016 - short by Péter Király
Metadata quality Assurance Framework at QQML2016 - shortMetadata quality Assurance Framework at QQML2016 - short
Metadata quality Assurance Framework at QQML2016 - short
Péter Király708 views
Service oriented space-infrastructures_brown_university_2014_lisi by Marco Lisi
Service oriented space-infrastructures_brown_university_2014_lisiService oriented space-infrastructures_brown_university_2014_lisi
Service oriented space-infrastructures_brown_university_2014_lisi
Marco Lisi292 views
Analysis of data quality and information quality problems in digital manufact... by Mary Montoya
Analysis of data quality and information quality problems in digital manufact...Analysis of data quality and information quality problems in digital manufact...
Analysis of data quality and information quality problems in digital manufact...
Mary Montoya3 views
The Internet of Things: What's next? by PayamBarnaghi
The Internet of Things: What's next? The Internet of Things: What's next?
The Internet of Things: What's next?
PayamBarnaghi1.2K views
Dynamic Semantics for the Internet of Things by PayamBarnaghi
Dynamic Semantics for the Internet of Things Dynamic Semantics for the Internet of Things
Dynamic Semantics for the Internet of Things
PayamBarnaghi1.9K views
PERICLES workshop (London 15 October 2015) - Digital Ecosystem Model by PERICLES_FP7
PERICLES workshop (London 15 October 2015) - Digital Ecosystem ModelPERICLES workshop (London 15 October 2015) - Digital Ecosystem Model
PERICLES workshop (London 15 October 2015) - Digital Ecosystem Model
PERICLES_FP7427 views
A Mathematical Model for Evaluation of Data Analytics Implementation Alternat... by Jānis Grabis
A Mathematical Model for Evaluation of Data Analytics Implementation Alternat...A Mathematical Model for Evaluation of Data Analytics Implementation Alternat...
A Mathematical Model for Evaluation of Data Analytics Implementation Alternat...
Jānis Grabis213 views
Automating Data Science over a Human Genomics Knowledge Base by Vaticle
Automating Data Science over a Human Genomics Knowledge BaseAutomating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge Base
Vaticle712 views

More from Marco Torchiano

Testing the UI of Mobile Applications by
Testing the UI of Mobile ApplicationsTesting the UI of Mobile Applications
Testing the UI of Mobile ApplicationsMarco Torchiano
337 views118 slides
Software Engineering II Course at Politecnico di Torino by
Software Engineering II Course at Politecnico di TorinoSoftware Engineering II Course at Politecnico di Torino
Software Engineering II Course at Politecnico di TorinoMarco Torchiano
187 views14 slides
Espresso vs. EyeAutomate: comparing two generations of Android GUI testing tools by
Espresso vs. EyeAutomate: comparing two generations of Android GUI testing toolsEspresso vs. EyeAutomate: comparing two generations of Android GUI testing tools
Espresso vs. EyeAutomate: comparing two generations of Android GUI testing toolsMarco Torchiano
240 views30 slides
Research Activities: past, present, and future. by
Research Activities: past, present, and future.Research Activities: past, present, and future.
Research Activities: past, present, and future.Marco Torchiano
185 views21 slides
Data Quality - Standards e Applicazioni by
Data Quality - Standards e ApplicazioniData Quality - Standards e Applicazioni
Data Quality - Standards e ApplicazioniMarco Torchiano
626 views32 slides
Data Visualization by
Data VisualizationData Visualization
Data VisualizationMarco Torchiano
908 views107 slides

More from Marco Torchiano(14)

Testing the UI of Mobile Applications by Marco Torchiano
Testing the UI of Mobile ApplicationsTesting the UI of Mobile Applications
Testing the UI of Mobile Applications
Marco Torchiano337 views
Software Engineering II Course at Politecnico di Torino by Marco Torchiano
Software Engineering II Course at Politecnico di TorinoSoftware Engineering II Course at Politecnico di Torino
Software Engineering II Course at Politecnico di Torino
Marco Torchiano187 views
Espresso vs. EyeAutomate: comparing two generations of Android GUI testing tools by Marco Torchiano
Espresso vs. EyeAutomate: comparing two generations of Android GUI testing toolsEspresso vs. EyeAutomate: comparing two generations of Android GUI testing tools
Espresso vs. EyeAutomate: comparing two generations of Android GUI testing tools
Marco Torchiano240 views
Research Activities: past, present, and future. by Marco Torchiano
Research Activities: past, present, and future.Research Activities: past, present, and future.
Research Activities: past, present, and future.
Marco Torchiano185 views
Data Quality - Standards e Applicazioni by Marco Torchiano
Data Quality - Standards e ApplicazioniData Quality - Standards e Applicazioni
Data Quality - Standards e Applicazioni
Marco Torchiano626 views
Riflessioni su Riforma Costituzionale "Renzi-Boschi" by Marco Torchiano
Riflessioni su Riforma Costituzionale "Renzi-Boschi"Riflessioni su Riforma Costituzionale "Renzi-Boschi"
Riflessioni su Riforma Costituzionale "Renzi-Boschi"
Marco Torchiano205 views
Relevance, Benefits, and Barriers of Software Modelling and Model Driven Tech... by Marco Torchiano
Relevance, Benefits, and Barriers of Software Modelling and Model Driven Tech...Relevance, Benefits, and Barriers of Software Modelling and Model Driven Tech...
Relevance, Benefits, and Barriers of Software Modelling and Model Driven Tech...
Marco Torchiano392 views
Energy Consumption Analysis
 of Image Encoding and Decoding Algorithms by Marco Torchiano
Energy Consumption Analysis
 of Image Encoding and Decoding AlgorithmsEnergy Consumption Analysis
 of Image Encoding and Decoding Algorithms
Energy Consumption Analysis
 of Image Encoding and Decoding Algorithms
Marco Torchiano475 views
Relevance, Benefits, and Problems of Software Modelling and Model-Driven Tech... by Marco Torchiano
Relevance, Benefits, and Problems of Software Modelling and Model-Driven Tech...Relevance, Benefits, and Problems of Software Modelling and Model-Driven Tech...
Relevance, Benefits, and Problems of Software Modelling and Model-Driven Tech...
Marco Torchiano611 views
A Model-Based Approach to Language Integration by Marco Torchiano
A Model-Based Approach to Language Integration A Model-Based Approach to Language Integration
A Model-Based Approach to Language Integration
Marco Torchiano532 views
On the computation of Truck Factor by Marco Torchiano
On the computation of Truck FactorOn the computation of Truck Factor
On the computation of Truck Factor
Marco Torchiano444 views
Language Interaction and Quality Issues: An Exploratory Study by Marco Torchiano
Language Interaction and Quality Issues: An Exploratory StudyLanguage Interaction and Quality Issues: An Exploratory Study
Language Interaction and Quality Issues: An Exploratory Study
Marco Torchiano928 views
The impact of process maturity on defect density by Marco Torchiano
The impact of process maturity on defect densityThe impact of process maturity on defect density
The impact of process maturity on defect density
Marco Torchiano1.3K views

Recently uploaded

AIMS-EREA.pdf by
AIMS-EREA.pdfAIMS-EREA.pdf
AIMS-EREA.pdfSudarson Roy Pratihar
8 views18 slides
Best Home Security Systems.pptx by
Best Home Security Systems.pptxBest Home Security Systems.pptx
Best Home Security Systems.pptxmogalang
11 views16 slides
shivam tiwari.pptx by
shivam tiwari.pptxshivam tiwari.pptx
shivam tiwari.pptxAanyaMishra4
9 views14 slides
Infomatica-MDM.pptx by
Infomatica-MDM.pptxInfomatica-MDM.pptx
Infomatica-MDM.pptxKapil Rangwani
13 views16 slides
Running PostgreSQL in a Kubernetes cluster: CloudNativePG by
Running PostgreSQL in a Kubernetes cluster: CloudNativePGRunning PostgreSQL in a Kubernetes cluster: CloudNativePG
Running PostgreSQL in a Kubernetes cluster: CloudNativePGNick Ivanov
10 views29 slides
GDG Community Day 2023 - Interpretable ML in production by
GDG Community Day 2023 - Interpretable ML in productionGDG Community Day 2023 - Interpretable ML in production
GDG Community Day 2023 - Interpretable ML in productionSARADINDU SENGUPTA
7 views19 slides

Recently uploaded(20)

Best Home Security Systems.pptx by mogalang
Best Home Security Systems.pptxBest Home Security Systems.pptx
Best Home Security Systems.pptx
mogalang11 views
Running PostgreSQL in a Kubernetes cluster: CloudNativePG by Nick Ivanov
Running PostgreSQL in a Kubernetes cluster: CloudNativePGRunning PostgreSQL in a Kubernetes cluster: CloudNativePG
Running PostgreSQL in a Kubernetes cluster: CloudNativePG
Nick Ivanov10 views
GDG Community Day 2023 - Interpretable ML in production by SARADINDU SENGUPTA
GDG Community Day 2023 - Interpretable ML in productionGDG Community Day 2023 - Interpretable ML in production
GDG Community Day 2023 - Interpretable ML in production
4_4_WP_4_06_ND_Model.pptx by d6fmc6kwd4
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptx
d6fmc6kwd47 views
Product Research sample.pdf by AllenSingson
Product Research sample.pdfProduct Research sample.pdf
Product Research sample.pdf
AllenSingson35 views
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning by SARADINDU SENGUPTA
GDG Cloud Community Day 2022 -  Managing data quality in Machine LearningGDG Cloud Community Day 2022 -  Managing data quality in Machine Learning
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning
Dr. Ousmane Badiane-2023 ReSAKSS Conference by AKADEMIYA2063
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS Conference
AKADEMIYA20636 views
Lack of communication among family.pptx by ahmed164023
Lack of communication among family.pptxLack of communication among family.pptx
Lack of communication among family.pptx
ahmed16402317 views
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx by DataScienceConferenc1
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx
AZConf 2023 - Considerations for LLMOps: Running LLMs in production by SARADINDU SENGUPTA
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionAZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in production
Pydata Global 2023 - How can a learnt model unlearn something by SARADINDU SENGUPTA
Pydata Global 2023 - How can a learnt model unlearn somethingPydata Global 2023 - How can a learnt model unlearn something
Pydata Global 2023 - How can a learnt model unlearn something
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion by Bertram Ludäscher
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion

Data Quality - Standards and Application to Open Data

  • 1. Data Quality Standards and Application to Open Data February 21, 2018 – Brunel University, UK Marco Torchiano marco.torchiano@polito.it Version 1.1.0 © Marco Torchiano, 2018
  • 2. About me  Marco Torchiano  Associate Professor, Politecnico di Torino  Senior Member IEEE  Faculty Fellow – Nexa Center for Internet and Society  Member UNI CT504–Software Engineering  Contacts: – mailto:marco.torchiano@polito.it – http://softeng.polito.it/torchiano/ – Twitter: @mtorchiano 3
  • 3. Current Research Interests  Mobile UI Automated Testing  PhD student working on fragility  (Open)Data Quality  PhD student working on KB quality  Software Energy Consumption  Several collaborations  Also: MDD, Survey methodology, code obfuscation, SE education, … 4
  • 4. Acknowledgments  Antonio Vetrò  The counterpart for this line of research  Many other people  L.Canova, R.Iemma, F.Iuliano, F.Morando, C.Orozco Minotas, G.Procaccianti, R.Rashid 5
  • 6. Open Coesione  portal about the fulfilment of investments using the 2007-2013 European Cohesion funds  Interactive Interface  Downloadable .csv datasets  ~100 billion Euros are being tracked, ~100K projects  http://www.opencoesione.gov.it/
  • 7. 9
  • 9. 43 ! * extraction, transformation, and loading 11
  • 11. » Refer always to raw data » If not possible, estimate accuracy on analysis (e.g., about 5% in the example above) 43 ! 13
  • 13. 15
  • 14. »Outliers can point to interesting facts Outliers 16
  • 15. »… or to something which deserves a second look Outliers 17
  • 16. Valu e pcvc= percentage of cells with correct value 18
  • 18. ISO - SQuaRE 2503x Quality Requirements 2504x Quality Evaluation 2501x Quality Model 2500x Quality Management 2502x Quality Measurement Family of standards 20
  • 19. ISO SQuaRE  Internal Quality  Values, formats, relation  External Quality  Technological environment  Quality in Use  Context of use of the data user 21
  • 20. ISO 25012 Data Quality Model 22 2503x Quality Requirements 2504x Quality Evaluation 2501x Quality Model 2500x Quality Management 2502x Quality Measurement
  • 21. Roles  Data Quality evaluator  Data Producer  Data Acquirer  Data User 23
  • 22. Data evaluator  Defines/adapts a quality model  Evaluate and act  Data correction  Technological adjustments  Organizational measures 24
  • 23. Model structure  Characteristic  Main aspects, e.g., usability  Sub-Characteristic (optional)  A detailed aspect of a characteristic, e.g. Understandability  Metric  A set of rules to assign and interpret a (numerical) evaluation to a specific (sub)- characteristic 25
  • 24. Characteristics  Accuracy  Completeness  Consistency  Credibility  Currentness  Accessibility  Compliance  Confidentiality  Efficiency  Precision  Traceability  Understandability  Availability  Portability  Recoverability 26
  • 25. Characteristics  Accuracy  Correspondence between data and reality (syntactic and semantic)  Completeness  Computer: presence of all necessary values  User: how much the data is able to satisfy the needs  Consistency  Absence of contradictions in the data 27
  • 26. Characteristics  Credibility  The extent to which data are regarded as true and credible by users  Currentness  the extent to which data is up-to-date  Accessibility  The capability of data to be accessed, particularly by people who need supporting technology or special configuration because of some disability 28
  • 27. Characteristics  Regulatory compliance  The capability of data to adhere to standards, conventions or regulations in force and similar rules relating to data quality  Confidentiality  The capability of the data to be accessed and interpreted only by authorized users  Efficiency  The capability of data to be processed (accessed, acquired, updated, etc) and to provide appropriate levels of performance using the appropriate amounts and types of resources under stated conditions 29
  • 28. Characteristics  Precision  Capability of the value assigned to an attribute to provide the degree of information needed in a stated context of use  Traceability  Presence of attributes providing an audit trail of access and changes made to data  Understandability  The extent to which data can be read and interpreted by users 30
  • 29. Characteristics  Availability  The capability of data to be always retrievable.  Recoverability  The capability to preserve a specified level of operations and its physical and logical integrity, even in the event of failure  Portability  The capability of data to be moved to another platform preserving quality 31
  • 31. ISO 25024 Measurement of Data Quality 33 2503x Quality Requirements 2504x Quality Evaluation 2501x Quality Model 2500x Quality Management 2502x Quality Measurement
  • 32. Relationships among standards ISO/IEC 25010 System and Software Product Quality ISO/IEC 25012 Data Quality composed of Quality characteristics Quality sub-characteristics composed of Quality Measure ISO/IEC 25022, 25023, 25024 Measuremen t function defines composed of Quality Measure Elements QME Measuremen t method ISO/IEC 25021 Property to quantifyTarget Entity Source: ISO/IEC 25024 34
  • 33. Data Life Cycle: examples Data design Data collection Data integration External data acquisition Source: ISO/IEC 25024 Data processin g Presentation Other use Data store Delete 35
  • 34. Data design: target entities  Architecture  Contextual schema  Data models (conceptual, logical, physical)  Data dictionary  Document 36
  • 35. Data design: properties  Attribute  Element  Information  Metadata  Vocabulary 37
  • 36. Other stages: target entities  Data file  DBMS  RDBMS  Form  Presentation device 38
  • 37. Properties  Data format  Data item  Data value  Information item  Information item content  Data record 39
  • 38. Metrics definition A) ID: abbreviated code of the quality characteristics + (I/D)+serial number b) Name: QM name related to data; c) Description d) Measurement function: formula showing how the QMEs are combined to produce the QM; e) DLC, Target entities, Properties: DLC includes stages of the DLC where the data QMEs are applicable, target entities and properties of target entities; f) Note: in the note, additional information such as an acceptable range of values, reference to other standards, explanations or interpretation or criteria, measurement method used to obtain the 40
  • 41. Open Government Data OD: open data, data that can be  Used  Reused  Redistributed  By anyone and with any goal G: Government produced or commissioned by a government or an institutional entity controlled by the government http://opengovernmentdata.org 51
  • 42. Why OGD ?  Transparency  Social and commercial value  Participation 52
  • 43. Case 1: Open Coesione  Published data  Structured  Open data format OpenCoesione Statistical data from municipalities  Residents  Weddings  Commercial activities 60
  • 44. Datasets analyzed 61 Orchestrated disclosure Decentralized disclosure ● Open Coesione ● portal about the fulfilment of investments using the 2007-2013 European Cohesion funds ● 85 billion Euros are being tracked, 850K projects Dataset Torino Roma Milano Firenze Bologna Residents X X X X X Weddings X X X Business Activities X X X
  • 46. Measures Characteristic Description ISO name Completeness Percentage of complete cells Com-I-1 (cell) Percentage of complete rows Com-I-1 (row) Accuracy Percentage of syntactically accurate cells Acc-I-1 Traceability Track of creation Tra-D-2 ( c ) Track of update Tra-D-2 (u) Currentness Percentage of current rows Cur-I-2 Delay in publication ~Cur-I-1 Compliance eGSM compliance Cmp-D-1 five stars open data Cmp-D-1 Understandability Percentage of columns with metadata Und-I-3 Percentage of columns in comprehensible format Und-I-4 63
  • 47. e.GMS 1. Accessibility (mandatory if appl) 2. Addressee (optional) 3. Aggregation (optional) 4. Audience (optional) 5. Contributor (optional) 6. Coverage (recommended) 7. Creator (mandatory) 8. Date (mandatory) 9. Description (optional) 10. Digital signature (optional) 11. Disposal (optional) 12. Format (optional) 13. Identifier (mandatory if appl) 14. Language (recommended) 15. Location (optional) 16. Mandate (optional) 17. Preservation (optional) 18. Publisher (mandatory if appl) 19. Relation (optional) 20. Rights (optional) 21. Source (optional) 22. Status (optional) 23. Subject (mandatory) 24. Title (mandatory) 25. Type (optional) UK - e-Governmant Metadata Standard https://www.oasis-open.org/committees/download.php/7271/eGMS%20version%203.pdf 64
  • 48. Results – Open Coesione 65 0.00 0.20 0.40 0.60 0.80 1.00 Com-I-1 (cell) Com-I-1 (row) Acc-I-1 Tra-D-2 ( c ) Tra-D-2 (u) Cur-I-2 ~Cur-I-1 Cmp-D-1 Cmp-D-1 Und-I-3 Und-I-4 Null/zero values : domain uncertain Track updates missing Missing metadata data not linked
  • 49. 0 0.2 0.4 0.6 0.8 1 Com-I-1 (cell) Com-I-1 (row) Acc-I-1 Tra-D-2 ( c ) Tra-D-2 (u) Cur-I-2 ~Cur-I-1 Cmp-D-1 Cmp-D-1 Und-I-3 Und-I-4 Results – Municipality data 66 Discrepancies of values with domain No info on updates Missing metadata
  • 50. Findings  Disclosure strategy implies different data quality  Centralized vs.  Decentralized  Traceability is generally lacking  Proposals to use Sw Conf Mgmt tools  Metadata is often missing or incomplete 67
  • 51. Case 2: Public Contracts  Published data  Structured  Open format Data on public contracts ex Art.37 Decree Transparency + prescriptions ANAC 68
  • 52. Public contracts  Decree Transparency (14 march 2013 n.33)  Public contracts (Art.37 & Art 9.)  Open Data Publication  XML Standard Format (ANAC)  Selected administrations: Italian universities 69
  • 54. <lotto> <cig>4421574E47</cig> <strutturaProponente> <codiceFiscaleProp>00518460019</codiceFiscaleProp> <denominazione>Politecnico di Torino</denominazione> </strutturaProponente> <oggetto> Procedura di cottimo fiduciario per affidamento servizio di manutenzione e assistenza di primo livello stazioni self-service </oggetto> <sceltaContraente>08-AFFIDAMENTO IN ECONOMIA - COTTIMO FIDUCIARIO</sceltaContraente> <partecipanti> <partecipante> <codiceFiscale>06267040019</codiceFiscale> <ragioneSociale>OVER S.A.S. DI VERGNANO CARLO & C.</ragioneSociale> </partecipante> </partecipanti> <aggiudicatari> <aggiudicatario> <codiceFiscale>06267040019</codiceFiscale> <ragioneSociale>OVER S.A.S. DI VERGNANO CARLO & C.</ragioneSociale> </aggiudicatario> </aggiudicatari> <importoAggiudicazione>7500.00</importoAggiudicazione> <tempiCompletamento> <dataInizio>2014-09-01</dataInizio> <dataUltimazione>2014-11-30</dataUltimazione> </tempiCompletamento> <importoSommeLiquidate>7500.00</importoSommeLiquidate> </lotto> 71
  • 55. Quality Evaluation Framework Intrinsic Dimensions Domain Dependent Dimension Measure Accuracy Percentage of elements with correct values. Completeness Percentage of complete elements. Percentage of complete aggregate elements. Dimension Measure Consistency Percentage of lots that meet the Intrarelational and Interrelational Integrity Constraints. Duplication Number of duplicates. 72
  • 56. Identification of datasets  First 25 universities of the overall ranking for the 2014 provided by the newspaper Il Sole 24 Ore.  Only 12 universities provide summary tables in XML format. Total numer of assessed lots: 123702 Average number of published lots:10308,5  The remaining 13 universities either do not provide the summary tables or they provide summary tables but not in XML format. 73
  • 57. CIG 74 The University of Torino publishes summary tables that have 100% cig completeness, that is, the 100% of lots have the cig element but about 32% of them are out of domain. 1 0.94 0.9999 0.999 0.67 1 0.99 1 0.998 0.997 0.9998 0.99 1 1 1 1 1 1 1 1 1 1 1 1 0.600.700.800.901.00 UniBo PoliMi PoliTo UniMi UniTo UniVe UniUpo UniFe UniMib UniPv UniSa UnivPm Unique Tender Identifier A lot of “00000000000”.
  • 58. The element is present for each lot but it is always empty. Choice of contracting part 75 0.9999 0.998 0.9999 0 1 1 1 0.9991 1 1 1 1 1 1 1 1 1 1 1 0.999 1 1 1 1 0.000.200.400.600.801.00 UniBo PoliMi PoliTo UniMi UniTo UniVe UniUpo UniFe UniMib UniPv UniSa UnivPm
  • 59. All the lots published by University of Milano have a winner but no information about the participants. Fiscal Code 76 1 0.97 0.99 1 1 1 1 1 1 1 1 1 1 1 1 1 0.974 1 1 0.996 1 0.951 0.900.920.940.960.981.00 UniBo PoliMi PoliTo UniMi UniTo UniVe UniUpo UniFe UniMib UniPv UniSa UnivPm
  • 60. In 14% of lots the amount paid is greater than the awarded amount. Amount paid vs. Total paid 78 0.87 0.97 0.96 0.9999 0.998 0.99 0.999 0.93 0.995 0.9999 0.98 0.98 0.800.850.900.951.00 UniBo PoliMi PoliTo UniMi UniTo UniVe UniUpo UniFe UniMib UniPv UniSa UnivPm PayedlessorequaltoAwarded
  • 61. Final considerations  ISO standard provides several predefined measures  Must be adapted to the case at hand  Can be aggregated in different ways  Possibility to define new measures  ISO standard is intended for structured data  What about semantic knowledge bases? 79
  • 62. References  ISO/IEC 25012:2008, Software engineering — Software product Quality Requirements and Evaluation (SQuaRE) — Data quality model  ISO 25024:2015, Software engineering — Software product Quality Requirements and Evaluation (SQuaRE) — Measurement of data quality  Vetrò, Antonio; Canova, Lorenzo; Torchiano, Marco; Orozco Minotas, Camilo; Iemma, Raimondo; Morando, Federico “Open Data Quality Measurement Framework: Definition and Application to Open Government Data”GOVERNMENT INFORMATION QUARTERLY, Vol.33, pp.325-337, ISSN:0740- 624X  Torchiano, Marco; Vetro', Antonio; Iuliano, Francesca “Preserving the Benefits of Open Government Data by Measuring and Improving Their Quality: An Empirical Study” in IEEE 41st Annual Computer Software and Applications Conference (COMPSAC 2017) 80

Editor's Notes

  1. Transparency isn’t just about access, it is also about sharing and reuse — often, to understand material it needs to be analyzed and visualized and this requires that the material be open so that it can be freely used and reused.
  2. To assess the quality we consider different dimensions:intrinsic dimesions which do not depend on the type of the data and domain dependent dimensions. As intrinsic dimension we evaluate the Accuracy computed as the percentage of elements with correct values and Completeness computed as percentage of complete elements and the percentage of complete aggregate elements, where an element is considered not correct or incomplete if it does not meet the specification of its domain or the number occurrences specified in the XML schema. For the domain dependent dimensions we evaluate the consistency by defining a set of integrity constraints that strictly depend on public constracts domain as for axample that the amountPaid must be less than or equal to the award amount or that if a public contract does not have a successful tenderer the amount paid must be equal to zero.
  3. To conduct the evaluation we selected the first 25 universities of the general ranking for the 2014 provided by the newspaper Il Sole 24 Ore. Only 12 of them provide summary tables in the xml format for a total of 123702 assessed lot. The remaining 13 Universities either do not provide the summary tables or they provide summary table but not in XML format.
  4. The accuracy and completeness were computed for all elements but we will show the most interesting and moreover we wiil see only some of the integrity constraints defined to asses the consistency. The cig is the unique identifier of a lot. The university of torino has a completeness on the cig of 100% this means that the cig element is present in all analysed lots but in the 32% of cases it is out of domain.
  5. The scelta contraente is one of the most important element because it specifies the procedure for the selection of the contractor and it can be used by the authorities to detect illegal award of contracts. High accuracy and completeness will improve the transparency of contracts. The completeness sceltaContraente element for the university of Milano is 100% but percentage of correct elements is equal to 0 this because the scelta contraente is always present in all the lots provided by the university of Milano but its value is always empty.
  6. The codiceFiscale is the unique identifier for the participants, an interesting aspect is that the University of Milano is not classified because in all the summary tables provided by the University there isn’t information about the participants.
  7. This results is highlighted by the lots has participant and the successful tenderer is participant interrelational constraints. The first one computes the percentage of lots which have a successful tenderer and have at least one participant while the second constraints computes the percentage of cells in which the sucessful tenderer of a lot is a participant for the same lot. In both cases the percentage for the university of Milano is equal to zero because there isn’t information on participants in the analysed files. For the university of Milano-Bococca the percentage of lot has participant constraint is slightly higher than the successfulTenderer is Participant and this means that althought in some lots there are participants, the successful tenderer is not one of those participant.
  8. The first IntrarelationalConstraint computes the percentage of lots in which the amount paid is less than or equal to the award amount and we can see that the 14% of lot of University of Bologna have an amount paid greater then the award amount this shows that more public money than requested is spent. The successfulTenderer_amountPaint computes the percentage of cells in which there isn’t information about the successful tenderer but the amount paid is different by zero. For the 40 % of lots of the University of Bologna there is not information about the successfull tenderer but an amount of money is distributed and it is not possible to track the money, that is, it is not known who receives the money.