This is the presentation part of an M.Sc. thesis in Software Engineering at Friedrich Schiller University Jena entitled “Dataset quality visualization in BEXIS2”. In this thesis, a visual overview of data quality was prototypically implemented as a new feature for the BEXIS2 Data Management System to make studying dataset quality straightforward.
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Dataset quality visualization in BEXIS2
1. Dataset Quality Visualization in BEXIS2
M A S T E R T H E S I S
Nafiseh Navabpour
March 2021
Supervisors
Prof. Dr. Birgitta König-Ries, Roman Gerlach, Sirko Schindler
7. Summary of problem statement
• Data is continuously produced
• Data is stored in different quality level in data portals
• Data consumers search data portals to review dataset
quality
• Reviewing dataset quality is time consuming
=> Dataset quality overview
Problem statement
7
8. • Interview with BEXIS2 users
• Literature review
• Looking at other data portals
8
The path in solving the problem
The path in solving the problem
9. Interview with BEXIS2 users
Upload
data
Quality
check
• Excel sheet
• TXT
• CSV
• Image
• Comprehensible metadata
• Well-defined dataset
• Clearly defined variables
• Accurate metadata
• Correct information
• Complete data
• Ready for analyzing
• Ready for reuse
The path in solving the problem
9
10. Literature review:
Data quality Dimensions
Intrinsic DQ Accuracy Objectivity Believability Reputation
Contextual DQ
Value-
added
Relevancy Timeliness Completeness
amount of
data
Representational
DQ
Interpretability
Ease of
understanding
Consistency Conciseness
Accessibility DQ Accessibility Access security
The path in solving the problem
10
Wang et al. 1996.
11. Literature review:
Dataset summarization
Data origin
The way of
specifying
the time
The way of
specifying
the location
Dataset
completion
Anything else
that make
data clearer
A short
description
Dataset
format and
size
Data headers
Data types
and data
values
The path in solving the problem
11
Koesten et al. 2020.
12. Dataset quality components
Quality components Quality attributes
Description length Accuracy, Completeness
Dataset format and size Accessibility
A list of variables Relevancy, Completeness, Understandability
data types, data distribution Completeness, Amount of data
A list of files, file extensions Relevancy, Understandability, Accessibility
Dataset contributors Believability
Metadata/data completeness Completeness
Dataset security level Accessibility, Security
Shared elements Reputation
Comparison Value-added
The path in solving the problem
12
13. Literature review:
Dataset quality visualization pipeline
Data
import
Data
preparation Mapping
Data
manipulation
Rendering
The path in solving the problem
13
Qin et al. 2020.
14. looking at other data portals
The path in solving the problem
https://data.cityofnewyork.us
https://www.kaggle.com
25. Positive and
negative points
+ Short text
+ Similar shapes
+ Color palette
+ Legend and tooltips
+ Interactive elements
- Loading time
25
A prototypical solution
27. Thank you for your attention!
Jena, 23.03.2021
Nafiseh Navabpour
Nafiseh.Navabpour@uni-jena.de
28. References
• Atto – amazon tall tower observatory, Feb. 2021. [Online]. Available: https://www.attoproject.org/.
• Covid-19 world vaccination progress, Feb. 2021. [Online]. Available:
https://www.kaggle.com/gpreda/covid-world-vaccination-progress.
• L. Koesten, E. Simperl, T. Blount, E. Kacprzak, and J. Tennison, “Everything you always wanted to know
about a dataset: Studies in data summarisation,” International Journal of Human-Computer Studies, vol.
135, p. 102 367, 2020. DOI: 10.1016/j.ijhcs.2019.10.004.
• N. Navabpour, Dataset quality visualization in bexis2, version 1.0.0, Mar. 2021. DOI:
10.5281/zenodo.4485845.
• Covid-19 free meals locations, Feb. 2021. [Online]. Available:
https://data.cityofnewyork.us/Education/COVID-19-Free-Meals-Locations/sp4a-vevi.
• X. Qin, Y. Luo, N. Tang, and G. Li, “Making data visualization more efficient and effective: A survey,” The
VLDB Journal, vol. 29, no. 1, pp. 93–117, 2020. DOI: 10.1007/s00778-019-00588-3.
• R. Y. Wang and D. M. Strong, “Beyond accuracy: What data quality means to data consumers,” Journal of
management information systems, vol. 12, no. 4, pp. 5–33, 1996. DOI: 10.1080/07421222.1996.11518099.
28