Data Quality for AI or AI for Data quality: advances in Data Quality Management for the success and sustainability of emerging technologies, business and society
“Data is the new oil” is only partly true, since according to Forbes, data is more than oil, while according to Ataccama, “Manual Data Quality Doesn’t Cut It in 2023” – this was the main driver behind of my guest lecture entitled “Data Quality for AI or AI for Data quality: advances in Data Quality Management for the success and sustainability of emerging technologies, business and society”, as part of which we discussed what is the role of artificial intelligence in data quality management and what is the role of data quality for AI, concluding that it is not about “data quality for AI” OR “AI for data quality” but rather about AND.
We also looked at what is the current market offer regarding AI-driven data quality management, what are the pros and cons of these solutions and what are the prerequisites that we have to take into account when using them (e.g., metadata and their quality for those, which derive DQ rules based on metadata analysis), and how possibly more promising solution could be built.
We also looked at what are those data quality specificities we should consider depending on the artifact – a data object (dataset), whose owner is known / is unknown (open data), Information System, Data Warehouse, Data Lake, Data Lakehouse, Data Mesh – where, when and how DQ takes place in them? What are the current trends? And are these indeed trends or rather hype?
Data Quality as a prerequisite for you business success: when should I start ...Anastasija Nikiforova
These are slides for my talk "Data Quality as a prerequisite for you business success: when should I start taking care of it?" I delivered as an invited keynote for HackCodeX Forum that gathered international experts to share their experience and knowledge on the emerging technologies and areas such as Artificial Intelligence, Security, Data Quality, Quantum Computing, Sustainability, Open Data, Privacy etc.
Public data ecosystems in and for smart cities: how to make open / Big / smar...Anastasija Nikiforova
This is a set of slides used as part of my keynote "Public data ecosystems in and for smart cities: how to make open / Big / smart / geo data ecosystems value-adding for SDG-compliant Smart Living and Society 5.0" delivered at the 5th International Conference on Advanced Research Methods and Analytics (CARMA 2023) -> https://carmaconf2023.wordpress.com/keynote-speakers/. read more here -> https://anastasijanikiforova.com/2023/06/30/keynote-at-the-5th-international-conference-on-advanced-research-methods-and-analytics-carma-2023/
OPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERSAnastasija Nikiforova
"OPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERS" set of slides was prepared for the Guest Lecture, which I has delivered to the students of the University of South-Eastern Norway (USN), October 2021
Preprint-WCMRI,IFERP,Singapore,28 October 2022.pdfChristo Ananth
Call for Papers- Special Session: World Conference on Multidisciplinary Research and Innovation (WCMRI-22), (Session 1: Information and Communication Technology), Singapore
Christo Ananth
Professor, Samarkand State University, Uzbekistan
Towards High-Value Datasets determination for data-driven development: a syst...Anastasija Nikiforova
Slides for the talk delivered as part of EGOV-CeDEM-ePart 2023 (EGOV2023) conference, aimed at examining how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks, which was done by conducting a Systematic Literature Review.
Read the paper here -> https://link.springer.com/chapter/10.1007/978-3-031-41138-0_14
Data Quality as a prerequisite for you business success: when should I start ...Anastasija Nikiforova
These are slides for my talk "Data Quality as a prerequisite for you business success: when should I start taking care of it?" I delivered as an invited keynote for HackCodeX Forum that gathered international experts to share their experience and knowledge on the emerging technologies and areas such as Artificial Intelligence, Security, Data Quality, Quantum Computing, Sustainability, Open Data, Privacy etc.
Public data ecosystems in and for smart cities: how to make open / Big / smar...Anastasija Nikiforova
This is a set of slides used as part of my keynote "Public data ecosystems in and for smart cities: how to make open / Big / smart / geo data ecosystems value-adding for SDG-compliant Smart Living and Society 5.0" delivered at the 5th International Conference on Advanced Research Methods and Analytics (CARMA 2023) -> https://carmaconf2023.wordpress.com/keynote-speakers/. read more here -> https://anastasijanikiforova.com/2023/06/30/keynote-at-the-5th-international-conference-on-advanced-research-methods-and-analytics-carma-2023/
OPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERSAnastasija Nikiforova
"OPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERS" set of slides was prepared for the Guest Lecture, which I has delivered to the students of the University of South-Eastern Norway (USN), October 2021
Preprint-WCMRI,IFERP,Singapore,28 October 2022.pdfChristo Ananth
Call for Papers- Special Session: World Conference on Multidisciplinary Research and Innovation (WCMRI-22), (Session 1: Information and Communication Technology), Singapore
Christo Ananth
Professor, Samarkand State University, Uzbekistan
Towards High-Value Datasets determination for data-driven development: a syst...Anastasija Nikiforova
Slides for the talk delivered as part of EGOV-CeDEM-ePart 2023 (EGOV2023) conference, aimed at examining how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks, which was done by conducting a Systematic Literature Review.
Read the paper here -> https://link.springer.com/chapter/10.1007/978-3-031-41138-0_14
Smart Data for Behavioural Change: Towards Energy Efficient BuildingsAnna Fensel
“The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation.” - this statement of Tim Berners-Lee has gained even more relevance since the start of this century.
The humanity is rapidly developing and persistently experiencing local and global challenges, such as global warming/climate change, dis-balances in demand and supply, among many others. Mastering most (if not all) of them require a behavior change. Behavioral change is difficult to achieve per se, and it is important that technology – as a major enabler - has a positive rather than a negative impact here.
Further, the dramatic growth of data volumes (Big Data, Internet of Things) and the data’s increased power and impact and on the people's daily lives are calling for new types, practices and policies of behavior with data.
These factors made the role of semantic technology even more crucial: in terms of providing a well-defined meaning, and eventually delivering Smart Data for a functional and fair data value chain.
Addressing the behavioural change with Smart Data, I discuss potential ICT solutions investigating the domain of energy efficient buildings. Particularly, our completed OpenFridge experiment will be presented: design and development of the Internet of Things data system with semantic and data analytics enablers for building new services on a top of typical home appliance data — in particular, refrigerators. The system has been evaluated with real life end-user pilots.
In conclusions, I overview our related ongoing work, namely, in the areas of the impact of Big Data on society and related research roadmapping (linking to sociology), personalized energy efficiency data management services in buildings (linking to psychology), and semantic data licensing (linking to law).
Supervised Multi Attribute Gene Manipulation For Cancerpaperpublications3
Abstract: Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviours, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems.
They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Data mining techniques are the result of a long process of research and product development. This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective and proactive information delivery.
Drought is possibly the most complex and least understood of natural hazards. The effects of drought accumulate slowly and linger for years. It is estimated that 380 million people, 38% of the world’s rural poor, live in the arid and semi-arid tropics (SAT). Of those who are vulnerable to drought, more than 90% are either smallholder farmers or landless laborers. The Committee on Science and Technology for the United Nations Convention to Combat Desertification, in its fifth session last year, issued a note on strategies for communicating relevant information on combating the effects of drought.
Trust and Accountability: experiences from the FAIRDOM Commons Initiative.Carole Goble
Presented at Digital Life 2018, Bergen, March 2018. In the Trust and Accountability session.
In recent years we have seen a change in expectations for the management and availability of all the outcomes of research (models, data, SOPs, software etc) and for greater transparency and reproduciblity in the method of research. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for stewardship [1] have proved to be an effective rallying-cry for community groups and for policy makers.
The FAIRDOM Initiative (FAIR Data Models Operations, http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards and sensitivity to asset sharing and credit anxiety. Our aim is a FAIR Research Commons that blends together the doing of research with the communication of research. The Platform has been installed by over 30 labs/projects and our public, centrally hosted FAIRDOMHub [2] supports the outcomes of 90+ projects. We are proud to support projects in Norway’s Digital Life programme.
2018 is our 10th anniversary. Over the past decade we learned a lot about trust between researchers, between researchers and platform developers and curators and between both these groups and funders. We have experienced the Tragedy of the Commons but also seen shifts in attitudes.
In this talk we will use our experiences in FAIRDOM to explore the political, economic, social and technical, social practicalities of Trust.
[1] Wilkinson et al (2016) The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
[2] Wolstencroft, et al (2016) FAIRDOMHub: a repository and collaboration environment for sharing systems biology research Nucleic Acids Research, 45(D1): D404-D407. DOI: 10.1093/nar/gkw1032
European Data Science Academy - Enabling Data Driven Digital EuropePersontyle
The ‘Age of Data’ continues to thrive, with data being produced from all industries at a phenomenal rate that introduces numerous challenges regarding the collection, storage and analysis of this data. To address this problem, the European Data Science Academy (EDSA) will establish a virtuous learning production cycle for Data Science.
To learn more about the project visit: http://edsa-project.eu/
Building on iMarine for fostering Innovation, Decision making, Governance and...Blue BRIDGE
BlueBRIDGE - Building Research environments fostering Innovation, Decision making, Governance and Education - is funded under H2020 and provides data services to scientists, researchers and data managers delivering a solid foundation for informed advice to competent authorities. A complete set of web-based data and computational resources will enable them to address key challenges related to the Blue Growth long term strategy with a strong focus on sustainable growth. BlueBRIDGE services will be built on top of the iMarine infrastructure (www.i-marine.eu) in order to capitalize on the previous investments made by the European Commission and as a first step towards their sustainability after the end of the project. www.bluebridge-vres.eu | @BlueBridgeVREs
Artificial Intelligence for open data or open data for artificial intelligence?Anastasija Nikiforova
This is a presentation used to deliver an invited talk for Babu Banarasi Das University (BBDU, Department of Computer Science and Engineering) Development Program «Artificial Intelligence for Sustainable Development» organized by AI Research Centre, Department of Computer Science & Engineering, ShodhGuru Research Labs, Soft Computing Research Society, IEEE UP Section, Computational Intelligence Society Chapter in 2022. Read more here -> https://anastasijanikiforova.com/2022/09/24/ai-for-open-data-or-open-data-for-ai-an-invited-talk-for-bbdu-development-program-artificial-intelligence-for-sustainable-development%f0%9f%8e%a4/
Overlooked aspects of data governance: workflow framework for enterprise data...Anastasija Nikiforova
This presentation is a supplementary material for the article "Overlooked aspects of data governance: workflow framework for enterprise data deduplication" (Azeroual, Nikiforova, Shei) presented at The International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS2023).
Abstract of the paper: Data quality in companies is decisive and critical to the benefits their products and services can provide. However, in heterogeneous IT infrastructures where, e.g., different applications for Enterprise Resource Planning (ERP), Customer Relationship Management (CRM), product management, manufacturing, and marketing are used, duplicates, e.g., multiple entries for the same customer or product in a database or information system, occur. There can be several reasons for this, but the result of non-unique or duplicate records is a degraded data quality. This ultimately leads to poorer, inefficient, and inaccurate data-driven decisions. For this reason, in this paper, we develop a conceptual data governance framework for effective and efficient management of duplicate data, and improvement of data accuracy and consistency in large data ecosystems. We present methods and recommendations for companies to deal with duplicate data in a meaningful way.
More Related Content
Similar to Data Quality for AI or AI for Data quality: advances in Data Quality Management for the success and sustainability of emerging technologies, business and society
Smart Data for Behavioural Change: Towards Energy Efficient BuildingsAnna Fensel
“The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation.” - this statement of Tim Berners-Lee has gained even more relevance since the start of this century.
The humanity is rapidly developing and persistently experiencing local and global challenges, such as global warming/climate change, dis-balances in demand and supply, among many others. Mastering most (if not all) of them require a behavior change. Behavioral change is difficult to achieve per se, and it is important that technology – as a major enabler - has a positive rather than a negative impact here.
Further, the dramatic growth of data volumes (Big Data, Internet of Things) and the data’s increased power and impact and on the people's daily lives are calling for new types, practices and policies of behavior with data.
These factors made the role of semantic technology even more crucial: in terms of providing a well-defined meaning, and eventually delivering Smart Data for a functional and fair data value chain.
Addressing the behavioural change with Smart Data, I discuss potential ICT solutions investigating the domain of energy efficient buildings. Particularly, our completed OpenFridge experiment will be presented: design and development of the Internet of Things data system with semantic and data analytics enablers for building new services on a top of typical home appliance data — in particular, refrigerators. The system has been evaluated with real life end-user pilots.
In conclusions, I overview our related ongoing work, namely, in the areas of the impact of Big Data on society and related research roadmapping (linking to sociology), personalized energy efficiency data management services in buildings (linking to psychology), and semantic data licensing (linking to law).
Supervised Multi Attribute Gene Manipulation For Cancerpaperpublications3
Abstract: Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviours, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems.
They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Data mining techniques are the result of a long process of research and product development. This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective and proactive information delivery.
Drought is possibly the most complex and least understood of natural hazards. The effects of drought accumulate slowly and linger for years. It is estimated that 380 million people, 38% of the world’s rural poor, live in the arid and semi-arid tropics (SAT). Of those who are vulnerable to drought, more than 90% are either smallholder farmers or landless laborers. The Committee on Science and Technology for the United Nations Convention to Combat Desertification, in its fifth session last year, issued a note on strategies for communicating relevant information on combating the effects of drought.
Trust and Accountability: experiences from the FAIRDOM Commons Initiative.Carole Goble
Presented at Digital Life 2018, Bergen, March 2018. In the Trust and Accountability session.
In recent years we have seen a change in expectations for the management and availability of all the outcomes of research (models, data, SOPs, software etc) and for greater transparency and reproduciblity in the method of research. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for stewardship [1] have proved to be an effective rallying-cry for community groups and for policy makers.
The FAIRDOM Initiative (FAIR Data Models Operations, http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards and sensitivity to asset sharing and credit anxiety. Our aim is a FAIR Research Commons that blends together the doing of research with the communication of research. The Platform has been installed by over 30 labs/projects and our public, centrally hosted FAIRDOMHub [2] supports the outcomes of 90+ projects. We are proud to support projects in Norway’s Digital Life programme.
2018 is our 10th anniversary. Over the past decade we learned a lot about trust between researchers, between researchers and platform developers and curators and between both these groups and funders. We have experienced the Tragedy of the Commons but also seen shifts in attitudes.
In this talk we will use our experiences in FAIRDOM to explore the political, economic, social and technical, social practicalities of Trust.
[1] Wilkinson et al (2016) The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
[2] Wolstencroft, et al (2016) FAIRDOMHub: a repository and collaboration environment for sharing systems biology research Nucleic Acids Research, 45(D1): D404-D407. DOI: 10.1093/nar/gkw1032
European Data Science Academy - Enabling Data Driven Digital EuropePersontyle
The ‘Age of Data’ continues to thrive, with data being produced from all industries at a phenomenal rate that introduces numerous challenges regarding the collection, storage and analysis of this data. To address this problem, the European Data Science Academy (EDSA) will establish a virtuous learning production cycle for Data Science.
To learn more about the project visit: http://edsa-project.eu/
Building on iMarine for fostering Innovation, Decision making, Governance and...Blue BRIDGE
BlueBRIDGE - Building Research environments fostering Innovation, Decision making, Governance and Education - is funded under H2020 and provides data services to scientists, researchers and data managers delivering a solid foundation for informed advice to competent authorities. A complete set of web-based data and computational resources will enable them to address key challenges related to the Blue Growth long term strategy with a strong focus on sustainable growth. BlueBRIDGE services will be built on top of the iMarine infrastructure (www.i-marine.eu) in order to capitalize on the previous investments made by the European Commission and as a first step towards their sustainability after the end of the project. www.bluebridge-vres.eu | @BlueBridgeVREs
Media, information and the promise of new technologies in Knowledge Transfer ...maudelfin
Similar to Data Quality for AI or AI for Data quality: advances in Data Quality Management for the success and sustainability of emerging technologies, business and society (20)
Artificial Intelligence for open data or open data for artificial intelligence?Anastasija Nikiforova
This is a presentation used to deliver an invited talk for Babu Banarasi Das University (BBDU, Department of Computer Science and Engineering) Development Program «Artificial Intelligence for Sustainable Development» organized by AI Research Centre, Department of Computer Science & Engineering, ShodhGuru Research Labs, Soft Computing Research Society, IEEE UP Section, Computational Intelligence Society Chapter in 2022. Read more here -> https://anastasijanikiforova.com/2022/09/24/ai-for-open-data-or-open-data-for-ai-an-invited-talk-for-bbdu-development-program-artificial-intelligence-for-sustainable-development%f0%9f%8e%a4/
Overlooked aspects of data governance: workflow framework for enterprise data...Anastasija Nikiforova
This presentation is a supplementary material for the article "Overlooked aspects of data governance: workflow framework for enterprise data deduplication" (Azeroual, Nikiforova, Shei) presented at The International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS2023).
Abstract of the paper: Data quality in companies is decisive and critical to the benefits their products and services can provide. However, in heterogeneous IT infrastructures where, e.g., different applications for Enterprise Resource Planning (ERP), Customer Relationship Management (CRM), product management, manufacturing, and marketing are used, duplicates, e.g., multiple entries for the same customer or product in a database or information system, occur. There can be several reasons for this, but the result of non-unique or duplicate records is a degraded data quality. This ultimately leads to poorer, inefficient, and inaccurate data-driven decisions. For this reason, in this paper, we develop a conceptual data governance framework for effective and efficient management of duplicate data, and improvement of data accuracy and consistency in large data ecosystems. We present methods and recommendations for companies to deal with duplicate data in a meaningful way.
Framework for understanding quantum computing use cases from a multidisciplin...Anastasija Nikiforova
This presentation is a supplementary material for the article "Framework for understanding quantum computing use cases from a multidisciplinary perspective and future research directions" (Ukpabi, D.C., Karjaluoto, H., Botticher, A., Nikiforova, A., Petrescu, D.I., Schindler, P., Valtenbergs, V., Lehmann, L., & Yakaryılmaz, A) available at https://arxiv.org/ftp/arxiv/papers/2212/2212.13909.pdf. THe presentation, however, was delivered for QWorld Quantum Science Days 2023 | May 29-31.
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...Anastasija Nikiforova
This presentation was delivered as part of the Data Science Seminar titled “When, Why and How? The Importance of Business Intelligence“ organized by the Institute of Computer Science (University of Tartu) in cooperation with Swedbank.
In this presentation I talked about:
*“Data warehouse vs. data lake – what are they and what is the difference between them?” (structured vs unstructured, static vs dynamic (real-time data), schema-on-write vs schema on-read, ETL vs ELT) with further elaboration on What are their goals and purposes? What is their target audience? What are their pros and cons?
*“Is the Data warehouse the only data repository suitable for BI?” – no, (today) data lakes can also be suitable. And even more, both are considered the key to “a single version of the truth”. Although, if descriptive BI is the only purpose, it might still be better to stay within data warehouse. But, if you want to either have predictive BI or use your data for ML (or do not have a specific idea on how you want to use the data, but want to be able to explore your data effectively and efficiently), you know that a data warehouse might not be the best option.
*“So, the data lake will save my resources a lot, because I do not have to worry about how to store /allocate the data – just put it in one storage and voila?!” – no, in this case your data lake will turn into a data swamp! And you are forgetting about the data quality you should (must!) be thinking of!
*“But how do you prevent the data lake from becoming a data swamp?” – in short and simple terms – proper data governance & metadata management is the answer (but not as easy as it sounds – do not forget about your data engineer and be friendly with him [always… literally always :D) and also think about the culture in your organization.
*“So, the use of a data warehouse is the key to high quality data?” – no, it is not! Having ETL do not guarantee the quality of your data (transform&load is not data quality management). Think about data quality regardless of the repository!
*“Are data warehouses and data lakes the only options to consider or are we missing something?“– true! Data lakehouse!
*“If a data lakehouse is a combination of benefits of a data warehouse and data lake, is it a silver bullet?“– no, it is not! This is another option (relatively immature) to consider that may be the best bit for you, but not a panacea. Dealing with data is not easy (still)…
In addition, in this talk I also briefly introduced the ongoing research into the integration of the data lake as a data repository and data wrangling seeking for an increased data quality in IS. In short, this is somewhat like an improved data lakehouse, where we emphasize the need of data governance and data wrangling to be integrated to really get the benefits that the data lakehouses promise (although we still call it a data lake, since a data lakehouse is nut sufficiently mature concept with different definitions of it).
Putting FAIR Principles in the Context of Research Information: FAIRness for ...Anastasija Nikiforova
This presentation is a supplementary material for "Putting FAIR Principles in the Context of Research Information: FAIRness for CRIS and CRIS for FAIRness" (Otmane Azeroual, Joachim Schopfel, Janne Polonen, and Anastasija Nikiforova) paper presented at 14th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K) conference, and also received the Best Paper Award. In this presentation we raise a discussion on this topic showing that the improvement of FAIRness is a dual or bidirectional process, where CRIS promotes and contributes to the FAIRness of data and infrastructures, and FAIR principles push for further improvement in the underlying CRIS data model and format, positively affecting the sustainability of these systems and underlying artifacts. CRIS are beneficial for FAIR, and FAIR is beneficial for CRIS.
See the text here -> https://www.scitepress.org/Link.aspx?doi=10.5220/0011548700003335
Cite as -> Azeroual, O.; Schöpfel, J.; Pölönen, J. and Nikiforova, A. (2022). Putting FAIR Principles in the Context of Research Information: FAIRness for CRIS and CRIS for FAIRness. In Proceedings of the 14th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - KMIS, ISBN 978-989-758-614-9; ISSN 2184-3228, pages 63-71. DOI: 10.5220/0011548700003335
Open data hackathon as a tool for increased engagement of Generation Z: to h...Anastasija Nikiforova
This is presentation for the paper "Open data hackathon as a tool for increased engagement of Generation Z: to hack or not to hack?" presented at EGETC2022.
A hackathon is known as a form of civic innovation in which participants representing citizens can point out existing problems or social needs and propose a solution. Given the high social, technical, and economic potential of open government data (OGD), the concept of open data hackathons is becoming popular around the world. This concept has become popular in Latvia with the annual hackathons organised for a specific cluster of citizens – Generation Z. This study presents the latest findings on the role of open data hackathons and the benefits that they can bring to both the society, participants, and government. First, a systematic literature review is carried out to establish a knowledge base. Then, empirical research of 4 case studies of open data hackathons for Generation Z participants held between 2018 and 2021 in Latvia is conducted to understand which ideas dominated and what were the main results of these events for the OGD initiative. It demonstrates that, despite the widespread belief that young people are indifferent to current
societal and natural problems, the ideas developed correspond to current situation and are aimed at solving them, revealing aspects for improvement in both the
provision of data, infrastructure, culture, and government- related areas.
Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Inno...Anastasija Nikiforova
This is the presentation for our ongoing study "Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Innovation Resistance Theory" (Anastasija Nikiforova, Anneke Zuiderwijk) presented at ICEGOV2022 conference – 15th International Conference on Theory and Practice of Electronic Governance (nominated to the Best Paper Awards).
In short, the study aims to develop an Open Government Data-adapted Innovation Resistance Theory model to empirically identify predictors affecting public agencies’ resistance to openly sharing government data. Here we want to understand:
💡what are functional and behavioural factors that facilitate or hamper opening government data by public organizations?
💡does IRT provide a new and more complete insight compared to more traditional UTAUT and TAM? – IRT has not been applied in this domain, yet, so we are checking whether it should be considered, or rather those models we are familiar so much are the best ones?
💡and additionally – does the COVID-19 pandemic had an [obvious/significant] effect on the public agencies in terms of their readiness or resistance to openly share government data?
Based on a review of the literature on both IRT research and barriers associated with open data sharing by public agencies, we developed an initial version of the model. Once the model is refined in a qualitative study (interviews with public agencies), we will validate it to study the resistance of public authorities to openly sharing government data in a quantitative study.
Read the paper and cite as -> Nikiforova A., Zuiderwijk A. (2022) Barriers to openly sharing government data: towards an open data-adapted innovation resistance theory, In 15th International Conference on Theory and Practice of Electronic Governance (ICEGOV 2022). Association for Computing Machinery, New York, NY, USA, 215–220, https://doi.org/10.1145/3560107.3560143 – best paper award nominee
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRISAnastasija Nikiforova
This presentation is a supplementary material for the "Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS" presented at 15th International Conference on Current Research Information Systems (CRIS2022) - Linking Research Information across data spaces. It provides an insight on the ongoing study of combining data lake as a data repository and data wrangling seeking for an increased data quality in CRIS systems, although the proposed approach is domain-agnostic and can be used not only within CRIS.
Read the article here -> Azeroual, O., Schöpfel, J., Ivanovic, D., & Nikiforova, A. (2022, May). Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS. In CRIS2022: 15th International Conference on Current Research Information Systems --> https://hal.archives-ouvertes.fr/hal-03694519/
The role of open data in the development of sustainable smart cities and smar...Anastasija Nikiforova
This presentation is a supplementary material for the guest lecture "The role of open data in the development of sustainable smart cities and smart society" I delivered for the Federal University of Technology – Paraná (Universidade Tecnológica Federal do Paraná (UTFPR)) (Brazil, May 2022).
Data security as a top priority in the digital world: preserve data value by ...Anastasija Nikiforova
Today, in the age of information and Industry 4.0, billions of data sources, including but not limited to interconnected devices (sensors, monitoring devices) forming Cyber-Physical Systems (CPS) and the Internet of Things (IoT) ecosystem, continuously generate, collect, process, and exchange data. With the rapid increase in the number of devices and information systems in use, the amount of data is increasing. Moreover, due to the digitization and variety of data being continuously produced and processed with a reference to Big Data, their value, is also growing. As a result, the risk of security breaches and data leaks. The value of data, however, is dependent on several factors, where data quality and data security that can affect the data quality if the data are accessed and corrupted, are the most vital. Data serve as the basis for decision-making, input for models, forecasts, simulations etc., which can be of high strategical and commercial / business value. This has become even more relevant in terms of COVID-19 pandemic, when in addition to affecting the health, lives, and lifestyle of billions of citizens globally, making it even more digitized, it has had a significant impact on business. This is especially the case because of challenges companies have faced in maintaining business continuity in this so-called “new normal”. However, in addition to those cybersecurity threats that are caused by changes directly related to the pandemic and its consequences, many previously known threats have become even more desirable targets for intruders, hackers. Every year millions of personal records become available online. Moreover, the popularity of IoTSE decreased a level of complexity of searching for connected devices on the internet and easy access even for novices due to the widespread popularity of step-by-step guides on how to use IoT search engine to find and gain access if insufficiently protected to webcams, routers, databases and other artifacts. A recent research demonstrated that weak data and database protection in particular is one of the key security threats. Various measures can be taken to address the issue. The aim of the study to which this presentation refers is to examine whether “traditional” vulnerability registries provide a sufficiently comprehensive view of DBMS security, or whether they should be intensively and dynamically inspected by DBMS holders by referring to Internet of Things Search Engines moving towards a sustainable and resilient digitized environment. The paper brings attention to this problem and make the reader think about data security before looking for and introducing more advanced security and protection mechanisms, which, in the absence of the above, may bring no value.
IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...Anastasija Nikiforova
This presentation is devoted to the "IoTSE-based Open Database Vulnerability inspection in three Baltic Countries: ShoBEVODSDT sees you" research paper developed by Artjoms Daskevics and Anastasija Nikiforova and presented during the The International conference on Internet of Things, Systems, Management and Security (IOTSMS2021) co-located with The 8th International Conference on Social Networks Analysis, Management and Security (SNAMS2021), December 6-9, 2021, Valencia, Spain (online)
Read paper here -> Daskevics, A., & Nikiforova, A. (2021, December). IoTSE-based open database vulnerability inspection in three Baltic countries: ShoBEVODSDT sees you. In 2021 8th International Conference on Internet of Things: Systems, Management and Security (IOTSMS) (pp. 1-8). IEEE -> https://ieeexplore.ieee.org/abstract/document/9704952?casa_token=NfEjYuud0wEAAAAA:6QxucVPuY762I3qzD6D_oWqa0B9eMUFRNMG-E7dyHKohSYIzI0bH1V9bLaAcly_Lp-Ll52ghO5Y
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...Anastasija Nikiforova
This presentations is a supplementary material for presenting the "Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can Save Your Business" (authored by Anastasija Nikiforova and Natalija Kozmina) research paper during the The International Conference on Intelligent Data Science Technologies and Applications (IDSTA2021), November 15-16, 2021. Tartu, Estonia (web-based)
Read paper here -> Nikiforova, A., & Kozmina, N. (2021, November). Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can Save Your Business. In 2021 Second International Conference on Intelligent Data Science Technologies and Applications (IDSTA) (pp. 66-73). IEEE -> https://ieeexplore.ieee.org/abstract/document/9660802?casa_token=LFJa20LrXAwAAAAA:wVwhTcCPWqxdloAvDQ3-l98KkkLx70xzG3zNvIIkJbC6wvJ4VxwX_VGc3mmW_7c1T-QJlOtTiao
ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...Anastasija Nikiforova
This presentation is devoted to the "ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detection tool or what Internet of Things Search Engines know about you" research paper developed by Artjoms Daskevics and Anastasija Nikiforova and presented during the The International Conference on Intelligent Data Science Technologies and Applications (IDSTA2021), November 15-16, 2021. Tartu, Estonia (web-based).
Read paper here -> Daskevics, A., & Nikiforova, A. (2021, November). ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detection tool or what Internet of Things Search Engines know about you. In 2021 Second International Conference on Intelligent Data Science Technologies and Applications (IDSTA) (pp. 38-45). IEEE.
Invited talk "Open Data as a driver of Society 5.0: how you and your scientif...Anastasija Nikiforova
This presentation is prepared as a part of my talk on the openness (open data and open science) in the context of Society 5.0 during the International Conference and Expo on Nanotechnology and Nanomaterials. It was very pleasant to receive an invitation to deliver the talk on my recently published article Smarter Open Government Data for Society 5.0: Are Your Open Data Smart Enough? (Sensors 2021, 21(15), 5204), which I have entitled as “Open Data as a driver of Society 5.0: how you and your scientific outputs can contribute to the development of the Super Smart Society and transformation into Smart Living?“. The paper has been briefly discussed in my previous post, thus, just a few words on this talk and overall experience.
Towards enrichment of the open government data: a stakeholder-centered determ...Anastasija Nikiforova
This set of slides is a part of the presentation prepared and delivered in the scope of the 14th International Conference on Theory and Practice of Electronic Governance (ICEGOV 2021), 6-8 October, 2021, Smart Digital Governance for Global Sustainability
It is based on the paper -> Nikiforova, A. (2021, October). Towards enrichment of the open government data: a stakeholder-centered determination of High-Value Data sets for Latvia. In 14th International Conference on Theory and Practice of Electronic Governance (pp. 367-372) -> https://dl.acm.org/doi/abs/10.1145/3494193.3494243?casa_token=bPeuwmFWwQwAAAAA:ls-xXIPK5uXDHyxtBxqsMJOCuV6ud_ip59BX8n78uJnqvql6e8H9urlDG9zzeNklRmGFwI4sCXU06w
Atvērtā lekcija "Atvērto datu potenciāls" notika LU SZF maģistrantūras kursa “Datu sabiedrības vadība” ietvaros, ko nolasīja Dr.sc.comp. Anastasija Ņikiforova, LU Datorikas fakultātes docente, pētniece.
Atvērtie dati tiek uzskatīti par vērtīgu resursu, kura izmantošana ir potenciāli spējīga sniegt ievērojamus ekonomiskus, tehnoloģiskus un sociālus ieguvumus. Taču to panākšanai ir jāizpildās virknei priekšnosacījumu, kas attiecināmi gan uz datiem, gan uz infrastruktūru, gan uz lietotājiem, t.i. atvērto datu iniciatīvas veiksmes faktors ir ilgtspējīgas atvērto pārvaldes datu ekosistēmas izveide un uzturēšana. Lekcijas mērķis ir sniegt ieskatu par atvērto datu popularitāti un potenciālu tehnoloģisko un ekonomisko procesu attīstībai, uzmanību pievēršot to praktiskiem pielietojumiem gan Latvijā, gan ārpus tās, datus transformējot (inovatīvajos) risinājumos un pakalpojumos. Tāpat, ir plānots sniegts ieskatu par nozīmīgākajiem aspektiem, kas potenciāli ir spējīgi sekmēt ilgtspējīgas atvērto datu ekosistēmas izveidi, nodrošinot iespēju ikvienam interesentam atvērtus datus transformēt vērtībā.
PhD, Dc. comp.sc. Anastasija Ņikiforova ir Latvijas Universitātes Datorikas Fakultātes docente un Inovatīvo informācijas tehnoloģiju laboratorijas pētniece. Dr. Ņikiforovas pētnieciskas intereses ir saistītas ar datu pārvaldības, īpaši datu kvalitātes, un atvērto datu saistītājiem jautājumiem. LU Datorikas fakultātē papildus citiem docētājiem kursiem viņa ir izstrādājusi Specsemināru “Atvērtie dati un datu kvalitāte” un maģistra programmas kursu “Atvērtie pārvaldes dati datu-virzītā pasaulē”. Dr. Ņikiforova ir Latvijas Zinātnes padomes eksperte Inženierzinātnes un tehnoloģijas (Elektrotehnika, elektronika, informācijas un komunikāciju tehnoloģijas) un Dabaszinātnes (Datorzinātnes un informātika) nozarēs, kā arī LATA (Latvijas Atvērto Tehnoloģiju Asociācija) asociētā biedre. Viņa ir vairāk kā 25 zinātnisko rakstu (līdz-)autore, 4 no kuriem ir publicēti augstākā rangā Q1 žurnālos.
TIMELINESS OF OPEN DATA IN OPEN GOVERNMENT DATA PORTALS THROUGH PANDEMIC-RELA...Anastasija Nikiforova
This presentation is a supplementary material for the following article -> Nikiforova, A. (2020, October). Timeliness of open data in open government data portals through pandemic-related data: a long data way from the publisher to the user. In 2020 Fourth International Conference on Multimedia Computing, Networking and Applications (MCNA) (pp. 131-138). IEEE.
The paper addresses the “timeliness” of data in open government data (OGD) portals. It is one of the primary principles of open data, which is considered to be a success factor, while at the same time it is one of the biggest barriers that can disrupt users trust in data and even the desire to use the entire open data portal. However, assessing this aspect is a very difficult task that, in most cases, becomes an impossible for open data users. There is therefore a lack of comparative studies on the timeliness of data of different national open data portals. Unfortunately, 2020 gave the opportunity to find out this. It became easy enough to compare how long is the data path from the data holder to the OGD portal by analysing the timeliness of Covid-19-related data sets in relation to the first case observed in a country. The study thus fills the gap of comparative studies by addressing 60 countries and their OGD portals concerning the timeliness of the data, providing a report on how much and what countries provide the open data as quickly as possible. It makes it possible to understand how quickly OGD portals react to emergencies by opening and updating data for their further potential reuse, which is essential in the digital data-driven world.
Read paper here -> Nikiforova, A. (2020, October). Timeliness of open data in open government data portals through pandemic-related data: a long data way from the publisher to the user. In 2020 Fourth International Conference on Multimedia Computing, Networking and Applications (MCNA) (pp. 131-138). IEEE.https://ieeexplore.ieee.org/abstract/document/9264298?casa_token=FtfC_6bqZnsAAAAA:TaSnKrE7ZCxLyq5hvxX-X8O2sK_vZYcodTBtxoWOvaOAIFmMmy65f5dIK-kKYxFAMiC5jyl7Eeg
This presentation is a supplementary material for the following article -> Nikiforova, A., Bicevskis, J., & Karnitis, G. (2020, December). Towards a Concurrence Analysis in Business Processes. In 2020 Seventh International Conference on Social Networks Analysis, Management and Security (SNAMS) (pp. 1-6). IEEE.
This paper presents first steps towards a solution aimed to provide concurrent business processes analysis methodology for predicting the probability of incorrect business process execution. The aim of the paper is to (a) look at approaches to describing and dealing with the execution of concurrent processes, mainly focusing on the transaction mechanisms in database management systems, (b) present an idea and a preliminary version of an algorithm that detects the possibility of incorrect execution of concurrent business processes. Analyzing business process according to the proposed procedure allows to configure transaction processing optimally.
DATA QUALITY MODEL-BASED TESTING OF INFORMATION SYSTEMS: THE USE-CASE OF E-SC...Anastasija Nikiforova
This presentation is a supplementary material for the following article -> Nikiforova, A., Bicevskis, J., Bicevska, Z., & Oditis, I. (2020, December). Data quality model-based testing of information systems: the use-case of E-scooters. In 2020 7th International Conference on Internet of Things: Systems, Management and Security (IOTSMS) (pp. 1-8). IEEE.
The paper proposes a data quality model-based testing methodology aimed at improving testing methodology of information systems (IS) using previously proposed data quality model. The solution supposes creation of a description of the data to be processed by IS and the data quality requirements used for the development of the tests, followed by performing an automated test of the system on the generated tests verifying the correctness of data to be entered and stored in the database. The generation of tests for all possible data quality conditions creates a complete set of tests that verify the operation of the IS under all possible data quality conditions. The proposed solution is demonstrated by the real example of the system dealing with e-scooters. Although the proposed solution is demonstrated by applying it to the system that is already in use, it can also be used when developing a new system.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Data Quality for AI or AI for Data quality: advances in Data Quality Management for the success and sustainability of emerging technologies, business and society
1. DATA QUALITY FOR AI OR AI FOR DATA QUALITY:
ADVANCES IN DATA QUALITY MANAGEMENT
FOR THE SUCCESS AND SUSTAINABILITY OF
EMERGING TECHNOLOGIES, BUSINESS AND SOCIETY
ANASTASIJA NIKIFOROVA
University of Tartu, Institute of Computer Science, Estonia
European Open Science Cloud, Task Force «FAIR metrics and data quality»
Expert of the Latvian Council of Sciences, Associate member of the Latvian Open Technology Association
https://anastasijanikiforova.com/
Guest Lecture for the University of South-Eastern Norway (USN), October 2023
2. “It is among the top 1% of the world's universities, making it
one of Northern Europe's leading universities and the best-
ranked university in the Baltics”
University of Tartu : Rankings, Fees & Courses Details | Top Universities, University of Tartu | World
University Rankings | THE (timeshighereducation.com)
3. PHD IN COMPUTER SCIENCE – DATA PROCESSING SYSTEMS AND DATA NETWORKING
RESEARCH INTERESTS: DATA MANAGEMENT WITH A FOCUS ON DATA QUALITY, PUBLIC
ADMINISTRATION, OPEN DATA- AND OPEN GOVERNMENT DATA (ECOSYSTEMS)- RELATED TOPICS,
COVERING BOTH TECHNOLOGICAL AND SOCIETAL ASPECTS OF THE ABOVE, SOCIETY 5.0, SDG,
SMART CITY, SUSTAINABLE DEVELOPMENT, IOT, HCI AND DIGITIZATION.
✔ASSISTANT PROFESSOR AT THE UNIVERSITY OF TARTU, FACULTY OF SCIENCE AND TECHNOLOGY, INSTITUTE OF COMPUTER SCIENCE,
CHAIR OF SOFTWARE ENGINEERING
✔EUROPEAN OPEN SCIENCE CLOUD TASK FORCE “FAIR METRICS AND DATA QUALITY”
✔EDSC AMBASSADOR (EUROPEAN DIGITAL SKILLS CERTIFICATE, AS PART OF ACTION 9 OF THE DIGITAL EDUCATION ACTION PLAN (2021-
2027) – JRC/SVQ/2022/OP/0013)
✔IFIP WG8.5 ON ICT AND PUBLIC ADMINISTRATION MEMBER
✔ASSOCIATE MEMBER OF THE LATVIAN OPEN TECHNOLOGY ASSOCIATION
✔EXPERT OF THE LATVIAN COUNCIL OF SCIENCES IN (1) NATURAL SCIENCES – COMPUTER SCIENCE & INFORMATICS, (2) ENGINEERING &
TECHNOLOGY-ELECTRICAL ENGINEERING, ELECTRONICS, ICT, (3) SOCIAL SCIENCES – ECONOMICS & BUSINESS
✔EXPERT OF THE COST – EUROPEAN COOPERATION IN SCIENCE & TECHNOLOGY
✔EDITORIAL BOARD MEMBER FOR SEVERAL JOURNALS, PROGRAM COMMITTEE MEMBER FOR SEVERAL INTERNATIONAL
CONFERENCES (20+), PART OF AN ORGANIZING COMMITTEE (5+), INVITED REVIEWER FOR 15+ HIGH-QUALITY JOURNALS
✔ASSISTANT PROFESSOR AT THE UNIVERSITY OF TARTU, FACULTY OF SCIENCE AND TECHNOLOGY, INSTITUTE OF COMPUTER SCIENCE,
CHAIR OF SOFTWARE ENGINEERING
✔EUROPEAN OPEN SCIENCE CLOUD TASK FORCE “FAIR METRICS AND DATA QUALITY”
✔EDSC AMBASSADOR (EUROPEAN DIGITAL SKILLS CERTIFICATE, AS PART OF ACTION 9 OF THE DIGITAL EDUCATION ACTION PLAN (2021-
2027) – JRC/SVQ/2022/OP/0013)
✔IFIP WG8.5 ON ICT AND PUBLIC ADMINISTRATION MEMBER
✔ASSOCIATE MEMBER OF THE LATVIAN OPEN TECHNOLOGY ASSOCIATION
✔EXPERT OF THE LATVIAN COUNCIL OF SCIENCES IN (1) NATURAL SCIENCES – COMPUTER SCIENCE & INFORMATICS, (2) ENGINEERING &
TECHNOLOGY-ELECTRICAL ENGINEERING, ELECTRONICS, ICT, (3) SOCIAL SCIENCES – ECONOMICS & BUSINESS
✔EXPERT OF THE COST – EUROPEAN COOPERATION IN SCIENCE & TECHNOLOGY
✔EDITORIAL BOARD MEMBER FOR SEVERAL JOURNALS, PROGRAM COMMITTEE MEMBER FOR SEVERAL INTERNATIONAL
CONFERENCES (20+), PART OF AN ORGANIZING COMMITTEE (5+), INVITED REVIEWER FOR 15+ HIGH-QUALITY JOURNALS
✔VISITING RESEARCHER AT THE DELFT UNIVERSITY OF TEHNOLOGY, FACULTY TECHNOLOGY POLICY AND MANAGEMENT (TPM)
✔ASSISTANT PROFESSOR AT THE FACULTY OF COMPUTING, UNIVERSITY OF LATVIA
✔RESEARCHER IN THE INNOVATION LABORATORY, FACULTY OF COMPUTING, UNIVERSITY OF LATVIA
✔IT-EXPERT AT THE LATVIAN BIOMEDICAL RESEARCH AND STUDY CENTRE, BBMRI-ERIC LV NATIONAL NODE
✔ADVISOR FOR THE INSTITUTE FOR SOCIAL AND POLITICAL STUDIES, UNIVERSITY OF LATVIA
✔DATA SECURITY SOLUTIONS, LATVIA
✔VISITING RESEARCHER AT THE DELFT UNIVERSITY OF TEHNOLOGY, FACULTY TECHNOLOGY POLICY AND MANAGEMENT (TPM)
✔ASSISTANT PROFESSOR AT THE FACULTY OF COMPUTING, UNIVERSITY OF LATVIA
✔RESEARCHER IN THE INNOVATION LABORATORY, FACULTY OF COMPUTING, UNIVERSITY OF LATVIA
✔IT-EXPERT AT THE LATVIAN BIOMEDICAL RESEARCH AND STUDY CENTRE, BBMRI-ERIC LV NATIONAL NODE
✔ADVISOR FOR THE INSTITUTE FOR SOCIAL AND POLITICAL STUDIES, UNIVERSITY OF LATVIA
✔DATA SECURITY SOLUTIONS, LATVIA
MOST RECENT EXPERIENCE
PAST EXPERIENCE
BRIEFLY
ABOUT ME…
9. DATA QUALITY - WHAT, WHY, HOW, 10 BEST PRACTICES & MORE - Enterprise Master Data Management • Profisee
10. DATA … DATA ARE EVERYWHERE
M-Files on Twitter: "Data is the New Oil – Especially in Oil and Gas! https://t.co/zFlrvQqlMs https://t.co/qE3Q4aLNQy" / Twitter
11. DATA … DATA ARE EVERYWHERE
Sources: Premium Vector | Artificial intelligence logo, icon. vector symbol ai, deep learning blockchain neural network concept. machine learning, artificial intelligence, ai. (freepik.com), Top 10 Successful Data Science Companies in 2023 - Learn | Hevo (hevodata.com),
How to Use Business Intelligence (BI) to Improve Organizational Alignment | Wyn Enterprise (grapecity.com), Machine learning logo - Wi6Labs, Business Intelligence Icon Gráfico por aimagenarium · Creative Fabrica, Open Data – GEOAFRICA,
https://www.gartner.com/en/articles/4-emerging-technologies-you-need-to-know-about?utm_medium=social&utm_source=linkedin&utm_campaign=SM_GB_YOY_GTR_SOC_SF1_SM-SWG&utm_content=&sf267111387=1
18. “DATA IS THE NEW OIL” WHY IT IS NOT?
BUT!
✓
Source: Here's Why Data Is Not The New Oil (forbes.com), Image sources: Oil well – Wikipedia, How do we get oil and gas out of the ground? (world-petroleum.org), Customized Silos For Effective Storage of Food | Nextech Solutions (nextechagrisolutions.com)
DATA, LIKE OIL is a source of power,
and those, who control them,
are establishing themselves as «masters of the universe»,
just as oil barons did 100 years ago
19. effectively infinitely durable and reusable
treating like oil –storing in siloes, has little benefit & reduces its usefulness
a finite resource
can be replicated indefinitely & moved around the world at
the speed of light, at low cost, through fiber optic networks
OIL
requires huge amounts of resources to be
transported to where it is needed
when used, its energy being lost as heat or light, or
permanently converted into another form (e.g., plastic)
becomes more useful the more it is used - once
processed, data often reveals further applications
as the world’s oil reserves dwindle, extracting
it becomes increasingly difficult and expensive
becoming increasingly available as computer
technology advances
data mining doesn’t intrinsically involve damage to the
environment & exploitation of finite natural resources
*apart from the electricity used to run the system
oil drilling involve causing damage to the natural
environment and exploitation of finite natural resources
“DATA IS THE NEW OIL” WHY IT IS NOT?
✘
Source: Here's Why Data Is Not The New Oil (forbes.com), Image sources: Oil well – Wikipedia, How do we get oil and gas out of the ground? (world-petroleum.org), Customized Silos For Effective Storage of Food | Nextech Solutions (nextechagrisolutions.com)
DATA
✘
✘
✘
✘
20. IF WE THINK ABOUT DATA AS A POWER SOURCE OR FUEL,
IT WOULD MAKE MORE SENSE TO COMPARE THEM WITH
RENEWABLE SOURCES LIKE THE
SUN, WIND AND TIDES”
-B. Marr, Forbes
Here's Why Data Is Not The New Oil (forbes.com)
Letter from the Editor: Here comes the sun (medicalnewstoday.com), A healthy wind | MIT News | Massachusetts Institute of Technology, Tidal phenomenon: high and low tides | Ponant Magazine
21. AMONG OTHER “NUANCES”,
DATA QUALITY IS USE-CASE DEPENDENT AND DYNAMIC IN NATURE
“ABSOLUTE DATA QUALITY”
DATA QUALITY LEVEL AT WHICH THE DATA WOULD SATISFY
ALL POSSIBLE USE CASES - IS IMPOSSIBLE TO ACHIEVE,
BUT IT IS A GOAL TO BE PURSUED
24. Def. 1: FITNESS-FOR-USE
Def. 2: FITNESS-FOR-PURPOSE
Def. 3: FREE OF ERRORS
UTILITY*
WARRANTY*
=
=
According to ITIL® 4: the framework for the management of IT-enabled service
25. ISO def.: THE DEGREE TO WHICH
DATA SATISFIES THE REQUIREMENTS
OF ITS INTENDED PURPOSE
ISO/IEC 25012
27. NOT ONLY ABOUT WHAT, BUT
ALSO ABOUT HOW?
IT IS A PROCESS
28. NOT ONLY ABOUT WHAT, BUT
ALSO ABOUT HOW?
IT IS A PROCESS –
DATA QUALITY MANAGEMENT PROCESS
29.
30.
31. DEFINE
MEASURE
ANALYSE
IMPROVE TDQM
DATA QUALITY MANAGEMENT PROCESS
TOTAL DATA QUALITY MANAGEMENT LIFCYCLE (BY MIT)
DEFINE: IDENTIFY RELEVANT DQ DIMENSIONS
MEASURE: PRODUCE DQ METRICS
ANALYSE: IDENTIFY ROOT CAUSES FOR DQ PROBLEMS AND
DETERMINE THE IMPACT OF POOR DQ
IMPROVE: IDENTIFY AND EMPLOY TECHNIQUES FOR
IMPROVING DQ
32. •Lacagnina, C., David, R., Nikiforova, A., Kuusniemi, M. E., Cappiello, C., Biehlmaier, O., Wright, L.,
Schubert, C., Bertino, A., Thiemann, H., & Dennis, R. (2023). Towards a data quality framework
for EOSC. Zenodo. https://doi.org/10.5281/zenodo.7515816
35. IS THERE ANY COMMONLY ACCEPTED DQ DIMENSION
CLASSIFICATION?
https://iso25000.com/index.php/en/iso-25000-standards/iso-25012/136-iso-iec-2012
ISO 25012
SOFTWARE ENGINEERING — SOFTWARE
PRODUCT QUALITY REQUIREMENTS
AND EVALUATION (SQUARE) — DATA
QUALITY MODEL
36. DIMENSIONS VARY IN DEFINITION AND SCOPE
ONE AND THE SAME NOTION CAN REFER TO DIFFERENT DIMENSIONS
ONE AND THE SAME DIMENSION CAN HAVE
DIFFERENT NOTIONS [IN DIFFERENT SOURCES]
DATA QUALITY RULES ARE THEN DEFINED
FOR EACH DIMENSION
METRICS ARE THEN SELECTED FOR THEM
37. SIMPLER
USER-ORIENTED
APPROACH
BASED ON USER DEFINED DATA
QUALITY REQUIREMENTS
Nikiforova, A. (2020). Definition and Evaluation of Data Quality: User-Oriented Data Object-Driven Approach to Data Quality
Assessment. Baltic Journal of Modern Computing, 8(3).
45. ✓ STANDARDIZATION, NORMALIZATION AND PARSING
✓ MATCHING / DEDUPLICATION AND MERGING
✓ DATA CLEANSING
✓ VALIDATION
✓ DATA PROFILING / AUDITING
✓ SOME A FEW OF THEM SUPPORT (SEMI-)AUTOMATED DQ RULE RECOGNITION
DQ TOOLS FOR (SEMI-)AUTOMATED DQM
46. Systematic Search of DQ Tools
Research papers:
Searched from Scopus using
keywords
Technology reviewers:
❏ 16 technology reviewers -
128 DQ tools
Suggestions by DQ
professionals
Martinsaari H. (2023). Toward an Automated Data Quality Rule Detection in Data Warehouses. Master Thesis (supervisor: Nikiforova Anastasija)
47. 47
DQ management is closely related to other information management functionalities like
metadata management and master data management.
Tool Environment and Connectivity
48. 10DQ tools out of 151 are able to detect DQ rules in DW
DQ rules were mainly discovered using
metadata, built-in rules and machine learning
54. DATA OBJECT
DATASET
DATABASE DATA REPOSITORY INFORMATION SYSTEM
SOFTWARE
DATA OWNER
KNOWN
THIRD-PARTY
NO ONE-SIZE-FITS-ALL
Nikiforova, A. (2020). Definition and Evaluation of Data Quality: User-Oriented Data Object-Driven Approach to Data Quality Assessment. Baltic Journal of Modern Computing, 8(3).
Nikiforova, A. (2020). Definition and Evaluation of Data Quality: User-Oriented Data Object-Driven Approach to Data Quality Assessment
Nikiforova, A. (2018). Open Data Quality Evaluation: A Comparative Analysis of Open Data in Latvia
Nikiforova, A. (2019). Analysis of open health data quality using data object-driven approach to data quality evaluation: insights from a Latvian context
Nikiforova, A. (2020, October). Timeliness of open data in open government data portals through pandemic-related data: a long data way from the
publisher to the user
The most frequently occurred data quality issues (for OGD) are: (a) contextual data
quality issues, (b) empty values even for primary data; (c) multiple denotation for the
same object within one data object and even a parameter; (d) issues on interrelated
parameters
55. DATA OBJECT
DATASET
DATABASE DATA REPOSITORY INFORMATION SYSTEM
SOFTWARE
DATA STRUCTURE
NO ONE-SIZE-FITS-ALL
STRUCTURED DATA UNSTRUCTURED DATA
SEMI-STRUCTURED DATA
Image sources: https://monkeylearn.com/blog/semi-structured-data/, https://www.pngitem.com/middle/ioJTTbR_organization-structure-icon-png-download-structures-icon-png/
58. THINK DATA QUALITY FIRST!!! OR TOWARDS DATA
QUALITY BY DESIGN
Guerra-García, C., Nikiforova, A., Jiménez, S., Perez-Gonzalez, H. G., Ramírez-Torres, M., & Ontañon-
García, L. (2023). ISO/IEC 25012-based methodology for managing data quality requirements in the
development of information systems: Towards Data Quality by Design. Data & Knowledge
Engineering, 145,
DAQUAVORD - A METHODOLOGY FOR PROJECT MANAGEMENT OF DATA QUALITY REQUIREMENTS
SPECIFICATION - AIMED AT ELICITING DQ REQUIREMENTS ARISING FROM DIFFERENT USERS’ VIEWPOINTS
THESE DQ REQUIREMENTS SERVE AS DATA QUALITY SOFTWARE REQUIREMENT AT THE TIME
OF THE DEVELOPMENT OF SOFTWARE THAT TAKES DATA QUALITY INTO ACCOUNT BY
DEFAULT.
IS BASED ON THE VIEWPOINT-ORIENTED REQUIREMENTS DEFINITION (VORD) METHOD, AND
THE LATEST AND MOST GENERALLY ACCEPTED ISO/IEC 25012 STANDARD.
59. DATA OBJECT
DATASET
DATABASE DATA REPOSITORY INFORMATION SYSTEM
SOFTWARE
DATA WAREHOUSE DATA LAKE
Maybe even something else?
NO ONE-SIZE-FITS-ALL
60. DATA OBJECT
DATASET
DATABASE DATA REPOSITORY INFORMATION SYSTEM
SOFTWARE
Running Analytics on the Data Lake - The Databricks Blog
NO ONE-SIZE-FITS-ALL
62. Implementing a Data Lake or Data Warehouse Architecture for Business Intelligence? | by Lan Chu | Towards Data Science
NB: EXTRACT-TRANSFORM-LOAD
IS NOT DQM!!!
65. Image source: The abstracted future of data engineering | by Justin Gage | Datalogue | Medium
OR HOW TO AVOID GIGO*?
*“GARBAGE IN, GARBAGE OUT”
66. DATA LAKE FOR BI
BUSINESS DATA LAKE
https://www.capgemini.com/wp-content/uploads/2017/07/pivotal_data_lake_vs_traditional_bi_20140805.pdf
67. DATA LAKE
+
DATA WRANGLING
[an asset, not a silver bullet]
✔
Source: https://monkeylearn.com/blog/data-wrangling/, https://www.altair.com/what-is-data-wrangling/ , https://pediaa.com/what-is-the-difference-between-data-wrangling-and-data-cleaning
69. THE DATA WRANGLING PROCESS TO PREPARE DATA AND INTEGRATE IT INTO IS
DEPENDING ON THE IS AND THE DESIRED OR REQUIRED TARGET QUALITY*, INDIVIDUAL STEPS
SHOULD BE CARRIED OUT SEVERAL TIMES ➔ !!! DATA WRANGLING IS A CONTINUOUS PROCESS
!!! THAT REPEATS ITSELF REPEATEDLY AT REGULAR INTERVALS.
Information
System
Azeroual, O., Schöpfel, J., Ivanovic, D., & Nikiforova, A. (2022). Combining data lake and
data wrangling for ensuring data quality in CRIS. Procedia Computer Science, 211, 3-16.
70. DATA LAKE VS DATA WAREHOUSE
HOW TO TAKE
THE ADVANTAGES OF BOTH?
71. DATA LAKE VS DATA WAREHOUSE
HOW TO TAKE
THE ADVANTAGES OF BOTH?
DATA LAKEHOUSE
72. DATA LAKEHOUSE IS SEEN AS A COMBINATION OF DATA WAREHOUSING WORKLOADS & DATA LAKE ECONOMICS
Running Analytics on the Data Lake - The Databricks Blog
73. Running Analytics on the Data Lake - The Databricks Blog, Build a Lake House Architecture on AWS | AWS Big Data Blog (amazon.com), The Data Lakehouse, the Data Warehouse and a Modern Data platform architecture - Microsoft Community Hub
74. DATA ARTIFACT
WHAT DQM APPROACH DEPENDS ON?
DEFINITION USER
TIME
DIMENSION
PROCESS PURPOSE
75.
76. MUSK’S TOP PRIORITY: TO IMPROVE THE
PRODUCT…
Q: HOW DOES ONE ENSURE THE RELIABILITY OF DATA
AND DECISIONS MADE BASED ON SAID DATA?
THE ANSWER LIES NOT IN MANAGING THE DATA ALONE,
BUT ALSO THE INFORMATION AROUND AND ABOUT DATA
ACQUISITION, TRANSFORMATIONS AND VISUALIZATION
TO PROVIDE A BETTER UNDERSTANDING AND SUPPORT
DECISION MAKERS
https://www.gqindia.com/get-smart/content/5-things-elon-musk-did-to-become-one-of-the-richest-men-in-the-world
77. https://www.gqindia.com/get-smart/content/5-things-elon-musk-did-to-become-one-of-the-richest-men-in-the-world
MUSK’S TOP PRIORITY: TO IMPROVE THE
PRODUCT…
Q: HOW DOES ONE ENSURE THE RELIABILITY OF DATA
AND DECISIONS MADE BASED ON SAID DATA?
THE ANSWER LIES NOT IN MANAGING THE DATA ALONE,
BUT ALSO THE INFORMATION AROUND AND ABOUT DATA
ACQUISITION, TRANSFORMATIONS AND VISUALIZATION
TO PROVIDE A BETTER UNDERSTANDING AND SUPPORT
DECISION MAKERS
BY FOCUSING ON SUSTAINABLE DATA, CLEAR
DATA GOVERNANCE
AND STRONG DATA MANAGEMENT
82. https://www.gqindia.com/get-smart/content/5-things-elon-musk-did-to-become-one-of-the-richest-men-in-the-world
DATA MESH IS A NEW TREND!?
https://www.edq.com/blog/data-quality-vs-data-governance/, What is a data mesh? | IBM
A DATA MESH IS A DECENTRALIZED DATA ARCHITECTURE*
THAT ORGANIZES DATA BY A SPECIFIC BUSINESS DOMAIN,
E.G., MARKETING, SALES, CUSTOMER SERVICE —
PROVIDING MORE OWNERSHIP TO THE PRODUCERS OF A
GIVEN DATA(SET) ➔ DEMOCRATIZE DATA ACROSS A LARGE
ORGANIZATION
*FOCUSES ON ORGANIZATIONAL CHANGE
“A data mesh involves a cultural shift in the way that companies think about their data”
83. DATA MESH IS A NEW TREND!?
Data Lakehouse, Data Mesh, and Data Fabric (r2) | PPT (slideshare.net)
96. FOR FURTHER READING IN CASE OF INTEREST…
✓ Nikiforova, A. (2020). Definition and Evaluation of Data Quality: User-Oriented Data Object-Driven Approach to Data Quality Assessment. Baltic Journal of
Modern Computing, 8(3).
✓ Guerra-García, C., Nikiforova, A., Jiménez, S., Perez-Gonzalez, H. G., Ramírez-Torres, M., & Ontañon-García, L. (2023). ISO/IEC 25012-based methodology for
managing data quality requirements in the development of information systems: Towards Data Quality by Design. Data & Knowledge Engineering, 145,
102152.
✓ Lacagnina, C., David, R., Nikiforova, A., Kuusniemi, M. E., Cappiello, C., Biehlmaier, O., ... & Dennis, R. (2022). TOWARDS A DATA QUALITY FRAMEWORK
FOR EOSC Authorship Community (Doctoral dissertation, EOSC Association).
✓ Nikiforova, A. (2020, October). Timeliness of open data in open government data portals through pandemic-related data: a long data way from the publisher
to the user. In 2020 Fourth International Conference on Multimedia Computing, Networking and Applications (MCNA) (pp. 131-138). IEEE.
✓ Azeroual, O., Jha, M., Nikiforova, A., Sha, K., Alsmirat, M., & Jha, S. (2022). A record linkage-based data deduplication framework with datacleaner
extension. Multimodal Technologies and Interaction, 6(4), 27.
✓ Azeroual, O., Nikiforova, A., & Sha, K. (2023, June). Overlooked Aspects of Data Governance: Workflow Framework For Enterprise Data Deduplication. In
2023 International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS) (pp. 65-73). IEEE.
✓ Azeroual, O., Schöpfel, J., Ivanovic, D., & Nikiforova, A. (2022). Combining data lake and data wrangling for ensuring data quality in CRIS. Procedia
Computer Science, 211, 3-16.
✓ Nikiforova, A., Bicevskis, J., Bicevska, Z., & Oditis, I. (2020, December). Data quality model-based testing of information systems: the use-case of E-
scooters. In 2020 7th International Conference on Internet of Things: Systems, Management and Security (IOTSMS) (pp. 1-8). IEEE.
✓ Nikiforova, A., & Kozmina, N. (2021, November). Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can Save Your Business. In 2021
Second International Conference on Intelligent Data Science Technologies and Applications (IDSTA) (pp. 66-73). IEEE.