hier als PDF-Dokument


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

hier als PDF-Dokument

  1. 1. Data Quality Mining in Ontologies for Utilities Fabian Grüning 1 Abstract The energy market in Europe underlies several diversifications. Through the pressure of the European Union the former monopolistic companies are forced to open the market and allow competition with newcomers in order to gain an economically and ecologically more efficient energy market. This new situation has implica- tions on the IT-infrastructure of those companies so that grown databases have to be integrated into the new company’s structures. As with every long-living database and especially with a data integration process there are emerging problems with the data quality. In a current project we decided to use the domain ontology CIM (common information model) to remodel and integrate those data sources. This paper deals with the data quality problems resulting from the long used databases and the integration process. Therefore a definition of data quality is introduced and an overview of the algorithm classes needed for ful- filling the defined requirements is given as well as an explanation for the term data quality mining. The dis- cussed topics will then be represented in an ontology that is used to specify the knowledge needed for and the processes of data quality mining and which will be merged with the domain ontology to execute data quality management. 1. Introduction In the field of knowledge engineering the approach of the usage of ontologies starts to get established. In a current project of reengineering the IT infrastructure of a utility the CIM (International Electrotechnical Commission 2003) data model is utilised which can be expressed as RDF (Klyne/Carroll (eds.) 2004) or OWL (Dean/Schreiber (eds.) 2004). A subproject is the consideration of data quality issues in the restructuring process. This paper deals with the first steps of the task where the term “data quality” gets defined and appropriate algorithm classes are identified that meet the needed requirements to achieve the former defined aims of data quality. The knowledge about data quality, the algorithms which can be used to meet the requirements, and the pro- cesses of data mining is then modelled as an ontology which can later be used to enrich the CIM with data quality considerations in the design process of the main ontology. This paper is structured as follows: We will first describe the meaning of data quality by introducing a data quality definition in chapter 2. In chapter 3 we will motivate the term data quality mining and de- scribe the algorithm classes needed for assuring data quality. The ontology for data quality mining will be developed in chapter 4. Finally, conclusions will be drawn and further work will be highlighted in chapter 5. 1 Carl von Ossietzky Universität, Department für Informatik, Escherweg 2, 26121 Oldenburg, Germany, Email: fabian.gruening@informatik.uni-oldenburg.de, Internet: http://www.informatik.uni-oldenburg.de/ 10.05.2010, Fabian Grüning 1
  2. 2. 2. What is data quality? The meaning of data quality has been discussed in several publications (Redman 1996, Bovee/Srivastava/ Mak 2001, Wand/Wang 1996, English 1999, Hinrichs 2002). It is reasonable to choose a definition that fits into the context of the application where the data quality is to be measured and improved. As the CIM mainly holds technical data about the electrical network with possible extensions to business data, scheduling, and reservation the definition from Redman fits for this application domain. Namely these are: accuracy, completeness, consistency, and currency. 3. What is data quality mining? There are mainly two different terms when it comes to data quality: data quality management and data quality mining. The former term has been established in the data quality community while the latter term is relatively new and implies the usage of data mining algorithms for the purpose of improving data qual- ity (Hipp/Güntzer/Grimmer 2001). As this approach is used in this project, we will also speak of data quality mining. Having discussed the characteristics of data quality in the previous chapter we now need to identify al- gorithm classes that provide the needed performance to fulfil the mentioned requirements. These are the well known algorithms of statistical process control, record linkage, logging of the data base’s activities, and especially classification as a data mining algorithm which are connected with the data quality dimen- sions. 4. An ontology for data quality mining The data structure used in the project to integrate several long used data bases and to plug in new com- ponents in the processes of the utility is the CIM which is an ontology of the energy market already used by utilities around the world. This chapter is addressed to show the usefulness of the usage of ontologies in the process of data quality mining as we present an ontology for data quality mining that models the connections between the data quality dimensions and the algorithm classes useful for measuring and assuring data quality and the pro- cesses of data quality management. This ontology can then be used to link aspects of data quality with the concepts of a domain ontology, e.g. the CIM, thereby utilising the domain experts’ knowledge about spots with potentially low data qual- ity in the data management and giving them a simple tool to express those presumptions in the design pro- cess without having to have deep knowledge about the aspects of data quality. The algorithms for detect- ing data quality problems and improving them are than configured applying this knowledge and can there- fore be used more precisely and efficiently than without the experts’ expertise. We hope that this convenient way of adding data quality aspects to a domain ontology is heavily used by domain experts so that there will be a noticeable improvement of the outcome of the data quality min- ing by the fine tuning of the applied algorithms. Criteria of data quality As discussed in chapter 3 the criteria of data quality and the algorithms used to meet those criteria are con- nected to each other. The first part of the ontology, as shown in figure 1, explicitly specifies these rela- tions. 10.05.2010, Fabian Grüning 2
  3. 3. As an example, algorithms for classification may be used for the measuring and improvement of the data quality criteria completeness, consistency, and accuracy. Classification of data quality criteria The semantics of the next part of the ontology has also already been discussed in chapter 3. The methods for analysing and improving data quality can be divided into to different classes regarding their applicabil- ity to whole entities or only single attributes of an entity. Figure 2 shows that only record linkage for de- tecting duplicate representation of real world entities is applicable to entities in contrast to the other meth- ods which are applied to the attributes of the entities. Figure 1 Graphical representation of the ontology for the applicability of algorithm classes regarding the data quality aims (“iCB...M” short for “is covered by ... method”). The knowledge of the connection between the identified algorithms and the data quality criteria is mod- elled with the two parts of the ontology presented in this and the previous subchapter. Processes of data quality mining: finite state machines We now come to the ontological representation of the processes of data quality mining. There is a certain sequence of tasks to be executed in each of the four different ontology classes that gets applied to data samples. Figure 2: Graphical representation of the classification of algorithm classes regarding the applicability on entity or attribute level. 10.05.2010, Fabian Grüning 3
  4. 4. Such a process can be represented as finite state machines with the states being the tasks of the process and the transitions being the conditions for reaching the next state. Finite state machines can easily be ex- pressed by ontologies by defining states, transitions, etc (see e.g. Dolog 2004). 5. Conclusions and further work We introduced a consistent data quality concept with the appropriate algorithm classes partly from the field of data mining to achieve the specified aims of data quality. We modelled the knowledge regarding the data quality criteria, the applicability of the algorithms in consideration of the aims of data quality and the ability to be executed on entity or attribute level, and the processes behind the data quality mining by building an ontology of data quality mining. The next step is to enrich the ontology of the CIM with the concepts of the data quality mining ontology to evaluate the advantage of considering data quality in the process of modelling data of a certain domain. Concrete implementations of the mentioned data quality algorithm classes are then needed to test the util- isability of the presented approach. We hope that the possibility of considering data quality aspects in the design process of an ontology of a real-world extract leads to a better data quality in a production system by utilising the experts’ domain knowledge. The usage of ontologies for both the modelling of the real world extract and the data quality consideration will hopefully lower the barrier for the designer of taking data quality aspects into consider- ation and therefore optimises the outcome of the data quality mining demand. Bibliography Bovee, M; Srivastava, R. P.; Mak, B. R.: A Conceptual Framework and Belief-Function Approach to As- sessing Overall Information Quality, in: Proceedings of the 6th International Conference on Informa- tion Quality (ICIQ 01), Boston, MA, 2001. Dean, M.; Schreiber, G. (eds.): OWL Web Ontology Language Reference, W3C Recommendation 10 February 2004, http://www.w3.org/TR/owl-ref/ (last access: 2006-02-20), 2004. Dolog, P.: The Ontology for State Machines, http://www.l3s.de/~dolog/fsm/ (last access: 2006-05-08), 2004. English, L. P.: Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits, Wiley Computer Publishing, 1999. Hinrichs, H.: Datenqualitätsmanagement in Data Warehouse-Systemen, Dissertation, Universität Olden- burg, 2002. Hipp, J.; Güntzer U.; Grimmer, U.: Data Quality Mining – Making a Virtue of Necessity, in: Proceedings of the 6th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD 2001), pages 52 – 57, Santa Barbara, California, 2001. International Electrotechnical Commission: INTERNATIONAL STANDARD IEC 61970-301: Energy management system application program interface (EMS-API) – Part 301: Common Information Model (CIM) Base. International Electrotechnical Commission, 2003. Klyne, G; Carroll J. J. (eds.): Resource Description Framework (RDF): Concepts and Abstract Syntax, W3C Recommendation 10 February 2004, http://www.w3.org/TR/rdf-concepts/ (last access: 2006-02-20), 2004. Redman, T. C.: Data Quality for the Information Age, Artech House, 1996. Wand, Y.; Wang, R.Y.: Anchoring Data Quality Dimensions in Ontological Foundations, Communication of the ACM, vol. 39, no. 11, 1996. 10.05.2010, Fabian Grüning 4