Data Mining on Incomplete Data


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Mining on Incomplete Data

  1. 1. Interactive Visual Data Mining Shouhong Wang* Department of Marketing/Business Information Systems University of Massachusetts Dartmouth 285 Old Westport Road Dartmouth, MA 02747-2300 USA voice: +1 508-999-8579 fax: +1 508-999-8646 email: Hai Wang Department of Finance and Management Science Saint Mary's University 903 Robie Street Halifax, NS B3H 2W3 Canada voice: +1 902-496-8231 email: (* Corresponding author)
  2. 2. Interactive Visual Data Mining Shouhong Wang,University of Massachusetts Dartmouth, USA Hai Wang, Saint Mary's University, Canada INTRODUCTION In the data mining field, people have no doubt that high level information (or knowledge) can be extracted from the database through the use of algorithms. However, a one-shot knowledge deduction is based on the assumption that the model developer knows the structure of knowledge to be deducted. This assumption may not be invalid in general. Hence, a general proposition for data mining is that, without human-computer interaction, any knowledge discovery algorithm (or program) will fail to meet the needs from a data miner who has a novel goal (Wang & Wang 2002). Recently, interactive visual data mining techniques have opened new avenues in the data mining field (Chen, Zhu, & Chen, 2001; Shneiderman, 2002; Han, Hu & Cercone, 2003; de Oliveira & Levkowitz, 2003; Yang, 2003). Interactive visual data mining differs from traditional data mining, standalone knowledge deduction algorithms, and one-way data visualization in many ways. Briefly, interactive visual data mining is human centered, and is implemented through knowledge discovery loops coupled with human-computer interaction and visual representations. Interactive visual data mining attempts to extract unsuspected and potentially useful patterns from the data for the data miners with novel goals, rather than to use the data to derive a certain information based on a priori human knowledge structure.
  3. 3. BACKGROUND A single generic knowledge deduction algorithm is insufficient to handle a variety of goals of data mining since a goal of data mining is often related to its specific problem domain. In fact, knowledge discovery in databases is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns of data mining (Fayyad, Piatetsky- Shapiro & Smyth, 1996). By this definition, two aspects of knowledge discovery are important to meaningful data mining. First, the criteria of validity, novelty, usefulness of knowledge to be discovered could be subjective. That is, the usefulness of a data pattern depends on the data miner and does not solely depend on the statistical strength of the pattern. Second, heuristic search in combinatorial spaces built on computer and human interaction is useful for effective knowledge discovery. One strategy for effective knowledge discovery is the use of human- computer collaboration. One technique used for human-computer collaboration in the business information systems field is data visualization (Montazami & Wang, 1988; Bell, 1991) which is particularly relevant to data mining (Keim & Kriegel, 1996; Wang, 2002). From the human side of data visualization, graphics cognition and problem solving are the two major concepts of data visualization. It is a commonly accepted principle that visual perception is compounded out of processes in a way which is adaptive to the visual presentation and the particular problem to be solved (Newell & Simon, 1972; Kosslyn, 1980). MAIN FOCUS Major components of interactive visual data mining and their functions that make data mining more effective are the current research theme in this field. Wang and Wang (2002) have
  4. 4. developed a model of interactive visual data mining for human-computer collaboration knowledge discovery. According to this model, an interactive visual data mining system has three components on the computer side, besides the database: data visualization instrument, data and model assembly, and human-computer interface. Data Visualization Instrument Data visualization instruments are tools for presenting data in human understandable graphics, images, or animation. While there have been many techniques for data visualization, such as various statistical charts with colors and animations, the self-organizing maps (SOM) method based on Kohonen neural network (Kohonen, 1989) has become one of the promising techniques of data visualization in data mining. SOM is a dynamic system that can learn the topological relations and abstract structures in the high-dimensional input vectors using low dimensional space for representation. These low-dimensional presentations can be viewed and interpreted by human in discovering knowledge (Wang, 2000). Data and Model Assembly The data and model assembly is a set of query functions that assemble the data and data visualization instruments for data mining. Query tools are characterized by structured query language (SQL), the standard query language for relational database systems. To support human-computer collaboration effectively, query processing is necessary in data mining. As the ultimate objective of data retrieval and presentation is the formulation of knowledge, it is difficult to create a single standard query language for all purposes of data mining. Nevertheless,
  5. 5. the following functionalities can be implemented through the design of query that support the examination of the relevancy, usefulness, interestingness, and novelty of extracted knowledge. (1) Schematics examination - Through this query function, the data miner is allowed to set different values for the parameters of the data visualization instrument to perceive various schematic visual presentations. (2) Consistency examination - To cross-check the data mining results, the data miner may choose different sets of data of the database to check if the conclusion from one set of data is consistent with others. This query function allows the data miner to make such consistency examination. (3) Relevancy examination - It is a fundamental law that, to validate a data mining result, one must use external data, which are not used in generating this result but are relevant to the problem being investigated. For instance, the data of customer attributes can be used for clustering to identify significant market segments for the company. However, whether the market segments relevant to a particular product, one must use separate product survey data. This query function allows the data miner to use various external data to examine the data mining results. (4) Dependability examination - The concept of dependability examination in interactive visual data mining is similar to that of factor analysis in traditional statistical analysis, but the dependability examination query function is more comprehensive in determining whether a variable contributes the data mining results in a certain way. (5) Homogeneousness examination - Knowledge formulation often needs to identify the ranges of values of a determinant variable so that observations with values of a certain range in this variable have a homogeneous behavior. This query function provides interactive mechanism for the data miner to decompose variables for homogeneousness examination.
  6. 6. Human-Computer Interface Human-computer interface allows the data miner to dialog with the computer. It integrates the data base, data visualization instruments, and data and model assembly into a single computing environment. Through the human-computer interface, the data miner is able to access the data visualization instruments, select data sets, invoke the query process, organize the screen, set colors and animation speed, and manage the intermediate data mining results. FUTURE TRENDS Interactive visual data mining techniques will become key components of any data mining instruments. More theories and techniques of interactive visual data mining will be developed in the near future, followed by comprehensive comparisons of these theories and techniques. Query systems along with data visualization functions on large-scale database systems for data mining will be available for data mining practitioners. CONCLUSION Given the fact that a one-shot knowledge deduction may not provide an alternative result if it fails, we must provide an integrated computing environment for the data miner through interactive visual data mining. An interactive visual data mining system consists three intertwined components, besides the database: data visualization instrument, data and model assembly instrument, and human-computer interface. In interactive visual data mining, the human-computer interaction and effective visual presentations of multivariate data allow the data miner to interpret the data mining results based on the particular problem domain, his/her
  7. 7. perception, specialty, and the creativity. The ultimate objective of interactive visual data mining is to allow the data miner to conduct the experimental process and examination simultaneously through the human-computer collaboration in order to obtain a “satisfactory” result. REFERENCES Bell, P. C. (1991). Visual interactive modelling: the past, the present, and the prospects, European Journal of Operational Research, 54(3), 274-286. Chen, M., Zhu, Q., & Chen, Z. (2001). An integrated interactive environment for knowledge discovery from heterogeneous data resources. Information and Software Technology, 43(8), 487-496. de Oliveira, M. C. F., & Levkowitz, H. (2003). From visual data exploration to visual data mining: A survey. IEEE Transactions on Visualization and Computer Graphics, 9(3), 378-394. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD process for extracting useful knowledge from volumes of data, Communications of the ACM , 39(11), 27-34. Han, J., Hu, X., & Cercone, N. (2003). A visualization model of interactive knowledge discovery systems and its implementations, Information Visualization, 2(2), 105-112. Keim, D. A., & Kriegel, H. P. (1996). Visualization techniques for mining large databases: a comparison, IEEE Transactions on Knowledge & Data Engineering, 8(6), 923-938. Kohonen, T. (1989). Self-Organization and Associative Memory, 3rd Ed. Berlin: Springer- Verlag. Kosslyn, S. M. (1980). Image and Mind, Cambridge, MA: Harvard University Press.
  8. 8. Montazemi, A., & Wang, S. (1988). The impact of information presentation modes on decision making: A meta-analysis, Journal of Management Information Systems, 5(3), 101-127. Newell, A. & Simon, H. A. (1972). Human Problem Solving, Englewood Cliffs, NJ: Prentice Hall. Shneiderman, B. (2002). Inventing discovery tools: Combining information visualization with data mining, Information Visualization, 1, 5-12. Wang, S. (2000) Neural networks, in IEBM Handbook of IT in Business, Milan Zeleny (Ed.), London, UK: International Thomson Business Press, 2000, pp.382-391. Wang, S. (2002). Nonlinear pattern hypothesis generation for data mining. Data & Knowledge Engineering, 40(3), 273-283. Wang, S., & Wang, H. (2002). Knowledge discovery through self-organizing maps: Data visualization and query processing, Knowledge and Information Systems, 4(1), 31-45. Yang, L. (2003). Visual exploration of large relational data sets through 3D projections and footprint splatting. IEEE Transactions on Knowledge and Data Engineering, 15(6), 1460-1471. KEY TERMS AND THEIR DEFINITIONS Data Visualization: Presentation of data in human understandable graphics, images, or animation. Data and Model Assembly: A set of query functions that assemble the data and data visualization instruments for data mining.
  9. 9. Human-Computer Interface: Integrated computing environment that allows the data miner to access the data visualization instruments, select data sets, invoke the query process, organize the screen, set colors and animation speed, and manage the intermediate data mining results. Interactive Data Mining: Human-computer collaboration knowledge discovery process through the interaction between the data miner and the computer to extract novel, plausible, useful, relevant, and interesting knowledge from the data base. Query Tool: Structured query language that supports the examination of the relevancy, usefulness, interestingness, and novelty of extracted knowledge for interactive data mining. Self-Organizing Map (SOM): Two layer neural network that maps the high-dimensional data onto low-dimensional pictures through unsupervised learning or competitive learning process. It allows the data miner to view the clusters on the output maps. Visual Data Mining: Data mining process through data visualization. The fundamental concept of visual data mining is the interaction between data visual presentation, human graphics cognition, and problem solving.