Download publication

387 views

Published on

  • Be the first to comment

  • Be the first to like this

Download publication

  1. 1. A Data Mining Tool Using An Intelligent Processing System with a Clustering Application A.M.S. Zalzala, A. Al-Zain and I. Sarafis Department of Computing & Electrical Engineering Heriot-Watt University, Edinburgh EH14 4AS, UK {A.Zalzala, A.T.I.Alzain, I.Sarafis}@hw.ac.uk Abstract This paper presents DIPS, a database using an intelligent processing system. DIPS is a generic data mining tool for use with real-world applications. The tool is developed in Java and has access to an Oracle server for data storage. A Control GUI facilitates data manipulation, and the tool incorporates a set of algorithms for general data mining and clustering applications including e.g. neural networks and evolutionary computation techniques. Case studies are reported incorporating a rule-based genetic clustering algorithm in experimental and real-world applications. 1. KNOWLEDGE DISCOVERY IN DATABASES In the last decade, we have seen an explosive growth in our capabilities to both gener- ate and collect data. Advances in data collection (e.g. from remote sensors or from space satellites), the widespread introduction of bar codes for almost all commercial products, and the computerization of many business (e.g. credit card purchases) and government transactions (e.g. tax returns) have generated a flood of data. Advances in storage technology, such as faster, higher capacity, and cheaper storage devices (e.g. magnetic disks, CD-ROMs), better database management systems, and data warehous- ing technology, have allowed us to transform this data deluge into “mountains” of stored data. Such volumes of data clearly overwhelm the traditional manual methods of data analysis such as spreadsheets and ad-hoc queries. Those methods can create infor- mative reports from data, but cannot analyze the contents of those reports to focus on important knowledge. A significant need exists for a new generation of techniques and tools with the ability to intelligently and automatically assist humans in analyzing the mountains of data for nuggets of useful knowledge. These techniques and tools are the subject of the field of Knowledge Discovery on Databases (KDD). KDD is considered to be the “extraction of interesting (non-trivial, implicit, previously unknown and po- tentially useful) information or patterns from data in large databases [1]. The KDD pro- cess is interactive and iterative, involving numerous steps with many decisions being made by the user. The KDD steps includes Learning the application domain, Data cleaning, Data integration, Data selection, Data transformation, Selection of the data
  2. 2. mining task and algorithm(s), Data Mining, Pattern evaluation, and Knowledge presen- tation. 2. Data Mining Despite the fact that many people consider Data Mining (DM) as a synonym with KDD, in essence DM is only one step of the KDD process. However, the importance of the DM step is vital for the entire KDD process, because only during this step the “hid- den” patterns are extracted. DM is defined by Han and Kamber as “the application of algorithms for extracting patterns from data without the additional steps of the KDD process” [2]. DM is a multi-disciplinary field with application of techniques ranging from machine learning (ML), information science, visualization to statistics and database systems. Various classification schemes can be used to categorize DM tech- niques. Broadly speaking, these schemes are based on the type of databases, the type of knowledge and the techniques or algorithms used to mine hidden patterns. Obviously, it is very difficult to build a generic DM system suitable for all cases, because of the di- versity in the kind of databases used, in the kind of knowledge that is mined and in the available algorithms. There are several kinds of DM tasks depending on the applica- tion domain and the users’ interests. The literature discusses various DM tasks and it will be useful for the reader to understand the main categories of DM tasks. The neces- sity to identify the bounds and the functionality of each DM task is of vital importance, as this procedure will determine the format of the patterns that are going to be mined. The format of the patterns (in essence the kind of the mined “knowledge”) in turn im- poses some requirements and restrictions as far as the searching algorithm(s) is con- cerned. The type of knowledge of interest is a major factor greatly influencing the se- lection of the searching algorithm(s). The most popular and well-studied DM tasks are Characterization and Discrimination, Association Analysis, Classification and Predic- tion, Clustering, and Outlier Analysis. 3. DIPS The software incorporates an implementation of various classes providing for data mining algorithms. The shell package is developed in Java and accesses an Oracle server. As depicted in Figure 1, it consists of a converter, two databases, data provider and a set of data processing algorithms. The Converter reads datafiles provided by the client and stores them in Oracle tables as “database_1”. The data is checked for errors and the user is given a report of all errors. The user can view the errors, apply the necessary corrections then resume the conversion process to store the data. Control GUI Algorithm 1 Algorithm 2 Converter Data Provider Algorithm 3 Algorithm n Data manipulation classes Database 1 Database 2 Data (Alpha-numeric) Figure 1: General DIPS description
  3. 3. Data manipulation involves a number of stages, with each stage executing a specific “algorithm”. “Database_2” holds all data processed by one or more algorithm, while database_1 is never altered beyond what the converter provided earlier. (See the following Converter GUI). The algorithms depicted in Figure 1 include shell classes with the necessary data providers built into them. The actual algorithms include neural and genetic techniques and employ other packages for computation and display. (See the following GUI). Data manipulation can use all available algorithms or a subset in any order; hence the Control GUI provides the facility to identify a “processing path” i.e. which algorithms to execute one after the other. The GUI also allows a user the facility to edit any table at any time. The figures show various GUI displays for the converter, algorithms and execution. (See the following execution GUI).
  4. 4. 4. Software Implementation Java has gained enormous popularity as a programming language for the World Wide Web but Java also offers similar advantages for GUI applications [3-5]. In a sense, Java is a variation of the interface layer approach in that the Java engine already provides the interface to the underlying platform. However, the consequences are quite dramatic. The last Java release with its rich set of GUI tools, is being ported to virtually all platforms, and true to its “write once, run everywhere” slogan requires no special maintenance for different platforms which is reflected in reduced costs. Java provides a set of user interface components, the same across Unix, Windows and the Macintosh. Hence, cross-platform applications and applets can be built. 4.1. Java objects used in DIPS JFrame: the Swing JComponent equivalent of Frame. It adds double buffering to avoid flickering during AbstractButton JLabel drawing. It has a slightly setIcon() setIcon() different interface to setText() setText() geometry management – items are added to a contentPane. It can hold JButton JMenuItem JToggleButton a JmenuBar. <<constructor>> JButton() JButton(String) JButton can have an JButton(Icon) image and/or text label JButton(String, Icon) with controllable placement. Similarly for Jlabel, JcheckBox, Frame JradioButton. These JRootPane extend the appearance of Component Frame corresponding AWT rootPane objects. In DIPS the JMenuBar Jbuttons have a text getGlassPane() getContentPane() Container label to give brief getRootPane() describe about the getMenuPar() button functionality. Event: The path taken by events is different for AWT components and Swing components. Native events are generated by user actions, such as mouse clicks. Native events are translated into Java events, and sent to a Java object which has a native window. The difference is that all AWT objects have a native window, but only some Swing objects such as Jframe have one. For the AWT, events have to be finally delivered to native objects to have visual effect. For Swing, a container has to pass events to its Swing components to have visual as well as semantic effect. For a mouse press on a Button inside a Frame, a sequence for Java using X Windows occur.
  5. 5. JList: extends functionality of List, and requires programmatic changes. Jlist has a list of Object, not just String. The list can be set by a constructor. The contents of a Jlist are stored in the ListModel model, and the list is changed by methods of the model. The GUI uses a CellRenderer to paint each element of a list, and this can be set by an application. The listeners are ListSelectionListeners rather than ItemListeners. Text: JtextArea acts as a drop in replacement for TextArea. JtextField acts as a drop in replacement for TextField. JpasswordField is a safer replacement for JtextField used with setEchoChar(). JtextPane can display multiple fonts, colours, etc. Dialogs: The dialog types can be ERROR_MESSAGE, INFORMATION_MESSAGE, WARNING_MESSAGE, QUESTION_MESSAGE, and PLAIN_MESSAGE. 4.2. DIPS Connection with Oracle Oracle8i changes the way information is managed and accessed to meet the demands of the Internet age, while providing significant new features for traditional online transaction processing (OLTP) and data warehouse applications. It provides advanced tools to manage all types of data in Web sites, and it also delivers the performance, scalability, and availability necessary to support very large database (VLDB) and mission-critical applications. The network-oriented nature of Java makes it an ideal candidate for client/server computing, especially now that the ability to integrate it with popular commercial Database Management Systems (DBMS) is in the making. The first standardized work on Java-DBMS connectivity appears in a draft specification known as the Java Database Connectivity (JDBC) Application Programming Interface (API) specification. Instance SGA Shared pool Library Data buffer Redo log Cache cache buffer User process Data dict. cache Server process SMON DBWO PMON CKPT LGWR SMON Parameter Data Control Redo log file file file file Password file Archived Database log file Figure 2: Primary Components of Oracle (Storage Management and Processes) Created with the help of the aforementioned database and database-tool vendors, it is intended to fill the current vacancy in this level of connectivity that has prompted
  6. 6. companies like Weblogic to develop proprietary interfaces. JDBC creates a programming-level interface for communicating with databases in a uniform manner similar in concept to Microsoft’s Open Database Connectivity (ODBC) component, which has become the standard for personal computers and LANs. The JDBC standard itself is based on the X/Open SQL Call Level Interface, the same basis as that of ODBC. This is one of the reasons why the initial development of JDBC is progressing so fast. 5. A Data Mining Rule-Based Clustering Tool using GAs A rule-based clustering tool is developed, which exploit the search capability of genetic algorithms in order to generate a set of comprehensible IF-THEN clustering rules. When the discovered knowledge of a clustering task is presented to an end user using the form of IF-THEN clustering rules, which has the advantage of being a high level symbolic knowledge representation, such presentation contributes to the comprehensibility of the discovered knowledge. For this reason we developed a genetic algorithm for mining IF-THEN clustering rules from large databases. 5.1. The RBCGA Algorithm The rule-based genetic algorithm (RBCGA) used as the underlying search mechanism evolves individuals that contain a set of clustering rules. Each cluster is described by an IF-THEN clustering rule. In particular, a complete solution to the clustering problem shown in Figure 3(a) is represented by the following set of rules: Rule A [[400 ≤salary ≤1000] AND [0 ≤ tax ≤100]], and Rule B [[0 ≤salary ≤200] AND [300 ≤ tax ≤400]]. Consequently, the encoding of the individuals is shown in Figure 3(b). As far as the fitness function is concerned a function that incorporates the notion of density, selectivity, asymmetry and homogeny is suggested. (a) (b) Figure 3: (a) Distribution of patterns. (b) The structure of the individuals Special recombination and mutation operators suitable for individuals like those shown in Figure 3(b) are developed. Details of the operators and objective function are reported elsewhere [7] as they are beyond the scope of the current paper. 5.2.The Infrastructure of the Clustering Tool The DM Clustering Toolkit is based on three existing software packages: DIPS, Eos and Ptplot, but employs the rule-based genetic algorithm described briefly above. The
  7. 7. main components consisting DMCT are illustrated in Figure 4, while the functionality of each component is provided in the following. DIPS: this works as a bridge between the ORACLE database and the clustering toolkit. When an end-user decides to run the rule-based clustering algorithm through DIPS, the user must provide a name for the input and output table. The clustering algorithm runs against the data coming from the input table. DIPS connect to the ORACLE database and pass a JDBC Statement object to the DMCT. This object is used to retrieve the data from the input table. DMCT instantiates a static java object called Pattern in order to load all the data to the main memory, making the algorithm faster. Figure 4: Structure of the DIPS-based Clustering Tool PtPlot: This is a plotting package, entirely written in java and is used for drawing graphs. Plots are essential parts of a simulation as provide a clear view about the progress of the running algorithm, but DMCT extents PtPlot providing enhanced plotting features. Currently, the end-user is able to plot more than 25 statistics metrics related to the EA running in the background. The update mechanism for the plots based on Observer design pattern. The great availability of statistics metrics along with the
  8. 8. flexibility that Eos provides in defining and calculating new statistics makes PtPlot a powerful monitoring tool for our simulations. For instance, in the running screenshot of the application shown in Figure 5, an end user is able to monitor various statistics related to population’s objective value, density, coverage, and asymmetry. Eos Platform: Eos is a software platform developed by BT’s Future Technologies Group [6]. Eos supports research and rapid implementation of evolutionary algorithms, ecosystem simulations and hybrid models. Amongst others the toolkit supports Genetic Algorithms and Evolutionary Strategies. It defines basic classes and various implementations for: genomes, recombination operators, mutation operators, selection strategies, replacement strategies, interactions, and more. The Eos platform is built using the Object Oriented design paradigm so that it is customizable and extensible. The flexibility of the Eos platform makes it a powerful environment for developing new algorithms and architectures. Eos is entirely implemented in Java and runs on all major Operating Systems. Figure 5: The main monitoring frame of clustering application The toolkit supports Genetic Algorithms and Evolutionary Strategies. It defines basic classes and various implementations for genomes, recombination operators, mutation operators, selection strategies, replacement strategies, interactions, and more. What makes Eos a powerful tool for rapid development of evolutionary algorithms is its flexibility in developing new types of genomes and genetic operators. For instance, our approach for mining if-then clustering rules uses a special non-binary representation for the individuals. It is relatively easy to develop these new types of individuals by simply extending the core class of Eos, which provides the necessary functionality for an individual and adding some new features according to the requirements of the problem. Each type of individual requires dedicated genetic operators, which have been developed, based on the same idea: a) extend the basic classes provided by Eos and b) add new functionality. The core of Eos platform provides the infrastructure for setting up and running an evolutionary algorithm.
  9. 9. 6. Clustering Results The output of a data mining clustering task should be interpretable, meaningful and easily readable. Typically, an end-user of a data mining clustering tool requires the descriptions of the discovered clusters to be easily assimilated. Usually, the most desirable output of a clustering task consists of a set of IF (conditions in the feature space) THEN cluster ID clustering rules. Obviously, rule-based cluster descriptors can help users to gain better insight in the distribution of patterns. Bearing in mind the above requirement, we have developed a powerful visualization tool to plot best individuals IF-THEN clustering rules (see Figure 6). Figure 6: Visualization tool for presenting the discovered rules The update “heartbeat” ripples recursively through the simulation environment and at the end of each update cycle the visualization tool, which observes the environment, updates its “contents”. There is built-in support in DMCT to support visualization operations in more than two dimensions. The user simply has to select the number of plotters and the attributes that appears in each plot. 6.1. Case Studies We report the results of experiments with two data sets, namely DS1 and DS2, which contain patterns in two dimensions. The settings parameters for RBCGA are depicted in Table 1. Table 1: Algorithm parameters Generations 200 Population Size 50 Mutation rate 95% Mutation Probability 0.005
  10. 10. Recombination rate 100% Selection strategy Roulette Selection (k=2) Replacement strategy Replace Worst with elitism size 1 Threshold for sparse rules 5% Stopping criterion Maximum number of generations The effectiveness of RBCGA is evaluated on different types of data distributions. DS1 was obtained from the SEQUOIA 2000 benchmark database, providing real data sets that are representative of Earth Science tasks [8]. The databases contain four types of datasets: raster data, point data, polygon data and directed graph data. The polygon datasets contain 79,607 polygons (delimited by 3,997,756 points) of homogeneous land use, extracted from the US Geological Survey's Geographic Information Retrieval and Analysis System (GIRAS). The RBCGA is run against the polygon 83 dataset, which corresponds to areas characterized as “wet tundra” areas. DS1 contains 953 two-dimensional patterns that apparently can be grouped into four different clusters. The clusters are well-separated from each other and are of different shapes, sizes and densities. The main challenge for RBCGA is to cope with the apparent discontinuities of patterns within the clusters. RBCGA is run 50 times against DS1 searching for k=4 clusters, and always the output was quite similar to the clustering depicted in Figure 7. It should be pointed out that despite the fact that there are two subclusters within the area defined by cluster 2, RBCGA never spitted cluster 2 into two clusters because the subclusters are closely located to each other. The convergence speed of RBCGA is approximately 100 generations while for the completion of each generation required around 0.9 seconds. Figure 7: Clustering DS1 (polygon 83) dataset Dataset DS2 was synthetically generated (Figure 8) having similar structure to the DS1 (as described in [10]). Many partitioning-based clustering algorithms, such as BIRCH [11], fail to reveal the apparent clusters and usually split the bigger one in order to minimize the square-error function used in k-means approaches [10]. We compared RBCGA against a standard k-means algorithm and the clustering results are depicted in
  11. 11. Figure 8. RBCGA generates partitions similar to Figure 9(b), which means that can be used in cases where there are differences in the geometry and densities of the clusters. The convergence speed of RBCGA for DS2 is approximately 150-200 generations. (a) k-means (b) RBCGA Figure 8: Clustering DS2 using (a) k-means algorithms and (b) RBCGA 7. Conclusions
  12. 12. This paper presented the implementation of a generic data mining tool incorporating Java, Oracle, and JDBC technologies. The tool presents a shell for the development of real life applications using various knowledge discovery tasks. One primary advantage of the tool is the incorporation of new intelligent system algorithms within the data mining process, hence providing a research and development platform for a growing field of prime importance. Case studies were presented for a clustering algorithm employing a rule-based genetic algorithm, which demonstrated the feasibility of the data mining tool in incorporating the evolutionary formulation using the EOS package. Further work is in progress aiming at harnessing various Internet technologies to further enhance the potential of the developed tool. Acknowledgements The authors gratefully acknowledge the financial support of the SHEFC SLI Cluster under grant no. 24, and the provision of the EOS software by BT’s Intelligent Systems Laboratory. References 1 Fayyad, U., G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in knowledge discovery and data mining, AAAI Press/The MIT Press, 1996. 2 Han, J. and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufman Publishers, 2000. 3 Flanagan, D., Java in a Nutshell, O’Reilly, 1999. 4 Oracle Corp., Enteriprise DBA Architecture and Administration production 1.0, August 1999. 5 Quatrani, T., Visual Modeling With Rational Rose 2000 and UML, Addison- Wesley, 1998. 6 E. Bonsma, M. Shackleton and R. Shipman, Eos - an evolutionary and ecosystem research platform, BT Technology Journal, 18(14):24-31, 2000. 7 I Sarafis, AMS Zalzala and P W Trinder, A Genetic Rule-Based Data Clustering Toolkit, In Proc World Congress on Computational Intelligence, May 2002 (to appear). 8 Stonebraker, M., Frew, J., Gardels, K., and Meredith, J. 1993, The Sequoia 2000 Storage Benchmark, In Proc. ACM-SIGMOD International Conference on Management of Data, pp. 2-11, Washington, D.C., May 1993. 9 S. Guha, R. Rastogi, and K. Shim, CURE: An efficient clustering algorithm for large databases, In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 73-84, New York, 1998. 10 Tian Zhang, Raghu Ramakrishnan, and Miron Livny, BIRCH: An Efficient Data Clustering Method for Very Large Databases, In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 103-114, Montreal, Canada, 1996.

×