Report on KDD and DataMining


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Report on KDD and DataMining

  1. 1. Survey of Algorithms used in Data Mining and Knowledge Discovery in Databases Ping-Herng Denny Lin Department of Computer Science, California State University, Fullerton. Abstract Due to the explosive growth of information available in databases, researchers have been looking for ways to make use of data stored. Knowledge Discovery in Databases is the process of finding useful information in a large database. Many algorithms have been suggested and used to find useful data, and the activity is called data mining. This paper briefly describes the relationship between Knowledge Discovery in Databases (KDD) and data mining (DM), and surveys algorithms and methods used for data mining. The report will attempt to identify the strengths and weaknesses of data mining algorithms, and list some examples of how these algorithms are used. Keywords: Algorithms, data mining, knowledge discovery in databases 0 Outline 1. Definition of terminology 2. Data mining tasks 3. Data mining methods and techniques 4. Example applications 5. Conclusion 6. References 1 Definition of terminology The purpose of Knowledge Discovery in Database is to find useful information in an ever growing data set size in databases. Computer assisted transactions have generated mountains of data that could be useful to maintain the competitive edge of a company. For instance, retailers such as Wal-Mart have been making the best use of information from their 20 million per day point-of-sale transactions [Hed95], to understand customer purchasing behavior and improve customer service. While the terms “Data mining” (DM) and “Knowledge Discovery in Databases” (KDD) have been used interchangeably, it is best to differentiate these terms since DM is one of the stages of KDD. The KDD process is generally made up of the following iterative stages [Rei99]: 1. Understand the business: This phase is used to identify the domain, objectives, requirements, and feasibility of the data mining project. 2. Understand data: This phase is based on data mining objectives identified in the previous stage. It is used to analyze and document available data and knowledge sources in the business, and to study characteristics of the data. 1
  2. 2. 3. Prepare data: Format or transform data to suitable media, clean or eliminate missing tuples, and focus the data for data mining. This step is usually the most time consuming part of the KDD project. 4. Explore data: This phase is used to explore interesting insights into the data, so that initial hypotheses and models of data can be evaluated and developed. 5. Data mining: Select and apply modeling and discovery algorithms to find patterns of interest in the data. This is accomplished by using different algorithms, which is the subject of this paper. 6. Evaluate results: Interpret and evaluate data mining results in business terms, and initiate new experiments if necessary. 7. Deploy results: Implement results obtained by the data mining process. Maintenance and monitoring data mining results are also part of this phase. 8. Document experience: This phase is really done throughout all phases of KDD projects, and documents aspects of the project and deployment of data mining results. This stage enables future projects to be more efficient and effective. The term “classification” is frequently used as an algorithm for all data mining tasks. Instead, it is best to use the term to refer to the category of supervised learning algorithms used to search interesting data patterns. While classification algorithms have become very popular and ubiquitous in DM research, it is just but one of the many types of algorithms available to solve a specific type of DM task. Scaleability refers to the ability of data mining algorithms to work under increasingly large databases. Because data mining deals with large databases, scaleability is a desirable feature. 2 Data mining tasks Based on the data collected, data mining algorithms are used to either produce a description of the data stored, or predict an outcome. Different kinds of algorithms are used to achieve either one of these tasks. However, in the overall KDD process, any mixture of these tasks may be called upon to achieve the desired results. For example, in determining consumer preference for a new product, an analyst might first use clustering to segment the customer database, and then apply regression to predict buying behavior for that cluster [Ggr99]. 1. Description tasks: These tasks describe the data being mined and they are: 2. Summarization: To extract compact patterns that describe subsets of data. The method used to achieve this task are Association Rule algorithms. 3. Segmentation or Clustering: To separate data items into subsets that are similar to each other. Partition-based clustering algorithms are used to achieve this task. 4. Change and Deviation Detection: To detect changes in sequential data (such as protein sequencing, behavioral sequences, etc.). 5. Dependency Modeling: To construct models of causality within the data. 6. Prediction tasks: To predict some field(s) in a database based on information in other fields. 7. Classification: To predict the most likely state of a categorical variable (its class). 2
  3. 3. 8. Regression: To predict results that are numeric continuous variables [Fay98]. 3 Data mining methods and techniques Because the DM and KDD fields are relatively new, different authors appear to survey methods in different ways. 1. Fayyad’s methods for data mining [Fay98]: Predictive Modeling; Clustering; Summarization; Dependency Modeling; Change & Deviation Detection 2. Goebel & Gruenwald’s methods for data mining [Ggr99]: Statistical Models; Case-Based Reasoning; Neural Networks; Decision Trees; Rule Induction; Bayesian Belief Networks; Genetic algorithms; Fuzzy Sets; Rough Sets 3. Aggarwal & Yu’s survey of techniques for data mining [Ayu98]: Association rules, Clustering, Classification In the course of gathering information for this paper, Aggarwal & Yu's survey turned out to be very helpful in organizing descriptions of various methods. What follows is a survey of methods and techniques for data mining using Aggarwal & Yu's outline, while using information from other sources to describe the different approaches and their strengths and weaknesses as it applies to data mining. 1. Association rules: This type of algorithm seeks to find interesting data correlations between data items in different attributes. These data correlations are rules that state the association between sets of items with some minimum level confidence factor. For example, an association rule is a statement that “90% [confidence factor] of transactions that purchase bread and butter [antecedent] also purchase milk [consequent]” [Ais93]. Advantages: Generates very understandable data, and has become very popular in market analysis (also known as basket analysis). Disadvantages: Is I/O intensive for repeated passes over the database. 2. Clustering: These algorithms group similar records into segments which have considerable similarity within a group of points. Each of these segments may be treated differently. These algorithms are also known as unsupervised learning algorithms, and they rely heavily on statistical methods. Gray & Orlowska claim [Gor98] that clustering algorithms are typically in one of the following two forms: 3. Hierarchical clustering: These algorithms form a grouping of the data set by repeatedly joining or dividing existing clusters into new clusters [Gor98]. S: {1..20} S1: {5 6 8 10 14 16 18 20} S2: {1..4 7 9 11 12 13 15 17 19} S11: {8 10 16 18 20} S12: {5 6 14} S 21: {1 2 7 9 11 17} S22: {3 4 12 13 15 19} 3 S211: {7 11} S212: {1 2 9 17} S221: {3 13 15} S222: {4 12 19} A Hierarchy of Clusters
  4. 4. Table with animal information, used to generate the above hierarchy of clusters: # Animal Feet Ears Eats Gives_Milk Flies Swims 1 tiger claw external meat yes unable able 2 dog claw external grain yes unable able 3 horse hoof external grass yes unable able 4 sheep hoof external grass yes unable unable 5 frog web middle meat no unable well 6 penguin web middle meat no unable able 7 bat claw external meat yes well unable 8 chicken claw middle grain no able unable 9 rat claw external grain yes unable able 10 crane claw middle meat no well unable 11 rabbit claw external grass yes unable unable 12 camel hoof external grass yes unable unable 13 pig hoof external grain yes unable able 14 duck web middle grain no able well 15 ox hoof external grass yes unable able 16 swallow claw middle meat no well unable 17 bear claw external meat yes unable able 18 ostrich claw middle grain no unable unable 19 horse hoof external grass yes unable well 20 eagle claw middle meat no well unable Results of Hierarchical clustering: # Decision Rules Discovered [Hma91] 1 [Feet=web] à [Gives_milk=no] 2 [Feet=hoof] à [Gives_milk=yes] 3 [Feet=claw] [Swims=able] à [Feet=claw] [Gives_milk=yes] Advantages: Hierarchical clustering generates clear rules from the data, without supervision or pre-defined labels. Disadvantages: Partition clustering algorithms may outperform hierarchical rule-based algorithms [Mpi99]. 1. Partition-clustering: These algorithms partition the data set into a specified number of clusters by iteratively selecting a representative point for each cluster, determining a partitioning of the data based on these representative points and evaluating the partition according to metric- distance- or model- based criteria [Fay98]. Metric-distance techniques finds the best k-way partition so that cases in each block of the partition are closer in distance to each other than to cases in other clusters. Model- based techniques find the best fit of a hypothesized model for each of the clusters, to each cluster. Probability measures are used to score the fit of a model to a cluster. 4
  5. 5. Advantages: Can produce results that promise to be better than rule- based clustering methods [Mpi99]. Disadvantages: Like hierarchical clustering, finding rules that optimally covers all data clusters is an NP-hard problem [Hma91]. 2. Classification: These algorithms rely on human supervision to train itself to classify data into pre-defined categorical classes. For example, given classes of patients that correspond to medical treatment responses, identify most responsive forms of treatment for the patient [Ggr99]. 3. Decision Tree: These classifiers use a decision tree to partition data until each partition contains mostly examples from a particular class. The split point in a decision tree is a non-leaf node in the tree that uses some condition (also known as predicates) to decide how data should be partitioned. The terminal nodes in a decision tree contains tuples in the same class. Ag > <= 55 55 Good Married Yes No ? Bad Ag > <= 38 38 Good <= 25 Ag > 25 Good Bad A credit-scoring decision tree [Rei99] Advantages: Decision tree algorithms scale well, run fast, and produce highly interpretable results [Eab98]. Disadvantages: Decision tree algorithms suffer because they may find local maximas, producing inaccurate results [Ffa99]. 4. k-Nearest Neighbor: This algorithm finds the nearest neighbors of the test example, and assigns its class label on the majority labels of the nearest neighbors. This technique assumes that locality in the feature space imply strong relationships among class labels [Ayu98]. Advantages: k-Nearest Neighbor algorithms are easy to implement, and its results are interpretable. Disadvantages: k-Nearest Neighbor algorithms produce huge models for small data sets. Scalability is a serious concern for these algorithms [Eab98]. 5. DNF Rules: On the left hand side of these rules, it specifies a union of a set of possible non-disjoint regions in the feature space; the right hand side 5
  6. 6. corresponds to the class label [Ayu98]. These algorithms aim to generate a minimal and complete rule set that is consistent with the training data [Aho96]. Advantages: DNF Rules generate a small set of consistent rules, so algorithm scales well. Disadvantages: Finding DNF Rules can be computationally expensive [Win95]. 6. Neural networks: These networks are loosely modeled after the human brain. It is a data structure of functions that given one or more weighted inputs, produces an output which is a class label. Individual pieces of data is fed into the neural network, and the functions in the network are modified (learning) according to the error rates of the output produced. This approach usually results in long training times even if the data set is small [Ayu98]. Advantages: Neural networks produce accurate results. Disadvantages: Neural networks require long training times and produce hard to understand results. These algorithms are not scaleable [Eab98], and significant data pre-processing is required. 7. Genetic algorithms: These algorithms are used to formulate hypotheses about dependencies between variables. A collection of potential problem solutions compete with each other, and the best solutions are selected and combined with each other, so that the solution set will improve over time [Ggr99]. Advantages: Genetic algorithms produce well tested accurate results. Disadvantages: Genetic algorithms arrive at results through evolutionary programming, and the results are often hard to understand. 8. Bayesian networks: These are directed acyclic graphs that represent probability distributions, which can be used to represent expert knowledge. The nodes represent attribute variables and states (events), and edges represent probabilistic dependencies between them. Associated with each node are local probability distributions, and arcs are drawn from cause nodes to effect nodes. Data mining a Bayesian network consists of using entering expert = 0.05 p(Fuel=empty | ξ) knowledge into the Bayesian network (using domain experts), and then p(Battery=bad | ξ) = 0.02 using Battery a database to update, refine, and improve on the knowledge in the Fuel Bayesian network; new graphs may result from this improvement, and the causal relationships between these resulting nodes can be easily interpreted [Hec96]. Gauge p(Gauge=empty | Battery=good, Fuel=not empty, ξ) = 0.04 p(Gauge=empty | Battery=good, Fuel=empty, ξ) = 0.97 p(Gauge=empty | Battery=bad, Fuel=not empty, ξ) = 0.10 p(Gauge=empty | Battery=bad, Fuel=empty, ξ) = 0.99 Turn Start p(Turn=no | Battery=good, ξ) = 0.03 p(Start=no | Turn=yes, Fuel=not empty, ξ) = 0.01 p(Turn=no | Battery=bad, ξ) = 0.98 p(Start=no | Turn=yes, Fuel=empty, ξ) = 0.92 p(Start=no | Turn=no, Fuel=not empty, ξ) = 1.0 6 p(Start=no | Turn=no, Fuel=empty, ξ) = 1.0 Bayesian-network for troubleshooting a car that won’t start [Hec96]
  7. 7. Note: ξ represents the state of knowledge of the person providing the probability. Advantages: Bayesian networks produce easy to understand results. Disadvantages: Must gather conditional probabilities from domain experts. 9. Rough and Fuzzy Sets: These are mathematical concepts that deal with uncertainty. For Rough set models, an upper and lower bound is defined, each of which has members and non-members, respectively. The upper bound of a rough set is the union between the lower bound and the boundary region. Members of the boundary region are "possible members" of a set [Ggr99]. Advantages: Fuzzy set models are less prone to local maximas than decision tree algorithms [Ffa99], and like rough set models, they cope with uncertainty better than any other algorithm. Disadvantages: Rough and Fuzzy sets require other algorithms to work. 4 Example applications 1. An Association rule discoverer: Wal-Mart uses an association rule algorithm to decide how goods are to be shelved. For example, one can find cough and cold medicine placed in the paper goods and facial tissue section; the pharmacy section that normally carries cough and cold medicine is usually more than 60 steps away in a different section, but data mining research has shown that a high number of those who buy facial tissue [antecedent] also need cough and cold medicine [consequent]. 2. A Hierarchical clustering unsupervised learner: Thought/KD1 discovers zoological truths [Hma91]. 3. A Metric-distance clustering unsupervised learner: 3-level customer profile fraud detector. Authors claim that their approach outperforms another rule-based method for unsupervised learning [Mpi99]. 4. A Decision tree classifier: SKICAT. Automatically analyzes and catalogues astronomical objects [Fdw96]. 5. A DNF rule classifier: R-MINI equity return predictor. Produces minimal complete and consistent rules to predict market movements that beat the S&P 500 index [Aho96]. 7
  8. 8. 6. A Bayesian classifier application: Spam miner. Detects spams with 97% accuracy [Zct99]. 7. A Rough-Set classifier: PRIMEROSE-REX2. Discovers diagnostic rules extracted from a medical temporal database [Tsu98]. The author has used Rough set model with rule induction to discover new medical information. 5 Conclusions This is my first attempt at understanding the research that has been done on the subject of Knowledge Discovery in Databases, and Data Mining. This paper is clearly far from being comprehensive, as the field of KDD and DM is an emerging one, and new techniques continue to be developed. I have found that every data mining algorithm has its strengths and limitations, and that no single algorithm can be used for all problems. Selecting the best algorithm for the business problem at hand requires an understanding of the capabilities of the algorithm, and how the end-user wishes to use the results of data mining. 6 References consulted [Aho96] C. Apte, S. J. Hong, “Predicting Equity Returns from Securities Data”, Advances in Knowledge Discovery And Data Mining, American Association for Artificial Intelligence, P. 541-560, 1996. [Ais93] R. Agrawal, T. Imielinski, A. Swami, “Mining Association Rules between Sets of Items in Large Databases”, Proceedings of the 1993 ACM SIGMOD Conference Washington DC, USA, P. 207-216, May 1993. [Ayu98] C. C. Aggarwal, P. S. Yu, “Data Mining Techniques for Associations, Clustering and Classification”, Lecture Notes in Computer Science 1574, P. 13-23, 1998. [Eab98] J. Elder, D. Abbott, “A Comparison of Leading Data Mining Tools”, Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY, 1998. [Fay98] U. Fayyad, “Mining Databases: Towards Algorithms for Knowledge Discovery”, Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, Vol. 21, No. 1, P. 39-48, March 1998. [Fdw96] U. Fayyad, S. G. Djorgovski, N. Weir, “Automating the Analysis and Cataloguing of Sky Surveys”, Advances in Knowledge Discovery And Data Mining, American Association for Artificial Intelligence, P. 471-493, 1996. [Ffa99] C. Fertig, A. Freitas, L. Arruda, C. Kaestner, “A Fuzzy Beam-Search Rule Induction Algorithm”, Lecture Notes in Artificial Intelligence 1704, Berlin, Springer, P. 341-347, 1999. [Fps96] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, “From Data Mining to Knowledge Discovery: An Overview”, Advances in Knowledge Discovery And Data Mining, American Association for Artificial Intelligence, P. 1-34, 1996. [Ggr99] M. Goebel, L. Gruenwald, “A Survey Of Data Mining And Knowledge Discovery Software Tools”, SIGKDD Explorations, Vol. 1, No. 1, P. 20-33, June 1999. 8
  9. 9. [Gor98] B. Gray, M. Orlowska, “CCAIIA: Clustering Categorical Attributes into Interesting Association Rules”, Lecture Notes in Artificial Intelligence 1394, P. 132-143, Berlin Springer, 1998. [Han99] D. Hand, “Statistics and Data Mining: Intersecting Disciplines”, SIGKDD Explorations, Vol. 1, No. 1, P. 16-19, June 1999. [Hec96] D. Heckerman, “Bayesian Networks for Knowledge Discovery”, Advances in Knowledge Discovery And Data Mining, American Association for Artificial Intelligence, P. 273-305, 1996. [Hed95] S. Hedberg, “The Data Gold Rush”, Byte, Vol. 20, No. 10, P. 83- 88, October 1995. [Hma91] J. Hong, C. Mao, “Incremental Discovery of Rules and Structure by Hierarchical and Parallel Clustering”, Knowledge Discovery In Databases, American Association for Artificial Intelligence, P. 177-194, 1991. [Mpi99] U. Murad, G. Pinkas, “Unsupervised Profiling for Identifying Superimposed Fraud”, Lecture Notes in Artificial Intelligence 1704, Berlin, Springer, P. 251-261, 1999. [Rei99] T. Reinartz, “Focusing Solutions for Data-Mining”, Lecture Notes in Artificial Intelligence 1623, Berlin, Springer, 1999. [Tsu98] S. Tsumoto, “Discovery of Approximate Medical Knowledge Based on Rough Set Model”, Lecture Notes in Artificial Intelligence 1510, Berlin, Springer, P. 468-476, 1998. [Win95] S. Weiss, N. Indurkhya, “Rule-based Machine Learning Methods for Functional Prediction”, Journal of Artificial Intelligence Research, Vol. 3, P. 383-403, December 1995. [Zct99] A. Zighed, M. Côté, N. Troudi, “The Data-Mining and the Technology of Agents to Fight the Illicit Electronic Messages”, Lecture Notes in Computer Science 1574, P. 464-468, Berlin, Springer, 1999. 9