From Data Mining to Knowledge Discovery in Databases
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

From Data Mining to Knowledge Discovery in Databases

on

  • 2,079 views

 

Statistics

Views

Total Views
2,079
Views on SlideShare
2,079
Embed Views
0

Actions

Likes
1
Downloads
60
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

From Data Mining to Knowledge Discovery in Databases Document Transcript

  • 1. Articles From Data Mining to Knowledge Discovery in Databases Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth s Data mining and knowledge discovery in This article begins by discussing the histori- databases have been attracting a significant cal context of KDD and data mining and their amount of research, industry, and media atten- intersection with other related fields. A brief tion of late. What is all the excitement about? summary of recent KDD real-world applica- This article provides an overview of this emerging tions is provided. Definitions of KDD and da- field, clarifying how data mining and knowledge ta mining are provided, and the general mul- discovery in databases are related both to each other and to related fields, such as machine tistep KDD process is outlined. This multistep learning, statistics, and databases. The article process has the application of data-mining al- mentions particular real-world applications, gorithms as one particular step in the process. specific data-mining techniques, challenges in- The data-mining step is discussed in more de- volved in real-world applications of knowledge tail in the context of specific data-mining al- discovery, and current and future research direc- gorithms and their application. Real-world tions in the field. practical application issues are also outlined. Finally, the article enumerates challenges for future research and development and in par- A cross a wide variety of fields, data are ticular discusses potential opportunities for AI being collected and accumulated at a technology in KDD systems. dramatic pace. There is an urgent need for a new generation of computational theo- ries and tools to assist humans in extracting useful information (knowledge) from the Why Do We Need KDD? rapidly growing volumes of digital data. The traditional method of turning data into These theories and tools are the subject of the knowledge relies on manual analysis and in- emerging field of knowledge discovery in terpretation. For example, in the health-care databases (KDD). industry, it is common for specialists to peri- At an abstract level, the KDD field is con- odically analyze current trends and changes cerned with the development of methods and in health-care data, say, on a quarterly basis. techniques for making sense of data. The basic The specialists then provide a report detailing problem addressed by the KDD process is one the analysis to the sponsoring health-care or- of mapping low-level data (which are typically ganization; this report becomes the basis for too voluminous to understand and digest easi- future decision making and planning for ly) into other forms that might be more com- health-care management. In a totally differ- pact (for example, a short report), more ab- ent type of application, planetary geologists stract (for example, a descriptive sift through remotely sensed images of plan- approximation or model of the process that ets and asteroids, carefully locating and cata- generated the data), or more useful (for exam- loging such geologic objects of interest as im- ple, a predictive model for estimating the val- pact craters. Be it science, marketing, finance, ue of future cases). At the core of the process is health care, retail, or any other field, the clas- the application of specific data-mining meth- sical approach to data analysis relies funda- ods for pattern discovery and extraction.1 mentally on one or more analysts becoming Copyright © 1996, American Association for Artificial Intelligence. All rights reserved. 0738-4602-1996 / $2.00 FALL 1996 37
  • 2. Articles intimately familiar with the data and serving areas is astronomy. Here, a notable success as an interface between the data and the users was achieved by SKICAT, a system used by as- and products. tronomers to perform image analysis, For these (and many other) applications, classification, and cataloging of sky objects this form of manual probing of a data set is from sky-survey images (Fayyad, Djorgovski, slow, expensive, and highly subjective. In and Weir 1996). In its first application, the fact, as data volumes grow dramatically, this system was used to process the 3 terabytes type of manual data analysis is becoming (1012 bytes) of image data resulting from the completely impractical in many domains. Second Palomar Observatory Sky Survey, Databases are increasing in size in two ways: where it is estimated that on the order of 109 (1) the number N of records or objects in the sky objects are detectable. SKICAT can outper- database and (2) the number d of fields or at- form humans and traditional computational tributes to an object. Databases containing on techniques in classifying faint sky objects. See the order of N = 109 objects are becoming in- Fayyad, Haussler, and Stolorz (1996) for a sur- creasingly common, for example, in the as- vey of scientific applications. tronomical sciences. Similarly, the number of In business, main KDD application areas There is an fields d can easily be on the order of 102 or includes marketing, finance (especially in- even 103, for example, in medical diagnostic vestment), fraud detection, manufacturing, urgent need applications. Who could be expected to di- telecommunications, and Internet agents. for a new gest millions of records, each having tens or Marketing: In marketing, the primary ap- hundreds of fields? We believe that this job is plication is database marketing systems, generation of certainly not one for humans; hence, analysis which analyze customer databases to identify computation- work needs to be automated, at least partially. different customer groups and forecast their The need to scale up human analysis capa- behavior. Business Week (Berry 1994) estimat- al theories bilities to handling the large number of bytes ed that over half of all retailers are using or and tools to that we can collect is both economic and sci- planning to use database marketing, and assist entific. Businesses use data to gain competi- those who do use it have good results; for ex- tive advantage, increase efficiency, and pro- ample, American Express reports a 10- to 15- humans in vide more valuable services to customers. percent increase in credit-card use. Another extracting Data we capture about our environment are notable marketing application is market-bas- the basic evidence we use to build theories ket analysis (Agrawal et al. 1996) systems, useful and models of the universe we live in. Be- which find patterns such as, “If customer information cause computers have enabled humans to bought X, he/she is also likely to buy Y and gather more data than we can digest, it is on- Z.” Such patterns are valuable to retailers. (knowledge) ly natural to turn to computational tech- Investment: Numerous companies use da- from the niques to help us unearth meaningful pat- ta mining for investment, but most do not rapidly terns and structures from the massive describe their systems. One exception is LBS volumes of data. Hence, KDD is an attempt to Capital Management. Its system uses expert growing address a problem that the digital informa- systems, neural nets, and genetic algorithms volumes of tion era made a fact of life for all of us: data to manage portfolios totaling $600 million; overload. since its start in 1993, the system has outper- digital formed the broad stock market (Hall, Mani, data. Data Mining and Knowledge and Barr 1996). Fraud detection: HNC Falcon and Nestor Discovery in the Real World PRISM systems are used for monitoring credit- A large degree of the current interest in KDD card fraud, watching over millions of ac- is the result of the media interest surrounding counts. The FAIS system (Senator et al. 1995), successful KDD applications, for example, the from the U.S. Treasury Financial Crimes En- focus articles within the last two years in forcement Network, is used to identify finan- Business Week, Newsweek, Byte, PC Week, and cial transactions that might indicate money- other large-circulation periodicals. Unfortu- laundering activity. nately, it is not always easy to separate fact Manufacturing: The CASSIOPEE trou- from media hype. Nonetheless, several well- bleshooting system, developed as part of a documented examples of successful systems joint venture between General Electric and can rightly be referred to as KDD applications SNECMA, was applied by three major Euro- and have been deployed in operational use pean airlines to diagnose and predict prob- on large-scale real-world problems in science lems for the Boeing 737. To derive families of and in business. faults, clustering methods are used. CASSIOPEE In science, one of the primary application received the European first prize for innova- 38 AI MAGAZINE
  • 3. Articles tive applications (Manago and Auriol 1996). Data Mining and KDD Telecommunications: The telecommuni- cations alarm-sequence analyzer (TASA) was Historically, the notion of finding useful pat- built in cooperation with a manufacturer of terns in data has been given a variety of telecommunications equipment and three names, including data mining, knowledge ex- telephone networks (Mannila, Toivonen, and traction, information discovery, information Verkamo 1995). The system uses a novel harvesting, data archaeology, and data pattern processing. The term data mining has mostly framework for locating frequently occurring been used by statisticians, data analysts, and alarm episodes from the alarm stream and the management information systems (MIS) presenting them as rules. Large sets of discov- communities. It has also gained popularity in ered rules can be explored with flexible infor- the database field. The phrase knowledge dis- mation-retrieval tools supporting interactivity covery in databases was coined at the first KDD and iteration. In this way, TASA offers pruning, workshop in 1989 (Piatetsky-Shapiro 1991) to grouping, and ordering tools to refine the re- emphasize that knowledge is the end product sults of a basic brute-force search for rules. of a data-driven discovery. It has been popular- Data cleaning: The MERGE - PURGE system ized in the AI and machine-learning fields. was applied to the identification of duplicate In our view, KDD refers to the overall pro- welfare claims (Hernandez and Stolfo 1995). cess of discovering useful knowledge from da- The basic It was used successfully on data from the Wel- ta, and data mining refers to a particular step fare Department of the State of Washington. in this process. Data mining is the application problem In other areas, a well-publicized system is of specific algorithms for extracting patterns addressed by IBM’s ADVANCED SCOUT, a specialized data-min- from data. The distinction between the KDD ing system that helps National Basketball As- process and the data-mining step (within the the KDD sociation (NBA) coaches organize and inter- process) is a central point of this article. The process is pret data from NBA games (U.S. News 1995). additional steps in the KDD process, such as one of ADVANCED SCOUT was used by several of the data preparation, data selection, data cleaning, NBA teams in 1996, including the Seattle Su- incorporation of appropriate prior knowledge, mapping personics, which reached the NBA finals. and proper interpretation of the results of low-level Finally, a novel and increasingly important mining, are essential to ensure that useful type of discovery is one based on the use of in- knowledge is derived from the data. Blind ap- data into telligent agents to navigate through an infor- plication of data-mining methods (rightly crit- other forms mation-rich environment. Although the idea icized as data dredging in the statistical litera- of active triggers has long been analyzed in the ture) can be a dangerous activity, easily that might be database field, really successful applications of leading to the discovery of meaningless and more invalid patterns. this idea appeared only with the advent of the compact, Internet. These systems ask the user to specify The Interdisciplinary Nature of KDD more a profile of interest and search for related in- KDD has evolved, and continues to evolve, formation among a wide variety of public-do- from the intersection of research fields such as abstract, main and proprietary sources. For example, FIREFLY is a personal music-recommendation machine learning, pattern recognition, or more agent: It asks a user his/her opinion of several databases, statistics, AI, knowledge acquisition useful. for expert systems, data visualization, and music pieces and then suggests other music high-performance computing. The unifying that the user might like (<http:// goal is extracting high-level knowledge from www.ffly.com/>). CRAYON (http://crayon.net/>) low-level data in the context of large data sets. allows users to create their own free newspaper The data-mining component of KDD cur- (supported by ads); NEWSHOUND (<http://www. rently relies heavily on known techniques sjmercury.com/hound/>) from the San Jose from machine learning, pattern recognition, Mercury News and FARCAST (<http://www.far- and statistics to find patterns from data in the cast.com/> automatically search information data-mining step of the KDD process. A natu- from a wide variety of sources, including ral question is, How is KDD different from pat- newspapers and wire services, and e-mail rele- tern recognition or machine learning (and re- vant documents directly to the user. lated fields)? The answer is that these fields These are just a few of the numerous such provide some of the data-mining methods systems that use KDD techniques to automat- that are used in the data-mining step of the ically produce useful information from large KDD process. KDD focuses on the overall pro- masses of raw data. See Piatetsky-Shapiro et cess of knowledge discovery from data, includ- al. (1996) for an overview of issues in devel- ing how the data are stored and accessed, how oping industrial KDD applications. algorithms can be scaled to massive data sets FALL 1996 39
  • 4. Articles and still run efficiently, how results can be in- A driving force behind KDD is the database terpreted and visualized, and how the overall field (the second D in KDD). Indeed, the man-machine interaction can usefully be problem of effective data manipulation when modeled and supported. The KDD process data cannot fit in the main memory is of fun- can be viewed as a multidisciplinary activity damental importance to KDD. Database tech- that encompasses techniques beyond the niques for gaining efficient data access, scope of any one particular discipline such as grouping and ordering operations when ac- machine learning. In this context, there are cessing data, and optimizing queries consti- clear opportunities for other fields of AI (be- tute the basics for scaling algorithms to larger sides machine learning) to contribute to data sets. Most data-mining algorithms from KDD. KDD places a special emphasis on find- statistics, pattern recognition, and machine ing understandable patterns that can be inter- learning assume data are in the main memo- preted as useful or interesting knowledge. ry and pay no attention to how the algorithm Thus, for example, neural networks, although breaks down if only limited views of the data a powerful modeling tool, are relatively are possible. difficult to understand compared to decision A related field evolving from databases is trees. KDD also emphasizes scaling and ro- data warehousing, which refers to the popular bustness properties of modeling algorithms business trend of collecting and cleaning Data mining for large noisy data sets. transactional data to make them available for Related AI research fields include machine online analysis and decision support. Data is a step in discovery, which targets the discovery of em- warehousing helps set the stage for KDD in the KDD pirical laws from observation and experimen- two important ways: (1) data cleaning and (2) tation (Shrager and Langley 1990) (see Kloes- data access. process that gen and Zytkow [1996] for a glossary of terms Data cleaning: As organizations are forced consists of ap- common to KDD and machine discovery), to think about a unified logical view of the and causal modeling for the inference of wide variety of data and databases they pos- plying data causal models from data (Spirtes, Glymour, sess, they have to address the issues of map- analysis and and Scheines 1993). Statistics in particular ping data to a single naming convention, discovery al- has much in common with KDD (see Elder uniformly representing and handling missing and Pregibon [1996] and Glymour et al. data, and handling noise and errors when gorithms that [1996] for a more detailed discussion of this possible. produce a par- synergy). Knowledge discovery from data is Data access: Uniform and well-defined fundamentally a statistical endeavor. Statistics methods must be created for accessing the da- ticular enu- provides a language and framework for quan- ta and providing access paths to data that meration of tifying the uncertainty that results when one were historically difficult to get to (for exam- tries to infer general patterns from a particu- ple, stored offline). patterns lar sample of an overall population. As men- Once organizations and individuals have (or models) tioned earlier, the term data mining has had solved the problem of how to store and ac- negative connotations in statistics since the cess their data, the natural next step is the over the 1960s when computer-based data analysis question, What else do we do with all the da- data. techniques were first introduced. The concern ta? This is where opportunities for KDD natu- arose because if one searches long enough in rally arise. any data set (even randomly generated data), A popular approach for analysis of data one can find patterns that appear to be statis- warehouses is called online analytical processing tically significant but, in fact, are not. Clearly, (OLAP), named for a set of principles pro- this issue is of fundamental importance to posed by Codd (1993). OLAP tools focus on KDD. Substantial progress has been made in providing multidimensional data analysis, recent years in understanding such issues in which is superior to SQL in computing sum- statistics. Much of this work is of direct rele- maries and breakdowns along many dimen- vance to KDD. Thus, data mining is a legiti- sions. OLAP tools are targeted toward simpli- mate activity as long as one understands how fying and supporting interactive data analysis, to do it correctly; data mining carried out but the goal of KDD tools is to automate as poorly (without regard to the statistical as- much of the process as possible. Thus, KDD is pects of the problem) is to be avoided. KDD a step beyond what is currently supported by can also be viewed as encompassing a broader most standard database systems. view of modeling than statistics. KDD aims to provide tools to automate (to the degree pos- Basic Definitions sible) the entire process of data analysis and KDD is the nontrivial process of identifying the statistician’s “art” of hypothesis selection. valid, novel, potentially useful, and ultimate- 40 AI MAGAZINE
  • 5. Articles Interpretation / Evaluation Data Mining Transformation Knowledge Preprocessing Selection Patterns --- --- --- --- --- --- --- --- --- Transformed Preprocessed Data Data Data Target Date Figure 1. An Overview of the Steps That Compose the KDD Process. ly understandable patterns in data (Fayyad, data) or utility (for example, gain, perhaps in Piatetsky-Shapiro, and Smyth 1996). dollars saved because of better predictions or Here, data are a set of facts (for example, speedup in response time of a system). No- cases in a database), and pattern is an expres- tions such as novelty and understandability sion in some language describing a subset of are much more subjective. In certain contexts, the data or a model applicable to the subset. understandability can be estimated by sim- Hence, in our usage here, extracting a pattern plicity (for example, the number of bits to de- also designates fitting a model to data; find- scribe a pattern). An important notion, called ing structure from data; or, in general, mak- interestingness (for example, see Silberschatz ing any high-level description of a set of data. and Tuzhilin [1995] and Piatetsky-Shapiro and The term process implies that KDD comprises Matheus [1994]), is usually taken as an overall many steps, which involve data preparation, measure of pattern value, combining validity, search for patterns, knowledge evaluation, novelty, usefulness, and simplicity. Interest- and refinement, all repeated in multiple itera- ingness functions can be defined explicitly or tions. By nontrivial, we mean that some can be manifested implicitly through an or- search or inference is involved; that is, it is dering placed by the KDD system on the dis- not a straightforward computation of covered patterns or models. predefined quantities like computing the av- Given these notions, we can consider a erage value of a set of numbers. pattern to be knowledge if it exceeds some in- The discovered patterns should be valid on terestingness threshold, which is by no new data with some degree of certainty. We means an attempt to define knowledge in the also want patterns to be novel (at least to the philosophical or even the popular view. As a system and preferably to the user) and poten- matter of fact, knowledge in this definition is tially useful, that is, lead to some benefit to purely user oriented and domain specific and the user or task. Finally, the patterns should is determined by whatever functions and be understandable, if not immediately then thresholds the user chooses. after some postprocessing. Data mining is a step in the KDD process The previous discussion implies that we can that consists of applying data analysis and define quantitative measures for evaluating discovery algorithms that, under acceptable extracted patterns. In many cases, it is possi- computational efficiency limitations, pro- ble to define measures of certainty (for exam- duce a particular enumeration of patterns (or ple, estimated prediction accuracy on new models) over the data. Note that the space of FALL 1996 41
  • 6. Articles patterns is often infinite, and the enumera- methods, the effective number of variables tion of patterns involves some form of under consideration can be reduced, or in- search in this space. Practical computational variant representations for the data can be constraints place severe limits on the sub- found. space that can be explored by a data-mining Fifth is matching the goals of the KDD pro- algorithm. cess (step 1) to a particular data-mining The KDD process involves using the method. For example, summarization, clas- database along with any required selection, sification, regression, clustering, and so on, preprocessing, subsampling, and transforma- are described later as well as in Fayyad, Piatet- tions of it; applying data-mining methods sky-Shapiro, and Smyth (1996). (algorithms) to enumerate patterns from it; Sixth is exploratory analysis and model and evaluating the products of data mining and hypothesis selection: choosing the data- to identify the subset of the enumerated pat- mining algorithm(s) and selecting method(s) terns deemed knowledge. The data-mining to be used for searching for data patterns. component of the KDD process is concerned This process includes deciding which models with the algorithmic means by which pat- and parameters might be appropriate (for ex- terns are extracted and enumerated from da- ample, models of categorical data are differ- ta. The overall KDD process (figure 1) in- ent than models of vectors over the reals) and cludes the evaluation and possible matching a particular data-mining method interpretation of the mined patterns to de- with the overall criteria of the KDD process termine which patterns can be considered (for example, the end user might be more in- new knowledge. The KDD process also in- terested in understanding the model than its cludes all the additional steps described in predictive capabilities). the next section. Seventh is data mining: searching for pat- The notion of an overall user-driven pro- terns of interest in a particular representa- cess is not unique to KDD: analogous propos- tional form or a set of such representations, als have been put forward both in statistics including classification rules or trees, regres- (Hand 1994) and in machine learning (Brod- sion, and clustering. The user can significant- ley and Smyth 1996). ly aid the data-mining method by correctly performing the preceding steps. Eighth is interpreting mined patterns, pos- The KDD Process sibly returning to any of steps 1 through 7 for further iteration. This step can also involve The KDD process is interactive and iterative, visualization of the extracted patterns and involving numerous steps with many deci- sions made by the user. Brachman and Anand models or visualization of the data given the (1996) give a practical view of the KDD pro- extracted models. cess, emphasizing the interactive nature of Ninth is acting on the discovered knowl- the process. Here, we broadly outline some of edge: using the knowledge directly, incorpo- its basic steps: rating the knowledge into another system for First is developing an understanding of the further action, or simply documenting it and application domain and the relevant prior reporting it to interested parties. This process knowledge and identifying the goal of the also includes checking for and resolving po- KDD process from the customer’s viewpoint. tential conflicts with previously believed (or Second is creating a target data set: select- extracted) knowledge. ing a data set, or focusing on a subset of vari- The KDD process can involve significant ables or data samples, on which discovery is iteration and can contain loops between to be performed. any two steps. The basic flow of steps (al- Third is data cleaning and preprocessing. though not the potential multitude of itera- Basic operations include removing noise if tions and loops) is illustrated in figure 1. appropriate, collecting the necessary informa- Most previous work on KDD has focused on tion to model or account for noise, deciding step 7, the data mining. However, the other on strategies for handling missing data fields, steps are as important (and probably more and accounting for time-sequence informa- so) for the successful application of KDD in tion and known changes. practice. Having defined the basic notions Fourth is data reduction and projection: and introduced the KDD process, we now finding useful features to represent the data focus on the data-mining component, depending on the goal of the task. With di- which has, by far, received the most atten- mensionality reduction or transformation tion in the literature. 42 AI MAGAZINE
  • 7. Articles The Data-Mining Step of the KDD Process The data-mining component of the KDD pro- o Debt cess often involves repeated iterative applica- o o tion of particular data-mining methods. This x section presents an overview of the primary o o goals of data mining, a description of the x x x methods used to address these goals, and a o o brief description of the data-mining algo- o o rithms that incorporate these methods. x x x The knowledge discovery goals are defined x o o by the intended use of the system. We can o x distinguish two types of goals: (1) verification x o and (2) discovery. With verification, the sys- tem is limited to verifying the user’s hypothe- Income sis. With discovery, the system autonomously finds new patterns. We further subdivide the discovery goal into prediction, where the sys- tem finds patterns for predicting the future Figure 2. A Simple Data Set with Two Classes Used for Illustrative Purposes. behavior of some entities, and description, where the system finds patterns for presenta- tion to a user in a human-understandable form. In this article, we are primarily con- ily in the goodness-of-fit criterion used to cerned with discovery-oriented data mining. evaluate model fit or in the search method Data mining involves fitting models to, or used to find a good fit. determining patterns from, observed data. In our brief overview of data-mining meth- The fitted models play the role of inferred ods, we try in particular to convey the notion knowledge: Whether the models reflect useful that most (if not all) methods can be viewed or interesting knowledge is part of the over- as extensions or hybrids of a few basic tech- all, interactive KDD process where subjective niques and principles. We first discuss the pri- human judgment is typically required. Two mary methods of data mining and then show primary mathematical formalisms are used in that the data- mining methods can be viewed model fitting: (1) statistical and (2) logical. as consisting of three primary algorithmic The statistical approach allows for nondeter- components: (1) model representation, (2) ministic effects in the model, whereas a logi- model evaluation, and (3) search. In the dis- cal model is purely deterministic. We focus cussion of KDD and data-mining methods, primarily on the statistical approach to data we use a simple example to make some of the mining, which tends to be the most widely notions more concrete. Figure 2 shows a sim- used basis for practical data-mining applica- ple two-dimensional artificial data set consist- tions given the typical presence of uncertain- ing of 23 cases. Each point on the graph rep- ty in real-world data-generating processes. resents a person who has been given a loan Most data-mining methods are based on by a particular bank at some time in the past. tried and tested techniques from machine The horizontal axis represents the income of learning, pattern recognition, and statistics: the person; the vertical axis represents the to- classification, clustering, regression, and so tal personal debt of the person (mortgage, car on. The array of different algorithms under payments, and so on). The data have been each of these headings can often be bewilder- classified into two classes: (1) the x’s repre- ing to both the novice and the experienced sent persons who have defaulted on their data analyst. It should be emphasized that of loans and (2) the o’s represent persons whose the many data-mining methods advertised in loans are in good status with the bank. Thus, the literature, there are really only a few fun- this simple artificial data set could represent a damental techniques. The actual underlying historical data set that can contain useful model representation being used by a particu- knowledge from the point of view of the lar method typically comes from a composi- bank making the loans. Note that in actual tion of a small number of well-known op- KDD applications, there are typically many tions: polynomials, splines, kernel and basis more dimensions (as many as several hun- functions, threshold-Boolean functions, and dreds) and many more data points (many so on. Thus, algorithms tend to differ primar- thousands or even millions). FALL 1996 43
  • 8. Articles The purpose here is to illustrate basic ideas on a small problem in two-dimensional o space. Debt No Loan o o Data-Mining Methods x o The two high-level primary goals of data min- o ing in practice tend to be prediction and de- x x x scription. As stated earlier, prediction in- o o o o volves using some variables or fields in the x x database to predict unknown or future values x x of other variables of interest, and description o o Loan o focuses on finding human-interpretable pat- x x o terns describing the data. Although the boundaries between prediction and descrip- Income tion are not sharp (some of the predictive models can be descriptive, to the degree that they are understandable, and vice versa), the distinction is useful for understanding the Figure 3. A Simple Linear Classification Boundary for the Loan Data Set. overall discovery goal. The relative impor- The shaped region denotes class no loan. tance of prediction and description for partic- ular data-mining applications can vary con- siderably. The goals of prediction and description can be achieved using a variety of particular data-mining methods. Classification is learning a function that ine maps (classifies) a data item into one of sever- ionL o gr ess al predefined classes (Weiss and Kulikowski Debt Re o o 1991; Hand 1981). Examples of classification x methods used as part of knowledge discovery o o applications include the classifying of trends x in financial markets (Apte and Hong 1996) x x o o and the automated identification of objects of o o x interest in large image databases (Fayyad, x x Djorgovski, and Weir 1996). Figure 3 shows a x o o simple partitioning of the loan data into two o x class regions; note that it is not possible to x o separate the classes perfectly using a linear decision boundary. The bank might want to Income use the classification regions to automatically decide whether future loan applicants will be given a loan or not. Figure 4. A Simple Linear Regression for the Loan Data Set. Regression is learning a function that maps a data item to a real-valued prediction vari- able. Regression applications are many, for example, predicting the amount of biomass present in a forest given remotely sensed mi- crowave measurements, estimating the proba- bility that a patient will survive given the re- sults of a set of diagnostic tests, predicting consumer demand for a new product as a function of advertising expenditure, and pre- dicting time series where the input variables can be time-lagged versions of the prediction variable. Figure 4 shows the result of simple linear regression where total debt is fitted as a linear function of income: The fit is poor be- cause only a weak correlation exists between the two variables. Clustering is a common descriptive task 44 AI MAGAZINE
  • 9. Articles where one seeks to identify a finite set of cat- egories or clusters to describe the data (Jain and Dubes 1988; Titterington, Smith, and + Cluster 2 Debt Makov 1985). The categories can be mutually Cluster 1 + + exclusive and exhaustive or consist of a richer + representation, such as hierarchical or over- + + lapping categories. Examples of clustering ap- + + + plications in a knowledge discovery context + + include discovering homogeneous subpopula- + + tions for consumers in marketing databases + + + and identifying subcategories of spectra from + + + infrared sky measurements (Cheeseman and + + Stutz 1996). Figure 5 shows a possible cluster- + + Cluster 3 ing of the loan data set into three clusters; note that the clusters overlap, allowing data Income points to belong to more than one cluster. The original class labels (denoted by x’s and o’s in the previous figures) have been replaced Figure 5. A Simple Clustering of the Loan Data Set into Three Clusters. by a + to indicate that the class membership Note that original labels are replaced by a +. is no longer assumed known. Closely related to clustering is the task of probability density estimation, which consists of techniques for estimating from data the joint multivariate discovering the most significant changes in probability density function of all the vari- the data from previously measured or norma- ables or fields in the database (Silverman tive values (Berndt and Clifford 1996; Guyon, 1986). Matic, and Vapnik 1996; Kloesgen 1996; Summarization involves methods for find- Matheus, Piatetsky-Shapiro, and McNeill ing a compact description for a subset of da- 1996; Basseville and Nikiforov 1993). ta. A simple example would be tabulating the mean and standard deviations for all fields. The Components of More sophisticated methods involve the Data-Mining Algorithms derivation of summary rules (Agrawal et al. The next step is to construct specific algo- 1996), multivariate visualization techniques, rithms to implement the general methods we and the discovery of functional relationships outlined. One can identify three primary between variables (Zembowicz and Zytkow components in any data-mining algorithm: 1996). Summarization techniques are often (1) model representation, (2) model evalua- applied to interactive exploratory data analy- tion, and (3) search. sis and automated report generation. This reductionist view is not necessarily Dependency modeling consists of finding a complete or fully encompassing; rather, it is a model that describes significant dependencies convenient way to express the key concepts between variables. Dependency models exist of data-mining algorithms in a relatively at two levels: (1) the structural level of the unified and compact manner. Cheeseman model specifies (often in graphic form) which (1990) outlines a similar structure. variables are locally dependent on each other Model representation is the language used to and (2) the quantitative level of the model describe discoverable patterns. If the repre- specifies the strengths of the dependencies sentation is too limited, then no amount of using some numeric scale. For example, prob- training time or examples can produce an ac- abilistic dependency networks use condition- curate model for the data. It is important that al independence to specify the structural as- a data analyst fully comprehend the represen- pect of the model and probabilities or tational assumptions that might be inherent correlations to specify the strengths of the de- in a particular method. It is equally impor- pendencies (Glymour et al. 1987; Heckerman tant that an algorithm designer clearly state 1996). Probabilistic dependency networks are which representational assumptions are being increasingly finding applications in areas as made by a particular algorithm. Note that in- diverse as the development of probabilistic creased representational power for models in- medical expert systems from databases, infor- creases the danger of overfitting the training mation retrieval, and modeling of the human data, resulting in reduced prediction accuracy genome. on unseen data. Change and deviation detection focuses on Model-evaluation criteria are quantitative FALL 1996 45
  • 10. Articles Decision Trees and Rules Decision trees and rules that use univariate o splits have a simple representational form, Debt No Loan making the inferred model relatively easy for o o the user to comprehend. However, the restric- x o tion to a particular tree or rule representation o x can significantly restrict the functional form x x (and, thus, the approximation power) of the o o o o model. For example, figure 6 illustrates the ef- x fect of a threshold split applied to the income x x x variable for a loan data set: It is clear that us- o o Loan o ing such simple threshold splits (parallel to x x o the feature axes) severely limits the type of classification boundaries that can be induced. t Income If one enlarges the model space to allow more general expressions (such as multivariate hy- perplanes at arbitrary angles), then the model is more powerful for prediction but can be Figure 6. Using a Single Threshold on the Income Variable to much more difficult to comprehend. A large Try to Classify the Loan Data Set. number of decision tree and rule-induction algorithms are described in the machine- learning and applied statistics literature (Quinlan 1992; Breiman et al. 1984). To a large extent, they depend on likeli- hood-based model-evaluation methods, with varying degrees of sophistication in terms of statements (or fit functions) of how well a par- penalizing model complexity. Greedy search ticular pattern (a model and its parameters) methods, which involve growing and prun- meets the goals of the KDD process. For ex- ing rule and tree structures, are typically used ample, predictive models are often judged by to explore the superexponential space of pos- the empirical prediction accuracy on some sible models. Trees and rules are primarily test set. Descriptive models can be evaluated used for predictive modeling, both for clas- along the dimensions of predictive accuracy, sification (Apte and Hong 1996; Fayyad, Djor- novelty, utility, and understandability of the govski, and Weir 1996) and regression, al- fitted model. though they can also be applied to summary Search method consists of two components: descriptive modeling (Agrawal et al. 1996). (1) parameter search and (2) model search. Once the model representation (or family of Nonlinear Regression and representations) and the model-evaluation Classification Methods criteria are fixed, then the data-mining prob- These methods consist of a family of tech- lem has been reduced to purely an optimiza- niques for prediction that fit linear and non- tion task: Find the parameters and models linear combinations of basis functions (sig- from the selected family that optimize the moids, splines, polynomials) to combinations evaluation criteria. In parameter search, the of the input variables. Examples include feed- algorithm must search for the parameters forward neural networks, adaptive spline that optimize the model-evaluation criteria methods, and projection pursuit regression given observed data and a fixed model repre- (see Elder and Pregibon [1996], Cheng and sentation. Model search occurs as a loop over Titterington [1994], and Friedman [1989] for the parameter-search method: The model rep- more detailed discussions). Consider neural resentation is changed so that a family of networks, for example. Figure 7 illustrates the models is considered. type of nonlinear decision boundary that a neural network might find for the loan data set. In terms of model evaluation, although Some Data-Mining Methods networks of the appropriate size can univer- A wide variety of data-mining methods exist, sally approximate any smooth function to but here, we only focus on a subset of popu- any desired degree of accuracy, relatively little lar techniques. Each method is discussed in is known about the representation properties the context of model representation, model of fixed-size networks estimated from finite evaluation, and search. data sets. Also, the standard squared error and 46 AI MAGAZINE
  • 11. Articles cross-entropy loss functions used to train neural networks can be viewed as log-likeli- hood functions for regression and classification, respectively (Ripley 1994; Ge- o Debt man, Bienenstock, and Doursat 1992). Back No Loan o o propagation is a parameter-search method x o that performs gradient descent in parameter o (weight) space to find a local maximum of x x x the likelihood function starting from random o o o o initial conditions. Nonlinear regression meth- x ods, although powerful in representational x x x power, can be difficult to interpret. o o Loan o For example, although the classification x x o boundaries of figure 7 might be more accu- rate than the simple threshold boundary of Income figure 6, the threshold boundary has the ad- vantage that the model can be expressed, to some degree of certainty, as a simple rule of the form “if income is greater than threshold, then loan will have good status.” Figure 7. An Example of Classification Boundaries Learned by a Nonlinear Classifier (Such as a Neural Network) for the Loan Data Set. Example-Based Methods The representation is simple: Use representa- tive examples from the database to approxi- mate a model; that is, predictions on new ex- amples are derived from the properties of similar examples in the model whose predic- tion is known. Techniques include nearest- o neighbor classification and regression algo- Debt No Loan rithms (Dasarathy 1991) and case-based o o x reasoning systems (Kolodner 1993). Figure 8 o illustrates the use of a nearest-neighbor clas- o x sifier for the loan data set: The class at any x x new point in the two-dimensional space is o o o o the same as the class of the closest point in x x the original training data set. x x A potential disadvantage of example-based o o Loan o methods (compared with tree-based methods) x x o is that a well-defined distance metric for eval- uating the distance between data points is re- Income quired. For the loan data in figure 8, this would not be a problem because income and debt are measured in the same units. Howev- er, if one wished to include variables such as the duration of the loan, sex, and profession, Figure 8. Classification Boundaries for a Nearest-Neighbor then it would require more effort to define a Classifier for the Loan Data Set. sensible metric between the variables. Model evaluation is typically based on cross-valida- tion estimates (Weiss and Kulikowski 1991) of a prediction error: Parameters of the model to be estimated can include the number of neighbors to use for prediction and the dis- tance metric itself. Like nonlinear regression methods, example-based methods are often asymptotically powerful in terms of approxi- mation properties but, conversely, can be difficult to interpret because the model is im- plicit in the data and not explicitly formulat- ed. Related techniques include kernel-density FALL 1996 47
  • 12. Articles estimation (Silverman 1986) and mixture evitably limited in scope; many data-mining modeling (Titterington, Smith, and Makov techniques, particularly specialized methods 1985). for particular types of data and domains, were not mentioned specifically. We believe the Probabilistic Graphic general discussion on data-mining tasks and Dependency Models components has general relevance to a vari- Graphic models specify probabilistic depen- ety of methods. For example, consider time- dencies using a graph structure (Whittaker series prediction, which traditionally has 1990; Pearl 1988). In its simplest form, the been cast as a predictive regression task (au- model specifies which variables are directly de- toregressive models, and so on). Recently, pendent on each other. Typically, these mod- more general models have been developed for els are used with categorical or discrete-valued time-series applications, such as nonlinear ba- variables, but extensions to special cases, such sis functions, example-based models, and ker- Understand- as Gaussian densities, for real-valued variables nel methods. Furthermore, there has been are also possible. Within the AI and statistical significant interest in descriptive graphic and ing data communities, these models were initially de- local data modeling of time series rather than mining and veloped within the framework of probabilistic purely predictive modeling (Weigend and expert systems; the structure of the model and Gershenfeld 1993). Thus, although different model the parameters (the conditional probabilities algorithms and applications might appear dif- induction at attached to the links of the graph) were elicit- ferent on the surface, it is not uncommon to ed from experts. Recently, there has been sig- find that they share many common compo- this nificant work in both the AI and statistical nents. Understanding data mining and model component communities on methods whereby both the induction at this component level clarifies level clarifies structure and the parameters of graphic mod- the behavior of any data-mining algorithm els can be learned directly from databases and makes it easier for the user to understand the behavior (Buntine 1996; Heckerman 1996). Model-eval- its overall contribution and applicability to of any uation criteria are typically Bayesian in form, the KDD process. and parameter estimation can be a mixture of An important point is that each technique data-mining closed-form estimates and iterative methods typically suits some problems better than algorithm depending on whether a variable is directly others. For example, decision tree classifiers observed or hidden. Model search can consist can be useful for finding structure in high-di- and makes it of greedy hill-climbing methods over various mensional spaces and in problems with easier for the graph structures. Prior knowledge, such as a mixed continuous and categorical data (be- partial ordering of the variables based on cause tree methods do not require distance user to metrics). However, classification trees might causal relations, can be useful in terms of re- understand its ducing the model search space. Although still not be suitable for problems where the true overall primarily in the research phase, graphic model decision boundaries between classes are de- induction methods are of particular interest to scribed by a second-order polynomial (for ex- contribution KDD because the graphic form of the model ample). Thus, there is no universal data-min- and lends itself easily to human interpretation. ing method, and choosing a particular algorithm for a particular application is some- applicability Relational Learning Models thing of an art. In practice, a large portion of to the Although decision trees and rules have a repre- the application effort can go into properly sentation restricted to propositional logic, rela- formulating the problem (asking the right KDD tional learning (also known as inductive logic question) rather than into optimizing the al- process. programming) uses the more flexible pattern gorithmic details of a particular data-mining language of first-order logic. A relational learn- method (Langley and Simon 1995; Hand er can easily find formulas such as X = Y. Most 1994). research to date on model-evaluation methods Because our discussion and overview of da- for relational learning is logical in nature. The ta-mining methods has been brief, we want extra representational power of relational to make two important points clear: models comes at the price of significant com- First, our overview of automated search fo- putational demands in terms of search. See cused mainly on automated methods for ex- Dzeroski (1996) for a more detailed discussion. tracting patterns or models from data. Al- though this approach is consistent with the definition we gave earlier, it does not neces- Discussion sarily represent what other communities Given the broad spectrum of data-mining might refer to as data mining. For example, methods and algorithms, our overview is in- some use the term to designate any manual 48 AI MAGAZINE
  • 13. Articles search of the data or search assisted by queries oriented data, although making the applica- to a database management system or to refer tion development more difficult, make it po- to humans visualizing patterns in data. In tentially much more useful because it is easier other communities, it is used to refer to the to retrain a system than a human. Finally, automated correlation of data from transac- and perhaps one of the most important con- tions or the automated generation of transac- siderations, is prior knowledge. It is useful to tion reports. We choose to focus only on know something about the domain —what methods that contain certain degrees of are the important fields, what are the likely search autonomy. relationships, what is the user utility func- Second, beware the hype: The state of the tion, what patterns are already known, and so art in automated methods in data mining is on. still in a fairly early stage of development. There are no established criteria for deciding Research and Application Challenges which methods to use in which circum- We outline some of the current primary re- stances, and many of the approaches are search and application challenges for KDD. based on crude heuristic approximations to This list is by no means exhaustive and is in- avoid the expensive search required to find tended to give the reader a feel for the types optimal, or even good, solutions. Hence, the of problem that KDD practitioners wrestle reader should be careful when confronted with. with overstated claims about the great ability Larger databases: Databases with hun- of a system to mine useful information from dreds of fields and tables and millions of large (or even small) databases. records and of a multigigabyte size are com- monplace, and terabyte (1012 bytes) databases are beginning to appear. Methods for dealing Application Issues with large data volumes include more For a survey of KDD applications as well as efficient algorithms (Agrawal et al. 1996), detailed examples, see Piatetsky-Shapiro et al. sampling, approximation, and massively par- (1996) for industrial applications and Fayyad, allel processing (Holsheimer et al. 1996). Haussler, and Stolorz (1996) for applications High dimensionality: Not only is there of- in science data analysis. Here, we examine ten a large number of records in the database, criteria for selecting potential applications, but there can also be a large number of fields which can be divided into practical and tech- (attributes, variables); so, the dimensionality nical categories. The practical criteria for KDD of the problem is high. A high-dimensional projects are similar to those for other applica- data set creates problems in terms of increas- tions of advanced technology and include the ing the size of the search space for model in- potential impact of an application, the ab- duction in a combinatorially explosive man- sence of simpler alternative solutions, and ner. In addition, it increases the chances that strong organizational support for using tech- a data-mining algorithm will find spurious nology. For applications dealing with person- patterns that are not valid in general. Ap- al data, one should also consider the privacy proaches to this problem include methods to and legal issues (Piatetsky-Shapiro 1995). reduce the effective dimensionality of the The technical criteria include considera- problem and the use of prior knowledge to tions such as the availability of sufficient data identify irrelevant variables. (cases). In general, the more fields there are Overfitting: When the algorithm searches and the more complex the patterns being for the best parameters for one particular sought, the more data are needed. However, model using a limited set of data, it can mod- strong prior knowledge (see discussion later) el not only the general patterns in the data can reduce the number of needed cases sig- but also any noise specific to the data set, re- nificantly. Another consideration is the rele- sulting in poor performance of the model on vance of attributes. It is important to have da- test data. Possible solutions include cross-vali- ta attributes that are relevant to the discovery dation, regularization, and other sophisticat- task; no amount of data will allow prediction ed statistical strategies. based on attributes that do not capture the Assessing of statistical significance: A required information. Furthermore, low noise problem (related to overfitting) occurs when levels (few data errors) are another considera- the system is searching over many possible tion. High amounts of noise make it hard to models. For example, if a system tests models identify patterns unless a large number of cas- at the 0.001 significance level, then on aver- es can mitigate random noise and help clarify age, with purely random data, N/1000 of the aggregate patterns. Changing and time- these models will be accepted as significant. FALL 1996 49
  • 14. Articles This point is frequently missed by many ini- edge is important in all the steps of the KDD tial attempts at KDD. One way to deal with process. Bayesian approaches (for example, this problem is to use methods that adjust Cheeseman [1990]) use prior probabilities the test statistic as a function of the search, over data and distributions as one form of en- for example, Bonferroni adjustments for inde- coding prior knowledge. Others employ de- pendent tests or randomization testing. ductive database capabilities to discover Changing data and knowledge: Rapidly knowledge that is then used to guide the da- changing (nonstationary) data can make pre- ta-mining search (for example, Simoudis, viously discovered patterns invalid. In addi- Livezey, and Kerber [1995]). tion, the variables measured in a given appli- Integration with other systems: A stand- cation database can be modified, deleted, or alone discovery system might not be very augmented with new measurements over useful. Typical integration issues include inte- time. Possible solutions include incremental gration with a database management system methods for updating the patterns and treat- (for example, through a query interface), in- ing change as an opportunity for discovery tegration with spreadsheets and visualization by using it to cue the search for patterns of tools, and accommodating of real-time sensor change only (Matheus, Piatetsky-Shapiro, and readings. Examples of integrated KDD sys- McNeill 1996). See also Agrawal and Psaila tems are described by Simoudis, Livezey, and (1995) and Mannila, Toivonen, and Verkamo Kerber (1995) and Stolorz, Nakamura, Mesro- (1995). biam, Muntz, Shek, Santos, Yi, Ng, Chien, Missing and noisy data: This problem is Mechoso, and Farrara (1995). especially acute in business databases. U.S. census data reportedly have error rates as great as 20 percent in some fields. Important Concluding Remarks: The attributes can be missing if the database was Potential Role of AI in KDD not designed with discovery in mind. Possible In addition to machine learning, other AI fiel- solutions include more sophisticated statisti- ds can potentially contribute significantly to cal strategies to identify hidden variables and various aspects of the KDD process. We men- dependencies (Heckerman 1996; Smyth et al. tion a few examples of these areas here: 1996). Natural language presents significant op- Complex relationships between fields: portunities for mining in free-form text, espe- Hierarchically structured attributes or values, cially for automated annotation and indexing relations between attributes, and more so- prior to classification of text corpora. Limited phisticated means for representing knowl- parsing capabilities can help substantially in edge about the contents of a database will re- the task of deciding what an article refers to. quire algorithms that can effectively use such Hence, the spectrum from simple natural lan- information. Historically, data-mining algo- guage processing all the way to language un- rithms have been developed for simple at- derstanding can help substantially. Also, nat- tribute-value records, although new tech- ural language processing can contribute niques for deriving relations between significantly as an effective interface for stat- variables are being developed (Dzeroski 1996; ing hints to mining algorithms and visualiz- Djoko, Cook, and Holder 1995). ing and explaining knowledge derived by a Understandability of patterns: In many KDD system. applications, it is important to make the dis- Planning considers a complicated data coveries more understandable by humans. analysis process. It involves conducting com- Possible solutions include graphic representa- plicated data-access and data-transformation tions (Buntine 1996; Heckerman 1996), rule operations; applying preprocessing routines; structuring, natural language generation, and and, in some cases, paying attention to re- techniques for visualization of data and source and data-access constraints. Typically, knowledge. Rule-refinement strategies (for ex- data processing steps are expressed in terms of ample, Major and Mangano [1995]) can be desired postconditions and preconditions for used to address a related problem: The discov- the application of certain routines, which ered knowledge might be implicitly or explic- lends itself easily to representation as a plan- itly redundant. ning problem. In addition, planning ability User interaction and prior knowledge: can play an important role in automated Many current KDD methods and tools are not agents (see next item) to collect data samples truly interactive and cannot easily incorpo- or conduct a search to obtain needed data sets. rate prior knowledge about a problem except Intelligent agents can be fired off to col- in simple ways. The use of domain knowl- lect necessary information from a variety of 50 AI MAGAZINE
  • 15. Articles sources. In addition, information agents can Note be activated remotely over the network or 1. Throughout this article, we use the term pattern can trigger on the occurrence of a certain to designate a pattern found in data. We also refer event and start an analysis operation. Finally, to models. One can think of patterns as compo- agents can help navigate and model the nents of models, for example, a particular rule in a World-Wide Web (Etzioni 1996), another area classification model or a linear component in a re- growing in importance. gression model. Uncertainty in AI includes issues for man- aging uncertainty, proper inference mecha- References nisms in the presence of uncertainty, and the Agrawal, R., and Psaila, G. 1995. Active Data Min- reasoning about causality, all fundamental to ing. In Proceedings of the First International Con- KDD theory and practice. In fact, the KDD-96 ference on Knowledge Discovery and Data Mining conference had a joint session with the UAI-96 (KDD-95), 3–8. Menlo Park, Calif.: American Asso- ciation for Artificial Intelligence. conference this year (Horvitz and Jensen 1996). Knowledge representation includes on- Agrawal, R.; Mannila, H.; Srikant, R.; Toivonen, H.; tologies, new concepts for representing, stor- and Verkamo, I. 1996. Fast Discovery of Association Rules. In Advances in Knowledge Discovery and Data ing, and accessing knowledge. Also included Mining, eds. U. Fayyad, G. Piatetsky-Shapiro, P. are schemes for representing knowledge and Smyth, and R. Uthurusamy, 307–328. Menlo Park, allowing the use of prior human knowledge Calif.: AAAI Press. about the underlying process by the KDD Apte, C., and Hong, S. J. 1996. Predicting Equity system. Returns from Securities Data with Minimal Rule These potential contributions of AI are but Generation. In Advances in Knowledge Discovery and a sampling; many others, including human- Data Mining, eds. U. Fayyad, G. Piatetsky-Shapiro, P. computer interaction, knowledge-acquisition Smyth, and R. Uthurusamy, 514–560. Menlo Park, techniques, and the study of mechanisms for Calif.: AAAI Press. reasoning, have the opportunity to con- Basseville, M., and Nikiforov, I. V. 1993. Detection tribute to KDD. of Abrupt Changes: Theory and Application. Engle- In conclusion, we presented some defini- wood Cliffs, N.J.: Prentice Hall. tions of basic notions in the KDD field. Our Berndt, D., and Clifford, J. 1996. Finding Patterns primary aim was to clarify the relation be- in Time Series: A Dynamic Programming Approach. tween knowledge discovery and data mining. In Advances in Knowledge Discovery and Data Mining, We provided an overview of the KDD process eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and and basic data-mining methods. Given the R. Uthurusamy, 229–248. Menlo Park, Calif.: AAAI broad spectrum of data-mining methods and Press. algorithms, our overview is inevitably limit- Berry, J. 1994. Database Marketing. Business Week, ed in scope: There are many data-mining September 5, 56–62. techniques, particularly specialized methods Brachman, R., and Anand, T. 1996. The Process of for particular types of data and domain. Al- Knowledge Discovery in Databases: A Human-Cen- tered Approach. In Advances in Knowledge Discovery though various algorithms and applications and Data Mining, 37–58, eds. U. Fayyad, G. Piatet- might appear quite different on the surface, sky-Shapiro, P. Smyth, and R. Uthurusamy. Menlo it is not uncommon to find that they share Park, Calif.: AAAI Press. many common components. Understanding Breiman, L.; Friedman, J. H.; Olshen, R. A.; and data mining and model induction at this Stone, C. J. 1984. Classification and Regression Trees. component level clarifies the task of any da- Belmont, Calif.: Wadsworth. ta-mining algorithm and makes it easier for Brodley, C. E., and Smyth, P. 1996. Applying Clas- the user to understand its overall contribu- sification Algorithms in Practice. Statistics and Com- tion and applicability to the KDD process. puting. Forthcoming. This article represents a step toward a Buntine, W. 1996. Graphical Models for Discover- common framework that we hope will ulti- ing Knowledge. In Advances in Knowledge Discovery mately provide a unifying vision of the com- and Data Mining, eds. U. Fayyad, G. Piatetsky- mon overall goals and methods used in Shapiro, P. Smyth, and R. Uthurusamy, 59–82. KDD. We hope this will eventually lead to a Menlo Park, Calif.: AAAI Press. better understanding of the variety of ap- Cheeseman, P. 1990. On Finding the Most Probable proaches in this multidisciplinary field and Model. In Computational Models of Scientific Discov- how they fit together. ery and Theory Formation, eds. J. Shrager and P. Lan- gley, 73–95. San Francisco, Calif.: Morgan Kauf- Acknowledgments mann. We thank Sam Uthurusamy, Ron Brachman, and Cheeseman, P., and Stutz, J. 1996. Bayesian Clas- KDD-96 referees for their valuable suggestions sification (AUTOCLASS): Theory and Results. In Ad- and ideas. vances in Knowledge Discovery and Data Mining, eds. FALL 1996 51
  • 16. Articles U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. ering Informative Patterns and Data Cleaning. In Uthurusamy, 73–95. Menlo Park, Calif.: AAAI Press. Advances in Knowledge Discovery and Data Mining, Cheng, B., and Titterington, D. M. 1994. Neural eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and Networks—A Review from a Statistical Perspective. R. Uthurusamy, 181–204. Menlo Park, Calif.: AAAI Statistical Science 9(1): 2–30. Press. Codd, E. F. 1993. Providing OLAP (On-Line Analyti- Hall, J.; Mani, G.; and Barr, D. 1996. Applying cal Processing) to User-Analysts: An IT Mandate. E. Computational Intelligence to the Investment Pro- F. Codd and Associates. cess. In Proceedings of CIFER-96: Computational Dasarathy, B. V. 1991. Nearest Neighbor (NN) Intelligence in Financial Engineering. Washington, Norms: NN Pattern Classification Techniques. D.C.: IEEE Computer Society. Washington, D.C.: IEEE Computer Society. Hand, D. J. 1994. Deconstructing Statistical Ques- Djoko, S.; Cook, D.; and Holder, L. 1995. Analyzing tions. Journal of the Royal Statistical Society A. 157(3): the Benefits of Domain Knowledge in Substructure 317–356. Discovery. In Proceedings of KDD-95: First Interna- Hand, D. J. 1981. Discrimination and Classification. tional Conference on Knowledge Discovery and Chichester, U.K.: Wiley. Data Mining, 75–80. Menlo Park, Calif.: American Heckerman, D. 1996. Bayesian Networks for Knowl- Association for Artificial Intelligence. edge Discovery. In Advances in Knowledge Discovery Dzeroski, S. 1996. Inductive Logic Programming for and Data Mining, eds. U. Fayyad, G. Piatetsky- Knowledge Discovery in Databases. In Advances in Shapiro, P. Smyth, and R. Uthurusamy, 273–306. Knowledge Discovery and Data Mining, eds. U. Menlo Park, Calif.: AAAI Press. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Hernandez, M., and Stolfo, S. 1995. The MERGE - Uthurusamy, 59–82. Menlo Park, Calif.: AAAI Press. PURGE Problem for Large Databases. In Proceedings Elder, J., and Pregibon, D. 1996. A Statistical Per- of the 1995 ACM-SIGMOD Conference, 127–138. spective on KDD. In Advances in Knowledge Discov- New York: Association for Computing Machinery. ery and Data Mining, eds. U. Fayyad, G. Piatetsky- Holsheimer, M.; Kersten, M. L.; Mannila, H.; and Shapiro, P. Smyth, and R. Uthurusamy, 83–116. Toivonen, H. 1996. Data Surveyor: Searching the Menlo Park, Calif.: AAAI Press. Nuggets in Parallel. In Advances in Knowledge Dis- Etzioni, O. 1996. The World Wide Web: Quagmire covery and Data Mining, eds. U. Fayyad, G. Piatet- or Gold Mine? Communications of the ACM (Special sky-Shapiro, P. Smyth, and R. Uthurusamy, Issue on Data Mining). November 1996. Forthcom- 447–471. Menlo Park, Calif.: AAAI Press. ing. Horvitz, E., and Jensen, F. 1996. Proceedings of the Fayyad, U. M.; Djorgovski, S. G.; and Weir, N. 1996. Twelfth Conference of Uncertainty in Artificial Intelli- From Digitized Images to On-Line Catalogs: Data gence. San Mateo, Calif.: Morgan Kaufmann. Mining a Sky Survey. AI Magazine 17(2): 51–66. Jain, A. K., and Dubes, R. C. 1988. Algorithms for Fayyad, U. M.; Haussler, D.; and Stolorz, Z. 1996. Clustering Data. Englewood Cliffs, N.J.: Prentice- KDD for Science Data Analysis: Issues and Exam- Hall. ples. In Proceedings of the Second International Kloesgen, W. 1996. A Multipattern and Multistrate- Conference on Knowledge Discovery and Data gy Discovery Assistant. In Advances in Knowledge Mining (KDD-96), 50–56. Menlo Park, Calif.: Amer- Discovery and Data Mining, eds. U. Fayyad, G. Piatet- ican Association for Artificial Intelligence. sky-Shapiro, P. Smyth, and R. Uthurusamy, Fayyad, U. M.; Piatetsky-Shapiro, G.; and Smyth, P. 249–271. Menlo Park, Calif.: AAAI Press. 1996. From Data Mining to Knowledge Discovery: Kloesgen, W., and Zytkow, J. 1996. Knowledge Dis- An Overview. In Advances in Knowledge Discovery covery in Databases Terminology. In Advances in and Data Mining, eds. U. Fayyad, G. Piatetsky- Knowledge Discovery and Data Mining, eds. U. Fayyad, Shapiro, P. Smyth, and R. Uthurusamy, 1–30. Men- G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, lo Park, Calif.: AAAI Press. 569–588. Menlo Park, Calif.: AAAI Press. Fayyad, U. M.; Piatetsky-Shapiro, G.; Smyth, P.; and Kolodner, J. 1993. Case-Based Reasoning. San Fran- Uthurusamy, R. 1996. Advances in Knowledge Dis- cisco, Calif.: Morgan Kaufmann. covery and Data Mining. Menlo Park, Calif.: AAAI Langley, P., and Simon, H. A. 1995. Applications of Press. Machine Learning and Rule Induction. Communica- Friedman, J. H. 1989. Multivariate Adaptive Regres- tions of the ACM 38:55–64. sion Splines. Annals of Statistics 19:1–141. Major, J., and Mangano, J. 1995. Selecting among Geman, S.; Bienenstock, E.; and Doursat, R. 1992. Rules Induced from a Hurricane Database. Journal Neural Networks and the Bias/Variance Dilemma. of Intelligent Information Systems 4(1): 39–52. Neural Computation 4:1–58. Manago, M., and Auriol, M. 1996. Mining for OR. Glymour, C.; Madigan, D.; Pregibon, D.; and ORMS Today (Special Issue on Data Mining), Febru- Smyth, P. 1996. Statistics and Data Mining. Com- ary, 28–32. munications of the ACM (Special Issue on Data Min- Mannila, H.; Toivonen, H.; and Verkamo, A. I. ing). November 1996. Forthcoming. 1995. Discovering Frequent Episodes in Sequences. Glymour, C.; Scheines, R.; Spirtes, P.; Kelly, K. 1987. In Proceedings of the First International Confer- Discovering Causal Structure. New York: Academic. ence on Knowledge Discovery and Data Mining Guyon, O.; Matic, N.; and Vapnik, N. 1996. Discov- (KDD-95), 210–215. Menlo Park, Calif.: American 52 AI MAGAZINE
  • 17. Articles Association for Artificial Intelligence. Spirtes, P.; Glymour, C.; and Scheines, R. 1993. Matheus, C.; Piatetsky-Shapiro, G.; and McNeill, D. Causation, Prediction, and Search. New York: 1996. Selecting and Reporting What Is Interesting: Springer-Verlag. The KEfiR Application to Healthcare Data. In Ad- Stolorz, P.; Nakamura, H.; Mesrobian, E.; Muntz, R.; vances in Knowledge Discovery and Data Mining, eds. Shek, E.; Santos, J.; Yi, J.; Ng, K.; Chien, S.; Me- U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. choso, C.; and Farrara, J. 1995. Fast Spatio-Tempo- Uthurusamy, 495–516. Menlo Park, Calif.: AAAI ral Data Mining of Large Geophysical Datasets. In Press. Proceedings of KDD-95: First International Confer- Pearl, J. 1988. Probabilistic Reasoning in Intelligent ence on Knowledge Discovery and Data Mining, Systems. San Francisco, Calif.: Morgan Kaufmann. 300–305. Menlo Park, Calif.: American Association for Artificial Intelligence. Piatetsky-Shapiro, G. 1995. Knowledge Discovery Titterington, D. M.; Smith, A. F. M.; and Makov, U. in Personal Data versus Privacy—A Mini-Sympo- E. 1985. Statistical Analysis of Finite-Mixture Distribu- sium. IEEE Expert 10(5). tions. Chichester, U.K.: Wiley. Piatetsky-Shapiro, G. 1991. Knowledge Discovery U.S. News. 1995. Basketball’s New High-Tech Guru: in Real Databases: A Report on the IJCAI-89 Work- IBM Software Is Changing Coaches’ Game Plans. shop. AI Magazine 11(5): 68–70. U.S. News and World Report, 11 December. Piatetsky-Shapiro, G., and Matheus, C. 1994. The Weigend, A., and Gershenfeld, N., eds. 1993. Pre- Interestingness of Deviations. In Proceedings of dicting the Future and Understanding the Past. Red- KDD-94, eds. U. M. Fayyad and R. Uthurusamy. wood City, Calif.: Addison-Wesley. Technical Report WS-03. Menlo Park, Calif.: AAAI Press. Weiss, S. I., and Kulikowski, C. 1991. Computer Sys- tems That Learn: Classification and Prediction Meth- Piatetsky-Shapiro, G.; Brachman, R.; Khabaza, T.; ods from Statistics, Neural Networks, Machine Learn- Kloesgen, W.; and Simoudis, E., 1996. An Overview ing, and Expert Systems. San Francisco, Calif.: of Issues in Developing Industrial Data Mining and Morgan Kaufmann. Knowledge Discovery Applications. In Proceedings Whittaker, J. 1990. Graphical Models in Applied Mul- of the Second International Conference on Knowl- tivariate Statistics. New York: Wiley. edge Discovery and Data Mining (KDD-96), eds. J. Han and E. Simoudis, 89–95. Menlo Park, Calif.: Zembowicz, R., and Zytkow, J. 1996. From Contin- American Association for Artificial Intelligence. gency Tables to Various Forms of Knowledge in Databases. In Advances in Knowledge Discovery and Quinlan, J. 1992. C4.5: Programs for Machine Learn- Data Mining, eds. U. Fayyad, G. Piatetsky-Shapiro, P. ing. San Francisco, Calif.: Morgan Kaufmann. Smyth, and R. Uthurusamy, 329–351. Menlo Park, Ripley, B. D. 1994. Neural Networks and Related Calif.: AAAI Press. Methods for Classification. Journal of the Royal Sta- tistical Society B. 56(3): 409–437. Senator, T.; Goldberg, H. G.; Wooton, J.; Cottini, M. A.; Umarkhan, A. F.; Klinger, C. D.; Llamas, W. M.; Marrone, M. P.; and Wong, R. W. H. 1995. The Fi- Usama Fayyad is a senior re- nancial Crimes Enforcement Network AI System searcher at Microsoft Research. ( FAIS ): Identifying Potential Money Laundering He received his Ph.D. in 1991 from Reports of Large Cash Transactions. AI Maga- from the University of Michigan zine 16(4): 21–39. at Ann Arbor. Prior to joining Mi- crosoft in 1996, he headed the Shrager, J., and Langley, P., eds. 1990. Computation- Machine Learning Systems Group al Models of Scientific Discovery and Theory Forma- at the Jet Propulsion Laboratory tion. San Francisco, Calif.: Morgan Kaufmann. (JPL), California Institute of Tech- Silberschatz, A., and Tuzhilin, A. 1995. On Subjec- nology, where he developed data-mining systems tive Measures of Interestingness in Knowledge Dis- for automated science data analysis. He remains covery. In Proceedings of KDD-95: First Interna- affiliated with JPL as a distinguished visiting scien- tional Conference on Knowledge Discovery and tist. Fayyad received the JPL 1993 Lew Allen Award Data Mining, 275–281. Menlo Park, Calif.: Ameri- for Excellence in Research and the 1994 National can Association for Artificial Intelligence. Aeronautics and Space Administration Exceptional Silverman, B. 1986. Density Estimation for Statistics Achievement Medal. His research interests include and Data Analysis. New York: Chapman and Hall. knowledge discovery in large databases, data min- Simoudis, E.; Livezey, B.; and Kerber, R. 1995. Using ing, machine-learning theory and applications, sta- Recon for Data Cleaning. In Proceedings of KDD-95: tistical pattern recognition, and clustering. He was First International Conference on Knowledge Discov- program cochair of KDD-94 and KDD-95 (the First International Conference on Knowledge Discovery ery and Data Mining, 275–281. Menlo Park, Calif.: and Data Mining). He is general chair of KDD-96, American Association for Artificial Intelligence. an editor in chief of the journal Data Mining and Smyth, P.; Burl, M.; Fayyad, U.; and Perona, P. Knowledge Discovery, and coeditor of the 1996 AAAI 1996. Modeling Subjective Uncertainty in Image Press book Advances in Knowledge Discovery and Da- Annotation. In Advances in Knowledge Discovery and ta Mining. Data Mining, 517–540. Menlo Park, Calif.: AAAI Press. FALL 1996 53
  • 18. Articles Gregory Piatetsky-Shapiro is a cal Engineering Departments at Caltech (1994) and principal member of the technical regularly conducts tutorials on probabilistic learn- staff at GTE Laboratories and the ing algorithms at national conferences (including principal investigator of the UAI-93, AAAI-94, CAIA-95, IJCAI-95). He is general Knowledge Discovery in Databas- chair of the Sixth International Workshop on AI es (KDD) Project, which focuses and Statistics, to be held in 1997. Smyth’s research on developing and deploying ad- interests include statistical pattern recognition, ma- vanced KDD systems for business chine learning, decision theory, probabilistic rea- applications. Previously, he soning, information theory, and the application of worked on applying intelligent front ends to het- probability and statistics in AI. He has published 16 erogeneous databases. Piatetsky-Shapiro received journal papers, 10 book chapters, and 60 confer- several GTE awards, including GTE’s highest tech- ence papers on these topics. nical achievement award for the KEfiR system for health-care data analysis. His research interests in- clude intelligent database systems, dependency networks, and Internet resource discovery. Prior to GTE, he worked at Strategic Information develop- ing financial database systems. Piatetsky-Shapiro re- ceived his M.S. in 1979 and his Ph.D. in 1984, both from New York University (NYU). His Ph.D. disser- tation on self-organizing database systems received NYU awards as the best dissertation in computer science and in all natural sciences. Piatetsky- Shapiro organized and chaired the first three (1989, 1991, and 1993) KDD workshops and helped in de- veloping them into successful conferences (KDD-95 and KDD-96). He has also been on the program committees of numerous other conferences and workshops on AI and databases. He edited and coedited several collections on KDD, including two books—Knowledge Discovery in Databases (AAAI Press, 1991) and Advances in Knowledge Discovery in Databases (AAAI Press, 1996)—and has many other publications in the areas of AI and databases. He is a coeditor in chief of the new Data Mining and Knowledge Discovery journal. Piatetsky-Shapiro founded and moderates the KDD Nuggets electronic AAAI 97 newsletter (kdd@gte.com) and is the web master for Knowledge Discovery Mine (<http://info.gte.com/ ~kdd /index.html>). Providence, Rhode Island Padhraic Smyth received a first- class-honors Bachelor of Engi- July 27–31, 1997 neering from the National Uni- versity of Ireland in 1984 and an MSEE and a Ph.D. from the Elec- trical Engineering Department at the California Institute of Tech- nology (Caltech) in 1985 and Title pages due January 6, 1997 1988, respectively. From 1988 to 1996, he was a technical group leader at the Jet Papers due January 8, 1997 Propulsion Laboratory (JPL). Since April 1996, he has been a faculty member in the Information and Camera copy due April 2, 1997 Computer Science Department at the University of California at Irvine. He is also currently a principal investigator at JPL (part-time) and is a consultant to ncai@aaaai.org private industry. Smyth received the Lew Allen Award for Excellence in Research at JPL in 1993 http://www.aaai.org/ and has been awarded 14 National Aeronautics and Space Administration certificates for technical in- Conferences/National/1997/aaai97.html novation since 1991. He was coeditor of the book Advances in Knowledge Discovery and Data Mining (AAAI Press, 1996). Smyth was a visiting lecturer in the Computational and Neural Systems and Electri- 54 AI MAGAZINE