Data Mining:Concepts and TechniquesSecond Edition
The Morgan Kaufmann Series in Data Management SystemsSeries Editor: Jim Gray, Microsoft ResearchData Mining: Concepts and Techniques, Second EditionJiawei Han and Micheline KamberQuerying XML: XQuery, XPath, and SQL/XML in contextJim Melton and Stephen BuxtonFoundations of Multidimensional and Metric Data StructuresHanan SametDatabase Modeling and Design: Logical Design, Fourth EditionToby J. Teorey, Sam S. Lightstone and Thomas P. NadeauJoe Celko’s SQL for Smarties: Advanced SQL Programming, Third EditionJoe CelkoMoving Objects DatabasesRalf Guting and Markus SchneiderJoe Celko’s SQL Programming StyleJoe CelkoData Mining: Practical Machine Learning Tools and Techniques, Second EditionIan Witten and Eibe FrankFuzzy Modeling and Genetic Algorithms for Data Mining and ExplorationEarl CoxData Modeling Essentials, Third EditionGraeme C. Simsion and Graham C. WittLocation-Based ServicesJochen Schiller and Agnès VoisardDatabase Modeling with Microsft® Visio for Enterprise ArchitectsTerry Halpin, Ken Evans, Patrick Hallock, Bill MacleanDesigning Data-Intensive Web ApplicationsStephano Ceri, Piero Fraternali, Aldo Bongio, Marco Brambilla, Sara Comai, and Maristella MateraMining the Web: Discovering Knowledge from Hypertext DataSoumen ChakrabartiAdvanced SQL:II 1999—Understanding Object-Relational and Other Advanced FeaturesJim MeltonDatabase Tuning: Principles, Experiments, and Troubleshooting TechniquesDennis Shasha and Philippe BonnetSQL:1999—Understanding Relational Language ComponentsJim Melton and Alan R. SimonInformation Visualization in Data Mining and Knowledge DiscoveryEdited by Usama Fayyad, Georges G. Grinstein, and Andreas WierseTransactional Information Systems: Theory, Algorithms, and Practice of ConcurrencyControl and RecoveryGerhard Weikum and Gottfried VossenSpatial Databases: With Application to GISPhilippe Rigaux, Michel Scholl, and Agnes VoisardInformation Modeling and Relational Databases: From Conceptual Analysis to Logical DesignTerry HalpinComponent Database SystemsEdited by Klaus R. Dittrich and Andreas Geppert
Managing Reference Data in Enterprise Databases: Binding Corporate Data to the Wider WorldMalcolm ChisholmData Mining: Concepts and TechniquesJiawei Han and Micheline KamberUnderstanding SQL and Java Together: A Guide to SQLJ, JDBC, and Related TechnologiesJim Melton and Andrew EisenbergDatabase: Principles, Programming, and Performance, Second EditionPatrick and Elizabeth O’NeilThe Object Data Standard: ODMG 3.0Edited by R. G. G. Cattell and Douglas K. BarryData on the Web: From Relations to Semistructured Data and XMLSerge Abiteboul, Peter Buneman, and Dan SuciuData Mining: Practical Machine Learning Tools and Techniques with Java ImplementationsIan Witten and Eibe FrankJoe Celko’s SQL for Smarties: Advanced SQL Programming, Second EditionJoe CelkoJoe Celko’s Data and Databases: Concepts in PracticeJoe CelkoDeveloping Time-Oriented Database Applications in SQLRichard T. SnodgrassWeb Farming for the Data WarehouseRichard D. HackathornManagement of Heterogeneous and Autonomous Database SystemsEdited by Ahmed Elmagarmid, Marek Rusinkiewicz, and Amit ShethObject-Relational DBMSs: Tracking the Next Great Wave, Second EditionMichael Stonebraker and Paul Brown,with Dorothy MooreA Complete Guide to DB2 Universal DatabaseDon ChamberlinUniversal Database Management: A Guide to Object/Relational TechnologyCynthia Maro SaraccoReadings in Database Systems, Third EditionEdited by Michael Stonebraker and Joseph M. HellersteinUnderstanding SQL’s Stored Procedures: A Complete Guide to SQL/PSMJim MeltonPrinciples of Multimedia Database SystemsV. S. SubrahmanianPrinciples of Database Query Processing for Advanced ApplicationsClement T. Yu and Weiyi MengAdvanced Database SystemsCarlo Zaniolo, Stefano Ceri, Christos Faloutsos, Richard T. Snodgrass,V. S. Subrahmanian, and Roberto ZicariPrinciples of Transaction ProcessingPhilip A. Bernstein and Eric NewcomerUsing the New DB2: IBMs Object-Relational Database SystemDon ChamberlinDistributed AlgorithmsNancy A. Lynch
Active Database Systems: Triggers and Rules For Advanced Database ProcessingEdited by Jennifer Widom and Stefano CeriMigrating Legacy Systems: Gateways, Interfaces, & the Incremental ApproachMichael L. Brodie and Michael StonebrakerAtomic TransactionsNancy Lynch, Michael Merritt, William Weihl, and Alan FeketeQuery Processing for Advanced Database SystemsEdited by Johann Christoph Freytag, David Maier, and Gottfried VossenTransaction Processing: Concepts and TechniquesJim Gray and Andreas ReuterBuilding an Object-Oriented Database System: The Story of O2Edited by François Bancilhon, Claude Delobel, and Paris KanellakisDatabase Transaction Models for Advanced ApplicationsEdited by Ahmed K. ElmagarmidA Guide to Developing Client/Server SQL ApplicationsSetrag Khoshaﬁan, Arvola Chan, Anna Wong, and Harry K. T. WongThe Benchmark Handbook for Database and Transaction Processing Systems, Second EditionEdited by Jim GrayCamelot and Avalon: A Distributed Transaction FacilityEdited by Jeffrey L. Eppinger, Lily B. Mummert, and Alfred Z. SpectorReadings in Object-Oriented Database SystemsEdited by Stanley B. Zdonik and David Maier
Data Mining:Concepts and TechniquesSecond EditionJiawei HanUniversity of Illinois at Urbana-ChampaignMicheline KamberA M S T E R D A M B O S T O NH E I D E L B E R G L O N D O NN E W Y O R K O X F O R D P A R I SS A N D I E G O S A N F R A N C I S C OS I N G A P O R E S Y D N E Y T O K Y O
Publisher Diane CerraPublishing Services Managers Simon Crump, George MorrisonEditorial Assistant Asma StephanCover Design Ross Carron DesignCover Mosaic c Image Source/Getty ImagesComposition diacriTechTechnical Illustration Dartmouth Publishing, Inc.Copyeditor Multiscience PressProofreader Multiscience PressIndexer Multiscience PressInterior printer Maple-Vail Book Manufacturing GroupCover printer Phoenix ColorMorgan Kaufmann Publishers is an imprint of Elsevier.500 Sansome Street, Suite 400, San Francisco, CA 94111This book is printed on acid-free paper.c 2006 by Elsevier Inc. All rights reserved.Designations used by companies to distinguish their products are often claimed as trademarks orregistered trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim,the product names appear in initial capital or all capital letters. Readers, however, should contactthe appropriate companies for more complete information regarding trademarks andregistration.No part of this publication may be reproduced, stored in a retrieval system, or transmitted in anyform or by any means—electronic, mechanical, photocopying, scanning, or otherwise—withoutprior written permission of the publisher.Permissions may be sought directly from Elsevier’s Science & Technology Rights Department inOxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, e-mail:email@example.com. You may also complete your request on-line via the Elsevier homepage(http://elsevier.com) by selecting “Customer Support” and then “Obtaining Permissions.”Library of Congress Cataloging-in-Publication DataApplication submittedISBN 13: 978-1-55860-901-3ISBN 10: 1-55860-901-6For information on all Morgan Kaufmann publications, visit our Web site atwww.mkp.com or www.books.elsevier.comPrinted in the United States of America06 07 08 09 10 5 4 3 2 1
DedicationTo Y. Dora and Lawrence for your love and encouragementJ.H.To Erik, Kevan, Kian, and Mikael for your love and inspirationM.K.vii
ContentsForeword xixPreface xxiChapter 1 Introduction 11.1 What Motivated Data Mining? Why Is It Important? 11.2 So, What Is Data Mining? 51.3 Data Mining—On What Kind of Data? 91.3.1 Relational Databases 101.3.2 Data Warehouses 121.3.3 Transactional Databases 141.3.4 Advanced Data and Information Systems and AdvancedApplications 151.4 Data Mining Functionalities—What Kinds of Patterns Can BeMined? 211.4.1 Concept/Class Description: Characterization andDiscrimination 211.4.2 Mining Frequent Patterns, Associations, and Correlations 231.4.3 Classiﬁcation and Prediction 241.4.4 Cluster Analysis 251.4.5 Outlier Analysis 261.4.6 Evolution Analysis 271.5 Are All of the Patterns Interesting? 271.6 Classiﬁcation of Data Mining Systems 291.7 Data Mining Task Primitives 311.8 Integration of a Data Mining System witha Database or Data Warehouse System 341.9 Major Issues in Data Mining 36ix
x Contents1.10 Summary 39Exercises 40Bibliographic Notes 42Chapter 2 Data Preprocessing 472.1 Why Preprocess the Data? 482.2 Descriptive Data Summarization 512.2.1 Measuring the Central Tendency 512.2.2 Measuring the Dispersion of Data 532.2.3 Graphic Displays of Basic Descriptive Data Summaries 562.3 Data Cleaning 612.3.1 Missing Values 612.3.2 Noisy Data 622.3.3 Data Cleaning as a Process 652.4 Data Integration and Transformation 672.4.1 Data Integration 672.4.2 Data Transformation 702.5 Data Reduction 722.5.1 Data Cube Aggregation 732.5.2 Attribute Subset Selection 752.5.3 Dimensionality Reduction 772.5.4 Numerosity Reduction 802.6 Data Discretization and Concept Hierarchy Generation 862.6.1 Discretization and Concept Hierarchy Generation forNumerical Data 882.6.2 Concept Hierarchy Generation for Categorical Data 942.7 Summary 97Exercises 97Bibliographic Notes 101Chapter 3 Data Warehouse and OLAP Technology: An Overview 1053.1 What Is a Data Warehouse? 1053.1.1 Differences between Operational Database Systemsand Data Warehouses 1083.1.2 But, Why Have a Separate Data Warehouse? 1093.2 A Multidimensional Data Model 1103.2.1 From Tables and Spreadsheets to Data Cubes 1103.2.2 Stars, Snowﬂakes, and Fact Constellations:Schemas for Multidimensional Databases 1143.2.3 Examples for Deﬁning Star, Snowﬂake,and Fact Constellation Schemas 117
Contents xi3.2.4 Measures: Their Categorization and Computation 1193.2.5 Concept Hierarchies 1213.2.6 OLAP Operations in the Multidimensional Data Model 1233.2.7 A Starnet Query Model for QueryingMultidimensional Databases 1263.3 Data Warehouse Architecture 1273.3.1 Steps for the Design and Construction of Data Warehouses 1283.3.2 A Three-Tier Data Warehouse Architecture 1303.3.3 Data Warehouse Back-End Tools and Utilities 1343.3.4 Metadata Repository 1343.3.5 Types of OLAP Servers: ROLAP versus MOLAPversus HOLAP 1353.4 Data Warehouse Implementation 1373.4.1 Efﬁcient Computation of Data Cubes 1373.4.2 Indexing OLAP Data 1413.4.3 Efﬁcient Processing of OLAP Queries 1443.5 From Data Warehousing to Data Mining 1463.5.1 Data Warehouse Usage 1463.5.2 From On-Line Analytical Processingto On-Line Analytical Mining 1483.6 Summary 150Exercises 152Bibliographic Notes 154Chapter 4 Data Cube Computation and Data Generalization 1574.1 Efﬁcient Methods for Data Cube Computation 1574.1.1 A Road Map for the Materialization of Different Kindsof Cubes 1584.1.2 Multiway Array Aggregation for Full Cube Computation 1644.1.3 BUC: Computing Iceberg Cubes from the Apex CuboidDownward 1684.1.4 Star-cubing: Computing Iceberg Cubes Usinga Dynamic Star-tree Structure 1734.1.5 Precomputing Shell Fragments for Fast High-DimensionalOLAP 1784.1.6 Computing Cubes with Complex Iceberg Conditions 1874.2 Further Development of Data Cube and OLAPTechnology 1894.2.1 Discovery-Driven Exploration of Data Cubes 1894.2.2 Complex Aggregation at Multiple Granularity:Multifeature Cubes 1924.2.3 Constrained Gradient Analysis in Data Cubes 195
xii Contents4.3 Attribute-Oriented Induction—An AlternativeMethod for Data Generalization and Concept Description 1984.3.1 Attribute-Oriented Induction for Data Characterization 1994.3.2 Efﬁcient Implementation of Attribute-Oriented Induction 2054.3.3 Presentation of the Derived Generalization 2064.3.4 Mining Class Comparisons: Discriminating betweenDifferent Classes 2104.3.5 Class Description: Presentation of Both Characterizationand Comparison 2154.4 Summary 218Exercises 219Bibliographic Notes 223Chapter 5 Mining Frequent Patterns, Associations, and Correlations 2275.1 Basic Concepts and a Road Map 2275.1.1 Market Basket Analysis: A Motivating Example 2285.1.2 Frequent Itemsets, Closed Itemsets, and Association Rules 2305.1.3 Frequent Pattern Mining: A Road Map 2325.2 Efﬁcient and Scalable Frequent Itemset Mining Methods 2345.2.1 The Apriori Algorithm: Finding Frequent Itemsets UsingCandidate Generation 2345.2.2 Generating Association Rules from Frequent Itemsets 2395.2.3 Improving the Efﬁciency of Apriori 2405.2.4 Mining Frequent Itemsets without Candidate Generation 2425.2.5 Mining Frequent Itemsets Using Vertical Data Format 2455.2.6 Mining Closed Frequent Itemsets 2485.3 Mining Various Kinds of Association Rules 2505.3.1 Mining Multilevel Association Rules 2505.3.2 Mining Multidimensional Association Rulesfrom Relational Databases and Data Warehouses 2545.4 From Association Mining to Correlation Analysis 2595.4.1 Strong Rules Are Not Necessarily Interesting: An Example 2605.4.2 From Association Analysis to Correlation Analysis 2615.5 Constraint-Based Association Mining 2655.5.1 Metarule-Guided Mining of Association Rules 2665.5.2 Constraint Pushing: Mining Guided by Rule Constraints 2675.6 Summary 272Exercises 274Bibliographic Notes 280
Contents xiiiChapter 6 Classiﬁcation and Prediction 2856.1 What Is Classiﬁcation? What Is Prediction? 2856.2 Issues Regarding Classiﬁcation and Prediction 2896.2.1 Preparing the Data for Classiﬁcation and Prediction 2896.2.2 Comparing Classiﬁcation and Prediction Methods 2906.3 Classiﬁcation by Decision Tree Induction 2916.3.1 Decision Tree Induction 2926.3.2 Attribute Selection Measures 2966.3.3 Tree Pruning 3046.3.4 Scalability and Decision Tree Induction 3066.4 Bayesian Classiﬁcation 3106.4.1 Bayes’ Theorem 3106.4.2 Naïve Bayesian Classiﬁcation 3116.4.3 Bayesian Belief Networks 3156.4.4 Training Bayesian Belief Networks 3176.5 Rule-Based Classiﬁcation 3186.5.1 Using IF-THEN Rules for Classiﬁcation 3196.5.2 Rule Extraction from a Decision Tree 3216.5.3 Rule Induction Using a Sequential Covering Algorithm 3226.6 Classiﬁcation by Backpropagation 3276.6.1 A Multilayer Feed-Forward Neural Network 3286.6.2 Deﬁning a Network Topology 3296.6.3 Backpropagation 3296.6.4 Inside the Black Box: Backpropagation and Interpretability 3346.7 Support Vector Machines 3376.7.1 The Case When the Data Are Linearly Separable 3376.7.2 The Case When the Data Are Linearly Inseparable 3426.8 Associative Classiﬁcation: Classiﬁcation by AssociationRule Analysis 3446.9 Lazy Learners (or Learning from Your Neighbors) 3476.9.1 k-Nearest-Neighbor Classiﬁers 3486.9.2 Case-Based Reasoning 3506.10 Other Classiﬁcation Methods 3516.10.1 Genetic Algorithms 3516.10.2 Rough Set Approach 3516.10.3 Fuzzy Set Approaches 3526.11 Prediction 3546.11.1 Linear Regression 3556.11.2 Nonlinear Regression 3576.11.3 Other Regression-Based Methods 358
xiv Contents6.12 Accuracy and Error Measures 3596.12.1 Classiﬁer Accuracy Measures 3606.12.2 Predictor Error Measures 3626.13 Evaluating the Accuracy of a Classiﬁer or Predictor 3636.13.1 Holdout Method and Random Subsampling 3646.13.2 Cross-validation 3646.13.3 Bootstrap 3656.14 Ensemble Methods—Increasing the Accuracy 3666.14.1 Bagging 3666.14.2 Boosting 3676.15 Model Selection 3706.15.1 Estimating Conﬁdence Intervals 3706.15.2 ROC Curves 3726.16 Summary 373Exercises 375Bibliographic Notes 378Chapter 7 Cluster Analysis 3837.1 What Is Cluster Analysis? 3837.2 Types of Data in Cluster Analysis 3867.2.1 Interval-Scaled Variables 3877.2.2 Binary Variables 3897.2.3 Categorical, Ordinal, and Ratio-Scaled Variables 3927.2.4 Variables of Mixed Types 3957.2.5 Vector Objects 3977.3 A Categorization of Major Clustering Methods 3987.4 Partitioning Methods 4017.4.1 Classical Partitioning Methods: k-Means and k-Medoids 4027.4.2 Partitioning Methods in Large Databases: Fromk-Medoids to CLARANS 4077.5 Hierarchical Methods 4087.5.1 Agglomerative and Divisive Hierarchical Clustering 4087.5.2 BIRCH: Balanced Iterative Reducing and ClusteringUsing Hierarchies 4127.5.3 ROCK: A Hierarchical Clustering Algorithm forCategorical Attributes 4147.5.4 Chameleon: A Hierarchical Clustering AlgorithmUsing Dynamic Modeling 4167.6 Density-Based Methods 4187.6.1 DBSCAN: A Density-Based Clustering Method Based onConnected Regions with Sufﬁciently High Density 418
Contents xv7.6.2 OPTICS: Ordering Points to Identify the ClusteringStructure 4207.6.3 DENCLUE: Clustering Based on DensityDistribution Functions 4227.7 Grid-Based Methods 4247.7.1 STING: STatistical INformation Grid 4257.7.2 WaveCluster: Clustering Using Wavelet Transformation 4277.8 Model-Based Clustering Methods 4297.8.1 Expectation-Maximization 4297.8.2 Conceptual Clustering 4317.8.3 Neural Network Approach 4337.9 Clustering High-Dimensional Data 4347.9.1 CLIQUE: A Dimension-Growth Subspace Clustering Method 4367.9.2 PROCLUS: A Dimension-Reduction Subspace ClusteringMethod 4397.9.3 Frequent Pattern–Based Clustering Methods 4407.10 Constraint-Based Cluster Analysis 4447.10.1 Clustering with Obstacle Objects 4467.10.2 User-Constrained Cluster Analysis 4487.10.3 Semi-Supervised Cluster Analysis 4497.11 Outlier Analysis 4517.11.1 Statistical Distribution-Based Outlier Detection 4527.11.2 Distance-Based Outlier Detection 4547.11.3 Density-Based Local Outlier Detection 4557.11.4 Deviation-Based Outlier Detection 4587.12 Summary 460Exercises 461Bibliographic Notes 464Chapter 8 Mining Stream, Time-Series, and Sequence Data 4678.1 Mining Data Streams 4688.1.1 Methodologies for Stream Data Processing andStream Data Systems 4698.1.2 Stream OLAP and Stream Data Cubes 4748.1.3 Frequent-Pattern Mining in Data Streams 4798.1.4 Classiﬁcation of Dynamic Data Streams 4818.1.5 Clustering Evolving Data Streams 4868.2 Mining Time-Series Data 4898.2.1 Trend Analysis 4908.2.2 Similarity Search in Time-Series Analysis 493
xvi Contents8.3 Mining Sequence Patterns in Transactional Databases 4988.3.1 Sequential Pattern Mining: Concepts and Primitives 4988.3.2 Scalable Methods for Mining Sequential Patterns 5008.3.3 Constraint-Based Mining of Sequential Patterns 5098.3.4 Periodicity Analysis for Time-Related Sequence Data 5128.4 Mining Sequence Patterns in Biological Data 5138.4.1 Alignment of Biological Sequences 5148.4.2 Hidden Markov Model for Biological Sequence Analysis 5188.5 Summary 527Exercises 528Bibliographic Notes 531Chapter 9 Graph Mining, Social Network Analysis, and MultirelationalData Mining 5359.1 Graph Mining 5359.1.1 Methods for Mining Frequent Subgraphs 5369.1.2 Mining Variant and Constrained Substructure Patterns 5459.1.3 Applications: Graph Indexing, Similarity Search, Classiﬁcation,and Clustering 5519.2 Social Network Analysis 5569.2.1 What Is a Social Network? 5569.2.2 Characteristics of Social Networks 5579.2.3 Link Mining: Tasks and Challenges 5619.2.4 Mining on Social Networks 5659.3 Multirelational Data Mining 5719.3.1 What Is Multirelational Data Mining? 5719.3.2 ILP Approach to Multirelational Classiﬁcation 5739.3.3 Tuple ID Propagation 5759.3.4 Multirelational Classiﬁcation Using Tuple ID Propagation 5779.3.5 Multirelational Clustering with User Guidance 5809.4 Summary 584Exercises 586Bibliographic Notes 587Chapter 10 Mining Object, Spatial, Multimedia, Text, and Web Data 59110.1 Multidimensional Analysis and Descriptive Mining of ComplexData Objects 59110.1.1 Generalization of Structured Data 59210.1.2 Aggregation and Approximation in Spatial and Multimedia DataGeneralization 593
Contents xvii10.1.3 Generalization of Object Identiﬁers and Class/SubclassHierarchies 59410.1.4 Generalization of Class Composition Hierarchies 59510.1.5 Construction and Mining of Object Cubes 59610.1.6 Generalization-Based Mining of Plan Databases byDivide-and-Conquer 59610.2 Spatial Data Mining 60010.2.1 Spatial Data Cube Construction and Spatial OLAP 60110.2.2 Mining Spatial Association and Co-location Patterns 60510.2.3 Spatial Clustering Methods 60610.2.4 Spatial Classiﬁcation and Spatial Trend Analysis 60610.2.5 Mining Raster Databases 60710.3 Multimedia Data Mining 60710.3.1 Similarity Search in Multimedia Data 60810.3.2 Multidimensional Analysis of Multimedia Data 60910.3.3 Classiﬁcation and Prediction Analysis of Multimedia Data 61110.3.4 Mining Associations in Multimedia Data 61210.3.5 Audio and Video Data Mining 61310.4 Text Mining 61410.4.1 Text Data Analysis and Information Retrieval 61510.4.2 Dimensionality Reduction for Text 62110.4.3 Text Mining Approaches 62410.5 Mining the World Wide Web 62810.5.1 Mining the Web Page Layout Structure 63010.5.2 Mining the Web’s Link Structures to IdentifyAuthoritative Web Pages 63110.5.3 Mining Multimedia Data on the Web 63710.5.4 Automatic Classiﬁcation of Web Documents 63810.5.5 Web Usage Mining 64010.6 Summary 641Exercises 642Bibliographic Notes 645Chapter 11 Applications and Trends in Data Mining 64911.1 Data Mining Applications 64911.1.1 Data Mining for Financial Data Analysis 64911.1.2 Data Mining for the Retail Industry 65111.1.3 Data Mining for the Telecommunication Industry 65211.1.4 Data Mining for Biological Data Analysis 65411.1.5 Data Mining in Other Scientiﬁc Applications 65711.1.6 Data Mining for Intrusion Detection 658
xviii Contents11.2 Data Mining System Products and Research Prototypes 66011.2.1 How to Choose a Data Mining System 66011.2.2 Examples of Commercial Data Mining Systems 66311.3 Additional Themes on Data Mining 66511.3.1 Theoretical Foundations of Data Mining 66511.3.2 Statistical Data Mining 66611.3.3 Visual and Audio Data Mining 66711.3.4 Data Mining and Collaborative Filtering 67011.4 Social Impacts of Data Mining 67511.4.1 Ubiquitous and Invisible Data Mining 67511.4.2 Data Mining, Privacy, and Data Security 67811.5 Trends in Data Mining 68111.6 Summary 684Exercises 685Bibliographic Notes 687Appendix An Introduction to Microsoft’s OLE DB forData Mining 691A.1 Model Creation 693A.2 Model Training 695A.3 Model Prediction and Browsing 697Bibliography 703Index 745
ForewordWe are deluged by data—scientiﬁc data, medical data, demographic data, ﬁnancial data,and marketing data. People have no time to look at this data. Human attention hasbecome the precious resource. So, we must ﬁnd ways to automatically analyze the data,to automatically classify it, to automatically summarize it, to automatically discover andcharacterize trends in it, and to automatically ﬂag anomalies. This is one of the mostactive and exciting areas of the database research community. Researchers in areas includ-ing statistics, visualization, artiﬁcial intelligence, and machine learning are contributingto this ﬁeld. The breadth of the ﬁeld makes it difﬁcult to grasp the extraordinary progressover the last few decades.Six years ago, Jiawei Han’s and Micheline Kamber’s seminal textbook organized andpresented Data Mining. It heralded a golden age of innovation in the ﬁeld. This revisionof their book reﬂects that progress; more than half of the references and historical notesare to recent work. The ﬁeld has matured with many new and improved algorithms, andhas broadened to include many more datatypes: streams, sequences, graphs, time-series,geospatial, audio, images, and video. We are certainly not at the end of the golden age—indeed research and commercial interest in data mining continues to grow—but we areall fortunate to have this modern compendium.The book gives quick introductions to database and data mining concepts withparticular emphasis on data analysis. It then covers in a chapter-by-chapter tour the con-cepts and techniques that underlie classiﬁcation, prediction, association, and clustering.These topics are presented with examples, a tour of the best algorithms for each prob-lem class, and with pragmatic rules of thumb about when to apply each technique. TheSocratic presentation style is both very readable and very informative. I certainly learneda lot from reading the ﬁrst edition and got re-educated and updated in reading the secondedition.Jiawei Han and Micheline Kamber have been leading contributors to data miningresearch. This is the text they use with their students to bring them up to speed on thexix
xx Forewordﬁeld. The ﬁeld is evolving very rapidly, but this book is a quick way to learn the basic ideas,and to understand where the ﬁeld is today. I found it very informative and stimulating,and believe you will too.Jim GrayMicrosoft ResearchSan Francisco, CA, USA
PrefaceOur capabilities of both generating and collecting data have been increasing rapidly.Contributing factors include the computerization of business, scientiﬁc, and governmenttransactions; the widespread use of digital cameras, publication tools, and bar codes formost commercial products; and advances in data collection tools ranging from scannedtext and image platforms to satellite remote sensing systems. In addition, popular useof the World Wide Web as a global information system has ﬂooded us with a tremen-dous amount of data and information. This explosive growth in stored or transient datahas generated an urgent need for new techniques and automated tools that can intelli-gently assist us in transforming the vast amounts of data into useful information andknowledge.This book explores the concepts and techniques of data mining, a promising andﬂourishing frontier in data and information systems and their applications. Data mining,also popularly referred to as knowledge discovery from data (KDD), is the automated orconvenient extraction of patterns representing knowledge implicitly stored or capturedin large databases, data warehouses, the Web, other massive information repositories, ordata streams.Data mining is a multidisciplinary ﬁeld, drawing work from areas including databasetechnology, machine learning, statistics, pattern recognition, information retrieval,neural networks, knowledge-based systems, artiﬁcial intelligence, high-performancecomputing, and data visualization. We present techniques for the discovery of patternshidden in large data sets, focusing on issues relating to their feasibility, usefulness, effec-tiveness, and scalability. As a result, this book is not intended as an introduction todatabase systems, machine learning, statistics, or other such areas, although we do pro-vide the background necessary in these areas in order to facilitate the reader’s compre-hension of their respective roles in data mining. Rather, the book is a comprehensiveintroduction to data mining, presented with effectiveness and scalability issues in focus.It should be useful for computing science students, application developers, and businessprofessionals, as well as researchers involved in any of the disciplines listed above.Data mining emerged during the late 1980s, made great strides during the 1990s, andcontinues to ﬂourish into the new millennium. This book presents an overall pictureof the ﬁeld, introducing interesting data mining techniques and systems and discussingxxi
xxii Prefaceapplications and research directions. An important motivation for writing this book wasthe need to build an organized framework for the study of data mining—a challengingtask, owing to the extensive multidisciplinary nature of this fast-developing ﬁeld. Wehope that this book will encourage people with different backgrounds and experiencesto exchange their views regarding data mining so as to contribute toward the furtherpromotion and shaping of this exciting and dynamic ﬁeld.Organization of the BookSince the publication of the ﬁrst edition of this book, great progress has been made inthe ﬁeld of data mining. Many new data mining methods, systems, and applications havebeen developed. This new edition substantially revises the ﬁrst edition of the book, withnumerous enhancements and a reorganization of the technical contents of the entirebook. In addition, several new chapters are included to address recent developments onmining complex types of data, including stream data, sequence data, graph structureddata, social network data, and multirelational data.The chapters are described brieﬂy as follows, with emphasis on the new material.Chapter 1 provides an introduction to the multidisciplinary ﬁeld of data mining.It discusses the evolutionary path of database technology, which has led to the needfor data mining, and the importance of its applications. It examines the types of datato be mined, including relational, transactional, and data warehouse data, as well ascomplex types of data such as data streams, time-series, sequences, graphs, social net-works, multirelational data, spatiotemporal data, multimedia data, text data, and Webdata. The chapter presents a general classiﬁcation of data mining tasks, based on thedifferent kinds of knowledge to be mined. In comparison with the ﬁrst edition, twonew sections are introduced: Section 1.7 is on data mining primitives, which allowusers to interactively communicate with data mining systems in order to direct themining process, and Section 1.8 discusses the issues regarding how to integrate a datamining system with a database or data warehouse system. These two sections repre-sent the condensed materials of Chapter 4, “Data Mining Primitives, Languages andArchitectures,” in the ﬁrst edition. Finally, major challenges in the ﬁeld are discussed.Chapter 2 introduces techniques for preprocessing the data before mining. Thiscorresponds to Chapter 3 of the ﬁrst edition. Because data preprocessing precedes theconstruction of data warehouses, we address this topic here, and then follow with anintroduction to data warehouses in the subsequent chapter. This chapter describes var-ious statistical methods for descriptive data summarization, including measuring bothcentral tendency and dispersion of data. The description of data cleaning methods hasbeen enhanced. Methods for data integration and transformation and data reduction arediscussed, including the use of concept hierarchies for dynamic and static discretization.The automatic generation of concept hierarchies is also described.Chapters 3 and 4 provide a solid introduction to data warehouse, OLAP (On-LineAnalytical Processing), and data generalization. These two chapters correspond toChapters 2 and 5 of the ﬁrst edition, but with substantial enhancement regarding data
Preface xxiiiwarehouse implementation methods. Chapter 3 introduces the basic concepts, archi-tectures and general implementations of data warehouse and on-line analytical process-ing, as well as the relationship between data warehousing and data mining. Chapter 4takes a more in-depth look at data warehouse and OLAP technology, presenting adetailed study of methods of data cube computation, including the recently developedstar-cubing and high-dimensional OLAP methods. Further explorations of data ware-house and OLAP are discussed, such as discovery-driven cube exploration, multifeaturecubes for complex data mining queries, and cube gradient analysis. Attribute-orientedinduction, an alternative method for data generalization and concept description, isalso discussed.Chapter 5 presents methods for mining frequent patterns, associations, and corre-lations in transactional and relational databases and data warehouses. In addition tointroducing the basic concepts, such as market basket analysis, many techniques for fre-quent itemset mining are presented in an organized way. These range from the basicApriori algorithm and its variations to more advanced methods that improve on efﬁ-ciency, including the frequent-pattern growth approach, frequent-pattern mining withvertical data format, and mining closed frequent itemsets. The chapter also presents tech-niques for mining multilevel association rules, multidimensional association rules, andquantitative association rules. In comparison with the previous edition, this chapter hasplaced greater emphasis on the generation of meaningful association and correlationrules. Strategies for constraint-based mining and the use of interestingness measures tofocus the rule search are also described.Chapter 6 describes methods for data classiﬁcation and prediction, including decisiontree induction, Bayesian classiﬁcation, rule-based classiﬁcation, the neural network tech-nique of backpropagation, support vector machines, associative classiﬁcation, k-nearestneighborclassiﬁers,case-basedreasoning,geneticalgorithms,roughsettheory,andfuzzyset approaches. Methods of regression are introduced. Issues regarding accuracy and howto choose the best classiﬁer or predictor are discussed. In comparison with the corre-sponding chapter in the ﬁrst edition, the sections on rule-based classiﬁcation and supportvector machines are new, and the discussion of measuring and enhancing classiﬁcationand prediction accuracy has been greatly expanded.Cluster analysisformsthetopicof Chapter7. Severalmajordataclusteringapproachesare presented, including partitioning methods, hierarchical methods, density-basedmethods, grid-based methods, and model-based methods. New sections in this editionintroduce techniques for clustering high-dimensional data, as well as for constraint-based cluster analysis. Outlier analysis is also discussed.Chapters 8 to 10 treat advanced topics in data mining and cover a large body ofmaterials on recent progress in this frontier. These three chapters now replace our pre-vious single chapter on advanced topics. Chapter 8 focuses on the mining of streamdata, time-series data, and sequence data (covering both transactional sequences andbiological sequences). The basic data mining techniques (such as frequent-pattern min-ing, classiﬁcation, clustering, and constraint-based mining) are extended for these typesof data. Chapter 9 discusses methods for graph and structural pattern mining, socialnetwork analysis and multirelational data mining. Chapter 10 presents methods for
xxiv Prefacemining object, spatial, multimedia, text, and Web data, which cover a great deal of newprogress in these areas.Finally, in Chapter 11, we summarize the concepts presented in this book and discussapplications and trends in data mining. New material has been added on data mining forbiological and biomedical data analysis, other scientiﬁc applications, intrusion detection,and collaborative ﬁltering. Social impacts of data mining, such as privacy and data secu-rity issues, are discussed, in addition to challenging research issues. Further discussionof ubiquitous data mining has also been added.The Appendix provides an introduction to Microsoft’s OLE DB for Data Mining(OLEDB for DM).Throughout the text, italic font is used to emphasize terms that are deﬁned, while boldfont is used to highlight or summarize main ideas. Sans serif font is used for reservedwords. Bold italic font is used to represent multidimensional quantities.This book has several strong features that set it apart from other texts on data min-ing. It presents a very broad yet in-depth coverage from the spectrum of data mining,especially regarding several recent research topics on data stream mining, graph min-ing, social network analysis, and multirelational data mining. The chapters precedingthe advanced topics are written to be as self-contained as possible, so they may be readin order of interest by the reader. All of the major methods of data mining are pre-sented. Because we take a database point of view to data mining, the book also presentsmany important topics in data mining, such as scalable algorithms and multidimensionalOLAP analysis, that are often overlooked or minimally treated in other books.To the InstructorThis book is designed to give a broad, yet detailed overview of the ﬁeld of data mining. Itcan be used to teach an introductory course on data mining at an advanced undergraduatelevel or at the ﬁrst-year graduate level. In addition, it can also be used to teach an advancedcourse on data mining.If you plan to use the book to teach an introductory course, you may ﬁnd that thematerials in Chapters 1 to 7 are essential, among which Chapter 4 may be omitted if youdo not plan to cover the implementation methods for data cubing and on-line analyticalprocessing in depth. Alternatively, you may omit some sections in Chapters 1 to 7 anduse Chapter 11 as the ﬁnal coverage of applications and trends on data mining.If you plan to use the book to teach an advanced course on data mining, you may useChapters 8 through 11. Moreover, additional materials and some recent research papersmay supplement selected themes from among the advanced topics of these chapters.Individual chapters in this book can also be used for tutorials or for special topicsin related courses, such as database systems, machine learning, pattern recognition, andintelligent data analysis.Each chapter ends with a set of exercises, suitable as assigned homework. The exercisesare either short questions that test basic mastery of the material covered, longer questionsthat require analytical thinking, or implementation projects. Some exercises can also be
Preface xxvused as research discussion topics. The bibliographic notes at the end of each chapter canbe used to ﬁnd the research literature that contains the origin of the concepts and meth-ods presented, in-depth treatment of related topics, and possible extensions. Extensiveteaching aids are available from the book’s websites, such as lecture slides, reading lists,and course syllabi.To the StudentWe hope that this textbook will spark your interest in the young yet fast-evolving ﬁeld ofdata mining. We have attempted to present the material in a clear manner, with carefulexplanation of the topics covered. Each chapter ends with a summary describing the mainpoints. We have included many ﬁgures and illustrations throughout the text in order tomake the book more enjoyable and reader-friendly. Although this book was designed asa textbook, we have tried to organize it so that it will also be useful to you as a referencebook or handbook, should you later decide to perform in-depth research in the relatedﬁelds or pursue a career in data mining.What do you need to know in order to read this book?You should have some knowledge of the concepts and terminology associated withdatabase systems, statistics, and machine learning. However, we do try to provideenough background of the basics in these ﬁelds, so that if you are not so familiar withthese ﬁelds or your memory is a bit rusty, you will not have trouble following thediscussions in the book.You should have some programming experience. In particular, you should be able toread pseudo-code and understand simple data structures such as multidimensionalarrays.To the ProfessionalThis book was designed to cover a wide range of topics in the ﬁeld of data mining. As aresult, it is an excellent handbook on the subject. Because each chapter is designed to beas stand-alone as possible, you can focus on the topics that most interest you. The bookcan be used by application programmers and information service managers who wish tolearn about the key ideas of data mining on their own. The book would also be useful fortechnical data analysis staff in banking, insurance, medicine, and retailing industries whoare interested in applying data mining solutions to their businesses. Moreover, the bookmay serve as a comprehensive survey of the data mining ﬁeld, which may also beneﬁtresearchers who would like to advance the state-of-the-art in data mining and extendthe scope of data mining applications.The techniques and algorithms presented are of practical utility. Rather than select-ing algorithms that perform well on small “toy” data sets, the algorithms describedin the book are geared for the discovery of patterns and knowledge hidden in large,
xxvi Prefacereal data sets. In Chapter 11, we brieﬂy discuss data mining systems in commercialuse, as well as promising research prototypes. Algorithms presented in the book areillustrated in pseudo-code. The pseudo-code is similar to the C programming lan-guage, yet is designed so that it should be easy to follow by programmers unfamiliarwith C or C++. If you wish to implement any of the algorithms, you should ﬁnd thetranslation of our pseudo-code into the programming language of your choice to bea fairly straightforward task.Book Websites with ResourcesThe book has a website at www.cs.uiuc.edu/∼hanj/bk2 and another with Morgan Kauf-mann Publishers at www.mkp.com/datamining2e. These websites contain many sup-plemental materials for readers of this book or anyone else with an interest in datamining. The resources include:Slide presentations per chapter. Lecture notes in Microsoft PowerPoint slides areavailable for each chapter.Artwork of the book. This may help you to make your own slides for your class-room teaching.Instructors’ manual. This complete set of answers to the exercises in the book isavailable only to instructors from the publisher’s website.Course syllabi and lecture plan. These are given for undergraduate and graduateversions of introductory and advanced courses on data mining, which use the textand slides.Supplemental reading lists with hyperlinks. Seminal papers for supplemental read-ing are organized per chapter.Links to data mining data sets and software. We will provide a set of links to datamining data sets and sites containing interesting data mining software pack-ages, such as IlliMine from the University of Illinois at Urbana-Champaign(http://illimine.cs.uiuc.edu).Sample assignments, exams, course projects. A set of sample assignments, exams,and course projects will be made available to instructors from the publisher’swebsite.Table of contents of the book in PDF.Errata on the different printings of the book. We welcome you to point out anyerrors in the book. Once the error is conﬁrmed, we will update this errata list andinclude acknowledgment of your contribution.Comments or suggestions can be sent to firstname.lastname@example.org. We would be happy tohear from you.
Preface xxviiAcknowledgments for the First Edition of the BookWe would like to express our sincere thanks to all those who have worked or are cur-rently working with us on data mining–related research and/or the DBMiner project, orhave provided us with various support in data mining. These include Rakesh Agrawal,Stella Atkins, Yvan Bedard, Binay Bhattacharya, (Yandong) Dora Cai, Nick Cercone,Surajit Chaudhuri, Sonny H. S. Chee, Jianping Chen, Ming-Syan Chen, Qing Chen,Qiming Chen, Shan Cheng, David Cheung, Shi Cong, Son Dao, Umeshwar Dayal,James Delgrande, Guozhu Dong, Carole Edwards, Max Egenhofer, Martin Ester, UsamaFayyad, Ling Feng, Ada Fu, Yongjian Fu, Daphne Gelbart, Randy Goebel, Jim Gray,Robert Grossman, Wan Gong, Yike Guo, Eli Hagen, Howard Hamilton, Jing He, LarryHenschen, Jean Hou, Mei-Chun Hsu, Kan Hu, Haiming Huang, Yue Huang, JuliaItskevitch, Wen Jin, Tiko Kameda, Hiroyuki Kawano, Rizwan Kheraj, Eddie Kim, WonKim, Krzysztof Koperski, Hans-Peter Kriegel, Vipin Kumar, Laks V. S. Lakshmanan,Joyce Man Lam, James Lau, Deyi Li, George (Wenmin) Li, Jin Li, Ze-Nian Li, NancyLiao, Gang Liu, Junqiang Liu, Ling Liu, Alan (Yijun) Lu, Hongjun Lu, Tong Lu, Wei Lu,Xuebin Lu, Wo-Shun Luk, Heikki Mannila, Runying Mao, Abhay Mehta, Gabor Melli,Alberto Mendelzon, Tim Merrett, Harvey Miller, Drew Miners, Behzad Mortazavi-Asl,Richard Muntz, Raymond T. Ng, Vicent Ng, Shojiro Nishio, Beng-Chin Ooi, TamerOzsu, Jian Pei, Gregory Piatetsky-Shapiro, Helen Pinto, Fred Popowich, Amynmo-hamed Rajan, Peter Scheuermann, Shashi Shekhar, Wei-Min Shen, Avi Silberschatz,Evangelos Simoudis, Nebojsa Stefanovic, Yin Jenny Tam, Simon Tang, Zhaohui Tang,Dick Tsur, Anthony K. H. Tung, Ke Wang, Wei Wang, Zhaoxia Wang, Tony Wind, LaraWinstone, Ju Wu, Betty (Bin) Xia, Cindy M. Xin, Xiaowei Xu, Qiang Yang, Yiwen Yin,Clement Yu, Jeffrey Yu, Philip S. Yu, Osmar R. Zaiane, Carlo Zaniolo, Shuhua Zhang,Zhong Zhang, Yvonne Zheng, Xiaofang Zhou, and Hua Zhu. We are also grateful toJean Hou, Helen Pinto, Lara Winstone, and Hua Zhu for their help with some of theoriginal ﬁgures in this book, and to Eugene Belchev for his careful proofreading ofeach chapter.We also wish to thank Diane Cerra, our Executive Editor at Morgan KaufmannPublishers, for her enthusiasm, patience, and support during our writing of this book,as well as Howard Severson, our Production Editor, and his staff for their conscien-tious efforts regarding production. We are indebted to all of the reviewers for theirinvaluable feedback. Finally, we thank our families for their wholehearted supportthroughout this project.Acknowledgments for the Second Edition of the BookWe would like to express our grateful thanks to all of the previous and current mem-bers of the Data Mining Group at UIUC, the faculty and students in the Data andInformation Systems (DAIS) Laboratory in the Department of Computer Science,the University of Illinois at Urbana-Champaign, and many friends and colleagues,
xxviii Prefacewhose constant support and encouragement have made our work on this edition arewarding experience. These include Gul Agha, Rakesh Agrawal, Loretta Auvil, PeterBajcsy, Geneva Belford, Deng Cai, Y. Dora Cai, Roy Cambell, Kevin C.-C. Chang, Sura-jit Chaudhuri, Chen Chen, Yixin Chen, Yuguo Chen, Hong Cheng, David Cheung,Shengnan Cong, Gerald DeJong, AnHai Doan, Guozhu Dong, Charios Ermopoulos,Martin Ester, Christos Faloutsos, Wei Fan, Jack C. Feng, Ada Fu, Michael Garland,Johannes Gehrke, Hector Gonzalez, Mehdi Harandi, Thomas Huang, Wen Jin, Chu-lyun Kim, Sangkyum Kim, Won Kim, Won-Young Kim, David Kuck, Young-Koo Lee,Harris Lewin, Xiaolei Li, Yifan Li, Chao Liu, Han Liu, Huan Liu, Hongyan Liu, Lei Liu,Ying Lu, Klara Nahrstedt, David Padua, Jian Pei, Lenny Pitt, Daniel Reed, Dan Roth,Bruce Schatz, Zheng Shao, Marc Snir, Zhaohui Tang, Bhavani M. Thuraisingham, JosepTorrellas, Peter Tzvetkov, Benjamin W. Wah, Haixun Wang, Jianyong Wang, Ke Wang,Muyuan Wang, Wei Wang, Michael Welge, Marianne Winslett, Ouri Wolfson, AndrewWu, Tianyi Wu, Dong Xin, Xifeng Yan, Jiong Yang, Xiaoxin Yin, Hwanjo Yu, JeffreyX. Yu, Philip S. Yu, Maria Zemankova, ChengXiang Zhai, Yuanyuan Zhou, and WeiZou. Deng Cai and ChengXiang Zhai have contributed to the text mining and Webmining sections, Xifeng Yan to the graph mining section, and Xiaoxin Yin to the mul-tirelational data mining section. Hong Cheng, Charios Ermopoulos, Hector Gonzalez,David J. Hill, Chulyun Kim, Sangkyum Kim, Chao Liu, Hongyan Liu, Kasif Manzoor,Tianyi Wu, Xifeng Yan, and Xiaoxin Yin have contributed to the proofreading of theindividual chapters of the manuscript.We also which to thank Diane Cerra, our Publisher at Morgan Kaufmann Pub-lishers, for her constant enthusiasm, patience, and support during our writing of thisbook. We are indebted to Alan Rose, the book Production Project Manager, for histireless and ever prompt communications with us to sort out all details of the pro-duction process. We are grateful for the invaluable feedback from all of the reviewers.Finally, we thank our families for their wholehearted support throughout this project.
1IntroductionThis book is an introduction to a young and promising ﬁeld called data mining and knowledgediscovery from data. The material in this book is presented from a database perspective,where emphasis is placed on basic data mining concepts and techniques for uncoveringinteresting data patterns hidden in large data sets. The implementation methods dis-cussed are particularly oriented toward the development of scalable and efﬁcient datamining tools. In this chapter, you will learn how data mining is part of the naturalevolution of database technology, why data mining is important, and how it is deﬁned.You will learn about the general architecture of data mining systems, as well as gaininsight into the kinds of data on which mining can be performed, the types of patternsthat can be found, and how to tell which patterns represent useful knowledge. Youwill study data mining primitives, from which data mining query languages can bedesigned. Issues regarding how to integrate a data mining system with a database ordata warehouse are also discussed. In addition to studying a classiﬁcation of data min-ing systems, you will read about challenging research issues for building data miningtools of the future.1.1What Motivated Data Mining? Why Is It Important?Necessity is the mother of invention. —PlatoData mining has attracted a great deal of attention in the information industry and insociety as a whole in recent years, due to the wide availability of huge amounts of dataand the imminent need for turning such data into useful information and knowledge.The information and knowledge gained can be used for applications ranging from mar-ket analysis, fraud detection, and customer retention, to production control and scienceexploration.Data mining can be viewed as a result of the natural evolution of informationtechnology. The database system industry has witnessed an evolutionary path in thedevelopment of the following functionalities (Figure 1.1): data collection and databasecreation, data management (including data storage and retrieval, and database1
2 Chapter 1 IntroductionFigure 1.1 The evolution of database system technology.
1.1 What Motivated Data Mining? Why Is It Important? 3transaction processing), and advanced data analysis (involving data warehousing anddata mining). For instance, the early development of data collection and databasecreation mechanisms served as a prerequisite for later development of effective mech-anisms for data storage and retrieval, and query and transaction processing. Withnumerous database systems offering query and transaction processing as commonpractice, advanced data analysis has naturally become the next target.Since the 1960s, database and information technology has been evolving system-atically from primitive ﬁle processing systems to sophisticated and powerful databasesystems. The research and development in database systems since the 1970s has pro-gressed from early hierarchical and network database systems to the development ofrelational database systems (where data are stored in relational table structures; seeSection 1.3.1), data modeling tools, and indexing and accessing methods. In addition,users gained convenient and ﬂexible data access through query languages, user inter-faces, optimized query processing, and transaction management. Efﬁcient methodsfor on-line transaction processing (OLTP), where a query is viewed as a read-onlytransaction, have contributed substantially to the evolution and wide acceptance ofrelational technology as a major tool for efﬁcient storage, retrieval, and managementof large amounts of data.Database technology since the mid-1980s has been characterized by the popularadoption of relational technology and an upsurge of research and developmentactivities on new and powerful database systems. These promote the development ofadvanced data models such as extended-relational, object-oriented, object-relational,and deductive models. Application-oriented database systems, including spatial, tem-poral, multimedia, active, stream, and sensor, and scientiﬁc and engineering databases,knowledge bases, and ofﬁce information bases, have ﬂourished. Issues related to thedistribution, diversiﬁcation, and sharing of data have been studied extensively. Hetero-geneous database systems and Internet-based global information systems such as theWorld Wide Web (WWW) have also emerged and play a vital role in the informationindustry.The steady and amazing progress of computer hardware technology in the pastthree decades has led to large supplies of powerful and affordable computers, datacollection equipment, and storage media. This technology provides a great boost tothe database and information industry, and makes a huge number of databases andinformation repositories available for transaction management, information retrieval,and data analysis.Data can now be stored in many different kinds of databases and informationrepositories. One data repository architecture that has emerged is the data warehouse(Section 1.3.2), a repository of multiple heterogeneous data sources organized under auniﬁed schema at a single site in order to facilitate management decision making. Datawarehouse technology includes data cleaning, data integration, and on-line analyticalprocessing (OLAP), that is, analysis techniques with functionalities such as summa-rization, consolidation, and aggregation as well as the ability to view information fromdifferent angles. Although OLAP tools support multidimensional analysis and deci-sion making, additional data analysis tools are required for in-depth analysis, such as
4 Chapter 1 IntroductionFigure 1.2 We are data rich, but information poor.data classiﬁcation, clustering, and the characterization of data changes over time. Inaddition, huge volumes of data can be accumulated beyond databases and data ware-houses. Typical examples include the World Wide Web and data streams, where dataﬂow in and out like streams, as in applications like video surveillance, telecommunica-tion, and sensor networks. The effective and efﬁcient analysis of data in such differentforms becomes a challenging task.The abundance of data, coupled with the need for powerful data analysis tools, hasbeen described as a data rich but information poor situation. The fast-growing, tremen-dous amount of data, collected and stored in large and numerous data repositories, hasfar exceeded our human ability for comprehension without powerful tools (Figure 1.2).As a result, data collected in large data repositories become “data tombs”—data archivesthat are seldom visited. Consequently, important decisions are often made based not onthe information-rich data stored in data repositories, but rather on a decision maker’sintuition, simply because the decision maker does not have the tools to extract the valu-able knowledge embedded in the vast amounts of data. In addition, consider expertsystem technologies, which typically rely on users or domain experts to manually inputknowledge into knowledge bases. Unfortunately, this procedure is prone to biases anderrors, and is extremely time-consuming and costly. Data mining tools perform dataanalysis and may uncover important data patterns, contributing greatly to business
1.2 So, What Is Data Mining? 5strategies, knowledge bases, and scientiﬁc and medical research. The widening gapbetween data and information calls for a systematic development of data mining toolsthat will turn data tombs into “golden nuggets” of knowledge.1.2So, What Is Data Mining?Simply stated, data mining refers to extracting or “mining” knowledge from large amountsof data. The term is actually a misnomer. Remember that the mining of gold from rocksor sand is referred to as gold mining rather than rock or sand mining. Thus, data miningshould have been more appropriately named “knowledge mining from data,” which isunfortunately somewhat long. “Knowledge mining,” a shorter term, may not reﬂect theemphasis on mining from large amounts of data. Nevertheless, mining is a vivid termcharacterizing the process that ﬁnds a small set of precious nuggets from a great deal ofraw material (Figure 1.3). Thus, such a misnomer that carries both “data” and “min-ing” became a popular choice. Many other terms carry a similar or slightly differentmeaning to data mining, such as knowledge mining from data, knowledge extraction,data/pattern analysis, data archaeology, and data dredging.Many people treat data mining as a synonym for another popularly used term, Knowl-edge Discovery from Data, or KDD. Alternatively, others view data mining as simply anKnowledgeFigure 1.3 Data mining—searching for knowledge (interesting patterns) in your data.
6 Chapter 1 IntroductionFigure 1.4 Data mining as a step in the process of knowledge discovery.
1.2 So, What Is Data Mining? 7essential step in the process of knowledge discovery. Knowledge discovery as a processis depicted in Figure 1.4 and consists of an iterative sequence of the following steps:1. Data cleaning (to remove noise and inconsistent data)2. Data integration (where multiple data sources may be combined)13. Dataselection (where data relevant to the analysis task are retrieved from the database)4. Data transformation (where data are transformed or consolidated into forms appro-priate for mining by performing summary or aggregation operations, for instance)25. Data mining (an essential process where intelligent methods are applied in order toextract data patterns)6. Pattern evaluation (to identify the truly interesting patterns representing knowledgebased on some interestingness measures; Section 1.5)7. Knowledge presentation (where visualization and knowledge representation tech-niques are used to present the mined knowledge to the user)Steps 1 to 4 are different forms of data preprocessing, where the data are preparedfor mining. The data mining step may interact with the user or a knowledge base. Theinteresting patterns are presented to the user and may be stored as new knowledge inthe knowledge base. Note that according to this view, data mining is only one step in theentire process, albeit an essential one because it uncovers hidden patterns for evaluation.We agree that data mining is a step in the knowledge discovery process. However, inindustry, in media, and in the database research milieu, the term data mining is becomingmore popular than the longer term of knowledge discovery from data. Therefore, in thisbook, we choose to use the term data mining. We adopt a broad view of data miningfunctionality: data mining is the process of discovering interesting knowledge from largeamounts of data stored in databases, data warehouses, or other information repositories.Based on this view, the architecture of a typical data mining system may have thefollowing major components (Figure 1.5):Database, data warehouse, World Wide Web, or other information repository: Thisis one or a set of databases, data warehouses, spreadsheets, or other kinds of informa-tion repositories. Data cleaning and data integration techniques may be performedon the data.Database or data warehouse server: The database or data warehouse server is respon-sible for fetching the relevant data, based on the user’s data mining request.1A popular trend in the information industry is to perform data cleaning and data integration as apreprocessing step, where the resulting data are stored in a data warehouse.2Sometimes data transformation and consolidation are performed before the data selection process,particularly in the case of data warehousing. Data reduction may also be performed to obtain a smallerrepresentation of the original data without sacriﬁcing its integrity.
8 Chapter 1 IntroductionDatabaseDataWarehouseWorld WideWebOther InfoRepositoriesUser InterfacePattern EvaluationData Mining EngineDatabase orData Warehouse Serverdata cleaning, integration and selectionKnowledgeBaseFigure 1.5 Architecture of a typical data mining system.Knowledge base: This is the domain knowledge that is used to guide the search orevaluate the interestingness of resulting patterns. Such knowledge can include con-cept hierarchies, used to organize attributes or attribute values into different levels ofabstraction. Knowledge such as user beliefs, which can be used to assess a pattern’sinterestingness based on its unexpectedness, may also be included. Other examplesof domain knowledge are additional interestingness constraints or thresholds, andmetadata (e.g., describing data from multiple heterogeneous sources).Data mining engine: This is essential to the data mining system and ideally consists ofa set of functional modules for tasks such as characterization, association and correla-tion analysis, classiﬁcation, prediction, cluster analysis, outlier analysis, and evolutionanalysis.Pattern evaluation module: This component typically employs interestingness mea-sures (Section 1.5) and interacts with the data mining modules so as to focus thesearch toward interesting patterns. It may use interestingness thresholds to ﬁlterout discovered patterns. Alternatively, the pattern evaluation module may be inte-grated with the mining module, depending on the implementation of the datamining method used. For efﬁcient data mining, it is highly recommended to push
1.3 Data Mining—On What Kind of Data? 9the evaluation of pattern interestingness as deep as possible into the mining processso as to conﬁne the search to only the interesting patterns.Userinterface: This module communicates between users and the data mining system,allowing the user to interact with the system by specifying a data mining query ortask, providing information to help focus the search, and performing exploratory datamining based on the intermediate data mining results. In addition, this componentallows the user to browse database and data warehouse schemas or data structures,evaluate mined patterns, and visualize the patterns in different forms.From a data warehouse perspective, data mining can be viewed as an advanced stageof on-line analytical processing (OLAP). However, data mining goes far beyond the nar-row scope of summarization-style analytical processing of data warehouse systems byincorporating more advanced techniques for data analysis.Although there are many “data mining systems” on the market, not all of them canperform true data mining. A data analysis system that does not handle large amounts ofdata should be more appropriately categorized as a machine learning system, a statisticaldata analysis tool, or an experimental system prototype. A system that can only per-form data or information retrieval, including ﬁnding aggregate values, or that performsdeductive query answering in large databases should be more appropriately categorizedas a database system, an information retrieval system, or a deductive database system.Data mining involves an integration of techniques from multiple disciplines such asdatabase and data warehouse technology, statistics, machine learning, high-performancecomputing, pattern recognition, neural networks, data visualization, informationretrieval, image and signal processing, and spatial or temporal data analysis. We adopta database perspective in our presentation of data mining in this book. That is, empha-sis is placed on efﬁcient and scalable data mining techniques. For an algorithm to bescalable, its running time should grow approximately linearly in proportion to the sizeof the data, given the available system resources such as main memory and disk space.By performing data mining, interesting knowledge, regularities, or high-level informa-tion can be extracted from databases and viewed or browsed from different angles. Thediscovered knowledge can be applied to decision making, process control, informationmanagement, and query processing. Therefore, data mining is considered one of the mostimportant frontiers in database and information systems and one of the most promisinginterdisciplinary developments in the information technology.1.3Data Mining—On What Kind of Data?In this section, we examine a number of different data repositories on which miningcan be performed. In principle, data mining should be applicable to any kind of datarepository, as well as to transient data, such as data streams. Thus the scope of ourexamination of data repositories will include relational databases, data warehouses,transactional databases, advanced database systems, ﬂat ﬁles, data streams, and the
10 Chapter 1 IntroductionWorld Wide Web. Advanced database systems include object-relational databases andspeciﬁc application-oriented databases, such as spatial databases, time-series databases,text databases, and multimedia databases. The challenges and techniques of mining maydiffer for each of the repository systems.Although this book assumes that readers have basic knowledge of informationsystems, we provide a brief introduction to each of the major data repository systemslisted above. In this section, we also introduce the ﬁctitious AllElectronics store, whichwill be used to illustrate concepts throughout the text.1.3.1 Relational DatabasesA database system, also called a database management system (DBMS), consists of acollection of interrelated data, known as a database, and a set of software programs tomanage and access the data. The software programs involve mechanisms for the deﬁni-tion of database structures; for data storage; for concurrent, shared, or distributed dataaccess; and for ensuring the consistency and security of the information stored, despitesystem crashes or attempts at unauthorized access.A relational database is a collection of tables, each of which is assigned a unique name.Each table consists of a set of attributes (columns or ﬁelds) and usually stores a large setof tuples (records or rows). Each tuple in a relational table represents an object identiﬁedby a unique key and described by a set of attribute values. A semantic data model, suchas an entity-relationship (ER) data model, is often constructed for relational databases.An ER data model represents the database as a set of entities and their relationships.Consider the following example.Example 1.1 A relational database for AllElectronics. The AllElectronics company is described by thefollowing relation tables: customer, item, employee, and branch. Fragments of the tablesdescribed here are shown in Figure 1.6.The relation customer consists of a set of attributes, including a unique customeridentity number (cust ID), customer name, address, age, occupation, annual income,credit information, category, and so on.Similarly, each of the relations item, employee, and branch consists of a set of attributesdescribing their properties.Tables can also be used to represent the relationships between or among multiplerelation tables. For our example, these include purchases (customer purchases items,creating a sales transaction that is handled by an employee), items sold (lists theitems sold in a given transaction), and works at (employee works at a branch ofAllElectronics).Relational data can be accessed by database queries written in a relational querylanguage, such as SQL, or with the assistance of graphical user interfaces. In the latter,the user may employ a menu, for example, to specify attributes to be included in thequery, and the constraints on these attributes. A given query is transformed into a set of
1.3 Data Mining—On What Kind of Data? 11Figure 1.6 Fragments of relations from a relational database for AllElectronics.relational operations, such as join, selection, and projection, and is then optimized forefﬁcient processing. A query allows retrieval of speciﬁed subsets of the data. Suppose thatyour job is to analyze the AllElectronics data. Through the use of relational queries, youcan ask things like “Show me a list of all items that were sold in the last quarter.” Rela-tional languages also include aggregate functions such as sum, avg (average), count, max(maximum), and min (minimum). These allow you to ask things like “Show me the totalsales of the last month, grouped by branch,” or “How many sales transactions occurredin the month of December?” or “Which sales person had the highest amount of sales?”
12 Chapter 1 IntroductionWhen data mining is applied to relational databases, we can go further by searching fortrends or data patterns. For example, data mining systems can analyze customer data topredict the credit risk of new customers based on their income, age, and previous creditinformation. Data mining systems may also detect deviations, such as items whose salesare far from those expected in comparison with the previous year. Such deviations canthen be further investigated (e.g., has there been a change in packaging of such items, ora signiﬁcant increase in price?).Relational databases are one of the most commonly available and rich informationrepositories, and thus they are a major data form in our study of data mining.1.3.2 Data WarehousesSuppose that AllElectronics is a successful international company, with branches aroundthe world. Each branch has its own set of databases. The president of AllElectronics hasasked you to provide an analysis of the company’s sales per item type per branch for thethird quarter. This is a difﬁcult task, particularly since the relevant data are spread outover several databases, physically located at numerous sites.If AllElectronics had a data warehouse, this task would be easy. A data ware-house is a repository of information collected from multiple sources, stored undera uniﬁed schema, and that usually resides at a single site. Data warehouses are con-structed via a process of data cleaning, data integration, data transformation, dataloading, and periodic data refreshing. This process is discussed in Chapters 2 and 3.Figure 1.7 shows the typical framework for construction and use of a data warehousefor AllElectronics.Data source in ChicagoData source in TorontoData source in VancouverData source in New York DataWarehouseCleanIntegrateTransformLoadRefreshQuery andAnalysis ToolsClientClientFigure 1.7 Typical framework of a data warehouse for AllElectronics.
1.3 Data Mining—On What Kind of Data? 13To facilitate decision making, the data in a data warehouse are organized aroundmajor subjects, such as customer, item, supplier, and activity. The data are stored toprovide information from a historical perspective (such as from the past 5–10 years)and are typically summarized. For example, rather than storing the details of eachsales transaction, the data warehouse may store a summary of the transactions peritem type for each store or, summarized to a higher level, for each sales region.A data warehouse is usually modeled by a multidimensional database structure,where each dimension corresponds to an attribute or a set of attributes in the schema,and each cell stores the value of some aggregate measure, such as count or sales amount.The actual physical structure of a data warehouse may be a relational data store or amultidimensional data cube. A data cube provides a multidimensional view of dataand allows the precomputation and fast accessing of summarized data.Example 1.2 A data cube for AllElectronics. A data cube for summarized sales data of AllElectronicsis presented in Figure 1.8(a). The cube has three dimensions: address (with city valuesChicago, New York, Toronto, Vancouver), time (with quarter values Q1, Q2, Q3, Q4), anditem (with item type values home entertainment, computer, phone, security). The aggregatevalue stored in each cell of the cube is sales amount (in thousands). For example, the totalsalesfortheﬁrstquarter,Q1,foritemsrelatingtosecuritysystemsinVancouveris$400,000,as stored in cell Vancouver, Q1, security . Additional cubes may be used to store aggregatesumsovereachdimension,correspondingtotheaggregatevaluesobtainedusingdifferentSQL group-bys (e.g., the total sales amount per city and quarter, or per city and item, orper quarter and item, or per each individual dimension).“I have also heard about data marts. What is the difference between a data warehouse anda data mart?” you may ask. A data warehouse collects information about subjects thatspan an entire organization, and thus its scope is enterprise-wide. A data mart, on theother hand, is a department subset of a data warehouse. It focuses on selected subjects,and thus its scope is department-wide.By providing multidimensional data views and the precomputation of summarizeddata, data warehouse systems are well suited for on-line analytical processing, orOLAP. OLAP operations use background knowledge regarding the domain of thedata being studied in order to allow the presentation of data at different levels ofabstraction. Such operations accommodate different user viewpoints. Examples ofOLAP operations include drill-down and roll-up, which allow the user to view thedata at differing degrees of summarization, as illustrated in Figure 1.8(b). For instance,we can drill down on sales data summarized by quarter to see the data summarizedby month. Similarly, we can roll up on sales data summarized by city to view the datasummarized by country.Although data warehouse tools help support data analysis, additional tools for datamining are required to allow more in-depth and automated analysis. An overview ofdata warehouse and OLAP technology is provided in Chapter 3. Advanced issues regard-ing data warehouse and OLAP implementation and data generalization are discussed inChapter 4.
14 Chapter 1 Introduction605 825 14 400Q1Q2Q3Q4ChicagoNew YorkToronto4401560395Vancouvertime(quarters)address(cities)homeentertainmentcomputerphoneitem (types)security<Vancouver,Q1, security>Q1Q2Q3Q4USACanada20001000time(quarters)address(countries)homeentertainmentcomputerphoneitem (types)security150100150JanFebMarchChicagoNew YorkTorontoVancouvertime(months)address(cities)homeentertainmentcomputerphoneitem (types)securityDrill-downon time data for Q1Roll-upon address(a)(b)Figure 1.8 A multidimensional data cube, commonly used for data warehousing, (a) showing summa-rized data for AllElectronics and (b) showing summarized data resulting from drill-down androll-up operations on the cube in (a). For improved readability, only some of the cube cellvalues are shown.1.3.3 Transactional DatabasesIn general, a transactional database consists of a ﬁle where each record represents a trans-action. A transaction typically includes a unique transaction identity number (trans ID)and a list of the items making up the transaction (such as items purchased in a store).
1.3 Data Mining—On What Kind of Data? 15trans ID list of item IDsT100 I1, I3, I8, I16T200 I2, I8... ...Figure 1.9 Fragment of a transactional database for sales at AllElectronics.The transactional database may have additional tables associated with it, which containother information regarding the sale, such as the date of the transaction, the customer IDnumber, the ID number of the salesperson and of the branch at which the sale occurred,and so on.Example 1.3 A transactional database for AllElectronics. Transactions can be stored in a table, withone record per transaction. A fragment of a transactional database for AllElectronicsis shown in Figure 1.9. From the relational database point of view, the sales table inFigure 1.9 is a nested relation because the attribute list of item IDs contains a set of items.Because most relational database systems do not support nested relational structures, thetransactional database is usually either stored in a ﬂat ﬁle in a format similar to that ofthe table in Figure 1.9 or unfolded into a standard relation in a format similar to that ofthe items sold table in Figure 1.6.As an analyst of the AllElectronics database, you may ask, “Show me all the itemspurchased by Sandy Smith” or “How many transactions include item number I3?”Answering such queries may require a scan of the entire transactional database.Suppose you would like to dig deeper into the data by asking, “Which items sold welltogether?” This kind of market basket data analysis would enable you to bundle groups ofitems together as a strategy for maximizing sales. For example, given the knowledge thatprinters are commonly purchased together with computers, you could offer an expensivemodel of printers at a discount to customers buying selected computers, in the hopes ofselling more of the expensive printers. A regular data retrieval system is not able to answerqueries like the one above. However, data mining systems for transactional data can doso by identifying frequent itemsets, that is, sets of items that are frequently sold together.The mining of such frequent patterns for transactional data is discussed in Chapter 18.104.22.168 Advanced Data and Information Systems andAdvanced ApplicationsRelational database systems have been widely used in business applications. With theprogress of database technology, various kinds of advanced data and information sys-tems have emerged and are undergoing development to address the requirements of newapplications.
16 Chapter 1 IntroductionThe new database applications include handling spatial data (such as maps),engineering design data (such as the design of buildings, system components, or inte-grated circuits), hypertext and multimedia data (including text, image, video, and audiodata), time-related data (such as historical records or stock exchange data), stream data(such as video surveillance and sensor data, where data ﬂow in and out like streams), andthe World Wide Web (a huge, widely distributed information repository made availableby the Internet). These applications require efﬁcient data structures and scalable meth-ods for handling complex object structures; variable-length records; semistructured orunstructured data; text, spatiotemporal, and multimedia data; and database schemaswith complex structures and dynamic changes.Inresponsetotheseneeds,advanceddatabasesystemsandspeciﬁcapplication-orienteddatabase systems have been developed. These include object-relational database systems,temporal and time-series database systems, spatial and spatiotemporal database systems,text and multimedia database systems, heterogeneous and legacy database systems, datastream management systems, and Web-based global information systems.While such databases or information repositories require sophisticated facilities toefﬁciently store, retrieve, and update large amounts of complex data, they also providefertile grounds and raise many challenging research and implementation issues for datamining. In this section, we describe each of the advanced database systems listed above.Object-Relational DatabasesObject-relational databases are constructed based on an object-relational data model.This model extends the relational model by providing a rich data type for handling com-plex objects and object orientation. Because most sophisticated database applicationsneed to handle complex objects and structures, object-relational databases are becom-ing increasingly popular in industry and applications.Conceptually, the object-relational data model inherits the essential concepts ofobject-oriented databases, where, in general terms, each entity is considered as anobject. Following the AllElectronics example, objects can be individual employees, cus-tomers, or items. Data and code relating to an object are encapsulated into a singleunit. Each object has associated with it the following:A set of variables that describe the objects. These correspond to attributes in theentity-relationship and relational models.A set of messages that the object can use to communicate with other objects, or withthe rest of the database system.A set of methods, where each method holds the code to implement a message. Uponreceiving a message, the method returns a value in response. For instance, the methodfor the message get photo(employee) will retrieve and return a photo of the givenemployee object.Objects that share a common set of properties can be grouped into an object class.Each object is an instance of its class. Object classes can be organized into class/subclass
1.3 Data Mining—On What Kind of Data? 17hierarchies so that each class represents properties that are common to objects in thatclass. For instance, an employee class can contain variables like name, address, and birth-date. Suppose that the class, sales person, is a subclass of the class, employee. A sales personobject would inherit all of the variables pertaining to its superclass of employee. In addi-tion, it has all of the variables that pertain speciﬁcally to being a salesperson (e.g., com-mission). Such a class inheritance feature beneﬁts information sharing.For data mining in object-relational systems, techniques need to be developed forhandling complex object structures, complex data types, class and subclass hierarchies,property inheritance, and methods and procedures.Temporal Databases, Sequence Databases, andTime-Series DatabasesA temporal database typically stores relational data that include time-related attributes.These attributes may involve several timestamps, each having different semantics.A sequence database stores sequences of ordered events, with or without a concretenotion of time. Examples include customer shopping sequences, Web click streams, andbiological sequences. A time-seriesdatabase stores sequences of values or events obtainedover repeated measurements of time (e.g., hourly, daily, weekly). Examples include datacollected from the stock exchange, inventory control, and the observation of naturalphenomena (like temperature and wind).Data mining techniques can be used to ﬁnd the characteristics of object evolution, orthe trend of changes for objects in the database. Such information can be useful in deci-sion making and strategy planning. For instance, the mining of banking data may aid inthe scheduling of bank tellers according to the volume of customer trafﬁc. Stock exchangedata can be mined to uncover trends that could help you plan investment strategies (e.g.,when is the best time to purchase AllElectronics stock?). Such analyses typically requiredeﬁning multiple granularity of time. For example, time may be decomposed accordingto ﬁscal years, academic years, or calendar years. Years may be further decomposed intoquarters or months.Spatial Databases and Spatiotemporal DatabasesSpatial databases contain spatial-related information. Examples include geographic(map)databases,verylarge-scaleintegration(VLSI)orcomputed-aideddesigndatabases,and medical and satellite image databases. Spatial data may be represented in raster for-mat, consisting of n-dimensional bit maps or pixel maps. For example, a 2-D satelliteimage may be represented as raster data, where each pixel registers the rainfall in a givenarea. Maps can be represented in vector format, where roads, bridges, buildings, andlakes are represented as unions or overlays of basic geometric constructs, such as points,lines, polygons, and the partitions and networks formed by these components.Geographic databases have numerous applications, ranging from forestry and ecol-ogy planning to providing public service information regarding the location of telephoneand electric cables, pipes, and sewage systems. In addition, geographic databases are
18 Chapter 1 Introductioncommonly used in vehicle navigation and dispatching systems. An example of such asystem for taxis would store a city map with information regarding one-way streets, sug-gested routes for moving from region A to region B during rush hour, and the locationof restaurants and hospitals, as well as the current location of each driver.“What kind of data mining can be performed on spatial databases?” you may ask. Datamining may uncover patterns describing the characteristics of houses located near a spec-iﬁed kind of location, such as a park, for instance. Other patterns may describe the cli-mate of mountainous areas located at various altitudes, or describe the change in trendof metropolitan poverty rates based on city distances from major highways. The relation-ships among a set of spatial objects can be examined in order to discover which subsets ofobjects are spatially auto-correlated or associated. Clusters and outliers can be identiﬁedby spatial cluster analysis. Moreover, spatial classiﬁcation can be performed to constructmodels for prediction based on the relevant set of features of the spatial objects. Further-more, “spatial data cubes” may be constructed to organize data into multidimensionalstructures and hierarchies, on which OLAP operations (such as drill-down and roll-up)can be performed.A spatial database that stores spatial objects that change with time is called aspatiotemporal database, from which interesting information can be mined. For exam-ple, we may be able to group the trends of moving objects and identify some strangelymoving vehicles, or distinguish a bioterrorist attack from a normal outbreak of the ﬂubased on the geographic spread of a disease with time.Text Databases and Multimedia DatabasesText databases are databases that contain word descriptions for objects. These worddescriptions are usually not simple keywords but rather long sentences or paragraphs,such as product speciﬁcations, error or bug reports, warning messages, summary reports,notes, or other documents. Text databases may be highly unstructured (such as someWeb pages on the World Wide Web). Some text databases may be somewhat structured,that is, semistructured (such as e-mail messages and many HTML/XML Web pages),whereas others are relatively well structured (such as library catalogue databases). Textdatabases with highly regular structures typically can be implemented using relationaldatabase systems.“What can data mining on text databases uncover?” By mining text data, one mayuncover general and concise descriptions of the text documents, keyword or contentassociations, as well as the clustering behavior of text objects. To do this, standard datamining methods need to be integrated with information retrieval techniques and theconstruction or use of hierarchies speciﬁcally for text data (such as dictionaries and the-sauruses), as well as discipline-oriented term classiﬁcation systems (such as in biochemi-stry, medicine, law, or economics).Multimedia databases store image, audio, and video data. They are used in appli-cations such as picture content-based retrieval, voice-mail systems, video-on-demandsystems, the World Wide Web, and speech-based user interfaces that recognize spokencommands. Multimedia databases must support large objects, because data objects such
1.3 Data Mining—On What Kind of Data? 19as video can require gigabytes of storage. Specialized storage and search techniques arealso required. Because video and audio data require real-time retrieval at a steady andpredetermined rate in order to avoid picture or sound gaps and system buffer overﬂows,such data are referred to as continuous-media data.For multimedia data mining, storage and search techniques need to be integratedwith standard data mining methods. Promising approaches include the construction ofmultimedia data cubes, the extraction of multiple features from multimedia data, andsimilarity-based pattern matching.Heterogeneous Databases and Legacy DatabasesA heterogeneous database consists of a set of interconnected, autonomous componentdatabases. The components communicate in order to exchange information and answerqueries. Objects in one component database may differ greatly from objects in othercomponent databases, making it difﬁcult to assimilate their semantics into the overallheterogeneous database.Many enterprises acquire legacy databases as a result of the long history of infor-mation technology development (including the application of different hardware andoperating systems). A legacy database is a group of heterogeneous databases that com-bines different kinds of data systems, such as relational or object-oriented databases,hierarchical databases, network databases, spreadsheets, multimedia databases, or ﬁlesystems. The heterogeneous databases in a legacy database may be connected by intra-or inter-computer networks.Information exchange across such databases is difﬁcult because it would requireprecise transformation rules from one representation to another, considering diversesemantics. Consider, for example, the problem in exchanging information regardingstudent academic performance among different schools. Each school may have its owncomputer system and use its own curriculum and grading system. One university mayadopt a quarter system, offer three courses on database systems, and assign grades fromA+ to F, whereas another may adopt a semester system, offer two courses on databases,and assign grades from 1 to 10. It is very difﬁcult to work out precise course-to-gradetransformation rules between the two universities, making information exchange dif-ﬁcult. Data mining techniques may provide an interesting solution to the informationexchange problem by performing statistical data distribution and correlation analysis,and transforming the given data into higher, more generalized, conceptual levels (suchas fair, good, or excellent for student grades), from which information exchange can thenmore easily be performed.Data StreamsMany applications involve the generation and analysis of a new kind of data, called streamdata, where data ﬂow in and out of an observation platform (or window) dynamically.Such data streams have the following unique features: huge or possibly inﬁnite volume,dynamically changing, ﬂowing in and out in a ﬁxed order, allowing only one or a small
20 Chapter 1 Introductionnumber of scans, and demanding fast (often real-time) response time. Typical examples ofdata streams include various kinds of scientiﬁc and engineering data, time-series data,and data produced in other dynamic environments, such as power supply, network traf-ﬁc, stock exchange, telecommunications, Web click streams, video surveillance, andweather or environment monitoring.Because data streams are normally not stored in any kind of data repository, effec-tive and efﬁcient management and analysis of stream data poses great challenges toresearchers. Currently, many researchers are investigating various issues relating to thedevelopmentofdatastreammanagementsystems.Atypicalquerymodelinsuchasystemis the continuous query model, where predeﬁned queries constantly evaluate incomingstreams, collect aggregate data, report the current status of data streams, and respond totheir changes.Mining data streams involves the efﬁcient discovery of general patterns and dynamicchanges within stream data. For example, we may like to detect intrusions of a computernetwork based on the anomaly of message ﬂow, which may be discovered by clusteringdata streams, dynamic construction of stream models, or comparing the current frequentpatterns with that at a certain previous time. Most stream data reside at a rather low levelof abstraction, whereas analysts are often more interested in higher and multiple levelsof abstraction. Thus, multilevel, multidimensional on-line analysis and mining shouldbe performed on stream data as well.The World Wide WebThe World Wide Web and its associated distributed information services, such asYahoo!, Google, America Online, and AltaVista, provide rich, worldwide, on-line infor-mation services, where data objects are linked together to facilitate interactive access.Users seeking information of interest traverse from one object via links to another.Such systems provide ample opportunities and challenges for data mining. For exam-ple, understanding user access patterns will not only help improve system design (byproviding efﬁcient access between highly correlated objects), but also leads to bettermarketing decisions (e.g., by placing advertisements in frequently visited documents,or by providing better customer/user classiﬁcation and behavior analysis). Capturinguser access patterns in such distributed information environments is called Web usagemining (or Weblog mining).Although Web pages may appear fancy and informative to human readers, they can behighly unstructured and lack a predeﬁned schema, type, or pattern. Thus it is difﬁcult forcomputers to understand the semantic meaning of diverse Web pages and structure themin an organized way for systematic information retrieval and data mining. Web servicesthat provide keyword-based searches without understanding the context behind the Webpages can only offer limited help to users. For example, a Web search based on a singlekeyword may return hundreds of Web page pointers containing the keyword, but mostof the pointers will be very weakly related to what the user wants to ﬁnd. Data miningcan often provide additional help here than Web search services. For example, authori-tative Web page analysis based on linkages among Web pages can help rank Web pages
1.4 Data Mining Functionalities—What Kinds of Patterns Can Be Mined? 21based on their importance, inﬂuence, and topics. Automated Web page clustering andclassiﬁcation help group and arrange Web pages in a multidimensional manner basedon their contents. Web community analysis helps identify hidden Web social networksand communities and observe their evolution. Web mining is the development of scal-able and effective Web data analysis and mining methods. It may help us learn about thedistribution of information on the Web in general, characterize and classify Web pages,and uncover Web dynamics and the association and other relationships among differentWeb pages, users, communities, and Web-based activities.Data mining in advanced database and information systems is discussed in Chapters 8to 10.1.4Data Mining Functionalities—What Kinds of PatternsCan Be Mined?We have observed various types of databases and information repositories on which datamining can be performed. Let us now examine the kinds of data patterns that can bemined.Data mining functionalities are used to specify the kind of patterns to be found indata mining tasks. In general, data mining tasks can be classiﬁed into two categories:descriptive and predictive. Descriptive mining tasks characterize the general propertiesof the data in the database. Predictive mining tasks perform inference on the current datain order to make predictions.In some cases, users may have no idea regarding what kinds of patterns in their datamay be interesting, and hence may like to search for several different kinds of patterns inparallel. Thus it is important to have a data mining system that can mine multiple kinds ofpatterns to accommodate different user expectations or applications. Furthermore, datamining systems should be able to discover patterns at various granularity (i.e., differentlevels of abstraction). Data mining systems should also allow users to specify hints toguide or focus the search for interesting patterns. Because some patterns may not holdfor all of the data in the database, a measure of certainty or “trustworthiness” is usuallyassociated with each discovered pattern.Data mining functionalities, and the kinds of patterns they can discover, are describedbelow.1.4.1 Concept/Class Description: Characterization andDiscriminationData can be associated with classes or concepts. For example, in the AllElectronics store,classes of items for sale include computers and printers, and concepts of customers includebigSpenders and budgetSpenders. It can be useful to describe individual classes and con-cepts in summarized, concise, and yet precise terms. Such descriptions of a class ora concept are called class/concept descriptions. These descriptions can be derived via(1) data characterization, by summarizing the data of the class under study (often called