SlideShare a Scribd company logo
1 of 69
( (
P A N G . N I N G T A N
M i c h i g a n S t a t e U n i v e r s i t y
M I C H A E L S T E I N B A C H
U n i v e r s i t y o f M i n n e s o t a
V I P I N K U M A R
U n i v e r s i t y o f M i n n e s o t a
a n d A r m y H i g h P e r f o r m a n c e
C o m p u t i n g R e s e a r c h C e n t e r
+f.f_l crf.rfh. .W if f
aqtY 6l$
t . T . R . C .
i'&'ufe61ttt1/.
Y  t. $t,/,1'
n,5  . 7  V
' 4 8 !
Boston San Francisco NewYork
London Toronto Sydney Tokyo Singapore Madrid
MexicoCity Munich Paris CapeTown HongKong Montreal
G.R
r+6,q
If you purchased this book within the United States or Canada
you should be aware that it has been
wrongfirlly imported without the approval of the Publishel or
the Author.
T3
Loo 6
- {)gq*
3 AcquisitionsEditor Matt Goldstein
ProjectEditor Katherine Harutunian
Production Supervisor Marilyn Lloyd
Production Services Paul C. Anagnostopoulos of Windfall
Software
Marketing Manager Michelle Brown
Copyeditor Kathy Smith
Proofreader IenniferMcClain
Technicallllustration GeorgeNichols
Cover Design Supervisor Joyce Cosentino Wells
Cover Design Night & Day Design
Cover Image @ 2005 Rob Casey/Brand X pictures
hepress and Manufacturing Caroline Fell
Printer HamiltonPrinting
Access the latest information about Addison-Wesley titles from
our iWorld Wide Web site:
http : //www. aw-bc.com/computing
Many of the designations used by manufacturers and sellers to
distiriguish their products
are claimed as trademarks. where those designations appear in
this book, and Addison-
Wesley was aware of a trademark claim, the designations have
been printed in initial caps
or all caps.
The programs and applications presented in this book have been
incl,[rded for their
instructional value. They have been tested with care, but are not
guatanteed for any
particular purpose. The publisher does not offer any warranties
or representations, nor does
it accept any liabilities with respect to the programs or
applications.
Copyright @ 2006 by Pearson Education, Inc.
For information on obtaining permission for use of material in
this work, please submit a
written request to Pearson Education, Inc., Rights and Contract
Department, 75 Arlington
Street, Suite 300, Boston, MA02II6 or fax your request to (617)
g4g-j047.
All rights reserved. No part of this publication may be
reproduced, stored in a retrieval
system, or transmitted, in any form or by any means, electronic,
mechanical, photocopying,
recording, or any other media embodiments now known or
hereafter to become known,
without the prior written permission of the publisher. printed in
the united States of
America.
lsBN 0-321-42052-7
2 3 4 5 67 8 9 10-HAM-O8 07 06
our famili,es
Preface
Advances in data generation and collection are producing data
sets of mas-
sive size in commerce and a variety of scientific disciplines.
Data warehouses
store details of the sales and operations of businesses, Earth-
orbiting satellites
beam high-resolution images and sensor data back to Earth, and
genomics ex-
periments generate sequence, structural, and functional data for
an increasing
number of organisms. The ease with which data can now be
gathered and
stored has created a new attitude toward data analysis: Gather
whatever data
you can whenever and wherever possible. It has become an
article of faith
that the gathered data will have value, either for the purpose
that initially
motivated its collection or for purposes not yet envisioned.
The field of data mining grew out of the limitations of current
data anal-
ysis techniques in handling the challenges posedl by these new
types of data
sets. Data mining does not replace other areas of data analysis,
but rather
takes them as the foundation for much of its work. While some
areas of data
mining, such as association analysis, are unique to the field,
other areas, such
as clustering, classification, and anomaly detection, build upon
a long history
of work on these topics in other fields. Indeed, the willingness
of data mining
researchers to draw upon existing techniques has contributed to
the strength
and breadth of the field, as well as to its rapid growth.
Another strength of the field has been its emphasis on
collaboration with
researchers in other areas. The challenges of analyzing new
types of data
cannot be met by simply applying data analysis techniques in
isolation from
those who understand the data and the domain in which it
resides. Often, skill
in building multidisciplinary teams has been as responsible for
the success of
data mining projects as the creation of new and innovative
algorithms. Just
as, historically, many developments in statistics were driven by
the needs of
agriculture, industry, medicine, and business, rxrany of the
developments in
data mining are being driven by the needs of those same fields.
This book began as a set of notes and lecture slides for a data
mining
course that has been offered at the University of Minnesota
since Spring 1998
to upper-division undergraduate and graduate Students.
Presentation slides
viii Preface
and exercises developed in these offerings grew with time and
served as a basis
for the book. A survey of clustering techniques in data mining,
originally
written in preparation for research in the area, served as a
starting point
for one of the chapters in the book. Over time, the clustering
chapter was
joined by chapters on data, classification, association analysis,
and anomaly
detection. The book in its current form has been class tested at
the home
institutions of the authors-the University of Minnesota and
Michigan State
University-as well as several other universities.
A number of data mining books appeared in the meantime, but
were not
completely satisfactory for our students primarily graduate and
undergrad-
uate students in computer science, but including students from
industry and
a wide variety of other disciplines. Their mathematical and
computer back-
grounds varied considerably, but they shared a common goal: to
learn about
data mining as directly as possible in order to quickly apply it
to problems
in their own domains. Thus, texts with extensive mathematical
or statistical
prerequisites were unappealing to many of them, as were texts
that required a
substantial database background. The book that evolved in
response to these
students needs focuses as directly as possible on the key
concepts of data min-
ing by illustrating them with examples, simple descriptions of
key algorithms,
and exercises.
Overview Specifically, this book provides a comprehensive
introduction to
data mining and is designed to be accessible and useful to
students, instructors,
researchers, and professionals. Areas covered include data
preprocessing, vi-
sualization, predictive modeling, association analysis,
clustering, and anomaly
detection. The goal is to present fundamental concepts and
algorithms for
each topic, thus providing the reader with the necessary
background for the
application of data mining to real problems. In addition, this
book also pro-
vides a starting point for those readers who are interested in
pursuing research
in data mining or related fields.
The book covers five main topics: data, classification,
association analysis,
clustering, and anomaly detection. Except for anomaly
detection, each of these
areas is covered in a pair of chapters. For classification,
association analysis,
and clustering, the introductory chapter covers basic concepts,
representative
algorithms, and evaluation techniques, while the more advanced
chapter dis-
cusses advanced concepts and algorithms. The objective is to
provide the
reader with a sound understanding of the foundations of data
mining, while
still covering many important advanced topics. Because of this
approach, the
book is useful both as a learning tool and as a reference.
Preface ix
To help the readers better understand the concepts that have
been pre-
sented, we provide an extensive set of examples, figures, and
exercises. Bib-
Iiographic notes are included at the end of each chapter for
readers who are
interested in more advanced topics, historically important
papers, and recent
trends. The book also contains a comprehensive subject and
author index.
To the Instructor As a textbook, this book is suitable for a wide
range
of students at the advanced undergraduate or graduate level.
Since students
come to this subject with diverse backgrounds that may not
include extensive
knowledge of statistics or databases, our book requires minimal
prerequisites-
no database knowledge is needed and we assume only a modest
background
in statistics or mathematics. To this end, the book was designed
to be as
self-contained as possible. Necessary material from statistics,
linear algebra,
and machine learning is either integrated into the body of the
text, or for some
advanced topics, covered in the appendices.
Since the chapters covering major data mining topics are self-
contained,
the order in which topics can be covered is quite flexible. The
core material
is covered in Chapters 2, 4, 6, 8, and 10. Although the
introductory data
chapter (2) should be covered first, the basic classification,
association analy-
sis, and clustering chapters (4, 6, and 8, respectively) can be
covered in any
order. Because of the relationship of anomaly detection (10) to
classification
(4) and clustering (8), these chapters should precede Chapter
10. Various
topics can be selected from the advanced classification,
association analysis,
and clustering chapters (5, 7, and 9, respectively) to fit the
schedule and in-
terests of the instructor and students. We also advise that the
lectures be
augmented by projects or practical exercises in data mining.
Although they
are time consuming, such hands-on assignments greatly enhance
the value of
the course.
Support Materials The supplements for the book are available at
Addison-
Wesley's Website www.aw.con/cssupport. Support materials
available to all
readers of this book include
PowerPoint lecture slides
Suggestions for student projects
Data mining resources such as data mining algorithms and data
sets
On-line tutorials that give step-by-step examples for selected
data mining
techniques described in the book using actual data sets and data
analysis
software
o
o
o
o
x Preface
Additional support materials, including solutions to exercises,
are available
only to instructors adopting this textbook for classroom use.
Please contact
your school's Addison-Wesley representative for information on
obtaining ac-
cess to this material. Comments and suggestions, as well as
reports of errors,
can be sent to the authors through [email protected]
Acknowledgments Many people contributed to this book. We
begin by
acknowledging our families to whom this book is dedicated.
Without their
patience and support, this project would have been impossible.
We would like to thank the current and former students of our
data mining
groups at the University of Minnesota and Michigan State for
their contribu-
tions. Eui-Hong (Sam) Han and Mahesh Joshi helped with the
initial data min-
ing classes. Some ofthe exercises and presentation slides that
they created can
be found in the book and its accompanying slides. Students in
our data min-
ing groups who provided comments on drafts of the book or
who contributed
in other ways include Shyam Boriah, Haibin Cheng, Varun
Chandola, Eric
Eilertson, Levent Ertoz, Jing Gao, Rohit Gupta, Sridhar Iyer,
Jung-Eun Lee,
Benjamin Mayer, Aysel Ozgur, Uygar Oztekin, Gaurav Pandey,
Kashif Riaz,
Jerry Scripps, Gyorgy Simon, Hui Xiong, Jieping Ye, and
Pusheng Zhang. We
would also like to thank the students of our data mining classes
at the Univer-
sity of Minnesota and Michigan State University who worked
with early drafbs
of the book and provided invaluable feedback. We specifically
note the helpful
suggestions of Bernardo Craemer, Arifin Ruslim, Jamshid
Vayghan, and Yu
Wei.
Joydeep Ghosh (University of Texas) and Sanjay Ranka
(University of
Florida) class tested early versions of the book. We also
received many useful
suggestions directly from the following UT students: Pankaj
Adhikari, Ra-
jiv Bhatia, Fbederic Bosche, Arindam Chakraborty, Meghana
Deodhar, Chris
Everson, David Gardner, Saad Godil, Todd Hay, Clint Jones,
Ajay Joshi,
Joonsoo Lee, Yue Luo, Anuj Nanavati, Tyler Olsen, Sunyoung
Park, Aashish
Phansalkar, Geoff Prewett, Michael Ryoo, Daryl Shannon, and
Mei Yang.
Ronald Kostoff (ONR) read an early version of the clustering
chapter and
offered numerous suggestions. George Karypis provided
invaluable IATEX as-
sistance in creating an author index. Irene Moulitsas also
provided assistance
with IATEX and reviewed some of the appendices. Musetta
Steinbach was very
helpful in finding errors in the figures.
We would like to acknowledge our colleagues at the University
of Min-
nesota and Michigan State who have helped create a positive
environment for
data mining research. They include Dan Boley, Joyce Chai, Anil
Jain, Ravi
Preface xi
Janardan, Rong Jin, George Karypis, Haesun Park, William F.
Punch, Shashi
Shekhar, and Jaideep Srivastava. The collaborators on our many
data mining
projects, who also have our gratitude, include Ramesh Agrawal,
Steve Can-
non, Piet C. de Groen, FYan Hill, Yongdae Kim, Steve
Klooster, Kerry Long,
Nihar Mahapatra, Chris Potter, Jonathan Shapiro, Kevin
Silverstein, Nevin
Young, and Zhi-Li Zhang.
The departments of Computer Science and Engineering at the
University of
Minnesota and Michigan State University provided computing
resources and
a supportive environment for this project. ARDA, ARL, ARO,
DOE, NASA,
and NSF provided research support for Pang-Ning Tan, Michael
Steinbach,
and Vipin Kumar. In particular, Kamal Abdali, Dick Brackney,
Jagdish Chan-
dra, Joe Coughlan, Michael Coyle, Stephen Davis, Flederica
Darema, Richard
Hirsch, Chandrika Kamath, Raju Namburu, N. Radhakrishnan,
James Sido-
ran, Bhavani Thuraisingham, Walt Tiernin, Maria Zemankova,
and Xiaodong
Zhanghave been supportive of our research in data mining and
high-performance
computing.
It was a pleasure working with the helpful staff at Pearson
Education. In
particular, we would like to thank Michelle Brown, Matt
Goldstein, Katherine
Harutunian, Marilyn Lloyd, Kathy Smith, and Joyce Wells. We
would also
like to thank George Nichols, who helped with the art work and
Paul Anag-
nostopoulos, who provided I4.T[X support. We are grateful to
the following
Pearson reviewers: Chien-Chung Chan (University of Akron),
Zhengxin Chen
(University of Nebraska at Omaha), Chris Clifton (Purdue
University), Joy-
deep Ghosh (University of Texas, Austin), Nazli Goharian
(Illinois Institute
of Technology), J. Michael Hardin (University of Alabama),
James Hearne
(Western Washington University), Hillol Kargupta (University
of Maryland,
Baltimore County and Agnik, LLC), Eamonn Keogh (University
of California-
Riverside), Bing Liu (University of Illinois at Chicago),
Mariofanna Milanova
(University of Arkansas at Little Rock), Srinivasan
Parthasarathy (Ohio State
University), Zbigniew W. Ras (University of North Carolina at
Charlotte),
Xintao Wu (University of North Carolina at Charlotte), and
Mohammed J.
Zaki (Rensselaer Polvtechnic Institute).
Gontents
Preface
Introduction 1
1.1 What Is Data Mining? 2
7.2 Motivating Challenges 4
1.3 The Origins of Data Mining 6
1.4 Data Mining Tasks 7
1.5 Scope and Organization of the Book 11
1.6 Bibliographic Notes 13
v l l
t.7 Exercises
Data
1 6
1 9
2.I Types of Data 22
2.1.I Attributes and Measurement 23
2.L.2 Types of Data Sets . 29
2.2 Data Quality 36
2.2.I Measurement and Data Collection Issues 37
2.2.2 Issues Related to Applications
2.3 Data Preprocessing
2.3.L Aggregation
2.3.2 Sampling
2.3.3 Dimensionality Reduction
2.3.4 Feature Subset Selection
2.3.5 Feature Creation
2.3.6 Discretization and Binarization
2.3:7 Variable Tlansformation .
2.4 Measures of Similarity and Dissimilarity . . .
2.4.L Basics
2.4.2 Similarity and Dissimilarity between Simple Attributes .
2.4.3 Dissimilarities between Data Objects .
2.4.4 Similarities between Data Objects
43
44
45
47
50
5 2
5 5
5 7
63
6 5
66
6 7
6 9
72
xiv Contents
2.4.5 Examples of Proximity Measures
2.4.6 Issues in Proximity Calculation
2.4.7 Selecting the Right Proximity Measure
2.5 BibliographicNotes
2.6 Exercises
Exploring Data
3.i The Iris Data Set
3.2 Summary Statistics
3.2.L Frequencies and the Mode
3.2.2 Percentiles
3.2.3 Measures of Location: Mean and Median
3.2.4 Measures of Spread: Range and Variance
3.2.5 Multivariate Summary Statistics
3.2.6 Other Ways to Summarize the Data
3.3 Visualization
3.3.1 Motivations for Visualization
3.3.2 General Concepts
3.3.3 Techniques
3.3.4 Visualizing Higher-Dimensional Data .
3.3.5 Do's and Don'ts
3.4 OLAP and Multidimensional Data Analysis
3.4.I Representing Iris Data as a Multidimensional Array
3.4.2 Multidimensional Data: The General Case .
3.4.3 Analyzing Multidimensional Data
3.4.4 Final Comments on Multidimensional Data Analysis
Bibliographic Notes
Exercises
Classification:
Basic Concepts, Decision Tlees, and Model Evaluation
4.1 Preliminaries
4.2 General Approach to Solving a Classification Problem
4.3 Decision Tlee Induction
4.3.1 How a Decision Tlee Works
4.3.2 How to Build a Decision TYee
4.3.3 Methods for Expressing Attribute Test Conditions
4.3.4 Measures for Selecting the Best Split .
4.3.5 Algorithm for Decision Tlee Induction
4.3.6 An Examole: Web Robot Detection
3 . 5
3 . 6
73
80
83
84
88
9 7
98
98
99
1 0 0
1 0 1
102
704
1 0 5
1 0 5
1 0 5
1 0 6
1 1 0
724
1 3 0
1 3 1
1 3 1
1 3 3
1 3 5
1 3 9
1 3 9
747
L45
746
748
1 5 0
150
1 5 1
1 5 5
1 5 8
164
1 6 6
Contents xv
4.3.7 Characteristics of Decision Tlee Induction
4.4 Model Overfitting
4.4.L Overfitting Due to Presence of Noise
4.4.2 Overfitting Due to Lack of Representative Samples
4.4.3 Overfitting and the Multiple Comparison Procedure
4.4.4 Estimation of Generalization Errors
4.4.5 Handling Overfitting in Decision Tlee Induction
4.5 Evaluating the Performance of a Classifier
4.5.I Holdout Method
4.5.2 Random Subsampling . . .
4.5.3 Cross-Validation
4.5.4 Bootstrap
4.6 Methods for Comparing Classifiers
4.6.L Estimating a Confidence Interval for Accuracy
4.6.2 Comparing the Performance of Two Models .
4.6.3 Comparing the Performance of Two Classifiers
4.7 BibliographicNotes
4.8 Exercises
5 Classification: Alternative Techniques
5.1 Rule-Based Classifier
5.1.1 How a Rule-Based Classifier Works
5.1.2 Rule-Ordering Schemes
5.1.3 How to Build a Rule-Based Classifier
5.1.4 Direct Methods for Rule Extraction
5.1.5 Indirect Methods for Rule Extraction
5.1.6 Characteristics of Rule-Based Classifiers
5.2 Nearest-Neighbor classifiers
5.2.L Algorithm
5.2.2 Characteristics of Nearest-Neighbor Classifiers
5.3 Bayesian Classifiers
5.3.1 Bayes Theorem
5.3.2 Using the Bayes Theorem for Classification
5.3.3 Naive Bayes Classifier
5.3.4 Bayes Error Rate
5.3.5 Bayesian Belief Networks
5.4 Artificial Neural Network (ANN)
5.4.I Perceptron
5.4.2 Multilayer Artificial Neural Network
5.4.3 Characteristics of ANN
1 6 8
1 7 2
L75
L 7 7
178
179
184
186
1 8 6
1 8 7
1 8 7
1 8 8
1 8 8
1 8 9
1 9 1
192
193
1 9 8
207
207
209
2 I I
2r2
2r3
2 2 L
223
223
225
226
227
228
229
23L
238
240
246
247
25r
255
xvi Contents
5.5 Support Vector Machine (SVM)
5.5.1 Maximum Margin Hyperplanes
5.5.2 Linear SVM: Separable Case
5.5.3 Linear SVM: Nonseparable Case
5.5.4 Nonlinear SVM .
5.5.5 Characteristics of SVM
Ensemble Methods
5.6.1 Rationale for Ensemble Method
5.6.2 Methods for Constructing an Ensemble Classifier
5.6.3 Bias-Variance Decomposition
5.6.4 Bagging
5.6.5 Boosting
5.6.6 Random Forests
5.6.7 Empirical Comparison among Ensemble Methods
Class Imbalance Problem
5.7.1 Alternative Metrics
5.7.2 The Receiver Operating Characteristic Curve
5.7.3 Cost-Sensitive Learning . .
5.7.4 Sampling-Based Approaches .
Multiclass Problem
Bibliographic Notes
Exercises
5 . 6
o . t
256
256
259
266
270
276
276
277
278
28r
2 8 3
285
290
294
294
295
298
302
305
306
309
3 1 5
c . 6
5 . 9
5 . 1 0
Association Analysis: Basic Concepts and Algorithms 327
6.1 Problem Definition . 328
6.2 Flequent Itemset Generation 332
6.2.I The Apri,ori Principle 333
6.2.2 Fbequent Itemset Generation in the Apri,ori, Algorithm .
335
6.2.3 Candidate Generation and Pruning . . . 338
6.2.4 Support Counting 342
6.2.5 Computational Complexity 345
6.3 Rule Generatiorr 349
6.3.1 Confidence-Based Pruning 350
6.3.2 Rule Generation in Apri,ori, Algorithm 350
6.3.3 An Example: Congressional Voting Records 352
6.4 Compact Representation of Fbequent Itemsets 353
6.4.7 Maximal Flequent Itemsets 354
6.4.2 Closed Frequent Itemsets 355
6.5 Alternative Methods for Generating Frequent Itemsets 359
6.6 FP-Growth Alsorithm 363
Contents xvii
6.6.1 FP-tee Representation
6.6.2 Frequent Itemset Generation in FP-Growth Algorithm .
6.7 Evaluation of Association Patterns
6.7.l Objective Measures of Interestingness
6.7.2 Measures beyond Pairs of Binary Variables
6.7.3 Simpson's Paradox
6.8 Effect of Skewed Support Distribution
6.9 Bibliographic Notes
363
366
370
37r
382
384
386
390
404
4L5
415
4t8
4 1 8
422
424
426
429
429
431
436
439
442
443
444
447
448
453
457
457
458
458
460
461
463
465
469
473
6.10 Exercises
7 Association Analysis: Advanced
7.I Handling Categorical Attributes
7.2 Handling Continuous Attributes
Concepts
7.2.I Discretization-Based Methods
7.2.2 Statistics-Based Methods
7.2.3 Non-discretizalion Methods
Handling a Concept Hierarchy
Seouential Patterns
7.4.7 Problem Formulation
7.4.2 Sequential Pattern Discovery
7.4.3 Timing Constraints
7.4.4 Alternative Counting Schemes
7.5 Subgraph Patterns
7.5.1 Graphs and Subgraphs .
7.5.2 Frequent Subgraph Mining
7.5.3 Apri,od-like Method
7.5.4 Candidate Generation
7.5.5 Candidate Pruning
7.5.6 Support Counting
7.6 Infrequent Patterns
7.6.7 Negative Patterns
7.6.2 Negatively Correlated Patterns
7.6.3 Comparisons among Infrequent Patterns, Negative Pat-
terns, and Negatively Correlated Patterns
7.6.4 Techniques for Mining Interesting Infrequent Patterns
7.6.5 Techniques Based on Mining Negative Patterns
7.6.6 Techniques Based on Support Expectation .
7.7 Bibliographic Notes
7.8 Exercises
7 . 3
7 . 4
xviii Contents
Cluster Analysis: Basic Concepts and Algorithms
8.1 Overview
8.1.1 What Is Cluster Analysis?
8.I.2 Different Types of Clusterings .
8.1.3 Different Types of Clusters
8.2 K-means
8.2.7 The Basic K-means Algorithm
8.2.2 K-means: Additional Issues
8.2.3 Bisecting K-means
8.2.4 K-means and Different Types of Clusters
8.2.5 Strengths and Weaknesses
8.2.6 K-means as an Optimization Problem
8.3 Agglomerative Hierarchical Clustering
8.3.1 Basic Agglomerative Hierarchical Clustering Algorithm
8.3.2 Specific Techniques
8.3.3 The Lance-Williams Formula for Cluster Proximity .
8.3.4 Key Issues in Hierarchical Clustering .
8.3.5 Strengths and Weaknesses
DBSCAN
8.4.1 Tladitional Density: Center-Based Approach
8.4.2 The DBSCAN Algorithm
8.4.3 Strengths and Weaknesses
Cluster Evaluation
8.5.1 Overview
8.5.2 Unsupervised Cluster Evaluation Using Cohesion and
Separation
8.5.3 Unsupervised Cluster Evaluation Using the Proximity
Matrix
8.5.4 Unsupervised Evaluation of Hierarchical Clustering .
8.5.5 Determining the Correct Number of Clusters
8.5.6 Clustering Tendency
8.5.7 Supervised Measures of Cluster Validity
8.5.8 Assessing the Significance of Cluster Validity Measures .
8 . 4
8 . 5
487
490
490
49r
493
496
497
506
508
5 1 0
5 1 0
5 1 3
5 1 5
5 1 6
5 1 8
524
524
526
526
5 2 7
528
530
532
5 3 3
536
542
544
546
547
548
5 5 3
o o o
5 5 9
8.6 Bibliograph
8.7 Exercises
ic Notes
Cluster Analysis: Additional Issues and Algorithms 569
9.1 Characteristics of Data, Clusters, and Clustering Algorithms
. 570
9.1.1 Example: Comparing K-means and DBSCAN . . . . . . 570
9.1.2 Data Characteristics 577
Contents xix
9.1.3 Cluster Characteristics . . 573
9.L.4 General Characteristics of Clustering Algorithms 575
9.2 Prototype-Based Clustering 577
9.2.1 Fzzy Clustering 577
9.2.2 Clustering Using Mixture Models 583
9.2.3 Self-Organizing Maps (SOM) 594
9.3 Density-Based Clustering 600
9.3.1 Grid-Based Clustering 601
9.3.2 Subspace Clustering 604
9.3.3 DENCLUE: A Kernel-Based Scheme for Density-Based
Clustering 608
9.4 Graph-Based Clustering 612
9.4.1 Sparsification 613
9.4.2 Minimum Spanning Tlee (MST) Clustering . . . 674
9.4.3 OPOSSUM: Optimal Partitioning of Sparse Similarities
Using METIS 616
9.4.4 Chameleon: Hierarchical Clustering with Dynamic
Modeling
9.4.5 Shared Nearest Neighbor Similarity
9.4.6 The Jarvis-Patrick Clustering Algorithm
9.4.7 SNN Density
9.4.8 SNN Density-Based Clustering
9.5 Scalable Clustering Algorithms
9.5.1 Scalability: General Issues and Approaches
9 . 5 . 2 B I R C H
9.5.3 CURE
9.6 Which Clustering Algorithm?
9.7 Bibliographic Notes
9.8 Exercises
6 1 6
622
625
627
629
630
630
633
635
639
643
647
10 Anomaly Detection 651
10.1 Preliminaries 653
10.1.1 Causes of Anomalies 653
10.1.2 Approaches to Anomaly Detection 654
10.1.3 The Use of Class Labels 655
1 0 . 1 . 4 I s s u e s 6 5 6
10.2 Statistical Approaches 658
t0.2.7 Detecting Outliers in a Univariate Normal Distribution
659
1 0 . 2 . 2 O u t l i e r s i n a M u l t i v a r i a t e N o r m a l D
i s t r i b u t i o n . . . . . 6 6 1
10.2.3 A Mixture Model Approach for Anomaly Detection. 662
xx Contents
10.2.4 Strengths and Weaknesses
10.3 Proximity-Based Outlier Detection
10.3.1 Strengths and Weaknesses
10.4 Density-Based Outlier Detection
10.4.1 Detection of Outliers Using Relative Density
70.4.2 Strengths and Weaknesses
10.5 Clustering-Based Techniques
10.5.1 Assessing the Extent to Which an Object Belongs to a
Cluster
10.5.2 Impact of Outliers on the Initial Clustering
10.5.3 The Number of Clusters to Use
10.5.4 Strengths and Weaknesses
665
666
666
668
669
670
67L
672
674
674
674
675
680
6 8 5
b6i)
10.6 Bibliograph
10.7 Exercises
ic Notes
Appendix A Linear Algebra
A.1 Vectors
A.1.1 Definition 685
4.I.2 Vector Addition and Multiplication by a Scalar 685
A.1.3 Vector Spaces 687
4.7.4 The Dot Product, Orthogonality, and Orthogonal
Projections 688
A.1.5 Vectors and Data Analysis 690
42 Matrices 691
A.2.1 Matrices: Definitions 691
A-2.2 Matrices: Addition and Multiplication by a Scalar 692
4.2.3 Matrices: Multiplication 693
4.2.4 Linear tansformations and Inverse Matrices 695
4.2.5 Eigenvalue and Singular Value Decomposition . 697
4.2.6 Matrices and Data Analysis 699
A.3 Bibliographic Notes 700
Appendix B Dimensionality Reduction 7OL
8.1 PCA and SVD 70I
B.1.1 Principal Components Analysis (PCA) 70L
8 . 7 . 2 S V D . 7 0 6
8.2 Other Dimensionality Reduction Techniques 708
8.2.I Factor Analysis 708
8.2.2 Locally Linear Embedding (LLE) . 770
8.2.3 Multidimensional Scaling, FastMap, and ISOMAP 7I2
Contents xxi
8.2.4 Common Issues
B.3 Bibliographic Notes
Appendix C Probability and Statistics
C.1 Probability
C.1.1 Expected Values
C.2 Statistics
C.2.L Point Estimation
C.2.2 Central Limit Theorem
C.2.3 Interval Estimation
C.3 Hypothesis Testing
Appendix D Regression
D.1 Preliminaries
D.2 Simple Linear Regression
D.2.L Least Square Method
D.2.2 Analyzing Regression Errors
D.2.3 Analyzing Goodness of Fit
D.3 Multivariate Linear Regression
D.4 Alternative Least-Square Regression Methods
Appendix E Optimization
E.1 Unconstrained Optimizafion
E.1.1 Numerical Methods
8.2 Constrained Optimization
E.2.I Equality Constraints
8.2.2 Inequality Constraints
Author Index
Subject Index
Copyright Permissions
715
7L6
7L9
7L9
722
723
724
724
725
726
739
739
742
746
746
747
750
758
769
729
729
730
731
733
735
736
737
1
Introduction
Rapid advances in data collection and storage technology have
enabled or-
ganizations to accumulate vast amounts of data. However,
extracting useful
information has proven extremely challenging. Often,
traditional data analy-
sis tools and techniques cannot be used because of the massive
size of a data
set. Sometimes, the non-traditional nature of the data means that
traditional
approaches cannot be applied even if the data set is relatively
small. In other
situations, the questions that need to be answered cannot be
addressed using
existing data analysis techniques, and thus, new methods need
to be devel-
oped.
Data mining is a technology that blends traditional data analysis
methods
with sophisticated algorithms for processing large volumes of
data. It has also
opened up exciting opportunities for exploring and analyzing
new types of
data and for analyzing old types of data in new ways. In this
introductory
chapter, we present an overview of data mining and outline the
key topics
to be covered in this book. We start with a description of some
well-known
applications that require new techniques for data analysis.
Business Point-of-sale data collection (bar code scanners, radio
frequency
identification (RFID), and smart card technology) have allowed
retailers to
collect up-to-the-minute data about customer purchases at the
checkout coun-
ters of their stores. Retailers can utilize this information, along
with other
business-critical data such as Web logs from e-commerce Web
sites and cus-
tomer service records from call centers, to help them better
understand the
needs of their customers and make more informed business
decisions.
Data mining techniques can be used to support a wide range of
business
intelligence applications such as customer profiling, targeted
marketing, work-
flow management, store layout, and fraud detection. It can also
help retailers
2 Chapter 1 lntroduction
answer important business questions such as "Who are the most
profitable
customers?" "What products can be cross-sold or up-sold?" and
"What is the
revenue outlook of the company for next year?)) Some of these
questions …
HAN 02-ded-vii-viii-9780123814791 2011/6/1 3:31 Page viii #2
HAN 01-fm-i-vi-9780123814791 2011/6/1 3:29 Page i #1
Data Mining
Third Edition
HAN 01-fm-i-vi-9780123814791 2011/6/1 3:29 Page ii #2
The Morgan Kaufmann Series in Data Management Systems
(Selected Titles)
Joe Celko’s Data, Measurements, and Standards in SQL
Joe Celko
Information Modeling and Relational Databases, 2nd Edition
Terry Halpin, Tony Morgan
Joe Celko’s Thinking in Sets
Joe Celko
Business Metadata
Bill Inmon, Bonnie O’Neil, Lowell Fryman
Unleashing Web 2.0
Gottfried Vossen, Stephan Hagemann
Enterprise Knowledge Management
David Loshin
The Practitioner’s Guide to Data Quality Improvement
David Loshin
Business Process Change, 2nd Edition
Paul Harmon
IT Manager’s Handbook, 2nd Edition
Bill Holtsnider, Brian Jaffe
Joe Celko’s Puzzles and Answers, 2nd Edition
Joe Celko
Architecture and Patterns for IT Service Management, 2nd
Edition, Resource Planning
and Governance
Charles Betz
Joe Celko’s Analytics and OLAP in SQL
Joe Celko
Data Preparation for Data Mining Using SAS
Mamdouh Refaat
Querying XML: XQuery, XPath, and SQL/ XML in Context
Jim Melton, Stephen Buxton
Data Mining: Concepts and Techniques, 3rd Edition
Jiawei Han, Micheline Kamber, Jian Pei
Database Modeling and Design: Logical Design, 5th Edition
Toby J. Teorey, Sam S. Lightstone, Thomas P. Nadeau, H. V.
Jagadish
Foundations of Multidimensional and Metric Data Structures
Hanan Samet
Joe Celko’s SQL for Smarties: Advanced SQL Programming,
4th Edition
Joe Celko
Moving Objects Databases
Ralf Hartmut Güting, Markus Schneider
Joe Celko’s SQL Programming Style
Joe Celko
Fuzzy Modeling and Genetic Algorithms for Data Mining and
Exploration
Earl Cox
HAN 01-fm-i-vi-9780123814791 2011/6/1 3:29 Page iii #3
Data Modeling Essentials, 3rd Edition
Graeme C. Simsion, Graham C. Witt
Developing High Quality Data Models
Matthew West
Location-Based Services
Jochen Schiller, Agnes Voisard
Managing Time in Relational Databases: How to Design,
Update, and Query Temporal Data
Tom Johnston, Randall Weis
Database Modeling with Microsoft R© Visio for Enterprise
Architects
Terry Halpin, Ken Evans, Patrick Hallock, Bill Maclean
Designing Data-Intensive Web Applications
Stephano Ceri, Piero Fraternali, Aldo Bongio, Marco Brambilla,
Sara Comai, Maristella Matera
Mining the Web: Discovering Knowledge from Hypertext Data
Soumen Chakrabarti
Advanced SQL: 1999—Understanding Object-Relational and
Other Advanced Features
Jim Melton
Database Tuning: Principles, Experiments, and Troubleshooting
Techniques
Dennis Shasha, Philippe Bonnet
SQL: 1999—Understanding Relational Language Components
Jim Melton, Alan R. Simon
Information Visualization in Data Mining and Knowledge
Discovery
Edited by Usama Fayyad, Georges G. Grinstein, Andreas Wierse
Transactional Information Systems
Gerhard Weikum, Gottfried Vossen
Spatial Databases
Philippe Rigaux, Michel Scholl, and Agnes Voisard
Managing Reference Data in Enterprise Databases
Malcolm Chisholm
Understanding SQL and Java Together
Jim Melton, Andrew Eisenberg
Database: Principles, Programming, and Performance, 2nd
Edition
Patrick and Elizabeth O’Neil
The Object Data Standard
Edited by R. G. G. Cattell, Douglas Barry
Data on the Web: From Relations to Semistructured Data and
XML
Serge Abiteboul, Peter Buneman, Dan Suciu
Data Mining: Practical Machine Learning Tools and Techniques
with Java Implementations,
3rd Edition
Ian Witten, Eibe Frank, Mark A. Hall
Joe Celko’s Data and Databases: Concepts in Practice
Joe Celko
Developing Time-Oriented Database Applications in SQL
Richard T. Snodgrass
Web Farming for the Data Warehouse
Richard D. Hackathorn
HAN 01-fm-i-vi-9780123814791 2011/6/1 3:29 Page iv #4
Management of Heterogeneous and Autonomous Database
Systems
Edited by Ahmed Elmagarmid, Marek Rusinkiewicz, Amit Sheth
Object-Relational DBMSs, 2nd Edition
Michael Stonebraker, Paul Brown, with Dorothy Moore
Universal Database Management: A Guide to Object/Relational
Technology
Cynthia Maro Saracco
Readings in Database Systems, 3rd Edition
Edited by Michael Stonebraker, Joseph M. Hellerstein
Understanding SQL’s Stored Procedures: A Complete Guide to
SQL/PSM
Jim Melton
Principles of Multimedia Database Systems
V. S. Subrahmanian
Principles of Database Query Processing for Advanced
Applications
Clement T. Yu, Weiyi Meng
Advanced Database Systems
Carlo Zaniolo, Stefano Ceri, Christos Faloutsos, Richard T.
Snodgrass, V. S. Subrahmanian,
Roberto Zicari
Principles of Transaction Processing, 2nd Edition
Philip A. Bernstein, Eric Newcomer
Using the New DB2: IBM’s Object-Relational Database System
Don Chamberlin
Distributed Algorithms
Nancy A. Lynch
Active Database Systems: Triggers and Rules for Advanced
Database Processing
Edited by Jennifer Widom, Stefano Ceri
Migrating Legacy Systems: Gateways, Interfaces, and the
Incremental Approach
Michael L. Brodie, Michael Stonebraker
Atomic Transactions
Nancy Lynch, Michael Merritt, William Weihl, Alan Fekete
Query Processing for Advanced Database Systems
Edited by Johann Christoph Freytag, David Maier, Gottfried
Vossen
Transaction Processing
Jim Gray, Andreas Reuter
Database Transaction Models for Advanced Applications
Edited by Ahmed K. Elmagarmid
A Guide to Developing Client/Server SQL Applications
Setrag Khoshafian, Arvola Chan, Anna Wong, Harry K. T.
Wong
HAN 01-fm-i-vi-9780123814791 2011/6/1 3:29 Page v #5
Data Mining
Concepts and Techniques
Third Edition
Jiawei Han
University of Illinois at Urbana–Champaign
Micheline Kamber
Jian Pei
Simon Fraser University
AMSTERDAM • BOSTON • HEIDELBERG • LONDON
NEW YORK • OXFORD • PARIS • SAN DIEGO
SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO
Morgan Kaufmann is an imprint of Elsevier
HAN 01-fm-i-vi-9780123814791 2011/6/1 3:29 Page vi #6
Morgan Kaufmann Publishers is an imprint of Elsevier.
225 Wyman Street, Waltham, MA 02451, USA
c© 2012 by Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in
any form or by any means,
electronic or mechanical, including photocopying, recording, or
any information storage and
retrieval system, without permission in writing from the
publisher. Details on how to seek
permission, further information about the Publisher’s
permissions policies and our
arrangements with organizations such as the Copyright
Clearance Center and the Copyright
Licensing Agency, can be found at our website:
www.elsevier.com/permissions.
This book and the individual contributions contained in it are
protected under copyright by
the Publisher (other than as may be noted herein).
Notices
Knowledge and best practice in this field are constantly
changing. As new research and
experience broaden our understanding, changes in research
methods or professional practices,
may become necessary. Practitioners and researchers must
always rely on their own experience
and knowledge in evaluating and using any information or
methods described herein. In using
such information or methods they should be mindful of their
own safety and the safety of others,
including parties for whom they have a professional
responsibility.
To the fullest extent of the law, neither the Publisher nor the
authors, contributors, or editors,
assume any liability for any injury and/or damage to persons or
property as a matter of products
liability, negligence or otherwise, or from any use or operation
of any methods, products,
instructions, or ideas contained in the material herein.
Library of Congress Cataloging-in-Publication Data
Han, Jiawei.
Data mining : concepts and techniques / Jiawei Han, Micheline
Kamber, Jian Pei. – 3rd ed.
p. cm.
ISBN 978-0-12-381479-1
1. Data mining. I. Kamber, Micheline. II. Pei, Jian. III. Title.
QA76.9.D343H36 2011
006.3′12–dc22 2011010635
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British
Library.
For information on all Morgan Kaufmann publications, visit our
Web site at www.mkp.com or www.elsevierdirect.com
Printed in the United States of America
11 12 13 14 15 10 9 8 7 6 5 4 3 2 1
HAN 02-ded-vii-viii-9780123814791 2011/6/1 3:31 Page vii #1
To Y. Dora and Lawrence for your love and encouragement
J.H.
To Erik, Kevan, Kian, and Mikael for your love and inspiration
M.K.
To my wife, Jennifer, and daughter, Jacqueline
J.P.
EDELKAMP 19-ch15-671-700-9780123725127 2011/5/28 14:50
Page 672 #2
This page intentionally left blank
HAN 03-toc-ix-xviii-9780123814791 2011/6/1 3:32 Page ix #1
Contents
Foreword xix
Foreword to Second Edition xxi
Preface xxiii
Acknowledgments xxxi
About the Authors xxxv
Chapter 1 Introduction 1
1.1 Why Data Mining? 1
1.1.1 Moving toward the Information Age 1
1.1.2 Data Mining as the Evolution of Information Technology
2
1.2 What Is Data Mining? 5
1.3 What Kinds of Data Can Be Mined? 8
1.3.1 Database Data 9
1.3.2 Data Warehouses 10
1.3.3 Transactional Data 13
1.3.4 Other Kinds of Data 14
1.4 What Kinds of Patterns Can Be Mined? 15
1.4.1 Class/Concept Description: Characterization and
Discrimination 15
1.4.2 Mining Frequent Patterns, Associations, and Correlations
17
1.4.3 Classification and Regression for Predictive Analysis 18
1.4.4 Cluster Analysis 19
1.4.5 Outlier Analysis 20
1.4.6 Are All Patterns Interesting? 21
1.5 Which Technologies Are Used? 23
1.5.1 Statistics 23
1.5.2 Machine Learning 24
1.5.3 Database Systems and Data Warehouses 26
1.5.4 Information Retrieval 26
ix
HAN 03-toc-ix-xviii-9780123814791 2011/6/1 3:32 Page x #2
x Contents
1.6 Which Kinds of Applications Are Targeted? 27
1.6.1 Business Intelligence 27
1.6.2 Web Search Engines 28
1.7 Major Issues in Data Mining 29
1.7.1 Mining Methodology 29
1.7.2 User Interaction 30
1.7.3 Efficiency and Scalability 31
1.7.4 Diversity of Database Types 32
1.7.5 Data Mining and Society 32
1.8 Summary 33
1.9 Exercises 34
1.10 Bibliographic Notes 35
Chapter 2 Getting to Know Your Data 39
2.1 Data Objects and Attribute Types 40
2.1.1 What Is an Attribute? 40
2.1.2 Nominal Attributes 41
2.1.3 Binary Attributes 41
2.1.4 Ordinal Attributes 42
2.1.5 Numeric Attributes 43
2.1.6 Discrete versus Continuous Attributes 44
2.2 Basic Statistical Descriptions of Data 44
2.2.1 Measuring the Central Tendency: Mean, Median, and
Mode 45
2.2.2 Measuring the Dispersion of Data: Range, Quartiles,
Variance,
Standard Deviation, and Interquartile Range 48
2.2.3 Graphic Displays of Basic Statistical Descriptions of Data
51
2.3 Data Visualization 56
2.3.1 Pixel-Oriented Visualization Techniques 57
2.3.2 Geometric Projection Visualization Techniques 58
2.3.3 Icon-Based Visualization Techniques 60
2.3.4 Hierarchical Visualization Techniques 63
2.3.5 Visualizing Complex Data and Relations 64
2.4 Measuring Data Similarity and Dissimilarity 65
2.4.1 Data Matrix versus Dissimilarity Matrix 67
2.4.2 Proximity Measures for Nominal Attributes 68
2.4.3 Proximity Measures for Binary Attributes 70
2.4.4 Dissimilarity of Numeric Data: Minkowski Distance 72
2.4.5 Proximity Measures for Ordinal Attributes 74
2.4.6 Dissimilarity for Attributes of Mixed Types 75
2.4.7 Cosine Similarity 77
2.5 Summary 79
2.6 Exercises 79
2.7 Bibliographic Notes 81
HAN 03-toc-ix-xviii-9780123814791 2011/6/1 3:32 Page xi #3
Contents xi
Chapter 3 Data Preprocessing 83
3.1 Data Preprocessing: An Overview 84
3.1.1 Data Quality: Why Preprocess the Data? 84
3.1.2 Major Tasks in Data Preprocessing 85
3.2 Data Cleaning 88
3.2.1 Missing Values 88
3.2.2 Noisy Data 89
3.2.3 Data Cleaning as a Process 91
3.3 Data Integration 93
3.3.1 Entity Identification Problem 94
3.3.2 Redundancy and Correlation Analysis 94
3.3.3 Tuple Duplication 98
3.3.4 Data Value Conflict Detection and Resolution 99
3.4 Data Reduction 99
3.4.1 Overview of Data Reduction Strategies 99
3.4.2 Wavelet Transforms 100
3.4.3 Principal Components Analysis 102
3.4.4 Attribute Subset Selection 103
3.4.5 Regression and Log-Linear Models: Parametric
Data Reduction 105
3.4.6 Histograms 106
3.4.7 Clustering 108
3.4.8 Sampling 108
3.4.9 Data Cube Aggregation 110
3.5 Data Transformation and Data Discretization 111
3.5.1 Data Transformation Strategies Overview 112
3.5.2 Data Transformation by Normalization 113
3.5.3 Discretization by Binning 115
3.5.4 Discretization by Histogram Analysis 115
3.5.5 Discretization by Cluster, Decision Tree, and Correlation
Analyses 116
3.5.6 Concept Hierarchy Generation for Nominal Data 117
3.6 Summary 120
3.7 Exercises 121
3.8 Bibliographic Notes 123
Chapter 4 Data Warehousing and Online Analytical Processing
125
4.1 Data Warehouse: Basic Concepts 125
4.1.1 What Is a Data Warehouse? 126
4.1.2 Differences between Operational Database Systems
and Data Warehouses 128
4.1.3 But, Why Have a Separate Data Warehouse? 129
HAN 03-toc-ix-xviii-9780123814791 2011/6/1 3:32 Page xii #4
xii Contents
4.1.4 Data Warehousing: A Multitiered Architecture 130
4.1.5 Data Warehouse Models: Enterprise Warehouse, Data
Mart,
and Virtual Warehouse 132
4.1.6 Extraction, Transformation, and Loading 134
4.1.7 Metadata Repository 134
4.2 Data Warehouse Modeling: Data Cube and OLAP 135
4.2.1 Data Cube: A Multidimensional Data Model 136
4.2.2 Stars, Snowflakes, and Fact Constellations: Schemas
for Multidimensional Data Models 139
4.2.3 Dimensions: The Role of Concept Hierarchies 142
4.2.4 Measures: Their Categorization and Computation 144
4.2.5 Typical OLAP Operations 146
4.2.6 A Starnet Query Model for Querying Multidimensional
Databases 149
4.3 Data Warehouse Design and Usage 150
4.3.1 A Business Analysis Framework for Data Warehouse
Design 150
4.3.2 Data Warehouse Design Process 151
4.3.3 Data Warehouse Usage for Information Processing 153
4.3.4 From Online Analytical Processing to Multidimensional
Data Mining 155
4.4 Data Warehouse Implementation 156
4.4.1 Efficient Data Cube Computation: An Overview 156
4.4.2 Indexing OLAP Data: Bitmap Index and Join Index 160
4.4.3 Efficient Processing of OLAP Queries 163
4.4.4 OLAP Server Architectures: ROLAP versus MOLAP
versus HOLAP 164
4.5 Data Generalization by Attribute-Oriented Induction 166
4.5.1 Attribute-Oriented Induction for Data Characterization
167
4.5.2 Efficient Implementation of Attribute-Oriented Induction
172
4.5.3 Attribute-Oriented Induction for Class Comparisons 175
4.6 Summary 178
4.7 Exercises 180
4.8 Bibliographic Notes 184
Chapter 5 Data Cube Technology 187
5.1 Data Cube Computation: Preliminary Concepts 188
5.1.1 Cube Materialization: Full Cube, Iceberg Cube, Closed
Cube,
and Cube Shell 188
5.1.2 General Strategies for Data Cube Computation 192
5.2 Data Cube Computation Methods 194
5.2.1 Multiway Array Aggregation for Full Cube Computation
195
HAN 03-toc-ix-xviii-9780123814791 2011/6/1 3:32 Page xiii #5
Contents xiii
5.2.2 BUC: Computing Iceberg Cubes from the Apex Cuboid
Downward 200
5.2.3 Star-Cubing: Computing Iceberg Cubes Using a Dynamic
Star-Tree Structure 204
5.2.4 Precomputing Shell Fragments for Fast High-Dimensional
OLAP 210
5.3 Processing Advanced Kinds of Queries by Exploring Cube
Technology 218
5.3.1 Sampling Cubes: OLAP-Based Mining on Sampling Data
218
5.3.2 Ranking Cubes: Efficient Computation of Top-k Queries
225
5.4 Multidimensional Data Analysis in Cube Space 227
5.4.1 Prediction Cubes: Prediction Mining in Cube Space 227
5.4.2 Multifeature Cubes: Complex Aggregation at Multiple
Granularities 230
5.4.3 Exception-Based, Discovery-Driven Cube Space
Exploration 231
5.5 Summary 234
5.6 Exercises 235
5.7 Bibliographic Notes 240
Chapter 6 Mining Frequent Patterns, Associations, and
Correlations: Basic
Concepts and Methods 243
6.1 Basic Concepts 243
6.1.1 Market Basket Analysis: A Motivating Example 244
6.1.2 Frequent Itemsets, Closed Itemsets, and Association Rules
246
6.2 Frequent Itemset Mining Methods 248
6.2.1 Apriori Algorithm: Finding Frequent Itemsets by Confined
Candidate Generation 248
6.2.2 Generating Association Rules from Frequent Itemsets 254
6.2.3 Improving the Efficiency of Apriori 254
6.2.4 A Pattern-Growth Approach for Mining Frequent Itemsets
257
6.2.5 Mining Frequent Itemsets Using Vertical Data Format 259
6.2.6 Mining Closed and Max Patterns 262
6.3 Which Patterns Are Interesting?—Pattern Evaluation
Methods 264
6.3.1 Strong Rules Are Not Necessarily Interesting 264
6.3.2 From Association Analysis to Correlation Analysis 265
6.3.3 A Comparison of Pattern Evaluation Measures 267
6.4 Summary 271
6.5 Exercises 273
6.6 Bibliographic Notes 276
HAN 03-toc-ix-xviii-9780123814791 2011/6/1 3:32 Page xiv #6
xiv Contents
Chapter 7 Advanced Pattern Mining 279
7.1 Pattern Mining: A Road Map 279
7.2 Pattern Mining in Multilevel, Multidimensional Space 283
7.2.1 Mining Multilevel Associations 283
7.2.2 Mining Multidimensional Associations 287
7.2.3 Mining Quantitative Association Rules 289
7.2.4 Mining Rare Patterns and Negative Patterns 291
7.3 Constraint-Based Frequent Pattern Mining 294
7.3.1 Metarule-Guided Mining of Association Rules 295
7.3.2 Constraint-Based Pattern Generation: Pruning Pattern
Space
and Pruning Data Space 296
7.4 Mining High-Dimensional Data and Colossal Patterns 301
7.4.1 Mining Colossal Patterns by Pattern-Fusion 302
7.5 Mining Compressed or Approximate Patterns 307
7.5.1 Mining Compressed Patterns by Pattern Clustering 308
7.5.2 Extracting Redundancy-Aware Top-k Patterns 310
7.6 Pattern Exploration and Application 313
7.6.1 Semantic Annotation of Frequent Patterns 313
7.6.2 Applications of Pattern Mining 317
7.7 Summary 319
7.8 Exercises 321
7.9 Bibliographic Notes 323
Chapter 8 Classification: Basic Concepts 327
8.1 Basic Concepts 327
8.1.1 What Is Classification? 327
8.1.2 General Approach to Classification 328
8.2 Decision Tree Induction 330
8.2.1 Decision Tree Induction 332
8.2.2 Attribute Selection Measures 336
8.2.3 Tree Pruning 344
8.2.4 Scalability and Decision Tree Induction 347
8.2.5 Visual Mining for Decision Tree Induction 348
8.3 Bayes Classification Methods 350
8.3.1 Bayes’ Theorem 350
8.3.2 Näıve Bayesian Classification 351
8.4 Rule-Based Classification 355
8.4.1 Using IF-THEN Rules for Classification 355
8.4.2 Rule Extraction from a Decision Tree 357
8.4.3 Rule Induction Using a Sequential Covering Algorithm
359
HAN 03-toc-ix-xviii-9780123814791 2011/6/1 3:32 Page xv #7
Contents xv
8.5 Model Evaluation and Selection 364
8.5.1 Metrics for Evaluating Classifier Performance 364
8.5.2 Holdout Method and Random Subsampling 370
8.5.3 Cross-Validation 370
8.5.4 Bootstrap 371
8.5.5 Model Selection Using Statistical Tests of Significance
372
8.5.6 Comparing Classifiers Based on Cost–Benefit and ROC
Curves 373
8.6 Techniques to Improve Classification Accuracy 377
8.6.1 Introducing Ensemble Methods 378
8.6.2 Bagging 379
8.6.3 Boosting and AdaBoost 380
8.6.4 Random Forests 382
8.6.5 Improving Classification Accuracy of Class-Imbalanced
Data 383
8.7 Summary 385
8.8 Exercises 386
8.9 Bibliographic Notes 389
Chapter 9 Classification: Advanced Methods 393
9.1 Bayesian Belief Networks 393
9.1.1 Concepts and Mechanisms 394
9.1.2 Training Bayesian Belief Networks 396
9.2 Classification by Backpropagation 398
9.2.1 A Multilayer Feed-Forward Neural Network 398
9.2.2 Defining a Network Topology 400
9.2.3 Backpropagation 400
9.2.4 Inside the Black Box: Backpropagation and
Interpretability 406
9.3 Support Vector Machines 408
9.3.1 The Case When the Data Are Linearly Separable 408
9.3.2 The Case When the Data Are Linearly Inseparable 413
9.4 Classification Using Frequent Patterns 415
9.4.1 Associative Classification 416
9.4.2 Discriminative Frequent Pattern–Based Classification 419
9.5 Lazy Learners (or Learning from Your Neighbors) 422
9.5.1 k-Nearest-Neighbor Classifiers 423
9.5.2 Case-Based Reasoning 425
9.6 Other Classification Methods 426
9.6.1 Genetic Algorithms 426
9.6.2 Rough Set Approach 427
9.6.3 Fuzzy Set Approaches 428
9.7 Additional Topics Regarding Classification 429
9.7.1 Multiclass Classification 430
HAN 03-toc-ix-xviii-9780123814791 2011/6/1 3:32 Page xvi #8
xvi Contents
9.7.2 Semi-Supervised Classification 432
9.7.3 Active Learning 433
9.7.4 Transfer Learning 434
9.8 Summary 436
9.9 Exercises 438
9.10 Bibliographic Notes 439
Chapter 10 Cluster Analysis: Basic Concepts and Methods 443
10.1 Cluster Analysis 444
10.1.1 What Is Cluster Analysis? 444
10.1.2 Requirements for Cluster Analysis 445
10.1.3 Overview of Basic Clustering Methods 448
10.2 Partitioning Methods 451
10.2.1 k-Means: A Centroid-Based Technique 451
10.2.2 k-Medoids: A Representative Object-Based Technique
454
10.3 Hierarchical Methods 457
10.3.1 Agglomerative versus Divisive Hierarchical Clustering
459
10.3.2 Distance Measures in Algorithmic Methods 461
10.3.3 BIRCH: Multiphase Hierarchical Clustering Using
Clustering
Feature Trees 462
10.3.4 Chameleon: Multiphase Hierarchical Clustering Using
Dynamic
Modeling 466
10.3.5 Probabilistic Hierarchical Clustering 467
10.4 Density-Based Methods 471
10.4.1 DBSCAN: Density-Based Clustering Based on Connected
Regions with High Density 471
10.4.2 OPTICS: Ordering Points to Identify the Clustering
Structure 473
10.4.3 DENCLUE: Clustering Based on Density Distribution
Functions 476
10.5 Grid-Based Methods 479
10.5.1 STING: STatistical INformation Grid 479
10.5.2 CLIQUE: An Apriori-like Subspace Clustering Method
481
10.6 Evaluation of Clustering 483
10.6.1 Assessing Clustering Tendency 484
10.6.2 Determining the Number of Clusters 486
10.6.3 Measuring Clustering Quality 487
10.7 Summary 490
10.8 Exercises 491
10.9 Bibliographic Notes 494
Chapter 11 Advanced Cluster Analysis 497
11.1 Probabilistic Model-Based Clustering 497
11.1.1 Fuzzy Clusters 499
HAN 03-toc-ix-xviii-9780123814791 2011/6/1 3:32 Page xvii
#9
Contents xvii
11.1.2 Probabilistic Model-Based Clusters 501
11.1.3 Expectation-Maximization Algorithm 505
11.2 Clustering High-Dimensional Data 508
11.2.1 Clustering High-Dimensional Data: Problems,
Challenges,
and Major Methodologies 508
11.2.2 Subspace Clustering Methods 510
11.2.3 Biclustering 512
11.2.4 Dimensionality Reduction Methods and Spectral
Clustering 519
11.3 Clustering Graph and Network Data 522
11.3.1 Applications and Challenges 523
11.3.2 Similarity Measures 525
11.3.3 Graph Clustering Methods 528
11.4 Clustering with Constraints 532
11.4.1 Categorization of Constraints 533
11.4.2 Methods for Clustering with Constraints 535
11.5 Summary 538
11.6 Exercises 539
11.7 Bibliographic Notes 540
Chapter 12 Outlier Detection 543
12.1 Outliers and Outlier Analysis 544
12.1.1 What Are Outliers? 544
12.1.2 Types of Outliers 545
12.1.3 Challenges of Outlier Detection 548
12.2 Outlier Detection Methods 549
12.2.1 Supervised, Semi-Supervised, and Unsupervised Methods
549
12.2.2 Statistical Methods, Proximity-Based Methods, and
Clustering-Based Methods 551
12.3 Statistical Approaches 553
12.3.1 Parametric Methods 553
12.3.2 Nonparametric Methods 558
12.4 Proximity-Based Approaches 560
12.4.1 Distance-Based Outlier Detection and a Nested Loop
Method 561
12.4.2 A Grid-Based Method 562
12.4.3 Density-Based Outlier Detection 564
12.5 Clustering-Based Approaches 567
12.6 Classification-Based Approaches 571
12.7 Mining Contextual and Collective Outliers 573
12.7.1 Transforming Contextual Outlier Detection to
Conventional
Outlier Detection 573
HAN 03-toc-ix-xviii-9780123814791 2011/6/1 3:32 Page xviii
#10
xviii Contents
12.7.2 Modeling Normal Behavior with Respect to Contexts 574
12.7.3 Mining Collective Outliers 575
12.8 Outlier Detection in High-Dimensional Data 576
12.8.1 Extending Conventional Outlier Detection 577
12.8.2 Finding Outliers in Subspaces 578
12.8.3 Modeling High-Dimensional Outliers 579
12.9 Summary 581
12.10 Exercises 582
12.11 Bibliographic Notes 583
Chapter 13 Data Mining Trends and Research Frontiers 585
13.1 Mining Complex Data Types 585
13.1.1 Mining Sequence Data: Time-Series, Symbolic
Sequences,
and Biological Sequences 586
13.1.2 Mining Graphs and Networks 591
13.1.3 Mining Other Kinds of Data 595
13.2 Other Methodologies of Data Mining 598
13.2.1 Statistical Data Mining 598
13.2.2 Views on Data Mining Foundations 600
13.2.3 Visual and Audio Data Mining 602
13.3 Data Mining Applications 607
13.3.1 Data Mining for Financial Data Analysis 607
13.3.2 Data Mining for Retail and Telecommunication
Industries 609
13.3.3 Data Mining in Science and Engineering 611
13.3.4 Data Mining for Intrusion Detection and Prevention 614
13.3.5 Data Mining and Recommender Systems 615
13.4 Data Mining and Society 618
13.4.1 Ubiquitous and Invisible Data Mining 618
13.4.2 Privacy, Security, and Social Impacts of Data Mining
620
13.5 Data Mining Trends 622
13.6 Summary 625
13.7 Exercises 626
13.8 Bibliographic Notes 628
Bibliography 633
Index 673
HAN 04-fore-xix-xxii-9780123814791 2011/6/1 3:32 Page xix
#1
Foreword
Analyzing large amounts of data is a necessity. Even popular
science books, like “super
crunchers,” give compelling cases where large amounts of data
yield discoveries and
intuitions that surprise even experts. Every enterprise benefits
from collecting and ana-
lyzing its data: Hospitals can spot trends and anomalies in their
patient records, search
engines can do better ranking and ad placement, and
environmental and public health
agencies can spot patterns and abnormalities in their data. The
list continues, with
cybersecurity and computer network intrusion detection;
monitoring of the energy
consumption of household appliances; pattern analysis in
bioinformatics and pharma-
ceutical data; financial and business intelligence data; spotting
trends in blogs, Twitter,
and many more. Storage is inexpensive and getting even less so,
as are data sensors. Thus,
collecting and storing data is easier than ever before.
The problem then becomes how to analyze the data. This is
exactly the focus of this
Third Edition of the book. Jiawei, Micheline, and Jian give
encyclopedic coverage of all
the related methods, from the classic topics of clustering and
classification, to database
methods (e.g., association rules, data cubes) to more recent and
advanced topics (e.g.,
SVD/PCA, wavelets, support vector machines).
The exposition is extremely accessible to beginners and
advanced readers alike. The
book gives the fundamental material first and the more
advanced material in follow-up
chapters. It also has numerous rhetorical questions, which I
found extremely helpful for
maintaining focus.
We have used the first two editions as textbooks in data mining
courses at Carnegie
Mellon and plan to continue to do so with this Third Edition.
The new version has
significant additions: Notably, it has more than 100 citations to
works from 2006
onward, focusing on more recent material such as graphs and
social networks, sen-
sor networks, and outlier detection. This book has a new section
for visualization, has
expanded outlier detection into a whole chapter, and has
separate chapters for advanced
xix
HAN 04-fore-xix-xxii-9780123814791 2011/6/1 3:32 Page xx
#2
xx Foreword
methods—for example, pattern mining with top-k patterns and
more and clustering
methods with biclustering and graph clustering.
Overall, it is an excellent book on classic and modern data
mining methods, and it is
ideal not only for teaching but also as a reference book.
Christos Faloutsos
Carnegie Mellon University
HAN 04-fore-xix-xxii-9780123814791 2011/6/1 3:32 Page xxi
#3
Foreword to Second Edition
We are deluged by data—scientific data, medical data,
demographic data, financial data,
and marketing data. People have no time to look at this data.
Human attention has
become the precious resource. So, we must find ways to
automatically analyze the
data, to automatically classify it, to automatically summarize it,
to automatically dis-
cover and characterize trends in it, and to automatically flag
anomalies. This is one
of the most active and exciting areas of the database research
community. Researchers
in areas including statistics, visualization, artificial
intelligence, and machine learning
are contributing to this field. The breadth of the field makes it
difficult to grasp the
extraordinary progress over the last few decades.
Six years ago, Jiawei Han’s and Micheline Kamber’s seminal
textbook organized and
presented Data Mining. It heralded a golden age of innovation
in the field. This revision
of their book reflects that progress; more than half of the
references and historical notes
are to recent work. The field has matured with many new and
improved algorithms, and
has broadened to include many more datatypes: streams,
sequences, graphs, time-series,
geospatial, audio, images, and video. We are certainly not at the
end of the golden age—
indeed research and commercial interest in data mining
continues to grow—but we are
all fortunate to have this modern compendium.
The book gives quick introductions to database and data mining
concepts with
particular emphasis on data analysis. It then covers in a chapter-
by-chapter tour the
concepts and techniques that underlie classification, prediction,
association, and clus-
tering. These topics are presented with examples, a tour of the
best algorithms for each
problem class, and with pragmatic rules of thumb about when to
apply each technique.
The Socratic presentation style is both very readable and very
informative. I certainly
learned a lot from reading the first edition and got re-educated
and updated in reading
the second edition.
Jiawei Han and Micheline Kamber have been leading
contributors to data mining
research. This is the text they use with their students to bring
them up to speed …

More Related Content

Similar to ( (P A N G . N I N G T A NM i c h i g a n .docx

Mastering Data Science A Comprehensive Introduction.docx
Mastering Data Science A Comprehensive Introduction.docxMastering Data Science A Comprehensive Introduction.docx
Mastering Data Science A Comprehensive Introduction.docxworkshayesteh
 
Data Mining for Education. Ryan S.J.d. Baker, Carnegie Mellon University
Data Mining for Education.  Ryan S.J.d. Baker, Carnegie Mellon UniversityData Mining for Education.  Ryan S.J.d. Baker, Carnegie Mellon University
Data Mining for Education. Ryan S.J.d. Baker, Carnegie Mellon Universityeraser Juan José Calderón
 
INFORMATION TECHNOLOGY AND A NALYTICS • SUBHASHISH SAMAD.docx
INFORMATION TECHNOLOGY AND A NALYTICS • SUBHASHISH SAMAD.docxINFORMATION TECHNOLOGY AND A NALYTICS • SUBHASHISH SAMAD.docx
INFORMATION TECHNOLOGY AND A NALYTICS • SUBHASHISH SAMAD.docxdirkrplav
 
essentials-of-business-research-methods-fourth-edition-mzm-dr-notes.pdf
essentials-of-business-research-methods-fourth-edition-mzm-dr-notes.pdfessentials-of-business-research-methods-fourth-edition-mzm-dr-notes.pdf
essentials-of-business-research-methods-fourth-edition-mzm-dr-notes.pdfTrungHo47
 
Given the scenario, your role, and the information provided by the
Given the scenario, your role, and the information provided by theGiven the scenario, your role, and the information provided by the
Given the scenario, your role, and the information provided by theMatthewTennant613
 
Data Science and its relationship to Big Data and data-driven decision making
Data Science and its relationship to Big Data and data-driven decision makingData Science and its relationship to Big Data and data-driven decision making
Data Science and its relationship to Big Data and data-driven decision makingDr. Volkan OBAN
 
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONSEXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONSeditorijettcs
 
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONSEXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONSeditorijettcs
 
The Organization of the Research Proposal.pptx
The Organization of the Research Proposal.pptxThe Organization of the Research Proposal.pptx
The Organization of the Research Proposal.pptxMarkJustineVicera
 
An Introduction to Marketing Research.pdf
An Introduction to Marketing Research.pdfAn Introduction to Marketing Research.pdf
An Introduction to Marketing Research.pdfEmma Burke
 
Marketing Analytics using R/Python
Marketing Analytics using R/PythonMarketing Analytics using R/Python
Marketing Analytics using R/PythonSagar Singh
 
Douglas C. Montgomery - Design and Analysis of Experiments-Wiley (2013).pdf
Douglas C. Montgomery - Design and Analysis of Experiments-Wiley (2013).pdfDouglas C. Montgomery - Design and Analysis of Experiments-Wiley (2013).pdf
Douglas C. Montgomery - Design and Analysis of Experiments-Wiley (2013).pdfwhoisraiden1
 

Similar to ( (P A N G . N I N G T A NM i c h i g a n .docx (17)

Mastering Data Science A Comprehensive Introduction.docx
Mastering Data Science A Comprehensive Introduction.docxMastering Data Science A Comprehensive Introduction.docx
Mastering Data Science A Comprehensive Introduction.docx
 
Data Mining for Education. Ryan S.J.d. Baker, Carnegie Mellon University
Data Mining for Education.  Ryan S.J.d. Baker, Carnegie Mellon UniversityData Mining for Education.  Ryan S.J.d. Baker, Carnegie Mellon University
Data Mining for Education. Ryan S.J.d. Baker, Carnegie Mellon University
 
INFORMATION TECHNOLOGY AND A NALYTICS • SUBHASHISH SAMAD.docx
INFORMATION TECHNOLOGY AND A NALYTICS • SUBHASHISH SAMAD.docxINFORMATION TECHNOLOGY AND A NALYTICS • SUBHASHISH SAMAD.docx
INFORMATION TECHNOLOGY AND A NALYTICS • SUBHASHISH SAMAD.docx
 
essentials-of-business-research-methods-fourth-edition-mzm-dr-notes.pdf
essentials-of-business-research-methods-fourth-edition-mzm-dr-notes.pdfessentials-of-business-research-methods-fourth-edition-mzm-dr-notes.pdf
essentials-of-business-research-methods-fourth-edition-mzm-dr-notes.pdf
 
ml-03x01.pdf
ml-03x01.pdfml-03x01.pdf
ml-03x01.pdf
 
Given the scenario, your role, and the information provided by the
Given the scenario, your role, and the information provided by theGiven the scenario, your role, and the information provided by the
Given the scenario, your role, and the information provided by the
 
IJET-V2I6P22
IJET-V2I6P22IJET-V2I6P22
IJET-V2I6P22
 
Ayasdi Case Study
Ayasdi Case StudyAyasdi Case Study
Ayasdi Case Study
 
Ayasdi: Demystifying the Unknown
Ayasdi: Demystifying the UnknownAyasdi: Demystifying the Unknown
Ayasdi: Demystifying the Unknown
 
Data Science and its relationship to Big Data and data-driven decision making
Data Science and its relationship to Big Data and data-driven decision makingData Science and its relationship to Big Data and data-driven decision making
Data Science and its relationship to Big Data and data-driven decision making
 
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONSEXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
 
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONSEXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
 
The Organization of the Research Proposal.pptx
The Organization of the Research Proposal.pptxThe Organization of the Research Proposal.pptx
The Organization of the Research Proposal.pptx
 
Data mining
Data miningData mining
Data mining
 
An Introduction to Marketing Research.pdf
An Introduction to Marketing Research.pdfAn Introduction to Marketing Research.pdf
An Introduction to Marketing Research.pdf
 
Marketing Analytics using R/Python
Marketing Analytics using R/PythonMarketing Analytics using R/Python
Marketing Analytics using R/Python
 
Douglas C. Montgomery - Design and Analysis of Experiments-Wiley (2013).pdf
Douglas C. Montgomery - Design and Analysis of Experiments-Wiley (2013).pdfDouglas C. Montgomery - Design and Analysis of Experiments-Wiley (2013).pdf
Douglas C. Montgomery - Design and Analysis of Experiments-Wiley (2013).pdf
 

More from gertrudebellgrove

-I am unable to accept emailed exams or late exams. No exception.docx
-I am unable to accept emailed exams or late exams. No exception.docx-I am unable to accept emailed exams or late exams. No exception.docx
-I am unable to accept emailed exams or late exams. No exception.docxgertrudebellgrove
 
-delineate characteristics, prevalence of  exceptionality-evalua.docx
-delineate characteristics, prevalence of  exceptionality-evalua.docx-delineate characteristics, prevalence of  exceptionality-evalua.docx
-delineate characteristics, prevalence of  exceptionality-evalua.docxgertrudebellgrove
 
-1st play name is READY STEADY YETI GO-2nd play name is INTO .docx
-1st play name is READY STEADY YETI GO-2nd play name is INTO .docx-1st play name is READY STEADY YETI GO-2nd play name is INTO .docx
-1st play name is READY STEADY YETI GO-2nd play name is INTO .docxgertrudebellgrove
 
-6th-Edition-Template-without-Abstract.dotWhat are Heuristics .docx
-6th-Edition-Template-without-Abstract.dotWhat are Heuristics .docx-6th-Edition-Template-without-Abstract.dotWhat are Heuristics .docx
-6th-Edition-Template-without-Abstract.dotWhat are Heuristics .docxgertrudebellgrove
 
- write one 5-7 page paper about All forms of Euthanasia are moral..docx
- write one 5-7 page paper about All forms of Euthanasia are moral..docx- write one 5-7 page paper about All forms of Euthanasia are moral..docx
- write one 5-7 page paper about All forms of Euthanasia are moral..docxgertrudebellgrove
 
-1st Play name is BERNHARDTHAMLET -2nd Play name is READY ST.docx
-1st Play name is BERNHARDTHAMLET -2nd Play name is READY ST.docx-1st Play name is BERNHARDTHAMLET -2nd Play name is READY ST.docx
-1st Play name is BERNHARDTHAMLET -2nd Play name is READY ST.docxgertrudebellgrove
 
. 1. Rutter and Sroufe identified _____________ as one of three impo.docx
. 1. Rutter and Sroufe identified _____________ as one of three impo.docx. 1. Rutter and Sroufe identified _____________ as one of three impo.docx
. 1. Rutter and Sroufe identified _____________ as one of three impo.docxgertrudebellgrove
 
-Prior to the Civil War, how did the (dominant) discourse over the U.docx
-Prior to the Civil War, how did the (dominant) discourse over the U.docx-Prior to the Civil War, how did the (dominant) discourse over the U.docx
-Prior to the Civil War, how did the (dominant) discourse over the U.docxgertrudebellgrove
 
- Using the definition Awareness of sensation and perception to ex.docx
- Using the definition Awareness of sensation and perception to ex.docx- Using the definition Awareness of sensation and perception to ex.docx
- Using the definition Awareness of sensation and perception to ex.docxgertrudebellgrove
 
- should include an introduction to the environmental issue and its .docx
- should include an introduction to the environmental issue and its .docx- should include an introduction to the environmental issue and its .docx
- should include an introduction to the environmental issue and its .docxgertrudebellgrove
 
- FIRST EXAM SPRING 20201. Describe how the view of operations.docx
- FIRST EXAM SPRING 20201. Describe how the view of operations.docx- FIRST EXAM SPRING 20201. Describe how the view of operations.docx
- FIRST EXAM SPRING 20201. Describe how the view of operations.docxgertrudebellgrove
 
- Considering the concepts, examples and learning from the v.docx
- Considering the concepts, examples and learning from the v.docx- Considering the concepts, examples and learning from the v.docx
- Considering the concepts, examples and learning from the v.docxgertrudebellgrove
 
- Discuss why a computer incident response team (CIRT) plan is neede.docx
- Discuss why a computer incident response team (CIRT) plan is neede.docx- Discuss why a computer incident response team (CIRT) plan is neede.docx
- Discuss why a computer incident response team (CIRT) plan is neede.docxgertrudebellgrove
 
- Discuss why a computer incident response team (CIRT) plan is n.docx
- Discuss why a computer incident response team (CIRT) plan is n.docx- Discuss why a computer incident response team (CIRT) plan is n.docx
- Discuss why a computer incident response team (CIRT) plan is n.docxgertrudebellgrove
 
- 2 -Section CPlease write your essay in the blue book.docx
- 2 -Section CPlease write your essay in the blue book.docx- 2 -Section CPlease write your essay in the blue book.docx
- 2 -Section CPlease write your essay in the blue book.docxgertrudebellgrove
 
- Confidence intervals for a population mean, standard deviation kno.docx
- Confidence intervals for a population mean, standard deviation kno.docx- Confidence intervals for a population mean, standard deviation kno.docx
- Confidence intervals for a population mean, standard deviation kno.docxgertrudebellgrove
 
) Create a new thread. As indicated above, select  two tools describ.docx
) Create a new thread. As indicated above, select  two tools describ.docx) Create a new thread. As indicated above, select  two tools describ.docx
) Create a new thread. As indicated above, select  two tools describ.docxgertrudebellgrove
 
(Write 3 to 4 sentences per question)  1. Describe one way y.docx
(Write 3 to 4 sentences per question)  1. Describe one way y.docx(Write 3 to 4 sentences per question)  1. Describe one way y.docx
(Write 3 to 4 sentences per question)  1. Describe one way y.docxgertrudebellgrove
 
( America and Venezuela) this is a ppt. groups assignment. Below is .docx
( America and Venezuela) this is a ppt. groups assignment. Below is .docx( America and Venezuela) this is a ppt. groups assignment. Below is .docx
( America and Venezuela) this is a ppt. groups assignment. Below is .docxgertrudebellgrove
 
++ 2 PAGES++Topic Make a bill to legalize all felon has the rig.docx
++ 2 PAGES++Topic Make a bill to legalize all felon has the rig.docx++ 2 PAGES++Topic Make a bill to legalize all felon has the rig.docx
++ 2 PAGES++Topic Make a bill to legalize all felon has the rig.docxgertrudebellgrove
 

More from gertrudebellgrove (20)

-I am unable to accept emailed exams or late exams. No exception.docx
-I am unable to accept emailed exams or late exams. No exception.docx-I am unable to accept emailed exams or late exams. No exception.docx
-I am unable to accept emailed exams or late exams. No exception.docx
 
-delineate characteristics, prevalence of  exceptionality-evalua.docx
-delineate characteristics, prevalence of  exceptionality-evalua.docx-delineate characteristics, prevalence of  exceptionality-evalua.docx
-delineate characteristics, prevalence of  exceptionality-evalua.docx
 
-1st play name is READY STEADY YETI GO-2nd play name is INTO .docx
-1st play name is READY STEADY YETI GO-2nd play name is INTO .docx-1st play name is READY STEADY YETI GO-2nd play name is INTO .docx
-1st play name is READY STEADY YETI GO-2nd play name is INTO .docx
 
-6th-Edition-Template-without-Abstract.dotWhat are Heuristics .docx
-6th-Edition-Template-without-Abstract.dotWhat are Heuristics .docx-6th-Edition-Template-without-Abstract.dotWhat are Heuristics .docx
-6th-Edition-Template-without-Abstract.dotWhat are Heuristics .docx
 
- write one 5-7 page paper about All forms of Euthanasia are moral..docx
- write one 5-7 page paper about All forms of Euthanasia are moral..docx- write one 5-7 page paper about All forms of Euthanasia are moral..docx
- write one 5-7 page paper about All forms of Euthanasia are moral..docx
 
-1st Play name is BERNHARDTHAMLET -2nd Play name is READY ST.docx
-1st Play name is BERNHARDTHAMLET -2nd Play name is READY ST.docx-1st Play name is BERNHARDTHAMLET -2nd Play name is READY ST.docx
-1st Play name is BERNHARDTHAMLET -2nd Play name is READY ST.docx
 
. 1. Rutter and Sroufe identified _____________ as one of three impo.docx
. 1. Rutter and Sroufe identified _____________ as one of three impo.docx. 1. Rutter and Sroufe identified _____________ as one of three impo.docx
. 1. Rutter and Sroufe identified _____________ as one of three impo.docx
 
-Prior to the Civil War, how did the (dominant) discourse over the U.docx
-Prior to the Civil War, how did the (dominant) discourse over the U.docx-Prior to the Civil War, how did the (dominant) discourse over the U.docx
-Prior to the Civil War, how did the (dominant) discourse over the U.docx
 
- Using the definition Awareness of sensation and perception to ex.docx
- Using the definition Awareness of sensation and perception to ex.docx- Using the definition Awareness of sensation and perception to ex.docx
- Using the definition Awareness of sensation and perception to ex.docx
 
- should include an introduction to the environmental issue and its .docx
- should include an introduction to the environmental issue and its .docx- should include an introduction to the environmental issue and its .docx
- should include an introduction to the environmental issue and its .docx
 
- FIRST EXAM SPRING 20201. Describe how the view of operations.docx
- FIRST EXAM SPRING 20201. Describe how the view of operations.docx- FIRST EXAM SPRING 20201. Describe how the view of operations.docx
- FIRST EXAM SPRING 20201. Describe how the view of operations.docx
 
- Considering the concepts, examples and learning from the v.docx
- Considering the concepts, examples and learning from the v.docx- Considering the concepts, examples and learning from the v.docx
- Considering the concepts, examples and learning from the v.docx
 
- Discuss why a computer incident response team (CIRT) plan is neede.docx
- Discuss why a computer incident response team (CIRT) plan is neede.docx- Discuss why a computer incident response team (CIRT) plan is neede.docx
- Discuss why a computer incident response team (CIRT) plan is neede.docx
 
- Discuss why a computer incident response team (CIRT) plan is n.docx
- Discuss why a computer incident response team (CIRT) plan is n.docx- Discuss why a computer incident response team (CIRT) plan is n.docx
- Discuss why a computer incident response team (CIRT) plan is n.docx
 
- 2 -Section CPlease write your essay in the blue book.docx
- 2 -Section CPlease write your essay in the blue book.docx- 2 -Section CPlease write your essay in the blue book.docx
- 2 -Section CPlease write your essay in the blue book.docx
 
- Confidence intervals for a population mean, standard deviation kno.docx
- Confidence intervals for a population mean, standard deviation kno.docx- Confidence intervals for a population mean, standard deviation kno.docx
- Confidence intervals for a population mean, standard deviation kno.docx
 
) Create a new thread. As indicated above, select  two tools describ.docx
) Create a new thread. As indicated above, select  two tools describ.docx) Create a new thread. As indicated above, select  two tools describ.docx
) Create a new thread. As indicated above, select  two tools describ.docx
 
(Write 3 to 4 sentences per question)  1. Describe one way y.docx
(Write 3 to 4 sentences per question)  1. Describe one way y.docx(Write 3 to 4 sentences per question)  1. Describe one way y.docx
(Write 3 to 4 sentences per question)  1. Describe one way y.docx
 
( America and Venezuela) this is a ppt. groups assignment. Below is .docx
( America and Venezuela) this is a ppt. groups assignment. Below is .docx( America and Venezuela) this is a ppt. groups assignment. Below is .docx
( America and Venezuela) this is a ppt. groups assignment. Below is .docx
 
++ 2 PAGES++Topic Make a bill to legalize all felon has the rig.docx
++ 2 PAGES++Topic Make a bill to legalize all felon has the rig.docx++ 2 PAGES++Topic Make a bill to legalize all felon has the rig.docx
++ 2 PAGES++Topic Make a bill to legalize all felon has the rig.docx
 

Recently uploaded

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 

Recently uploaded (20)

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 

( (P A N G . N I N G T A NM i c h i g a n .docx

  • 1. ( ( P A N G . N I N G T A N M i c h i g a n S t a t e U n i v e r s i t y M I C H A E L S T E I N B A C H U n i v e r s i t y o f M i n n e s o t a V I P I N K U M A R U n i v e r s i t y o f M i n n e s o t a a n d A r m y H i g h P e r f o r m a n c e C o m p u t i n g R e s e a r c h C e n t e r +f.f_l crf.rfh. .W if f aqtY 6l$ t . T . R . C . i'&'ufe61ttt1/. Y t. $t,/,1' n,5 . 7 V ' 4 8 ! Boston San Francisco NewYork
  • 2. London Toronto Sydney Tokyo Singapore Madrid MexicoCity Munich Paris CapeTown HongKong Montreal G.R r+6,q If you purchased this book within the United States or Canada you should be aware that it has been wrongfirlly imported without the approval of the Publishel or the Author. T3 Loo 6 - {)gq* 3 AcquisitionsEditor Matt Goldstein ProjectEditor Katherine Harutunian Production Supervisor Marilyn Lloyd Production Services Paul C. Anagnostopoulos of Windfall Software Marketing Manager Michelle Brown Copyeditor Kathy Smith Proofreader IenniferMcClain Technicallllustration GeorgeNichols Cover Design Supervisor Joyce Cosentino Wells Cover Design Night & Day Design Cover Image @ 2005 Rob Casey/Brand X pictures hepress and Manufacturing Caroline Fell Printer HamiltonPrinting Access the latest information about Addison-Wesley titles from our iWorld Wide Web site:
  • 3. http : //www. aw-bc.com/computing Many of the designations used by manufacturers and sellers to distiriguish their products are claimed as trademarks. where those designations appear in this book, and Addison- Wesley was aware of a trademark claim, the designations have been printed in initial caps or all caps. The programs and applications presented in this book have been incl,[rded for their instructional value. They have been tested with care, but are not guatanteed for any particular purpose. The publisher does not offer any warranties or representations, nor does it accept any liabilities with respect to the programs or applications. Copyright @ 2006 by Pearson Education, Inc. For information on obtaining permission for use of material in this work, please submit a written request to Pearson Education, Inc., Rights and Contract Department, 75 Arlington Street, Suite 300, Boston, MA02II6 or fax your request to (617) g4g-j047. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or any other media embodiments now known or hereafter to become known, without the prior written permission of the publisher. printed in the united States of
  • 4. America. lsBN 0-321-42052-7 2 3 4 5 67 8 9 10-HAM-O8 07 06 our famili,es Preface Advances in data generation and collection are producing data sets of mas- sive size in commerce and a variety of scientific disciplines. Data warehouses store details of the sales and operations of businesses, Earth- orbiting satellites beam high-resolution images and sensor data back to Earth, and genomics ex- periments generate sequence, structural, and functional data for an increasing number of organisms. The ease with which data can now be gathered and stored has created a new attitude toward data analysis: Gather whatever data you can whenever and wherever possible. It has become an article of faith that the gathered data will have value, either for the purpose that initially motivated its collection or for purposes not yet envisioned.
  • 5. The field of data mining grew out of the limitations of current data anal- ysis techniques in handling the challenges posedl by these new types of data sets. Data mining does not replace other areas of data analysis, but rather takes them as the foundation for much of its work. While some areas of data mining, such as association analysis, are unique to the field, other areas, such as clustering, classification, and anomaly detection, build upon a long history of work on these topics in other fields. Indeed, the willingness of data mining researchers to draw upon existing techniques has contributed to the strength and breadth of the field, as well as to its rapid growth. Another strength of the field has been its emphasis on collaboration with researchers in other areas. The challenges of analyzing new types of data cannot be met by simply applying data analysis techniques in isolation from those who understand the data and the domain in which it resides. Often, skill in building multidisciplinary teams has been as responsible for the success of data mining projects as the creation of new and innovative algorithms. Just as, historically, many developments in statistics were driven by the needs of agriculture, industry, medicine, and business, rxrany of the developments in data mining are being driven by the needs of those same fields.
  • 6. This book began as a set of notes and lecture slides for a data mining course that has been offered at the University of Minnesota since Spring 1998 to upper-division undergraduate and graduate Students. Presentation slides viii Preface and exercises developed in these offerings grew with time and served as a basis for the book. A survey of clustering techniques in data mining, originally written in preparation for research in the area, served as a starting point for one of the chapters in the book. Over time, the clustering chapter was joined by chapters on data, classification, association analysis, and anomaly detection. The book in its current form has been class tested at the home institutions of the authors-the University of Minnesota and Michigan State University-as well as several other universities. A number of data mining books appeared in the meantime, but were not completely satisfactory for our students primarily graduate and undergrad- uate students in computer science, but including students from industry and a wide variety of other disciplines. Their mathematical and computer back- grounds varied considerably, but they shared a common goal: to
  • 7. learn about data mining as directly as possible in order to quickly apply it to problems in their own domains. Thus, texts with extensive mathematical or statistical prerequisites were unappealing to many of them, as were texts that required a substantial database background. The book that evolved in response to these students needs focuses as directly as possible on the key concepts of data min- ing by illustrating them with examples, simple descriptions of key algorithms, and exercises. Overview Specifically, this book provides a comprehensive introduction to data mining and is designed to be accessible and useful to students, instructors, researchers, and professionals. Areas covered include data preprocessing, vi- sualization, predictive modeling, association analysis, clustering, and anomaly detection. The goal is to present fundamental concepts and algorithms for each topic, thus providing the reader with the necessary background for the application of data mining to real problems. In addition, this book also pro- vides a starting point for those readers who are interested in pursuing research in data mining or related fields. The book covers five main topics: data, classification, association analysis, clustering, and anomaly detection. Except for anomaly
  • 8. detection, each of these areas is covered in a pair of chapters. For classification, association analysis, and clustering, the introductory chapter covers basic concepts, representative algorithms, and evaluation techniques, while the more advanced chapter dis- cusses advanced concepts and algorithms. The objective is to provide the reader with a sound understanding of the foundations of data mining, while still covering many important advanced topics. Because of this approach, the book is useful both as a learning tool and as a reference. Preface ix To help the readers better understand the concepts that have been pre- sented, we provide an extensive set of examples, figures, and exercises. Bib- Iiographic notes are included at the end of each chapter for readers who are interested in more advanced topics, historically important papers, and recent trends. The book also contains a comprehensive subject and author index. To the Instructor As a textbook, this book is suitable for a wide range of students at the advanced undergraduate or graduate level. Since students come to this subject with diverse backgrounds that may not include extensive
  • 9. knowledge of statistics or databases, our book requires minimal prerequisites- no database knowledge is needed and we assume only a modest background in statistics or mathematics. To this end, the book was designed to be as self-contained as possible. Necessary material from statistics, linear algebra, and machine learning is either integrated into the body of the text, or for some advanced topics, covered in the appendices. Since the chapters covering major data mining topics are self- contained, the order in which topics can be covered is quite flexible. The core material is covered in Chapters 2, 4, 6, 8, and 10. Although the introductory data chapter (2) should be covered first, the basic classification, association analy- sis, and clustering chapters (4, 6, and 8, respectively) can be covered in any order. Because of the relationship of anomaly detection (10) to classification (4) and clustering (8), these chapters should precede Chapter 10. Various topics can be selected from the advanced classification, association analysis, and clustering chapters (5, 7, and 9, respectively) to fit the schedule and in- terests of the instructor and students. We also advise that the lectures be augmented by projects or practical exercises in data mining. Although they are time consuming, such hands-on assignments greatly enhance the value of
  • 10. the course. Support Materials The supplements for the book are available at Addison- Wesley's Website www.aw.con/cssupport. Support materials available to all readers of this book include PowerPoint lecture slides Suggestions for student projects Data mining resources such as data mining algorithms and data sets On-line tutorials that give step-by-step examples for selected data mining techniques described in the book using actual data sets and data analysis software o o o o x Preface Additional support materials, including solutions to exercises, are available only to instructors adopting this textbook for classroom use.
  • 11. Please contact your school's Addison-Wesley representative for information on obtaining ac- cess to this material. Comments and suggestions, as well as reports of errors, can be sent to the authors through [email protected] Acknowledgments Many people contributed to this book. We begin by acknowledging our families to whom this book is dedicated. Without their patience and support, this project would have been impossible. We would like to thank the current and former students of our data mining groups at the University of Minnesota and Michigan State for their contribu- tions. Eui-Hong (Sam) Han and Mahesh Joshi helped with the initial data min- ing classes. Some ofthe exercises and presentation slides that they created can be found in the book and its accompanying slides. Students in our data min- ing groups who provided comments on drafts of the book or who contributed in other ways include Shyam Boriah, Haibin Cheng, Varun Chandola, Eric Eilertson, Levent Ertoz, Jing Gao, Rohit Gupta, Sridhar Iyer, Jung-Eun Lee, Benjamin Mayer, Aysel Ozgur, Uygar Oztekin, Gaurav Pandey, Kashif Riaz, Jerry Scripps, Gyorgy Simon, Hui Xiong, Jieping Ye, and Pusheng Zhang. We would also like to thank the students of our data mining classes at the Univer- sity of Minnesota and Michigan State University who worked with early drafbs
  • 12. of the book and provided invaluable feedback. We specifically note the helpful suggestions of Bernardo Craemer, Arifin Ruslim, Jamshid Vayghan, and Yu Wei. Joydeep Ghosh (University of Texas) and Sanjay Ranka (University of Florida) class tested early versions of the book. We also received many useful suggestions directly from the following UT students: Pankaj Adhikari, Ra- jiv Bhatia, Fbederic Bosche, Arindam Chakraborty, Meghana Deodhar, Chris Everson, David Gardner, Saad Godil, Todd Hay, Clint Jones, Ajay Joshi, Joonsoo Lee, Yue Luo, Anuj Nanavati, Tyler Olsen, Sunyoung Park, Aashish Phansalkar, Geoff Prewett, Michael Ryoo, Daryl Shannon, and Mei Yang. Ronald Kostoff (ONR) read an early version of the clustering chapter and offered numerous suggestions. George Karypis provided invaluable IATEX as- sistance in creating an author index. Irene Moulitsas also provided assistance with IATEX and reviewed some of the appendices. Musetta Steinbach was very helpful in finding errors in the figures. We would like to acknowledge our colleagues at the University of Min- nesota and Michigan State who have helped create a positive environment for data mining research. They include Dan Boley, Joyce Chai, Anil
  • 13. Jain, Ravi Preface xi Janardan, Rong Jin, George Karypis, Haesun Park, William F. Punch, Shashi Shekhar, and Jaideep Srivastava. The collaborators on our many data mining projects, who also have our gratitude, include Ramesh Agrawal, Steve Can- non, Piet C. de Groen, FYan Hill, Yongdae Kim, Steve Klooster, Kerry Long, Nihar Mahapatra, Chris Potter, Jonathan Shapiro, Kevin Silverstein, Nevin Young, and Zhi-Li Zhang. The departments of Computer Science and Engineering at the University of Minnesota and Michigan State University provided computing resources and a supportive environment for this project. ARDA, ARL, ARO, DOE, NASA, and NSF provided research support for Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. In particular, Kamal Abdali, Dick Brackney, Jagdish Chan- dra, Joe Coughlan, Michael Coyle, Stephen Davis, Flederica Darema, Richard Hirsch, Chandrika Kamath, Raju Namburu, N. Radhakrishnan, James Sido- ran, Bhavani Thuraisingham, Walt Tiernin, Maria Zemankova, and Xiaodong Zhanghave been supportive of our research in data mining and high-performance
  • 14. computing. It was a pleasure working with the helpful staff at Pearson Education. In particular, we would like to thank Michelle Brown, Matt Goldstein, Katherine Harutunian, Marilyn Lloyd, Kathy Smith, and Joyce Wells. We would also like to thank George Nichols, who helped with the art work and Paul Anag- nostopoulos, who provided I4.T[X support. We are grateful to the following Pearson reviewers: Chien-Chung Chan (University of Akron), Zhengxin Chen (University of Nebraska at Omaha), Chris Clifton (Purdue University), Joy- deep Ghosh (University of Texas, Austin), Nazli Goharian (Illinois Institute of Technology), J. Michael Hardin (University of Alabama), James Hearne (Western Washington University), Hillol Kargupta (University of Maryland, Baltimore County and Agnik, LLC), Eamonn Keogh (University of California- Riverside), Bing Liu (University of Illinois at Chicago), Mariofanna Milanova (University of Arkansas at Little Rock), Srinivasan Parthasarathy (Ohio State University), Zbigniew W. Ras (University of North Carolina at Charlotte), Xintao Wu (University of North Carolina at Charlotte), and Mohammed J. Zaki (Rensselaer Polvtechnic Institute).
  • 15. Gontents Preface Introduction 1 1.1 What Is Data Mining? 2 7.2 Motivating Challenges 4 1.3 The Origins of Data Mining 6 1.4 Data Mining Tasks 7 1.5 Scope and Organization of the Book 11 1.6 Bibliographic Notes 13 v l l t.7 Exercises Data 1 6 1 9 2.I Types of Data 22 2.1.I Attributes and Measurement 23 2.L.2 Types of Data Sets . 29 2.2 Data Quality 36 2.2.I Measurement and Data Collection Issues 37 2.2.2 Issues Related to Applications 2.3 Data Preprocessing 2.3.L Aggregation 2.3.2 Sampling 2.3.3 Dimensionality Reduction
  • 16. 2.3.4 Feature Subset Selection 2.3.5 Feature Creation 2.3.6 Discretization and Binarization 2.3:7 Variable Tlansformation . 2.4 Measures of Similarity and Dissimilarity . . . 2.4.L Basics 2.4.2 Similarity and Dissimilarity between Simple Attributes . 2.4.3 Dissimilarities between Data Objects . 2.4.4 Similarities between Data Objects 43 44 45 47 50 5 2 5 5 5 7 63 6 5 66 6 7 6 9 72 xiv Contents 2.4.5 Examples of Proximity Measures 2.4.6 Issues in Proximity Calculation 2.4.7 Selecting the Right Proximity Measure 2.5 BibliographicNotes 2.6 Exercises
  • 17. Exploring Data 3.i The Iris Data Set 3.2 Summary Statistics 3.2.L Frequencies and the Mode 3.2.2 Percentiles 3.2.3 Measures of Location: Mean and Median 3.2.4 Measures of Spread: Range and Variance 3.2.5 Multivariate Summary Statistics 3.2.6 Other Ways to Summarize the Data 3.3 Visualization 3.3.1 Motivations for Visualization 3.3.2 General Concepts 3.3.3 Techniques 3.3.4 Visualizing Higher-Dimensional Data . 3.3.5 Do's and Don'ts 3.4 OLAP and Multidimensional Data Analysis 3.4.I Representing Iris Data as a Multidimensional Array 3.4.2 Multidimensional Data: The General Case . 3.4.3 Analyzing Multidimensional Data 3.4.4 Final Comments on Multidimensional Data Analysis Bibliographic Notes Exercises Classification: Basic Concepts, Decision Tlees, and Model Evaluation 4.1 Preliminaries 4.2 General Approach to Solving a Classification Problem 4.3 Decision Tlee Induction 4.3.1 How a Decision Tlee Works 4.3.2 How to Build a Decision TYee 4.3.3 Methods for Expressing Attribute Test Conditions
  • 18. 4.3.4 Measures for Selecting the Best Split . 4.3.5 Algorithm for Decision Tlee Induction 4.3.6 An Examole: Web Robot Detection 3 . 5 3 . 6 73 80 83 84 88 9 7 98 98 99 1 0 0 1 0 1 102 704 1 0 5 1 0 5 1 0 5 1 0 6 1 1 0 724 1 3 0 1 3 1 1 3 1 1 3 3 1 3 5 1 3 9 1 3 9 747
  • 19. L45 746 748 1 5 0 150 1 5 1 1 5 5 1 5 8 164 1 6 6 Contents xv 4.3.7 Characteristics of Decision Tlee Induction 4.4 Model Overfitting 4.4.L Overfitting Due to Presence of Noise 4.4.2 Overfitting Due to Lack of Representative Samples 4.4.3 Overfitting and the Multiple Comparison Procedure 4.4.4 Estimation of Generalization Errors 4.4.5 Handling Overfitting in Decision Tlee Induction 4.5 Evaluating the Performance of a Classifier 4.5.I Holdout Method 4.5.2 Random Subsampling . . . 4.5.3 Cross-Validation 4.5.4 Bootstrap 4.6 Methods for Comparing Classifiers 4.6.L Estimating a Confidence Interval for Accuracy 4.6.2 Comparing the Performance of Two Models . 4.6.3 Comparing the Performance of Two Classifiers
  • 20. 4.7 BibliographicNotes 4.8 Exercises 5 Classification: Alternative Techniques 5.1 Rule-Based Classifier 5.1.1 How a Rule-Based Classifier Works 5.1.2 Rule-Ordering Schemes 5.1.3 How to Build a Rule-Based Classifier 5.1.4 Direct Methods for Rule Extraction 5.1.5 Indirect Methods for Rule Extraction 5.1.6 Characteristics of Rule-Based Classifiers 5.2 Nearest-Neighbor classifiers 5.2.L Algorithm 5.2.2 Characteristics of Nearest-Neighbor Classifiers 5.3 Bayesian Classifiers 5.3.1 Bayes Theorem 5.3.2 Using the Bayes Theorem for Classification 5.3.3 Naive Bayes Classifier 5.3.4 Bayes Error Rate 5.3.5 Bayesian Belief Networks 5.4 Artificial Neural Network (ANN) 5.4.I Perceptron 5.4.2 Multilayer Artificial Neural Network 5.4.3 Characteristics of ANN 1 6 8 1 7 2 L75 L 7 7 178 179 184
  • 21. 186 1 8 6 1 8 7 1 8 7 1 8 8 1 8 8 1 8 9 1 9 1 192 193 1 9 8 207 207 209 2 I I 2r2 2r3 2 2 L 223 223 225 226 227 228 229 23L 238 240 246 247 25r 255
  • 22. xvi Contents 5.5 Support Vector Machine (SVM) 5.5.1 Maximum Margin Hyperplanes 5.5.2 Linear SVM: Separable Case 5.5.3 Linear SVM: Nonseparable Case 5.5.4 Nonlinear SVM . 5.5.5 Characteristics of SVM Ensemble Methods 5.6.1 Rationale for Ensemble Method 5.6.2 Methods for Constructing an Ensemble Classifier 5.6.3 Bias-Variance Decomposition 5.6.4 Bagging 5.6.5 Boosting 5.6.6 Random Forests 5.6.7 Empirical Comparison among Ensemble Methods Class Imbalance Problem 5.7.1 Alternative Metrics 5.7.2 The Receiver Operating Characteristic Curve 5.7.3 Cost-Sensitive Learning . . 5.7.4 Sampling-Based Approaches . Multiclass Problem Bibliographic Notes Exercises 5 . 6 o . t 256 256 259 266 270 276 276
  • 23. 277 278 28r 2 8 3 285 290 294 294 295 298 302 305 306 309 3 1 5 c . 6 5 . 9 5 . 1 0 Association Analysis: Basic Concepts and Algorithms 327 6.1 Problem Definition . 328 6.2 Flequent Itemset Generation 332 6.2.I The Apri,ori Principle 333 6.2.2 Fbequent Itemset Generation in the Apri,ori, Algorithm . 335 6.2.3 Candidate Generation and Pruning . . . 338 6.2.4 Support Counting 342 6.2.5 Computational Complexity 345 6.3 Rule Generatiorr 349 6.3.1 Confidence-Based Pruning 350 6.3.2 Rule Generation in Apri,ori, Algorithm 350 6.3.3 An Example: Congressional Voting Records 352
  • 24. 6.4 Compact Representation of Fbequent Itemsets 353 6.4.7 Maximal Flequent Itemsets 354 6.4.2 Closed Frequent Itemsets 355 6.5 Alternative Methods for Generating Frequent Itemsets 359 6.6 FP-Growth Alsorithm 363 Contents xvii 6.6.1 FP-tee Representation 6.6.2 Frequent Itemset Generation in FP-Growth Algorithm . 6.7 Evaluation of Association Patterns 6.7.l Objective Measures of Interestingness 6.7.2 Measures beyond Pairs of Binary Variables 6.7.3 Simpson's Paradox 6.8 Effect of Skewed Support Distribution 6.9 Bibliographic Notes 363 366 370 37r 382 384 386 390 404 4L5 415 4t8
  • 25. 4 1 8 422 424 426 429 429 431 436 439 442 443 444 447 448 453 457 457 458 458 460 461 463 465 469 473 6.10 Exercises 7 Association Analysis: Advanced 7.I Handling Categorical Attributes 7.2 Handling Continuous Attributes Concepts 7.2.I Discretization-Based Methods
  • 26. 7.2.2 Statistics-Based Methods 7.2.3 Non-discretizalion Methods Handling a Concept Hierarchy Seouential Patterns 7.4.7 Problem Formulation 7.4.2 Sequential Pattern Discovery 7.4.3 Timing Constraints 7.4.4 Alternative Counting Schemes 7.5 Subgraph Patterns 7.5.1 Graphs and Subgraphs . 7.5.2 Frequent Subgraph Mining 7.5.3 Apri,od-like Method 7.5.4 Candidate Generation 7.5.5 Candidate Pruning 7.5.6 Support Counting 7.6 Infrequent Patterns 7.6.7 Negative Patterns 7.6.2 Negatively Correlated Patterns 7.6.3 Comparisons among Infrequent Patterns, Negative Pat- terns, and Negatively Correlated Patterns 7.6.4 Techniques for Mining Interesting Infrequent Patterns 7.6.5 Techniques Based on Mining Negative Patterns 7.6.6 Techniques Based on Support Expectation . 7.7 Bibliographic Notes 7.8 Exercises 7 . 3 7 . 4 xviii Contents
  • 27. Cluster Analysis: Basic Concepts and Algorithms 8.1 Overview 8.1.1 What Is Cluster Analysis? 8.I.2 Different Types of Clusterings . 8.1.3 Different Types of Clusters 8.2 K-means 8.2.7 The Basic K-means Algorithm 8.2.2 K-means: Additional Issues 8.2.3 Bisecting K-means 8.2.4 K-means and Different Types of Clusters 8.2.5 Strengths and Weaknesses 8.2.6 K-means as an Optimization Problem 8.3 Agglomerative Hierarchical Clustering 8.3.1 Basic Agglomerative Hierarchical Clustering Algorithm 8.3.2 Specific Techniques 8.3.3 The Lance-Williams Formula for Cluster Proximity . 8.3.4 Key Issues in Hierarchical Clustering . 8.3.5 Strengths and Weaknesses DBSCAN 8.4.1 Tladitional Density: Center-Based Approach 8.4.2 The DBSCAN Algorithm 8.4.3 Strengths and Weaknesses Cluster Evaluation 8.5.1 Overview 8.5.2 Unsupervised Cluster Evaluation Using Cohesion and Separation 8.5.3 Unsupervised Cluster Evaluation Using the Proximity Matrix 8.5.4 Unsupervised Evaluation of Hierarchical Clustering . 8.5.5 Determining the Correct Number of Clusters
  • 28. 8.5.6 Clustering Tendency 8.5.7 Supervised Measures of Cluster Validity 8.5.8 Assessing the Significance of Cluster Validity Measures . 8 . 4 8 . 5 487 490 490 49r 493 496 497 506 508 5 1 0 5 1 0 5 1 3 5 1 5 5 1 6 5 1 8 524 524 526 526 5 2 7 528 530 532 5 3 3 536 542
  • 29. 544 546 547 548 5 5 3 o o o 5 5 9 8.6 Bibliograph 8.7 Exercises ic Notes Cluster Analysis: Additional Issues and Algorithms 569 9.1 Characteristics of Data, Clusters, and Clustering Algorithms . 570 9.1.1 Example: Comparing K-means and DBSCAN . . . . . . 570 9.1.2 Data Characteristics 577 Contents xix 9.1.3 Cluster Characteristics . . 573 9.L.4 General Characteristics of Clustering Algorithms 575 9.2 Prototype-Based Clustering 577 9.2.1 Fzzy Clustering 577 9.2.2 Clustering Using Mixture Models 583 9.2.3 Self-Organizing Maps (SOM) 594 9.3 Density-Based Clustering 600 9.3.1 Grid-Based Clustering 601 9.3.2 Subspace Clustering 604 9.3.3 DENCLUE: A Kernel-Based Scheme for Density-Based
  • 30. Clustering 608 9.4 Graph-Based Clustering 612 9.4.1 Sparsification 613 9.4.2 Minimum Spanning Tlee (MST) Clustering . . . 674 9.4.3 OPOSSUM: Optimal Partitioning of Sparse Similarities Using METIS 616 9.4.4 Chameleon: Hierarchical Clustering with Dynamic Modeling 9.4.5 Shared Nearest Neighbor Similarity 9.4.6 The Jarvis-Patrick Clustering Algorithm 9.4.7 SNN Density 9.4.8 SNN Density-Based Clustering 9.5 Scalable Clustering Algorithms 9.5.1 Scalability: General Issues and Approaches 9 . 5 . 2 B I R C H 9.5.3 CURE 9.6 Which Clustering Algorithm? 9.7 Bibliographic Notes 9.8 Exercises 6 1 6 622 625 627 629 630 630 633 635 639
  • 31. 643 647 10 Anomaly Detection 651 10.1 Preliminaries 653 10.1.1 Causes of Anomalies 653 10.1.2 Approaches to Anomaly Detection 654 10.1.3 The Use of Class Labels 655 1 0 . 1 . 4 I s s u e s 6 5 6 10.2 Statistical Approaches 658 t0.2.7 Detecting Outliers in a Univariate Normal Distribution 659 1 0 . 2 . 2 O u t l i e r s i n a M u l t i v a r i a t e N o r m a l D i s t r i b u t i o n . . . . . 6 6 1 10.2.3 A Mixture Model Approach for Anomaly Detection. 662 xx Contents 10.2.4 Strengths and Weaknesses 10.3 Proximity-Based Outlier Detection 10.3.1 Strengths and Weaknesses 10.4 Density-Based Outlier Detection 10.4.1 Detection of Outliers Using Relative Density 70.4.2 Strengths and Weaknesses 10.5 Clustering-Based Techniques 10.5.1 Assessing the Extent to Which an Object Belongs to a Cluster 10.5.2 Impact of Outliers on the Initial Clustering
  • 32. 10.5.3 The Number of Clusters to Use 10.5.4 Strengths and Weaknesses 665 666 666 668 669 670 67L 672 674 674 674 675 680 6 8 5 b6i) 10.6 Bibliograph 10.7 Exercises ic Notes Appendix A Linear Algebra A.1 Vectors A.1.1 Definition 685 4.I.2 Vector Addition and Multiplication by a Scalar 685 A.1.3 Vector Spaces 687 4.7.4 The Dot Product, Orthogonality, and Orthogonal Projections 688 A.1.5 Vectors and Data Analysis 690
  • 33. 42 Matrices 691 A.2.1 Matrices: Definitions 691 A-2.2 Matrices: Addition and Multiplication by a Scalar 692 4.2.3 Matrices: Multiplication 693 4.2.4 Linear tansformations and Inverse Matrices 695 4.2.5 Eigenvalue and Singular Value Decomposition . 697 4.2.6 Matrices and Data Analysis 699 A.3 Bibliographic Notes 700 Appendix B Dimensionality Reduction 7OL 8.1 PCA and SVD 70I B.1.1 Principal Components Analysis (PCA) 70L 8 . 7 . 2 S V D . 7 0 6 8.2 Other Dimensionality Reduction Techniques 708 8.2.I Factor Analysis 708 8.2.2 Locally Linear Embedding (LLE) . 770 8.2.3 Multidimensional Scaling, FastMap, and ISOMAP 7I2 Contents xxi 8.2.4 Common Issues B.3 Bibliographic Notes Appendix C Probability and Statistics C.1 Probability C.1.1 Expected Values C.2 Statistics C.2.L Point Estimation
  • 34. C.2.2 Central Limit Theorem C.2.3 Interval Estimation C.3 Hypothesis Testing Appendix D Regression D.1 Preliminaries D.2 Simple Linear Regression D.2.L Least Square Method D.2.2 Analyzing Regression Errors D.2.3 Analyzing Goodness of Fit D.3 Multivariate Linear Regression D.4 Alternative Least-Square Regression Methods Appendix E Optimization E.1 Unconstrained Optimizafion E.1.1 Numerical Methods 8.2 Constrained Optimization E.2.I Equality Constraints 8.2.2 Inequality Constraints Author Index Subject Index Copyright Permissions 715 7L6 7L9 7L9
  • 36. Rapid advances in data collection and storage technology have enabled or- ganizations to accumulate vast amounts of data. However, extracting useful information has proven extremely challenging. Often, traditional data analy- sis tools and techniques cannot be used because of the massive size of a data set. Sometimes, the non-traditional nature of the data means that traditional approaches cannot be applied even if the data set is relatively small. In other situations, the questions that need to be answered cannot be addressed using existing data analysis techniques, and thus, new methods need to be devel- oped. Data mining is a technology that blends traditional data analysis methods with sophisticated algorithms for processing large volumes of data. It has also opened up exciting opportunities for exploring and analyzing new types of data and for analyzing old types of data in new ways. In this introductory chapter, we present an overview of data mining and outline the key topics to be covered in this book. We start with a description of some well-known applications that require new techniques for data analysis. Business Point-of-sale data collection (bar code scanners, radio frequency identification (RFID), and smart card technology) have allowed
  • 37. retailers to collect up-to-the-minute data about customer purchases at the checkout coun- ters of their stores. Retailers can utilize this information, along with other business-critical data such as Web logs from e-commerce Web sites and cus- tomer service records from call centers, to help them better understand the needs of their customers and make more informed business decisions. Data mining techniques can be used to support a wide range of business intelligence applications such as customer profiling, targeted marketing, work- flow management, store layout, and fraud detection. It can also help retailers 2 Chapter 1 lntroduction answer important business questions such as "Who are the most profitable customers?" "What products can be cross-sold or up-sold?" and "What is the revenue outlook of the company for next year?)) Some of these questions … HAN 02-ded-vii-viii-9780123814791 2011/6/1 3:31 Page viii #2
  • 38. HAN 01-fm-i-vi-9780123814791 2011/6/1 3:29 Page i #1 Data Mining Third Edition HAN 01-fm-i-vi-9780123814791 2011/6/1 3:29 Page ii #2 The Morgan Kaufmann Series in Data Management Systems (Selected Titles) Joe Celko’s Data, Measurements, and Standards in SQL Joe Celko Information Modeling and Relational Databases, 2nd Edition Terry Halpin, Tony Morgan Joe Celko’s Thinking in Sets Joe Celko Business Metadata Bill Inmon, Bonnie O’Neil, Lowell Fryman Unleashing Web 2.0 Gottfried Vossen, Stephan Hagemann Enterprise Knowledge Management David Loshin The Practitioner’s Guide to Data Quality Improvement David Loshin Business Process Change, 2nd Edition Paul Harmon
  • 39. IT Manager’s Handbook, 2nd Edition Bill Holtsnider, Brian Jaffe Joe Celko’s Puzzles and Answers, 2nd Edition Joe Celko Architecture and Patterns for IT Service Management, 2nd Edition, Resource Planning and Governance Charles Betz Joe Celko’s Analytics and OLAP in SQL Joe Celko Data Preparation for Data Mining Using SAS Mamdouh Refaat Querying XML: XQuery, XPath, and SQL/ XML in Context Jim Melton, Stephen Buxton Data Mining: Concepts and Techniques, 3rd Edition Jiawei Han, Micheline Kamber, Jian Pei Database Modeling and Design: Logical Design, 5th Edition Toby J. Teorey, Sam S. Lightstone, Thomas P. Nadeau, H. V. Jagadish Foundations of Multidimensional and Metric Data Structures Hanan Samet Joe Celko’s SQL for Smarties: Advanced SQL Programming, 4th Edition Joe Celko Moving Objects Databases Ralf Hartmut Güting, Markus Schneider
  • 40. Joe Celko’s SQL Programming Style Joe Celko Fuzzy Modeling and Genetic Algorithms for Data Mining and Exploration Earl Cox HAN 01-fm-i-vi-9780123814791 2011/6/1 3:29 Page iii #3 Data Modeling Essentials, 3rd Edition Graeme C. Simsion, Graham C. Witt Developing High Quality Data Models Matthew West Location-Based Services Jochen Schiller, Agnes Voisard Managing Time in Relational Databases: How to Design, Update, and Query Temporal Data Tom Johnston, Randall Weis Database Modeling with Microsoft R© Visio for Enterprise Architects Terry Halpin, Ken Evans, Patrick Hallock, Bill Maclean Designing Data-Intensive Web Applications Stephano Ceri, Piero Fraternali, Aldo Bongio, Marco Brambilla, Sara Comai, Maristella Matera Mining the Web: Discovering Knowledge from Hypertext Data Soumen Chakrabarti
  • 41. Advanced SQL: 1999—Understanding Object-Relational and Other Advanced Features Jim Melton Database Tuning: Principles, Experiments, and Troubleshooting Techniques Dennis Shasha, Philippe Bonnet SQL: 1999—Understanding Relational Language Components Jim Melton, Alan R. Simon Information Visualization in Data Mining and Knowledge Discovery Edited by Usama Fayyad, Georges G. Grinstein, Andreas Wierse Transactional Information Systems Gerhard Weikum, Gottfried Vossen Spatial Databases Philippe Rigaux, Michel Scholl, and Agnes Voisard Managing Reference Data in Enterprise Databases Malcolm Chisholm Understanding SQL and Java Together Jim Melton, Andrew Eisenberg Database: Principles, Programming, and Performance, 2nd Edition Patrick and Elizabeth O’Neil The Object Data Standard Edited by R. G. G. Cattell, Douglas Barry Data on the Web: From Relations to Semistructured Data and XML
  • 42. Serge Abiteboul, Peter Buneman, Dan Suciu Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, 3rd Edition Ian Witten, Eibe Frank, Mark A. Hall Joe Celko’s Data and Databases: Concepts in Practice Joe Celko Developing Time-Oriented Database Applications in SQL Richard T. Snodgrass Web Farming for the Data Warehouse Richard D. Hackathorn HAN 01-fm-i-vi-9780123814791 2011/6/1 3:29 Page iv #4 Management of Heterogeneous and Autonomous Database Systems Edited by Ahmed Elmagarmid, Marek Rusinkiewicz, Amit Sheth Object-Relational DBMSs, 2nd Edition Michael Stonebraker, Paul Brown, with Dorothy Moore Universal Database Management: A Guide to Object/Relational Technology Cynthia Maro Saracco Readings in Database Systems, 3rd Edition Edited by Michael Stonebraker, Joseph M. Hellerstein Understanding SQL’s Stored Procedures: A Complete Guide to SQL/PSM
  • 43. Jim Melton Principles of Multimedia Database Systems V. S. Subrahmanian Principles of Database Query Processing for Advanced Applications Clement T. Yu, Weiyi Meng Advanced Database Systems Carlo Zaniolo, Stefano Ceri, Christos Faloutsos, Richard T. Snodgrass, V. S. Subrahmanian, Roberto Zicari Principles of Transaction Processing, 2nd Edition Philip A. Bernstein, Eric Newcomer Using the New DB2: IBM’s Object-Relational Database System Don Chamberlin Distributed Algorithms Nancy A. Lynch Active Database Systems: Triggers and Rules for Advanced Database Processing Edited by Jennifer Widom, Stefano Ceri Migrating Legacy Systems: Gateways, Interfaces, and the Incremental Approach Michael L. Brodie, Michael Stonebraker Atomic Transactions Nancy Lynch, Michael Merritt, William Weihl, Alan Fekete Query Processing for Advanced Database Systems Edited by Johann Christoph Freytag, David Maier, Gottfried
  • 44. Vossen Transaction Processing Jim Gray, Andreas Reuter Database Transaction Models for Advanced Applications Edited by Ahmed K. Elmagarmid A Guide to Developing Client/Server SQL Applications Setrag Khoshafian, Arvola Chan, Anna Wong, Harry K. T. Wong HAN 01-fm-i-vi-9780123814791 2011/6/1 3:29 Page v #5 Data Mining Concepts and Techniques Third Edition Jiawei Han University of Illinois at Urbana–Champaign Micheline Kamber Jian Pei Simon Fraser University AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Morgan Kaufmann is an imprint of Elsevier
  • 45. HAN 01-fm-i-vi-9780123814791 2011/6/1 3:29 Page vi #6 Morgan Kaufmann Publishers is an imprint of Elsevier. 225 Wyman Street, Waltham, MA 02451, USA c© 2012 by Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods or professional practices, may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information or methods described herein. In using
  • 46. such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data Han, Jiawei. Data mining : concepts and techniques / Jiawei Han, Micheline Kamber, Jian Pei. – 3rd ed. p. cm. ISBN 978-0-12-381479-1 1. Data mining. I. Kamber, Micheline. II. Pei, Jian. III. Title. QA76.9.D343H36 2011 006.3′12–dc22 2011010635 British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. For information on all Morgan Kaufmann publications, visit our Web site at www.mkp.com or www.elsevierdirect.com Printed in the United States of America 11 12 13 14 15 10 9 8 7 6 5 4 3 2 1
  • 47. HAN 02-ded-vii-viii-9780123814791 2011/6/1 3:31 Page vii #1 To Y. Dora and Lawrence for your love and encouragement J.H. To Erik, Kevan, Kian, and Mikael for your love and inspiration M.K. To my wife, Jennifer, and daughter, Jacqueline J.P. EDELKAMP 19-ch15-671-700-9780123725127 2011/5/28 14:50 Page 672 #2 This page intentionally left blank HAN 03-toc-ix-xviii-9780123814791 2011/6/1 3:32 Page ix #1 Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi
  • 48. About the Authors xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1 1.1.2 Data Mining as the Evolution of Information Technology 2 1.2 What Is Data Mining? 5 1.3 What Kinds of Data Can Be Mined? 8 1.3.1 Database Data 9 1.3.2 Data Warehouses 10 1.3.3 Transactional Data 13 1.3.4 Other Kinds of Data 14 1.4 What Kinds of Patterns Can Be Mined? 15 1.4.1 Class/Concept Description: Characterization and Discrimination 15 1.4.2 Mining Frequent Patterns, Associations, and Correlations 17 1.4.3 Classification and Regression for Predictive Analysis 18 1.4.4 Cluster Analysis 19 1.4.5 Outlier Analysis 20 1.4.6 Are All Patterns Interesting? 21 1.5 Which Technologies Are Used? 23 1.5.1 Statistics 23 1.5.2 Machine Learning 24 1.5.3 Database Systems and Data Warehouses 26 1.5.4 Information Retrieval 26 ix
  • 49. HAN 03-toc-ix-xviii-9780123814791 2011/6/1 3:32 Page x #2 x Contents 1.6 Which Kinds of Applications Are Targeted? 27 1.6.1 Business Intelligence 27 1.6.2 Web Search Engines 28 1.7 Major Issues in Data Mining 29 1.7.1 Mining Methodology 29 1.7.2 User Interaction 30 1.7.3 Efficiency and Scalability 31 1.7.4 Diversity of Database Types 32 1.7.5 Data Mining and Society 32 1.8 Summary 33 1.9 Exercises 34 1.10 Bibliographic Notes 35 Chapter 2 Getting to Know Your Data 39 2.1 Data Objects and Attribute Types 40 2.1.1 What Is an Attribute? 40 2.1.2 Nominal Attributes 41 2.1.3 Binary Attributes 41 2.1.4 Ordinal Attributes 42 2.1.5 Numeric Attributes 43 2.1.6 Discrete versus Continuous Attributes 44 2.2 Basic Statistical Descriptions of Data 44 2.2.1 Measuring the Central Tendency: Mean, Median, and Mode 45
  • 50. 2.2.2 Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation, and Interquartile Range 48 2.2.3 Graphic Displays of Basic Statistical Descriptions of Data 51 2.3 Data Visualization 56 2.3.1 Pixel-Oriented Visualization Techniques 57 2.3.2 Geometric Projection Visualization Techniques 58 2.3.3 Icon-Based Visualization Techniques 60 2.3.4 Hierarchical Visualization Techniques 63 2.3.5 Visualizing Complex Data and Relations 64 2.4 Measuring Data Similarity and Dissimilarity 65 2.4.1 Data Matrix versus Dissimilarity Matrix 67 2.4.2 Proximity Measures for Nominal Attributes 68 2.4.3 Proximity Measures for Binary Attributes 70 2.4.4 Dissimilarity of Numeric Data: Minkowski Distance 72 2.4.5 Proximity Measures for Ordinal Attributes 74 2.4.6 Dissimilarity for Attributes of Mixed Types 75 2.4.7 Cosine Similarity 77 2.5 Summary 79 2.6 Exercises 79 2.7 Bibliographic Notes 81 HAN 03-toc-ix-xviii-9780123814791 2011/6/1 3:32 Page xi #3 Contents xi Chapter 3 Data Preprocessing 83
  • 51. 3.1 Data Preprocessing: An Overview 84 3.1.1 Data Quality: Why Preprocess the Data? 84 3.1.2 Major Tasks in Data Preprocessing 85 3.2 Data Cleaning 88 3.2.1 Missing Values 88 3.2.2 Noisy Data 89 3.2.3 Data Cleaning as a Process 91 3.3 Data Integration 93 3.3.1 Entity Identification Problem 94 3.3.2 Redundancy and Correlation Analysis 94 3.3.3 Tuple Duplication 98 3.3.4 Data Value Conflict Detection and Resolution 99 3.4 Data Reduction 99 3.4.1 Overview of Data Reduction Strategies 99 3.4.2 Wavelet Transforms 100 3.4.3 Principal Components Analysis 102 3.4.4 Attribute Subset Selection 103 3.4.5 Regression and Log-Linear Models: Parametric Data Reduction 105 3.4.6 Histograms 106 3.4.7 Clustering 108 3.4.8 Sampling 108 3.4.9 Data Cube Aggregation 110 3.5 Data Transformation and Data Discretization 111 3.5.1 Data Transformation Strategies Overview 112 3.5.2 Data Transformation by Normalization 113 3.5.3 Discretization by Binning 115 3.5.4 Discretization by Histogram Analysis 115 3.5.5 Discretization by Cluster, Decision Tree, and Correlation
  • 52. Analyses 116 3.5.6 Concept Hierarchy Generation for Nominal Data 117 3.6 Summary 120 3.7 Exercises 121 3.8 Bibliographic Notes 123 Chapter 4 Data Warehousing and Online Analytical Processing 125 4.1 Data Warehouse: Basic Concepts 125 4.1.1 What Is a Data Warehouse? 126 4.1.2 Differences between Operational Database Systems and Data Warehouses 128 4.1.3 But, Why Have a Separate Data Warehouse? 129 HAN 03-toc-ix-xviii-9780123814791 2011/6/1 3:32 Page xii #4 xii Contents 4.1.4 Data Warehousing: A Multitiered Architecture 130 4.1.5 Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse 132 4.1.6 Extraction, Transformation, and Loading 134 4.1.7 Metadata Repository 134 4.2 Data Warehouse Modeling: Data Cube and OLAP 135 4.2.1 Data Cube: A Multidimensional Data Model 136 4.2.2 Stars, Snowflakes, and Fact Constellations: Schemas
  • 53. for Multidimensional Data Models 139 4.2.3 Dimensions: The Role of Concept Hierarchies 142 4.2.4 Measures: Their Categorization and Computation 144 4.2.5 Typical OLAP Operations 146 4.2.6 A Starnet Query Model for Querying Multidimensional Databases 149 4.3 Data Warehouse Design and Usage 150 4.3.1 A Business Analysis Framework for Data Warehouse Design 150 4.3.2 Data Warehouse Design Process 151 4.3.3 Data Warehouse Usage for Information Processing 153 4.3.4 From Online Analytical Processing to Multidimensional Data Mining 155 4.4 Data Warehouse Implementation 156 4.4.1 Efficient Data Cube Computation: An Overview 156 4.4.2 Indexing OLAP Data: Bitmap Index and Join Index 160 4.4.3 Efficient Processing of OLAP Queries 163 4.4.4 OLAP Server Architectures: ROLAP versus MOLAP versus HOLAP 164 4.5 Data Generalization by Attribute-Oriented Induction 166 4.5.1 Attribute-Oriented Induction for Data Characterization 167 4.5.2 Efficient Implementation of Attribute-Oriented Induction 172 4.5.3 Attribute-Oriented Induction for Class Comparisons 175 4.6 Summary 178 4.7 Exercises 180
  • 54. 4.8 Bibliographic Notes 184 Chapter 5 Data Cube Technology 187 5.1 Data Cube Computation: Preliminary Concepts 188 5.1.1 Cube Materialization: Full Cube, Iceberg Cube, Closed Cube, and Cube Shell 188 5.1.2 General Strategies for Data Cube Computation 192 5.2 Data Cube Computation Methods 194 5.2.1 Multiway Array Aggregation for Full Cube Computation 195 HAN 03-toc-ix-xviii-9780123814791 2011/6/1 3:32 Page xiii #5 Contents xiii 5.2.2 BUC: Computing Iceberg Cubes from the Apex Cuboid Downward 200 5.2.3 Star-Cubing: Computing Iceberg Cubes Using a Dynamic Star-Tree Structure 204 5.2.4 Precomputing Shell Fragments for Fast High-Dimensional OLAP 210 5.3 Processing Advanced Kinds of Queries by Exploring Cube Technology 218 5.3.1 Sampling Cubes: OLAP-Based Mining on Sampling Data 218 5.3.2 Ranking Cubes: Efficient Computation of Top-k Queries
  • 55. 225 5.4 Multidimensional Data Analysis in Cube Space 227 5.4.1 Prediction Cubes: Prediction Mining in Cube Space 227 5.4.2 Multifeature Cubes: Complex Aggregation at Multiple Granularities 230 5.4.3 Exception-Based, Discovery-Driven Cube Space Exploration 231 5.5 Summary 234 5.6 Exercises 235 5.7 Bibliographic Notes 240 Chapter 6 Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods 243 6.1 Basic Concepts 243 6.1.1 Market Basket Analysis: A Motivating Example 244 6.1.2 Frequent Itemsets, Closed Itemsets, and Association Rules 246 6.2 Frequent Itemset Mining Methods 248 6.2.1 Apriori Algorithm: Finding Frequent Itemsets by Confined Candidate Generation 248 6.2.2 Generating Association Rules from Frequent Itemsets 254 6.2.3 Improving the Efficiency of Apriori 254 6.2.4 A Pattern-Growth Approach for Mining Frequent Itemsets 257 6.2.5 Mining Frequent Itemsets Using Vertical Data Format 259 6.2.6 Mining Closed and Max Patterns 262
  • 56. 6.3 Which Patterns Are Interesting?—Pattern Evaluation Methods 264 6.3.1 Strong Rules Are Not Necessarily Interesting 264 6.3.2 From Association Analysis to Correlation Analysis 265 6.3.3 A Comparison of Pattern Evaluation Measures 267 6.4 Summary 271 6.5 Exercises 273 6.6 Bibliographic Notes 276 HAN 03-toc-ix-xviii-9780123814791 2011/6/1 3:32 Page xiv #6 xiv Contents Chapter 7 Advanced Pattern Mining 279 7.1 Pattern Mining: A Road Map 279 7.2 Pattern Mining in Multilevel, Multidimensional Space 283 7.2.1 Mining Multilevel Associations 283 7.2.2 Mining Multidimensional Associations 287 7.2.3 Mining Quantitative Association Rules 289 7.2.4 Mining Rare Patterns and Negative Patterns 291 7.3 Constraint-Based Frequent Pattern Mining 294 7.3.1 Metarule-Guided Mining of Association Rules 295 7.3.2 Constraint-Based Pattern Generation: Pruning Pattern Space and Pruning Data Space 296 7.4 Mining High-Dimensional Data and Colossal Patterns 301
  • 57. 7.4.1 Mining Colossal Patterns by Pattern-Fusion 302 7.5 Mining Compressed or Approximate Patterns 307 7.5.1 Mining Compressed Patterns by Pattern Clustering 308 7.5.2 Extracting Redundancy-Aware Top-k Patterns 310 7.6 Pattern Exploration and Application 313 7.6.1 Semantic Annotation of Frequent Patterns 313 7.6.2 Applications of Pattern Mining 317 7.7 Summary 319 7.8 Exercises 321 7.9 Bibliographic Notes 323 Chapter 8 Classification: Basic Concepts 327 8.1 Basic Concepts 327 8.1.1 What Is Classification? 327 8.1.2 General Approach to Classification 328 8.2 Decision Tree Induction 330 8.2.1 Decision Tree Induction 332 8.2.2 Attribute Selection Measures 336 8.2.3 Tree Pruning 344 8.2.4 Scalability and Decision Tree Induction 347 8.2.5 Visual Mining for Decision Tree Induction 348 8.3 Bayes Classification Methods 350 8.3.1 Bayes’ Theorem 350 8.3.2 Näıve Bayesian Classification 351 8.4 Rule-Based Classification 355 8.4.1 Using IF-THEN Rules for Classification 355 8.4.2 Rule Extraction from a Decision Tree 357
  • 58. 8.4.3 Rule Induction Using a Sequential Covering Algorithm 359 HAN 03-toc-ix-xviii-9780123814791 2011/6/1 3:32 Page xv #7 Contents xv 8.5 Model Evaluation and Selection 364 8.5.1 Metrics for Evaluating Classifier Performance 364 8.5.2 Holdout Method and Random Subsampling 370 8.5.3 Cross-Validation 370 8.5.4 Bootstrap 371 8.5.5 Model Selection Using Statistical Tests of Significance 372 8.5.6 Comparing Classifiers Based on Cost–Benefit and ROC Curves 373 8.6 Techniques to Improve Classification Accuracy 377 8.6.1 Introducing Ensemble Methods 378 8.6.2 Bagging 379 8.6.3 Boosting and AdaBoost 380 8.6.4 Random Forests 382 8.6.5 Improving Classification Accuracy of Class-Imbalanced Data 383 8.7 Summary 385 8.8 Exercises 386 8.9 Bibliographic Notes 389 Chapter 9 Classification: Advanced Methods 393 9.1 Bayesian Belief Networks 393
  • 59. 9.1.1 Concepts and Mechanisms 394 9.1.2 Training Bayesian Belief Networks 396 9.2 Classification by Backpropagation 398 9.2.1 A Multilayer Feed-Forward Neural Network 398 9.2.2 Defining a Network Topology 400 9.2.3 Backpropagation 400 9.2.4 Inside the Black Box: Backpropagation and Interpretability 406 9.3 Support Vector Machines 408 9.3.1 The Case When the Data Are Linearly Separable 408 9.3.2 The Case When the Data Are Linearly Inseparable 413 9.4 Classification Using Frequent Patterns 415 9.4.1 Associative Classification 416 9.4.2 Discriminative Frequent Pattern–Based Classification 419 9.5 Lazy Learners (or Learning from Your Neighbors) 422 9.5.1 k-Nearest-Neighbor Classifiers 423 9.5.2 Case-Based Reasoning 425 9.6 Other Classification Methods 426 9.6.1 Genetic Algorithms 426 9.6.2 Rough Set Approach 427 9.6.3 Fuzzy Set Approaches 428 9.7 Additional Topics Regarding Classification 429 9.7.1 Multiclass Classification 430 HAN 03-toc-ix-xviii-9780123814791 2011/6/1 3:32 Page xvi #8 xvi Contents
  • 60. 9.7.2 Semi-Supervised Classification 432 9.7.3 Active Learning 433 9.7.4 Transfer Learning 434 9.8 Summary 436 9.9 Exercises 438 9.10 Bibliographic Notes 439 Chapter 10 Cluster Analysis: Basic Concepts and Methods 443 10.1 Cluster Analysis 444 10.1.1 What Is Cluster Analysis? 444 10.1.2 Requirements for Cluster Analysis 445 10.1.3 Overview of Basic Clustering Methods 448 10.2 Partitioning Methods 451 10.2.1 k-Means: A Centroid-Based Technique 451 10.2.2 k-Medoids: A Representative Object-Based Technique 454 10.3 Hierarchical Methods 457 10.3.1 Agglomerative versus Divisive Hierarchical Clustering 459 10.3.2 Distance Measures in Algorithmic Methods 461 10.3.3 BIRCH: Multiphase Hierarchical Clustering Using Clustering Feature Trees 462 10.3.4 Chameleon: Multiphase Hierarchical Clustering Using Dynamic Modeling 466 10.3.5 Probabilistic Hierarchical Clustering 467
  • 61. 10.4 Density-Based Methods 471 10.4.1 DBSCAN: Density-Based Clustering Based on Connected Regions with High Density 471 10.4.2 OPTICS: Ordering Points to Identify the Clustering Structure 473 10.4.3 DENCLUE: Clustering Based on Density Distribution Functions 476 10.5 Grid-Based Methods 479 10.5.1 STING: STatistical INformation Grid 479 10.5.2 CLIQUE: An Apriori-like Subspace Clustering Method 481 10.6 Evaluation of Clustering 483 10.6.1 Assessing Clustering Tendency 484 10.6.2 Determining the Number of Clusters 486 10.6.3 Measuring Clustering Quality 487 10.7 Summary 490 10.8 Exercises 491 10.9 Bibliographic Notes 494 Chapter 11 Advanced Cluster Analysis 497 11.1 Probabilistic Model-Based Clustering 497 11.1.1 Fuzzy Clusters 499 HAN 03-toc-ix-xviii-9780123814791 2011/6/1 3:32 Page xvii #9 Contents xvii
  • 62. 11.1.2 Probabilistic Model-Based Clusters 501 11.1.3 Expectation-Maximization Algorithm 505 11.2 Clustering High-Dimensional Data 508 11.2.1 Clustering High-Dimensional Data: Problems, Challenges, and Major Methodologies 508 11.2.2 Subspace Clustering Methods 510 11.2.3 Biclustering 512 11.2.4 Dimensionality Reduction Methods and Spectral Clustering 519 11.3 Clustering Graph and Network Data 522 11.3.1 Applications and Challenges 523 11.3.2 Similarity Measures 525 11.3.3 Graph Clustering Methods 528 11.4 Clustering with Constraints 532 11.4.1 Categorization of Constraints 533 11.4.2 Methods for Clustering with Constraints 535 11.5 Summary 538 11.6 Exercises 539 11.7 Bibliographic Notes 540 Chapter 12 Outlier Detection 543 12.1 Outliers and Outlier Analysis 544 12.1.1 What Are Outliers? 544 12.1.2 Types of Outliers 545 12.1.3 Challenges of Outlier Detection 548
  • 63. 12.2 Outlier Detection Methods 549 12.2.1 Supervised, Semi-Supervised, and Unsupervised Methods 549 12.2.2 Statistical Methods, Proximity-Based Methods, and Clustering-Based Methods 551 12.3 Statistical Approaches 553 12.3.1 Parametric Methods 553 12.3.2 Nonparametric Methods 558 12.4 Proximity-Based Approaches 560 12.4.1 Distance-Based Outlier Detection and a Nested Loop Method 561 12.4.2 A Grid-Based Method 562 12.4.3 Density-Based Outlier Detection 564 12.5 Clustering-Based Approaches 567 12.6 Classification-Based Approaches 571 12.7 Mining Contextual and Collective Outliers 573 12.7.1 Transforming Contextual Outlier Detection to Conventional Outlier Detection 573 HAN 03-toc-ix-xviii-9780123814791 2011/6/1 3:32 Page xviii #10 xviii Contents 12.7.2 Modeling Normal Behavior with Respect to Contexts 574
  • 64. 12.7.3 Mining Collective Outliers 575 12.8 Outlier Detection in High-Dimensional Data 576 12.8.1 Extending Conventional Outlier Detection 577 12.8.2 Finding Outliers in Subspaces 578 12.8.3 Modeling High-Dimensional Outliers 579 12.9 Summary 581 12.10 Exercises 582 12.11 Bibliographic Notes 583 Chapter 13 Data Mining Trends and Research Frontiers 585 13.1 Mining Complex Data Types 585 13.1.1 Mining Sequence Data: Time-Series, Symbolic Sequences, and Biological Sequences 586 13.1.2 Mining Graphs and Networks 591 13.1.3 Mining Other Kinds of Data 595 13.2 Other Methodologies of Data Mining 598 13.2.1 Statistical Data Mining 598 13.2.2 Views on Data Mining Foundations 600 13.2.3 Visual and Audio Data Mining 602 13.3 Data Mining Applications 607 13.3.1 Data Mining for Financial Data Analysis 607 13.3.2 Data Mining for Retail and Telecommunication Industries 609 13.3.3 Data Mining in Science and Engineering 611 13.3.4 Data Mining for Intrusion Detection and Prevention 614 13.3.5 Data Mining and Recommender Systems 615
  • 65. 13.4 Data Mining and Society 618 13.4.1 Ubiquitous and Invisible Data Mining 618 13.4.2 Privacy, Security, and Social Impacts of Data Mining 620 13.5 Data Mining Trends 622 13.6 Summary 625 13.7 Exercises 626 13.8 Bibliographic Notes 628 Bibliography 633 Index 673 HAN 04-fore-xix-xxii-9780123814791 2011/6/1 3:32 Page xix #1 Foreword Analyzing large amounts of data is a necessity. Even popular science books, like “super crunchers,” give compelling cases where large amounts of data yield discoveries and intuitions that surprise even experts. Every enterprise benefits from collecting and ana- lyzing its data: Hospitals can spot trends and anomalies in their patient records, search engines can do better ranking and ad placement, and environmental and public health agencies can spot patterns and abnormalities in their data. The list continues, with
  • 66. cybersecurity and computer network intrusion detection; monitoring of the energy consumption of household appliances; pattern analysis in bioinformatics and pharma- ceutical data; financial and business intelligence data; spotting trends in blogs, Twitter, and many more. Storage is inexpensive and getting even less so, as are data sensors. Thus, collecting and storing data is easier than ever before. The problem then becomes how to analyze the data. This is exactly the focus of this Third Edition of the book. Jiawei, Micheline, and Jian give encyclopedic coverage of all the related methods, from the classic topics of clustering and classification, to database methods (e.g., association rules, data cubes) to more recent and advanced topics (e.g., SVD/PCA, wavelets, support vector machines). The exposition is extremely accessible to beginners and advanced readers alike. The book gives the fundamental material first and the more advanced material in follow-up chapters. It also has numerous rhetorical questions, which I found extremely helpful for maintaining focus. We have used the first two editions as textbooks in data mining courses at Carnegie Mellon and plan to continue to do so with this Third Edition. The new version has significant additions: Notably, it has more than 100 citations to works from 2006 onward, focusing on more recent material such as graphs and social networks, sen-
  • 67. sor networks, and outlier detection. This book has a new section for visualization, has expanded outlier detection into a whole chapter, and has separate chapters for advanced xix HAN 04-fore-xix-xxii-9780123814791 2011/6/1 3:32 Page xx #2 xx Foreword methods—for example, pattern mining with top-k patterns and more and clustering methods with biclustering and graph clustering. Overall, it is an excellent book on classic and modern data mining methods, and it is ideal not only for teaching but also as a reference book. Christos Faloutsos Carnegie Mellon University HAN 04-fore-xix-xxii-9780123814791 2011/6/1 3:32 Page xxi #3 Foreword to Second Edition We are deluged by data—scientific data, medical data, demographic data, financial data, and marketing data. People have no time to look at this data. Human attention has
  • 68. become the precious resource. So, we must find ways to automatically analyze the data, to automatically classify it, to automatically summarize it, to automatically dis- cover and characterize trends in it, and to automatically flag anomalies. This is one of the most active and exciting areas of the database research community. Researchers in areas including statistics, visualization, artificial intelligence, and machine learning are contributing to this field. The breadth of the field makes it difficult to grasp the extraordinary progress over the last few decades. Six years ago, Jiawei Han’s and Micheline Kamber’s seminal textbook organized and presented Data Mining. It heralded a golden age of innovation in the field. This revision of their book reflects that progress; more than half of the references and historical notes are to recent work. The field has matured with many new and improved algorithms, and has broadened to include many more datatypes: streams, sequences, graphs, time-series, geospatial, audio, images, and video. We are certainly not at the end of the golden age— indeed research and commercial interest in data mining continues to grow—but we are all fortunate to have this modern compendium. The book gives quick introductions to database and data mining concepts with particular emphasis on data analysis. It then covers in a chapter- by-chapter tour the concepts and techniques that underlie classification, prediction, association, and clus-
  • 69. tering. These topics are presented with examples, a tour of the best algorithms for each problem class, and with pragmatic rules of thumb about when to apply each technique. The Socratic presentation style is both very readable and very informative. I certainly learned a lot from reading the first edition and got re-educated and updated in reading the second edition. Jiawei Han and Micheline Kamber have been leading contributors to data mining research. This is the text they use with their students to bring them up to speed …