1. Find all frequent itemsets of length 1 by scanning the database to count item occurrences.
2. Iteratively generate candidate itemsets of length k from frequent itemsets of length k-1, and prune unpromising candidates using the Apriori property.
3. Scan the database to determine truly frequent itemsets.
4. Generate association rules from frequent itemsets by adding items to the antecedent and consequent of rules if they meet minimum confidence.
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
Data mining - GDi Techno Solutions
1. This is part of your
ADVANCED ALGORITHMS IN
COMPUTATIONAL BIOLOGY (C3),
1
2. Class Info
• Lecturer: Chi-Yao Tseng ( 曾祺堯 )
cytseng@citi.sinica.edu.tw
• Grading:
– No assignments
– Midterm:
• 2012/04/20
• I’m in charge of 17x2 points out of 120
• No take-home questions
2
3. Outline
• Introduction
– From data warehousing to data mining
• Mining Capabilities
– Association rules
– Classification
– Clustering
• More about Data Mining
3
4. Main Reference
• Jiawei Han, Micheline Kamber, Data Mining:
Concepts and Techniques, 2nd Edition, Morgan
Kaufmann, 2006.
– Official website:
http://www.cs.uiuc.edu/homes/hanj/bk2/
4
5. Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes (1015 B= 1 million GB)
– Data collection and data availability
• Automated data collection tools, database systems, Web,
computerized society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube, Facebook
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
5
6. Why Not Traditional Data Analysis?
• Tremendous amount of data
– Algorithms must be highly scalable to handle such as terabytes of data
• High-dimensionality of data
– Micro-array may have tens of thousands of dimensions
• High complexity of data
• New and sophisticated applications
6
7. Evolution of Database Technology
• 1960s:
– Data collection, database creation, IMS and network DBMS
• 1970s:
– Relational data model, relational DBMS implementation
• 1980s:
– RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
– Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
– Data mining, data warehousing, multimedia databases, and Web databases
• 2000s
– Stream data management and mining
– Data mining and its applications
– Web technology (XML, data integration) and global information systems
7
8. What is Data Mining?
• Knowledge discovery in databases
– Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of data.
• Alternative names:
– Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information
harvesting, business intelligence, etc.
8
9. Data Mining: On What Kinds of Data?
• Database-oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data (incl. bio-sequences)
– Structure data, graphs, social networks and multi-linked data
– Object-relational databases
– Heterogeneous databases and legacy databases
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases
– The World-Wide Web
9
10. Knowledge Discovery (KDD) Process
Interpretation / Knowledge!
Evaluation
Data Mining
Selection & Patterns
Transformation
Transformed
Data Cleaning data
& Integration
Data warehouse
• This is a view from typical database systems
and data warehousing communities.
Databases • Data mining plays an essential role in the
knowledge discovery process.
10
11. Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
11
13. Typical Data Mining System
Graphical User Interface
Pattern Evaluation
Knowledge
Data Mining Engine Base
Database or Data
Warehouse Server
data cleaning, integration, and selection
Data World-Wide Other info.
Database
warehouse Web repositories
13
14. Data Warehousing
• A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of managements’
decision making process. —W. H. Inmon
14
15. Data Warehousing
• Subject-oriented:
– Provide a simple and concise view around particular subject issues by
excluding data that are not useful in the decision support process.
• Integrated:
– Constructed by integrating multiple, heterogeneous data sources.
• Time-variant:
– Provide information from a historical perspective (e.g., past 5-10
years.)
• Nonvolatile:
– Operational update of data does not occur in the data warehouse
environment
– Usually requires only two operations: load data & access data.
15
16. Data Warehousing
• The process of constructing and using data
warehouses
• A decision support database that is maintained
separately from the organization’s operational
database
• Support information processing by providing a solid
platform of consolidated, historical data for analysis
• Set up stages for effective data mining
16
17. Illustration of Data Warehousing
client
Data source in Taipei
Clean
Transform Query and
Data Analysis
Data source in New York Integrate Warehouse Tools
. Load
.
.
client
Data source in London
17
18. OLTP vs. OLAP
OLTP(On-line Transaction Processing)
Short online transactions:
update, insert, delete current & detailed data,
Versatile
Online-Transaction
Processing Tx.
database
Complex Queries
Analytics
Data Mining
Decision Making
Data Warehouse
OLAP(On-line Analytical Processing)
aggregated & historical data,
Static and Low volume
18
19. Multi-Dimensional View of Data Mining
• Data to be mined
– Relational, data warehouse, transactional, stream, object-oriented/relational,
active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
• Knowledge to be mined
– Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
19
20. Mining Capabilities (1/4)
• Multi-dimensional concept description:
Characterization and discrimination
– Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions
• Frequent patterns (or frequent itemsets),
association
– Diaper Beer [0.5%, 75%] (support, confidence)
20
21. Mining Capabilities (2/4)
• Classification and prediction
– Construct models (functions) that describe and distinguish classes or
concepts for future prediction
• E.g., classify countries based on (climate), or classify cars based on
(gas mileage)
– Predict some unknown or missing numerical values
21
22. Mining Capabilities (3/4)
• Clustering
– Class label is unknown: Group data to form new categories
(i.e., clusters), e.g., cluster houses to find distribution
patterns
– Maximizing intra-class similarity & minimizing interclass
similarity
• Outlier analysis
– Outlier: Data object that does not comply with the general
behavior of the data
– Noise or exception? Useful in fraud detection, rare events
analysis
22
23. Mining Capabilities (4/4)
• Time and ordering, trend and evolution
analysis
– Trend and deviation: e.g., regression analysis
– Sequential pattern mining: e.g., digital camera
large SD memory
– Periodicity analysis
– Motifs and biological sequence analysis
• Approximate and consecutive motifs
– Similarity-based analysis
23
24. More Advanced Mining
Techniques
• Data stream mining
– Mining data that is ordered, time-varying, potentially infinite.
• Graph mining
– Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)
• Information network analysis
– Social networks: actors (objects, nodes) and relationships (edges)
• e.g., author networks in CS, terrorist networks
– Multiple heterogeneous networks
• A person could be multiple information networks: friends, family,
classmates, …
– Links carry a lot of semantic information: Link mining
• Web mining
– Web is a big information network: from PageRank to Google
– Analysis of Web information networks
• Web community discovery, opinion mining, usage mining, …
24
25. Challenges for Data Mining
• Handling of different types of data
• Efficiency and scalability of mining algorithms
• Usefulness and certainty of mining results
• Expression of various kinds of mining results
• Interactive mining at multiple abstraction levels
• Mining information from different source of data
• Protection of privacy and data security
25
26. Brief Summary
• Data mining: Discovering interesting patterns and knowledge from
massive amount of data
• A natural evolution of database technology, in great demand, with wide
applications
• A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge
presentation
• Mining can be performed in a variety of data
• Data mining functionalities: characterization, discrimination, association,
classification, clustering, outlier and trend analysis, etc.
26
27. A Brief History of Data Mining Society
• 1989 IJCAI Workshop on Knowledge Discovery in Databases
– Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,
1991)
• 1991-1994 Workshops on Knowledge Discovery in Databases
– Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-
Shapiro, P. Smyth, and R. Uthurusamy, 1996)
• 1995-1998 International Conferences on Knowledge Discovery in Databases and
Data Mining (KDD’95-98)
– Journal of Data Mining and Knowledge Discovery (1997)
• ACM SIGKDD conferences since 1998 and SIGKDD Explorations
• More conferences on data mining
– PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001),
etc.
More details here: http://www.kdnuggets.com/gpspubs/sigkdd-explorations-kdd-10-years.html
• ACM Transactions on KDD starting in 2007
27
28. Conferences and Journals on Data Mining
• KDD Conferences • Other related conferences
– ACM SIGKDD Int. Conf. on – DB: ACM SIGMOD, VLDB,
Knowledge Discovery in ICDE, EDBT, ICDT
Databases and Data Mining
(KDD) – WEB & IR: CIKM, WWW,
– SIAM Data Mining Conf. (SDM) SIGIR
– (IEEE) Int. Conf. on Data Mining – ML & PR: ICML, CVPR, NIPS
(ICDM) • Journals
– European Conf. on Machine
Learning and Principles and – Data Mining and Knowledge
practices of Knowledge Discovery Discovery (DAMI or DMKD)
and Data Mining (ECML-PKDD) – IEEE Trans. On Knowledge
– Pacific-Asia Conf. on Knowledge and Data Eng. (TKDE)
Discovery and Data Mining – KDD Explorations
(PAKDD)
– ACM Trans. on KDD
– Int. Conf. on Web Search and
Data Mining (WSDM)
28
31. Basic Concepts
• Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
• First proposed by Agrawal, Imielinski, and Swami [AIS93] in the
context of frequent itemsets and association rule mining
• Motivation: Finding inherent regularities in data
– What products were often purchased together?— Beer and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
• Applications
– Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis
31
32. Mining Association Rules
• Transaction data analysis. Given:
– A database of transactions (Each tx. has a list of
items purchased)
– Minimum confidence and minimum support
• Find all association rules: the presence of one
set of items implies the presence of another
set of items
Diaper Beer [0.5%, 75%]
(support, confidence)
32
33. Two Parameters
• Confidence (how true)
– The rule X&Y ⇒Z has 90% confidence:
means 90% of customers who bought X and Y also
bought Z.
• Support (how useful is the rule)
– Useful rules should have some minimum
transaction support.
33
34. Mining Strong Association Rules in
Transaction Databases (1/2)
• Measurement of rule strength in a transaction
database.
A→B [support, confidence]
# of tx containing all items in A ∪ B
support = Pr( A ∪ B ) =
total # of tx
# of tx containing both A ∪ B
confidence = Pr( B | A) =
# of tx containing A
34
35. Mining Strong Association Rules in
Transaction Databases (2/2)
• We are often interested in only strong
associations, i.e.,
– support ≥ min_sup
– confidence ≥ min_conf
• Examples:
– milk → bread [5%, 60%]
– tire and auto_accessories → auto_services [2%,
80%].
35
36. Example of Association Rules
Transaction-id Items bought
1 A, B, D
2 A, C, D
3 A, D, E
4 B, E, F
5 B, C, D, E, F
Let min. support = 50%, min. confidence = 50%
Frequent patterns: {A:3, B:3, D:3, E:3, AD:3}
Association rules: A D (s = 60%, c = 100%)
D A (s = 60%, c = 75%)
36
37. Two Steps for Mining Association Rules
• Determining “large (frequent) itemsets”
– The main factor for overall performance
– The downward closure property of frequent
patterns
• Any subset of a frequent itemset must be frequent
• If {beer, diaper, nuts} is frequent, so is {beer, diaper}
• i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
• Generating rules
37
38. The Apriori Algorithm
• Apriori (R. Agrawal and R. Srikant. Fast algorithms
for mining association rules. VLDB'94.)
– Derivation of large 1-itemsets L1: At the first
iteration, scan all the transactions and count the
number of occurrences for each item.
– Level-wise derivation: At the kth iteration, the
candidate set Ck are those whose every (k-1)-item
subset is in Lk-1. Scan DB and count the # of
occurrences for each candidate itemset.
38
40. From Large Itemsets to Rules
• For each large itemset m
– For each subset p of m
if ( sup(m) / sup(m-p) ≥ min_conf )
• output the rule (m-p)→p
– conf. = sup(m)/sup(m-p)
– support = sup(m)
• m = {a,c,d,e,f,g} 2000 tx’s,
p = {c,e,f,g}
m-p = {a,d} 5000 tx’s
– conf. = # {a,c,d,e,f,g} / # {a,d}
– rule: {a,d} →{c,e,f,g}
confidence: 40%, support: 2000 tx’s
40
41. Redundant Rules
• For the same support and confidence, if we
have a rule {a,d} →{c,e,f,g}, do we have
[agga98a]:
– {a,d} →{c,e,f} ? Yes!
– {a} →{c,e,f,g} ? Yes!
– {a,d,c} →{e,f,g} ? No!
– {a} →{c,d,e,f,g} ? No!
41
42. Practice
• Suppose we additionally have
– 500 ACE
– 600 BCD
– Support = 3 tx’s (50%), confidence = 66%
• Repeat the large itemset generation
– Identify all large itemsets
• Derive up to 4 rules
– Generate rules from the large itemsets with the
biggest number of elements (from big to small)
42
43. Discussion of The Apriori Algorithm
• Apriori (R. Agrawal and R. Srikant. Fast algorithms for mining
association rules. VLDB'94.)
– Derivation of large 1-itemsets L1: At the first iteration, scan
all the transactions and count the number of occurrences
for each item.
– Level-wise derivation: At the kth iteration, the candidate set
Ck are those whose every (k-1)-item subset is in Lk-1. Scan DB
and count the # of occurrences for each candidate
itemset.
• The cardinalitiy (number of elements) of C2 is huge.
• The execution time for the first 2 iterations is the
dominating factor to overall performance!
• Database scan is expensive. 43
44. Improvement of the Apriori Algorithm
• Reduce passes of transaction database scans
• Shrink the number of candidates
• Facilitate the support counting of candidates
44
45. Example Improvement 1- Partition: Scan
Database Only Twice
• Any itemset that is potentially frequent in DB
must be frequent in at least one of the
partitions of DB
– Scan 1: partition database and find local frequent
patterns
– Scan 2: consolidate global frequent patterns
• A. Savasere, E. Omiecinski, and S. Navathe. An efficient
algorithm for mining association in large databases. In
VLDB’95
45
46. Example Improvement 2- DHP
• DHP (direct hashing with pruning): Apriori + hashing
– Use hash-based method to reduce the size of C2.
– Allow effective reduction on tx database size (tx number and
each tx size.)
Tid Items
100 A, C, D
200 B, C, E
300 A, B, C, E
400 B, E
J. Park, M.-S. Chen, and P. Yu.
An effective hash-based algorithm for mining association rules. In SIGMOD’95. 46
47. Mining Frequent Patterns w/o Candidate
Generation
• A highly compact data structure: frequent
pattern tree.
• An FP-tree-based pattern fragment growth
mining method.
• Search technique in mining: partitioning-
based, divide-and-conquer method.
• J. Han, J. Pei, Y. Yin, Mining Frequent Patterns without
Candidate Generation, in SIGMOD’2000.
47
48. Frequent Patter Tree (FP-tree)
• 3 parts:
– One root labeled as ‘null’
– A set of item prefix subtrees
– Frequent item header table
• Each node in the prefix subtree consists of
– Item name
– Count
– Node-link
• Each entry in the frequent-item header table consists of
– Item-name
– Head of node-link
48
49. The FP-tree Structure
frequent item header table
Item Head of node-links root
f f:4
c c:1
a c:3
b b:1
m b:1
p a:3
p:1
m:2
b:1
p:2
m:1
49
50. FP-tree Construction: Step1
• Scan the transaction database DB once (the
first time), and derives a list of frequent
items.
• Sort frequent items in frequency descending
order.
• This ordering is important since each path of a
tree will follow this order.
50
51. Example (min. support = 3)
Tx ID Items Bought (ordered) Frequent Items
100 f,a,c,d,g,i,m,p f,c,a,m,p
200 a,b,c,f,l,m,o f,c,a,b,m
300 b,f,h,j,o f,b
frequent item header table
400 b,c,k,s,p c,b,p
500 a,f,c,e,l,p,m,n f,c,a,m,p Item Head of node-links
f
List of frequent items: c
(f:4), (c:4), (a:3), (b:3), (m:3), (p:3) a
b
m
p
51
52. FP-tree Construction: Step 2
• Create a root of a tree, label with “null”
• Scan the database the second time. The scan of the first tx
leads to the construction of the first branch of the tree.
Scan of 1st transaction: f,a,c,d,g,i,m,p root
The 1st branch of the tree <(f:1),(c:1),(a:1),(m:1),(p:1)>
f:1
c:1
a:1
m:1
p:1
52
53. FP-tree Construction: Step 2 (cont’d)
• Scan of 2nd transaction: root
a,b,c,f,l,m,o → f,c,a,b,m
f:2
c:2
two new nodes:
(b:1) (m:1) a:2
m:1
b:1
p:1
m:1
53
54. Tx ID Items Bought (ordered) Frequent Items
100 f,a,c,d,g,i,m,p f,c,a,m,p
The FP-tree 200 a,b,c,f,l,m,o f,c,a,b,m
300 b,f,h,j,o f,b
400 b,c,k,s,p c,b,p
frequent item header table 500 a,f,c,e,l,p,m,n f,c,a,m,p
Item Head of node-links root
f f:4
c c:1
a c:3
b b:1
m b:1
p a:3
p:1
m:2
b:1
p:2
54
m:1
55. Mining Process
• Starts from the least frequent item p
– Mining order: p -> m -> b -> a -> c -> f
frequent item header table
Item Head of node-links
f
c
a
b
m
p
55
56. Mining Process for item p
• Starts from the least frequent item p
root min. support = 3
Two paths:
f:4
<f:4, c:3, a:3, m:2, p:2>
c:1 <c:1, b:1,p:1>
c:3
b:1 Conditional pattern based of ”p”:
b:1 <f:2, c:2, a:2, m:2>
a:3 <c:1, b:1>
p:1
Conditional frequent pattern:
m:2 <c:3>
b:1
So we have two frequent patterns:
p:2 {p:3}, {cp:3}
m:1 56
57. Mining Process for Item m
root min. support = 3
Two paths:
f:4
<f:4, c:3, a:3, m:2>
c:1 <f:4, c:3, a:3, b:1, m:1>
c:3
b:1 Conditional pattern based of ”m”:
b:1 <f:2, c:2, a:2>
a:3 <f:1, c:1, a:1, b:1>
p:1
Conditional frequent pattern:
m:2 <f:3, c:3, a:3>
b:1
p:2
m:1 57
58. Mining m’s Conditional FP-tree
Mine (<f:3, c:3, a:3> |
m) f
a c
(am:3) (cm:3) (fm:3)
Mine (<f:3, c:3> | am) Mine (<f:3> | cm)
c f f
(cam:3) (fam:3 (fcm:3
Mine (<f:3> | cam) ) )
f
(fcam:3
) So we have frequent patterns:
{m:3}, {am:3}, {cm:3}, {fm:3}, {cam:3}, {fam:3}, {fcm:3}, {fcam:3}
58
59. Analysis of the FP-tree-based method
• Find the complete set of frequent itemsets
• Efficient because
– Works on a reduced set of pattern bases
– Performs mining operations less costly than
generation & test
• Cons:
– No advantages if the length of most tx’s are short
– The size of FP-tree not always fit into main memory
59
60. Generalized Association Rules
• Given the class hierarchy (taxonomy), one would
like to choose proper data granularities for
mining.
• Different confidence/support may be
considered.
• R. Srikant and R. Agrawal, Mining generalized association
rules, VLDB’95.
60
63. Other Relevant Topics
• Max patterns
– R. J. Bayardo. Efficiently mining long patterns from databases.
SIGMOD'98.
• Closed patterns
– N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent
closed itemsets for association rules. ICDT'99.
• Sequential Patterns
– What items one will purchase if he/she has bought some certain
items.
– R. Srikant and R. Agrawal, Mining sequential patterns, ICDE’95
• Traversal Patterns
– Mining path traversal patterns in a web environment where
documents or objects are linked together to facilitate interactive
access.
– M.-S. Chen, J. Park and P. Yu. Efficient Data Mining for Path Traversal
Patterns. TKDE’98.
and more… 63
65. Classification
• Classifying tuples in a database.
• Each tuple has some attributes with known
values.
• In training set E
– Each tuple consists of the same set of multiple
attributes as the tuples in the large database W.
– Additionally, each tuple has a known class
identity.
65
66. Classification (cont’d)
• Derive the classification mechanism from the
training set E, and then use this mechanism to
classify general data (in testing set.)
• A decision tree based approach has been
influential in machine learning studies.
66
67. Classification –
Step 1: Model Construction
• Train model from the existing data pool
Training Classification algorithm
Data
name age income own
cars?
Sandy <=30 low no Classification rules
Bill <=30 low yes
Fox 31…40 high yes
Susan >40 med no
Claire >40 med no
Andy 31…40 high yes
67
68. Classification –
Step 2: Model Usage
Testing
Data Classification rules
name age income own
cars? No
John >40 hight ? No
Sally <=30 low ?
Yes
Annie 31…40 high ?
68
69. What is Prediction?
• Prediction is similar to classification
– First, construct model
– Second, use model to predict future of unknown
objects
• Prediction is different from classification
– Classification refers to predict categorical class
label.
– Prediction refers to predict continuous values.
• Major method: regression
69
70. Supervised vs. Unsupervised
Learning
• Supervised learning (e.g., classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations.
• Unsupervised learning (e.g., clustering)
– We are given a set of measurements, observations,
etc. with the aim of establishing the existence of
classes or clusters in the data.
– No training data, or the “training data” are not
accompanied by class labels.
70
71. Evaluating Classification Methods
• Predictive accuracy
• Speed
– Time to construct the model and time to use the model
• Robustness
– Handling noise and missing values
• Scalability
– Efficiency in large databases (not memory resident data)
• Goodness of rules
– Decision tree size
– The compactness of classification rules
71
72. A Decision-Tree Based Classification
• A decision tree of whether going to play tennis or not:
outlook
sunny rainy
overcast
humidity windy
high low P Yes No
N P N P
• ID-3 and its extended version C4.5 (Quinlan’93):
A top-down decision tree generation algorithm
72
73. Algorithm for Decision Tree Induction
(1/2)
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-
conquer manner.
– Attributes are categorical.
(if an attribute is a continuous number, it needs to be discretized in
advance.) E.g.
0 ~ 20 61 ~ 80
0 <= age <= 100 21 ~ 40 81 ~ 100
41 ~ 60
– At start, all the training examples are at the root.
– Examples are partitioned recursively based on selected
attributes.
73
74. Algorithm for Decision Tree Induction
(2/2)
• Basic algorithm (a greedy algorithm)
– Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain): maximizing an
information gain measure, i.e., favoring the partitioning
which makes the majority of examples belong to a single
class.
– Conditions for stopping partitioning:
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
• There are no samples left
74
75. Decision Tree Induction: Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
75
77. Primary Issues in Tree Construction (1/2)
• Split criterion: Goodness function
– Used to select the attribute to be split at a tree
node during the tree generation phase
– Different algorithms may use different goodness
functions:
• Information gain (used in ID3/C4.5)
• Gini index (used in CART)
77
78. Primary Issues in Tree Construction (2/2)
• Branching scheme:
– Determining the tree branch to which a sample
belongs
Income: Income: Income:
– Binary vs. k-ary splitting high medium low
• When to stop the further splitting of a node?
e.g. impurity measure
• Labeling rule: a node is labeled as the class to
which most samples at the node belongs.
78
79. How to Use a Tree?
• Directly
– Test the attribute value of unknown sample against the
tree.
– A path is traced from root to a leaf which holds the label.
• Indirectly
– Decision tree is converted to classification rules.
– One rule is created for each path from the root to a leaf.
– IF-THEN is easier for humans to understand .
79
80. Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D:
m
Info( D) = −∑ pi log 2 ( pi )
i =1
Expected information (entropy):
Entropy is a measure of how "mixed up" an attribute is.
It is sometimes equated to the purity or impurity of a
variable.
High Entropy means that we are sampling from a uniform
(boring) distribution. 80
81. Expected Information (Entropy)
Expected information (entropy) needed to classify a tuple in D:
m
Info( D) = −∑ pi log 2 ( pi ) (m: number of
i =1
labels)
3 3 2 2 5 5 0 0
Info( D) = I (3,2) = − log 2 ( ) − log 2 ( ) Info( D) = I (5,0) = − log 2 ( ) − log 2 ( )
5 5 5 5 5 5 5 5
3 2 = 0−0 = 0
≈ − × (−0.737) − × (−1.322)
5 5
81
82. Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D:
m
Info( D) = I ( D ) = −∑ pi log 2 ( pi )
i =1
Information needed (after using A to split D into v partitions) to
v |D |
classify D: Info A ( D) = ∑
j
× I (D j )
j =1 | D |
Information gained by branching on attribute A
Gain(A) = Info(D) − Info A(D)
82
83. Expected Information (Entropy)
Information needed (after using A to split D into v
partitions) to classify D: v |D |
Info A ( D) = ∑
j
× I (D j )
j =1 | D |
2 3 2 3
Info( D) = Info(1,1) + Info(2,1) Info( D) = Info(2,0) + Info(3,0)
5 5 5 5
83
84. Attribute Selection: Information Gain
Class P: buys_computer = “yes” 5 4
Infoage ( D) = I (2,3) + I (4,0)
Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D) = I (9,5) = − log 2 ( ) − log 2 ( ) =0.940 + I (3,2) = 0.694
14 14 14 14 14
5
age pi ni I(pi, ni) I (2,3) means “age <=30” has 5 out of
14
<=30 2 3 0.971 14 samples, with 2 yes’es and 3
31…40 4 0 0
no’s. Hence
>40 3 2 0.971
Gain(age) = Info( D) − Infoage ( D) = 0.246
age income student credit_rating buys_computer
<=30 high no fair no
<=30
31…40
high
high
no
no
excellent
fair
no
yes
Similarly,
>40 medium no fair yes
>40
>40
low
low
yes
yes
fair
excellent
yes
no
Gain(income) = 0.029
31…40
<=30
low
medium
yes
no
excellent
fair
yes
no
Gain( student ) = 0.151
<=30
>40
low
medium
yes
yes
fair
fair
yes
yes Gain(credit _ rating ) = 0.048
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no 84
85. Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a
large number of values.
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain.)
v | Dj | | Dj |
SplitInfo A ( D) = −∑ × log 2 ( )
j =1 |D| |D|
– GainRatio(A) = Gain(A)/SplitInfo(A)
4 4 6 6 4 4
SplitInfo A ( D ) = − × log 2 ( ) − × log 2 ( ) − × log 2 ( ) = 0.926
14 14 14 14 14 14
GainRatio(income) = 0.029/0.926 = 0.031
• The attribute with the maximum gain ratio is selected as the
splitting attribute. 85
86. Gini index (CART, IBM
IntelligentMiner)
• If a data set D contains examples from n classes, gini index, Gini(D) is
defined as n
Gini( D) = −1 ∑ p2 j
j=1
where pj is the relative frequency of class j in D
• If a data set D is split on A into two subsets D1 and D2, the gini index Gini(D)
is defined as: |D | |D |
Gini A ( D) = 1 Gini( D1) + 2 Gini( D 2)
| D| |D|
• Reduction in Impurity:
∆Gini( A) = Gini(D) − GiniA ( D)
• The attribute provides the smallest GiniA(D) (or the largest reduction in
impurity) is chosen to split the node (need to enumerate all the possible
splitting points for each attribute.)
86
87. Gini index (CART, IBM
IntelligentMiner)
• Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no.”
2 2
9 5
Gini ( D) = 1 − − = 0.459
14 14
• Suppose the attribute income partitions D into 10 in D 1: {low, medium}
and 4 in D2: {high}.
10 4
Giniincome∈{low,medium} ( D) = Gini ( D1 ) + Gini ( D1 )
14 14
10 6 4 4 1 3
= [1 − ( ) 2 − ( ) 2 ] + [1 − ( ) 2 − ( ) 2 ]
14 10 10 14 4 4
= 0.45
= Giniincome∈{high} ( D)
But Giniincomeϵ{medium,high} is 0.30 and thus the best since it is the lowest.
87
88. Other Attribute Selection Measures
• CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
• C-SEP: performs better than info. gain and gini index in certain cases
• G-statistics: has a close approximation to χ2 distribution
• MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred):
– The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
• Multivariate splits (partition based on multiple variable combinations)
– CART: finds multivariate splits based on a linear combination of
attributes.
Which attribute selection measure is the best?
Most give good results, none is significantly superior than others
88
89. Other Types of Classification Methods
• Bayes Classification Methods
• Rule-Based Classification
• Support Vector Machine (SVM)
• Some of these methods will be taught in the
following lessons.
89
91. What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster Analysis
– Grouping a set of data objects into clusters
• Typical applications:
– As a stand-alone tool to get insight into data
distribution
– As a preprocessing step for other algorithms
91
92. General Applications of Clustering
• Spatial data analysis
– Create thematic maps in GIS by clustering feature spaces.
– Detect spatial clusters and explain them in spatial data mining.
• Image Processing
• Pattern recognition
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Web-log data to discover groups of similar access
patterns
92
93. Examples of Clustering
Applications
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs.
• Land use: Identification of areas of similar land use in an earth
observation database.
• Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost.
• City-planning: Identifying groups of houses according to their
house type, value, and geographical location.
93
94. What is Good Clustering?
• A good clustering method will produce high quality
clusters with
– High intra-class similarity
– Low inter-class similarity
• The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
• The quality of a clustering method is also measured by
its ability to discover hidden patterns.
94
95. Requirements of Clustering
in Data Mining (1/2)
• Scalability
• Ability to deal with different types of
attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements of domain knowledge
for input
• Able to deal with outliers
95
96. Requirements of Clustering
in Data Mining (2/2)
• Insensitive to order of input records
• High dimensionality
– Curse of dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
96
97. Clustering Methods (I)
• Partitioning Method
– Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
– K-means, k-medoids, CLARANS
• Hierarchical Method
– Create a hierarchical decomposition of the set of data (or objects)
using some criterion
– Diana, Agnes, BIRCH, ROCK, CHAMELEON
• Density-based Method
– Based on connectivity and density functions
– Typical methods: DBSACN, OPTICS, DenClue
97
98. Clustering Methods (II)
• Grid-based approach
– based on a multiple-level granularity structure
– Typical methods: STING, WaveCluster, CLIQUE
• Model-based approach
– A model is hypothesized for each of the clusters and tries to find the best fit of
that model to each other
– Typical methods: EM, SOM, COBWEB
• Frequent pattern-based
– Based on the analysis of frequent patterns
– Typical methods: pCluster
• User-guided or constraint-based
– Clustering by considering user-specified or application-specific constraints
– Typical methods: cluster-on-demand, constrained clustering
98
99. Typical Alternatives to
Calculate the Distance between Clusters
• Single link: smallest distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
• Complete link: largest distance between an element in one cluster and
an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
• Average: average distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
• Centroid: distance between the centroids of two clusters,
i.e., dis(Ki, Kj) = dis(Ci, Cj)
• Medoid: distance between the medoids of two clusters,
i.e., dis(Ki, Kj) = dis(Mi, Mj)
– Medoid: one chosen, centrally located object in the cluster
99
100. Centroid, Radius and Diameter of a Cluster
(for numerical data sets)
• Centroid: the “middle” of a cluster
ΣiN= 1(t )
Cm = N
ip
• Radius: square root of average mean squared distance from any point
of the cluster to its centroid
Σ N (t − cm ) 2
Rm = i =1 ip
N
• Diameter: square root of average mean squared distance between all
pairs of points in the cluster
Σ N Σ N (t − t ) 2 diameter != 2 * radius
i = 1 j = 1 ip jq
D =
m N ( N − 1)
100
101. Partitioning Algorithms: Basic Concept
• Partitioning method: construct a partition of a
database D of n objects into a set of k clusters.
• Given a number k, find a partition of k clusters
that optimizes the chosen partitioning criterion.
– Global optimal: exhaustively enumerate all partitions.
– Heuristic methods: k-means, k-medoids
• k-means (MacQueen’67)
• k-medoids or PAM, partion around medoids (Kaufman &
Rousseeuw’87)
101
102. The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in
four steps:
1. Arbitrarily choose k points as initial cluster centroids.
2. Update Means (Centroids): Compute seed points as
the center of the clusters of the current partition.
loop (center: mean point of the cluster)
3. Re-assign Points: Assign each object to the cluster with
the nearest seed point.
4. Go back to Step 2, stop when no more new
assignment.
102
103. Example of the
K-Means Clustering Method
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
Assign
4 4
4
3 Update 3
each
3
2
2
the 2
objects
1 1
1
0 cluster 0
to the
0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
means
most
similar Re-assign Re-assign
centroid
Given k = 2: 10 10
9 9
Arbitrarily choose k 8 8
object as initial
7 7
6 6
cluster centroid 5
Update 5
4 4
3
the 3
cluster
2 2
1 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
103
104. Comments on the K-Means Clustering
• Time Complexity: O(tkn), where n is # of objects, k is
# of clusters, and t is # of iterations. Normally,
k,t<<n.
• Often terminates at a local optimum.
(The global optimum may be found using techniques such as:
deterministic annealing and genetic algorithms)
• Weakness:
– Applicable only when mean is defined, how about
categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers 104
105. Why is K-Means Unable to
Handle Outliers?
• The k-means algorithm is sensitive to outliers
– Since an object with an extremely large value may
substantially distort the distribution of the data.
X
• K-Medoids: Instead of taking the mean value of the
object in a cluster as a reference point, medoids can
be used, which is the most centrally located object in
a cluster.
105
106. PAM: The K-Medoids Method
• PAM: Partition Around Medoids
• Use real object to represent the cluster
1. Randomly select k representative objects as medoids.
2. Assign each data point to the closest medoid.
3. For each medoid m,
loop
a. For each non-medoid data point o
b. Swap m and o, and compute the total cost of the
configuration.
1. Select the configuration with the lowest cost.
2. Repeat steps 2 to 5 until there is no change in the medoid.
106
107. A Typical K-Medoids Algorithm (PAM)
10 10
9 9
8
7
8
Assign each
7
6
Arbitrary 6 remaining object to
5
choose k
5 the nearest medoid
4 4
3 object as 3
2
initial 2
1 1
0
medoids 0
10
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
9
k=2
8
7
10
6
m2
9
5
8
4
7
3
6
2
m1
5
1
4
0
3 0 1 2 3 4 5 6 7 8 9 10
2
1
0
0 1 2 3 4 5 6 7 8 9 10
Swap each medoid and each data point, and
compute the total cost of the configuration 107
108. PAM Clustering: Total swapping cost TCih=∑jCjih
10
10
9
d(j,h)<d(j,t) 9
j
- Original 8
t 8 t
7
medoid: t, i 7
6
j 6
- h: swap with i 5
5
4
i h 4
h
- j: any non-
3
3
i
2
j j 2
j j
selected object 1
1
0
0 1 2 3 4 5 6 7 8 9 10 i h 0
0 1 2 3 4 5 6 7 8 9 10 t t
Cjih = d(j, h) - d(j, i) Cjih = 0
10 10
9
d(j,h)>d(j,t) 9
8
h 8
7
6
j 7
6
5 5 i
i h j
t
4 4
3 3
2
j j
2
t j j
1 1
0
0 1 2 3 4 5 6 7 8 9 10 i t 0
0 1 2 3 4 5 6 7 8 9 10
t h
Cjih = d(j, t) - d(j, i) Cjih = d(j, h) - d(j, t) 108
109. What is the Problem with PAM?
• PAM is more robust than k-means in the presence of
noise and outliers because a medoid is less influenced
by outliers or other extreme values than a mean.
• PAM works efficiently for small data sets but does not
scale well for large data sets.
– O( k(n-k)(n-k) ) for each iteration,
where n is # of data, k is # of clusters
– Improvements: CLARA (uses a sampled set to determine
medoids), CLARANS
109
110. Hierarchical Clustering
• Use distance matrix as clustering criteria.
• This method does not require the number of clusters k as an
input, but needs a termination condition.
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0
110
111. AGNES (Agglomerative Nesting)
• Introduced in Kaufmann and Rousseeuw (1990)
• Use the Single-Link method and the dissimilarity matrix.
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
111
112. Dendrogram:
Shows How the Clusters are Merged
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected
component forms a cluster.
112
113. DIANA (Divisive Analysis)
• Introduced in Kaufmann and Rousseeuw (1990)
• Inverse order of AGNES
• Eventually each node forms a cluster on its own.
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
0
9
8
7
6
5
4
10
3
2
1
0
113
114. More on Hierarchical Clustering
• Major weakness:
– Do not scale well: time complexity is at least O(n2), where n is
the number of total objects.
– Can never undo what was done previously.
• Integration of hierarchical with distance-based clustering
– BIRCH(1996): uses CF-tree data structure and incrementally
adjusts the quality of sub-clusters.
– CURE(1998): selects well-scattered points from the cluster and
then shrinks them towards the center of the cluster by a
specified fraction.
114
115. Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as
density-connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition
• Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
115
116. Density-Based Clustering: Basic Concepts
• Two parameters:
– Eps: Maximum radius of the neighborhood
– MinPts: Minimum number of points in an Eps-neighborhood
of that point
Eps
116
117. Density-Based Clustering: Basic Concepts
• Two parameters:
– Eps: Maximum radius of the neighborhood
– MinPts: Minimum number of points in an Eps-neighborhood
of that point
• NEps(q): {p | dist(p,q) <= Eps} // p, q are two data points
• Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
– p belongs to NEps(q)
p MinPts = 5
– core point condition: q Eps = 1 cm
|NEps (q)| >= MinPts
117
118. Density-Reachable and Density-Connected
• Density-reachable:
– A point p is density-reachable from a p
point q w.r.t. Eps, MinPts if there is a
p2
chain of points p1, …, pn, p1 q
= q, pn = p such that pi+1 is directly
density-reachable from pi.
• Density-connected:
– A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there is a
point o such that both, p and q are o
density-reachable from o w.r.t. Eps
and MinPts.
118
119. DBSCAN: Density Based Spatial Clustering of
Applications with Noise
• Relies on a density-based notion of cluster: A cluster is defined
as a maximal set of density-connected points.
• Discovers clusters of arbitrary shape in spatial databases with
noise.
Border
Border
Eps = 1cm
MinPts = 5
Core
119
120. DBSCAN: The Algorithm
• Arbitrary select an unvisited point p.
• Retrieve all points density-reachable from p w.r.t. Eps and
MinPts.
• If p is a core point, a cluster is formed. Mark all these points
as visited.
• If p is a border point (no points are density-reachable from
p), mark p as visited and DBSCAN visits the next point of the
database.
• Continue the process until all of the points have been visited.
120
121. References (1)
• R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional
data for data mining applications. SIGMOD'98
• M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
• M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering
structure, SIGMOD’99.
• P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scientific, 1996
• Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
• M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based Local Outliers. SIGMOD 2000.
• M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large
spatial databases. KDD'96.
• M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for
efficient class identification. SSD'95.
• D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172,
1987.
• D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic
systems. VLDB’98.
121
122. References (2)
• V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data Using Summaries. KDD'99.
• D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. In
Proc. VLDB’98.
• S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD'98.
• S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. In ICDE'99, pp. 512-
521, Sydney, Australia, March 1999.
• A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. KDD’98.
• A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
• G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling.
COMPUTER, 32(8): 68-75, 1999.
• L. Kaufman and P. J. Rousseeuw, 1987. Clustering by Means of Medoids. In: Dodge, Y. (Ed.), Statistical Data Analysis
Based on the L1 Norm, North Holland, Amsterdam. pp. 405-416.
• L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons,
1990.
• E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98.
• J. B. MacQueen (1967): "Some Methods for classification and Analysis of Multivariate Observations", Proceedings
of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press,
1:281-297
• G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons,
1988.
• P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
• R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.
122
123. References (3)
• L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A Review , SIGKDD
Explorations, 6(1), June 2004
• E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc. 1996
Int. Conf. on Pattern Recognition,.
• G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for
very large spatial databases. VLDB’98.
• A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering in Large Databases,
ICDT'01.
• A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles , ICDE'01
• H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets, SIGMOD’ 02.
• W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining, VLDB’97.
• T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large
databases. SIGMOD'96.
• Wikipedia: DBSCAN. http://en.wikipedia.org/wiki/DBSCAN.
123
126. The Top 10 Algorithms
The 3-Step Identification Process
1. Nominations. ACM KDD Innovation Award and IEEE
ICDM Research Contributions Award winners were
invited in September 2006 to each nominate up to 10
best-known algorithms.
2. Verification. Each nomination was verified for its
citations on Google Scholar in late October 2006, and
those nominations that did not have at least 50
citations were removed. 18 nominations survived and
were then organized in 10 topics.
3. Voting by the wider community.
126
127. Top-10 Most Popular DM Algorithms:
18 Identified Candidates (I)
• Classification
– #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann.,
1993.
– #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and
Regression Trees. Wadsworth, 1984.
– #3. K Nearest Neighbors (kNN): Hastie, T. and Tibshirani, R. 1996. Discriminant
Adaptive Nearest Neighbor Classification. TPAMI. 18(6)
– #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid After All?
Internat. Statist. Rev. 69, 385-398.
• Statistical Learning
– #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer-
Verlag.
– #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New
York. Association Analysis
• Association Analysis
– #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining
Association Rules. In VLDB '94.
– #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without
candidate generation. In SIGMOD '00.
127
Editor's Notes
IMS= information management system
Primary goal: discovering knowledge. Thus, we want to find proper values of confidence and support in different hierarchical levels.
I : the expected information needed to classify a given sample E (entropy) : expected information based on the partitioning into subsets by A
I : the expected information needed to classify a given sample E (entropy) : expected information based on the partitioning into subsets by A
I : the expected information needed to classify a given sample E (entropy) : expected information based on the partitioning into subsets by A
I : the expected information needed to classify a given sample E (entropy) : expected information based on the partitioning into subsets by A