Upcoming SlideShare
×

# Data Mining Algorithms

1,122 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
1,122
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
25
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Data Mining Algorithms

1. 1. Cluster Analysis Measuring Similarity Algorithms Cluster Analysis Measuring Similarity Algorithms Overview Data Mining Algorithms Cluster Analysis Introduction Requirements Cluster Analysis Measuring Similarity Graham Williams Distances Data Types Principal Data Miner, ATO Adjunct Associate Professor, ANU Algorithms Cluster Methods KMeans Copyright c 2006, Graham J. Williams http://togaware.com 1/25/1 Copyright c 2006, Graham J. Williams http://togaware.com 3/25/2 Cluster Analysis Measuring Similarity Algorithms Cluster Analysis Measuring Similarity Algorithms What is Cluster Analysis? What is Cluster Analysis? 1 How do we understand the behaviour of an individual? Paint everyone with the same brush; Cluster: a collection of data objects Treat everyone as an individual. Similar to one another within the same cluster 2 How do we understand the world—through understanding Dissimilar to the objects in other clusters every individual in the world? Cluster analysis 3 We categorise, for good or bad, entities into groups: Grouping a set of data objects into clusters Socio-economic groups: “the poor”, “the rich”; Clustering is unsupervised classiﬁcation: no predeﬁned Political: a lefty, a new right; classes—descriptive data mining. Racial: religious, geographical, Typical applications 4 We ﬁnd that to get through in life we generally talk about As a stand-alone tool to get insight into data distribution groups, not individuals, but computers don’t need to—they As a preprocessing step for other algorithms have the power to build an understanding of the individual, for better or worse. Copyright c 2006, Graham J. Williams http://togaware.com 4/25/3 Copyright c 2006, Graham J. Williams http://togaware.com 5/25/4 Cluster Analysis Measuring Similarity Algorithms Cluster Analysis Measuring Similarity Algorithms General Applications of Clustering Speciﬁc Examples Pattern Recognition Marketing: Help marketers discover distinct groups in their Spatial Data Analysis customer bases, and then use this knowledge to develop create thematic maps in GIS by clustering feature spaces targeted marketing programs Land use: Identiﬁcation of areas detect spatial clusters and explain them in spatial data of similar land use in an earth observation database mining Insurance: Identifying groups of motor insurance policy Image Processing holders with a high average claim cost Economic Science (especially market research) City-planning: Identifying groups of houses according to their WWW house type, value, and geographical location Document Classiﬁcation Earth-quake studies: Observed earth quake epicenters should Question Categorisation Weblog Access Patterns be clustered along continent faults Copyright c 2006, Graham J. Williams http://togaware.com 6/25/5 Copyright c 2006, Graham J. Williams http://togaware.com 7/25/6
2. 2. Cluster Analysis Measuring Similarity Algorithms Cluster Analysis Measuring Similarity Algorithms What Is Good Clustering? Clustering Caveats High Quality: Clustering may not be the best way to discover interesting groups high intra-class similarity in a data set. Often visulisation methods work well, allowing the low inter-class similarity human expert to identify useful groups. However, as the data set Depends on: sizes increase to millions of entites, this becomes inpractical and similarity measure clusters help to partition the data so that we can deal with smaller algorithm for searching groups. Diﬀerent algorithms deliver diﬀerent clusterings. Ability to discover hidden patterns Copyright c 2006, Graham J. Williams http://togaware.com 8/25/7 Copyright c 2006, Graham J. Williams http://togaware.com 9/25/8 Cluster Analysis Measuring Similarity Algorithms Cluster Analysis Measuring Similarity Algorithms Requirements of Clustering in Data Mining Overview Cluster Analysis Scalability Introduction Requirements Diﬀerent attribute types Clusters with arbitrary shape Measuring Similarity Minimal domain knowledge required Distances Can cope with noise and outliers Data Types Insensitive to order of input records High dimensionality Algorithms Cluster Methods KMeans Copyright c 2006, Graham J. Williams http://togaware.com 10/25/9 Copyright c 2006, Graham J. Williams http://togaware.com 11/25/10 Cluster Analysis Measuring Similarity Algorithms Cluster Analysis Measuring Similarity Algorithms Similarity and Dissimilarity Between Objects Minkowski distance q d(a, b) = (|a1 − b1 |q + |a2 − b2 |q + . . . + |ap − bp |q ) Distance measures the similarity or dissimilarity between two data objects a = (a1 , a2 , . . . , ap ) and b = (b1 , b2 , . . . , bp ). If q = 1, d is the Manhattan distance. Properties d(a, b) ≥ 0 d(a, b) = |a1 − b1 | + |a2 − b2 | + . . . + |ap − bp | d(a, a) = 0 d(a, b) = d(b, a) If q = 2, d is Euclidean distance: d(a, b) ≤ d(a, c) + d(c, b) d(a, b) = (|a1 − b1 |2 + |a2 − b2 |2 + . . . + |ap − bp |2 ) Copyright c 2006, Graham J. Williams http://togaware.com 12/25/11 Copyright c 2006, Graham J. Williams http://togaware.com 13/25/12
3. 3. Cluster Analysis Measuring Similarity Algorithms Cluster Analysis Measuring Similarity Algorithms Type of data in clustering analysis Overview Cluster Analysis Introduction Requirements Interval-scaled variables Binary variables Measuring Similarity Nominal, ordinal, and ratio variables Distances Variables of mixed types Data Types Algorithms Cluster Methods KMeans Copyright c 2006, Graham J. Williams http://togaware.com 14/25/13 Copyright c 2006, Graham J. Williams http://togaware.com 15/25/14 Cluster Analysis Measuring Similarity Algorithms Cluster Analysis Measuring Similarity Algorithms Major Clustering Approaches Basic Partitioning Algorithm Partitioning algorithms (kmeans, pam, clara, fanny): Construct various partitions and then evaluate them by some Partition database D of n objects into k clusters criterion. A ﬁxed number of clusters, k, is generated. Start Given k, ﬁnd k clusters that optimises partitioning criterion with an initial (perhaps random) cluster. Global optimal: exhaustively enumerate all partitions Hierarchy algorithms: Create a hierarchical decomposition of Heuristic methods: k-means and k-medoids algorithms the set of data (or objects) using some criterion k-means: Each cluster represented by center of the cluster Density-based: based on connectivity and density functions k-medoids or PAM (partition around medoids): Each Grid-based: based on a multiple-level granularity structure cluster represented by one of the objects in the cluster Model-based: A model is hypothesized for each of the clusters and the idea is to ﬁnd the best ﬁt of that model Copyright c 2006, Graham J. Williams http://togaware.com 16/25/15 Copyright c 2006, Graham J. Williams http://togaware.com 17/25/16 Cluster Analysis Measuring Similarity Algorithms Cluster Analysis Measuring Similarity Algorithms The K-Means Clustering Method The K-Means Clustering Method 10 10 10 9 9 9 8 5 8 5 8 5 7 5 7 5 7 5 6 5 6 5 6 5 Given k, the k-means algorithm is implemented in 4 steps: 5 4 5 1 1 5 4 5 1 1 5 4 5 1 1 5 1 5 1 5 1 1 Partition objects into k nonempty subsets 3 2 1 3 2 1 3 2 1 2 Compute seed points as the centroids of the clusters of the 1 5 1 5 1 5 current partition. The centroid is the center (mean point) 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 −→ of the cluster. 10 9 10 9 3 Assign each object to the cluster with the nearest seed 8 5 8 5 point. 7 5 7 5 6 5 6 5 4 Go back to Step 2, stop when no objects change clusters. 5 5 5 1 5 5 5 1 4 5 1 4 5 1 3 3 1 1 2 2 1 1 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Copyright c 2006, Graham J. Williams http://togaware.com 18/25/17 Copyright c 2006, Graham J. Williams http://togaware.com 19/25/18
4. 4. Cluster Analysis Measuring Similarity Algorithms Cluster Analysis Measuring Similarity Algorithms Comments on K-Means Comments on K-Means Weakness Often terminates at a local optimum. The global optimum Strengths may be found using techniques such as: deterministic Relatively eﬃcient: O(tkn), where n is the number objects, annealing and genetic algorithms k is the number of clusters, and t is the number iterations. Applicable only when the mean is deﬁned—what about Normally, k, t n. categorical data? Need to specify k, the number of clusters, in advance. Unable to handle noisy data and outliers. Not suitable for non-convex clusters. Copyright c 2006, Graham J. Williams http://togaware.com 20/25/19 Copyright c 2006, Graham J. Williams http://togaware.com 21/25/20 Cluster Analysis Measuring Similarity Algorithms Cluster Analysis Measuring Similarity Algorithms KMeans in R Rattle: Hierarchical Variable Cluster c l u s t e r s − 5 l o a d ( ” w i n e . Rdata ” ) 6 w i n e . c l = kmeans ( w i n e [ , 2 : 3 ] , q q q clusters ) q q 5 q q q q q q q q p l o t ( wine [ , 2 : 3 ] , q q q qq 4 q q q q q q q q q q q c o l=w i n e . c l \$ c l u s t e r ) q q qq q q Malic q q q q q qq q q q q q q q q 3 qq q q q q q q p o i n t s ( wine . c l \$ c e n t e r s , q q q qq q q qq q q q q q q q qq q qq q q pch =19 , c e x = 1 . 5 , q 2 q qq q q qqq q q q q q q q q q q q q q q qq qq q q q q q qq q q q q q q qq q q q q q q q qq q q q q q q q q q qq q q q q qq q q q c o l =1: c l u s t e r s ) q qq q q q q q q q q q q q qqq q q q q q 1 qq q q q q q q de v . c o p y ( d e v i c e=pdf , 11 12 13 14 f i l e =”wine− c l u s t e r s . p d f ” ) Alcohol de v . o f f ( ) Copyright c 2006, Graham J. Williams http://togaware.com 22/25/21 Copyright c 2006, Graham J. Williams http://togaware.com 23/25/22 Cluster Analysis Measuring Similarity Algorithms Cluster Analysis Measuring Similarity Algorithms Rattle: Hierarchical Data Cluster Summary Cluster analysis is unsupervised learning. Useful for partitioning a very large population, perhaps for data mining each sub-population separately. Often more eﬀective under expert guidance. Copyright c 2006, Graham J. Williams http://togaware.com 24/25/23 Copyright c 2006, Graham J. Williams http://togaware.com 25/25/24