0
Upcoming SlideShare
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Standard text messaging rates apply

Data Applied: Clustering

437

Published on

Data Applied: Clustering

Data Applied: Clustering

Published in: Technology, Education
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total Views
437
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
0
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript

• 1. 5
Data-Applied.com: Clustering
• 2. Introduction
Definition:
Clustering techniques apply when rather than predicting the class, we just want the instances to be divided into natural group
Problem statement:
Given the desired number of clusters K and a dataset of N points, and a distance based measurement function, we have to find a partition of the dataset that minimizes the value of measurement function
• 3. Algorithm: BIRCH
Data applied uses BIRCH algorithm for clustering
BIRCH stands for Balanced Iterative Reducing and Clustering using hierarchies
BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data-points to try to produce the best quality clustering with the available resources
• 4. Why BIRCH?
Takes into consideration the fact that it might not be possible to have the whole data set in memory, as required by many other algorithms
Minimize the time required for I/O by providing results on the basis of a single scan of data
It also gives an option to handle outliers
• 5. Background
Given a N d-dimensional data set: Xi for i varying from 1 to N:
Centroid(X0):
X0 = summation(Xi)/N
R = (summation((Xi-X0)^2/N))^(1/2)
Diameter D:
D = ((summation(Xi – Xj)^2)/(N(N-1))) ^(1/2)
• 6. Background
Plus we also define 5 more distance metrics:
D0: Euclidian distance
D1: Manhattan distance
D2: Average inter cluster distance
D3: Average intra cluster distance
D4: Variance increase distance
• 7. Clustering feature
Clustering feature is a triple summarizing the information that we maintain about the cluster
CF = (N, LS, SS)
N is the number of data points in the cluster
LS is linear sum of data points
SS is square sum of data points
Additive theory: If CF1 and CF2 are two disjoint cluster than the merged cluster is represented as CF1 + CF2
We can calculate D0,D1… D4 using clustering features
• 8. CF Trees
A CF tree is a height balanced tree with two parameters: branching factor B and threshold T
Each non-leaf node contains at most B entries of the form [CF_i,child_i], where \$child_i \$ is a pointer to its ith child node and \$CF_i\$ is the subcluster represented by this child
A leaf node contains at most L entries each of the form \$[CF_i]\$ . It also has to two pointers prev and next which are used to chain all leaf nodes together
The tree size is a function of T. The larger the T is, the smaller the tree
• 9. BIRCH Algorithm
• 10. Building of CF tree
• 11. Using the DA-API’s web interface to execute Clustering
• 12. Step 1: Selection of data
• 13. Step 2: Select Cluster
• 14. Step 3: Clustering Result
• 15. Visit more self help tutorials
• Pick a tutorial of your choice and browse through it at your own pace.
• 16. The tutorials section is free, self-guiding and will not involve any additional support.
• 17. Visit us at www.dataminingtools.net