SlideShare a Scribd company logo
DATA MINING WITH
CLUSTERING AND
CLASSIFICATION
Overview
• Definition of Clustering
• Existing clustering methods
• Clustering examples
• Classification
• Classification examples
• Conclusion
Definition
• Clustering can be considered the most important
unsupervised learning technique; so, as every other
problem of this kind, it deals with finding a structure in a
collection of unlabeled data.
• Clustering is “the process of organizing objects into
groups whose members are similar in some way”.
• A cluster is therefore a collection of objects which are
“similar” between them and are “dissimilar” to the
objects belonging to other clusters.
Mu-Yu Lu, SJSU
Why clustering?
A few good reasons ...
• Simplifications
• Pattern detection
• Useful in data concept construction
• Unsupervised learning process
Where to use clustering?
• Data mining
• Information retrieval
• text mining
• Web analysis
• marketing
• medical diagnostic
Which method should I use?
• Type of attributes in data
• Scalability to larger dataset
• Ability to work with irregular data
• Time cost
• complexity
• Data order dependency
• Result presentation
Major Existing clustering methods
• Distance-based
• Hierarchical
• Partitioning
• Probabilistic
Measuring Similarity
• Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, which is typically metric:
d(i, j)
• There is a separate “quality” function that measures the
“goodness” of a cluster.
• The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
• Weights should be associated with different variables
based on applications and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
Professor Lee, Sin-Min
Distance based method
• In this case we easily identify the 4 clusters into which the data can be
divided; the similarity criterion is distance: two or more objects belong to
the same cluster if they are “close” according to a given distance. This is
called distance-based clustering.
Hierarchical clustering
Agglomerative (bottom up)
1. start with 1 point
(singleton)
2. recursively add two or
more appropriate
clusters
3. Stop when k number of
clusters is achieved.
Divisive (top down)
1. Start with a big cluster
2. Recursively divide into
smaller clusters
3. Stop when k number of
clusters is achieved.
general steps of hierarchical clustering
Given a set of N items to be clustered, and an N*N distance (or
similarity) matrix, the basic process of hierarchical clustering
(defined by S.C. Johnson in 1967) is this:
• Start by assigning each item to a cluster, so that if you have N
items, you now have N clusters, each containing just one item.
Let the distances (similarities) between the clusters the same as
the distances (similarities) between the items they contain.
• Find the closest (most similar) pair of clusters and merge them
into a single cluster, so that now you have one cluster less.
• Compute distances (similarities) between the new cluster and
each of the old clusters.
• Repeat steps 2 and 3 until all items are clustered into K number
of clusters
Mu-Yu Lu, SJSU
Exclusive vs. non exclusive
clustering
• In the first case data are grouped in an
exclusive way, so that if a certain datum
belongs to a definite cluster then it could
not be included in another cluster. A
simple example of that is shown in the
figure below, where the separation of
points is achieved by a straight line on a
bi-dimensional plane.
• On the contrary the second type, the
overlapping clustering, uses fuzzy sets to
cluster data, so that each point may
belong to two or more clusters with
different degrees of membership.
Partitioning clustering
1. Divide data into proper subset
2. recursively go through each subset
and relocate points between clusters
(opposite to visit-once approach in
Hierarchical approach)
This recursive relocation= higher quality cluster
Probabilistic clustering
1. Data are picked from mixture of
probability distribution.
2. Use the mean, variance of each
distribution as parameters for cluster
3. Single cluster membership
Single-Linkage Clustering(hierarchical)
• The N*N proximity matrix is D = [d(i,j)]
• The clusterings are assigned sequence
numbers 0,1,......, (n-1)
• L(k) is the level of the kth clustering
• A cluster with sequence number m is
denoted (m)
• The proximity between clusters (r) and (s)
is denoted d [(r),(s)]
Mu-Yu Lu, SJSU
The algorithm is composed of the
following steps:
• Begin with the disjoint clustering having level
L(0) = 0 and sequence number m = 0.
• Find the least dissimilar pair of clusters in the
current clustering, say pair (r), (s), according to
d[(r),(s)] = min d[(i),(j)]
where the minimum is over all pairs of clusters
in the current clustering.
The algorithm is composed of the
following steps:(cont.)
• Increment the sequence number : m = m +1. Merge
clusters (r) and (s) into a single cluster to form the next
clustering m. Set the level of this clustering to
L(m) = d[(r),(s)]
• Update the proximity matrix, D, by deleting the rows and
columns corresponding to clusters (r) and (s) and adding
a row and column corresponding to the newly formed
cluster. The proximity between the new cluster, denoted
(r,s) and old cluster (k) is defined in this way:
d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)]
• If all objects are in one cluster, stop. Else, go to step 2.
Hierarchical clustering example
• Let’s now see a simple example: a hierarchical clustering
of distances in kilometers between some Italian cities.
The method used is single-linkage.
• Input distance matrix (L = 0 for all the clusters):
• The nearest pair of cities is MI and TO, at distance 138. These
are merged into a single cluster called "MI/TO". The level of
the new cluster is L(MI/TO) = 138 and the new sequence
number is m = 1.
Then we compute the distance from this new compound object
to all other objects. In single link clustering the rule is that
the distance from the compound object to another object is
equal to the shortest distance from any member of the
cluster to the outside object. So the distance from "MI/TO"
to RM is chosen to be 564, which is the distance from MI to
RM, and so on.
• After merging MI with TO we obtain the
following matrix:
• min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a
new cluster called NA/RM
L(NA/RM) = 219
m = 2
• min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM
into a new cluster called BA/NA/RM
L(BA/NA/RM) = 255
m = 3
• min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM
and FI into a new cluster called BA/FI/NA/RM
L(BA/FI/NA/RM) = 268
m = 4
• Finally, we merge the last two clusters at level 295.
• The process is summarized by the following hierarchical tree:
K-mean algorithm
1. It accepts the number of clusters to group data
into, and the dataset to cluster as input values.
2. It then creates the first K initial clusters (K=
number of clusters needed) from the dataset by
choosing K rows of data randomly from the
dataset. For Example, if there are 10,000 rows of
data in the dataset and 3 clusters need to be
formed, then the first K=3 initial clusters will be
created by selecting 3 records randomly from the
dataset as the initial clusters. Each of the 3 initial
clusters formed will have just one row of data.
3. The K-Means algorithm calculates the Arithmetic Mean of
each cluster formed in the dataset. The Arithmetic Mean of
a cluster is the mean of all the individual records in the cluster. In
each of the first K initial clusters, their is only one record. The
Arithmetic Mean of a cluster with one record is the set of values
that make up that record. For Example if the dataset we are
discussing is a set of Height, Weight and Age measurements for
students in a University, where a record P in the dataset S is
represented by a Height, Weight and Age measurement, then P =
{Age, Height, Weight). Then a record containing
the measurements of a student John, would be represented as
John = {20, 170, 80} where John's Age = 20 years, Height = 1.70
metres and Weight = 80 Pounds. Since there is only one record in
each initial cluster then the Arithmetic Mean of a cluster with only
the record for John as a member = {20, 170, 80}.
4. Next, K-Means assigns each record in the dataset to only one of the initial
clusters. Each record is assigned to the nearest cluster (the cluster which it is
most similar to) using a measure of distance or similarity like the Euclidean
Distance Measure or Manhattan/City-Block Distance Measure.
5. K-Means re-assigns each record in the dataset to the most similar cluster and re-
calculates the arithmetic mean of all the clusters in the dataset. The arithmetic
mean of a cluster is the arithmetic mean of all the records in that cluster. For
Example, if a cluster contains two records where the record of the set of
measurements for John = {20, 170, 80} and Henry = {30, 160, 120}, then the
arithmetic mean Pmean is represented as Pmean= {Agemean, Heightmean,
Weightmean). Agemean= (20 + 30)/2, Heightmean= (170 + 160)/2 and
Weightmean= (80 + 120)/2. The arithmetic mean of this cluster = {25,
165, 100}. This new arithmetic mean becomes the center of this new cluster.
Following the same procedure, new cluster centers are formed for all the
existing clusters.
6. K-Means re-assigns each record in the dataset to only one of the
new clusters formed. A record or data point is assigned to the
nearest cluster (the cluster which it is most similar to) using a
measure of distance or similarity
7. The preceding steps are repeated until stable clusters are formed
and the K-Means clustering procedure is completed. Stable clusters
are formed when new iterations or repetitions of the K-Means
clustering algorithm does not create new clusters as the cluster center
or Arithmetic Mean of each cluster formed is the same as the old
cluster center. There are different techniques for determining when a
stable cluster is formed or when the k-means clustering algorithm
procedure is completed.
Classification
• Classification Problem Overview
• Classification Techniques
– Regression
– Distance
– Decision Trees
– Rules
– Neural Networks
Goal: Provide an overview of the classification
problem and introduce some of the basic
algorithms
Classification Examples
• Teachers classify students’ grades as A,
B, C, D, or F.
• Identify mushrooms as poisonous or
edible.
• Predict when a river will flood.
• Identify individuals with credit risks.
• Speech recognition
• Pattern recognition
Classification Ex: Grading
• If x >= 90 then grade
=A.
• If 80<=x<90 then
grade =B.
• If 70<=x<80 then
grade =C.
• If 60<=x<70 then
grade =D.
• If x<50 then grade =F.
>=90
<90
x
>=80
<80
x
>=70
<70
x
F
B
A
>=60
<50
x C
D
Classification Techniques
• Approach:
1. Create specific model by evaluating
training data (or using domain
experts’ knowledge).
2. Apply model developed to new data.
• Classes must be predefined
• Most common techniques use DTs,
NNs, or are based on distances or
statistical methods.
Defining Classes
Partitioning Based
Distance Based
Classification Using Regression
• Division: Use regression function to
divide area into regions.
• Prediction: Use regression function to
predict a class membership function.
Input includes desired class.
Height Example Data
Name Gender Height Output1 Output2
Kristina F 1.6m Short Medium
Jim M 2m Tall Medium
Maggie F 1.9m Medium Tall
Martha F 1.88m Medium Tall
Stephanie F 1.7m Short Medium
Bob M 1.85m Medium Medium
Kathy F 1.6m Short Medium
Dave M 1.7m Short Medium
Worth M 2.2m Tall Tall
Steven M 2.1m Tall Tall
Debbie F 1.8m Medium Medium
Todd M 1.95m Medium Medium
Kim F 1.9m Medium Tall
Amy F 1.8m Medium Medium
Wynette F 1.75m Medium Medium
Division
Prediction
Classification Using Distance
• Place items in class to which they are
“closest”.
• Must determine distance between an
item and a class.
• Classes represented by
– Centroid: Central value.
– Medoid: Representative point.
– Individual points
• Algorithm: KNN
K Nearest Neighbor
(KNN):
• Training set includes classes.
• Examine K items near item to be
classified.
• New item placed in class with the most
number of close items.
• O(q) for each tuple to be classified.
(Here q is the size of the training set.)
KNN
KNN Algorithm
Classification Using Decision
Trees
• Partitioning based: Divide search
space into rectangular regions.
• Tuple placed into class based on the
region within which it falls.
• DT approaches differ in how the tree is
built: DT Induction
• Internal nodes associated with attribute
and arcs with values for that attribute.
• Algorithms: ID3, C4.5, CART
Decision Tree
Given:
– D = {t1, …, tn} where ti=<ti1, …, tih>
– Database schema contains {A1, A2, …, Ah}
– Classes C={C1, …., Cm}
Decision or Classification Tree is a tree
associated with D such that
– Each internal node is labeled with attribute, Ai
– Each arc is labeled with predicate which can
be applied to attribute at parent
– Each leaf node is labeled with a class, Cj
DT Induction
Comparing DTs
Balanced
Deep
ID3
• Creates tree using information theory concepts and
tries to reduce expected number of comparison..
• ID3 chooses split attribute with the highest
information gain using entropy as base for
calculation.
Conclusion
• very useful in data mining
• applicable for both text and graphical
based data
• Help simplify data complexity
• classification
• detect hidden pattern in data
Reference
• Dr. M.H. Dunham -
http://engr.smu.edu/~mhd/dmbook/part2.ppt.
• Dr. Lee, Sin-Min – San Jose State University
• Mu-Yu Lu, SJSU
• Database System Concepts, Silberschatz, Korth, Sudarshan

More Related Content

Similar to Slide-TIF311-DM-10-11.ppt

K means Clustering - algorithm to cluster n objects
K means Clustering - algorithm to cluster n objectsK means Clustering - algorithm to cluster n objects
K means Clustering - algorithm to cluster n objects
VoidVampire
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Avijit Famous
 
K means clustering
K means clusteringK means clustering
K means clustering
keshav goyal
 
Clustering
ClusteringClustering
Clustering
Md. Hasnat Shoheb
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
vikassingh569137
 
Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis
Baivab Nag
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithmLaura Petrosanu
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
ShwetapadmaBabu1
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
nikshaikh786
 
Data mining and warehousing
Data mining and warehousingData mining and warehousing
Data mining and warehousing
Swetha544947
 
AI-Lec20 Clustering I - Kmean.pptx
AI-Lec20 Clustering I - Kmean.pptxAI-Lec20 Clustering I - Kmean.pptx
AI-Lec20 Clustering I - Kmean.pptx
Syed Ejaz
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
Eng. Dr. Dennis N. Mwighusa
 
Data Mining: Cluster Analysis
Data Mining: Cluster AnalysisData Mining: Cluster Analysis
Data Mining: Cluster Analysis
Suman Mia
 
Unsupervised learning Modi.pptx
Unsupervised learning Modi.pptxUnsupervised learning Modi.pptx
Unsupervised learning Modi.pptx
ssusere1fd42
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
SowmyaJyothi3
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
Afzaal Subhani
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
parry prabhu
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
PVP College
 

Similar to Slide-TIF311-DM-10-11.ppt (20)

K means Clustering - algorithm to cluster n objects
K means Clustering - algorithm to cluster n objectsK means Clustering - algorithm to cluster n objects
K means Clustering - algorithm to cluster n objects
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Clustering
ClusteringClustering
Clustering
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
Data mining and warehousing
Data mining and warehousingData mining and warehousing
Data mining and warehousing
 
kmean clustering
kmean clusteringkmean clustering
kmean clustering
 
AI-Lec20 Clustering I - Kmean.pptx
AI-Lec20 Clustering I - Kmean.pptxAI-Lec20 Clustering I - Kmean.pptx
AI-Lec20 Clustering I - Kmean.pptx
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
Data Mining: Cluster Analysis
Data Mining: Cluster AnalysisData Mining: Cluster Analysis
Data Mining: Cluster Analysis
 
Unsupervised learning Modi.pptx
Unsupervised learning Modi.pptxUnsupervised learning Modi.pptx
Unsupervised learning Modi.pptx
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 

More from ImXaib

ERD introduction in databases model.pptx
ERD introduction in databases model.pptxERD introduction in databases model.pptx
ERD introduction in databases model.pptx
ImXaib
 
SDA presentation the basics of computer science .pptx
SDA presentation the basics of computer science .pptxSDA presentation the basics of computer science .pptx
SDA presentation the basics of computer science .pptx
ImXaib
 
terminal a clear presentation on the topic.pptx
terminal a clear presentation on the topic.pptxterminal a clear presentation on the topic.pptx
terminal a clear presentation on the topic.pptx
ImXaib
 
What is Machine Learning_updated documents.pptx
What is Machine Learning_updated documents.pptxWhat is Machine Learning_updated documents.pptx
What is Machine Learning_updated documents.pptx
ImXaib
 
Grid Computing and it's applications.PPTX
Grid Computing and it's applications.PPTXGrid Computing and it's applications.PPTX
Grid Computing and it's applications.PPTX
ImXaib
 
Firewall.pdf
Firewall.pdfFirewall.pdf
Firewall.pdf
ImXaib
 
4966709.ppt
4966709.ppt4966709.ppt
4966709.ppt
ImXaib
 
lecture2.ppt
lecture2.pptlecture2.ppt
lecture2.ppt
ImXaib
 
Tools.pptx
Tools.pptxTools.pptx
Tools.pptx
ImXaib
 
lec3_10.ppt
lec3_10.pptlec3_10.ppt
lec3_10.ppt
ImXaib
 
ch12.ppt
ch12.pptch12.ppt
ch12.ppt
ImXaib
 
Fullandparavirtualization.ppt
Fullandparavirtualization.pptFullandparavirtualization.ppt
Fullandparavirtualization.ppt
ImXaib
 
mis9_ch08_ppt.ppt
mis9_ch08_ppt.pptmis9_ch08_ppt.ppt
mis9_ch08_ppt.ppt
ImXaib
 
rooster-ipsecindepth.ppt
rooster-ipsecindepth.pptrooster-ipsecindepth.ppt
rooster-ipsecindepth.ppt
ImXaib
 
Policy formation and enforcement.ppt
Policy formation and enforcement.pptPolicy formation and enforcement.ppt
Policy formation and enforcement.ppt
ImXaib
 
Database schema architecture.ppt
Database schema architecture.pptDatabase schema architecture.ppt
Database schema architecture.ppt
ImXaib
 
Transport layer security.ppt
Transport layer security.pptTransport layer security.ppt
Transport layer security.ppt
ImXaib
 
Trends in DM.pptx
Trends in DM.pptxTrends in DM.pptx
Trends in DM.pptx
ImXaib
 
AleksandrDoroninSlides.ppt
AleksandrDoroninSlides.pptAleksandrDoroninSlides.ppt
AleksandrDoroninSlides.ppt
ImXaib
 
dm15-visualization-data-mining.ppt
dm15-visualization-data-mining.pptdm15-visualization-data-mining.ppt
dm15-visualization-data-mining.ppt
ImXaib
 

More from ImXaib (20)

ERD introduction in databases model.pptx
ERD introduction in databases model.pptxERD introduction in databases model.pptx
ERD introduction in databases model.pptx
 
SDA presentation the basics of computer science .pptx
SDA presentation the basics of computer science .pptxSDA presentation the basics of computer science .pptx
SDA presentation the basics of computer science .pptx
 
terminal a clear presentation on the topic.pptx
terminal a clear presentation on the topic.pptxterminal a clear presentation on the topic.pptx
terminal a clear presentation on the topic.pptx
 
What is Machine Learning_updated documents.pptx
What is Machine Learning_updated documents.pptxWhat is Machine Learning_updated documents.pptx
What is Machine Learning_updated documents.pptx
 
Grid Computing and it's applications.PPTX
Grid Computing and it's applications.PPTXGrid Computing and it's applications.PPTX
Grid Computing and it's applications.PPTX
 
Firewall.pdf
Firewall.pdfFirewall.pdf
Firewall.pdf
 
4966709.ppt
4966709.ppt4966709.ppt
4966709.ppt
 
lecture2.ppt
lecture2.pptlecture2.ppt
lecture2.ppt
 
Tools.pptx
Tools.pptxTools.pptx
Tools.pptx
 
lec3_10.ppt
lec3_10.pptlec3_10.ppt
lec3_10.ppt
 
ch12.ppt
ch12.pptch12.ppt
ch12.ppt
 
Fullandparavirtualization.ppt
Fullandparavirtualization.pptFullandparavirtualization.ppt
Fullandparavirtualization.ppt
 
mis9_ch08_ppt.ppt
mis9_ch08_ppt.pptmis9_ch08_ppt.ppt
mis9_ch08_ppt.ppt
 
rooster-ipsecindepth.ppt
rooster-ipsecindepth.pptrooster-ipsecindepth.ppt
rooster-ipsecindepth.ppt
 
Policy formation and enforcement.ppt
Policy formation and enforcement.pptPolicy formation and enforcement.ppt
Policy formation and enforcement.ppt
 
Database schema architecture.ppt
Database schema architecture.pptDatabase schema architecture.ppt
Database schema architecture.ppt
 
Transport layer security.ppt
Transport layer security.pptTransport layer security.ppt
Transport layer security.ppt
 
Trends in DM.pptx
Trends in DM.pptxTrends in DM.pptx
Trends in DM.pptx
 
AleksandrDoroninSlides.ppt
AleksandrDoroninSlides.pptAleksandrDoroninSlides.ppt
AleksandrDoroninSlides.ppt
 
dm15-visualization-data-mining.ppt
dm15-visualization-data-mining.pptdm15-visualization-data-mining.ppt
dm15-visualization-data-mining.ppt
 

Recently uploaded

The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
GeoBlogs
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
Fundacja Rozwoju Społeczeństwa Przedsiębiorczego
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
Celine George
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
AzmatAli747758
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
Nguyen Thanh Tu Collection
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
Steve Thomason
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
Celine George
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
Col Mukteshwar Prasad
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 

Recently uploaded (20)

The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 

Slide-TIF311-DM-10-11.ppt

  • 1. DATA MINING WITH CLUSTERING AND CLASSIFICATION
  • 2. Overview • Definition of Clustering • Existing clustering methods • Clustering examples • Classification • Classification examples • Conclusion
  • 3. Definition • Clustering can be considered the most important unsupervised learning technique; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. • Clustering is “the process of organizing objects into groups whose members are similar in some way”. • A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. Mu-Yu Lu, SJSU
  • 4.
  • 5. Why clustering? A few good reasons ... • Simplifications • Pattern detection • Useful in data concept construction • Unsupervised learning process
  • 6. Where to use clustering? • Data mining • Information retrieval • text mining • Web analysis • marketing • medical diagnostic
  • 7. Which method should I use? • Type of attributes in data • Scalability to larger dataset • Ability to work with irregular data • Time cost • complexity • Data order dependency • Result presentation
  • 8. Major Existing clustering methods • Distance-based • Hierarchical • Partitioning • Probabilistic
  • 9. Measuring Similarity • Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d(i, j) • There is a separate “quality” function that measures the “goodness” of a cluster. • The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables. • Weights should be associated with different variables based on applications and data semantics. • It is hard to define “similar enough” or “good enough” – the answer is typically highly subjective. Professor Lee, Sin-Min
  • 10. Distance based method • In this case we easily identify the 4 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance. This is called distance-based clustering.
  • 11. Hierarchical clustering Agglomerative (bottom up) 1. start with 1 point (singleton) 2. recursively add two or more appropriate clusters 3. Stop when k number of clusters is achieved. Divisive (top down) 1. Start with a big cluster 2. Recursively divide into smaller clusters 3. Stop when k number of clusters is achieved.
  • 12. general steps of hierarchical clustering Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering (defined by S.C. Johnson in 1967) is this: • Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain. • Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less. • Compute distances (similarities) between the new cluster and each of the old clusters. • Repeat steps 2 and 3 until all items are clustered into K number of clusters Mu-Yu Lu, SJSU
  • 13. Exclusive vs. non exclusive clustering • In the first case data are grouped in an exclusive way, so that if a certain datum belongs to a definite cluster then it could not be included in another cluster. A simple example of that is shown in the figure below, where the separation of points is achieved by a straight line on a bi-dimensional plane. • On the contrary the second type, the overlapping clustering, uses fuzzy sets to cluster data, so that each point may belong to two or more clusters with different degrees of membership.
  • 14. Partitioning clustering 1. Divide data into proper subset 2. recursively go through each subset and relocate points between clusters (opposite to visit-once approach in Hierarchical approach) This recursive relocation= higher quality cluster
  • 15. Probabilistic clustering 1. Data are picked from mixture of probability distribution. 2. Use the mean, variance of each distribution as parameters for cluster 3. Single cluster membership
  • 16. Single-Linkage Clustering(hierarchical) • The N*N proximity matrix is D = [d(i,j)] • The clusterings are assigned sequence numbers 0,1,......, (n-1) • L(k) is the level of the kth clustering • A cluster with sequence number m is denoted (m) • The proximity between clusters (r) and (s) is denoted d [(r),(s)] Mu-Yu Lu, SJSU
  • 17. The algorithm is composed of the following steps: • Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0. • Find the least dissimilar pair of clusters in the current clustering, say pair (r), (s), according to d[(r),(s)] = min d[(i),(j)] where the minimum is over all pairs of clusters in the current clustering.
  • 18. The algorithm is composed of the following steps:(cont.) • Increment the sequence number : m = m +1. Merge clusters (r) and (s) into a single cluster to form the next clustering m. Set the level of this clustering to L(m) = d[(r),(s)] • Update the proximity matrix, D, by deleting the rows and columns corresponding to clusters (r) and (s) and adding a row and column corresponding to the newly formed cluster. The proximity between the new cluster, denoted (r,s) and old cluster (k) is defined in this way: d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)] • If all objects are in one cluster, stop. Else, go to step 2.
  • 19. Hierarchical clustering example • Let’s now see a simple example: a hierarchical clustering of distances in kilometers between some Italian cities. The method used is single-linkage. • Input distance matrix (L = 0 for all the clusters):
  • 20. • The nearest pair of cities is MI and TO, at distance 138. These are merged into a single cluster called "MI/TO". The level of the new cluster is L(MI/TO) = 138 and the new sequence number is m = 1. Then we compute the distance from this new compound object to all other objects. In single link clustering the rule is that the distance from the compound object to another object is equal to the shortest distance from any member of the cluster to the outside object. So the distance from "MI/TO" to RM is chosen to be 564, which is the distance from MI to RM, and so on.
  • 21. • After merging MI with TO we obtain the following matrix:
  • 22. • min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a new cluster called NA/RM L(NA/RM) = 219 m = 2
  • 23. • min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM into a new cluster called BA/NA/RM L(BA/NA/RM) = 255 m = 3
  • 24. • min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM and FI into a new cluster called BA/FI/NA/RM L(BA/FI/NA/RM) = 268 m = 4
  • 25. • Finally, we merge the last two clusters at level 295. • The process is summarized by the following hierarchical tree:
  • 26. K-mean algorithm 1. It accepts the number of clusters to group data into, and the dataset to cluster as input values. 2. It then creates the first K initial clusters (K= number of clusters needed) from the dataset by choosing K rows of data randomly from the dataset. For Example, if there are 10,000 rows of data in the dataset and 3 clusters need to be formed, then the first K=3 initial clusters will be created by selecting 3 records randomly from the dataset as the initial clusters. Each of the 3 initial clusters formed will have just one row of data.
  • 27. 3. The K-Means algorithm calculates the Arithmetic Mean of each cluster formed in the dataset. The Arithmetic Mean of a cluster is the mean of all the individual records in the cluster. In each of the first K initial clusters, their is only one record. The Arithmetic Mean of a cluster with one record is the set of values that make up that record. For Example if the dataset we are discussing is a set of Height, Weight and Age measurements for students in a University, where a record P in the dataset S is represented by a Height, Weight and Age measurement, then P = {Age, Height, Weight). Then a record containing the measurements of a student John, would be represented as John = {20, 170, 80} where John's Age = 20 years, Height = 1.70 metres and Weight = 80 Pounds. Since there is only one record in each initial cluster then the Arithmetic Mean of a cluster with only the record for John as a member = {20, 170, 80}.
  • 28. 4. Next, K-Means assigns each record in the dataset to only one of the initial clusters. Each record is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity like the Euclidean Distance Measure or Manhattan/City-Block Distance Measure. 5. K-Means re-assigns each record in the dataset to the most similar cluster and re- calculates the arithmetic mean of all the clusters in the dataset. The arithmetic mean of a cluster is the arithmetic mean of all the records in that cluster. For Example, if a cluster contains two records where the record of the set of measurements for John = {20, 170, 80} and Henry = {30, 160, 120}, then the arithmetic mean Pmean is represented as Pmean= {Agemean, Heightmean, Weightmean). Agemean= (20 + 30)/2, Heightmean= (170 + 160)/2 and Weightmean= (80 + 120)/2. The arithmetic mean of this cluster = {25, 165, 100}. This new arithmetic mean becomes the center of this new cluster. Following the same procedure, new cluster centers are formed for all the existing clusters.
  • 29. 6. K-Means re-assigns each record in the dataset to only one of the new clusters formed. A record or data point is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity 7. The preceding steps are repeated until stable clusters are formed and the K-Means clustering procedure is completed. Stable clusters are formed when new iterations or repetitions of the K-Means clustering algorithm does not create new clusters as the cluster center or Arithmetic Mean of each cluster formed is the same as the old cluster center. There are different techniques for determining when a stable cluster is formed or when the k-means clustering algorithm procedure is completed.
  • 30. Classification • Classification Problem Overview • Classification Techniques – Regression – Distance – Decision Trees – Rules – Neural Networks Goal: Provide an overview of the classification problem and introduce some of the basic algorithms
  • 31. Classification Examples • Teachers classify students’ grades as A, B, C, D, or F. • Identify mushrooms as poisonous or edible. • Predict when a river will flood. • Identify individuals with credit risks. • Speech recognition • Pattern recognition
  • 32. Classification Ex: Grading • If x >= 90 then grade =A. • If 80<=x<90 then grade =B. • If 70<=x<80 then grade =C. • If 60<=x<70 then grade =D. • If x<50 then grade =F. >=90 <90 x >=80 <80 x >=70 <70 x F B A >=60 <50 x C D
  • 33. Classification Techniques • Approach: 1. Create specific model by evaluating training data (or using domain experts’ knowledge). 2. Apply model developed to new data. • Classes must be predefined • Most common techniques use DTs, NNs, or are based on distances or statistical methods.
  • 35. Classification Using Regression • Division: Use regression function to divide area into regions. • Prediction: Use regression function to predict a class membership function. Input includes desired class.
  • 36. Height Example Data Name Gender Height Output1 Output2 Kristina F 1.6m Short Medium Jim M 2m Tall Medium Maggie F 1.9m Medium Tall Martha F 1.88m Medium Tall Stephanie F 1.7m Short Medium Bob M 1.85m Medium Medium Kathy F 1.6m Short Medium Dave M 1.7m Short Medium Worth M 2.2m Tall Tall Steven M 2.1m Tall Tall Debbie F 1.8m Medium Medium Todd M 1.95m Medium Medium Kim F 1.9m Medium Tall Amy F 1.8m Medium Medium Wynette F 1.75m Medium Medium
  • 39. Classification Using Distance • Place items in class to which they are “closest”. • Must determine distance between an item and a class. • Classes represented by – Centroid: Central value. – Medoid: Representative point. – Individual points • Algorithm: KNN
  • 40. K Nearest Neighbor (KNN): • Training set includes classes. • Examine K items near item to be classified. • New item placed in class with the most number of close items. • O(q) for each tuple to be classified. (Here q is the size of the training set.)
  • 41. KNN
  • 43. Classification Using Decision Trees • Partitioning based: Divide search space into rectangular regions. • Tuple placed into class based on the region within which it falls. • DT approaches differ in how the tree is built: DT Induction • Internal nodes associated with attribute and arcs with values for that attribute. • Algorithms: ID3, C4.5, CART
  • 44. Decision Tree Given: – D = {t1, …, tn} where ti=<ti1, …, tih> – Database schema contains {A1, A2, …, Ah} – Classes C={C1, …., Cm} Decision or Classification Tree is a tree associated with D such that – Each internal node is labeled with attribute, Ai – Each arc is labeled with predicate which can be applied to attribute at parent – Each leaf node is labeled with a class, Cj
  • 47. ID3 • Creates tree using information theory concepts and tries to reduce expected number of comparison.. • ID3 chooses split attribute with the highest information gain using entropy as base for calculation.
  • 48. Conclusion • very useful in data mining • applicable for both text and graphical based data • Help simplify data complexity • classification • detect hidden pattern in data
  • 49. Reference • Dr. M.H. Dunham - http://engr.smu.edu/~mhd/dmbook/part2.ppt. • Dr. Lee, Sin-Min – San Jose State University • Mu-Yu Lu, SJSU • Database System Concepts, Silberschatz, Korth, Sudarshan