SlideShare a Scribd company logo
Welcome To My Presentation
On
Clustering Analysis
Submitted By
Ruhul Amin
Department of Statistics
Pabna University of Science & Technology
Department of Statistics, Pabna University of Science & Technology
OUTLINEOF PRESENTATION
 Clustering : basic concept
 Types of clustering
 Clustering techniques
 K-means clustering
 K-means clustering algorithm
 Requirements
 Applications
 Advantages & Disadvantages
 Conclusion
Department of Statistics, Pabna University of Science & Technology 2
CLUSTERING: BASICCONCEPT
CLUSTERING
Clustering is traditionally viewed as an unsupervised method for data analysis. Clustering is the task of
the population or data points into a number of groups such that data points in the same groups are more
to other data points in the same group than those in other groups. In simple words, the aim is to segregate
groups with similar traits and assign them into clusters. It is a main task of exploratory data mining, and a
common technique for statistical data analysis, used in many fields, including machine learning, pattern
recognition, image analysis, information
retrieval, bioinformatics, data compression, and computer graphics.
Department of Statistics, Pabna University of Science & Technology 3
TYPESOF CLUSTERING
Broadly speaking, clustering can be divided into two subgroups :
HARD CLUSTERING:
In hard clustering, each data point either belongs to a cluster completely or not.
As an instance, we want the algorithm to read all of the tweets and determine if a tweet is a positive or a negative
tweet.
SOFT CLUSTERING:
In the soft clustering method, each data point will not completely belong to one cluster, instead, it can be a member of
more than one cluster it has a set of membership coefficients corresponding to the probability of being in a given
cluster.
As an instance, if you are attempting to forecast the rating changes for the counterparties who you trade with. The
algorithm can create clusters for each rating and indicate the likelihood of a counterparty to belong to a cluster.
Department of Statistics, Pabna University of Science & Technology 4
TYPES OF CLUSTERING
Is clustering typically …?
A. Supervised
B. Unsupervised
Department of Statistics, Pabna University of Science & Technology 5
Supervised
Unsupervised
CLUSTERING TECHNIQUES
Department of Statistics, Pabna University of Science & Technology 6
CLUSTERINGTECHNIQUES
A CATEGORIZATION OF MAJOR CLUSTERING METHODS
Partitioning Methods
Hierarchical Methods
Density-based Methods
Grid-based Methods
Model-based Methods
Department of Statistics, Pabna University of Science & Technology 7
CLUSTERINGTECHNIQUES
Partitional clustering decomposes a data set into a set of disjoint clusters.
Partitional clustering (or partitioning clustering) are clustering method
used to classify observations, within a data set, into multiple groups based on their
similarity. The algorithms require the analyst to specify the number of clusters to
be generated (N ≥ K). This course describes the
commonly used partitional, including: k means clustering
Department of Statistics, Pabna University of Science & Technology 8
K MEANSCLUSTERING
K-means clustering (Macqueen, 1967) is a method commonly used to automatically partition a data set
into k groups. It proceeds by selecting k initial cluster.
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.E., Data without defined
categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the
variable K (N ≥ K). The algorithm works iteratively to assign each data point to one of K groups based on the features that are
provided. Data points are clustered based on feature similarity. The results of the k-means clustering algorithm are:
1.the centroids of the K clusters, which can be used to label new data.
2.labels for the training data (each data point is assigned to a single cluster).
.
Department of Statistics, Pabna University of Science & Technology 9
K MEANS CLUSTERING ALGORITHMS
AS, YOU CAN SEE, K-MEANS ALGORITHM IS COMPOSED OF 3 STEPS:
STEP 1: INITIALIZATION
The first thing k-means does, is randomly choose K examples (data points) from the dataset as initial
centroids and that’s simply because it does not know yet where the center of each cluster is. (A
centroid is the center of a cluster).
STEP 2: CLUSTER ASSIGNMENT
Then, all the data points that are the closest (similar) to a centroid will create a cluster. If we’re using
the Euclidean distance between data points and every centroid, a straight line is drawn between two
centroids, then a perpendicular bisector (boundary line) divides this line into two clusters.
STEP 3: MOVE THE CENTROID
Now, we have new clusters, that need centers. A centroid’s new value is going to be the mean of all the
examples (data points) in a cluster.
We’ll keep repeating step 2 and 3 until the centroids stop moving, in other words, k-means algorithm is
converged.
Department of Statistics, Pabna University of Science & Technology 10
K MEANS CLUSTERING ALGORITHMS
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Step 1 Step 2 Step 3
Step 4
Department of Statistics, Pabna University of Science & Technology 11
K MEANSCLUSTERINGALGORITHM
CLUSTER ANALYSIS – EXAMPLE
We will work with a real-number example of the well-known k-means clustering algorithm.
We will try to find clusters in the below dataset, consisting of 5 points.
Department of Statistics, Pabna University of Science & Technology 12
K MEANSCLUSTERINGALGORITHMS
STEP 1: SET CLUSTER QUANTITY
The k-means algorithm requires you to set a number of clusters k beforehand. Here, we take k=2(the data look like there
clusters – one on the bottom left and one on the top right).
STEP 2: ASSIGNMENT OF DATA POINTS
In the assignment step, each data point gets assigned to the nearest cluster centroid. The cluster centroids can be seen as
centers of gravity within each cluster. To start with, we chose random points as centroids. Here, we take point A(1,1)
Instead of taking actual data points, we could have taken completely random points as well.
To calculate the nearest cluster centroid for each data point, you need a distance measure. There is a large number of
available metrics doing the job. We will work with the ordinary Euclidian distance.
Department of Statistics, Pabna University of Science & Technology 13
K MEANSCLUSTERINGALGORITHMS
STEP 3: MOVE THE CENTROID
Now, we have new clusters, that need centers. A centroid’s new value is going to be the mean of
all the examples in a cluster.
We’ll keep repeating step 2 and 3 until the centroids stop moving, in other words, k-means
algorithm is converged.
Department of Statistics, Pabna University of Science & Technology 14
K MEANSCLUSTERINGALGORITHMS
Department of Statistics, Pabna University of Science & Technology 15
K MEANSCLUSTERINGALGORITHMS
K MEANS CLUSTERINGALGORITHMS
Department of Statistics, Pabna University of Science & Technology 17
Requirements
Requirements
Requirements of clustering in data mining:-
1. Scalability - we need highly scalable clustering algorithms to deal with large databases.
2. Ability to deal with different kind of attributes - algorithms should be capable to be applied on any kind of data such as
interval based (numerical) data, categorical, binary data.
3. Discovery of clusters with attribute shape - the clustering algorithm should be capable of detect cluster of arbitrary
shape. The should not be bounded to only distance measures that tend to find spherical cluster of small size.
4. High dimensionality - the clustering algorithm should not only be able to handle low- dimensional data but also the high
dimensional space.
5. Ability to deal with noisy data - databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such
data and may lead to poor quality clusters.
6. Interpretability - the clustering results should be interpretable, comprehensible and usable.
Department of Statistics, Pabna University of Science & Technology 18
APPLICATIONS
HERE ARE 7 EXAMPLES OF CLUSTERING ALGORITHMS IN ACTION.
1. IDENTIFYING FAKE NEWS
How clustering works:
The way that the algorithm works is by taking in the content of the fake news article, the corpus,
examining the words used and then clustering them. These clusters are what helps the algorithm
determine which pieces are genuine and which are fake news. Certain words are found more
commonly in sensationalized, click-bait articles. When you see a high percentage of specific
terms in an article, it gives a higher probability of the material being fake news.
2. SPAM FILTER
How clustering works:
k-means clustering techniques have proven to be an effective way of identifying spam. The way
that it works is by looking at the different sections of the email (header, sender, and content). The
data is then grouped together.
These groups can then be classified to identify which are spam. Including clustering in the
classification process improves the accuracy of the filter to 97%. This is excellent news for
people who want to be sure they’re not missing out on your favorite newsletters and offers.
Department of Statistics, Pabna University of Science & Technology 19
APPLICATIONS
3. ASTRONOMY:
It helps to find groups of similar stars and galaxies.
4. GENOMICS:
It can be used to derive plant and animal taxonomies, categorize genes with similar functionality and gain insight into structures inherent in
populations.
5. CLASSIFYING NETWORK TRAFFIC
How clustering works:
k-means clustering is used to group together characteristics of the traffic sources. When the clusters are created, you can then classify the traffic
types. The process is faster and more accurate than the previous autoclass method. By having precise information on traffic sources, you are able
to grow your site and plan capacity effectively.
6. IDENTIFYING FRAUDULENT OR CRIMINAL ACTIVITY
How clustering works:
By analysing the GPS logs, the algorithm is able to group similar behaviors. Based on the characteristics of the groups you are then able to
classify them into those that are real and which are fraudulent.
7. DOCUMENT ANALYSIS
HOW CLUSTERING WORKS:
Hierarchical clustering has been used to solve this problem. The algorithm is able to look at the text and group it into different
themes. Using this technique, you can cluster and organize similar documents quickly using the characteristics identified in the
paragraph.
8.CALL RECORD DETAIL ANALYSIS
A call detail record (CDR) is the information captured by telecom companies during the call, SMS, and internet activity of a
customer.
Department of Statistics, Pabna University of Science & Technology 20
K-means advantages and disadvantages
Advantages of k-means
Relatively simple to implement.
Scales to large data sets.
Guarantees convergence.
Can warm-start the positions of centroids.
Easily adapts to new examples (data points).
Generalizes to clusters of different shapes and sizes, such as elliptical clusters.
Disadvantage of k-means
Choosing k manually being dependent on initial values.
For a low k, you can mitigate this dependence by running k-means several times with different
initial values and picking the best result. As k increases, you need advanced versions of k-means to
pick better values of the initial centroids (called k-means seeding).
Clustering data of varying sizes and density.
Clustering outliers.
Scaling with number of dimensions.
Department of Statistics, Pabna University of Science & Technology 21
CONCLUSION
Conclusion:
K means algorithm is useful for undirected knowledge discovery and is relatively simple.
K mean has found wide spread usage in lot of field raging from unsupervised learning of neural
,Pattern recognitions, classification analysis, Artificial intelligence ,Image processing and many others
Department of Statistics, Pabna University of Science & Technology 22

More Related Content

What's hot

K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
Afzaal Subhani
 
K MEANS CLUSTERING
K MEANS CLUSTERINGK MEANS CLUSTERING
K MEANS CLUSTERING
singh7599
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Kamalakshi Deshmukh-Samag
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
Kamal Acharya
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
Krish_ver2
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
Valerii Klymchuk
 
Clustering
ClusteringClustering
Clustering
M Rizwan Aqeel
 
K-means clustering algorithm
K-means clustering algorithmK-means clustering algorithm
K-means clustering algorithm
Vinit Dantkale
 
Presentation on unsupervised learning
Presentation on unsupervised learning Presentation on unsupervised learning
Presentation on unsupervised learning
ANKUSH PAL
 
Clusters techniques
Clusters techniquesClusters techniques
Clusters techniques
rajshreemuthiah
 
Density based clustering
Density based clusteringDensity based clustering
Density based clustering
YaswanthHariKumarVud
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
Neha Kulkarni
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
Ujjawal
 
KNN
KNN KNN
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
Archana Swaminathan
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
CloudxLab
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
Mohammad Junaid Khan
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
Arshad Farhad
 

What's hot (20)

K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 
K MEANS CLUSTERING
K MEANS CLUSTERINGK MEANS CLUSTERING
K MEANS CLUSTERING
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
Clustering
ClusteringClustering
Clustering
 
K-means clustering algorithm
K-means clustering algorithmK-means clustering algorithm
K-means clustering algorithm
 
Presentation on unsupervised learning
Presentation on unsupervised learning Presentation on unsupervised learning
Presentation on unsupervised learning
 
Clusters techniques
Clusters techniquesClusters techniques
Clusters techniques
 
Density based clustering
Density based clusteringDensity based clustering
Density based clustering
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
 
KNN
KNN KNN
KNN
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 

Similar to Presentation on K-Means Clustering

Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...
IRJET Journal
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
DrGnaneswariG
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
IJCSIS Research Publications
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data Analysis
IOSR Journals
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
IJDKP
 
Clustering & classification
Clustering & classificationClustering & classification
Clustering & classification
Jamshed Khan
 
47 292-298
47 292-29847 292-298
47 292-298
idescitation
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
Editor IJCATR
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
refedey275
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means Algorithm
IRJET Journal
 
Cluster analysis (2).docx
Cluster analysis (2).docxCluster analysis (2).docx
Cluster analysis (2).docx
YaseenRashid4
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptx
GandhiMathy6
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports Data
KathleneNgo
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
Natasha Grant
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
IJCSIS Research Publications
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
SowmyaJyothi3
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
Nandhini S
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptx
SaiPragnaKancheti
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptx
SaiPragnaKancheti
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 

Similar to Presentation on K-Means Clustering (20)

Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data Analysis
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
Clustering & classification
Clustering & classificationClustering & classification
Clustering & classification
 
47 292-298
47 292-29847 292-298
47 292-298
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means Algorithm
 
Cluster analysis (2).docx
Cluster analysis (2).docxCluster analysis (2).docx
Cluster analysis (2).docx
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptx
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports Data
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptx
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptx
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 

Recently uploaded

Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 

Recently uploaded (20)

Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 

Presentation on K-Means Clustering

  • 1. Welcome To My Presentation On Clustering Analysis Submitted By Ruhul Amin Department of Statistics Pabna University of Science & Technology Department of Statistics, Pabna University of Science & Technology
  • 2. OUTLINEOF PRESENTATION  Clustering : basic concept  Types of clustering  Clustering techniques  K-means clustering  K-means clustering algorithm  Requirements  Applications  Advantages & Disadvantages  Conclusion Department of Statistics, Pabna University of Science & Technology 2
  • 3. CLUSTERING: BASICCONCEPT CLUSTERING Clustering is traditionally viewed as an unsupervised method for data analysis. Clustering is the task of the population or data points into a number of groups such that data points in the same groups are more to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters. It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. Department of Statistics, Pabna University of Science & Technology 3
  • 4. TYPESOF CLUSTERING Broadly speaking, clustering can be divided into two subgroups : HARD CLUSTERING: In hard clustering, each data point either belongs to a cluster completely or not. As an instance, we want the algorithm to read all of the tweets and determine if a tweet is a positive or a negative tweet. SOFT CLUSTERING: In the soft clustering method, each data point will not completely belong to one cluster, instead, it can be a member of more than one cluster it has a set of membership coefficients corresponding to the probability of being in a given cluster. As an instance, if you are attempting to forecast the rating changes for the counterparties who you trade with. The algorithm can create clusters for each rating and indicate the likelihood of a counterparty to belong to a cluster. Department of Statistics, Pabna University of Science & Technology 4
  • 5. TYPES OF CLUSTERING Is clustering typically …? A. Supervised B. Unsupervised Department of Statistics, Pabna University of Science & Technology 5 Supervised Unsupervised
  • 6. CLUSTERING TECHNIQUES Department of Statistics, Pabna University of Science & Technology 6
  • 7. CLUSTERINGTECHNIQUES A CATEGORIZATION OF MAJOR CLUSTERING METHODS Partitioning Methods Hierarchical Methods Density-based Methods Grid-based Methods Model-based Methods Department of Statistics, Pabna University of Science & Technology 7
  • 8. CLUSTERINGTECHNIQUES Partitional clustering decomposes a data set into a set of disjoint clusters. Partitional clustering (or partitioning clustering) are clustering method used to classify observations, within a data set, into multiple groups based on their similarity. The algorithms require the analyst to specify the number of clusters to be generated (N ≥ K). This course describes the commonly used partitional, including: k means clustering Department of Statistics, Pabna University of Science & Technology 8
  • 9. K MEANSCLUSTERING K-means clustering (Macqueen, 1967) is a method commonly used to automatically partition a data set into k groups. It proceeds by selecting k initial cluster. K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.E., Data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K (N ≥ K). The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the k-means clustering algorithm are: 1.the centroids of the K clusters, which can be used to label new data. 2.labels for the training data (each data point is assigned to a single cluster). . Department of Statistics, Pabna University of Science & Technology 9
  • 10. K MEANS CLUSTERING ALGORITHMS AS, YOU CAN SEE, K-MEANS ALGORITHM IS COMPOSED OF 3 STEPS: STEP 1: INITIALIZATION The first thing k-means does, is randomly choose K examples (data points) from the dataset as initial centroids and that’s simply because it does not know yet where the center of each cluster is. (A centroid is the center of a cluster). STEP 2: CLUSTER ASSIGNMENT Then, all the data points that are the closest (similar) to a centroid will create a cluster. If we’re using the Euclidean distance between data points and every centroid, a straight line is drawn between two centroids, then a perpendicular bisector (boundary line) divides this line into two clusters. STEP 3: MOVE THE CENTROID Now, we have new clusters, that need centers. A centroid’s new value is going to be the mean of all the examples (data points) in a cluster. We’ll keep repeating step 2 and 3 until the centroids stop moving, in other words, k-means algorithm is converged. Department of Statistics, Pabna University of Science & Technology 10
  • 11. K MEANS CLUSTERING ALGORITHMS 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Step 1 Step 2 Step 3 Step 4 Department of Statistics, Pabna University of Science & Technology 11
  • 12. K MEANSCLUSTERINGALGORITHM CLUSTER ANALYSIS – EXAMPLE We will work with a real-number example of the well-known k-means clustering algorithm. We will try to find clusters in the below dataset, consisting of 5 points. Department of Statistics, Pabna University of Science & Technology 12
  • 13. K MEANSCLUSTERINGALGORITHMS STEP 1: SET CLUSTER QUANTITY The k-means algorithm requires you to set a number of clusters k beforehand. Here, we take k=2(the data look like there clusters – one on the bottom left and one on the top right). STEP 2: ASSIGNMENT OF DATA POINTS In the assignment step, each data point gets assigned to the nearest cluster centroid. The cluster centroids can be seen as centers of gravity within each cluster. To start with, we chose random points as centroids. Here, we take point A(1,1) Instead of taking actual data points, we could have taken completely random points as well. To calculate the nearest cluster centroid for each data point, you need a distance measure. There is a large number of available metrics doing the job. We will work with the ordinary Euclidian distance. Department of Statistics, Pabna University of Science & Technology 13
  • 14. K MEANSCLUSTERINGALGORITHMS STEP 3: MOVE THE CENTROID Now, we have new clusters, that need centers. A centroid’s new value is going to be the mean of all the examples in a cluster. We’ll keep repeating step 2 and 3 until the centroids stop moving, in other words, k-means algorithm is converged. Department of Statistics, Pabna University of Science & Technology 14
  • 15. K MEANSCLUSTERINGALGORITHMS Department of Statistics, Pabna University of Science & Technology 15
  • 17. K MEANS CLUSTERINGALGORITHMS Department of Statistics, Pabna University of Science & Technology 17
  • 18. Requirements Requirements Requirements of clustering in data mining:- 1. Scalability - we need highly scalable clustering algorithms to deal with large databases. 2. Ability to deal with different kind of attributes - algorithms should be capable to be applied on any kind of data such as interval based (numerical) data, categorical, binary data. 3. Discovery of clusters with attribute shape - the clustering algorithm should be capable of detect cluster of arbitrary shape. The should not be bounded to only distance measures that tend to find spherical cluster of small size. 4. High dimensionality - the clustering algorithm should not only be able to handle low- dimensional data but also the high dimensional space. 5. Ability to deal with noisy data - databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters. 6. Interpretability - the clustering results should be interpretable, comprehensible and usable. Department of Statistics, Pabna University of Science & Technology 18
  • 19. APPLICATIONS HERE ARE 7 EXAMPLES OF CLUSTERING ALGORITHMS IN ACTION. 1. IDENTIFYING FAKE NEWS How clustering works: The way that the algorithm works is by taking in the content of the fake news article, the corpus, examining the words used and then clustering them. These clusters are what helps the algorithm determine which pieces are genuine and which are fake news. Certain words are found more commonly in sensationalized, click-bait articles. When you see a high percentage of specific terms in an article, it gives a higher probability of the material being fake news. 2. SPAM FILTER How clustering works: k-means clustering techniques have proven to be an effective way of identifying spam. The way that it works is by looking at the different sections of the email (header, sender, and content). The data is then grouped together. These groups can then be classified to identify which are spam. Including clustering in the classification process improves the accuracy of the filter to 97%. This is excellent news for people who want to be sure they’re not missing out on your favorite newsletters and offers. Department of Statistics, Pabna University of Science & Technology 19
  • 20. APPLICATIONS 3. ASTRONOMY: It helps to find groups of similar stars and galaxies. 4. GENOMICS: It can be used to derive plant and animal taxonomies, categorize genes with similar functionality and gain insight into structures inherent in populations. 5. CLASSIFYING NETWORK TRAFFIC How clustering works: k-means clustering is used to group together characteristics of the traffic sources. When the clusters are created, you can then classify the traffic types. The process is faster and more accurate than the previous autoclass method. By having precise information on traffic sources, you are able to grow your site and plan capacity effectively. 6. IDENTIFYING FRAUDULENT OR CRIMINAL ACTIVITY How clustering works: By analysing the GPS logs, the algorithm is able to group similar behaviors. Based on the characteristics of the groups you are then able to classify them into those that are real and which are fraudulent. 7. DOCUMENT ANALYSIS HOW CLUSTERING WORKS: Hierarchical clustering has been used to solve this problem. The algorithm is able to look at the text and group it into different themes. Using this technique, you can cluster and organize similar documents quickly using the characteristics identified in the paragraph. 8.CALL RECORD DETAIL ANALYSIS A call detail record (CDR) is the information captured by telecom companies during the call, SMS, and internet activity of a customer. Department of Statistics, Pabna University of Science & Technology 20
  • 21. K-means advantages and disadvantages Advantages of k-means Relatively simple to implement. Scales to large data sets. Guarantees convergence. Can warm-start the positions of centroids. Easily adapts to new examples (data points). Generalizes to clusters of different shapes and sizes, such as elliptical clusters. Disadvantage of k-means Choosing k manually being dependent on initial values. For a low k, you can mitigate this dependence by running k-means several times with different initial values and picking the best result. As k increases, you need advanced versions of k-means to pick better values of the initial centroids (called k-means seeding). Clustering data of varying sizes and density. Clustering outliers. Scaling with number of dimensions. Department of Statistics, Pabna University of Science & Technology 21
  • 22. CONCLUSION Conclusion: K means algorithm is useful for undirected knowledge discovery and is relatively simple. K mean has found wide spread usage in lot of field raging from unsupervised learning of neural ,Pattern recognitions, classification analysis, Artificial intelligence ,Image processing and many others Department of Statistics, Pabna University of Science & Technology 22