SlideShare a Scribd company logo
1 of 20
Download to read offline
3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science
https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 1/20
K-Means Clustering Explained:
Algorithm And Sklearn
Implementation
Introduction to clustering and k-means clusters. Detailed overview
and sklearn implementation.
Marius Borcan · Follow
Published in Towards Data Science · 9 min read · Apr 13, 2020
30
K-Means clustering is one of the most powerful clustering algorithms in the
Data Science and Machine Learning world. It is very simple, yet it delivers
wonderful results. And because clustering is a very important step for
understanding a dataset, in this article we are going to discuss what is
clustering, why do we need it and what is k-means clustering going to help
us with in data science.
Search Sign up Sign in
Write
3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science
https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 2/20
Article overview:
What is Clustering
What is Unsupervised Machine Learning
Clustering applications
K-Means Clustering explained
K-Means Clustering Algorithm
K-Means Clustering Implementation using Scikit-Learn and Python
What is Clustering
Clustering is the task of grouping data into two or more groups based on the
properties of the data, and more exactly based on certain patterns which are
more or less obvious in the data. The goal is to find those patterns in the data
that help us be sure that, given a certain item in our dataset, we will be able
to correctly place the item in a correct group, so that it is similar to other
items in that group, but different from items in other groups.
That means the clustering actually consists of two parts: one is to identify
the groups and the other one is to try as much as possible to place every
item in the correct group.
3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science
https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 3/20
The ideal result for a clustering algorithm is that two items in the same
group are as similar to each other, while two items from different groups are
as different as possible.
Cluster example — Source: Wikipedia
A real-world example would be customer segmentation. As a business
selling various type of products/services, it would be very difficult to find
3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science
https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 4/20
the perfect business strategy for each and every customer. But we can be
smart about it and try to group our customers into a few subgroups,
understand what those customers all have in common and adapt our
business strategy for every group. Coming up with the wrong business
strategy to a customer would mean perhaps losing that customer, so it’s
important that we’ve achieved a good clustering of our market.
What is Unsupervised Machine Learning
Unsupervised Machine Learning is a type of Machine Learning Algorithm
that tries to infer patterns in the data without any prior knowledge. The
opposite is Supervised Machine Learning, where we have a training set and
the algorithm will try to find the patterns in the data by matching inputs to
predefined outputs.
The reason I am writing about this is because clustering an Unsupervised
Machine Learning Task. When applying a clustering algorithm, we don’t
know the categories a priori(although we can set the number of categories
that we want to be identified).
The categories will emerge from the algorithm analyzing the data. Because
of that, we may call clustering an exploratory machine learning task,
because we only know the number of categories, but not their properties.
3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science
https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 5/20
Then we can try playing around with different numbers of categories and see
if our data is better clustered or not.
And then we have to understand our clusters, which may actually be the
most different task. Let’s reuse the example with customer segmentation.
Let’s say we have run a clustering algorithm and we get our customers
clustered into 3 groups. But what are those groups? Why has the algorithm
decided that these customers fit into this group, and those customers fit into
that group? This is the part where you need very skilled data scientists along
with people who understand your business very well. They will look at the
data, try to analyze a few items in each category and try to guess a few
criteria. Then they will extrapolate from there once they find a valid pattern.
What happens when we get a new customer? We have to put this customer
into one of the clusters we already have, so we can run the data about this
customer through our algorithm and the algorithm will fit our customer into
one of our clusters. Also, in the future, after we acquire a large number of
new customers, we might need to rebuild our clusters — maybe new clusters
will appear or old clusters will disappear.
Clustering applications
3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science
https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 6/20
What are some common clustering applications? Before we fall in love with
clustering algorithms, we need to understand when we can use them and
when not.
The most common use case is the one we’ve already discussed:
customer/market segmentation. Companies run these types of analysis all
the time so they can understand their customers and markets and tailor their
business strategies, services and products for a better fit.
Another common use case is represented by information extraction tasks.
In information extraction tasks we often need to find relations between
entities, words, documents and so on. Now, if your intuition tells you we
have a higher chance of finding relations between items which are more
similar to each other, then you’re right, because clustering our data points
might help us figure out where to look for relations. (Note: if you want to
read more about information extraction, you can also try this article: Python
NLP Tutorial: Information Extraction and Knowledge Graphs).
Another very popular use cases is to use clustering for image segmentation.
Image segmentation is the task of looking at an image and trying to identify
different items in that image. We can use clustering to analyze the pixels of
the image and to identify which item in the image contains which pixel.
3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science
https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 7/20
K-Means Clustering explained
The K-Means clustering algorithm is an iterative clustering algorithm which
tries to asssign data points to exactly one cluster of the K number of clusters
we predefine.
As with any other clustering algorithm, it tries to make the items in one
cluster as similar as possible, while also making the clusters as different
from each other as possible. It does so by making sure that the sum of
squared distance between the data points in a cluster and the centroid of
that cluster is minimum. The centroid of the cluster is the mean value of all
the values in the cluster. You also get from this paragraph where the name K-
Means comes from.
In more technical terms, we try to make the data into one cluster as
homogenuous as possible, while making the cluster as heterogenuous as
possible. The K number is the number of clusters we try to obtain. We can
play around with K until we are satisfied with our results.
K-Means Clustering algorithm
The K-Means Clustering algorithm works with a few simple steps.
1. Assign the K number of clusters
3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science
https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 8/20
2. Shuffle the data and randomly assign each data point to one of the K
clusters and assign initial random centroids.
3. Calculate the squared sum between each data point and all centroids.
4. Reassign each data point to the closest centroid based on the
computation for step 3.
5. Reassign the centroid by calculating the mean value for every cluster
6. Repeat steps 3, 4, 5 until we no longer have to change anything in the
clusters
The time needed to run the K-Means Clustering algorithm depends on the
size of the dataset, the K number we define and the patterns in the data.
K-Means Clustering Implementation using Scikit-Learn and
Python
We are going to use the Sckikit-Learn Python library to run a K-Means
Clustering algorithm on a small dataset.
Dataset for K-Means Clustering algorithm
The data consists of 3 texts about London, Paris and Berlin. We are going to
extract the summary sections of the Wikipedia articles about these 3 cities
3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science
https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 9/20
and run them throught our clustering algorithm.
We will then provide 3 new sentences of our own and check if they are
correctly assigned to individual clusters. If that happens, then we will know
our clustering algorithm worked.
K-Means Clustering implementation
First let’s install our dependencies.
# Sklearn library for our cluster
pip3 install scikit-learn
# We will use nltk(Natural Language Toolkit) to remove stopwords
from the text
pip3 install nltk
# We will use the wikipedia library to download our texts from the
Wikipedia pages
pip3 install wikipedia
Now let's define a small class to help use gather the texts from the Wikipedia
pages. We will store the text into 3 files on our local so that we don't
download the texts again everytime we run the algorithm. Use class as it is
right now for your first run of the algorithm and for a second run you can
comment lines 8-12 and uncomment lines 13-15.
3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science
https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 10/20
import wikipedia
class TextFetcher:
def __init__(self, title):
self.title = title
page = wikipedia.page(title) # 8
f = open(title + ".txt", "w") # 9
f.write(page.summary) # 10
f.close() # 11
self.text = page.summary # 12
#f = open(title + ".txt", "r")
#self.text = f.read()
#f.close()
def getText(self):
return self.text
Now let’s build the dataset. We will take the text about each city and remove
stopwords. Stopwords are words we usually filter out before each text
processing task. They are very common words in the English language
which do not bring any value, any meaning to a text. Because most of them
are used everywhere, they will prevent us from clustering our texts correctly.
from text_fetcher import TextFetcher
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science
https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 11/20
import nltk
def preprocessor(text):
nltk.download('stopwords')
tokens = word_tokenize(text)
return (" ").join([word for word in tokens if word not in
stopwords.words()])
if __name__ == "__main__":
textFetcher = TextFetcher("London")
text1 = preprocessor(textFetcher.getText())
textFetcher = TextFetcher("Paris")
text2 = preprocessor(textFetcher.getText())
textFetcher = TextFetcher("Berlin")
text3 = preprocessor(textFetcher.getText())
docs = [text1, text2, text3]
Word vectorization techniques
It’s a known fact that computers are tipically very bad at understanding text,
but they are perform way better at working with numbers. Because our
dataset is made out of words, we need to transform the words into numbers.
Word embeddings or word vectorization represent a collection of
techniques used to assign a word to a vector of real numbers that can be
used by Machine Learning for certain purposes, one of which is text
clustering.
3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science
https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 12/20
The Scikit-Learn library contains a few word vectorizers, but for this article
we are going to choose the TfidfVectorizer.
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(docs)
Now it's time to apply our K-Means cluster algorithm. We are lucky that the
Scikit-Learn has a very good implementation of the K-Means algorithm and
we are going to use that. Because we know that we want to classify our texts
into 3 categories(one for each city) we will define the K value to be 3.
kmeans = KMeans(n_clusters=3).fit(tfidf)
print (kmeans)
# Output: [0 1 2]
I know, it’s that simple! Now what does our output mean? Simply put, those 3
values are our 3 clusters.
To test them, we can now provide 3 texts about which we know for sure they
should be in different clusters and see if they are assigned correctly. We have
3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science
https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 13/20
to make sure we don’t forget to also vectorize these 3 texts so that our
algorithm can understand them.
test = ["This is one is about London.", "London is a beautiful
city", "I love London"]
results = kmeans.predict(tfidf_vectorizer.transform(test))
print (results)
# Prints [0, 0, 0]
test = ["This is one is about Paris.", "Paris is a beautiful
city", "I love Paris"]
results = kmeans.predict(tfidf_vectorizer.transform(test))
print (results)
# Prints [2, 2, 2]
test = ["This is one is about Berlin.", "Berlin is a beautiful
city", "I love Berlin"]
results = kmeans.predict(tfidf_vectorizer.transform(test))
print(results)
# Prints [1, 1, 1]
test = ["This is about London", "This is about Paris", "This
is about Vienna"]
results = kmeans.predict(tfidf_vectorizer.transform(test))
print (results)
# Prints [0, 2, 1]
3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science
https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 14/20
And it seems our clustering worked! Now let’s suppose we would get another
text about which we don’t know anything. We can pass that text through our
classifier and see in which category it fits. I see this as a very good and
efficient text classifier.
Conclusions
Today we discussed the K-Means Clustering algorithm. We first went through
a general overview about Clustering algorithms and Unsupervised Machine
Learning techniques, then we discussed the K-Means Algorithm and we
implemented it using the Scikit-Learn Python library.
This article was originally published on the Programmer Backpack Blog. Make
sure to visit this blog if you want to read more stories of this kind.
Thank you so much for reading this! Interested in more stories like this? Follow me
on Twitter at @b_dmarius and I’ll post there every new article.
Machine Learning Artificial Intelligence Software Development Programming
Data Science
3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science
https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 15/20
Written by Marius Borcan
133 Followers · Writer for Towards Data Science
Passionate software engineer since ever. Interested in software architecture and machine
learning. Writing on https://programmerbackpack.com
Follow
More from Marius Borcan and Towards Data Science
3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science
https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 16/20
Marius Borcan in Towards Data Science
TF-IDF Explained And Python
Sklearn Implementation
What is TF-IDF and how you can implement it
in Python and Scikit-Learn.
6 min read · Jun 8, 2020
224 3
Leonie Monigatti in Towards Data Science
Intro to DSPy: Goodbye Prompting,
Hello Programming!
How the DSPy framework solves the fragility
problem in LLM-based applications by…
· 13 min read · Feb 27, 2024
2.7K 10
Dave Melillo in Towards Data Science
Building a Data Platform in 2024
Marius Borcan in The Startup
Python NLP Tutorial: Information
Extraction and Knowledge Graphs
3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science
https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 17/20
How to build a modern, scalable data platform
to power your analytics and data science…
9 min read · Feb 5, 2024
2K 30
This article was originally published on the
Programmer Backpack blog. Make sure to…
7 min read · Feb 3, 2020
286 2
See all from Marius Borcan See all from Towards Data Science
Recommended from Medium
3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science
https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 18/20
Avijit Bhattacharjee
Implementing K-Means Clustering
from Scratch in Python
K-Means is a widely used clustering algorithm
that helps organize data points into groups o…
3 min read · Sep 24, 2023
9
Cristian Leo in Towards Data Science
The Math behind Adam Optimizer
Why is Adam the most popular optimizer in
Deep Learning? Let’s understand it by diving…
16 min read · Jan 30, 2024
2.5K 20
Lists
Predictive Modeling w/
Python
20 stories · 992 saves
General Coding Knowledge
20 stories · 1002 saves
Coding & Development
11 stories · 495 saves
Natural Language Processing
1276 stories · 767 saves
3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science
https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 19/20
Dr. Ernesto Lee
The Ultimate Step-by-Step Guide
to Data Mining with PCA and…
An Analysis of Cereal Brands Using the Steps
31 min read · Nov 10, 2023
186
Tahera Firdose
Comparing Hierarchical, K-Means,
and DBSCAN Clustering…
Hierarchical Clustering
· 4 min read · Dec 4, 2023
4
Nirmal Sankalana
K-means Clustering: Choosing
Optimal K, Process, and Evaluatio…
Prof. Frenzel
Statistical Measures Every Analyst
Must Know—Part1
3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science
https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 20/20
In today’s data-driven world, businesses and
researchers encounter a huge amount of…
16 min read · Sep 19, 2023
109
Measures of Central Tendency, Variability,
Quartiles, Z-Scores, and as always:…
11 min read · Feb 4, 2024
739 5
See more recommendations
Help Status About Careers Blog Privacy Terms Text to speech Teams

More Related Content

Similar to K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Marius Borcan _ Towards Data Science.pdf

Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_Sagar
Sagar Kumar
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
Edureka!
 
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
IJERD Editor
 
slides.pptx
slides.pptxslides.pptx
slides.pptx
kunwarpratap8055
 

Similar to K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Marius Borcan _ Towards Data Science.pdf (20)

Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
 
Clustering
ClusteringClustering
Clustering
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports Data
 
lecture-intro-pet-nams-ai-in-toxicology.pptx
lecture-intro-pet-nams-ai-in-toxicology.pptxlecture-intro-pet-nams-ai-in-toxicology.pptx
lecture-intro-pet-nams-ai-in-toxicology.pptx
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means Algorithm
 
Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_Sagar
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
 
Customer segmentation.pptx
Customer segmentation.pptxCustomer segmentation.pptx
Customer segmentation.pptx
 
Clustering in Machine Learning.pdf
Clustering in Machine Learning.pdfClustering in Machine Learning.pdf
Clustering in Machine Learning.pdf
 
A Study in Employing Rough Set Based Approach for Clustering on Categorical ...
A Study in Employing Rough Set Based Approach for Clustering  on Categorical ...A Study in Employing Rough Set Based Approach for Clustering  on Categorical ...
A Study in Employing Rough Set Based Approach for Clustering on Categorical ...
 
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
 
slides.pptx
slides.pptxslides.pptx
slides.pptx
 
Clusterix at VDS 2016
Clusterix at VDS 2016Clusterix at VDS 2016
Clusterix at VDS 2016
 
G0354451
G0354451G0354451
G0354451
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

K-Means Clustering Explained_ Algorithm And Sklearn Implementation _ by Marius Borcan _ Towards Data Science.pdf

  • 1. 3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 1/20 K-Means Clustering Explained: Algorithm And Sklearn Implementation Introduction to clustering and k-means clusters. Detailed overview and sklearn implementation. Marius Borcan · Follow Published in Towards Data Science · 9 min read · Apr 13, 2020 30 K-Means clustering is one of the most powerful clustering algorithms in the Data Science and Machine Learning world. It is very simple, yet it delivers wonderful results. And because clustering is a very important step for understanding a dataset, in this article we are going to discuss what is clustering, why do we need it and what is k-means clustering going to help us with in data science. Search Sign up Sign in Write
  • 2. 3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 2/20 Article overview: What is Clustering What is Unsupervised Machine Learning Clustering applications K-Means Clustering explained K-Means Clustering Algorithm K-Means Clustering Implementation using Scikit-Learn and Python What is Clustering Clustering is the task of grouping data into two or more groups based on the properties of the data, and more exactly based on certain patterns which are more or less obvious in the data. The goal is to find those patterns in the data that help us be sure that, given a certain item in our dataset, we will be able to correctly place the item in a correct group, so that it is similar to other items in that group, but different from items in other groups. That means the clustering actually consists of two parts: one is to identify the groups and the other one is to try as much as possible to place every item in the correct group.
  • 3. 3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 3/20 The ideal result for a clustering algorithm is that two items in the same group are as similar to each other, while two items from different groups are as different as possible. Cluster example — Source: Wikipedia A real-world example would be customer segmentation. As a business selling various type of products/services, it would be very difficult to find
  • 4. 3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 4/20 the perfect business strategy for each and every customer. But we can be smart about it and try to group our customers into a few subgroups, understand what those customers all have in common and adapt our business strategy for every group. Coming up with the wrong business strategy to a customer would mean perhaps losing that customer, so it’s important that we’ve achieved a good clustering of our market. What is Unsupervised Machine Learning Unsupervised Machine Learning is a type of Machine Learning Algorithm that tries to infer patterns in the data without any prior knowledge. The opposite is Supervised Machine Learning, where we have a training set and the algorithm will try to find the patterns in the data by matching inputs to predefined outputs. The reason I am writing about this is because clustering an Unsupervised Machine Learning Task. When applying a clustering algorithm, we don’t know the categories a priori(although we can set the number of categories that we want to be identified). The categories will emerge from the algorithm analyzing the data. Because of that, we may call clustering an exploratory machine learning task, because we only know the number of categories, but not their properties.
  • 5. 3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 5/20 Then we can try playing around with different numbers of categories and see if our data is better clustered or not. And then we have to understand our clusters, which may actually be the most different task. Let’s reuse the example with customer segmentation. Let’s say we have run a clustering algorithm and we get our customers clustered into 3 groups. But what are those groups? Why has the algorithm decided that these customers fit into this group, and those customers fit into that group? This is the part where you need very skilled data scientists along with people who understand your business very well. They will look at the data, try to analyze a few items in each category and try to guess a few criteria. Then they will extrapolate from there once they find a valid pattern. What happens when we get a new customer? We have to put this customer into one of the clusters we already have, so we can run the data about this customer through our algorithm and the algorithm will fit our customer into one of our clusters. Also, in the future, after we acquire a large number of new customers, we might need to rebuild our clusters — maybe new clusters will appear or old clusters will disappear. Clustering applications
  • 6. 3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 6/20 What are some common clustering applications? Before we fall in love with clustering algorithms, we need to understand when we can use them and when not. The most common use case is the one we’ve already discussed: customer/market segmentation. Companies run these types of analysis all the time so they can understand their customers and markets and tailor their business strategies, services and products for a better fit. Another common use case is represented by information extraction tasks. In information extraction tasks we often need to find relations between entities, words, documents and so on. Now, if your intuition tells you we have a higher chance of finding relations between items which are more similar to each other, then you’re right, because clustering our data points might help us figure out where to look for relations. (Note: if you want to read more about information extraction, you can also try this article: Python NLP Tutorial: Information Extraction and Knowledge Graphs). Another very popular use cases is to use clustering for image segmentation. Image segmentation is the task of looking at an image and trying to identify different items in that image. We can use clustering to analyze the pixels of the image and to identify which item in the image contains which pixel.
  • 7. 3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 7/20 K-Means Clustering explained The K-Means clustering algorithm is an iterative clustering algorithm which tries to asssign data points to exactly one cluster of the K number of clusters we predefine. As with any other clustering algorithm, it tries to make the items in one cluster as similar as possible, while also making the clusters as different from each other as possible. It does so by making sure that the sum of squared distance between the data points in a cluster and the centroid of that cluster is minimum. The centroid of the cluster is the mean value of all the values in the cluster. You also get from this paragraph where the name K- Means comes from. In more technical terms, we try to make the data into one cluster as homogenuous as possible, while making the cluster as heterogenuous as possible. The K number is the number of clusters we try to obtain. We can play around with K until we are satisfied with our results. K-Means Clustering algorithm The K-Means Clustering algorithm works with a few simple steps. 1. Assign the K number of clusters
  • 8. 3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 8/20 2. Shuffle the data and randomly assign each data point to one of the K clusters and assign initial random centroids. 3. Calculate the squared sum between each data point and all centroids. 4. Reassign each data point to the closest centroid based on the computation for step 3. 5. Reassign the centroid by calculating the mean value for every cluster 6. Repeat steps 3, 4, 5 until we no longer have to change anything in the clusters The time needed to run the K-Means Clustering algorithm depends on the size of the dataset, the K number we define and the patterns in the data. K-Means Clustering Implementation using Scikit-Learn and Python We are going to use the Sckikit-Learn Python library to run a K-Means Clustering algorithm on a small dataset. Dataset for K-Means Clustering algorithm The data consists of 3 texts about London, Paris and Berlin. We are going to extract the summary sections of the Wikipedia articles about these 3 cities
  • 9. 3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 9/20 and run them throught our clustering algorithm. We will then provide 3 new sentences of our own and check if they are correctly assigned to individual clusters. If that happens, then we will know our clustering algorithm worked. K-Means Clustering implementation First let’s install our dependencies. # Sklearn library for our cluster pip3 install scikit-learn # We will use nltk(Natural Language Toolkit) to remove stopwords from the text pip3 install nltk # We will use the wikipedia library to download our texts from the Wikipedia pages pip3 install wikipedia Now let's define a small class to help use gather the texts from the Wikipedia pages. We will store the text into 3 files on our local so that we don't download the texts again everytime we run the algorithm. Use class as it is right now for your first run of the algorithm and for a second run you can comment lines 8-12 and uncomment lines 13-15.
  • 10. 3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 10/20 import wikipedia class TextFetcher: def __init__(self, title): self.title = title page = wikipedia.page(title) # 8 f = open(title + ".txt", "w") # 9 f.write(page.summary) # 10 f.close() # 11 self.text = page.summary # 12 #f = open(title + ".txt", "r") #self.text = f.read() #f.close() def getText(self): return self.text Now let’s build the dataset. We will take the text about each city and remove stopwords. Stopwords are words we usually filter out before each text processing task. They are very common words in the English language which do not bring any value, any meaning to a text. Because most of them are used everywhere, they will prevent us from clustering our texts correctly. from text_fetcher import TextFetcher from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans
  • 11. 3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 11/20 import nltk def preprocessor(text): nltk.download('stopwords') tokens = word_tokenize(text) return (" ").join([word for word in tokens if word not in stopwords.words()]) if __name__ == "__main__": textFetcher = TextFetcher("London") text1 = preprocessor(textFetcher.getText()) textFetcher = TextFetcher("Paris") text2 = preprocessor(textFetcher.getText()) textFetcher = TextFetcher("Berlin") text3 = preprocessor(textFetcher.getText()) docs = [text1, text2, text3] Word vectorization techniques It’s a known fact that computers are tipically very bad at understanding text, but they are perform way better at working with numbers. Because our dataset is made out of words, we need to transform the words into numbers. Word embeddings or word vectorization represent a collection of techniques used to assign a word to a vector of real numbers that can be used by Machine Learning for certain purposes, one of which is text clustering.
  • 12. 3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 12/20 The Scikit-Learn library contains a few word vectorizers, but for this article we are going to choose the TfidfVectorizer. tfidf_vectorizer = TfidfVectorizer() tfidf = tfidf_vectorizer.fit_transform(docs) Now it's time to apply our K-Means cluster algorithm. We are lucky that the Scikit-Learn has a very good implementation of the K-Means algorithm and we are going to use that. Because we know that we want to classify our texts into 3 categories(one for each city) we will define the K value to be 3. kmeans = KMeans(n_clusters=3).fit(tfidf) print (kmeans) # Output: [0 1 2] I know, it’s that simple! Now what does our output mean? Simply put, those 3 values are our 3 clusters. To test them, we can now provide 3 texts about which we know for sure they should be in different clusters and see if they are assigned correctly. We have
  • 13. 3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 13/20 to make sure we don’t forget to also vectorize these 3 texts so that our algorithm can understand them. test = ["This is one is about London.", "London is a beautiful city", "I love London"] results = kmeans.predict(tfidf_vectorizer.transform(test)) print (results) # Prints [0, 0, 0] test = ["This is one is about Paris.", "Paris is a beautiful city", "I love Paris"] results = kmeans.predict(tfidf_vectorizer.transform(test)) print (results) # Prints [2, 2, 2] test = ["This is one is about Berlin.", "Berlin is a beautiful city", "I love Berlin"] results = kmeans.predict(tfidf_vectorizer.transform(test)) print(results) # Prints [1, 1, 1] test = ["This is about London", "This is about Paris", "This is about Vienna"] results = kmeans.predict(tfidf_vectorizer.transform(test)) print (results) # Prints [0, 2, 1]
  • 14. 3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 14/20 And it seems our clustering worked! Now let’s suppose we would get another text about which we don’t know anything. We can pass that text through our classifier and see in which category it fits. I see this as a very good and efficient text classifier. Conclusions Today we discussed the K-Means Clustering algorithm. We first went through a general overview about Clustering algorithms and Unsupervised Machine Learning techniques, then we discussed the K-Means Algorithm and we implemented it using the Scikit-Learn Python library. This article was originally published on the Programmer Backpack Blog. Make sure to visit this blog if you want to read more stories of this kind. Thank you so much for reading this! Interested in more stories like this? Follow me on Twitter at @b_dmarius and I’ll post there every new article. Machine Learning Artificial Intelligence Software Development Programming Data Science
  • 15. 3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 15/20 Written by Marius Borcan 133 Followers · Writer for Towards Data Science Passionate software engineer since ever. Interested in software architecture and machine learning. Writing on https://programmerbackpack.com Follow More from Marius Borcan and Towards Data Science
  • 16. 3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 16/20 Marius Borcan in Towards Data Science TF-IDF Explained And Python Sklearn Implementation What is TF-IDF and how you can implement it in Python and Scikit-Learn. 6 min read · Jun 8, 2020 224 3 Leonie Monigatti in Towards Data Science Intro to DSPy: Goodbye Prompting, Hello Programming! How the DSPy framework solves the fragility problem in LLM-based applications by… · 13 min read · Feb 27, 2024 2.7K 10 Dave Melillo in Towards Data Science Building a Data Platform in 2024 Marius Borcan in The Startup Python NLP Tutorial: Information Extraction and Knowledge Graphs
  • 17. 3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 17/20 How to build a modern, scalable data platform to power your analytics and data science… 9 min read · Feb 5, 2024 2K 30 This article was originally published on the Programmer Backpack blog. Make sure to… 7 min read · Feb 3, 2020 286 2 See all from Marius Borcan See all from Towards Data Science Recommended from Medium
  • 18. 3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 18/20 Avijit Bhattacharjee Implementing K-Means Clustering from Scratch in Python K-Means is a widely used clustering algorithm that helps organize data points into groups o… 3 min read · Sep 24, 2023 9 Cristian Leo in Towards Data Science The Math behind Adam Optimizer Why is Adam the most popular optimizer in Deep Learning? Let’s understand it by diving… 16 min read · Jan 30, 2024 2.5K 20 Lists Predictive Modeling w/ Python 20 stories · 992 saves General Coding Knowledge 20 stories · 1002 saves Coding & Development 11 stories · 495 saves Natural Language Processing 1276 stories · 767 saves
  • 19. 3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 19/20 Dr. Ernesto Lee The Ultimate Step-by-Step Guide to Data Mining with PCA and… An Analysis of Cereal Brands Using the Steps 31 min read · Nov 10, 2023 186 Tahera Firdose Comparing Hierarchical, K-Means, and DBSCAN Clustering… Hierarchical Clustering · 4 min read · Dec 4, 2023 4 Nirmal Sankalana K-means Clustering: Choosing Optimal K, Process, and Evaluatio… Prof. Frenzel Statistical Measures Every Analyst Must Know—Part1
  • 20. 3/12/24, 11:25 AM K-Means Clustering Explained: AlgorithmAnd Sklearn Implementation | by Marius Borcan | Tow ards Data Science https://tow ardsdatascience.com/k-means-clustering-explained-algorithm-and-sklearn-implementation-1fe8e104e822 20/20 In today’s data-driven world, businesses and researchers encounter a huge amount of… 16 min read · Sep 19, 2023 109 Measures of Central Tendency, Variability, Quartiles, Z-Scores, and as always:… 11 min read · Feb 4, 2024 739 5 See more recommendations Help Status About Careers Blog Privacy Terms Text to speech Teams