SlideShare a Scribd company logo
1 of 30
Users categorization of StackOverflow data
Using K-Means clustering Algorithm
Project Presentation
Team Membar – Afzal Ahmad and Abhishek Barnwal
What is StackOverflow ?
• Stack Overflow is a question and answer site
Written in C# for professional and enthusiast
programmers. It's built and run by us as part of
the Stack Exchange network of Q&A sites.
About User Account on stackoverflow
• This site is all about getting answers. Good answers are voted up and
rise to the top .
• User reputation score goes up when others vote up his questions,
answers and edits.
• Badges are special achievements User earns for participating on the
site. They come in three levels: bronze, silver, and gold.
• The person who asked can mark one answer as "accepted".
DataSet Overview
• The dataset is obtained from stackexchange data dump at the
internet archieve.
• The link to the dataset is as follows.
Www.archive.org/details/stackexchange
•Each site under stack exchange is formatted as a separate archive
Consisting of xml file zipped via 7-zip that includes various files.
Dataset overview
• Stack overflow dataset consists of following files that is treated as table in
our database design.
1.posts
2.postLinks
3.Tags
4.Users
5.Votes
6.Badges
7.Comments
♥ But we are interested only in Users file which contains user's Id and and his
features like age,reputation,upotes,downvotes etc...
Features of Users Data
1. Age
2. Reputations
3. Upvotes
4. Downvotes
5. Views
Data preprocessing
• Our Dataset is in XML format and unfit for our algorithm to process
that’s why we need data processing to make it fit for our algorithm to
process it.
• Data preprocessing is a data mining technique that involves
transforming raw data into an understandable format.
• To achieve tha data in desired format we need to parse it.
python script to convert xml to csv
from copy import deepcopy
import numpy as np
import pandas as pd
#from matplotlib import pyplot as plt
#%matplotlib inline
#plt.rcParams['figure.figsize'] = (16, 9)
#plt.style.use('ggplot')
import xml.etree.ElementTree as ET
import csv
python script to convert xml to csv
tree = ET.parse("Users.xml")
root = tree.getroot()
# open a file for writing
User_data = open('user_data1.csv', 'w')
# create the csv writer object
csvwriter = csv.writer(User_data)
count = 0
python script to convert xml to csv
csvwriter.writerow(['Reputation', 'Views', 'UpVotes', 'DownVotes', 'Age'])
for i in root.findall('row'):
data = [i.get('Reputation'), i.get('Views'), i.get('UpVotes'), i.get('DownVotes'), i.get('Age') or '0']
# print data
count = count + 1
csvwriter.writerow(data)
User_data.close()
Converted CSV file format
.
What is clustering ?
Clustering is the task of dividing the population or data points
into a number of groups such that data points in the same groups
are more similar to other data points in the same group than
those in other groups. In simple words, the aim is to segregate
groups with similar traits and assign them into clusters.
Pictorial representation of Clustering
Types of Clustering
1. Hard Clustering: In hard clustering, each data point either
belongs to a cluster completely or not.
2. Soft Clustering: In soft clustering, instead of putting each
data point into a separate cluster, a probability or
likelihood of that data point to be in those clusters is
assigned.
Algorithm Used
• We are using K-means clustering algorithm to categorise the user of
different types on the basis of given features.
• k-means clustering is a data mining/machine learning algorithm used
to cluster observations into groups of related observations without
any prior knowledge of those relationships.
• This algorithm is also called unsupervised learning algorithm as it
does not have any idea of label of cluster.
• Using this algorithm we find the different k -categories depending on
the value of K.
Unsupervised Learning
Unsupervised learning is a type of machine learning algorithm used to
draw inferences from datasets consisting of input data without labeled
responses.
The most common unsupervised learning method is cluster analysis, which
is used for exploratory data analysis to find hidden patterns or grouping in
data. The clusters are modeled using a measure of similarity which is
defined upon metrics such as Euclidean or probabilistic distance.
Working of K-Means Algorithm
1 .Specify the desired number of clusters K : Let us choose k=2 for
these 5 data points in 2-D space.
2 . Randomly assign each data point to a cluster : Let’s assign three
points in cluster 1 shown using red color and two points in cluster 2
shown using grey color.
3 . Compute cluster centroids : The centroid of data points in the red
cluster is shown using red cross and those in grey cluster using grey
cross.
4. Now Re-assign each point to the closest cluster centroid .
5. Re-compute cluster centroids : Now, re-computing the
centroids for both the clusters.
6. Repeat steps 4 and 5 until no improvements are possible.
When there will be no further switching of data points between two
clusters for two successive repeats. It will mark the termination of the
algorithm if not explicitly mentioned.
Pictorial representation of K-means
Algorithm
Implementation of K-means Algorithm
1. We have converted our XML data into CSV.
2. Run K-Means Algorithm on stackoverflow data.
3. If K=4 then We get the four cluster center with the values given
below.
array([[ 1.82709702e+02, 8.86936593e-01, 8.58670741e-01,
3.59052712e-02, 3.21581360e+01],
[ 1.71912000e+04, 7.34000000e+01, 1.92800000e+02,
1.29000000e+01, 3.92000000e+01],
[ 3.89650000e+04, 3.47000000e+02, 5.10000000e+02,
8.60000000e+01, 3.00000000e+01],
[ 4.18018750e+03, 1.38750000e+01, 3.42187500e+01,
1.40625000e+00, 3.27187500e+01]])
Pictorial form of Data with 4 cluster centre
Important information regarding insights of
data
1.We processed the data of android users of stack overflow.
2.Here all the results and insights are only of android specific users.
3.We used only numerical value information of User’s as K-Means
algorithm works on Euclidean distance.
4. User’s information used here are as follows.
‘Age’ , ‘Views’ ,’Reputations’, ‘Upvotes’, Downvotes
Insights from stack overflow data
1. Almost all the users of android specific are above 30 in Age.
2. Users who have maximum reputations,views,upvotes and
downvotes are of minimum age among all other users.It means
young community is more involved in android than older.
3. With the growth of Age users are not interested to downvote the
answer. Young community is most involved in downvoting as well as
in upvoting to the answer.
4. Profile views are mostly affected by reputation.It is increasing 3-4
times on doubling the reputation.
.
Thank You

More Related Content

What's hot

A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)Lynn Cherny
 
Elementary data structure
Elementary data structureElementary data structure
Elementary data structureBiswajit Mandal
 
R Programming: Introduction to Vectors
R Programming: Introduction to VectorsR Programming: Introduction to Vectors
R Programming: Introduction to VectorsRsquared Academy
 
Basic of Data Structure - Data Structure - Notes
Basic of Data Structure - Data Structure - NotesBasic of Data Structure - Data Structure - Notes
Basic of Data Structure - Data Structure - NotesOmprakash Chauhan
 
Introduction To R Language
Introduction To R LanguageIntroduction To R Language
Introduction To R LanguageGaurang Dobariya
 
Chapter 6.5
Chapter 6.5Chapter 6.5
Chapter 6.5sotlsoc
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity ResolutionBenjamin Bengfort
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection ProcessBenjamin Bengfort
 
Chapter 6.5 new
Chapter 6.5 newChapter 6.5 new
Chapter 6.5 newsotlsoc
 
DATA STRUCTURE AND ALGORITHM FULL NOTES
DATA STRUCTURE AND ALGORITHM FULL NOTESDATA STRUCTURE AND ALGORITHM FULL NOTES
DATA STRUCTURE AND ALGORITHM FULL NOTESAniruddha Paul
 
R basics
R basicsR basics
R basicsFAO
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnSarah Guido
 
stacks and queues for public
stacks and queues for publicstacks and queues for public
stacks and queues for publiciqbalphy1
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 

What's hot (20)

A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)
 
Elementary data structure
Elementary data structureElementary data structure
Elementary data structure
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
R Programming: Introduction to Vectors
R Programming: Introduction to VectorsR Programming: Introduction to Vectors
R Programming: Introduction to Vectors
 
Basic of Data Structure - Data Structure - Notes
Basic of Data Structure - Data Structure - NotesBasic of Data Structure - Data Structure - Notes
Basic of Data Structure - Data Structure - Notes
 
Data structures
Data structuresData structures
Data structures
 
Introduction To R Language
Introduction To R LanguageIntroduction To R Language
Introduction To R Language
 
Chapter 6.5
Chapter 6.5Chapter 6.5
Chapter 6.5
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection Process
 
Chapter 6.5 new
Chapter 6.5 newChapter 6.5 new
Chapter 6.5 new
 
DATA STRUCTURE AND ALGORITHM FULL NOTES
DATA STRUCTURE AND ALGORITHM FULL NOTESDATA STRUCTURE AND ALGORITHM FULL NOTES
DATA STRUCTURE AND ALGORITHM FULL NOTES
 
Segment tree
Segment treeSegment tree
Segment tree
 
R basics
R basicsR basics
R basics
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
 
Data Structure and Algorithms
Data Structure and AlgorithmsData Structure and Algorithms
Data Structure and Algorithms
 
Segment tree
Segment treeSegment tree
Segment tree
 
stacks and queues for public
stacks and queues for publicstacks and queues for public
stacks and queues for public
 
Segment tree
Segment treeSegment tree
Segment tree
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 

Similar to K-Means Algorithm Implementation In python

PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfPython-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfssuser598883
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxSandeep Singh
 
Python for Data Analysis.pdf
Python for Data Analysis.pdfPython for Data Analysis.pdf
Python for Data Analysis.pdfJulioRecaldeLara1
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxtangadhurai
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm. Abdul salam
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark MLAhmet Bulut
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013Sanjeev Mishra
 
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxfINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxdataKarthik
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...Madan Golla
 
hierarchical clustering.pptx
hierarchical clustering.pptxhierarchical clustering.pptx
hierarchical clustering.pptxPriyadharshiniG41
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningmy6305874
 
House price prediction
House price predictionHouse price prediction
House price predictionSabahBegum
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Experfy
 
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?", Yevhen TatarynovFwdays
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIVijayananda Mohire
 

Similar to K-Means Algorithm Implementation In python (20)

Python for data analysis
Python for data analysisPython for data analysis
Python for data analysis
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfPython-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdf
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Python for Data Analysis.pdf
Python for Data Analysis.pdfPython for Data Analysis.pdf
Python for Data Analysis.pdf
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
 
Data Exploration in R.pptx
Data Exploration in R.pptxData Exploration in R.pptx
Data Exploration in R.pptx
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
More on Pandas.pptx
More on Pandas.pptxMore on Pandas.pptx
More on Pandas.pptx
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
 
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxfINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
 
hierarchical clustering.pptx
hierarchical clustering.pptxhierarchical clustering.pptx
hierarchical clustering.pptx
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
 
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
 

Recently uploaded

why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 

Recently uploaded (20)

why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 

K-Means Algorithm Implementation In python

  • 1. Users categorization of StackOverflow data Using K-Means clustering Algorithm Project Presentation Team Membar – Afzal Ahmad and Abhishek Barnwal
  • 2. What is StackOverflow ? • Stack Overflow is a question and answer site Written in C# for professional and enthusiast programmers. It's built and run by us as part of the Stack Exchange network of Q&A sites.
  • 3. About User Account on stackoverflow • This site is all about getting answers. Good answers are voted up and rise to the top . • User reputation score goes up when others vote up his questions, answers and edits. • Badges are special achievements User earns for participating on the site. They come in three levels: bronze, silver, and gold. • The person who asked can mark one answer as "accepted".
  • 4. DataSet Overview • The dataset is obtained from stackexchange data dump at the internet archieve. • The link to the dataset is as follows. Www.archive.org/details/stackexchange •Each site under stack exchange is formatted as a separate archive Consisting of xml file zipped via 7-zip that includes various files.
  • 5.
  • 6. Dataset overview • Stack overflow dataset consists of following files that is treated as table in our database design. 1.posts 2.postLinks 3.Tags 4.Users 5.Votes 6.Badges 7.Comments ♥ But we are interested only in Users file which contains user's Id and and his features like age,reputation,upotes,downvotes etc...
  • 7.
  • 8. Features of Users Data 1. Age 2. Reputations 3. Upvotes 4. Downvotes 5. Views
  • 9. Data preprocessing • Our Dataset is in XML format and unfit for our algorithm to process that’s why we need data processing to make it fit for our algorithm to process it. • Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. • To achieve tha data in desired format we need to parse it.
  • 10. python script to convert xml to csv from copy import deepcopy import numpy as np import pandas as pd #from matplotlib import pyplot as plt #%matplotlib inline #plt.rcParams['figure.figsize'] = (16, 9) #plt.style.use('ggplot') import xml.etree.ElementTree as ET import csv
  • 11. python script to convert xml to csv tree = ET.parse("Users.xml") root = tree.getroot() # open a file for writing User_data = open('user_data1.csv', 'w') # create the csv writer object csvwriter = csv.writer(User_data) count = 0
  • 12. python script to convert xml to csv csvwriter.writerow(['Reputation', 'Views', 'UpVotes', 'DownVotes', 'Age']) for i in root.findall('row'): data = [i.get('Reputation'), i.get('Views'), i.get('UpVotes'), i.get('DownVotes'), i.get('Age') or '0'] # print data count = count + 1 csvwriter.writerow(data) User_data.close()
  • 13. Converted CSV file format .
  • 14. What is clustering ? Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.
  • 16. Types of Clustering 1. Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not. 2. Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned.
  • 17. Algorithm Used • We are using K-means clustering algorithm to categorise the user of different types on the basis of given features. • k-means clustering is a data mining/machine learning algorithm used to cluster observations into groups of related observations without any prior knowledge of those relationships. • This algorithm is also called unsupervised learning algorithm as it does not have any idea of label of cluster. • Using this algorithm we find the different k -categories depending on the value of K.
  • 18. Unsupervised Learning Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses. The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data. The clusters are modeled using a measure of similarity which is defined upon metrics such as Euclidean or probabilistic distance.
  • 19. Working of K-Means Algorithm 1 .Specify the desired number of clusters K : Let us choose k=2 for these 5 data points in 2-D space.
  • 20. 2 . Randomly assign each data point to a cluster : Let’s assign three points in cluster 1 shown using red color and two points in cluster 2 shown using grey color.
  • 21. 3 . Compute cluster centroids : The centroid of data points in the red cluster is shown using red cross and those in grey cluster using grey cross.
  • 22. 4. Now Re-assign each point to the closest cluster centroid .
  • 23. 5. Re-compute cluster centroids : Now, re-computing the centroids for both the clusters.
  • 24. 6. Repeat steps 4 and 5 until no improvements are possible. When there will be no further switching of data points between two clusters for two successive repeats. It will mark the termination of the algorithm if not explicitly mentioned.
  • 25. Pictorial representation of K-means Algorithm
  • 26. Implementation of K-means Algorithm 1. We have converted our XML data into CSV. 2. Run K-Means Algorithm on stackoverflow data. 3. If K=4 then We get the four cluster center with the values given below. array([[ 1.82709702e+02, 8.86936593e-01, 8.58670741e-01, 3.59052712e-02, 3.21581360e+01], [ 1.71912000e+04, 7.34000000e+01, 1.92800000e+02, 1.29000000e+01, 3.92000000e+01], [ 3.89650000e+04, 3.47000000e+02, 5.10000000e+02, 8.60000000e+01, 3.00000000e+01], [ 4.18018750e+03, 1.38750000e+01, 3.42187500e+01, 1.40625000e+00, 3.27187500e+01]])
  • 27. Pictorial form of Data with 4 cluster centre
  • 28. Important information regarding insights of data 1.We processed the data of android users of stack overflow. 2.Here all the results and insights are only of android specific users. 3.We used only numerical value information of User’s as K-Means algorithm works on Euclidean distance. 4. User’s information used here are as follows. ‘Age’ , ‘Views’ ,’Reputations’, ‘Upvotes’, Downvotes
  • 29. Insights from stack overflow data 1. Almost all the users of android specific are above 30 in Age. 2. Users who have maximum reputations,views,upvotes and downvotes are of minimum age among all other users.It means young community is more involved in android than older. 3. With the growth of Age users are not interested to downvote the answer. Young community is most involved in downvoting as well as in upvoting to the answer. 4. Profile views are mostly affected by reputation.It is increasing 3-4 times on doubling the reputation.