SlideShare a Scribd company logo
Users categorization of StackOverflow data
Using K-Means clustering Algorithm
Project Presentation
Team Membar – Afzal Ahmad and Abhishek Barnwal
What is StackOverflow ?
• Stack Overflow is a question and answer site
Written in C# for professional and enthusiast
programmers. It's built and run by us as part of
the Stack Exchange network of Q&A sites.
About User Account on stackoverflow
• This site is all about getting answers. Good answers are voted up and
rise to the top .
• User reputation score goes up when others vote up his questions,
answers and edits.
• Badges are special achievements User earns for participating on the
site. They come in three levels: bronze, silver, and gold.
• The person who asked can mark one answer as "accepted".
DataSet Overview
• The dataset is obtained from stackexchange data dump at the
internet archieve.
• The link to the dataset is as follows.
Www.archive.org/details/stackexchange
•Each site under stack exchange is formatted as a separate archive
Consisting of xml file zipped via 7-zip that includes various files.
Dataset overview
• Stack overflow dataset consists of following files that is treated as table in
our database design.
1.posts
2.postLinks
3.Tags
4.Users
5.Votes
6.Badges
7.Comments
♥ But we are interested only in Users file which contains user's Id and and his
features like age,reputation,upotes,downvotes etc...
Features of Users Data
1. Age
2. Reputations
3. Upvotes
4. Downvotes
5. Views
Data preprocessing
• Our Dataset is in XML format and unfit for our algorithm to process
that’s why we need data processing to make it fit for our algorithm to
process it.
• Data preprocessing is a data mining technique that involves
transforming raw data into an understandable format.
• To achieve tha data in desired format we need to parse it.
python script to convert xml to csv
from copy import deepcopy
import numpy as np
import pandas as pd
#from matplotlib import pyplot as plt
#%matplotlib inline
#plt.rcParams['figure.figsize'] = (16, 9)
#plt.style.use('ggplot')
import xml.etree.ElementTree as ET
import csv
python script to convert xml to csv
tree = ET.parse("Users.xml")
root = tree.getroot()
# open a file for writing
User_data = open('user_data1.csv', 'w')
# create the csv writer object
csvwriter = csv.writer(User_data)
count = 0
python script to convert xml to csv
csvwriter.writerow(['Reputation', 'Views', 'UpVotes', 'DownVotes', 'Age'])
for i in root.findall('row'):
data = [i.get('Reputation'), i.get('Views'), i.get('UpVotes'), i.get('DownVotes'), i.get('Age') or '0']
# print data
count = count + 1
csvwriter.writerow(data)
User_data.close()
Converted CSV file format
.
What is clustering ?
Clustering is the task of dividing the population or data points
into a number of groups such that data points in the same groups
are more similar to other data points in the same group than
those in other groups. In simple words, the aim is to segregate
groups with similar traits and assign them into clusters.
Pictorial representation of Clustering
Types of Clustering
1. Hard Clustering: In hard clustering, each data point either
belongs to a cluster completely or not.
2. Soft Clustering: In soft clustering, instead of putting each
data point into a separate cluster, a probability or
likelihood of that data point to be in those clusters is
assigned.
Algorithm Used
• We are using K-means clustering algorithm to categorise the user of
different types on the basis of given features.
• k-means clustering is a data mining/machine learning algorithm used
to cluster observations into groups of related observations without
any prior knowledge of those relationships.
• This algorithm is also called unsupervised learning algorithm as it
does not have any idea of label of cluster.
• Using this algorithm we find the different k -categories depending on
the value of K.
Unsupervised Learning
Unsupervised learning is a type of machine learning algorithm used to
draw inferences from datasets consisting of input data without labeled
responses.
The most common unsupervised learning method is cluster analysis, which
is used for exploratory data analysis to find hidden patterns or grouping in
data. The clusters are modeled using a measure of similarity which is
defined upon metrics such as Euclidean or probabilistic distance.
Working of K-Means Algorithm
1 .Specify the desired number of clusters K : Let us choose k=2 for
these 5 data points in 2-D space.
2 . Randomly assign each data point to a cluster : Let’s assign three
points in cluster 1 shown using red color and two points in cluster 2
shown using grey color.
3 . Compute cluster centroids : The centroid of data points in the red
cluster is shown using red cross and those in grey cluster using grey
cross.
4. Now Re-assign each point to the closest cluster centroid .
5. Re-compute cluster centroids : Now, re-computing the
centroids for both the clusters.
6. Repeat steps 4 and 5 until no improvements are possible.
When there will be no further switching of data points between two
clusters for two successive repeats. It will mark the termination of the
algorithm if not explicitly mentioned.
Pictorial representation of K-means
Algorithm
Implementation of K-means Algorithm
1. We have converted our XML data into CSV.
2. Run K-Means Algorithm on stackoverflow data.
3. If K=4 then We get the four cluster center with the values given
below.
array([[ 1.82709702e+02, 8.86936593e-01, 8.58670741e-01,
3.59052712e-02, 3.21581360e+01],
[ 1.71912000e+04, 7.34000000e+01, 1.92800000e+02,
1.29000000e+01, 3.92000000e+01],
[ 3.89650000e+04, 3.47000000e+02, 5.10000000e+02,
8.60000000e+01, 3.00000000e+01],
[ 4.18018750e+03, 1.38750000e+01, 3.42187500e+01,
1.40625000e+00, 3.27187500e+01]])
Pictorial form of Data with 4 cluster centre
Important information regarding insights of
data
1.We processed the data of android users of stack overflow.
2.Here all the results and insights are only of android specific users.
3.We used only numerical value information of User’s as K-Means
algorithm works on Euclidean distance.
4. User’s information used here are as follows.
‘Age’ , ‘Views’ ,’Reputations’, ‘Upvotes’, Downvotes
Insights from stack overflow data
1. Almost all the users of android specific are above 30 in Age.
2. Users who have maximum reputations,views,upvotes and
downvotes are of minimum age among all other users.It means
young community is more involved in android than older.
3. With the growth of Age users are not interested to downvote the
answer. Young community is most involved in downvoting as well as
in upvoting to the answer.
4. Profile views are mostly affected by reputation.It is increasing 3-4
times on doubling the reputation.
.
Thank You

More Related Content

What's hot

A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)
Lynn Cherny
 
Elementary data structure
Elementary data structureElementary data structure
Elementary data structure
Biswajit Mandal
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
Venkata Reddy Konasani
 
R Programming: Introduction to Vectors
R Programming: Introduction to VectorsR Programming: Introduction to Vectors
R Programming: Introduction to Vectors
Rsquared Academy
 
Basic of Data Structure - Data Structure - Notes
Basic of Data Structure - Data Structure - NotesBasic of Data Structure - Data Structure - Notes
Basic of Data Structure - Data Structure - Notes
Omprakash Chauhan
 
Data structures
Data structuresData structures
Data structures
Saurabh Mishra
 
Introduction To R Language
Introduction To R LanguageIntroduction To R Language
Introduction To R Language
Gaurang Dobariya
 
Chapter 6.5
Chapter 6.5Chapter 6.5
Chapter 6.5
sotlsoc
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
Benjamin Bengfort
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection Process
Benjamin Bengfort
 
Chapter 6.5 new
Chapter 6.5 newChapter 6.5 new
Chapter 6.5 new
sotlsoc
 
DATA STRUCTURE AND ALGORITHM FULL NOTES
DATA STRUCTURE AND ALGORITHM FULL NOTESDATA STRUCTURE AND ALGORITHM FULL NOTES
DATA STRUCTURE AND ALGORITHM FULL NOTES
Aniruddha Paul
 
Segment tree
Segment treeSegment tree
Segment tree
Sindhuja Kumar
 
R basics
R basicsR basics
R basics
FAO
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
Sarah Guido
 
Data Structure and Algorithms
Data Structure and AlgorithmsData Structure and Algorithms
Data Structure and Algorithms
Sumathi MathanMohan
 
Segment tree
Segment treeSegment tree
Segment tree
shohanjh09
 
stacks and queues for public
stacks and queues for publicstacks and queues for public
stacks and queues for public
iqbalphy1
 
Segment tree
Segment treeSegment tree
Segment tree
Shakil Ahmed
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
Varad Meru
 

What's hot (20)

A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)
 
Elementary data structure
Elementary data structureElementary data structure
Elementary data structure
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
R Programming: Introduction to Vectors
R Programming: Introduction to VectorsR Programming: Introduction to Vectors
R Programming: Introduction to Vectors
 
Basic of Data Structure - Data Structure - Notes
Basic of Data Structure - Data Structure - NotesBasic of Data Structure - Data Structure - Notes
Basic of Data Structure - Data Structure - Notes
 
Data structures
Data structuresData structures
Data structures
 
Introduction To R Language
Introduction To R LanguageIntroduction To R Language
Introduction To R Language
 
Chapter 6.5
Chapter 6.5Chapter 6.5
Chapter 6.5
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection Process
 
Chapter 6.5 new
Chapter 6.5 newChapter 6.5 new
Chapter 6.5 new
 
DATA STRUCTURE AND ALGORITHM FULL NOTES
DATA STRUCTURE AND ALGORITHM FULL NOTESDATA STRUCTURE AND ALGORITHM FULL NOTES
DATA STRUCTURE AND ALGORITHM FULL NOTES
 
Segment tree
Segment treeSegment tree
Segment tree
 
R basics
R basicsR basics
R basics
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
 
Data Structure and Algorithms
Data Structure and AlgorithmsData Structure and Algorithms
Data Structure and Algorithms
 
Segment tree
Segment treeSegment tree
Segment tree
 
stacks and queues for public
stacks and queues for publicstacks and queues for public
stacks and queues for public
 
Segment tree
Segment treeSegment tree
Segment tree
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 

Similar to K-Means Algorithm Implementation In python

Python for data analysis
Python for data analysisPython for data analysis
Python for data analysis
Savitribai Phule Pune University
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
NishantKumar1179
 
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfPython-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdf
ssuser598883
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
Sandeep Singh
 
Python for Data Analysis.pdf
Python for Data Analysis.pdfPython for Data Analysis.pdf
Python for Data Analysis.pdf
JulioRecaldeLara1
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
tangadhurai
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam
 
Data Exploration in R.pptx
Data Exploration in R.pptxData Exploration in R.pptx
Data Exploration in R.pptx
Ramakrishna Reddy Bijjam
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
Ahmet Bulut
 
More on Pandas.pptx
More on Pandas.pptxMore on Pandas.pptx
More on Pandas.pptx
VirajPathania1
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
Sanjeev Mishra
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
MathewJohnSinoCruz
 
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxfINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
dataKarthik
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
Madan Golla
 
hierarchical clustering.pptx
hierarchical clustering.pptxhierarchical clustering.pptx
hierarchical clustering.pptx
PriyadharshiniG41
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
my6305874
 
House price prediction
House price predictionHouse price prediction
House price prediction
SabahBegum
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Experfy
 
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
Fwdays
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
Vijayananda Mohire
 

Similar to K-Means Algorithm Implementation In python (20)

Python for data analysis
Python for data analysisPython for data analysis
Python for data analysis
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfPython-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdf
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Python for Data Analysis.pdf
Python for Data Analysis.pdfPython for Data Analysis.pdf
Python for Data Analysis.pdf
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
 
Data Exploration in R.pptx
Data Exploration in R.pptxData Exploration in R.pptx
Data Exploration in R.pptx
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
More on Pandas.pptx
More on Pandas.pptxMore on Pandas.pptx
More on Pandas.pptx
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
 
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxfINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
 
hierarchical clustering.pptx
hierarchical clustering.pptxhierarchical clustering.pptx
hierarchical clustering.pptx
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
 
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
 

Recently uploaded

Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
safelyiotech
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
XfilesPro
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Paul Brebner
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
ISH Technologies
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
Paul Brebner
 
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
Luigi Fugaro
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
Alberto Brandolini
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
Marcin Chrost
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
ShulagnaSarkar2
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
ervikas4
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
Reetu63
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
Remote DBA Services
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
Patrick Weigel
 
DevOps Consulting Company | Hire DevOps Services
DevOps Consulting Company | Hire DevOps ServicesDevOps Consulting Company | Hire DevOps Services
DevOps Consulting Company | Hire DevOps Services
seospiralmantra
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
Malibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed RoundMalibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed Round
sjcobrien
 

Recently uploaded (20)

Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
 
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
 
DevOps Consulting Company | Hire DevOps Services
DevOps Consulting Company | Hire DevOps ServicesDevOps Consulting Company | Hire DevOps Services
DevOps Consulting Company | Hire DevOps Services
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
Malibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed RoundMalibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed Round
 

K-Means Algorithm Implementation In python

  • 1. Users categorization of StackOverflow data Using K-Means clustering Algorithm Project Presentation Team Membar – Afzal Ahmad and Abhishek Barnwal
  • 2. What is StackOverflow ? • Stack Overflow is a question and answer site Written in C# for professional and enthusiast programmers. It's built and run by us as part of the Stack Exchange network of Q&A sites.
  • 3. About User Account on stackoverflow • This site is all about getting answers. Good answers are voted up and rise to the top . • User reputation score goes up when others vote up his questions, answers and edits. • Badges are special achievements User earns for participating on the site. They come in three levels: bronze, silver, and gold. • The person who asked can mark one answer as "accepted".
  • 4. DataSet Overview • The dataset is obtained from stackexchange data dump at the internet archieve. • The link to the dataset is as follows. Www.archive.org/details/stackexchange •Each site under stack exchange is formatted as a separate archive Consisting of xml file zipped via 7-zip that includes various files.
  • 5.
  • 6. Dataset overview • Stack overflow dataset consists of following files that is treated as table in our database design. 1.posts 2.postLinks 3.Tags 4.Users 5.Votes 6.Badges 7.Comments ♥ But we are interested only in Users file which contains user's Id and and his features like age,reputation,upotes,downvotes etc...
  • 7.
  • 8. Features of Users Data 1. Age 2. Reputations 3. Upvotes 4. Downvotes 5. Views
  • 9. Data preprocessing • Our Dataset is in XML format and unfit for our algorithm to process that’s why we need data processing to make it fit for our algorithm to process it. • Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. • To achieve tha data in desired format we need to parse it.
  • 10. python script to convert xml to csv from copy import deepcopy import numpy as np import pandas as pd #from matplotlib import pyplot as plt #%matplotlib inline #plt.rcParams['figure.figsize'] = (16, 9) #plt.style.use('ggplot') import xml.etree.ElementTree as ET import csv
  • 11. python script to convert xml to csv tree = ET.parse("Users.xml") root = tree.getroot() # open a file for writing User_data = open('user_data1.csv', 'w') # create the csv writer object csvwriter = csv.writer(User_data) count = 0
  • 12. python script to convert xml to csv csvwriter.writerow(['Reputation', 'Views', 'UpVotes', 'DownVotes', 'Age']) for i in root.findall('row'): data = [i.get('Reputation'), i.get('Views'), i.get('UpVotes'), i.get('DownVotes'), i.get('Age') or '0'] # print data count = count + 1 csvwriter.writerow(data) User_data.close()
  • 13. Converted CSV file format .
  • 14. What is clustering ? Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.
  • 16. Types of Clustering 1. Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not. 2. Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned.
  • 17. Algorithm Used • We are using K-means clustering algorithm to categorise the user of different types on the basis of given features. • k-means clustering is a data mining/machine learning algorithm used to cluster observations into groups of related observations without any prior knowledge of those relationships. • This algorithm is also called unsupervised learning algorithm as it does not have any idea of label of cluster. • Using this algorithm we find the different k -categories depending on the value of K.
  • 18. Unsupervised Learning Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses. The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data. The clusters are modeled using a measure of similarity which is defined upon metrics such as Euclidean or probabilistic distance.
  • 19. Working of K-Means Algorithm 1 .Specify the desired number of clusters K : Let us choose k=2 for these 5 data points in 2-D space.
  • 20. 2 . Randomly assign each data point to a cluster : Let’s assign three points in cluster 1 shown using red color and two points in cluster 2 shown using grey color.
  • 21. 3 . Compute cluster centroids : The centroid of data points in the red cluster is shown using red cross and those in grey cluster using grey cross.
  • 22. 4. Now Re-assign each point to the closest cluster centroid .
  • 23. 5. Re-compute cluster centroids : Now, re-computing the centroids for both the clusters.
  • 24. 6. Repeat steps 4 and 5 until no improvements are possible. When there will be no further switching of data points between two clusters for two successive repeats. It will mark the termination of the algorithm if not explicitly mentioned.
  • 25. Pictorial representation of K-means Algorithm
  • 26. Implementation of K-means Algorithm 1. We have converted our XML data into CSV. 2. Run K-Means Algorithm on stackoverflow data. 3. If K=4 then We get the four cluster center with the values given below. array([[ 1.82709702e+02, 8.86936593e-01, 8.58670741e-01, 3.59052712e-02, 3.21581360e+01], [ 1.71912000e+04, 7.34000000e+01, 1.92800000e+02, 1.29000000e+01, 3.92000000e+01], [ 3.89650000e+04, 3.47000000e+02, 5.10000000e+02, 8.60000000e+01, 3.00000000e+01], [ 4.18018750e+03, 1.38750000e+01, 3.42187500e+01, 1.40625000e+00, 3.27187500e+01]])
  • 27. Pictorial form of Data with 4 cluster centre
  • 28. Important information regarding insights of data 1.We processed the data of android users of stack overflow. 2.Here all the results and insights are only of android specific users. 3.We used only numerical value information of User’s as K-Means algorithm works on Euclidean distance. 4. User’s information used here are as follows. ‘Age’ , ‘Views’ ,’Reputations’, ‘Upvotes’, Downvotes
  • 29. Insights from stack overflow data 1. Almost all the users of android specific are above 30 in Age. 2. Users who have maximum reputations,views,upvotes and downvotes are of minimum age among all other users.It means young community is more involved in android than older. 3. With the growth of Age users are not interested to downvote the answer. Young community is most involved in downvoting as well as in upvoting to the answer. 4. Profile views are mostly affected by reputation.It is increasing 3-4 times on doubling the reputation.