SlideShare a Scribd company logo
1 of 30
Users categorization of StackOverflow data
Using K-Means clustering Algorithm
Project Presentation
Team Membar – Afzal Ahmad and Abhishek Barnwal
What is StackOverflow ?
• Stack Overflow is a question and answer site
Written in C# for professional and enthusiast
programmers. It's built and run by us as part of
the Stack Exchange network of Q&A sites.
About User Account on stackoverflow
• This site is all about getting answers. Good answers are voted up and
rise to the top .
• User reputation score goes up when others vote up his questions,
answers and edits.
• Badges are special achievements User earns for participating on the
site. They come in three levels: bronze, silver, and gold.
• The person who asked can mark one answer as "accepted".
DataSet Overview
• The dataset is obtained from stackexchange data dump at the
internet archieve.
• The link to the dataset is as follows.
Www.archive.org/details/stackexchange
•Each site under stack exchange is formatted as a separate archive
Consisting of xml file zipped via 7-zip that includes various files.
Dataset overview
• Stack overflow dataset consists of following files that is treated as table in
our database design.
1.posts
2.postLinks
3.Tags
4.Users
5.Votes
6.Badges
7.Comments
♥ But we are interested only in Users file which contains user's Id and and his
features like age,reputation,upotes,downvotes etc...
Features of Users Data
1. Age
2. Reputations
3. Upvotes
4. Downvotes
5. Views
Data preprocessing
• Our Dataset is in XML format and unfit for our algorithm to process
that’s why we need data processing to make it fit for our algorithm to
process it.
• Data preprocessing is a data mining technique that involves
transforming raw data into an understandable format.
• To achieve tha data in desired format we need to parse it.
python script to convert xml to csv
from copy import deepcopy
import numpy as np
import pandas as pd
#from matplotlib import pyplot as plt
#%matplotlib inline
#plt.rcParams['figure.figsize'] = (16, 9)
#plt.style.use('ggplot')
import xml.etree.ElementTree as ET
import csv
python script to convert xml to csv
tree = ET.parse("Users.xml")
root = tree.getroot()
# open a file for writing
User_data = open('user_data1.csv', 'w')
# create the csv writer object
csvwriter = csv.writer(User_data)
count = 0
python script to convert xml to csv
csvwriter.writerow(['Reputation', 'Views', 'UpVotes', 'DownVotes', 'Age'])
for i in root.findall('row'):
data = [i.get('Reputation'), i.get('Views'), i.get('UpVotes'), i.get('DownVotes'), i.get('Age') or '0']
# print data
count = count + 1
csvwriter.writerow(data)
User_data.close()
Converted CSV file format
.
What is clustering ?
Clustering is the task of dividing the population or data points
into a number of groups such that data points in the same groups
are more similar to other data points in the same group than
those in other groups. In simple words, the aim is to segregate
groups with similar traits and assign them into clusters.
Pictorial representation of Clustering
Types of Clustering
1. Hard Clustering: In hard clustering, each data point either
belongs to a cluster completely or not.
2. Soft Clustering: In soft clustering, instead of putting each
data point into a separate cluster, a probability or
likelihood of that data point to be in those clusters is
assigned.
Algorithm Used
• We are using K-means clustering algorithm to categorise the user of
different types on the basis of given features.
• k-means clustering is a data mining/machine learning algorithm used
to cluster observations into groups of related observations without
any prior knowledge of those relationships.
• This algorithm is also called unsupervised learning algorithm as it
does not have any idea of label of cluster.
• Using this algorithm we find the different k -categories depending on
the value of K.
Unsupervised Learning
Unsupervised learning is a type of machine learning algorithm used to
draw inferences from datasets consisting of input data without labeled
responses.
The most common unsupervised learning method is cluster analysis, which
is used for exploratory data analysis to find hidden patterns or grouping in
data. The clusters are modeled using a measure of similarity which is
defined upon metrics such as Euclidean or probabilistic distance.
Working of K-Means Algorithm
1 .Specify the desired number of clusters K : Let us choose k=2 for
these 5 data points in 2-D space.
2 . Randomly assign each data point to a cluster : Let’s assign three
points in cluster 1 shown using red color and two points in cluster 2
shown using grey color.
3 . Compute cluster centroids : The centroid of data points in the red
cluster is shown using red cross and those in grey cluster using grey
cross.
4. Now Re-assign each point to the closest cluster centroid .
5. Re-compute cluster centroids : Now, re-computing the
centroids for both the clusters.
6. Repeat steps 4 and 5 until no improvements are possible.
When there will be no further switching of data points between two
clusters for two successive repeats. It will mark the termination of the
algorithm if not explicitly mentioned.
Pictorial representation of K-means
Algorithm
Implementation of K-means Algorithm
1. We have converted our XML data into CSV.
2. Run K-Means Algorithm on stackoverflow data.
3. If K=4 then We get the four cluster center with the values given
below.
array([[ 1.82709702e+02, 8.86936593e-01, 8.58670741e-01,
3.59052712e-02, 3.21581360e+01],
[ 1.71912000e+04, 7.34000000e+01, 1.92800000e+02,
1.29000000e+01, 3.92000000e+01],
[ 3.89650000e+04, 3.47000000e+02, 5.10000000e+02,
8.60000000e+01, 3.00000000e+01],
[ 4.18018750e+03, 1.38750000e+01, 3.42187500e+01,
1.40625000e+00, 3.27187500e+01]])
Pictorial form of Data with 4 cluster centre
Important information regarding insights of
data
1.We processed the data of android users of stack overflow.
2.Here all the results and insights are only of android specific users.
3.We used only numerical value information of User’s as K-Means
algorithm works on Euclidean distance.
4. User’s information used here are as follows.
‘Age’ , ‘Views’ ,’Reputations’, ‘Upvotes’, Downvotes
Insights from stack overflow data
1. Almost all the users of android specific are above 30 in Age.
2. Users who have maximum reputations,views,upvotes and
downvotes are of minimum age among all other users.It means
young community is more involved in android than older.
3. With the growth of Age users are not interested to downvote the
answer. Young community is most involved in downvoting as well as
in upvoting to the answer.
4. Profile views are mostly affected by reputation.It is increasing 3-4
times on doubling the reputation.
.
Thank You

More Related Content

What's hot

A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)Lynn Cherny
 
Elementary data structure
Elementary data structureElementary data structure
Elementary data structureBiswajit Mandal
 
R Programming: Introduction to Vectors
R Programming: Introduction to VectorsR Programming: Introduction to Vectors
R Programming: Introduction to VectorsRsquared Academy
 
Basic of Data Structure - Data Structure - Notes
Basic of Data Structure - Data Structure - NotesBasic of Data Structure - Data Structure - Notes
Basic of Data Structure - Data Structure - NotesOmprakash Chauhan
 
Introduction To R Language
Introduction To R LanguageIntroduction To R Language
Introduction To R LanguageGaurang Dobariya
 
Chapter 6.5
Chapter 6.5Chapter 6.5
Chapter 6.5sotlsoc
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity ResolutionBenjamin Bengfort
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection ProcessBenjamin Bengfort
 
Chapter 6.5 new
Chapter 6.5 newChapter 6.5 new
Chapter 6.5 newsotlsoc
 
DATA STRUCTURE AND ALGORITHM FULL NOTES
DATA STRUCTURE AND ALGORITHM FULL NOTESDATA STRUCTURE AND ALGORITHM FULL NOTES
DATA STRUCTURE AND ALGORITHM FULL NOTESAniruddha Paul
 
R basics
R basicsR basics
R basicsFAO
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnSarah Guido
 
stacks and queues for public
stacks and queues for publicstacks and queues for public
stacks and queues for publiciqbalphy1
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 

What's hot (20)

A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)
 
Elementary data structure
Elementary data structureElementary data structure
Elementary data structure
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
R Programming: Introduction to Vectors
R Programming: Introduction to VectorsR Programming: Introduction to Vectors
R Programming: Introduction to Vectors
 
Basic of Data Structure - Data Structure - Notes
Basic of Data Structure - Data Structure - NotesBasic of Data Structure - Data Structure - Notes
Basic of Data Structure - Data Structure - Notes
 
Data structures
Data structuresData structures
Data structures
 
Introduction To R Language
Introduction To R LanguageIntroduction To R Language
Introduction To R Language
 
Chapter 6.5
Chapter 6.5Chapter 6.5
Chapter 6.5
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection Process
 
Chapter 6.5 new
Chapter 6.5 newChapter 6.5 new
Chapter 6.5 new
 
DATA STRUCTURE AND ALGORITHM FULL NOTES
DATA STRUCTURE AND ALGORITHM FULL NOTESDATA STRUCTURE AND ALGORITHM FULL NOTES
DATA STRUCTURE AND ALGORITHM FULL NOTES
 
Segment tree
Segment treeSegment tree
Segment tree
 
R basics
R basicsR basics
R basics
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
 
Data Structure and Algorithms
Data Structure and AlgorithmsData Structure and Algorithms
Data Structure and Algorithms
 
Segment tree
Segment treeSegment tree
Segment tree
 
stacks and queues for public
stacks and queues for publicstacks and queues for public
stacks and queues for public
 
Segment tree
Segment treeSegment tree
Segment tree
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 

Similar to K-Means Algorithm Implementation In python

PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfPython-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfssuser598883
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxSandeep Singh
 
Python for Data Analysis.pdf
Python for Data Analysis.pdfPython for Data Analysis.pdf
Python for Data Analysis.pdfJulioRecaldeLara1
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxtangadhurai
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm. Abdul salam
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark MLAhmet Bulut
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013Sanjeev Mishra
 
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxfINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxdataKarthik
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...Madan Golla
 
hierarchical clustering.pptx
hierarchical clustering.pptxhierarchical clustering.pptx
hierarchical clustering.pptxPriyadharshiniG41
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningmy6305874
 
House price prediction
House price predictionHouse price prediction
House price predictionSabahBegum
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Experfy
 
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?", Yevhen TatarynovFwdays
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIVijayananda Mohire
 

Similar to K-Means Algorithm Implementation In python (20)

Python for data analysis
Python for data analysisPython for data analysis
Python for data analysis
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfPython-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdf
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Python for Data Analysis.pdf
Python for Data Analysis.pdfPython for Data Analysis.pdf
Python for Data Analysis.pdf
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Data structure and algorithm.
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
 
Data Exploration in R.pptx
Data Exploration in R.pptxData Exploration in R.pptx
Data Exploration in R.pptx
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
More on Pandas.pptx
More on Pandas.pptxMore on Pandas.pptx
More on Pandas.pptx
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
 
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxfINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
 
hierarchical clustering.pptx
hierarchical clustering.pptxhierarchical clustering.pptx
hierarchical clustering.pptx
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
 
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
 

Recently uploaded

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Anthony Dahanne
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 

Recently uploaded (20)

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 

K-Means Algorithm Implementation In python

  • 1. Users categorization of StackOverflow data Using K-Means clustering Algorithm Project Presentation Team Membar – Afzal Ahmad and Abhishek Barnwal
  • 2. What is StackOverflow ? • Stack Overflow is a question and answer site Written in C# for professional and enthusiast programmers. It's built and run by us as part of the Stack Exchange network of Q&A sites.
  • 3. About User Account on stackoverflow • This site is all about getting answers. Good answers are voted up and rise to the top . • User reputation score goes up when others vote up his questions, answers and edits. • Badges are special achievements User earns for participating on the site. They come in three levels: bronze, silver, and gold. • The person who asked can mark one answer as "accepted".
  • 4. DataSet Overview • The dataset is obtained from stackexchange data dump at the internet archieve. • The link to the dataset is as follows. Www.archive.org/details/stackexchange •Each site under stack exchange is formatted as a separate archive Consisting of xml file zipped via 7-zip that includes various files.
  • 5.
  • 6. Dataset overview • Stack overflow dataset consists of following files that is treated as table in our database design. 1.posts 2.postLinks 3.Tags 4.Users 5.Votes 6.Badges 7.Comments ♥ But we are interested only in Users file which contains user's Id and and his features like age,reputation,upotes,downvotes etc...
  • 7.
  • 8. Features of Users Data 1. Age 2. Reputations 3. Upvotes 4. Downvotes 5. Views
  • 9. Data preprocessing • Our Dataset is in XML format and unfit for our algorithm to process that’s why we need data processing to make it fit for our algorithm to process it. • Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. • To achieve tha data in desired format we need to parse it.
  • 10. python script to convert xml to csv from copy import deepcopy import numpy as np import pandas as pd #from matplotlib import pyplot as plt #%matplotlib inline #plt.rcParams['figure.figsize'] = (16, 9) #plt.style.use('ggplot') import xml.etree.ElementTree as ET import csv
  • 11. python script to convert xml to csv tree = ET.parse("Users.xml") root = tree.getroot() # open a file for writing User_data = open('user_data1.csv', 'w') # create the csv writer object csvwriter = csv.writer(User_data) count = 0
  • 12. python script to convert xml to csv csvwriter.writerow(['Reputation', 'Views', 'UpVotes', 'DownVotes', 'Age']) for i in root.findall('row'): data = [i.get('Reputation'), i.get('Views'), i.get('UpVotes'), i.get('DownVotes'), i.get('Age') or '0'] # print data count = count + 1 csvwriter.writerow(data) User_data.close()
  • 13. Converted CSV file format .
  • 14. What is clustering ? Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.
  • 16. Types of Clustering 1. Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not. 2. Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned.
  • 17. Algorithm Used • We are using K-means clustering algorithm to categorise the user of different types on the basis of given features. • k-means clustering is a data mining/machine learning algorithm used to cluster observations into groups of related observations without any prior knowledge of those relationships. • This algorithm is also called unsupervised learning algorithm as it does not have any idea of label of cluster. • Using this algorithm we find the different k -categories depending on the value of K.
  • 18. Unsupervised Learning Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses. The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data. The clusters are modeled using a measure of similarity which is defined upon metrics such as Euclidean or probabilistic distance.
  • 19. Working of K-Means Algorithm 1 .Specify the desired number of clusters K : Let us choose k=2 for these 5 data points in 2-D space.
  • 20. 2 . Randomly assign each data point to a cluster : Let’s assign three points in cluster 1 shown using red color and two points in cluster 2 shown using grey color.
  • 21. 3 . Compute cluster centroids : The centroid of data points in the red cluster is shown using red cross and those in grey cluster using grey cross.
  • 22. 4. Now Re-assign each point to the closest cluster centroid .
  • 23. 5. Re-compute cluster centroids : Now, re-computing the centroids for both the clusters.
  • 24. 6. Repeat steps 4 and 5 until no improvements are possible. When there will be no further switching of data points between two clusters for two successive repeats. It will mark the termination of the algorithm if not explicitly mentioned.
  • 25. Pictorial representation of K-means Algorithm
  • 26. Implementation of K-means Algorithm 1. We have converted our XML data into CSV. 2. Run K-Means Algorithm on stackoverflow data. 3. If K=4 then We get the four cluster center with the values given below. array([[ 1.82709702e+02, 8.86936593e-01, 8.58670741e-01, 3.59052712e-02, 3.21581360e+01], [ 1.71912000e+04, 7.34000000e+01, 1.92800000e+02, 1.29000000e+01, 3.92000000e+01], [ 3.89650000e+04, 3.47000000e+02, 5.10000000e+02, 8.60000000e+01, 3.00000000e+01], [ 4.18018750e+03, 1.38750000e+01, 3.42187500e+01, 1.40625000e+00, 3.27187500e+01]])
  • 27. Pictorial form of Data with 4 cluster centre
  • 28. Important information regarding insights of data 1.We processed the data of android users of stack overflow. 2.Here all the results and insights are only of android specific users. 3.We used only numerical value information of User’s as K-Means algorithm works on Euclidean distance. 4. User’s information used here are as follows. ‘Age’ , ‘Views’ ,’Reputations’, ‘Upvotes’, Downvotes
  • 29. Insights from stack overflow data 1. Almost all the users of android specific are above 30 in Age. 2. Users who have maximum reputations,views,upvotes and downvotes are of minimum age among all other users.It means young community is more involved in android than older. 3. With the growth of Age users are not interested to downvote the answer. Young community is most involved in downvoting as well as in upvoting to the answer. 4. Profile views are mostly affected by reputation.It is increasing 3-4 times on doubling the reputation.