SlideShare a Scribd company logo
1 of 6
Download to read offline
Profile Analysis of StackOverflow Users in Data
Analytics Domain
Sri Saranya. M, *Vigneshwari S, Jabez J, Vimali J S
School of Computing, Sathyabama Institute of Science and Technology, Chennai
Abstract— Data Analytics and Data Science is in the fast forward
mode recently. We see a lot of companies hiring people for data
analysis and data science, especially in India. Also, many
recruiting firms use stackoverflow to fish their potential
candidates. The industry has also started to recruit people based
on the shapes of expertise. Expertise of a personal is
metaphorically outlined by shapes of letters like I, T, M and
hyphen betting on her experiencein a section (depth) and
therefore the variety of areas of interest (width).This proposal
builds upon the work of mining shapes of user expertise in a
typical online social Question and Answer (Q&A) community
where expert users often answer questions posed by other
users.We have dealt with the temporal analysis of the expertise
among the Q&A community users in terms how the user/ expert
have evolved over time.
Keywords— Shapes of expertise, Graph communities, Expertise
evolution, Q&A community
I. INTRODUCTION
The Question and Answer Community is a place
where people can gain and share knowledge. It provides a
valuable resource that cannot be easily obtained using search
engines. Q&A communities can serve as a source to locate
experts in certain areas. StackOverflow is a one which allows
users to post questions which are answered by other users. The
users can also vote for an answer. StackOverflow in particular
are used by developers could help us in better improving
developer experience in using these sites. It would also help us
to understand how information within StackOverflow could be
useful to various software development activities, and the
community of knowledge is formed.
StackOverflow.com is one of the most popular Q&A
websites focused mainly on questions that are related to
programming languages but it also contains different domains
in the IT industry.
II. STATE OF ART
Xiaomo, present a three-phase framework that
consequently produces expertise profiles of online group
individuals. The three stages include a document-topic
relevance model, a user document association model and a
filtering strategy. LDA model is used to identify best
performances [5]. Riahi, in his paper the main objective is to
concentrate on discovering specialists for a recently posted
query. They route new questions to the best suited experts.
Here two models have been used known as LDA (Content of
User profile) and STM (Structure of profiles). The STM
performs better than LDA (Segmented Topic Model) [8].
A temporal study of experts in CQA and analyze the
changes over time. Here the Gaussian Mixture Model based
clustering algorithm is used to find the experts in the domain.
Probability of providing a best answer increases for experts &
it decreases for ordinary users over time.
III. MATERIALS & METHODS
A. Data collection
The data is collected from the StackOverflow Q&A
forum. Data Science domain is selected for Profile analysis of
users. Input dataset has records starting from May2014 to
December2016.
B. Data Cleaning
Data cleaning (Pre-processing) is an important and
critical step in Data mining process. Raw data that includes
noise, incomplete and inconsistent values which affect the
accuracy of data in analysis is need to be preprocessed to
improve the quality of data and also to achieve better
results[11-17]. The Null values are checked and removed
&duplicate records are also removedfrom the attributes.
Finally, the cleaned data is transformed for further data mining
process.
C. Data Preparation
Import the data from XML to CSV file format. Select
the relevant data that has good range values. Post data is
selected with the required fields such as Date, PostType Id,
User Id, Accepted AnswerId, and Answer Countfor EDA
analysis. Then data transformation was carried out. Create a
Calendar for the required period to find response time and
merge the calendar with the Post data for performing EDA
analysis.
D. Exploratory Data Analysis
The cleaneddata is used for Exploratory Data
Analysis. In EDA analysis we find the total number of
questions and answers in data science domain, number of
questions asked by each user, total number of answers given
by each user, maximum, minimum and average number of
answers for a question, number of accepted answers given by
each user. Based on the above information the further analysis
is performed.
IV. EXPERIMENTAL DESIGN
International Journal of Pure and Applied Mathematics
Volume 119 No. 7 2018, 565-570
ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)
url: http://www.ijpam.eu
Special Issue
ijpam.eu
565
R is a programming language and it is an
environment for statistical and graphical techniques. It is
highly extensible. R studio V3.3.1 is used for this data
analysis.
Algorithm for finding expert users
1. Load the dataset from the stack exchange Q&A community
2. Parse and read the dataset from XML to CSV file
3. Pre-processing: Clean, remove null values and duplicates
4. Perform Exploratory Data Analysis for the cleaned dataset
5. Use graph partitioning algorithms to find the communities
from the tag groups
6. Compare the partitioning algorithms to find the best communities
based on their modularity
7. Find the expertise profile and define shapes for the expert users
8. Perform the prediction for the evolution of experts
9. Evaluate the performance analysis for the dataset
10. Visualize the results
E. Finding Areas of expertise
The next part of this paper deals with finding the
areas of expertise within a community. The graph partitioning
approach is used to find the areas of expertise. Graph is
created using co-occurrence of tags given by users to the
questions. Modularity is calculated to find the connections
within the community. High modularity reflects dense
connections within communities in the graph & sparse
connections across communities in the graph. Sub
communities obtained by the partitioning algorithms are
known as Area of Expertise within the community. In this
paper several graph partitioning algorithms are used and a
comparison of these algorithms is also performed.
a) Girvan Newman Algorithm
In this algorithm, at each step the edge with the
highest betweenness is removed from the graph. Modularity of
the graph is computed for every division. Modularity value for
Girvan Newman algorithm for finding community is
0.007191502.
Fig.1: Graph for edge.betweeness.community detection
Fig.1 shows the graph for GirvanNewman and here
most of the members have fall in a single community.
b) Fast Greedy Algorithm
The algorithm is agglomerative. At each step two
groups merge & merging is decided by optimizing the
modularity. It is a fast algorithm but it doesn’t produce the
best overall community for the given data.
Fig.2: Fast greedy community detection
Fig.2 shows the graph for community detection using
fast greedy algorithm. Modularity of fast greedy is 0.2088943.
Modularity is good when compared to Girvan Newman but the
community formation is not properly split here. Thus Fast
Greedy algorithm is not acceptable for the given dataset.
c) Spin Glass clustering
This algorithm is based on Spin glass model. Here
the cluster size is defined as 5 to find the communities.
Fig.3: Spin glass community detection
The Fig.3 shows the graph for spin glass community
detection. Modularity of spin glass is 0.2221456 and it is less
than fast greedy but members are not divided properly in all
the 5 groups.
d) Multilevel Community
Multilevel community detection is based on the
modularity measure and hierarchical approach. In this
algorithm each vertex is assigned to a community on its own
and in every step, vertices are re-assigned to communities in a
local or in greedy way.
The Fig.4 shows the multilevel community detection
and here we got 5 communities and we can also see sparse
connections across communities.Modularity of multilevel
community is 0.2021255. So, multilevel approach is well
suited for the given dataset.
International Journal of Pure and Applied Mathematics Special Issue
566
Fig.4: Multilevel community detection
F) Finding Expertise patterns
Based on the communities obtained from the
multilevel graph partitioning approach the areas of expertise
have been obtained for the data science users. To find the
levels of expertise within an area it is defined as Low,
Medium and High.High- Users answering several questions &
providing several best answers. Medium- Users answering
several questions some of them being best answers. Low-
Users interest in some areas & providing some answers.
Shapes are assigned based on the range of number of
answers given by users. The shapes for the expert users can be
given as follows,Hyphen: Contributing in one or more areas
but not expertise in any area.I-shaped: Expertise in only one
area & no contribution in other areas.T-shaped: Expertise in
one area & also contributing in many areas.M-shaped:
Expertise in more than one areas then he is considered as
master in the community.
G) Evolution of Expertise of each profile
Figure 5: User Activity Graph
Fig.5 shows the distribution of profile users over a
period of 32 months data. Here the I-shaped users are
neglected because their contribution is very few in this period
of 32 months. So, it’s not shown in the above graph.
V. RESULTS & DISCUSSIONS
Generally prediction is used to forecast an event for
futuretime period. Several prediction approaches are used in
data analytics such as regression, time series etc. In this paper
prediction using time series is performed. It is used to predict
the future values based on previously observed values. Auto
ARIMA (Autoregressive Moving Average) model and
NNETAR (Neural Network) model is used in this prediction
analysis and the accuracy for prediction is also compared.
a) Hyphen-shaped users
The figure 6(a) shows the comparison of original
time series and the forecasted time series of Hyphen shaped
users as modelled by auto Arima.
Fig.6 (a) Auto Arima: Original Vs Predicted
Figure 6(b) shows the original time series followed
by the forecasted values of Hyphen shaped users using neural
network model.
Fig.6 (b) NNETAR: Original Vs Predicted
b) T-shaped users
The fig.7(a) shows the comparison of original time
series and the forecasted time series of T-shaped users as
modelled by auto Arima. Fig.7(b) shows the original time
series followed by the forecasted values of T-shaped users
using neural network model.
Fig.7(a) Auto Arima: Original Vs Predicted
International Journal of Pure and Applied Mathematics Special Issue
567
Fig.7 (b) NNETAR: Original Vs Predicted
c) M-shaped users
The fig.8(a) shows the comparison of original time
series and the forecasted time series for M-shaped users as
modelled by auto Arima.
Fig.8(a) Auto Arima: Original Vs Predicted
Fig.8 (b) shows the original time series followed by
the forecasted values of M-shaped users using neural network
model.
Figure 8(b) NNETAR: Original Vs Predicted
TABLE 1: Statistics of Training and Testing data using Auto ARIMA &
NNETAR
The above table 1 shows the evaluation metrics for
Hyphen shaped, T-shaped and M-shaped profile users
modelled by using auto Arima and neural networks. Among
the above, the model for forecasting the Hyphen and M-
shaped users is acceptable in both ARIMA & NNETAR model
and the model for forecasting the T-shaped users is relatively
acceptable in both the models since the MAPE value is high.
VI. CONCLUSION & FUTURE WORKS
In this paper, we obtained a tendency towards the preditction
of social Q & A forurms which involves experience in
learning.TheExploratory Data analysis has been done and the
results were shown. Tag groups were used to form the
Communities using graph. Then the shapes were given to
every user based on their answer count. We also extend this to
do temporal analysis of change in expertise shapes over time.
It can be extended to other communities also. For expert
identification, we have only considered number of answers.
We can extend this to include number of interactions between
the expert users and can also include number of upvotes to
decide the expertise.
ACKNOWLEDGMENT
The authors would like to acknowledge that this work has
been carried out at DST-FIST sponsored Cloud Computing
Lab (order Saction No. : SR/FST/ETI-364/2014 Dated: 21
November, 2014), School of Computing, Sathyabama
University.
REFERNCES
[1] Bazelli, B. et al. 2013. On the personality traits of StackOverflow users.
IEEE Int. Conf. on Software Maintenance, ICSM. 460–463.
[2] Bouguessa, M. et al. 2008. Identifying authoritative actors in question-
answering forums: the case of Yahoo! answers." Proc. of the 14th ACM
SIGKDD Int. Conf. on Knowledge Discovery and Data Mining. 866-
874.
[3] Daniel Czyczyn-Egird(B) and Rafal Wojszczyk, “Determining the
Popularity of Design Patterns Used by Programmers Based on the
Analysis of Questions and Answers on Stackoverflow.com Social
Network”, Springer International Publishing Switzerland 2016, pp.
421–433.
[4] https://archive.org/details/stackexchange
[5] Liu, X. et al. “Harnessing global expertise: A comparative study of
expertise profiling methods for online communities”,2012 Information
Systems Frontiers. 1–13.
[6] Varun Kumar, Niranjan Pedanekar, “Mining Shapes of Expertise in
Online Social Q&A Communities” In Companion Proceedings of
CSCW 2016, the 19th ACM Conference on Computer Supported
Cooperative Work and Social Computing.
[7] Yang, Liu, CQArank, ”Jointly model topics and expertise in community
question answering”, 2012 Proc. of the 22nd ACM International
Conference on Info. & Knowledge Management. 99-108.
[8] Riahi, “Finding expert users in community question answering”, 2012
Proc. of the 21st
International Conference Companion on World Wide
Web. 791– 798.
[9] Sourcing on StackOverflow. 2015.
http://sourcingrecruitment.info/sourcing-onstackoverflow/. Retrieved on
June 1, 2015.
[10] Zhan,” Expertise networks in online communities: structure and
algorithms”, 2007 Forum American Bar Association.
International Journal of Pure and Applied Mathematics Special Issue
568
[11] Vigneshwari. S and Aramudhan. M (2015), “Personalized cross
ontological framework for secured document retrieval in the cloud”,
National Academy Science Letters-India, Vol. 38 (5), pp. 421–424.
[12] Mary Posonia A, Vigneshwari S, Jyothi V.L,(2018),An Approach for
Efficient Ranking of XML Documents using BPN based RANN,
IIOABJ , Vol. 9 , 2 , pp. 1-7 ,
[13] Vigneshwari. S and Aramudhan. M (2015), “Social Information
Retrieval Based on Semantic Annotation and Hashing upon the Multiple
Ontologies”, Indian Journal of Science and Technology, Vol. No: 8(2),
pp. 103–107.SJR: 1.3.
[14] Vigneshwari. S and Aramudhan. M (2013), “An Approach for Ontology
Integration for Personalization with the Support of XML”, International
Journal of Engineering and Technology, Vol., No: 5(6), pp. 4556-4571.
SJR: 0.18.
[15] Vigneshwari. S and Aramudhan. M (2013), “A Technique To User
Profiling Ontology Mining And Relationship Ranking”, Journal of
Theoretical & Applied Information Technology, Vol. 58(3), pp. 635-
640. SJR: 0.15.
[16] Merlin Ann Roy,Vigneshwari. S, “An Effective Framework to Improve
The Efficiency of Semantic Based Search, Journal of Theoretical and
Applied Information Technology” ,Vol.73.No.2,pp. 220-225,2015
[17] S.Monisha, S.Vigneshwari, “A Framework for Ontology Based Link
Analysisfor Web Mining”, Journal of Theoretical and Applied
Information Technology,Vol.73.No.2,pp.307-312,2015
International Journal of Pure and Applied Mathematics Special Issue
569
570

More Related Content

What's hot

INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
IJwest
 
11.hybrid ga svm for efficient feature selection in e-mail classification
11.hybrid ga svm for efficient feature selection in e-mail classification11.hybrid ga svm for efficient feature selection in e-mail classification
11.hybrid ga svm for efficient feature selection in e-mail classification
Alexander Decker
 
2014 USA
2014 USA2014 USA
2014 USA
LI HE
 

What's hot (20)

INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
Sources of errors in distributed development projects implications for colla...
Sources of errors in distributed development projects implications for colla...Sources of errors in distributed development projects implications for colla...
Sources of errors in distributed development projects implications for colla...
 
Programmer information needs after memory failure
Programmer information needs after memory failureProgrammer information needs after memory failure
Programmer information needs after memory failure
 
Semantic Based Model for Text Document Clustering with Idioms
Semantic Based Model for Text Document Clustering with IdiomsSemantic Based Model for Text Document Clustering with Idioms
Semantic Based Model for Text Document Clustering with Idioms
 
Semantic Annotation Framework For Intelligent Information Retrieval Using KIM...
Semantic Annotation Framework For Intelligent Information Retrieval Using KIM...Semantic Annotation Framework For Intelligent Information Retrieval Using KIM...
Semantic Annotation Framework For Intelligent Information Retrieval Using KIM...
 
Scalable recommendation with social contextual information
Scalable recommendation with social contextual informationScalable recommendation with social contextual information
Scalable recommendation with social contextual information
 
Meta Classification Technique for Improving Credit Card Fraud Detection
Meta Classification Technique for Improving Credit Card Fraud Detection Meta Classification Technique for Improving Credit Card Fraud Detection
Meta Classification Technique for Improving Credit Card Fraud Detection
 
Modeling Text Independent Speaker Identification with Vector Quantization
Modeling Text Independent Speaker Identification with Vector QuantizationModeling Text Independent Speaker Identification with Vector Quantization
Modeling Text Independent Speaker Identification with Vector Quantization
 
Topic modeling
Topic modelingTopic modeling
Topic modeling
 
Sentiment Analysis and Classification of Tweets using Data Mining
Sentiment Analysis and Classification of Tweets using Data MiningSentiment Analysis and Classification of Tweets using Data Mining
Sentiment Analysis and Classification of Tweets using Data Mining
 
On the benefit of logic-based machine learning to learn pairwise comparisons
On the benefit of logic-based machine learning to learn pairwise comparisonsOn the benefit of logic-based machine learning to learn pairwise comparisons
On the benefit of logic-based machine learning to learn pairwise comparisons
 
IRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and ChallengesIRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and Challenges
 
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...
IRJET-  	  Twitter Sentimental Analysis for Predicting Election Result using ...IRJET-  	  Twitter Sentimental Analysis for Predicting Election Result using ...
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...
 
IRJET- Personality Prediction System using AI
IRJET- Personality Prediction System using AIIRJET- Personality Prediction System using AI
IRJET- Personality Prediction System using AI
 
IRJET - Recommendation System using Big Data Mining on Social Networks
IRJET -  	  Recommendation System using Big Data Mining on Social NetworksIRJET -  	  Recommendation System using Big Data Mining on Social Networks
IRJET - Recommendation System using Big Data Mining on Social Networks
 
11.hybrid ga svm for efficient feature selection in e-mail classification
11.hybrid ga svm for efficient feature selection in e-mail classification11.hybrid ga svm for efficient feature selection in e-mail classification
11.hybrid ga svm for efficient feature selection in e-mail classification
 
Hybrid ga svm for efficient feature selection in e-mail classification
Hybrid ga svm for efficient feature selection in e-mail classificationHybrid ga svm for efficient feature selection in e-mail classification
Hybrid ga svm for efficient feature selection in e-mail classification
 
Recommendation based on Clustering and Association Rules
Recommendation based on Clustering and Association RulesRecommendation based on Clustering and Association Rules
Recommendation based on Clustering and Association Rules
 
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm
Enhanced Retrieval of Web Pages using Improved Page Rank AlgorithmEnhanced Retrieval of Web Pages using Improved Page Rank Algorithm
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm
 
2014 USA
2014 USA2014 USA
2014 USA
 

Similar to Profile Analysis of Users in Data Analytics Domain

Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
ssuser33da69
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
IJERA Editor
 
Dad (Data Analysis And Design)
Dad (Data Analysis And Design)Dad (Data Analysis And Design)
Dad (Data Analysis And Design)
Jill Lyons
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data Mining
Editor IJCATR
 
A Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft DetectionA Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft Detection
Camella Taylor
 

Similar to Profile Analysis of Users in Data Analytics Domain (20)

A Survey on Recommendation System based on Knowledge Graph and Machine Learning
A Survey on Recommendation System based on Knowledge Graph and Machine LearningA Survey on Recommendation System based on Knowledge Graph and Machine Learning
A Survey on Recommendation System based on Knowledge Graph and Machine Learning
 
Social Friend Overlying Communities Based on Social Network Context
Social Friend Overlying Communities Based on Social Network ContextSocial Friend Overlying Communities Based on Social Network Context
Social Friend Overlying Communities Based on Social Network Context
 
Twitter Sentiment Analysis: An Unsupervised Approach
Twitter Sentiment Analysis: An Unsupervised ApproachTwitter Sentiment Analysis: An Unsupervised Approach
Twitter Sentiment Analysis: An Unsupervised Approach
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
Twitter Text Sentiment Analysis: A Comparative Study on Unigram and Bigram Fe...
Twitter Text Sentiment Analysis: A Comparative Study on Unigram and Bigram Fe...Twitter Text Sentiment Analysis: A Comparative Study on Unigram and Bigram Fe...
Twitter Text Sentiment Analysis: A Comparative Study on Unigram and Bigram Fe...
 
Socially Shared Images with Automated Annotation Process by Using Improved Us...
Socially Shared Images with Automated Annotation Process by Using Improved Us...Socially Shared Images with Automated Annotation Process by Using Improved Us...
Socially Shared Images with Automated Annotation Process by Using Improved Us...
 
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
Decision treeinductionmethodsandtheirapplicationtobigdatafinal 5
 
A Brief Survey on Recommendation System for a Gradient Classifier based Inade...
A Brief Survey on Recommendation System for a Gradient Classifier based Inade...A Brief Survey on Recommendation System for a Gradient Classifier based Inade...
A Brief Survey on Recommendation System for a Gradient Classifier based Inade...
 
IRJET- Comparative Study of Classification Algorithms for Sentiment Analy...
IRJET-  	  Comparative Study of Classification Algorithms for Sentiment Analy...IRJET-  	  Comparative Study of Classification Algorithms for Sentiment Analy...
IRJET- Comparative Study of Classification Algorithms for Sentiment Analy...
 
IRJET- Opinion Mining and Sentiment Analysis for Online Review
IRJET-  	  Opinion Mining and Sentiment Analysis for Online ReviewIRJET-  	  Opinion Mining and Sentiment Analysis for Online Review
IRJET- Opinion Mining and Sentiment Analysis for Online Review
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
 
Projection Multi Scale Hashing Keyword Search in Multidimensional Datasets
Projection Multi Scale Hashing Keyword Search in Multidimensional DatasetsProjection Multi Scale Hashing Keyword Search in Multidimensional Datasets
Projection Multi Scale Hashing Keyword Search in Multidimensional Datasets
 
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...
 
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
 
Dad (Data Analysis And Design)
Dad (Data Analysis And Design)Dad (Data Analysis And Design)
Dad (Data Analysis And Design)
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data Mining
 
Data Analytics Course Curriculum_ What to Expect and How to Prepare in 2023.pdf
Data Analytics Course Curriculum_ What to Expect and How to Prepare in 2023.pdfData Analytics Course Curriculum_ What to Expect and How to Prepare in 2023.pdf
Data Analytics Course Curriculum_ What to Expect and How to Prepare in 2023.pdf
 
IRJET - An Overview of Machine Learning Algorithms for Data Science
IRJET - An Overview of Machine Learning Algorithms for Data ScienceIRJET - An Overview of Machine Learning Algorithms for Data Science
IRJET - An Overview of Machine Learning Algorithms for Data Science
 
A Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft DetectionA Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft Detection
 
Sentiment Analysis on Twitter Data
Sentiment Analysis on Twitter DataSentiment Analysis on Twitter Data
Sentiment Analysis on Twitter Data
 

More from Drjabez

Survey of various methods used for integrating machine learning into brain tu...
Survey of various methods used for integrating machine learning into brain tu...Survey of various methods used for integrating machine learning into brain tu...
Survey of various methods used for integrating machine learning into brain tu...
Drjabez
 
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network Datasets
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network DatasetsA Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network Datasets
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network Datasets
Drjabez
 

More from Drjabez (7)

The power of_deep_learning_models_applications
The power of_deep_learning_models_applicationsThe power of_deep_learning_models_applications
The power of_deep_learning_models_applications
 
Survey of various methods used for integrating machine learning into brain tu...
Survey of various methods used for integrating machine learning into brain tu...Survey of various methods used for integrating machine learning into brain tu...
Survey of various methods used for integrating machine learning into brain tu...
 
Automated News Categorization Using Machine Learning Techniques
Automated News Categorization Using Machine Learning TechniquesAutomated News Categorization Using Machine Learning Techniques
Automated News Categorization Using Machine Learning Techniques
 
Novel Methodology of Data Management in Ad Hoc Network Formulated using Nanos...
Novel Methodology of Data Management in Ad Hoc Network Formulated using Nanos...Novel Methodology of Data Management in Ad Hoc Network Formulated using Nanos...
Novel Methodology of Data Management in Ad Hoc Network Formulated using Nanos...
 
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network Datasets
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network DatasetsA Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network Datasets
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network Datasets
 
Intrusion Detection System (IDS): Anomaly Detection using Outlier Detection A...
Intrusion Detection System (IDS): Anomaly Detection using Outlier Detection A...Intrusion Detection System (IDS): Anomaly Detection using Outlier Detection A...
Intrusion Detection System (IDS): Anomaly Detection using Outlier Detection A...
 
Anomaly detection by using CFS subset and neural network with WEKA tools
Anomaly detection by using CFS subset and neural network with WEKA tools Anomaly detection by using CFS subset and neural network with WEKA tools
Anomaly detection by using CFS subset and neural network with WEKA tools
 

Recently uploaded

VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 

Recently uploaded (20)

chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 

Profile Analysis of Users in Data Analytics Domain

  • 1. Profile Analysis of StackOverflow Users in Data Analytics Domain Sri Saranya. M, *Vigneshwari S, Jabez J, Vimali J S School of Computing, Sathyabama Institute of Science and Technology, Chennai Abstract— Data Analytics and Data Science is in the fast forward mode recently. We see a lot of companies hiring people for data analysis and data science, especially in India. Also, many recruiting firms use stackoverflow to fish their potential candidates. The industry has also started to recruit people based on the shapes of expertise. Expertise of a personal is metaphorically outlined by shapes of letters like I, T, M and hyphen betting on her experiencein a section (depth) and therefore the variety of areas of interest (width).This proposal builds upon the work of mining shapes of user expertise in a typical online social Question and Answer (Q&A) community where expert users often answer questions posed by other users.We have dealt with the temporal analysis of the expertise among the Q&A community users in terms how the user/ expert have evolved over time. Keywords— Shapes of expertise, Graph communities, Expertise evolution, Q&A community I. INTRODUCTION The Question and Answer Community is a place where people can gain and share knowledge. It provides a valuable resource that cannot be easily obtained using search engines. Q&A communities can serve as a source to locate experts in certain areas. StackOverflow is a one which allows users to post questions which are answered by other users. The users can also vote for an answer. StackOverflow in particular are used by developers could help us in better improving developer experience in using these sites. It would also help us to understand how information within StackOverflow could be useful to various software development activities, and the community of knowledge is formed. StackOverflow.com is one of the most popular Q&A websites focused mainly on questions that are related to programming languages but it also contains different domains in the IT industry. II. STATE OF ART Xiaomo, present a three-phase framework that consequently produces expertise profiles of online group individuals. The three stages include a document-topic relevance model, a user document association model and a filtering strategy. LDA model is used to identify best performances [5]. Riahi, in his paper the main objective is to concentrate on discovering specialists for a recently posted query. They route new questions to the best suited experts. Here two models have been used known as LDA (Content of User profile) and STM (Structure of profiles). The STM performs better than LDA (Segmented Topic Model) [8]. A temporal study of experts in CQA and analyze the changes over time. Here the Gaussian Mixture Model based clustering algorithm is used to find the experts in the domain. Probability of providing a best answer increases for experts & it decreases for ordinary users over time. III. MATERIALS & METHODS A. Data collection The data is collected from the StackOverflow Q&A forum. Data Science domain is selected for Profile analysis of users. Input dataset has records starting from May2014 to December2016. B. Data Cleaning Data cleaning (Pre-processing) is an important and critical step in Data mining process. Raw data that includes noise, incomplete and inconsistent values which affect the accuracy of data in analysis is need to be preprocessed to improve the quality of data and also to achieve better results[11-17]. The Null values are checked and removed &duplicate records are also removedfrom the attributes. Finally, the cleaned data is transformed for further data mining process. C. Data Preparation Import the data from XML to CSV file format. Select the relevant data that has good range values. Post data is selected with the required fields such as Date, PostType Id, User Id, Accepted AnswerId, and Answer Countfor EDA analysis. Then data transformation was carried out. Create a Calendar for the required period to find response time and merge the calendar with the Post data for performing EDA analysis. D. Exploratory Data Analysis The cleaneddata is used for Exploratory Data Analysis. In EDA analysis we find the total number of questions and answers in data science domain, number of questions asked by each user, total number of answers given by each user, maximum, minimum and average number of answers for a question, number of accepted answers given by each user. Based on the above information the further analysis is performed. IV. EXPERIMENTAL DESIGN International Journal of Pure and Applied Mathematics Volume 119 No. 7 2018, 565-570 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu Special Issue ijpam.eu 565
  • 2. R is a programming language and it is an environment for statistical and graphical techniques. It is highly extensible. R studio V3.3.1 is used for this data analysis. Algorithm for finding expert users 1. Load the dataset from the stack exchange Q&A community 2. Parse and read the dataset from XML to CSV file 3. Pre-processing: Clean, remove null values and duplicates 4. Perform Exploratory Data Analysis for the cleaned dataset 5. Use graph partitioning algorithms to find the communities from the tag groups 6. Compare the partitioning algorithms to find the best communities based on their modularity 7. Find the expertise profile and define shapes for the expert users 8. Perform the prediction for the evolution of experts 9. Evaluate the performance analysis for the dataset 10. Visualize the results E. Finding Areas of expertise The next part of this paper deals with finding the areas of expertise within a community. The graph partitioning approach is used to find the areas of expertise. Graph is created using co-occurrence of tags given by users to the questions. Modularity is calculated to find the connections within the community. High modularity reflects dense connections within communities in the graph & sparse connections across communities in the graph. Sub communities obtained by the partitioning algorithms are known as Area of Expertise within the community. In this paper several graph partitioning algorithms are used and a comparison of these algorithms is also performed. a) Girvan Newman Algorithm In this algorithm, at each step the edge with the highest betweenness is removed from the graph. Modularity of the graph is computed for every division. Modularity value for Girvan Newman algorithm for finding community is 0.007191502. Fig.1: Graph for edge.betweeness.community detection Fig.1 shows the graph for GirvanNewman and here most of the members have fall in a single community. b) Fast Greedy Algorithm The algorithm is agglomerative. At each step two groups merge & merging is decided by optimizing the modularity. It is a fast algorithm but it doesn’t produce the best overall community for the given data. Fig.2: Fast greedy community detection Fig.2 shows the graph for community detection using fast greedy algorithm. Modularity of fast greedy is 0.2088943. Modularity is good when compared to Girvan Newman but the community formation is not properly split here. Thus Fast Greedy algorithm is not acceptable for the given dataset. c) Spin Glass clustering This algorithm is based on Spin glass model. Here the cluster size is defined as 5 to find the communities. Fig.3: Spin glass community detection The Fig.3 shows the graph for spin glass community detection. Modularity of spin glass is 0.2221456 and it is less than fast greedy but members are not divided properly in all the 5 groups. d) Multilevel Community Multilevel community detection is based on the modularity measure and hierarchical approach. In this algorithm each vertex is assigned to a community on its own and in every step, vertices are re-assigned to communities in a local or in greedy way. The Fig.4 shows the multilevel community detection and here we got 5 communities and we can also see sparse connections across communities.Modularity of multilevel community is 0.2021255. So, multilevel approach is well suited for the given dataset. International Journal of Pure and Applied Mathematics Special Issue 566
  • 3. Fig.4: Multilevel community detection F) Finding Expertise patterns Based on the communities obtained from the multilevel graph partitioning approach the areas of expertise have been obtained for the data science users. To find the levels of expertise within an area it is defined as Low, Medium and High.High- Users answering several questions & providing several best answers. Medium- Users answering several questions some of them being best answers. Low- Users interest in some areas & providing some answers. Shapes are assigned based on the range of number of answers given by users. The shapes for the expert users can be given as follows,Hyphen: Contributing in one or more areas but not expertise in any area.I-shaped: Expertise in only one area & no contribution in other areas.T-shaped: Expertise in one area & also contributing in many areas.M-shaped: Expertise in more than one areas then he is considered as master in the community. G) Evolution of Expertise of each profile Figure 5: User Activity Graph Fig.5 shows the distribution of profile users over a period of 32 months data. Here the I-shaped users are neglected because their contribution is very few in this period of 32 months. So, it’s not shown in the above graph. V. RESULTS & DISCUSSIONS Generally prediction is used to forecast an event for futuretime period. Several prediction approaches are used in data analytics such as regression, time series etc. In this paper prediction using time series is performed. It is used to predict the future values based on previously observed values. Auto ARIMA (Autoregressive Moving Average) model and NNETAR (Neural Network) model is used in this prediction analysis and the accuracy for prediction is also compared. a) Hyphen-shaped users The figure 6(a) shows the comparison of original time series and the forecasted time series of Hyphen shaped users as modelled by auto Arima. Fig.6 (a) Auto Arima: Original Vs Predicted Figure 6(b) shows the original time series followed by the forecasted values of Hyphen shaped users using neural network model. Fig.6 (b) NNETAR: Original Vs Predicted b) T-shaped users The fig.7(a) shows the comparison of original time series and the forecasted time series of T-shaped users as modelled by auto Arima. Fig.7(b) shows the original time series followed by the forecasted values of T-shaped users using neural network model. Fig.7(a) Auto Arima: Original Vs Predicted International Journal of Pure and Applied Mathematics Special Issue 567
  • 4. Fig.7 (b) NNETAR: Original Vs Predicted c) M-shaped users The fig.8(a) shows the comparison of original time series and the forecasted time series for M-shaped users as modelled by auto Arima. Fig.8(a) Auto Arima: Original Vs Predicted Fig.8 (b) shows the original time series followed by the forecasted values of M-shaped users using neural network model. Figure 8(b) NNETAR: Original Vs Predicted TABLE 1: Statistics of Training and Testing data using Auto ARIMA & NNETAR The above table 1 shows the evaluation metrics for Hyphen shaped, T-shaped and M-shaped profile users modelled by using auto Arima and neural networks. Among the above, the model for forecasting the Hyphen and M- shaped users is acceptable in both ARIMA & NNETAR model and the model for forecasting the T-shaped users is relatively acceptable in both the models since the MAPE value is high. VI. CONCLUSION & FUTURE WORKS In this paper, we obtained a tendency towards the preditction of social Q & A forurms which involves experience in learning.TheExploratory Data analysis has been done and the results were shown. Tag groups were used to form the Communities using graph. Then the shapes were given to every user based on their answer count. We also extend this to do temporal analysis of change in expertise shapes over time. It can be extended to other communities also. For expert identification, we have only considered number of answers. We can extend this to include number of interactions between the expert users and can also include number of upvotes to decide the expertise. ACKNOWLEDGMENT The authors would like to acknowledge that this work has been carried out at DST-FIST sponsored Cloud Computing Lab (order Saction No. : SR/FST/ETI-364/2014 Dated: 21 November, 2014), School of Computing, Sathyabama University. REFERNCES [1] Bazelli, B. et al. 2013. On the personality traits of StackOverflow users. IEEE Int. Conf. on Software Maintenance, ICSM. 460–463. [2] Bouguessa, M. et al. 2008. Identifying authoritative actors in question- answering forums: the case of Yahoo! answers." Proc. of the 14th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining. 866- 874. [3] Daniel Czyczyn-Egird(B) and Rafal Wojszczyk, “Determining the Popularity of Design Patterns Used by Programmers Based on the Analysis of Questions and Answers on Stackoverflow.com Social Network”, Springer International Publishing Switzerland 2016, pp. 421–433. [4] https://archive.org/details/stackexchange [5] Liu, X. et al. “Harnessing global expertise: A comparative study of expertise profiling methods for online communities”,2012 Information Systems Frontiers. 1–13. [6] Varun Kumar, Niranjan Pedanekar, “Mining Shapes of Expertise in Online Social Q&A Communities” In Companion Proceedings of CSCW 2016, the 19th ACM Conference on Computer Supported Cooperative Work and Social Computing. [7] Yang, Liu, CQArank, ”Jointly model topics and expertise in community question answering”, 2012 Proc. of the 22nd ACM International Conference on Info. & Knowledge Management. 99-108. [8] Riahi, “Finding expert users in community question answering”, 2012 Proc. of the 21st International Conference Companion on World Wide Web. 791– 798. [9] Sourcing on StackOverflow. 2015. http://sourcingrecruitment.info/sourcing-onstackoverflow/. Retrieved on June 1, 2015. [10] Zhan,” Expertise networks in online communities: structure and algorithms”, 2007 Forum American Bar Association. International Journal of Pure and Applied Mathematics Special Issue 568
  • 5. [11] Vigneshwari. S and Aramudhan. M (2015), “Personalized cross ontological framework for secured document retrieval in the cloud”, National Academy Science Letters-India, Vol. 38 (5), pp. 421–424. [12] Mary Posonia A, Vigneshwari S, Jyothi V.L,(2018),An Approach for Efficient Ranking of XML Documents using BPN based RANN, IIOABJ , Vol. 9 , 2 , pp. 1-7 , [13] Vigneshwari. S and Aramudhan. M (2015), “Social Information Retrieval Based on Semantic Annotation and Hashing upon the Multiple Ontologies”, Indian Journal of Science and Technology, Vol. No: 8(2), pp. 103–107.SJR: 1.3. [14] Vigneshwari. S and Aramudhan. M (2013), “An Approach for Ontology Integration for Personalization with the Support of XML”, International Journal of Engineering and Technology, Vol., No: 5(6), pp. 4556-4571. SJR: 0.18. [15] Vigneshwari. S and Aramudhan. M (2013), “A Technique To User Profiling Ontology Mining And Relationship Ranking”, Journal of Theoretical & Applied Information Technology, Vol. 58(3), pp. 635- 640. SJR: 0.15. [16] Merlin Ann Roy,Vigneshwari. S, “An Effective Framework to Improve The Efficiency of Semantic Based Search, Journal of Theoretical and Applied Information Technology” ,Vol.73.No.2,pp. 220-225,2015 [17] S.Monisha, S.Vigneshwari, “A Framework for Ontology Based Link Analysisfor Web Mining”, Journal of Theoretical and Applied Information Technology,Vol.73.No.2,pp.307-312,2015 International Journal of Pure and Applied Mathematics Special Issue 569
  • 6. 570