Profile Analysis of Users in Data Analytics Domain

Profile Analysis of StackOverflow Users in Data
Analytics Domain
Sri Saranya. M, *Vigneshwari S, Jabez J, Vimali J S
School of Computing, Sathyabama Institute of Science and Technology, Chennai
Abstract— Data Analytics and Data Science is in the fast forward
mode recently. We see a lot of companies hiring people for data
analysis and data science, especially in India. Also, many
recruiting firms use stackoverflow to fish their potential
candidates. The industry has also started to recruit people based
on the shapes of expertise. Expertise of a personal is
metaphorically outlined by shapes of letters like I, T, M and
hyphen betting on her experiencein a section (depth) and
therefore the variety of areas of interest (width).This proposal
builds upon the work of mining shapes of user expertise in a
typical online social Question and Answer (Q&A) community
where expert users often answer questions posed by other
users.We have dealt with the temporal analysis of the expertise
among the Q&A community users in terms how the user/ expert
have evolved over time.
Keywords— Shapes of expertise, Graph communities, Expertise
evolution, Q&A community
I. INTRODUCTION
The Question and Answer Community is a place
where people can gain and share knowledge. It provides a
valuable resource that cannot be easily obtained using search
engines. Q&A communities can serve as a source to locate
experts in certain areas. StackOverflow is a one which allows
users to post questions which are answered by other users. The
users can also vote for an answer. StackOverflow in particular
are used by developers could help us in better improving
developer experience in using these sites. It would also help us
to understand how information within StackOverflow could be
useful to various software development activities, and the
community of knowledge is formed.
StackOverflow.com is one of the most popular Q&A
websites focused mainly on questions that are related to
programming languages but it also contains different domains
in the IT industry.
II. STATE OF ART
Xiaomo, present a three-phase framework that
consequently produces expertise profiles of online group
individuals. The three stages include a document-topic
relevance model, a user document association model and a
filtering strategy. LDA model is used to identify best
performances [5]. Riahi, in his paper the main objective is to
concentrate on discovering specialists for a recently posted
query. They route new questions to the best suited experts.
Here two models have been used known as LDA (Content of
User profile) and STM (Structure of profiles). The STM
performs better than LDA (Segmented Topic Model) [8].
A temporal study of experts in CQA and analyze the
changes over time. Here the Gaussian Mixture Model based
clustering algorithm is used to find the experts in the domain.
Probability of providing a best answer increases for experts &
it decreases for ordinary users over time.
III. MATERIALS & METHODS
A. Data collection
The data is collected from the StackOverflow Q&A
forum. Data Science domain is selected for Profile analysis of
users. Input dataset has records starting from May2014 to
December2016.
B. Data Cleaning
Data cleaning (Pre-processing) is an important and
critical step in Data mining process. Raw data that includes
noise, incomplete and inconsistent values which affect the
accuracy of data in analysis is need to be preprocessed to
improve the quality of data and also to achieve better
results[11-17]. The Null values are checked and removed
&duplicate records are also removedfrom the attributes.
Finally, the cleaned data is transformed for further data mining
process.
C. Data Preparation
Import the data from XML to CSV file format. Select
the relevant data that has good range values. Post data is
selected with the required fields such as Date, PostType Id,
User Id, Accepted AnswerId, and Answer Countfor EDA
analysis. Then data transformation was carried out. Create a
Calendar for the required period to find response time and
merge the calendar with the Post data for performing EDA
analysis.
D. Exploratory Data Analysis
The cleaneddata is used for Exploratory Data
Analysis. In EDA analysis we find the total number of
questions and answers in data science domain, number of
questions asked by each user, total number of answers given
by each user, maximum, minimum and average number of
answers for a question, number of accepted answers given by
each user. Based on the above information the further analysis
is performed.
IV. EXPERIMENTAL DESIGN
International Journal of Pure and Applied Mathematics
Volume 119 No. 7 2018, 565-570
ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)
url: http://www.ijpam.eu
Special Issue
ijpam.eu
565

R is a programming language and it is an
environment for statistical and graphical techniques. It is
highly extensible. R studio V3.3.1 is used for this data
analysis.
Algorithm for finding expert users
1. Load the dataset from the stack exchange Q&A community
2. Parse and read the dataset from XML to CSV file
3. Pre-processing: Clean, remove null values and duplicates
4. Perform Exploratory Data Analysis for the cleaned dataset
5. Use graph partitioning algorithms to find the communities
from the tag groups
6. Compare the partitioning algorithms to find the best communities
based on their modularity
7. Find the expertise profile and define shapes for the expert users
8. Perform the prediction for the evolution of experts
9. Evaluate the performance analysis for the dataset
10. Visualize the results
E. Finding Areas of expertise
The next part of this paper deals with finding the
areas of expertise within a community. The graph partitioning
approach is used to find the areas of expertise. Graph is
created using co-occurrence of tags given by users to the
questions. Modularity is calculated to find the connections
within the community. High modularity reflects dense
connections within communities in the graph & sparse
connections across communities in the graph. Sub
communities obtained by the partitioning algorithms are
known as Area of Expertise within the community. In this
paper several graph partitioning algorithms are used and a
comparison of these algorithms is also performed.
a) Girvan Newman Algorithm
In this algorithm, at each step the edge with the
highest betweenness is removed from the graph. Modularity of
the graph is computed for every division. Modularity value for
Girvan Newman algorithm for finding community is
0.007191502.
Fig.1: Graph for edge.betweeness.community detection
Fig.1 shows the graph for GirvanNewman and here
most of the members have fall in a single community.
b) Fast Greedy Algorithm
The algorithm is agglomerative. At each step two
groups merge & merging is decided by optimizing the
modularity. It is a fast algorithm but it doesn’t produce the
best overall community for the given data.
Fig.2: Fast greedy community detection
Fig.2 shows the graph for community detection using
fast greedy algorithm. Modularity of fast greedy is 0.2088943.
Modularity is good when compared to Girvan Newman but the
community formation is not properly split here. Thus Fast
Greedy algorithm is not acceptable for the given dataset.
c) Spin Glass clustering
This algorithm is based on Spin glass model. Here
the cluster size is defined as 5 to find the communities.
Fig.3: Spin glass community detection
The Fig.3 shows the graph for spin glass community
detection. Modularity of spin glass is 0.2221456 and it is less
than fast greedy but members are not divided properly in all
the 5 groups.
d) Multilevel Community
Multilevel community detection is based on the
modularity measure and hierarchical approach. In this
algorithm each vertex is assigned to a community on its own
and in every step, vertices are re-assigned to communities in a
local or in greedy way.
The Fig.4 shows the multilevel community detection
and here we got 5 communities and we can also see sparse
connections across communities.Modularity of multilevel
community is 0.2021255. So, multilevel approach is well
suited for the given dataset.
International Journal of Pure and Applied Mathematics Special Issue
566

Fig.4: Multilevel community detection
F) Finding Expertise patterns
Based on the communities obtained from the
multilevel graph partitioning approach the areas of expertise
have been obtained for the data science users. To find the
levels of expertise within an area it is defined as Low,
Medium and High.High- Users answering several questions &
providing several best answers. Medium- Users answering
several questions some of them being best answers. Low-
Users interest in some areas & providing some answers.
Shapes are assigned based on the range of number of
answers given by users. The shapes for the expert users can be
given as follows,Hyphen: Contributing in one or more areas
but not expertise in any area.I-shaped: Expertise in only one
area & no contribution in other areas.T-shaped: Expertise in
one area & also contributing in many areas.M-shaped:
Expertise in more than one areas then he is considered as
master in the community.
G) Evolution of Expertise of each profile
Figure 5: User Activity Graph
Fig.5 shows the distribution of profile users over a
period of 32 months data. Here the I-shaped users are
neglected because their contribution is very few in this period
of 32 months. So, it’s not shown in the above graph.
V. RESULTS & DISCUSSIONS
Generally prediction is used to forecast an event for
futuretime period. Several prediction approaches are used in
data analytics such as regression, time series etc. In this paper
prediction using time series is performed. It is used to predict
the future values based on previously observed values. Auto
ARIMA (Autoregressive Moving Average) model and
NNETAR (Neural Network) model is used in this prediction
analysis and the accuracy for prediction is also compared.
a) Hyphen-shaped users
The figure 6(a) shows the comparison of original
time series and the forecasted time series of Hyphen shaped
users as modelled by auto Arima.
Fig.6 (a) Auto Arima: Original Vs Predicted
Figure 6(b) shows the original time series followed
by the forecasted values of Hyphen shaped users using neural
network model.
Fig.6 (b) NNETAR: Original Vs Predicted
b) T-shaped users
The fig.7(a) shows the comparison of original time
series and the forecasted time series of T-shaped users as
modelled by auto Arima. Fig.7(b) shows the original time
series followed by the forecasted values of T-shaped users
using neural network model.
Fig.7(a) Auto Arima: Original Vs Predicted
567

Fig.7 (b) NNETAR: Original Vs Predicted
c) M-shaped users
The fig.8(a) shows the comparison of original time
series and the forecasted time series for M-shaped users as
modelled by auto Arima.
Fig.8(a) Auto Arima: Original Vs Predicted
Fig.8 (b) shows the original time series followed by
the forecasted values of M-shaped users using neural network
model.
Figure 8(b) NNETAR: Original Vs Predicted
TABLE 1: Statistics of Training and Testing data using Auto ARIMA &
NNETAR
The above table 1 shows the evaluation metrics for
Hyphen shaped, T-shaped and M-shaped profile users
modelled by using auto Arima and neural networks. Among
the above, the model for forecasting the Hyphen and M-
shaped users is acceptable in both ARIMA & NNETAR model
and the model for forecasting the T-shaped users is relatively
acceptable in both the models since the MAPE value is high.
VI. CONCLUSION & FUTURE WORKS
In this paper, we obtained a tendency towards the preditction
of social Q & A forurms which involves experience in
learning.TheExploratory Data analysis has been done and the
results were shown. Tag groups were used to form the
Communities using graph. Then the shapes were given to
every user based on their answer count. We also extend this to
do temporal analysis of change in expertise shapes over time.
It can be extended to other communities also. For expert
identification, we have only considered number of answers.
We can extend this to include number of interactions between
the expert users and can also include number of upvotes to
decide the expertise.
ACKNOWLEDGMENT
The authors would like to acknowledge that this work has
been carried out at DST-FIST sponsored Cloud Computing
Lab (order Saction No. : SR/FST/ETI-364/2014 Dated: 21
November, 2014), School of Computing, Sathyabama
University.
REFERNCES
[1] Bazelli, B. et al. 2013. On the personality traits of StackOverflow users.
IEEE Int. Conf. on Software Maintenance, ICSM. 460–463.
[2] Bouguessa, M. et al. 2008. Identifying authoritative actors in question-
answering forums: the case of Yahoo! answers." Proc. of the 14th ACM
SIGKDD Int. Conf. on Knowledge Discovery and Data Mining. 866-
874.
[3] Daniel Czyczyn-Egird(B) and Rafal Wojszczyk, “Determining the
Popularity of Design Patterns Used by Programmers Based on the
Analysis of Questions and Answers on Stackoverflow.com Social
Network”, Springer International Publishing Switzerland 2016, pp.
421–433.
[4] https://archive.org/details/stackexchange
[5] Liu, X. et al. “Harnessing global expertise: A comparative study of
expertise profiling methods for online communities”,2012 Information
Systems Frontiers. 1–13.
[6] Varun Kumar, Niranjan Pedanekar, “Mining Shapes of Expertise in
Online Social Q&A Communities” In Companion Proceedings of
CSCW 2016, the 19th ACM Conference on Computer Supported
Cooperative Work and Social Computing.
[7] Yang, Liu, CQArank, ”Jointly model topics and expertise in community
question answering”, 2012 Proc. of the 22nd ACM International
Conference on Info. & Knowledge Management. 99-108.
[8] Riahi, “Finding expert users in community question answering”, 2012
Proc. of the 21st
International Conference Companion on World Wide
Web. 791– 798.
[9] Sourcing on StackOverflow. 2015.
http://sourcingrecruitment.info/sourcing-onstackoverflow/. Retrieved on
June 1, 2015.
[10] Zhan,” Expertise networks in online communities: structure and
algorithms”, 2007 Forum American Bar Association.
568

[11] Vigneshwari. S and Aramudhan. M (2015), “Personalized cross
ontological framework for secured document retrieval in the cloud”,
National Academy Science Letters-India, Vol. 38 (5), pp. 421–424.
[12] Mary Posonia A, Vigneshwari S, Jyothi V.L,(2018),An Approach for
Efficient Ranking of XML Documents using BPN based RANN,
IIOABJ , Vol. 9 , 2 , pp. 1-7 ,
[13] Vigneshwari. S and Aramudhan. M (2015), “Social Information
Retrieval Based on Semantic Annotation and Hashing upon the Multiple
Ontologies”, Indian Journal of Science and Technology, Vol. No: 8(2),
pp. 103–107.SJR: 1.3.
[14] Vigneshwari. S and Aramudhan. M (2013), “An Approach for Ontology
Integration for Personalization with the Support of XML”, International
Journal of Engineering and Technology, Vol., No: 5(6), pp. 4556-4571.
SJR: 0.18.
[15] Vigneshwari. S and Aramudhan. M (2013), “A Technique To User
Profiling Ontology Mining And Relationship Ranking”, Journal of
Theoretical & Applied Information Technology, Vol. 58(3), pp. 635-
640. SJR: 0.15.
[16] Merlin Ann Roy,Vigneshwari. S, “An Effective Framework to Improve
The Efficiency of Semantic Based Search, Journal of Theoretical and
Applied Information Technology” ,Vol.73.No.2,pp. 220-225,2015
[17] S.Monisha, S.Vigneshwari, “A Framework for Ontology Based Link
Analysisfor Web Mining”, Journal of Theoretical and Applied
Information Technology,Vol.73.No.2,pp.307-312,2015
569

Profile Analysis of Users in Data Analytics Domain

More Related Content

What's hot

Similar to Profile Analysis of Users in Data Analytics Domain

More from Drjabez

Recently uploaded

Profile Analysis of Users in Data Analytics Domain