CodeCV: Mining Expertise of GitHub Users
from Coding Activities
SCAM 2022
October 3–4, 2022 – Limassol, Cyprus
Daniel Atzberger, Nico Scordialo, Tim Cech, Willy Scheibel, Matthias Trapp, Jürgen Döllner
Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany 04.10.2022
Motivation
”The people working in a software organization are its greatest
assets. It is expensive to recruit and retain good people, and it is
up to software managers to ensure that the engineers working on a
project are as productive as possible.”
I. Sommerville, Software Engineering. 9th Ed., Harlow 2016, p. 652
2 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
Motivation
3 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
Problem Statement
• Existing approaches focus on locating expertise in programming languages
[Moradi Dakhel et al., 2021], library usage [Teyton et al., 2013, Montandon et al., 2019],
and higher-level concepts [Saxena and Pedanekar, 2017, Teyton et al., 2014] and
summarizing them in a CV
[Kourtzanidis et al., 2020, Kuttal et al., 2021, Greene and Fischer, 2016].
• All approaches either rely on a set of predefined terms, or only analyze a small amount of
the actual source code activites, i.e., the vocabulary used in comments and identifier
names is largely ignored. In no case the approach provides explainability to the user.
4 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
Approach
The processing pipeline of the CodeCV system
5 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
Latent Dirichlet Allocation
The generative process of LDA [Blei et al., 2003]
6 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
Labeled LDA I
• The topics infered by the LDA model are
not labeled, it is therefore not possible to
link the vocabulary to the higher-level
concepts automatically.
• We utilize additional information from
GitHub projects for generating labeled
topics, to collect a corpus of representative
software projects associated to higher-level
concepts, similar to
[Atzberger et al., 2022].
7 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
Labeled LDA II
LLDA additionaly considers a set of labels [Ramage et al., 2009]. Therefore we can link the source
code with GitHub tags by the usage of LLDA
8 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
Example
First ten projects from GitHub that are marked as
representative projects for the concepts Machine
Learning and Cryptocurrency.
Machine Learning Cryptocurrency
tensorflow bitcoin
transformers ccxt
pytorch freqtrade
keras OpenBBTerminal
scikit-learn lbry-sdk
tesseract monero
face_recognition xmrig
TensorFlow-Examples blockchain
faceswap lnd
julia tendermint
Top ten words for the higher-level concepts
Machine Learning and Cryptocurrency by applying
the weighting scheme of Sievert and Shirley
[Sievert and Shirley, 2014].
Machine Learning Cryptocurrency
th fe
tensor order
self symbol
cuda crypto
input binance
model price
license trade
output wallet
layer exchange
size block
9 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
Scoring Function
The score sd,c of a developer d in a programming language, library, or concept c is given by:
sd,c =
ÿ
wœVc
rw,c · Nw,d
NVc,d
,
where Vc denotes the vocabulary associated to c, Nw,D is the overall count of a word w in the
commit histories of all developers d œ D, NVc,D is given by
NVc,D =
ÿ
wúœVc
Nwú,d,
and the relevance rw,c of a word w for a concept c is given by:
rw,c =
1
Nw,D
·
ÿ
wúœVc
Nwú,D
10 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
Dashboard
An example screenshot from our CodeCV system
11 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
Conclusions
• Low input process for generating expertise profiles for developers
• Promising first results: Able to identify high-level concepts
Results of our first experiments. The results indicate that our approach is able to locate expertise in
higher-level concepts adequatly.
12 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
Future Work
”The people working in a software organization are its greatest
assets. It is expensive to recruit and retain good people, and it is up
to software managers to ensure that the engineers working on
a project are as productive as possible. In successful companies
and economies, this productivity is achieved when people are
respected by the organization and are assigned responsibilities
that reflect their skills and experience.”
I. Sommerville, Software Engineering. 9th Ed., Harlow 2016, p. 652
13 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
Contact
• Daniel Atzberger, daniel.atzberger@hpi.uni-potsdam.de
• Nico Scordialo
• Tim Cech, tim.cech@hpi.uni-potsdam.de
• Willy Scheibel
• Matthias Trapp
• Jürgen Döllner
Acknowledgements
We want to thank the anonymous reviewers for their valuable comments and suggestions to improve this article. This work is
part of the „Software-DNA“ project, which is funded by the European Regional Development Fund (ERDF or EFRE in
German) and the State of Brandenburg (ILB). This work is part of the KMU project „KnowhowAnalyzer“ (Förderkennzeichen
01IS20088B), which is funded by the German Ministry for Education and Research (Bundesministerium für Bildung und
Forschung).
14 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
References I
[Atzberger et al., 2022] Atzberger, D., Cech, T., Jobst, A., Scheibel, W., Limberger, D.,
Trapp, M., and Döllner, J. (2022). Visualization of knowledge distribution across
development teams using 2.5D semantic software maps. In Proc. 17th International Joint
Conf. on Computer Vision, Imaging and Computer Graphics Theory and Applications,
IVAPP ’22, pages 210–217. INSTICC, SciTePress.
[Blei et al., 2003] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation.
Journal of Machine Learning Research, 3:993–1022.
[Greene and Fischer, 2016] Greene, G. J. and Fischer, B. (2016). CVExplorer: Identifying
candidate developers by mining and exploring their open source contributions. In Proc. 31st
International Conf. on Automated Software Engineering, ASE ’16, pages 804–809.
IEEE/ACM.
15 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
References II
[Kourtzanidis et al., 2020] Kourtzanidis, S., Chatzigeorgiou, A., and Ampatzoglou, A. (2020).
RepoSkillMiner: Identifying software expertise from GitHub repositories using natural
language processing. In Proc. 35th International Conf. on Automated Software Engineering,
ASE ’16, pages 1353–1357. IEEE/ACM.
[Kuttal et al., 2021] Kuttal, S. K., Chen, X., Wang, Z., Balali, S., and Sarma, A. (2021).
Visual resume: Exploring developers’ online contributions for hiring. Information and
Software Technology, 138:106633.
[Montandon et al., 2019] Montandon, J. E., Lourdes Silva, L., and Valente, M. T. (2019).
Identifying experts in software libraries and frameworks among GitHub users. In Proc. 16th
International Conf. on Mining Software Repositories, MSR ’19, pages 276–287. IEEE/ACM.
[Moradi Dakhel et al., 2021] Moradi Dakhel, A., C. Desmarais, M., and Khomh, F. (2021).
Assessing developer expertise from the statistical distribution of programming syntax
patterns. In Proc. 25th Conf. on Evaluation and Assessment in Software Engineering, EASE
’21, pages 90–99. ACM.
16 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
References III
[Ramage et al., 2009] Ramage, D., Hall, D., Nallapati, R., and Manning, C. D. (2009).
Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In
Proc. 2009 Conf. on Empirical Methods in Natural Language Processing, EMNLP ’09, pages
248–256. ACL.
[Saxena and Pedanekar, 2017] Saxena, R. and Pedanekar, N. (2017). I know what you coded
last summer: Mining candidate expertise from GitHub repositories. In Companion of the
2017 Conf. on Computer Supported Cooperative Work and Social Computing, CSCW ’17,
pages 299–302. ACM.
[Sievert and Shirley, 2014] Sievert, C. and Shirley, K. (2014). LDAvis: A method for visualizing
and interpreting topics. In Proc. Workshop on Interactive Language Learning, Visualization,
and Interfaces, pages 63–70. ACL.
[Teyton et al., 2013] Teyton, C., Falleri, J.-R., Morandat, F., and Blanc, X. (2013). Find your
library experts. In Proc. 20th Working Conf. on Reverse Engineering, WCRE ’13, pages
202–211. IEEE.
17 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
References IV
[Teyton et al., 2014] Teyton, C., Palyart, M., Falleri, J.-R., Morandat, F., and Blanc, X.
(2014). Automatic extraction of developer expertise. In Proc. 18th International Conf. on
Evaluation and Assessment in Software Engineering, EASE ’14, pages 1–10. ACM.
18 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
Provocative statements
• We should focus on the actual software code rather than more abstract concepts (e.g.
software metrics) when investigating developer expertise.
• When a machine learning system interacts with humans, it is almost always necessary to
explain the decision.
19 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
View publication stats

CodeCV - Mining Expertise of GitHub Users from Coding Activities - Online.pdf

  • 1.
    CodeCV: Mining Expertiseof GitHub Users from Coding Activities SCAM 2022 October 3–4, 2022 – Limassol, Cyprus Daniel Atzberger, Nico Scordialo, Tim Cech, Willy Scheibel, Matthias Trapp, Jürgen Döllner Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany 04.10.2022
  • 2.
    Motivation ”The people workingin a software organization are its greatest assets. It is expensive to recruit and retain good people, and it is up to software managers to ensure that the engineers working on a project are as productive as possible.” I. Sommerville, Software Engineering. 9th Ed., Harlow 2016, p. 652 2 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
  • 3.
    Motivation 3 CodeCV: MiningExpertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
  • 4.
    Problem Statement • Existingapproaches focus on locating expertise in programming languages [Moradi Dakhel et al., 2021], library usage [Teyton et al., 2013, Montandon et al., 2019], and higher-level concepts [Saxena and Pedanekar, 2017, Teyton et al., 2014] and summarizing them in a CV [Kourtzanidis et al., 2020, Kuttal et al., 2021, Greene and Fischer, 2016]. • All approaches either rely on a set of predefined terms, or only analyze a small amount of the actual source code activites, i.e., the vocabulary used in comments and identifier names is largely ignored. In no case the approach provides explainability to the user. 4 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
  • 5.
    Approach The processing pipelineof the CodeCV system 5 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
  • 6.
    Latent Dirichlet Allocation Thegenerative process of LDA [Blei et al., 2003] 6 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
  • 7.
    Labeled LDA I •The topics infered by the LDA model are not labeled, it is therefore not possible to link the vocabulary to the higher-level concepts automatically. • We utilize additional information from GitHub projects for generating labeled topics, to collect a corpus of representative software projects associated to higher-level concepts, similar to [Atzberger et al., 2022]. 7 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
  • 8.
    Labeled LDA II LLDAadditionaly considers a set of labels [Ramage et al., 2009]. Therefore we can link the source code with GitHub tags by the usage of LLDA 8 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
  • 9.
    Example First ten projectsfrom GitHub that are marked as representative projects for the concepts Machine Learning and Cryptocurrency. Machine Learning Cryptocurrency tensorflow bitcoin transformers ccxt pytorch freqtrade keras OpenBBTerminal scikit-learn lbry-sdk tesseract monero face_recognition xmrig TensorFlow-Examples blockchain faceswap lnd julia tendermint Top ten words for the higher-level concepts Machine Learning and Cryptocurrency by applying the weighting scheme of Sievert and Shirley [Sievert and Shirley, 2014]. Machine Learning Cryptocurrency th fe tensor order self symbol cuda crypto input binance model price license trade output wallet layer exchange size block 9 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
  • 10.
    Scoring Function The scoresd,c of a developer d in a programming language, library, or concept c is given by: sd,c = ÿ wœVc rw,c · Nw,d NVc,d , where Vc denotes the vocabulary associated to c, Nw,D is the overall count of a word w in the commit histories of all developers d œ D, NVc,D is given by NVc,D = ÿ wúœVc Nwú,d, and the relevance rw,c of a word w for a concept c is given by: rw,c = 1 Nw,D · ÿ wúœVc Nwú,D 10 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
  • 11.
    Dashboard An example screenshotfrom our CodeCV system 11 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
  • 12.
    Conclusions • Low inputprocess for generating expertise profiles for developers • Promising first results: Able to identify high-level concepts Results of our first experiments. The results indicate that our approach is able to locate expertise in higher-level concepts adequatly. 12 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
  • 13.
    Future Work ”The peopleworking in a software organization are its greatest assets. It is expensive to recruit and retain good people, and it is up to software managers to ensure that the engineers working on a project are as productive as possible. In successful companies and economies, this productivity is achieved when people are respected by the organization and are assigned responsibilities that reflect their skills and experience.” I. Sommerville, Software Engineering. 9th Ed., Harlow 2016, p. 652 13 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
  • 14.
    Contact • Daniel Atzberger,daniel.atzberger@hpi.uni-potsdam.de • Nico Scordialo • Tim Cech, tim.cech@hpi.uni-potsdam.de • Willy Scheibel • Matthias Trapp • Jürgen Döllner Acknowledgements We want to thank the anonymous reviewers for their valuable comments and suggestions to improve this article. This work is part of the „Software-DNA“ project, which is funded by the European Regional Development Fund (ERDF or EFRE in German) and the State of Brandenburg (ILB). This work is part of the KMU project „KnowhowAnalyzer“ (Förderkennzeichen 01IS20088B), which is funded by the German Ministry for Education and Research (Bundesministerium für Bildung und Forschung). 14 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
  • 15.
    References I [Atzberger etal., 2022] Atzberger, D., Cech, T., Jobst, A., Scheibel, W., Limberger, D., Trapp, M., and Döllner, J. (2022). Visualization of knowledge distribution across development teams using 2.5D semantic software maps. In Proc. 17th International Joint Conf. on Computer Vision, Imaging and Computer Graphics Theory and Applications, IVAPP ’22, pages 210–217. INSTICC, SciTePress. [Blei et al., 2003] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022. [Greene and Fischer, 2016] Greene, G. J. and Fischer, B. (2016). CVExplorer: Identifying candidate developers by mining and exploring their open source contributions. In Proc. 31st International Conf. on Automated Software Engineering, ASE ’16, pages 804–809. IEEE/ACM. 15 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
  • 16.
    References II [Kourtzanidis etal., 2020] Kourtzanidis, S., Chatzigeorgiou, A., and Ampatzoglou, A. (2020). RepoSkillMiner: Identifying software expertise from GitHub repositories using natural language processing. In Proc. 35th International Conf. on Automated Software Engineering, ASE ’16, pages 1353–1357. IEEE/ACM. [Kuttal et al., 2021] Kuttal, S. K., Chen, X., Wang, Z., Balali, S., and Sarma, A. (2021). Visual resume: Exploring developers’ online contributions for hiring. Information and Software Technology, 138:106633. [Montandon et al., 2019] Montandon, J. E., Lourdes Silva, L., and Valente, M. T. (2019). Identifying experts in software libraries and frameworks among GitHub users. In Proc. 16th International Conf. on Mining Software Repositories, MSR ’19, pages 276–287. IEEE/ACM. [Moradi Dakhel et al., 2021] Moradi Dakhel, A., C. Desmarais, M., and Khomh, F. (2021). Assessing developer expertise from the statistical distribution of programming syntax patterns. In Proc. 25th Conf. on Evaluation and Assessment in Software Engineering, EASE ’21, pages 90–99. ACM. 16 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
  • 17.
    References III [Ramage etal., 2009] Ramage, D., Hall, D., Nallapati, R., and Manning, C. D. (2009). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proc. 2009 Conf. on Empirical Methods in Natural Language Processing, EMNLP ’09, pages 248–256. ACL. [Saxena and Pedanekar, 2017] Saxena, R. and Pedanekar, N. (2017). I know what you coded last summer: Mining candidate expertise from GitHub repositories. In Companion of the 2017 Conf. on Computer Supported Cooperative Work and Social Computing, CSCW ’17, pages 299–302. ACM. [Sievert and Shirley, 2014] Sievert, C. and Shirley, K. (2014). LDAvis: A method for visualizing and interpreting topics. In Proc. Workshop on Interactive Language Learning, Visualization, and Interfaces, pages 63–70. ACL. [Teyton et al., 2013] Teyton, C., Falleri, J.-R., Morandat, F., and Blanc, X. (2013). Find your library experts. In Proc. 20th Working Conf. on Reverse Engineering, WCRE ’13, pages 202–211. IEEE. 17 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
  • 18.
    References IV [Teyton etal., 2014] Teyton, C., Palyart, M., Falleri, J.-R., Morandat, F., and Blanc, X. (2014). Automatic extraction of developer expertise. In Proc. 18th International Conf. on Evaluation and Assessment in Software Engineering, EASE ’14, pages 1–10. ACM. 18 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022
  • 19.
    Provocative statements • Weshould focus on the actual software code rather than more abstract concepts (e.g. software metrics) when investigating developer expertise. • When a machine learning system interacts with humans, it is almost always necessary to explain the decision. 19 CodeCV: Mining Expertise of GitHub Users from Coding Activities Tim Cech 04.10.2022 View publication stats