Exploring Language Communities on Github

Exploring Language
Communities on GitHub
Antigoni M. Founta

Introduction
This study focuses on the exploration of underlying patterns and the detection of
communities on programming languages used by GitHub users, via network analysis.
There are two graphs derived from the whole dataset and two location-specific graphs,
in order to study both the general audience of GitHub as well as the trends regarding
some sample locations.
Goal: Understand how languages are practically grouped in terms of the way
developers use them, as well as discover trends either worldwide or on specific
locations.
Nodes → Languages
Edges → Language co-occurrence in User Profiles (based on the user repositories)

GitHub
● GitHub is a web-based Git repository hosting service
● It offers distributed revision control and source code
management (SCM)
● It is the largest host of source code in the world! [1]
Why Github?
“The introduction of social features in a code hosting site has drawn
particular attention from researchers while the integrated social
features, and the availability of metadata through an accessible api
have made GitHub very attractive for software engineering
researchers” [3]
Top Image Source: https://goo.gl/CWBMqb
Bottom Image Source: https://github.com/logos

● Programming Language
categorization ambiguity
● GitHub bias on Web Development
● Locations and users have
power-law distribution: there are
numerous developers from few
locations (such as California,
London etc) and there is a
significant amount of locations
with few users
Pros Challenges
● Developers will get a hint of
which languages are used jointly,
and thus perhaps serve the same
purpose.
● Language creators will get a hint
of what their audience prefer and
trust.
● Language communities might
actually be another way to
explore developer communities.

Fundamentals
Dataset Features
➔ ID, Username, Location, Followers, Public Repos, Languages & Bytes of code
Network Structure
➔ Nodes: Languages
◆ Attribute: Total Bytes of Code
➔ Edges: Pairs of Languages that co-occurred in at least one user profile
◆ Weight: Amount of users that use both languages
Challenges upon Data
➔ Only public repositories accessible (users mainly work on private!)
➔ Languages are added by the user (empty, not real, not written in the same way)
PyGithub[2]

Final Datasets
❏ 4000 users since GitHub foundation + 150.000 from 2012
❏ Filter: Get only users with locations!
❏ Final: 2300 users since GitHub foundation + 37.000 from 2012

Methodology
Create graph (as described):
● Filters: Degree Range
● Layout: Force Atlas 2
● Node size: “Bytes of Code” Range
● Label size: Degree Range
Compute Modularity & get communities:
● Sometimes using edge weights, sometimes not
Visualize pairs of languages and amount of developers that use both

Results: All Data - All Languages
User-based
Language Graph

Language Co-occurrences on User Profiles &
Top Languages based on Bytes of Code written

Results: All Data - Top Languages
User-based
Language Graph

Language Co-occurrences
on User Profiles
#Top languages had minor differences, and thus are not reported

Results: California - Top 3 Languages
User-based
Language Graph

Results: Greece - Top 3 Languages
User-based
Language Graph

Repo-based
Language Graph
Communities
(modularity: 0.23)
Blue: Web-oriented
Pink: Desktop-oriented
Yellow: Other

Conclusions
Language-Oriented
➔ “Web-oriented” is the most robust category of languages used in Github
➔ “JavaScript - CSS” is the leading pair of languages, always outnumbering all other pairs
➔ Even though JavaScript is almost always dominating Pairs of Languages, C is always the
most used one in matters of Bytes of Code [perhaps C users are not language-extroverts…]
Scheme-Oriented
➔ With a user-based scheme we can understand the general preferences of developers and the
patterns between languages. [difficult when dataset is big!]
➔ With a repo-based scheme we can understand hidden (or at least not widely known)
patterns of languages that are used for same purposes.
➔ General purpose: repo-based scheme
Location purpose: user-based scheme

Future Work
● More Data !
● More Locations and Comparisons
● Language Graphs based on Top/Most influential Users [using followers or stars]
● Association Rules on Languages for community detection
● User Graph to detect user communities per Location (e.g. web developers, game
developers) and compare with Language Graph of Location

References
1. Github on Wikipedia: https://en.wikipedia.org/wiki/GitHub
2. PyGithub Library: https://github.com/PyGithub/PyGithub
3. Kalliamvakou, Eirini, et al. "The promises and perils of mining GitHub." Proceedings of the
11th working conference on mining software repositories. ACM, 2014.
4. Thung, Ferdian, et al. "Network structure of social coding in github." Software maintenance
and reengineering (csmr), 2013 17th european conference on. IEEE, 2013.
5. Takhteyev, Yuri, and Andrew Hilts. "Investigating the geography of open source software
through GitHub." (2010).
6. Figueira Filho, Fernando, et al. "A study on the geographical distribution of Brazil’s
prestigious software developers." Journal of Internet Services and Applications 6.1 (2015): 1.
Image Source: http://wifflegif.com/tags/58347-octocat-gifs

Thank you for your attention! Any questions?
Image Source: https://octodex.github.com/images/heisencat.png

Exploring Language Communities on Github

Recommended

Recommended

More Related Content

Similar to Exploring Language Communities on Github

Similar to Exploring Language Communities on Github (20)

Recently uploaded

Recently uploaded (20)

Exploring Language Communities on Github