Efficient GitHub Crawling using the GraphQL API

Efficient GitHub Crawling
using the GraphQL API
ICCSA | July 5th
2022, Málaga
Adrian Jobst, Daniel Atzberger, Tim Cech,
Willy Scheibel, Matthias Trapp, and Jürgen Döllner
Hasso-Plattner-Institute, Digital Engineering Faculty, University of Potsdam, Germany
05.07.2022

Software Analytics
”Software analytics aims to obtain insightful and actionable
information from software artifacts that help practitioners
accomplish tasks related to software development, systems, and
user.”
Zhang, D. et al. (2013). Software Analytics in Practice. Software, IEEE, 30:30–37.
2 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021

Software Development Process
Illustration of a software development process (Source: Seerene).

GitHub
• Heavily used platform where developer can host and
review code, and also manage projects
• 83 million developers
• 61 million created repositories (Only 2021)
• 170 million merged pull requests
• Wealth of data:
• Source control repository (git)
• Bug tracking (issues & pull requests)
• Project management (projects)
• Developer data
• Documentation (pages)
• Metadata, e.g., GitHub topics
• Complete API access over REST (v3) and GraphQL
(v4)
Jan Cossier’s Prometheus

Application Example
• Developers knowledge is directly encoded in the source code
• Training a Labeled Latent Dirichlet Allocation model on a corpus of GitHub projects, leads
to concept-specific vocabulary
• Locating keywords of a concept in the commit history of developer results in a skill level
Machine Learning Cryptocurrency Database Server Data Visualization
th order db request chart
tensor crypto table server series
self binance key header axis
cuda price name http pixi
model trade value body datum
license wallet opt response point
layer exchange sql message style

Related Work
Sourcerer Linstead et al. (2009)
• Crawls source code repositories.
• Applies LDA and its variant the Author-topic model on them.
Ghtorrent Gousios, G. and Spinellis, D. (2012).
• Monitors the Github public event time line.
• Based on the GitHub v3 API
Boa Dyer et al. (2015)
• Domain-specific programming language
• Runs on infrastructure which uses Hadoop map-reduce cluster

Prometheus System
Event-driven microservice architecture
• Microservice architecture properties:
• Applications are getting divided into
services
• Primitive communication medium (e.g.
REST)
• Less standardization of technologies
• Decentralized data management
• Easier upgradability and replaceability
• Event-driven properties explained by
purpose of events itself:
• Event notification
• State transfer
• Global state record
RDBMS
Event Backbone
Fetcher Importer
REST
GitHub
Prometheus architecture concept

GitHub Data Model

GitHub Fetcher
• Query similar to GitHub GraphQL query
• Arguments can be less restrictive
• Fields can be overloaded

Requirements
GitHub Fetcher
• Receive and execute job descriptions in
form of GraphQL queries
• Split query in background
• Handle pagination till specified number is
reached
• Forward parameters from responses to new
queries
• Publish responses
Importer
• Receive response events
• Insert or update objects in database
• Publish create, update and delete event on
database
RDBMS
Event Backbone
Fetcher Importer
REST
GitHub
Prometheus architecture concept

Architecture
Importer
GitHub
GraphQL
Endpoint
Metastore
Importer
binlog
Metastore
Publisher
RDBMS
Created / Updated / Deleted - Entity
GitHub
Fetcher
REST
Created / Finished - Job / Blueprint / WP
Post / Get - Job
Redis Publish / Subscribe
Prometheus architecture concept. Rectangles with rounded corners represent docker container, arrows
represent events.

Ghcrawler
• Archived open source project from
Microsoft
• The Microsoft Open Source Programs
Office uses this to track 1000s of repos in
which Microsoft is involved
• Implements GitHub best practices
guidelines for API access
• API token management (extend rate limit)
• Is able to crawl most GitHub entities

Experiment
• Dashed line indicates where
issues would be crawled
completely
• 3.5 times faster, even if
ghcrawler had no overhead
included pull requests
• Prometheus needs 98 API
calls, ghcrawler 11.990
• Ghcrawler used 11.990 from
20.000 available requests
• Prometheus used 98 out
from 5000 points

Conclusions
Contributions
• We propose a system for crawling data from public GitHub repositories based on the
GraphQL API
• We propose a mechanism for crawling nested GraphQL queries
Findings
• The GitHub GraphQL API has advantages in terms of performance.
• However the query graph structure has a large impact on the crawling time.
Future Work
• Concurrent API calls
• More flexible splitting approach to eliminate redundant queries.

Contact
• Adrian Jobst
• Daniel Atzberger,
daniel.atzberger@hpi.uni-
potsdam.de
• Tim Cech
• Willy Scheibel
• Dr. Matthias Trapp
• Prof. Dr. Jürgen Döllner
Acknowledgements
This work is part of the „Software-DNA“ project, which is funded
by the European Regional Development Fund (ERDF or EFRE in
German) and the State of Brandenburg (ILB). This work is part of
the KMU project „KnowhowAnalyzer“ (Förderkennzeichen
01IS20088B), which is funded by the German Ministry for
Education and Research (Bundesministerium für Bildung und
Forschung).
View publication stats

Efficient GitHub Crawling using the GraphQL API

Recommended

Recommended

More Related Content

Similar to Efficient GitHub Crawling using the GraphQL API

Similar to Efficient GitHub Crawling using the GraphQL API (20)

More from Matthias Trapp

More from Matthias Trapp (20)

Recently uploaded

Recently uploaded (20)

Efficient GitHub Crawling using the GraphQL API