This document discusses efficient crawling of GitHub data using the GraphQL API compared to traditional REST APIs. It presents the Prometheus system, which uses a microservices architecture and event-driven approach to split GraphQL queries and import response data into a database. An experiment shows the Prometheus system is over 3 times faster than an existing crawler when retrieving issues data from GitHub repositories. The document concludes the GraphQL API enables better performance for crawling but query structure also impacts efficiency.
Unleash Your Potential - Namagunga Girls Coding Club
Efficient GitHub Crawling using the GraphQL API
1. Efficient GitHub Crawling
using the GraphQL API
ICCSA | July 5th
2022, Málaga
Adrian Jobst, Daniel Atzberger, Tim Cech,
Willy Scheibel, Matthias Trapp, and Jürgen Döllner
Hasso-Plattner-Institute, Digital Engineering Faculty, University of Potsdam, Germany
05.07.2022
2. Software Analytics
”Software analytics aims to obtain insightful and actionable
information from software artifacts that help practitioners
accomplish tasks related to software development, systems, and
user.”
Zhang, D. et al. (2013). Software Analytics in Practice. Software, IEEE, 30:30–37.
2 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
3. Software Development Process
Illustration of a software development process (Source: Seerene).
3 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
4. GitHub
• Heavily used platform where developer can host and
review code, and also manage projects
• 83 million developers
• 61 million created repositories (Only 2021)
• 170 million merged pull requests
• Wealth of data:
• Source control repository (git)
• Bug tracking (issues & pull requests)
• Project management (projects)
• Developer data
• Documentation (pages)
• Metadata, e.g., GitHub topics
• Complete API access over REST (v3) and GraphQL
(v4)
Jan Cossier’s Prometheus
4 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
5. Application Example
• Developers knowledge is directly encoded in the source code
• Training a Labeled Latent Dirichlet Allocation model on a corpus of GitHub projects, leads
to concept-specific vocabulary
• Locating keywords of a concept in the commit history of developer results in a skill level
Machine Learning Cryptocurrency Database Server Data Visualization
th order db request chart
tensor crypto table server series
self binance key header axis
cuda price name http pixi
model trade value body datum
license wallet opt response point
layer exchange sql message style
5 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
6. Related Work
Sourcerer Linstead et al. (2009)
• Crawls source code repositories.
• Applies LDA and its variant the Author-topic model on them.
Ghtorrent Gousios, G. and Spinellis, D. (2012).
• Monitors the Github public event time line.
• Based on the GitHub v3 API
Boa Dyer et al. (2015)
• Domain-specific programming language
• Runs on infrastructure which uses Hadoop map-reduce cluster
6 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
7. Prometheus System
Event-driven microservice architecture
• Microservice architecture properties:
• Applications are getting divided into
services
• Primitive communication medium (e.g.
REST)
• Less standardization of technologies
• Decentralized data management
• Easier upgradability and replaceability
• Event-driven properties explained by
purpose of events itself:
• Event notification
• State transfer
• Global state record
RDBMS
Event Backbone
Fetcher Importer
REST
GitHub
Prometheus architecture concept
7 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
8. GitHub Data Model
8 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
9. GitHub Data Model
9 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
10. GitHub Fetcher
• Query similar to GitHub GraphQL query
• Arguments can be less restrictive
• Fields can be overloaded
10 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
11. Requirements
GitHub Fetcher
• Receive and execute job descriptions in
form of GraphQL queries
• Split query in background
• Handle pagination till specified number is
reached
• Forward parameters from responses to new
queries
• Publish responses
Importer
• Receive response events
• Insert or update objects in database
• Publish create, update and delete event on
database
RDBMS
Event Backbone
Fetcher Importer
REST
GitHub
Prometheus architecture concept
11 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
13. Ghcrawler
• Archived open source project from
Microsoft
• The Microsoft Open Source Programs
Office uses this to track 1000s of repos in
which Microsoft is involved
• Implements GitHub best practices
guidelines for API access
• API token management (extend rate limit)
• Is able to crawl most GitHub entities
13 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
14. Experiment
• Dashed line indicates where
issues would be crawled
completely
• 3.5 times faster, even if
ghcrawler had no overhead
included pull requests
• Prometheus needs 98 API
calls, ghcrawler 11.990
• Ghcrawler used 11.990 from
20.000 available requests
• Prometheus used 98 out
from 5000 points
14 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
15. Conclusions
Contributions
• We propose a system for crawling data from public GitHub repositories based on the
GraphQL API
• We propose a mechanism for crawling nested GraphQL queries
Findings
• The GitHub GraphQL API has advantages in terms of performance.
• However the query graph structure has a large impact on the crawling time.
Future Work
• Concurrent API calls
• More flexible splitting approach to eliminate redundant queries.
15 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
16. Contact
• Adrian Jobst
• Daniel Atzberger,
daniel.atzberger@hpi.uni-
potsdam.de
• Tim Cech
• Willy Scheibel
• Dr. Matthias Trapp
• Prof. Dr. Jürgen Döllner
Acknowledgements
This work is part of the „Software-DNA“ project, which is funded
by the European Regional Development Fund (ERDF or EFRE in
German) and the State of Brandenburg (ILB). This work is part of
the KMU project „KnowhowAnalyzer“ (Förderkennzeichen
01IS20088B), which is funded by the German Ministry for
Education and Research (Bundesministerium für Bildung und
Forschung).
16 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
View publication stats