Efficient GitHub Crawling
using the GraphQL API
ICCSA | July 5th
2022, Málaga
Adrian Jobst, Daniel Atzberger, Tim Cech,
Willy Scheibel, Matthias Trapp, and Jürgen Döllner
Hasso-Plattner-Institute, Digital Engineering Faculty, University of Potsdam, Germany
05.07.2022
Software Analytics
”Software analytics aims to obtain insightful and actionable
information from software artifacts that help practitioners
accomplish tasks related to software development, systems, and
user.”
Zhang, D. et al. (2013). Software Analytics in Practice. Software, IEEE, 30:30–37.
2 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
Software Development Process
Illustration of a software development process (Source: Seerene).
3 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
GitHub
• Heavily used platform where developer can host and
review code, and also manage projects
• 83 million developers
• 61 million created repositories (Only 2021)
• 170 million merged pull requests
• Wealth of data:
• Source control repository (git)
• Bug tracking (issues & pull requests)
• Project management (projects)
• Developer data
• Documentation (pages)
• Metadata, e.g., GitHub topics
• Complete API access over REST (v3) and GraphQL
(v4)
Jan Cossier’s Prometheus
4 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
Application Example
• Developers knowledge is directly encoded in the source code
• Training a Labeled Latent Dirichlet Allocation model on a corpus of GitHub projects, leads
to concept-specific vocabulary
• Locating keywords of a concept in the commit history of developer results in a skill level
Machine Learning Cryptocurrency Database Server Data Visualization
th order db request chart
tensor crypto table server series
self binance key header axis
cuda price name http pixi
model trade value body datum
license wallet opt response point
layer exchange sql message style
5 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
Related Work
Sourcerer Linstead et al. (2009)
• Crawls source code repositories.
• Applies LDA and its variant the Author-topic model on them.
Ghtorrent Gousios, G. and Spinellis, D. (2012).
• Monitors the Github public event time line.
• Based on the GitHub v3 API
Boa Dyer et al. (2015)
• Domain-specific programming language
• Runs on infrastructure which uses Hadoop map-reduce cluster
6 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
Prometheus System
Event-driven microservice architecture
• Microservice architecture properties:
• Applications are getting divided into
services
• Primitive communication medium (e.g.
REST)
• Less standardization of technologies
• Decentralized data management
• Easier upgradability and replaceability
• Event-driven properties explained by
purpose of events itself:
• Event notification
• State transfer
• Global state record
RDBMS
Event Backbone
Fetcher Importer
REST
GitHub
Prometheus architecture concept
7 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
GitHub Data Model
8 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
GitHub Data Model
9 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
GitHub Fetcher
• Query similar to GitHub GraphQL query
• Arguments can be less restrictive
• Fields can be overloaded
10 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
Requirements
GitHub Fetcher
• Receive and execute job descriptions in
form of GraphQL queries
• Split query in background
• Handle pagination till specified number is
reached
• Forward parameters from responses to new
queries
• Publish responses
Importer
• Receive response events
• Insert or update objects in database
• Publish create, update and delete event on
database
RDBMS
Event Backbone
Fetcher Importer
REST
GitHub
Prometheus architecture concept
11 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
Architecture
Importer
GitHub
GraphQL
Endpoint
Metastore
Importer
binlog
Metastore
Publisher
RDBMS
Created / Updated / Deleted - Entity
GitHub
Fetcher
REST
Created / Finished - Job / Blueprint / WP
Post / Get - Job
Redis Publish / Subscribe
Prometheus architecture concept. Rectangles with rounded corners represent docker container, arrows
represent events.
12 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
Ghcrawler
• Archived open source project from
Microsoft
• The Microsoft Open Source Programs
Office uses this to track 1000s of repos in
which Microsoft is involved
• Implements GitHub best practices
guidelines for API access
• API token management (extend rate limit)
• Is able to crawl most GitHub entities
13 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
Experiment
• Dashed line indicates where
issues would be crawled
completely
• 3.5 times faster, even if
ghcrawler had no overhead
included pull requests
• Prometheus needs 98 API
calls, ghcrawler 11.990
• Ghcrawler used 11.990 from
20.000 available requests
• Prometheus used 98 out
from 5000 points
14 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
Conclusions
Contributions
• We propose a system for crawling data from public GitHub repositories based on the
GraphQL API
• We propose a mechanism for crawling nested GraphQL queries
Findings
• The GitHub GraphQL API has advantages in terms of performance.
• However the query graph structure has a large impact on the crawling time.
Future Work
• Concurrent API calls
• More flexible splitting approach to eliminate redundant queries.
15 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
Contact
• Adrian Jobst
• Daniel Atzberger,
daniel.atzberger@hpi.uni-
potsdam.de
• Tim Cech
• Willy Scheibel
• Dr. Matthias Trapp
• Prof. Dr. Jürgen Döllner
Acknowledgements
This work is part of the „Software-DNA“ project, which is funded
by the European Regional Development Fund (ERDF or EFRE in
German) and the State of Brandenburg (ILB). This work is part of
the KMU project „KnowhowAnalyzer“ (Förderkennzeichen
01IS20088B), which is funded by the German Ministry for
Education and Research (Bundesministerium für Bildung und
Forschung).
16 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
View publication stats

Efficient GitHub Crawling using the GraphQL API

  • 1.
    Efficient GitHub Crawling usingthe GraphQL API ICCSA | July 5th 2022, Málaga Adrian Jobst, Daniel Atzberger, Tim Cech, Willy Scheibel, Matthias Trapp, and Jürgen Döllner Hasso-Plattner-Institute, Digital Engineering Faculty, University of Potsdam, Germany 05.07.2022
  • 2.
    Software Analytics ”Software analyticsaims to obtain insightful and actionable information from software artifacts that help practitioners accomplish tasks related to software development, systems, and user.” Zhang, D. et al. (2013). Software Analytics in Practice. Software, IEEE, 30:30–37. 2 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 3.
    Software Development Process Illustrationof a software development process (Source: Seerene). 3 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 4.
    GitHub • Heavily usedplatform where developer can host and review code, and also manage projects • 83 million developers • 61 million created repositories (Only 2021) • 170 million merged pull requests • Wealth of data: • Source control repository (git) • Bug tracking (issues & pull requests) • Project management (projects) • Developer data • Documentation (pages) • Metadata, e.g., GitHub topics • Complete API access over REST (v3) and GraphQL (v4) Jan Cossier’s Prometheus 4 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 5.
    Application Example • Developersknowledge is directly encoded in the source code • Training a Labeled Latent Dirichlet Allocation model on a corpus of GitHub projects, leads to concept-specific vocabulary • Locating keywords of a concept in the commit history of developer results in a skill level Machine Learning Cryptocurrency Database Server Data Visualization th order db request chart tensor crypto table server series self binance key header axis cuda price name http pixi model trade value body datum license wallet opt response point layer exchange sql message style 5 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 6.
    Related Work Sourcerer Linsteadet al. (2009) • Crawls source code repositories. • Applies LDA and its variant the Author-topic model on them. Ghtorrent Gousios, G. and Spinellis, D. (2012). • Monitors the Github public event time line. • Based on the GitHub v3 API Boa Dyer et al. (2015) • Domain-specific programming language • Runs on infrastructure which uses Hadoop map-reduce cluster 6 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 7.
    Prometheus System Event-driven microservicearchitecture • Microservice architecture properties: • Applications are getting divided into services • Primitive communication medium (e.g. REST) • Less standardization of technologies • Decentralized data management • Easier upgradability and replaceability • Event-driven properties explained by purpose of events itself: • Event notification • State transfer • Global state record RDBMS Event Backbone Fetcher Importer REST GitHub Prometheus architecture concept 7 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 8.
    GitHub Data Model 8Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 9.
    GitHub Data Model 9Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 10.
    GitHub Fetcher • Querysimilar to GitHub GraphQL query • Arguments can be less restrictive • Fields can be overloaded 10 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 11.
    Requirements GitHub Fetcher • Receiveand execute job descriptions in form of GraphQL queries • Split query in background • Handle pagination till specified number is reached • Forward parameters from responses to new queries • Publish responses Importer • Receive response events • Insert or update objects in database • Publish create, update and delete event on database RDBMS Event Backbone Fetcher Importer REST GitHub Prometheus architecture concept 11 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 12.
    Architecture Importer GitHub GraphQL Endpoint Metastore Importer binlog Metastore Publisher RDBMS Created / Updated/ Deleted - Entity GitHub Fetcher REST Created / Finished - Job / Blueprint / WP Post / Get - Job Redis Publish / Subscribe Prometheus architecture concept. Rectangles with rounded corners represent docker container, arrows represent events. 12 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 13.
    Ghcrawler • Archived opensource project from Microsoft • The Microsoft Open Source Programs Office uses this to track 1000s of repos in which Microsoft is involved • Implements GitHub best practices guidelines for API access • API token management (extend rate limit) • Is able to crawl most GitHub entities 13 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 14.
    Experiment • Dashed lineindicates where issues would be crawled completely • 3.5 times faster, even if ghcrawler had no overhead included pull requests • Prometheus needs 98 API calls, ghcrawler 11.990 • Ghcrawler used 11.990 from 20.000 available requests • Prometheus used 98 out from 5000 points 14 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 15.
    Conclusions Contributions • We proposea system for crawling data from public GitHub repositories based on the GraphQL API • We propose a mechanism for crawling nested GraphQL queries Findings • The GitHub GraphQL API has advantages in terms of performance. • However the query graph structure has a large impact on the crawling time. Future Work • Concurrent API calls • More flexible splitting approach to eliminate redundant queries. 15 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 16.
    Contact • Adrian Jobst •Daniel Atzberger, daniel.atzberger@hpi.uni- potsdam.de • Tim Cech • Willy Scheibel • Dr. Matthias Trapp • Prof. Dr. Jürgen Döllner Acknowledgements This work is part of the „Software-DNA“ project, which is funded by the European Regional Development Fund (ERDF or EFRE in German) and the State of Brandenburg (ILB). This work is part of the KMU project „KnowhowAnalyzer“ (Förderkennzeichen 01IS20088B), which is funded by the German Ministry for Education and Research (Bundesministerium für Bildung und Forschung). 16 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021 View publication stats