SlideShare a Scribd company logo
1 of 16
Download to read offline
Efficient GitHub Crawling
using the GraphQL API
ICCSA | July 5th
2022, Málaga
Adrian Jobst, Daniel Atzberger, Tim Cech,
Willy Scheibel, Matthias Trapp, and Jürgen Döllner
Hasso-Plattner-Institute, Digital Engineering Faculty, University of Potsdam, Germany
05.07.2022
Software Analytics
”Software analytics aims to obtain insightful and actionable
information from software artifacts that help practitioners
accomplish tasks related to software development, systems, and
user.”
Zhang, D. et al. (2013). Software Analytics in Practice. Software, IEEE, 30:30–37.
2 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
Software Development Process
Illustration of a software development process (Source: Seerene).
3 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
GitHub
• Heavily used platform where developer can host and
review code, and also manage projects
• 83 million developers
• 61 million created repositories (Only 2021)
• 170 million merged pull requests
• Wealth of data:
• Source control repository (git)
• Bug tracking (issues & pull requests)
• Project management (projects)
• Developer data
• Documentation (pages)
• Metadata, e.g., GitHub topics
• Complete API access over REST (v3) and GraphQL
(v4)
Jan Cossier’s Prometheus
4 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
Application Example
• Developers knowledge is directly encoded in the source code
• Training a Labeled Latent Dirichlet Allocation model on a corpus of GitHub projects, leads
to concept-specific vocabulary
• Locating keywords of a concept in the commit history of developer results in a skill level
Machine Learning Cryptocurrency Database Server Data Visualization
th order db request chart
tensor crypto table server series
self binance key header axis
cuda price name http pixi
model trade value body datum
license wallet opt response point
layer exchange sql message style
5 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
Related Work
Sourcerer Linstead et al. (2009)
• Crawls source code repositories.
• Applies LDA and its variant the Author-topic model on them.
Ghtorrent Gousios, G. and Spinellis, D. (2012).
• Monitors the Github public event time line.
• Based on the GitHub v3 API
Boa Dyer et al. (2015)
• Domain-specific programming language
• Runs on infrastructure which uses Hadoop map-reduce cluster
6 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
Prometheus System
Event-driven microservice architecture
• Microservice architecture properties:
• Applications are getting divided into
services
• Primitive communication medium (e.g.
REST)
• Less standardization of technologies
• Decentralized data management
• Easier upgradability and replaceability
• Event-driven properties explained by
purpose of events itself:
• Event notification
• State transfer
• Global state record
RDBMS
Event Backbone
Fetcher Importer
REST
GitHub
Prometheus architecture concept
7 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
GitHub Data Model
8 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
GitHub Data Model
9 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
GitHub Fetcher
• Query similar to GitHub GraphQL query
• Arguments can be less restrictive
• Fields can be overloaded
10 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
Requirements
GitHub Fetcher
• Receive and execute job descriptions in
form of GraphQL queries
• Split query in background
• Handle pagination till specified number is
reached
• Forward parameters from responses to new
queries
• Publish responses
Importer
• Receive response events
• Insert or update objects in database
• Publish create, update and delete event on
database
RDBMS
Event Backbone
Fetcher Importer
REST
GitHub
Prometheus architecture concept
11 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
Architecture
Importer
GitHub
GraphQL
Endpoint
Metastore
Importer
binlog
Metastore
Publisher
RDBMS
Created / Updated / Deleted - Entity
GitHub
Fetcher
REST
Created / Finished - Job / Blueprint / WP
Post / Get - Job
Redis Publish / Subscribe
Prometheus architecture concept. Rectangles with rounded corners represent docker container, arrows
represent events.
12 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
Ghcrawler
• Archived open source project from
Microsoft
• The Microsoft Open Source Programs
Office uses this to track 1000s of repos in
which Microsoft is involved
• Implements GitHub best practices
guidelines for API access
• API token management (extend rate limit)
• Is able to crawl most GitHub entities
13 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
Experiment
• Dashed line indicates where
issues would be crawled
completely
• 3.5 times faster, even if
ghcrawler had no overhead
included pull requests
• Prometheus needs 98 API
calls, ghcrawler 11.990
• Ghcrawler used 11.990 from
20.000 available requests
• Prometheus used 98 out
from 5000 points
14 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
Conclusions
Contributions
• We propose a system for crawling data from public GitHub repositories based on the
GraphQL API
• We propose a mechanism for crawling nested GraphQL queries
Findings
• The GitHub GraphQL API has advantages in terms of performance.
• However the query graph structure has a large impact on the crawling time.
Future Work
• Concurrent API calls
• More flexible splitting approach to eliminate redundant queries.
15 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
Contact
• Adrian Jobst
• Daniel Atzberger,
daniel.atzberger@hpi.uni-
potsdam.de
• Tim Cech
• Willy Scheibel
• Dr. Matthias Trapp
• Prof. Dr. Jürgen Döllner
Acknowledgements
This work is part of the „Software-DNA“ project, which is funded
by the European Regional Development Fund (ERDF or EFRE in
German) and the State of Brandenburg (ILB). This work is part of
the KMU project „KnowhowAnalyzer“ (Förderkennzeichen
01IS20088B), which is funded by the German Ministry for
Education and Research (Bundesministerium für Bildung und
Forschung).
16 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
View publication stats

More Related Content

Similar to Efficient GitHub Crawling using the GraphQL API

Future of jobs and digital economy citi conference 090618
Future of jobs and digital economy citi conference 090618Future of jobs and digital economy citi conference 090618
Future of jobs and digital economy citi conference 090618Economic Strategy Institute
 
Increase the Velocity of Your Software Releases Using GitHub and DeployHub
Increase the Velocity of Your Software Releases Using GitHub and DeployHubIncrease the Velocity of Your Software Releases Using GitHub and DeployHub
Increase the Velocity of Your Software Releases Using GitHub and DeployHubDevOps.com
 
OpenWhisk - Serverless Architecture
OpenWhisk - Serverless Architecture OpenWhisk - Serverless Architecture
OpenWhisk - Serverless Architecture Dev_Events
 
Building Community APIs using GraphQL, Neo4j, and Kotlin
Building Community APIs using GraphQL, Neo4j, and KotlinBuilding Community APIs using GraphQL, Neo4j, and Kotlin
Building Community APIs using GraphQL, Neo4j, and KotlinNeo4j
 
Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big queryGoogle for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big queryGoogle Cloud Platform - Japan
 
Data Science Meets DevOps: GitOps with OpenShift (1).pdf
Data Science Meets DevOps: GitOps with OpenShift (1).pdfData Science Meets DevOps: GitOps with OpenShift (1).pdf
Data Science Meets DevOps: GitOps with OpenShift (1).pdfHemaVeeradhi1
 
Secure Your Open Source Projects For Free
Secure Your Open Source Projects For FreeSecure Your Open Source Projects For Free
Secure Your Open Source Projects For FreeDavide Benvegnù
 
An overview of BigQuery
An overview of BigQuery An overview of BigQuery
An overview of BigQuery GirdhareeSaran
 
Serverless Apps with Open Whisk
Serverless Apps with Open Whisk Serverless Apps with Open Whisk
Serverless Apps with Open Whisk Dev_Events
 
Continuous Lifecycle London 2018 Event Keynote
Continuous Lifecycle London 2018 Event KeynoteContinuous Lifecycle London 2018 Event Keynote
Continuous Lifecycle London 2018 Event KeynoteWeaveworks
 
Critical Breakthroughs and Challenges in Big Data and Analytics
Critical Breakthroughs and Challenges in Big Data and AnalyticsCritical Breakthroughs and Challenges in Big Data and Analytics
Critical Breakthroughs and Challenges in Big Data and AnalyticsData Driven Innovation
 
Building A Distributed Build System at Google Scale (StrangeLoop 2016)
Building A Distributed Build System at Google Scale (StrangeLoop 2016)Building A Distributed Build System at Google Scale (StrangeLoop 2016)
Building A Distributed Build System at Google Scale (StrangeLoop 2016)Aysylu Greenberg
 
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023Vadym Kazulkin
 
Serverless apps with OpenWhisk
Serverless apps with OpenWhiskServerless apps with OpenWhisk
Serverless apps with OpenWhiskDaniel Krook
 
API Management for GraphQL
API Management for GraphQLAPI Management for GraphQL
API Management for GraphQLWSO2
 
Azure_DevOps_Customer_201903.pptx
Azure_DevOps_Customer_201903.pptxAzure_DevOps_Customer_201903.pptx
Azure_DevOps_Customer_201903.pptxSherman37
 
Introducing GitLab (September 2018)
Introducing GitLab (September 2018)Introducing GitLab (September 2018)
Introducing GitLab (September 2018)Noa Harel
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryMárton Kodok
 

Similar to Efficient GitHub Crawling using the GraphQL API (20)

Future of jobs and digital economy citi conference 090618
Future of jobs and digital economy citi conference 090618Future of jobs and digital economy citi conference 090618
Future of jobs and digital economy citi conference 090618
 
Increase the Velocity of Your Software Releases Using GitHub and DeployHub
Increase the Velocity of Your Software Releases Using GitHub and DeployHubIncrease the Velocity of Your Software Releases Using GitHub and DeployHub
Increase the Velocity of Your Software Releases Using GitHub and DeployHub
 
OpenWhisk - Serverless Architecture
OpenWhisk - Serverless Architecture OpenWhisk - Serverless Architecture
OpenWhisk - Serverless Architecture
 
Building Community APIs using GraphQL, Neo4j, and Kotlin
Building Community APIs using GraphQL, Neo4j, and KotlinBuilding Community APIs using GraphQL, Neo4j, and Kotlin
Building Community APIs using GraphQL, Neo4j, and Kotlin
 
Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big queryGoogle for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
 
DevOps on GCP Course Compared to AWS
DevOps on GCP Course Compared to AWSDevOps on GCP Course Compared to AWS
DevOps on GCP Course Compared to AWS
 
Data Science Meets DevOps: GitOps with OpenShift (1).pdf
Data Science Meets DevOps: GitOps with OpenShift (1).pdfData Science Meets DevOps: GitOps with OpenShift (1).pdf
Data Science Meets DevOps: GitOps with OpenShift (1).pdf
 
Secure Your Open Source Projects For Free
Secure Your Open Source Projects For FreeSecure Your Open Source Projects For Free
Secure Your Open Source Projects For Free
 
An overview of BigQuery
An overview of BigQuery An overview of BigQuery
An overview of BigQuery
 
Benchmarking of distributed linked data streaming systems
Benchmarking of distributed linked data streaming systemsBenchmarking of distributed linked data streaming systems
Benchmarking of distributed linked data streaming systems
 
Serverless Apps with Open Whisk
Serverless Apps with Open Whisk Serverless Apps with Open Whisk
Serverless Apps with Open Whisk
 
Continuous Lifecycle London 2018 Event Keynote
Continuous Lifecycle London 2018 Event KeynoteContinuous Lifecycle London 2018 Event Keynote
Continuous Lifecycle London 2018 Event Keynote
 
Critical Breakthroughs and Challenges in Big Data and Analytics
Critical Breakthroughs and Challenges in Big Data and AnalyticsCritical Breakthroughs and Challenges in Big Data and Analytics
Critical Breakthroughs and Challenges in Big Data and Analytics
 
Building A Distributed Build System at Google Scale (StrangeLoop 2016)
Building A Distributed Build System at Google Scale (StrangeLoop 2016)Building A Distributed Build System at Google Scale (StrangeLoop 2016)
Building A Distributed Build System at Google Scale (StrangeLoop 2016)
 
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
 
Serverless apps with OpenWhisk
Serverless apps with OpenWhiskServerless apps with OpenWhisk
Serverless apps with OpenWhisk
 
API Management for GraphQL
API Management for GraphQLAPI Management for GraphQL
API Management for GraphQL
 
Azure_DevOps_Customer_201903.pptx
Azure_DevOps_Customer_201903.pptxAzure_DevOps_Customer_201903.pptx
Azure_DevOps_Customer_201903.pptx
 
Introducing GitLab (September 2018)
Introducing GitLab (September 2018)Introducing GitLab (September 2018)
Introducing GitLab (September 2018)
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
 

More from Matthias Trapp

Interactive Control over Temporal Consistency while Stylizing Video Streams
Interactive Control over Temporal Consistency while Stylizing Video StreamsInteractive Control over Temporal Consistency while Stylizing Video Streams
Interactive Control over Temporal Consistency while Stylizing Video StreamsMatthias Trapp
 
A Framework for Art-directed Augmentation of Human Motion in Videos on Mobile...
A Framework for Art-directed Augmentation of Human Motion in Videos on Mobile...A Framework for Art-directed Augmentation of Human Motion in Videos on Mobile...
A Framework for Art-directed Augmentation of Human Motion in Videos on Mobile...Matthias Trapp
 
A Framework for Interactive 3D Photo Stylization Techniques on Mobile Devices
A Framework for Interactive 3D Photo Stylization Techniques on Mobile DevicesA Framework for Interactive 3D Photo Stylization Techniques on Mobile Devices
A Framework for Interactive 3D Photo Stylization Techniques on Mobile DevicesMatthias Trapp
 
ALIVE-Adaptive Chromaticity for Interactive Low-light Image and Video Enhance...
ALIVE-Adaptive Chromaticity for Interactive Low-light Image and Video Enhance...ALIVE-Adaptive Chromaticity for Interactive Low-light Image and Video Enhance...
ALIVE-Adaptive Chromaticity for Interactive Low-light Image and Video Enhance...Matthias Trapp
 
A Service-based Preset Recommendation System for Image Stylization Applications
A Service-based Preset Recommendation System for Image Stylization ApplicationsA Service-based Preset Recommendation System for Image Stylization Applications
A Service-based Preset Recommendation System for Image Stylization ApplicationsMatthias Trapp
 
Design Space of Geometry-based Image Abstraction Techniques with Vectorizatio...
Design Space of Geometry-based Image Abstraction Techniques with Vectorizatio...Design Space of Geometry-based Image Abstraction Techniques with Vectorizatio...
Design Space of Geometry-based Image Abstraction Techniques with Vectorizatio...Matthias Trapp
 
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...Matthias Trapp
 
CodeCV - Mining Expertise of GitHub Users from Coding Activities - Online.pdf
CodeCV - Mining Expertise of GitHub Users from Coding Activities - Online.pdfCodeCV - Mining Expertise of GitHub Users from Coding Activities - Online.pdf
CodeCV - Mining Expertise of GitHub Users from Coding Activities - Online.pdfMatthias Trapp
 
Non-Photorealistic Rendering of 3D Point Clouds for Cartographic Visualization
Non-Photorealistic Rendering of 3D Point Clouds for Cartographic VisualizationNon-Photorealistic Rendering of 3D Point Clouds for Cartographic Visualization
Non-Photorealistic Rendering of 3D Point Clouds for Cartographic VisualizationMatthias Trapp
 
TWIN4ROAD - Erfassung Analyse und Auswertung mobiler Multi Sensorik im Strass...
TWIN4ROAD - Erfassung Analyse und Auswertung mobiler Multi Sensorik im Strass...TWIN4ROAD - Erfassung Analyse und Auswertung mobiler Multi Sensorik im Strass...
TWIN4ROAD - Erfassung Analyse und Auswertung mobiler Multi Sensorik im Strass...Matthias Trapp
 
Interactive Close-Up Rendering for Detail+Overview Visualization of 3D Digita...
Interactive Close-Up Rendering for Detail+Overview Visualization of 3D Digita...Interactive Close-Up Rendering for Detail+Overview Visualization of 3D Digita...
Interactive Close-Up Rendering for Detail+Overview Visualization of 3D Digita...Matthias Trapp
 
Web-based and Mobile Provisioning of Virtual 3D Reconstructions
Web-based and Mobile Provisioning of Virtual 3D ReconstructionsWeb-based and Mobile Provisioning of Virtual 3D Reconstructions
Web-based and Mobile Provisioning of Virtual 3D ReconstructionsMatthias Trapp
 
Visualization of Knowledge Distribution across Development Teams using 2.5D S...
Visualization of Knowledge Distribution across Development Teams using 2.5D S...Visualization of Knowledge Distribution across Development Teams using 2.5D S...
Visualization of Knowledge Distribution across Development Teams using 2.5D S...Matthias Trapp
 
Real-time Screen-space Geometry Draping for 3D Digital Terrain Models
Real-time Screen-space Geometry Draping for 3D Digital Terrain ModelsReal-time Screen-space Geometry Draping for 3D Digital Terrain Models
Real-time Screen-space Geometry Draping for 3D Digital Terrain ModelsMatthias Trapp
 
FERMIUM - A Framework for Real-time Procedural Point Cloud Animation & Morphing
FERMIUM - A Framework for Real-time Procedural Point Cloud Animation & MorphingFERMIUM - A Framework for Real-time Procedural Point Cloud Animation & Morphing
FERMIUM - A Framework for Real-time Procedural Point Cloud Animation & MorphingMatthias Trapp
 
Interactive Editing of Signed Distance Fields
Interactive Editing of Signed Distance FieldsInteractive Editing of Signed Distance Fields
Interactive Editing of Signed Distance FieldsMatthias Trapp
 
Integration of Image Processing Techniques into the Unity Game Engine
Integration of Image Processing Techniques into the Unity Game EngineIntegration of Image Processing Techniques into the Unity Game Engine
Integration of Image Processing Techniques into the Unity Game EngineMatthias Trapp
 
Interactive GPU-based Image Deformation for Mobile Devices
Interactive GPU-based Image Deformation for Mobile DevicesInteractive GPU-based Image Deformation for Mobile Devices
Interactive GPU-based Image Deformation for Mobile DevicesMatthias Trapp
 
Interactive Photo Editing on Smartphones via Intrinsic Decomposition
Interactive Photo Editing on Smartphones via Intrinsic DecompositionInteractive Photo Editing on Smartphones via Intrinsic Decomposition
Interactive Photo Editing on Smartphones via Intrinsic DecompositionMatthias Trapp
 
Service-based Analysis and Abstraction for Content Moderation of Digital Images
Service-based Analysis and Abstraction for Content Moderation of Digital ImagesService-based Analysis and Abstraction for Content Moderation of Digital Images
Service-based Analysis and Abstraction for Content Moderation of Digital ImagesMatthias Trapp
 

More from Matthias Trapp (20)

Interactive Control over Temporal Consistency while Stylizing Video Streams
Interactive Control over Temporal Consistency while Stylizing Video StreamsInteractive Control over Temporal Consistency while Stylizing Video Streams
Interactive Control over Temporal Consistency while Stylizing Video Streams
 
A Framework for Art-directed Augmentation of Human Motion in Videos on Mobile...
A Framework for Art-directed Augmentation of Human Motion in Videos on Mobile...A Framework for Art-directed Augmentation of Human Motion in Videos on Mobile...
A Framework for Art-directed Augmentation of Human Motion in Videos on Mobile...
 
A Framework for Interactive 3D Photo Stylization Techniques on Mobile Devices
A Framework for Interactive 3D Photo Stylization Techniques on Mobile DevicesA Framework for Interactive 3D Photo Stylization Techniques on Mobile Devices
A Framework for Interactive 3D Photo Stylization Techniques on Mobile Devices
 
ALIVE-Adaptive Chromaticity for Interactive Low-light Image and Video Enhance...
ALIVE-Adaptive Chromaticity for Interactive Low-light Image and Video Enhance...ALIVE-Adaptive Chromaticity for Interactive Low-light Image and Video Enhance...
ALIVE-Adaptive Chromaticity for Interactive Low-light Image and Video Enhance...
 
A Service-based Preset Recommendation System for Image Stylization Applications
A Service-based Preset Recommendation System for Image Stylization ApplicationsA Service-based Preset Recommendation System for Image Stylization Applications
A Service-based Preset Recommendation System for Image Stylization Applications
 
Design Space of Geometry-based Image Abstraction Techniques with Vectorizatio...
Design Space of Geometry-based Image Abstraction Techniques with Vectorizatio...Design Space of Geometry-based Image Abstraction Techniques with Vectorizatio...
Design Space of Geometry-based Image Abstraction Techniques with Vectorizatio...
 
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
 
CodeCV - Mining Expertise of GitHub Users from Coding Activities - Online.pdf
CodeCV - Mining Expertise of GitHub Users from Coding Activities - Online.pdfCodeCV - Mining Expertise of GitHub Users from Coding Activities - Online.pdf
CodeCV - Mining Expertise of GitHub Users from Coding Activities - Online.pdf
 
Non-Photorealistic Rendering of 3D Point Clouds for Cartographic Visualization
Non-Photorealistic Rendering of 3D Point Clouds for Cartographic VisualizationNon-Photorealistic Rendering of 3D Point Clouds for Cartographic Visualization
Non-Photorealistic Rendering of 3D Point Clouds for Cartographic Visualization
 
TWIN4ROAD - Erfassung Analyse und Auswertung mobiler Multi Sensorik im Strass...
TWIN4ROAD - Erfassung Analyse und Auswertung mobiler Multi Sensorik im Strass...TWIN4ROAD - Erfassung Analyse und Auswertung mobiler Multi Sensorik im Strass...
TWIN4ROAD - Erfassung Analyse und Auswertung mobiler Multi Sensorik im Strass...
 
Interactive Close-Up Rendering for Detail+Overview Visualization of 3D Digita...
Interactive Close-Up Rendering for Detail+Overview Visualization of 3D Digita...Interactive Close-Up Rendering for Detail+Overview Visualization of 3D Digita...
Interactive Close-Up Rendering for Detail+Overview Visualization of 3D Digita...
 
Web-based and Mobile Provisioning of Virtual 3D Reconstructions
Web-based and Mobile Provisioning of Virtual 3D ReconstructionsWeb-based and Mobile Provisioning of Virtual 3D Reconstructions
Web-based and Mobile Provisioning of Virtual 3D Reconstructions
 
Visualization of Knowledge Distribution across Development Teams using 2.5D S...
Visualization of Knowledge Distribution across Development Teams using 2.5D S...Visualization of Knowledge Distribution across Development Teams using 2.5D S...
Visualization of Knowledge Distribution across Development Teams using 2.5D S...
 
Real-time Screen-space Geometry Draping for 3D Digital Terrain Models
Real-time Screen-space Geometry Draping for 3D Digital Terrain ModelsReal-time Screen-space Geometry Draping for 3D Digital Terrain Models
Real-time Screen-space Geometry Draping for 3D Digital Terrain Models
 
FERMIUM - A Framework for Real-time Procedural Point Cloud Animation & Morphing
FERMIUM - A Framework for Real-time Procedural Point Cloud Animation & MorphingFERMIUM - A Framework for Real-time Procedural Point Cloud Animation & Morphing
FERMIUM - A Framework for Real-time Procedural Point Cloud Animation & Morphing
 
Interactive Editing of Signed Distance Fields
Interactive Editing of Signed Distance FieldsInteractive Editing of Signed Distance Fields
Interactive Editing of Signed Distance Fields
 
Integration of Image Processing Techniques into the Unity Game Engine
Integration of Image Processing Techniques into the Unity Game EngineIntegration of Image Processing Techniques into the Unity Game Engine
Integration of Image Processing Techniques into the Unity Game Engine
 
Interactive GPU-based Image Deformation for Mobile Devices
Interactive GPU-based Image Deformation for Mobile DevicesInteractive GPU-based Image Deformation for Mobile Devices
Interactive GPU-based Image Deformation for Mobile Devices
 
Interactive Photo Editing on Smartphones via Intrinsic Decomposition
Interactive Photo Editing on Smartphones via Intrinsic DecompositionInteractive Photo Editing on Smartphones via Intrinsic Decomposition
Interactive Photo Editing on Smartphones via Intrinsic Decomposition
 
Service-based Analysis and Abstraction for Content Moderation of Digital Images
Service-based Analysis and Abstraction for Content Moderation of Digital ImagesService-based Analysis and Abstraction for Content Moderation of Digital Images
Service-based Analysis and Abstraction for Content Moderation of Digital Images
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Efficient GitHub Crawling using the GraphQL API

  • 1. Efficient GitHub Crawling using the GraphQL API ICCSA | July 5th 2022, Málaga Adrian Jobst, Daniel Atzberger, Tim Cech, Willy Scheibel, Matthias Trapp, and Jürgen Döllner Hasso-Plattner-Institute, Digital Engineering Faculty, University of Potsdam, Germany 05.07.2022
  • 2. Software Analytics ”Software analytics aims to obtain insightful and actionable information from software artifacts that help practitioners accomplish tasks related to software development, systems, and user.” Zhang, D. et al. (2013). Software Analytics in Practice. Software, IEEE, 30:30–37. 2 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 3. Software Development Process Illustration of a software development process (Source: Seerene). 3 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 4. GitHub • Heavily used platform where developer can host and review code, and also manage projects • 83 million developers • 61 million created repositories (Only 2021) • 170 million merged pull requests • Wealth of data: • Source control repository (git) • Bug tracking (issues & pull requests) • Project management (projects) • Developer data • Documentation (pages) • Metadata, e.g., GitHub topics • Complete API access over REST (v3) and GraphQL (v4) Jan Cossier’s Prometheus 4 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 5. Application Example • Developers knowledge is directly encoded in the source code • Training a Labeled Latent Dirichlet Allocation model on a corpus of GitHub projects, leads to concept-specific vocabulary • Locating keywords of a concept in the commit history of developer results in a skill level Machine Learning Cryptocurrency Database Server Data Visualization th order db request chart tensor crypto table server series self binance key header axis cuda price name http pixi model trade value body datum license wallet opt response point layer exchange sql message style 5 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 6. Related Work Sourcerer Linstead et al. (2009) • Crawls source code repositories. • Applies LDA and its variant the Author-topic model on them. Ghtorrent Gousios, G. and Spinellis, D. (2012). • Monitors the Github public event time line. • Based on the GitHub v3 API Boa Dyer et al. (2015) • Domain-specific programming language • Runs on infrastructure which uses Hadoop map-reduce cluster 6 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 7. Prometheus System Event-driven microservice architecture • Microservice architecture properties: • Applications are getting divided into services • Primitive communication medium (e.g. REST) • Less standardization of technologies • Decentralized data management • Easier upgradability and replaceability • Event-driven properties explained by purpose of events itself: • Event notification • State transfer • Global state record RDBMS Event Backbone Fetcher Importer REST GitHub Prometheus architecture concept 7 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 8. GitHub Data Model 8 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 9. GitHub Data Model 9 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 10. GitHub Fetcher • Query similar to GitHub GraphQL query • Arguments can be less restrictive • Fields can be overloaded 10 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 11. Requirements GitHub Fetcher • Receive and execute job descriptions in form of GraphQL queries • Split query in background • Handle pagination till specified number is reached • Forward parameters from responses to new queries • Publish responses Importer • Receive response events • Insert or update objects in database • Publish create, update and delete event on database RDBMS Event Backbone Fetcher Importer REST GitHub Prometheus architecture concept 11 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 12. Architecture Importer GitHub GraphQL Endpoint Metastore Importer binlog Metastore Publisher RDBMS Created / Updated / Deleted - Entity GitHub Fetcher REST Created / Finished - Job / Blueprint / WP Post / Get - Job Redis Publish / Subscribe Prometheus architecture concept. Rectangles with rounded corners represent docker container, arrows represent events. 12 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 13. Ghcrawler • Archived open source project from Microsoft • The Microsoft Open Source Programs Office uses this to track 1000s of repos in which Microsoft is involved • Implements GitHub best practices guidelines for API access • API token management (extend rate limit) • Is able to crawl most GitHub entities 13 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 14. Experiment • Dashed line indicates where issues would be crawled completely • 3.5 times faster, even if ghcrawler had no overhead included pull requests • Prometheus needs 98 API calls, ghcrawler 11.990 • Ghcrawler used 11.990 from 20.000 available requests • Prometheus used 98 out from 5000 points 14 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 15. Conclusions Contributions • We propose a system for crawling data from public GitHub repositories based on the GraphQL API • We propose a mechanism for crawling nested GraphQL queries Findings • The GitHub GraphQL API has advantages in terms of performance. • However the query graph structure has a large impact on the crawling time. Future Work • Concurrent API calls • More flexible splitting approach to eliminate redundant queries. 15 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021
  • 16. Contact • Adrian Jobst • Daniel Atzberger, daniel.atzberger@hpi.uni- potsdam.de • Tim Cech • Willy Scheibel • Dr. Matthias Trapp • Prof. Dr. Jürgen Döllner Acknowledgements This work is part of the „Software-DNA“ project, which is funded by the European Regional Development Fund (ERDF or EFRE in German) and the State of Brandenburg (ILB). This work is part of the KMU project „KnowhowAnalyzer“ (Förderkennzeichen 01IS20088B), which is funded by the German Ministry for Education and Research (Bundesministerium für Bildung und Forschung). 16 Efficient GitHub Crawling using the GraphQL API Daniel Atzberger 05.07.2021 View publication stats