Analysing GitHub commits with R
@BasiaFusinska
barbarafusinska.com
barbara.fusinska@gmail.com
https://github.com/BasiaFusinska/RTalk
About me
Programmer
Math enthusiast
Sweet tooth
@BasiaFusinska
Goals
Agenda
• Data analysis
• Capturing GitHub data
• Using R in Data Analysis
– R basics
– Data exploration & processing
– I/O operations
• Azure ML:
– Datasets
– Experiments
Data analysis process
Raw Data
Processed
Data
Data Analysis
& Visualization
Exploratory
Data Analysis
Data
Capture
What do we want to find out?
Language statistics
GitHut Visualisation
http://githut.info/
Data Capture
GitHub Archive
https://www.githubarchive.org/
Event Data
Data Source
(Events/Logs/Files)
Store & reorganize
Query
Google BigQuery
https://cloud.google.com/bigquery/what-is-bigquery
GitHub API
https://developer.github.com/v3/
Why R?
• Ross Ihaka & Robert Gentleman
• Name:
– First letter of names
– Play on the name of S
– S-PLUS – commercial alternative
• Open source
• Nr 1 for statistical computing
Development environment
R Environment
• R project
– console environment
– http://www.r-project.org/
• IDE
– Any editor
– RStudio
http://www.rstudio.com/products/rstudio/download/
RStudio
Editor
Console
Environment
variables
Plots
Files
Help
Packages
R Basics
Filtering
Goal: Language distribution
Distribution of active repositories per
language
What is an active repository?
Source of truth
GitHub Archive CreateEvent
GitHub Archive PushEvent
GitHub Archive PullRequestEvent
Task: Reading Pull Requests
1. Read the file line by line and
extract only pull request events
2. Extract id and language
information
3. Count and visualise language
distribution
Data: 1h GitHub Archive Events
from 01-01-2015, 3 PM
Reading Events
Read Pull Requests
Unique data
Language information
Language information output
Missing information
Omitting information
Plotting
Now everything is sorted…
Goal: Analyze ACTUAL active
repositories
Missing data
Language information
• Active repositories – Create, Push and
PullRequest events
• Missing language information:
– Google BigQuery
– GitHub API
• Process various data sources
Google BigQuery
Different sources of data
• GitHub Archive:
– id,
– url in a form:
https://api.github.com/repos/:name
– (rare cases) language
• Google BigQuery:
– no id,
– url in a form:
https://github.com/:name
– language
Task: Reading Active Repositories
1. Read the file line by line and extract only
create, push and pull request events
2. Extract id and url information
3. Read Google BigQuery data from saved file
4. Combine repositories data and Google data
base on the same url and fill in missing
language information
5. Count and visualise language distribution
Read GitHub Archive
Read Google Data
Repository data
Various data sources
Combining data
Sorted data
Active repositories per languages
Goal: Using GitHub API
Task: Retrieve language info for
repository
• GET
/repos/:owner/:repo/languages
• Owner: BasiaFusinska
• Repo: RWorkshop
GitHub API from R
Task: Calling GitHub Search
• GET
/search/repositories
• Querying: q parameter
• Paging: page parameter
GitHub API Search
Digression
Goal: Measure productivity
Push events across a week
Task: Gather & Save Week Data
1. Read files line by line
and count push events
for every day
2. Fill in the retrieved
data into data frame
3. Save data in the csv
file
Reading multiple files
Save to file
Push Events (CSV File)
Exercise: Analyse the week
1. Read activity data from file.
2. Define new column as part of
the day.
3. Calculate mean value for
number of pushes for every
part of the day.
4. Compare and visualize data
for Monday, Wednesday and
Friday.
Read week data
Analyse week
Analyse week output
Plot week data
Push Events during the week
Thank you
barbara.fusinska@gmail.com
@BasiaFusinska
barbarafusinska.com
https://github.com/BasiaFusinska/RTalk

Analysing GitHub commits with R