Analysing GitHub commits with R

Analysing GitHub commits with R
@BasiaFusinska
barbarafusinska.com
barbara.fusinska@gmail.com

About me
Programmer
Math enthusiast
Sweet tooth
@BasiaFusinska
• @BasiaFusinska
• barbarafusinska.com
• barbara.fusinska@gmail.com
https://github.com/BasiaFusinska/RTalk

Agenda
• Data analysis
• Capturing GitHub data
• Using R in Data Analysis
• Azure ML

Data analysis process
Raw Data
Processed
Data
Data Analysis
& Visualization
Exploratory
Data Analysis
Data
Capture

GitHut Visualisation
http://githut.info/

GitHub Archive
https://www.githubarchive.org/

Event Data
Data Source
(Events/Logs/Files)
Store & reorganize
Query

Google BigQuery
https://cloud.google.com/bigquery/what-is-bigquery

GitHub API
https://developer.github.com/v3/

Why R?
• Ross Ihaka & Robert Gentleman
• Name:
– First letter of names
– Play on the name of S
– S-PLUS – commercial alternative
• Open source
• Nr 1 for statistical computing

R Environment
• R project
– console environment
– http://www.r-project.org/
• IDE
– Any editor
– RStudio
http://www.rstudio.com/products/rstudio/download/

RStudio
Editor
Console
Environment
variables
Plots
Files
Help
Packages

Distribution of active repositories per
language

GitHub Archive PullRequestEvent

Task: Reading Pull Requests
1. Read the file line by line and
extract only pull request events
2. Extract id and language
information
3. Count and visualise language
distribution
Data: 1h GitHub Archive Events
from 01-01-2015, 3 PM

Goal: Analyze ACTUAL active
repositories

Language information
• Active repositories – Create, Push and
PullRequest events
• Missing language information:
– Google BigQuery
– GitHub API
• Process various data sources

Different sources of data
• GitHub Archive:
– id,
– url in a form:
https://api.github.com/repos/:name
– (rare cases) language
• Google BigQuery:
– no id,
– url in a form:
https://github.com/:name
– language

Task: Reading Active Repositories
1. Read the file line by line and extract only
create, push and pull request events
2. Extract id and url information
3. Read Google BigQuery data from saved file
4. Combine repositories data and Google data
base on the same url and fill in missing
language information
5. Count and visualise language distribution

Active repositories per languages

Task: Retrieve language info for
repository
• GET
/repos/:owner/:repo/languages
• Owner: BasiaFusinska
• Repo: RWorkshop

Task: Calling GitHub Search
• GET
/search/repositories
• Querying: q parameter
• Paging: page parameter

Big Data in R
• What’s _Big_Data_ anyway?
• R processes data in memory
• Bring down only the data
you need
• Streaming the data from
database

Reading pull requests experiments

To summarize…
• Data science – not a rocket science, shaving
the yak
• Different sources – different truths
• Capturing & storing data
• Data science UI – visualization is the key
• Desktop - hypothesis, development
• Cloud – production

What’s next?
• Data Exploration in R, workshop
• basiafusinska.com, blog
• katacoda.com, interactive learning platform

Thank you
barbara.fusinska@gmail.com
@BasiaFusinska
barbarafusinska.com
https://github.com/BasiaFusinska/RTalk

Analysing GitHub commits with R

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (15)

Similar to Analysing GitHub commits with R

Similar to Analysing GitHub commits with R (20)

More from Barbara Fusinska

More from Barbara Fusinska (20)

Recently uploaded

Recently uploaded (20)

Analysing GitHub commits with R