15. Why R?
• Ross Ihaka & Robert Gentleman
• Name:
– First letter of names
– Play on the name of S
– S-PLUS – commercial alternative
• Open source
• Nr 1 for statistical computing
17. R Environment
• R project
– console environment
– http://www.r-project.org/
• IDE
– Any editor
– RStudio
http://www.rstudio.com/products/rstudio/download/
32. Task: Reading Pull Requests
1. Read the file line by line and
extract only pull request events
2. Extract id and language
information
3. Count and visualise language
distribution
Data: 1h GitHub Archive Events
from 01-01-2015, 3 PM
45. Language information
• Active repositories – Create, Push and
PullRequest events
• Missing language information:
– Google BigQuery
– GitHub API
• Process various data sources
47. Different sources of data
• GitHub Archive:
– id,
– url in a form:
https://api.github.com/repos/:name
– (rare cases) language
• Google BigQuery:
– no id,
– url in a form:
https://github.com/:name
– language
48. Task: Reading Active Repositories
1. Read the file line by line and extract only
create, push and pull request events
2. Extract id and url information
3. Read Google BigQuery data from saved file
4. Combine repositories data and Google data
base on the same url and fill in missing
language information
5. Count and visualise language distribution
71. To summarize…
• Data science – not a rocket science, shaving
the yak
• Different sources – different truths
• Capturing & storing data
• Data science UI – visualization is the key
• Desktop - hypothesis, development
• Cloud – production
72. What’s next?
• Data Exploration in R, workshop
• basiafusinska.com, blog
• katacoda.com, interactive learning platform