3. Agenda
• R ecosystem
• R basics
• Analysing GitHub events
• Data sources
• Code… a lot of code
4. Why R?
• Ross Ihaka & Robert Gentleman
• Name:
• First letter of names
• Play on the name of S
• S-PLUS – commercial alternative
• Open source
• Nr 1 for statistical computing
5. R Environment
• R project
• console environment
• http://www.r-project.org/
• IDE
• Any editor
• RStudio
http://www.rstudio.com/products/rstudio/download/
11. Named elements
> myVector <- c(a="a", b="b", c="c")
> myVector
a b c
"a" "b" "c"
> myList <- list(a="a", b="b", c="c")
> myList
$a
[1] "a"
$b
[1] "b"
$c
[1] "c"
12. Accessing element
> myVector[1]
a
"a"
> myVector[[1]]
[1] "a"
> myVector['a']
a
"a"
> myVector[['a']]
[1] "a"
> myList[1]
$a
[1] "a"
> myList[[1]]
[1] "a"
> myList['a']
$a
[1] "a"
> myList[['a']]
[1] "a"
> myList$a
[1] "a"
23. Language information
• Only Pull Requests event types
have language information
• Data source – 1h events from
01.01.2015 3 PM
• ~11k events
• ~500 pull requests
24. Gender bias?
• 4,037,953 GitHub user
profiles
• 1,426,121 identified
(35.3%)
http://arstechnica.com/information-technology/2016/02/data-analysis-
of-github-contributions-reveals-unexpected-gender-bias/
Open Closed
Women 8,216 111,011
Men 150,248 2,181,517
25.
26. Reading data from files - csv
> sizes <- read.csv(sizesFile)
> sizes
category length width
1 B 20.0 3.0
2 A 23.0 3.6
3 B 75.0 18.0
4 B 44.0 10.0
5 C 2.5 6.0
6 B 7.2 27.0
7 A 45.8 34.0
8 C 12.0 2.0
9 A 5.0 13.0
10 A 68.0 14.5