SlideShare a Scribd company logo
1 of 44
Getting started with R when
analysing GitHub events
Barbara Fusinska
barbarafusinska.com
About me
Programmer
Math enthusiast
Sweet tooth
@BasiaFusinska
https://github.com/BasiaFusinska/RTalk
Agenda
• R ecosystem
• R basics
• Analysing GitHub events
• Data sources
• Code… a lot of code
Why R?
• Ross Ihaka & Robert Gentleman
• Name:
• First letter of names
• Play on the name of S
• S-PLUS – commercial alternative
• Open source
• Nr 1 for statistical computing
R Environment
• R project
• console environment
• http://www.r-project.org/
• IDE
• Any editor
• RStudio
http://www.rstudio.com/products/rstudio/download/
RStudio
Editor
Console
Environment
variables
Plots
Files
Help
Packages
R Basics
Basics - Types
> myChar <- "a"
> myChar
[1] "a"
> typeof(myChar)
[1] "character"
> myNum <- 10
> myNum
[1] 10
> typeof(myNum)
[1] "double"
> # Dynamic
> myNum <- "some text"
> typeof(myNum)
[1] "character"
Vectors
> myVector <- c("a", "b", "c")
> myVector
[1] "a" "b" "c"
> typeof(myVector)
[1] "character"
myVector <- 1:10
myVector <- double(0)
myVector <- c(2, 5:10, 20)
myVector <- letters[1:5]
myVector[5]
Lists
> myList <- list("a", "b", "c")
> myList
[[1]]
[1] "a"
[[2]]
[1] "b"
[[3]]
[1] "c"
> typeof(myList)
[1] "list"
Named elements
> myVector <- c(a="a", b="b", c="c")
> myVector
a b c
"a" "b" "c"
> myList <- list(a="a", b="b", c="c")
> myList
$a
[1] "a"
$b
[1] "b"
$c
[1] "c"
Accessing element
> myVector[1]
a
"a"
> myVector[[1]]
[1] "a"
> myVector['a']
a
"a"
> myVector[['a']]
[1] "a"
> myList[1]
$a
[1] "a"
> myList[[1]]
[1] "a"
> myList['a']
$a
[1] "a"
> myList[['a']]
[1] "a"
> myList$a
[1] "a"
Data frames
> dataFrame <- data.frame(col1=c(1,2,3), col2=c(4,5,6))
> dataFrame
col1 col2
1 1 4
2 2 5
3 3 6
> typeof(dataFrame)
[1] "list"
Summary
> summary(dataFrame)
col1 col2
Min. :1.0 Min. :4.0
1st Qu.:1.5 1st Qu.:4.5
Median :2.0 Median :5.0
Mean :2.0 Mean :5.0
3rd Qu.:2.5 3rd Qu.:5.5
Max. :3.0 Max. :6.0
Summary statistics
mean(dataFrame$col1)
max(dataFrame$col1)
min(dataFrame$col1)
sum(dataFrame$col1)
median(dataFrame$col1)
quantile(dataFrame$col1)
Filtering vectors and lists
> a <- 1:10
> a[a > 4]
[1] 5 6 7 8 9 10
> select <- function(x) { x > 4}
> a[select(a)]
[1] 5 6 7 8 9 10
> Filter(select, a)
[1] 5 6 7 8 9 10
Filtering data frames
dataFrame <- data.frame(
age=c(20, 15, 31, 45, 17),
gender=c('F', 'F', 'M', 'M', 'F'),
smoker=c(TRUE, TRUE, FALSE, TRUE, FALSE))
> dataFrame
age gender smoker
1 20 F TRUE
2 15 F TRUE
3 31 M FALSE
4 45 M TRUE
5 17 F FALSE
Filtering by rows
> dataFrame$age[
dataFrame$gender == 'F']
[1] 20 15 17
> dataFrame[2:4, ]
age gender smoker
2 15 F TRUE
3 31 M FALSE
4 45 M TRUE
> dataFrame[
dataFrame$age < 30, ]
age gender smoker
1 20 F TRUE
2 15 F TRUE
5 17 F FALSE
> dataFrame[
dataFrame$gender == 'M', ]
age gender smoker
3 31 M FALSE
4 45 M TRUE
Filtering by columns
> dataFrame[, 3]
[1] TRUE TRUE FALSE TRUE FALSE
> dataFrame[, c(1,3)]
age smoker
1 20 TRUE
2 15 TRUE
3 31 FALSE
4 45 TRUE
5 17 FALSE
> dataFrame[, c(3,2)]
smoker gender
1 TRUE F
2 TRUE F
3 FALSE M
4 TRUE M
5 FALSE F
> dataFrame[, c('age', 'smoker')]
age smoker
1 20 TRUE
2 15 TRUE
3 31 FALSE
4 45 TRUE
5 17 FALSE
Goal: Language distribution
https://www.githubarchive.org/
Google BigQuery
Language information
• Only Pull Requests event types
have language information
• Data source – 1h events from
01.01.2015 3 PM
• ~11k events
• ~500 pull requests
Gender bias?
• 4,037,953 GitHub user
profiles
• 1,426,121 identified
(35.3%)
http://arstechnica.com/information-technology/2016/02/data-analysis-
of-github-contributions-reveals-unexpected-gender-bias/
Open Closed
Women 8,216 111,011
Men 150,248 2,181,517
Reading data from files - csv
> sizes <- read.csv(sizesFile)
> sizes
category length width
1 B 20.0 3.0
2 A 23.0 3.6
3 B 75.0 18.0
4 B 44.0 10.0
5 C 2.5 6.0
6 B 7.2 27.0
7 A 45.8 34.0
8 C 12.0 2.0
9 A 5.0 13.0
10 A 68.0 14.5
Reading data from files - lines
> lines <- readLines(sizesFile)
> lines
[1] "category,length,width" "B,20,3"
[3] "A,23,3.6" "B,75,18"
[5] "B,44,10" "C,2.5,6"
[7] "B,7.2,27" "A,45.8,34"
[9] "C,12,2" "A,5,13"
[11] "A,68,14.5"
Writing data to csv file
write.csv(sizes, file=outputFile)
write.csv(sizes, file=outputFile, row.names = FALSE)
Applying operation across elements
> myVector <- c(1, 4, 9, 16, 25)
> sapply(myVector, sqrt)
[1] 1 2 3 4 5
> lapply(myVector, sqrt)
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] 4
[[5]]
[1] 5
Read GitHub Archive events
library("rjson")
readEvents <- function(file, eventNames) {
lines <- readLines(file)
jsonEvents <- lapply(lines, fromJSON)
specificEvents <- Filter(
function(e) { e$type %in% eventNames },
jsonEvents)
return(specificEvents)
}
Missing data
# Missing values
> a <- c(1,2,NA,3,4,5)
> a
[1] 1 2 NA 3 4 5
# Checking if missing data
> is.na(a)
[1] FALSE FALSE TRUE FALSE
FALSE FALSE
> anyNA(a)
[1] TRUE
# Setting missing values
> is.na(a) <- c(2,4)
> a
[1] 1 NA NA NA 4 5
# Setting null values
> a <- NULL
> is.null(a)
[1] TRUE
Read pull requests
pullRequestEvents <- readEvents(fileName,"PullRequestEvent")
select <- function(x) {
id <- x$payload$pull_request$base$repo$id
language <- x$payload$pull_request$base$repo$language
if (!is.null(language)) {
c(ID=id, Language=language)
} else {
c(ID=id, Language="")
}
}
pullRequests <- sapply(pullRequestEvents, select)
Some solutions
for(x in pullRequests) {
# version 1
rbind(dataFrame, x)
#version 2
idColumn <- c(idColumn, x[“ID”,])
languageColumn <- c(languageColumn, x[“Language”,])
}
# version 2
dataFrame <- data.frame(
id=idColumn,
language=languageColumn)
Prepare data
reposLanguages <- data.frame(
id=pullRequests["ID",],
language=pullRequests["Language",])
head(reposLanguages)
summary(reposLanguages)
Little look on the data
> head(reposLanguages)
id language
1 3542607 C++
2 10391073 Python
3 28668460 Python
4 28608107 Ruby
5 5452699 JavaScript
6 19777872 C#
> summary(reposLanguages)
id language
28648149: 12 Ruby : 66
28688863: 8 PHP : 55
20413356: 5 Python : 53
28668553: 5 : 51
10160141: 4 JavaScript: 47
206084 : 4 C++ : 30
(Other) :436 (Other) :172
Duplicated data
> myData <- c(1,2,3,4,3,2,5,6)
> duplicated(myData)
[1] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
> anyDuplicated(myData)
[1] 5
> unique(myData)
[1] 1 2 3 4 5 6
Unique repositories data
> reposLanguages <- unique(reposLanguages)
> summary(reposLanguages)
id language
25994257: 2 Python : 36
28528325: 2 JavaScript: 35
10126031: 1 Ruby : 35
10160141: 1 PHP : 34
10344201: 1 : 27
10391073: 1 Java : 22
(Other) :297 (Other) :116
Distribution tables
> collection <-
c('A','C','B','C','B','C')
> oneWayTable <-
table(collection)
> oneWayTable
collection
A B C
1 2 3
> attributes(oneWayTable)
$dim
[1] 3
$dimnames
$dimnames$collection
[1] "A" "B" "C"
Language distribution
> languages <- table(reposLanguages$language)
> head(languages)
ActionScript Bluespec C
27 1 1 9
C# C++
11 20
> languages <- sort(languages, decreasing=TRUE)
> head(languages)
Python JavaScript Ruby PHP
36 35 35 34 27
Java
22
Recognised languages
reposLanguages <-
reposLanguages[reposLanguages$language != "",]
languages <- table(reposLanguages$language)
languages <- sort(languages, decreasing=TRUE)
Language names
> languagesNames <- names(languages)
> languagesNames
[1] "Python" "JavaScript" "Ruby"
[4] "PHP" "Java" "C++"
[7] "CSS" "C#" "C"
[10] "Go" "Shell" "CoffeeScript”
[13] "Objective-C" "Puppet" "Scala"
[16] "Lua" "Rust" "Clojure"
[19] "Emacs Lisp" "Haskell" "Julia"
[22] "Makefile" "Perl" "VimL"
[25] "ActionScript" "Bluespec" "DM"
[28] "Elixir" "F#" "Haxe"
[31] "Matlab" "Swift" "TeX"
[34] ""
Plotting languages2Display <- languages[languages > 5]
barplot(languages2Display)
Summary
• GitHub Archive
• Introduction to R
• Data types
• Filtering
• I/O
• Applying operations
• Missing values & duplicates
• Binding data
• Distribution tables
• Plotting (barplot)
Thank you
barbara.fusinska@gmail.com
@BasiaFusinska
barbarafusinska.com
https://github.com/BasiaFusinska/RTalk
Questions?

More Related Content

What's hot

What's hot (20)

R for you
R for youR for you
R for you
 
P3 2017 python_regexes
P3 2017 python_regexesP3 2017 python_regexes
P3 2017 python_regexes
 
Statistical computing 01
Statistical computing 01Statistical computing 01
Statistical computing 01
 
R Brown-bag seminars : Seminar-8
R Brown-bag seminars : Seminar-8R Brown-bag seminars : Seminar-8
R Brown-bag seminars : Seminar-8
 
R Programming: Numeric Functions In R
R Programming: Numeric Functions In RR Programming: Numeric Functions In R
R Programming: Numeric Functions In R
 
20170509 rand db_lesugent
20170509 rand db_lesugent20170509 rand db_lesugent
20170509 rand db_lesugent
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
OOUG: VST , visual sql tuning diagrams
OOUG: VST , visual sql tuning diagramsOOUG: VST , visual sql tuning diagrams
OOUG: VST , visual sql tuning diagrams
 
Data handling in r
Data handling in rData handling in r
Data handling in r
 
R seminar dplyr package
R seminar dplyr packageR seminar dplyr package
R seminar dplyr package
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Python for R developers and data scientists
Python for R developers and data scientistsPython for R developers and data scientists
Python for R developers and data scientists
 
R and data mining
R and data miningR and data mining
R and data mining
 
Python for R users
Python for R usersPython for R users
Python for R users
 
Scala Hands On!!
Scala Hands On!!Scala Hands On!!
Scala Hands On!!
 
SQL window functions for MySQL
SQL window functions for MySQLSQL window functions for MySQL
SQL window functions for MySQL
 
[1062BPY12001] Data analysis with R / week 4
[1062BPY12001] Data analysis with R / week 4[1062BPY12001] Data analysis with R / week 4
[1062BPY12001] Data analysis with R / week 4
 
Rug hogan-10-03-2012
Rug hogan-10-03-2012Rug hogan-10-03-2012
Rug hogan-10-03-2012
 
Python for R Users
Python for R UsersPython for R Users
Python for R Users
 

Viewers also liked

4. arteria carotida externa
4. arteria carotida externa4. arteria carotida externa
4. arteria carotida externaanatogral
 
How aspects clean your code
How aspects clean your codeHow aspects clean your code
How aspects clean your codeBarbara Fusinska
 
Aldenmc capabilities 05-12-14-cc
Aldenmc capabilities 05-12-14-ccAldenmc capabilities 05-12-14-cc
Aldenmc capabilities 05-12-14-ccaldenmarcom
 
The Pub Digital Press Kit
The Pub Digital Press KitThe Pub Digital Press Kit
The Pub Digital Press KitTRG2014
 
Sveriges politiska partier
Sveriges politiska partierSveriges politiska partier
Sveriges politiska partierMikael Lärare
 
남동구 라선거구 최승원
남동구 라선거구 최승원남동구 라선거구 최승원
남동구 라선거구 최승원승원 최
 
Predicting the Future as a Service with Azure ML and R
Predicting the Future as a Service with Azure ML and R Predicting the Future as a Service with Azure ML and R
Predicting the Future as a Service with Azure ML and R Barbara Fusinska
 
Hb networking 2014
Hb networking 2014Hb networking 2014
Hb networking 2014suzcarle
 
5. arteria carotida int.
5. arteria carotida int.5. arteria carotida int.
5. arteria carotida int.anatogral
 
Clip voor altijd in mijn hart
Clip voor altijd in mijn hartClip voor altijd in mijn hart
Clip voor altijd in mijn hartTrees van Dijk
 
新規 Microsoft power point presentation
新規 Microsoft power point presentation新規 Microsoft power point presentation
新規 Microsoft power point presentationAkira Sawada
 
Go Green Go Cozmo (Jordan)
Go Green Go Cozmo (Jordan) Go Green Go Cozmo (Jordan)
Go Green Go Cozmo (Jordan) Amanda Haddad
 
Nft Distribution Holdings interview questions and answers
Nft Distribution Holdings interview questions and answersNft Distribution Holdings interview questions and answers
Nft Distribution Holdings interview questions and answersmayanevaeh121
 
Norland Managed Services interview questions and answers
Norland Managed Services interview questions and answersNorland Managed Services interview questions and answers
Norland Managed Services interview questions and answersmayanevaeh121
 
Group 13 Fracking and Silica Dust Exposure
Group 13  Fracking and Silica Dust ExposureGroup 13  Fracking and Silica Dust Exposure
Group 13 Fracking and Silica Dust ExposureAndres Guerrero
 

Viewers also liked (20)

4. arteria carotida externa
4. arteria carotida externa4. arteria carotida externa
4. arteria carotida externa
 
How aspects clean your code
How aspects clean your codeHow aspects clean your code
How aspects clean your code
 
Aldenmc capabilities 05-12-14-cc
Aldenmc capabilities 05-12-14-ccAldenmc capabilities 05-12-14-cc
Aldenmc capabilities 05-12-14-cc
 
The Pub Digital Press Kit
The Pub Digital Press KitThe Pub Digital Press Kit
The Pub Digital Press Kit
 
Sveriges politiska partier
Sveriges politiska partierSveriges politiska partier
Sveriges politiska partier
 
남동구 라선거구 최승원
남동구 라선거구 최승원남동구 라선거구 최승원
남동구 라선거구 최승원
 
Predicting the Future as a Service with Azure ML and R
Predicting the Future as a Service with Azure ML and R Predicting the Future as a Service with Azure ML and R
Predicting the Future as a Service with Azure ML and R
 
Hb networking 2014
Hb networking 2014Hb networking 2014
Hb networking 2014
 
Heart buddy
Heart buddyHeart buddy
Heart buddy
 
5. arteria carotida int.
5. arteria carotida int.5. arteria carotida int.
5. arteria carotida int.
 
Clip voor altijd in mijn hart
Clip voor altijd in mijn hartClip voor altijd in mijn hart
Clip voor altijd in mijn hart
 
新規 Microsoft power point presentation
新規 Microsoft power point presentation新規 Microsoft power point presentation
新規 Microsoft power point presentation
 
Heart buddy
Heart buddyHeart buddy
Heart buddy
 
When the connection fails
When the connection failsWhen the connection fails
When the connection fails
 
When the connection fails
When the connection failsWhen the connection fails
When the connection fails
 
Go Green Go Cozmo (Jordan)
Go Green Go Cozmo (Jordan) Go Green Go Cozmo (Jordan)
Go Green Go Cozmo (Jordan)
 
Nft Distribution Holdings interview questions and answers
Nft Distribution Holdings interview questions and answersNft Distribution Holdings interview questions and answers
Nft Distribution Holdings interview questions and answers
 
Norland Managed Services interview questions and answers
Norland Managed Services interview questions and answersNorland Managed Services interview questions and answers
Norland Managed Services interview questions and answers
 
Group 13 Fracking and Silica Dust Exposure
Group 13  Fracking and Silica Dust ExposureGroup 13  Fracking and Silica Dust Exposure
Group 13 Fracking and Silica Dust Exposure
 
Ekologiska fotavtryck
Ekologiska fotavtryckEkologiska fotavtryck
Ekologiska fotavtryck
 

Similar to Getting started with R when analysing GitHub commits

RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programmingYanchang Zhao
 
India software developers conference 2013 Bangalore
India software developers conference 2013 BangaloreIndia software developers conference 2013 Bangalore
India software developers conference 2013 BangaloreSatnam Singh
 
Interactively querying Google Analytics reports from R using ganalytics
Interactively querying Google Analytics reports from R using ganalyticsInteractively querying Google Analytics reports from R using ganalytics
Interactively querying Google Analytics reports from R using ganalyticsJohann de Boer
 
Python 표준 라이브러리
Python 표준 라이브러리Python 표준 라이브러리
Python 표준 라이브러리용 최
 
第3回 データフレームの基本操作 その1
第3回 データフレームの基本操作 その1第3回 データフレームの基本操作 その1
第3回 データフレームの基本操作 その1Wataru Shito
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSigmoid
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Serban Tanasa
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft AzureDmitry Petukhov
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Sease
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In RRsquared Academy
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
第3回 データフレームの基本操作 その1(解答付き)
第3回 データフレームの基本操作 その1(解答付き)第3回 データフレームの基本操作 その1(解答付き)
第3回 データフレームの基本操作 その1(解答付き)Wataru Shito
 
Introduction to R
Introduction to RIntroduction to R
Introduction to RStacy Irwin
 
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisIBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisTorsten Steinbach
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak PROIDEA
 

Similar to Getting started with R when analysing GitHub commits (20)

R programming language
R programming languageR programming language
R programming language
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programming
 
India software developers conference 2013 Bangalore
India software developers conference 2013 BangaloreIndia software developers conference 2013 Bangalore
India software developers conference 2013 Bangalore
 
Interactively querying Google Analytics reports from R using ganalytics
Interactively querying Google Analytics reports from R using ganalyticsInteractively querying Google Analytics reports from R using ganalytics
Interactively querying Google Analytics reports from R using ganalytics
 
Python 표준 라이브러리
Python 표준 라이브러리Python 표준 라이브러리
Python 표준 라이브러리
 
第3回 データフレームの基本操作 その1
第3回 データフレームの基本操作 その1第3回 データフレームの基本操作 その1
第3回 データフレームの基本操作 その1
 
R Programming Intro
R Programming IntroR Programming Intro
R Programming Intro
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 
R Programming Homework Help
R Programming Homework HelpR Programming Homework Help
R Programming Homework Help
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft Azure
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In R
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
第3回 データフレームの基本操作 その1(解答付き)
第3回 データフレームの基本操作 その1(解答付き)第3回 データフレームの基本操作 その1(解答付き)
第3回 データフレームの基本操作 その1(解答付き)
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Pro PostgreSQL
Pro PostgreSQLPro PostgreSQL
Pro PostgreSQL
 
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisIBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
 
R Introduction
R IntroductionR Introduction
R Introduction
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
 

More from Barbara Fusinska

Hassle free, scalable, machine learning learning with Kubeflow
Hassle free, scalable, machine learning learning with KubeflowHassle free, scalable, machine learning learning with Kubeflow
Hassle free, scalable, machine learning learning with KubeflowBarbara Fusinska
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlowBarbara Fusinska
 
Clean, Learn and Visualise data with R
Clean, Learn and Visualise data with RClean, Learn and Visualise data with R
Clean, Learn and Visualise data with RBarbara Fusinska
 
Using Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical SupportUsing Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical SupportBarbara Fusinska
 
Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with AzureBarbara Fusinska
 
Networks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlowNetworks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlowBarbara Fusinska
 
Using Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical SupportUsing Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical SupportBarbara Fusinska
 
Deep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive ToolkitDeep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive ToolkitBarbara Fusinska
 
Clean, Learn and Visualise data with R
Clean, Learn and Visualise data with RClean, Learn and Visualise data with R
Clean, Learn and Visualise data with RBarbara Fusinska
 
Using Machine Learning and Chatbots to handle 1st line technical support
Using Machine Learning and Chatbots to handle 1st line technical supportUsing Machine Learning and Chatbots to handle 1st line technical support
Using Machine Learning and Chatbots to handle 1st line technical supportBarbara Fusinska
 
V like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure MLV like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure MLBarbara Fusinska
 
A picture speaks a thousand words - Data Visualisation with R
A picture speaks a thousand words - Data Visualisation with RA picture speaks a thousand words - Data Visualisation with R
A picture speaks a thousand words - Data Visualisation with RBarbara Fusinska
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
Breaking the eggshell: From .NET to Node.js
Breaking the eggshell: From .NET to Node.jsBreaking the eggshell: From .NET to Node.js
Breaking the eggshell: From .NET to Node.jsBarbara Fusinska
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 

More from Barbara Fusinska (20)

Hassle free, scalable, machine learning learning with Kubeflow
Hassle free, scalable, machine learning learning with KubeflowHassle free, scalable, machine learning learning with Kubeflow
Hassle free, scalable, machine learning learning with Kubeflow
 
Machine Learning with R
Machine Learning with RMachine Learning with R
Machine Learning with R
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
Clean, Learn and Visualise data with R
Clean, Learn and Visualise data with RClean, Learn and Visualise data with R
Clean, Learn and Visualise data with R
 
TensorFlow in 3 sentences
TensorFlow in 3 sentencesTensorFlow in 3 sentences
TensorFlow in 3 sentences
 
Using Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical SupportUsing Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical Support
 
Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with Azure
 
Networks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlowNetworks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlow
 
Using Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical SupportUsing Machine Learning and Chatbots to handle 1st line Technical Support
Using Machine Learning and Chatbots to handle 1st line Technical Support
 
Deep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive ToolkitDeep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive Toolkit
 
Machine Learning with R
Machine Learning with RMachine Learning with R
Machine Learning with R
 
Clean, Learn and Visualise data with R
Clean, Learn and Visualise data with RClean, Learn and Visualise data with R
Clean, Learn and Visualise data with R
 
Using Machine Learning and Chatbots to handle 1st line technical support
Using Machine Learning and Chatbots to handle 1st line technical supportUsing Machine Learning and Chatbots to handle 1st line technical support
Using Machine Learning and Chatbots to handle 1st line technical support
 
V like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure MLV like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure ML
 
A picture speaks a thousand words - Data Visualisation with R
A picture speaks a thousand words - Data Visualisation with RA picture speaks a thousand words - Data Visualisation with R
A picture speaks a thousand words - Data Visualisation with R
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Breaking the eggshell: From .NET to Node.js
Breaking the eggshell: From .NET to Node.jsBreaking the eggshell: From .NET to Node.js
Breaking the eggshell: From .NET to Node.js
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 

Recently uploaded

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 

Recently uploaded (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

Getting started with R when analysing GitHub commits

  • 1. Getting started with R when analysing GitHub events Barbara Fusinska barbarafusinska.com
  • 2. About me Programmer Math enthusiast Sweet tooth @BasiaFusinska https://github.com/BasiaFusinska/RTalk
  • 3. Agenda • R ecosystem • R basics • Analysing GitHub events • Data sources • Code… a lot of code
  • 4. Why R? • Ross Ihaka & Robert Gentleman • Name: • First letter of names • Play on the name of S • S-PLUS – commercial alternative • Open source • Nr 1 for statistical computing
  • 5. R Environment • R project • console environment • http://www.r-project.org/ • IDE • Any editor • RStudio http://www.rstudio.com/products/rstudio/download/
  • 8. Basics - Types > myChar <- "a" > myChar [1] "a" > typeof(myChar) [1] "character" > myNum <- 10 > myNum [1] 10 > typeof(myNum) [1] "double" > # Dynamic > myNum <- "some text" > typeof(myNum) [1] "character"
  • 9. Vectors > myVector <- c("a", "b", "c") > myVector [1] "a" "b" "c" > typeof(myVector) [1] "character" myVector <- 1:10 myVector <- double(0) myVector <- c(2, 5:10, 20) myVector <- letters[1:5] myVector[5]
  • 10. Lists > myList <- list("a", "b", "c") > myList [[1]] [1] "a" [[2]] [1] "b" [[3]] [1] "c" > typeof(myList) [1] "list"
  • 11. Named elements > myVector <- c(a="a", b="b", c="c") > myVector a b c "a" "b" "c" > myList <- list(a="a", b="b", c="c") > myList $a [1] "a" $b [1] "b" $c [1] "c"
  • 12. Accessing element > myVector[1] a "a" > myVector[[1]] [1] "a" > myVector['a'] a "a" > myVector[['a']] [1] "a" > myList[1] $a [1] "a" > myList[[1]] [1] "a" > myList['a'] $a [1] "a" > myList[['a']] [1] "a" > myList$a [1] "a"
  • 13. Data frames > dataFrame <- data.frame(col1=c(1,2,3), col2=c(4,5,6)) > dataFrame col1 col2 1 1 4 2 2 5 3 3 6 > typeof(dataFrame) [1] "list"
  • 14. Summary > summary(dataFrame) col1 col2 Min. :1.0 Min. :4.0 1st Qu.:1.5 1st Qu.:4.5 Median :2.0 Median :5.0 Mean :2.0 Mean :5.0 3rd Qu.:2.5 3rd Qu.:5.5 Max. :3.0 Max. :6.0
  • 16. Filtering vectors and lists > a <- 1:10 > a[a > 4] [1] 5 6 7 8 9 10 > select <- function(x) { x > 4} > a[select(a)] [1] 5 6 7 8 9 10 > Filter(select, a) [1] 5 6 7 8 9 10
  • 17. Filtering data frames dataFrame <- data.frame( age=c(20, 15, 31, 45, 17), gender=c('F', 'F', 'M', 'M', 'F'), smoker=c(TRUE, TRUE, FALSE, TRUE, FALSE)) > dataFrame age gender smoker 1 20 F TRUE 2 15 F TRUE 3 31 M FALSE 4 45 M TRUE 5 17 F FALSE
  • 18. Filtering by rows > dataFrame$age[ dataFrame$gender == 'F'] [1] 20 15 17 > dataFrame[2:4, ] age gender smoker 2 15 F TRUE 3 31 M FALSE 4 45 M TRUE > dataFrame[ dataFrame$age < 30, ] age gender smoker 1 20 F TRUE 2 15 F TRUE 5 17 F FALSE > dataFrame[ dataFrame$gender == 'M', ] age gender smoker 3 31 M FALSE 4 45 M TRUE
  • 19. Filtering by columns > dataFrame[, 3] [1] TRUE TRUE FALSE TRUE FALSE > dataFrame[, c(1,3)] age smoker 1 20 TRUE 2 15 TRUE 3 31 FALSE 4 45 TRUE 5 17 FALSE > dataFrame[, c(3,2)] smoker gender 1 TRUE F 2 TRUE F 3 FALSE M 4 TRUE M 5 FALSE F > dataFrame[, c('age', 'smoker')] age smoker 1 20 TRUE 2 15 TRUE 3 31 FALSE 4 45 TRUE 5 17 FALSE
  • 23. Language information • Only Pull Requests event types have language information • Data source – 1h events from 01.01.2015 3 PM • ~11k events • ~500 pull requests
  • 24. Gender bias? • 4,037,953 GitHub user profiles • 1,426,121 identified (35.3%) http://arstechnica.com/information-technology/2016/02/data-analysis- of-github-contributions-reveals-unexpected-gender-bias/ Open Closed Women 8,216 111,011 Men 150,248 2,181,517
  • 25.
  • 26. Reading data from files - csv > sizes <- read.csv(sizesFile) > sizes category length width 1 B 20.0 3.0 2 A 23.0 3.6 3 B 75.0 18.0 4 B 44.0 10.0 5 C 2.5 6.0 6 B 7.2 27.0 7 A 45.8 34.0 8 C 12.0 2.0 9 A 5.0 13.0 10 A 68.0 14.5
  • 27. Reading data from files - lines > lines <- readLines(sizesFile) > lines [1] "category,length,width" "B,20,3" [3] "A,23,3.6" "B,75,18" [5] "B,44,10" "C,2.5,6" [7] "B,7.2,27" "A,45.8,34" [9] "C,12,2" "A,5,13" [11] "A,68,14.5"
  • 28. Writing data to csv file write.csv(sizes, file=outputFile) write.csv(sizes, file=outputFile, row.names = FALSE)
  • 29. Applying operation across elements > myVector <- c(1, 4, 9, 16, 25) > sapply(myVector, sqrt) [1] 1 2 3 4 5 > lapply(myVector, sqrt) [[1]] [1] 1 [[2]] [1] 2 [[3]] [1] 3 [[4]] [1] 4 [[5]] [1] 5
  • 30. Read GitHub Archive events library("rjson") readEvents <- function(file, eventNames) { lines <- readLines(file) jsonEvents <- lapply(lines, fromJSON) specificEvents <- Filter( function(e) { e$type %in% eventNames }, jsonEvents) return(specificEvents) }
  • 31. Missing data # Missing values > a <- c(1,2,NA,3,4,5) > a [1] 1 2 NA 3 4 5 # Checking if missing data > is.na(a) [1] FALSE FALSE TRUE FALSE FALSE FALSE > anyNA(a) [1] TRUE # Setting missing values > is.na(a) <- c(2,4) > a [1] 1 NA NA NA 4 5 # Setting null values > a <- NULL > is.null(a) [1] TRUE
  • 32. Read pull requests pullRequestEvents <- readEvents(fileName,"PullRequestEvent") select <- function(x) { id <- x$payload$pull_request$base$repo$id language <- x$payload$pull_request$base$repo$language if (!is.null(language)) { c(ID=id, Language=language) } else { c(ID=id, Language="") } } pullRequests <- sapply(pullRequestEvents, select)
  • 33. Some solutions for(x in pullRequests) { # version 1 rbind(dataFrame, x) #version 2 idColumn <- c(idColumn, x[“ID”,]) languageColumn <- c(languageColumn, x[“Language”,]) } # version 2 dataFrame <- data.frame( id=idColumn, language=languageColumn)
  • 34. Prepare data reposLanguages <- data.frame( id=pullRequests["ID",], language=pullRequests["Language",]) head(reposLanguages) summary(reposLanguages)
  • 35. Little look on the data > head(reposLanguages) id language 1 3542607 C++ 2 10391073 Python 3 28668460 Python 4 28608107 Ruby 5 5452699 JavaScript 6 19777872 C# > summary(reposLanguages) id language 28648149: 12 Ruby : 66 28688863: 8 PHP : 55 20413356: 5 Python : 53 28668553: 5 : 51 10160141: 4 JavaScript: 47 206084 : 4 C++ : 30 (Other) :436 (Other) :172
  • 36. Duplicated data > myData <- c(1,2,3,4,3,2,5,6) > duplicated(myData) [1] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE > anyDuplicated(myData) [1] 5 > unique(myData) [1] 1 2 3 4 5 6
  • 37. Unique repositories data > reposLanguages <- unique(reposLanguages) > summary(reposLanguages) id language 25994257: 2 Python : 36 28528325: 2 JavaScript: 35 10126031: 1 Ruby : 35 10160141: 1 PHP : 34 10344201: 1 : 27 10391073: 1 Java : 22 (Other) :297 (Other) :116
  • 38. Distribution tables > collection <- c('A','C','B','C','B','C') > oneWayTable <- table(collection) > oneWayTable collection A B C 1 2 3 > attributes(oneWayTable) $dim [1] 3 $dimnames $dimnames$collection [1] "A" "B" "C"
  • 39. Language distribution > languages <- table(reposLanguages$language) > head(languages) ActionScript Bluespec C 27 1 1 9 C# C++ 11 20 > languages <- sort(languages, decreasing=TRUE) > head(languages) Python JavaScript Ruby PHP 36 35 35 34 27 Java 22
  • 40. Recognised languages reposLanguages <- reposLanguages[reposLanguages$language != "",] languages <- table(reposLanguages$language) languages <- sort(languages, decreasing=TRUE)
  • 41. Language names > languagesNames <- names(languages) > languagesNames [1] "Python" "JavaScript" "Ruby" [4] "PHP" "Java" "C++" [7] "CSS" "C#" "C" [10] "Go" "Shell" "CoffeeScript” [13] "Objective-C" "Puppet" "Scala" [16] "Lua" "Rust" "Clojure" [19] "Emacs Lisp" "Haskell" "Julia" [22] "Makefile" "Perl" "VimL" [25] "ActionScript" "Bluespec" "DM" [28] "Elixir" "F#" "Haxe" [31] "Matlab" "Swift" "TeX" [34] ""
  • 42. Plotting languages2Display <- languages[languages > 5] barplot(languages2Display)
  • 43. Summary • GitHub Archive • Introduction to R • Data types • Filtering • I/O • Applying operations • Missing values & duplicates • Binding data • Distribution tables • Plotting (barplot)