R - the language

R - scripted data
History
Language
Packages
Tools
RPubs
Slidify
Shiny

A Brief History of R
– 1976 S - Bell Labs; Fortran
– John Chambers
– 1988 S Version 3; C language
● 1991 R Created
– Ross Ihaka and Robert Gentleman
● 1993 R Announced
– 1993 S licensed to StatSci (now Insightful)
● 2000 R Version 1.0.0 released
– 2004 S purchased from Lucent (2MM)
– 2008 TIBCO acquires Insightful (25MM)

Other “Stats” Tools
● R – additional, commercial support
Oracle: “Big Data Appliance” - R + Hadoop
+ Linux + NoSQL + Exadata(H/W)
IBM: R executing in Hadoop (massively
parallel in-databse analytics)
● SAS (SAS Institute) dev. 1966, 1st rel 1972
● SPSS (IBM) 1st rel 1968

Model Development and
Execution Comparison
http://inside-bigdata.com/2014/06/25/revolution-r-enterprise-vs-sas-performance/

Oracle + INTEL Libraries
https://blogs.oracle.com/R/entry/oracle_r_distribution_performance_benchmark

Language
● Derviative of S (S PLUS)
● Portable (includes Playstation 3)
● Interpreted, calls into C libraries
● Functional!
● GPL
● 40 year old technology
● Open Source (you want it, you do it)

Data Types
● Symbols refer to objects
● Object attributes
– names
– dimnames
– dimensions
– class
– length
– user defined attributes/metadata

Data Types
● Object types – single class, except list
– List
(may have mixed classes)
– Vectors
(scalar is a vector of length 1)
– Matrices
(vector with 'dimension' attribute)
(column major order)

Data Types
● Object types
– Factors
● Categorical data (like an enumeration)
– Data frames
● Special list, each element has same length
● Elements are columns with length rows
● Each elements (column) has its own type
● row.names() attribute to name the rows
● Convert to matrix with data.matrix()
● Load with read.table(), read.csv()

Data Types
● Object “atomic” classes
– character
– numeric (double precision real)
– integer
– complex
– logical (booleans)
Numeric and Integer include Inf and NaN
1 / Inf == 0 !
any class can be NA
NaN is NA, NA is not NaN

Data Types
● Dates
– “Date” class
– Days since epoch (1970-01-01)
● Times
– “POSIXct” or “POSIXlt” class
– Seconds since epoch
● Coerce to string with as.Date()
● Generic functions include 'weekdays()',
months()', 'quarters()'

Operators
● Grouping: ()
● Assignment: to<-from AND from->to
● Vectorized: + - ! * / ^ %% & |
● ~ ? : %/% %*% %o% %x% %in% < > == >=
<= && ||
● Element access: [[]] [] $
● Function argument types:
– symbol, symbol=default, ...

Control Structures
● if, else
● for
● while
● repeat
● break, next, return

Apply
● apply – apply functions over arrays
● lapply – apply functions over list / vector
● sapply – apply function to data frames
● tapply – apply function over ragged array
● mapply – apply function to multiple objects

Functions
● Functions are objects
● Functional closure consists of:
– Formal argument list
– Function body (definition)
– Environment
● Each of these can be assigned to
● Assign to environment can eliminate
unwanted environment capture

Packages
● CRAN (Comprehensive R Archive Network)
– Main site, includes R download
● Bioconductor
– Analysis of genomic data
– Next generation high-throughput
sequencing
● R-forge
● GitHub and Personal repositories

Packages
● Analysis
– Statistical analysis (stats, linprog)
● Linear (and general linear) modeling
● Tree models
● Analysis of variance
– Machine learning (caret, kernlab)
● Clustering (forests, k-means, knn, etc)
● Training and predictions
● Cross validation and error analysis

Packages
● Graphics
– Base graphics
● Plot: plot, hist, ...
● Annotate: text, lines, points, axis, ...
– Lattice
● Single command: xyplot, bwplot, ...
– Ggplot2
● Single command: qplot
● Defining objects: aesthetics, geoms
● Chain commands: ggplot, geom_*, ...

Packages
● Data visualization
– rCharts (GitHub), converts visualizations to
Javascript (e.g. d3.js)
http://www.google.com/trends/explore#q=R%20language%2C%20Data%20Visualization%2C%20D3.js%2C%20Processing.js&cmpt=q

Tools
● Command line
● Rstudio (can run on remote Linux server)
● Rkward
● Rcommander (tcl/tk)
● JGR – Java (GUI for R)
● Rattle - RGtk2

Tools
● Debugging
– Print statements!
– Interactive tools:
● traceback() – stack trace on error
● debug() – flags function for stepping
● browser() - stops function and enters debug
● trace() - insert trace statements
● recover() - modify error behavior, can
browse call stack

Tools
● Profiling
– “We should forget about small efficiencies,
say about 97% of the time: premature
optimization is the root of all evil”
– Donald Knuth
– system.time() - CPU, wall times
– Rprof() - use symmaryRprof() to see results
● Do not use Rprof() and system.time()
together
● Calls to C/Fortran libraries not profiled

Data Exploration
● Script it!
– If you can't repeat it, it didn't happen
● Get the data (ingest)
– Functions to download, uncompress,
unarchive, store, read, and organize
● Clean the data
– Handle missing and incomplete data,
impute values, identify outliers

Data Exploration
● Look at the data (models, visualization)
– Model – regressions (linear, logistic),
clustering, ANOVA
– Refine models and plot the result
● Look for systematic issues – unexpected
trends, bias, unexplained variance, error
estimates, residual analysis
● Explore complexity – number of explanatory
factors
– Plot the models
● What does it look like?

Reproducible Research
● Allows others to validate the work
● Ensures that the results are accepted
● Reduces the chance of errors propagating
– http://youtu.be/7gYIs7uYbMo
– 2010 Anil Potti resigns from Duke after
research was found flawed (off by 1!)
● Clinical trials based on the flawed research
was finally cancelled
● Closed data, non-reproducible research
exacerbated the problem

● Don't do things by hand – especially editing
spreadsheets to “clean up” data (removing
outliers, validating, editing) or dowloading
files
● Actions taken by hand need very detailed
documentation to reproduce – such as
download sites and what files were
downloaded to
● GUIs are convenient, but can't be repeated

● Capture the steps in a script:
– download.file(“http://...”, “localfile.zip”)
● Can be repeated as long as the link is
available. Can keep and manage the
downloaded file if that is an issue
– Use version control
● Capture small steps at a time (git is good
for this!)
● Can track changes and revert if needed
● Can use GitHub, BitBucket, SouceForge to
publish the results as well

● Capture environment – OS, tools, versions
● Don't save outputs – regenerate
– Ok to cache results while in use, but don't
store the results, just the code+data that
produced it
– If you keep intermediate files, document
how they were created
● Set random seed

Sharing Research
● Rmarkdown – markdown with embedded R
– knitr package executes the R fragments
and embeds the code and results into
markdown, which can convert to HTML or
PDF
– Literate programming!
● Hosted documentation
– Rpubs (rpubs.com)
– GitHub gh-pages (github.io)

Sharing Research
● Embedded presentations
– Author using slidify package
– Rmarkdown with embedded R code
– Creates HTML5 presentation slide deck
– Can include inline quizes

Data Products
● Interactive visualizations
– shiny, shinyapp packages
– RStudio includes interactive display of
shiny applications during development
– Generates bootstrap + HTML5 + javascript
+ d3 application
● Hosted!
– Hosted at shinyapp.io
– Private? Server images available (for
purchase)

R - the language

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to R - the language

Similar to R - the language (20)

Recently uploaded

Recently uploaded (20)

R - the language