What is R?● “R is a language and environment for statistical computing and graphics”● Paradigms: array, object-oriented, imperative, functional, procedural, reflective● Everything resides in memory (no big data)● Easy to get started!
Why R?● Free Software (GNU General Public License)● Mature, v1.0 released on 2000● Widely used● Good documentation and manuals● Lots of freely available packages● Excellent graphic capabilities
Getting the data (CSV)● MySQL SELECT * INTO OUTFILE /path/to/file.csv FIELDS TERMINATED BY , OPTIONALLY ENCLOSED BY " ESCAPED BY ‘’ LINES TERMINATED BY n FROM table WHERE <condition>;● Hive + sed INSERT OVERWRITE LOCAL DIRECTORY /tmp_path/ SELECT * FROM table WHERE <condition>; cat /tmp_path/* | sed s/[Ctrl-V][Ctrl-A]/t/g > out.txt● Consider sampling!
Linear Regressiony=α+β x n̂ ∑i=1 ( xi − ̄ )( y i − ̄ ) Cov [ x , y ] x yβ= = n Var [ x ] ∑i=1 ( x i − ̄ ) x 2̂ y ̂α= ̄ −β xJust use lm() in R! (But check the assumptions)
Want more?● Computing for Data Analysis – Roger D. Peng www.coursera.org/course/compdata● Statistics One – Andrew Conway www.coursera.org/course/stats1● An Introduction to R – The R Core Team cran.r-project.org/doc/manuals/r-release/R-intro.pdf