SlideShare a Scribd company logo
Introduction to R
Arin Basu MD MPH
DataAnalytics
dataanalytics@rediffmail.com
http://dataanalytics.objectis.net
We’ll Cover
• What is R
• How to obtain and install R
• How to read and export data
• How to do basic statistical analyses
• Econometric packages in R
What is R
• Software for Statistical Data Analysis
• Based on S
• Programming Environment
• Interpreted Language
• Data Storage, Analysis, Graphing
• Free and Open Source Software
Obtaining R
• Current Version: R-2.0.0
• Comprehensive R Archive Network:
http://cran.r-project.org
• Binary source codes
• Windows executables
• Compiled RPMs for Linux
• Can be obtained on a CD
Installing R
• Binary (Windows/Linux): One step process
– exe, rpm (Red Hat/Mandrake), apt-get (Debian)
• Linux, from sources:
$ tar –zxvf “filename.tar.gz”
$ cd filename
$ ./configure
$ make
$ make check
$ make install
Starting R
Windows, Double-click on Desktop Icon
Linux, type R at command prompt
$ R
Strengths and Weaknesses
• Strengths
– Free and Open Source
– Strong User Community
– Highly extensible, flexible
– Implementation of high end statistical methods
– Flexible graphics and intelligent defaults
• Weakness
– Steep learning curve
– Slow for large datasets
Basics
• Highly Functional
– Everything done through functions
– Strict named arguments
– Abbreviations in arguments OK
(e.g. T for TRUE)
• Object Oriented
– Everything is an object
– “<-” is an assignment operator
– “X <- 5”: X GETS the value 5
Getting Help in R
• From Documentation:
– ?WhatIWantToKnow
– help(“WhatIWantToKnow”)
– help.search(“WhatIWantToKnow”)
– help.start()
– getAnywhere(“WhatIWantToKnow”)
– example(“WhatIWantToKnow”)
• Documents: “Introduction to R”
• Active Mailing List
– Archives
– Directly Asking Questions on the List
Data Structures
• Supports virtually any type of data
• Numbers, characters, logicals (TRUE/ FALSE)
• Arrays of virtually unlimited sizes
• Simplest: Vectors and Matrices
• Lists: Can Contain mixed type variables
• Data Frame: Rectangular Data Set
Data Structure in R
Linear Rectangular
All Same Type VECTORS MATRIX*
Mixed LIST DATA FRAME
Running R
• Directly in the Windowing System
(Console)
• Using Editors
– Notepad, WinEdt, Tinn-R: Windows
– Xemacs, ESS (Emacs speaks Statistics)
• On the Editor:
–source(“filename.R”)
– Outputs can be diverted by using
• sink(“filename.Rout”)
R Working Area
This is the area where all
commands are issued, and
non-graphical outputs
observed when run
interactively
In an R Session…
• First, read data from other sources
• Use packages, libraries, and functions
• Write functions wherever necessary
• Conduct Statistical Data Analysis
• Save outputs to files, write tables
• Save R workspace if necessary (exit prompt)
Specific Tasks
• To see which directories and data are loaded,
type: search()
• To see which objects are stored, type: ls()
• To include a dataset in the searchpath for
analysis, type:
attach(NameOfTheDataset,
expression)
• To detach a dataset from the searchpath after
analysis, type:
detach(NameOfTheDataset)
Reading data into R
• R not well suited for data preprocessing
• Preprocess data elsewhere (SPSS, etc…)
• Easiest form of data to input: text file
• Spreadsheet like data:
– Small/medium size: use read.table()
– Large data: use scan()
• Read from other systems:
– Use the library “foreign”: library(foreign)
– Can import from SAS, SPSS, Epi Info
– Can export to STATA
Reading Data: summary
• Directly using a vector e.g.: x <- c(1,2,3…)
• Using scan and read.table function
• Using matrix function to read data matrices
• Using data.frame to read mixed data
• library(foreign) for data from other programs
Accessing Variables
• edit(<mydataobject>)
• Subscripts essential tools
– x[1] identifies first element in vector x
– y[1,] identifies first row in matrix y
– y[,1] identifies first column in matrix y
• $ sign for lists and data frames
– myframe$age gets age variable of myframe
– attach(dataframe) -> extract by variable name
Subset Data
• Using subset function
– subset() will subset the dataframe
• Subscripting from data frames
– myframe[,1] gives first column of myframe
• Specifying a vector
– myframe[1:5] gives first 5 rows of data
• Using logical expressions
– myframe[myframe[,1], < 5,] gets all rows of the
first column that contain values less than 5
Graphics
• Plot an object, like: plot(num.vec)
– here plots against index numbers
• Plot sends to graphic devices
– can specify which graphic device you want
• postscript, gif, jpeg, etc…
• you can turn them on and off, like: dev.off()
• Two types of plotting
– high level: graphs drawn with one call
– Low Level: add additional information to
existing graph
High Level: generated with plot()
Low Level: Scattergram with Lowess
Programming in R
• Functions & Operators typically work on
entire vectors
• Expressions surrounded by {}
• Codes separated by newlines, “;” not
necessary
• You can write your own functions and use
them
Statistical Functions in R
• Descriptive Statistics
• Statistical Modeling
– Regressions: Linear and Logistic
– Probit, Tobit Models
– Time Series
• Multivariate Functions
• Inbuilt Packages, contributed packages
Descriptive Statistics
• Has functions for all common statistics
• summary() gives lowest, mean, median,
first, third quartiles, highest for numeric
variables
• stem() gives stem-leaf plots
• table() gives tabulation of categorical
variables
Statistical Modeling
• Over 400 functions
– lm, glm, aov, ts
• Numerous libraries & packages
– survival, coxph, tree (recursive trees), nls, …
• Distinction between factors and regressors
– factors: categorical, regressors: continuous
– you must specify factors unless they are obvious
to R
– dummy variables for factors created automatically
• Use of data.frame makes life easy
How to model
• Specify your model like this:
– y ~ xi+ci, where
– y = outcome variable, xi = main explanatory
variables, ci = covariates, + = add terms
– Operators have special meanings
• + = add terms, : = interactions, / = nesting, so on…
• Modeling -- object oriented
– each modeling procedure produces objects
– classes and functions for each object
Synopsis of Operators
nesting only
no specific
%in%
limiting interaction depths
exponentiation
^
interaction only
sequence
:
main effect and nesting
division
/
main effect and interactions
multiplication
*
add or remove terms
add or subtract
+ or -
In Formula means
Usually means
Operator
Modeling Example: Regression
carReg <- lm(speed~dist, data=cars)
carReg = becomes an object
to get summary of this regression, we type
summary(carReg)
to get only coefficients, we type
coef(carReg), or carReg$coef
don’t want intercept? add 0, so
carReg <- lm(speed~0+dist, data=cars)
Multivariate Techniques
• Several Libraries available
– mva, hmisc, glm,
– MASS: discriminant analysis and multidim
scaling
• Econometrics packages
– dse (multivariate time series, state-space
models), ineq: for measuring inequality, poverty
estimation, its: for irregular time series, sem:
structural equation modeling, and so on…
[http://www.mayin.org/ajayshah/]
Summarizing…
• Effective data handling and storage
• large, coherent set of tools for data analysis
• Good graphical facilities and display
– on screen
– on paper
• well-developed, simple, effective programming
For more resources, check out…
R home page
http://www.r-project.org
R discussion group
http://www.stat.math.ethz.ch/mailman/listinfo/r-help
Search Google for R and Statistics
For more information, contact
dataanalytics@rediffmail.com

More Related Content

Similar to 17641.ppt

R basics
R basicsR basics
R basics
Sagun Baijal
 
Introduction to R.pptx
Introduction to R.pptxIntroduction to R.pptx
Introduction to R.pptx
karthikks82
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on r
Ashraf Uddin
 
Basics R.ppt
Basics R.pptBasics R.ppt
Basics R.ppt
AtulTandan
 
R programming slides
R  programming slidesR  programming slides
R programming slides
Pankaj Saini
 
R Introduction
R IntroductionR Introduction
R Introduction
Sangeetha S
 
Basics.ppt
Basics.pptBasics.ppt
Best corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiBest corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbai
Unmesh Baile
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL Server
Stéphane Fréchette
 
R programmingmilano
R programmingmilanoR programmingmilano
R programmingmilano
Ismail Seyrik
 
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics Researchers
Vitomir Kovanovic
 
Lecture_R.ppt
Lecture_R.pptLecture_R.ppt
Lecture_R.ppt
Abebe334138
 
Introduction to R _IMPORTANT FOR DATA ANALYTICS
Introduction to R _IMPORTANT FOR DATA ANALYTICSIntroduction to R _IMPORTANT FOR DATA ANALYTICS
Introduction to R _IMPORTANT FOR DATA ANALYTICS
HaritikaChhatwal1
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
Venkata Reddy Konasani
 
Lecture1_R.ppt
Lecture1_R.pptLecture1_R.ppt
Lecture1_R.ppt
ArchishaKhandareSS20
 
Lecture1_R.ppt
Lecture1_R.pptLecture1_R.ppt
Lecture1_R.ppt
vikassingh569137
 
Lecture1 r
Lecture1 rLecture1 r
Lecture1 r
Sandeep242951
 
Modeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.pptModeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.ppt
anshikagoel52
 
R programming by ganesh kavhar
R programming by ganesh kavharR programming by ganesh kavhar
R programming by ganesh kavhar
Savitribai Phule Pune University
 
Getting Started with R
Getting Started with RGetting Started with R
Getting Started with R
Sankhya_Analytics
 

Similar to 17641.ppt (20)

R basics
R basicsR basics
R basics
 
Introduction to R.pptx
Introduction to R.pptxIntroduction to R.pptx
Introduction to R.pptx
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on r
 
Basics R.ppt
Basics R.pptBasics R.ppt
Basics R.ppt
 
R programming slides
R  programming slidesR  programming slides
R programming slides
 
R Introduction
R IntroductionR Introduction
R Introduction
 
Basics.ppt
Basics.pptBasics.ppt
Basics.ppt
 
Best corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiBest corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbai
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL Server
 
R programmingmilano
R programmingmilanoR programmingmilano
R programmingmilano
 
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics Researchers
 
Lecture_R.ppt
Lecture_R.pptLecture_R.ppt
Lecture_R.ppt
 
Introduction to R _IMPORTANT FOR DATA ANALYTICS
Introduction to R _IMPORTANT FOR DATA ANALYTICSIntroduction to R _IMPORTANT FOR DATA ANALYTICS
Introduction to R _IMPORTANT FOR DATA ANALYTICS
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
Lecture1_R.ppt
Lecture1_R.pptLecture1_R.ppt
Lecture1_R.ppt
 
Lecture1_R.ppt
Lecture1_R.pptLecture1_R.ppt
Lecture1_R.ppt
 
Lecture1 r
Lecture1 rLecture1 r
Lecture1 r
 
Modeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.pptModeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.ppt
 
R programming by ganesh kavhar
R programming by ganesh kavharR programming by ganesh kavhar
R programming by ganesh kavhar
 
Getting Started with R
Getting Started with RGetting Started with R
Getting Started with R
 

Recently uploaded

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 

Recently uploaded (20)

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 

17641.ppt

  • 1. Introduction to R Arin Basu MD MPH DataAnalytics dataanalytics@rediffmail.com http://dataanalytics.objectis.net
  • 2. We’ll Cover • What is R • How to obtain and install R • How to read and export data • How to do basic statistical analyses • Econometric packages in R
  • 3. What is R • Software for Statistical Data Analysis • Based on S • Programming Environment • Interpreted Language • Data Storage, Analysis, Graphing • Free and Open Source Software
  • 4. Obtaining R • Current Version: R-2.0.0 • Comprehensive R Archive Network: http://cran.r-project.org • Binary source codes • Windows executables • Compiled RPMs for Linux • Can be obtained on a CD
  • 5. Installing R • Binary (Windows/Linux): One step process – exe, rpm (Red Hat/Mandrake), apt-get (Debian) • Linux, from sources: $ tar –zxvf “filename.tar.gz” $ cd filename $ ./configure $ make $ make check $ make install
  • 6. Starting R Windows, Double-click on Desktop Icon Linux, type R at command prompt $ R
  • 7. Strengths and Weaknesses • Strengths – Free and Open Source – Strong User Community – Highly extensible, flexible – Implementation of high end statistical methods – Flexible graphics and intelligent defaults • Weakness – Steep learning curve – Slow for large datasets
  • 8. Basics • Highly Functional – Everything done through functions – Strict named arguments – Abbreviations in arguments OK (e.g. T for TRUE) • Object Oriented – Everything is an object – “<-” is an assignment operator – “X <- 5”: X GETS the value 5
  • 9. Getting Help in R • From Documentation: – ?WhatIWantToKnow – help(“WhatIWantToKnow”) – help.search(“WhatIWantToKnow”) – help.start() – getAnywhere(“WhatIWantToKnow”) – example(“WhatIWantToKnow”) • Documents: “Introduction to R” • Active Mailing List – Archives – Directly Asking Questions on the List
  • 10. Data Structures • Supports virtually any type of data • Numbers, characters, logicals (TRUE/ FALSE) • Arrays of virtually unlimited sizes • Simplest: Vectors and Matrices • Lists: Can Contain mixed type variables • Data Frame: Rectangular Data Set
  • 11. Data Structure in R Linear Rectangular All Same Type VECTORS MATRIX* Mixed LIST DATA FRAME
  • 12. Running R • Directly in the Windowing System (Console) • Using Editors – Notepad, WinEdt, Tinn-R: Windows – Xemacs, ESS (Emacs speaks Statistics) • On the Editor: –source(“filename.R”) – Outputs can be diverted by using • sink(“filename.Rout”)
  • 13. R Working Area This is the area where all commands are issued, and non-graphical outputs observed when run interactively
  • 14. In an R Session… • First, read data from other sources • Use packages, libraries, and functions • Write functions wherever necessary • Conduct Statistical Data Analysis • Save outputs to files, write tables • Save R workspace if necessary (exit prompt)
  • 15. Specific Tasks • To see which directories and data are loaded, type: search() • To see which objects are stored, type: ls() • To include a dataset in the searchpath for analysis, type: attach(NameOfTheDataset, expression) • To detach a dataset from the searchpath after analysis, type: detach(NameOfTheDataset)
  • 16. Reading data into R • R not well suited for data preprocessing • Preprocess data elsewhere (SPSS, etc…) • Easiest form of data to input: text file • Spreadsheet like data: – Small/medium size: use read.table() – Large data: use scan() • Read from other systems: – Use the library “foreign”: library(foreign) – Can import from SAS, SPSS, Epi Info – Can export to STATA
  • 17. Reading Data: summary • Directly using a vector e.g.: x <- c(1,2,3…) • Using scan and read.table function • Using matrix function to read data matrices • Using data.frame to read mixed data • library(foreign) for data from other programs
  • 18. Accessing Variables • edit(<mydataobject>) • Subscripts essential tools – x[1] identifies first element in vector x – y[1,] identifies first row in matrix y – y[,1] identifies first column in matrix y • $ sign for lists and data frames – myframe$age gets age variable of myframe – attach(dataframe) -> extract by variable name
  • 19. Subset Data • Using subset function – subset() will subset the dataframe • Subscripting from data frames – myframe[,1] gives first column of myframe • Specifying a vector – myframe[1:5] gives first 5 rows of data • Using logical expressions – myframe[myframe[,1], < 5,] gets all rows of the first column that contain values less than 5
  • 20. Graphics • Plot an object, like: plot(num.vec) – here plots against index numbers • Plot sends to graphic devices – can specify which graphic device you want • postscript, gif, jpeg, etc… • you can turn them on and off, like: dev.off() • Two types of plotting – high level: graphs drawn with one call – Low Level: add additional information to existing graph
  • 21. High Level: generated with plot()
  • 22. Low Level: Scattergram with Lowess
  • 23. Programming in R • Functions & Operators typically work on entire vectors • Expressions surrounded by {} • Codes separated by newlines, “;” not necessary • You can write your own functions and use them
  • 24. Statistical Functions in R • Descriptive Statistics • Statistical Modeling – Regressions: Linear and Logistic – Probit, Tobit Models – Time Series • Multivariate Functions • Inbuilt Packages, contributed packages
  • 25. Descriptive Statistics • Has functions for all common statistics • summary() gives lowest, mean, median, first, third quartiles, highest for numeric variables • stem() gives stem-leaf plots • table() gives tabulation of categorical variables
  • 26. Statistical Modeling • Over 400 functions – lm, glm, aov, ts • Numerous libraries & packages – survival, coxph, tree (recursive trees), nls, … • Distinction between factors and regressors – factors: categorical, regressors: continuous – you must specify factors unless they are obvious to R – dummy variables for factors created automatically • Use of data.frame makes life easy
  • 27. How to model • Specify your model like this: – y ~ xi+ci, where – y = outcome variable, xi = main explanatory variables, ci = covariates, + = add terms – Operators have special meanings • + = add terms, : = interactions, / = nesting, so on… • Modeling -- object oriented – each modeling procedure produces objects – classes and functions for each object
  • 28. Synopsis of Operators nesting only no specific %in% limiting interaction depths exponentiation ^ interaction only sequence : main effect and nesting division / main effect and interactions multiplication * add or remove terms add or subtract + or - In Formula means Usually means Operator
  • 29. Modeling Example: Regression carReg <- lm(speed~dist, data=cars) carReg = becomes an object to get summary of this regression, we type summary(carReg) to get only coefficients, we type coef(carReg), or carReg$coef don’t want intercept? add 0, so carReg <- lm(speed~0+dist, data=cars)
  • 30. Multivariate Techniques • Several Libraries available – mva, hmisc, glm, – MASS: discriminant analysis and multidim scaling • Econometrics packages – dse (multivariate time series, state-space models), ineq: for measuring inequality, poverty estimation, its: for irregular time series, sem: structural equation modeling, and so on… [http://www.mayin.org/ajayshah/]
  • 31. Summarizing… • Effective data handling and storage • large, coherent set of tools for data analysis • Good graphical facilities and display – on screen – on paper • well-developed, simple, effective programming
  • 32. For more resources, check out… R home page http://www.r-project.org R discussion group http://www.stat.math.ethz.ch/mailman/listinfo/r-help Search Google for R and Statistics
  • 33. For more information, contact dataanalytics@rediffmail.com