1. A BRIEF INTRO TO ‘R’ – APPLIED
STATS & TIME SERIES ANALYSIS
- Shanmukha Sreenivas P
2. THE R ENVIRONMENT
R is an integrated suite of software facilities for data
manipulation, calculation and graphical display.
An effective data handling and storage facility
A suite of operators for calculations on arrays, in particular
matrices
A large, coherent, integrated collection of intermediate tools
for data analysis
Graphical facilities for data analysis
A well developed, simple and effective programming
language (called ‘S’) which includes conditionals, loops, user
defined recursive functions and I/O facilities.
3. “OPEN SOURCE”... THAT JUST
MEANS I DON’T HAVE TO PAY FOR
IT, RIGHT?
5
•No. Much more:
–Provides full access to algorithms and their implementation
–Ability to fix bugs and extend software
–Provides a forum allowing researchers to explore and
expand the methods used to analyze data
–Promotes reproducible research by providing open and
accessible tools
–Most of R is written in… R! This makes it quite easy to see
what functions are actually doing.
4. WHAT IS IT?
•R is an interpreted computer language.
–Most user-visible functions are written in R itself, calling upon a
smaller set of internal primitives.
– It is possible to interface procedures written in C, C+, or
FORTRAN languages for efficiency, and to write additional
primitives.
–System commands can be called from within R
•R is used for data manipulation, statistics, and graphics.
It is made up of:
– operators (+ - <- * %*% …) for calculations on arrays &
matrices
– large, coherent, integrated collection of functions
– facilities for making unlimited types of publication quality
graphics
– user written functions & sets of functions (packages); 800+
contributed packages so far & growing
5. R
ADVANTAGES
DISADVANTAGES
oNot user friendly @ start - steep
learning curve, minimal GUI.
oNo commercial support; figuring out
correct methods or how to use a function
on your own can be frustrating.
oEasy to make mistakes and not know.
oWorking with large datasets is limited
by RAM
oData prep & cleaning can be messier &
more mistake prone in R vs. SPSS or
SAS
oFast and free.
oState of the art: Statistical
researchers provide their methods as
R packages. SPSS and SAS are
years behind R!
o2nd only to MATLAB for graphics.
oMx, WinBugs, and other programs
use or will use R.
oActive user community
oExcellent for simulation,
programming, computer intensive
analyses, etc.
oForces you to think about your
analysis.
oInterfaces with database storage
software (SQL)
6. TYPICAL R SESSION
Start up R via the GUI or favorite text editor
Two windows:
1+ new or existing scripts (text files) - these will be saved
Terminal – output & temporary input - usually unsaved
7. STATISTICAL METHODS
Statistics: “meaningful” quantities about a sample of
objects, things, persons, events, phenomena, etc.
Simple to complex issues. E.g.
Correlation
ANOVA
MANOVA
Regression – linear, multiple, logistic
LDA
PCA/ Factor Analysis
Frequency domain analysis
Econometric modelling (TSA)
Two main categories:
* Descriptive statistics
* Inferential statistics
8. DESCRIPTIVE STATISTICS
Use sample information to explain/make abstraction of
population “phenomena”.
Common “phenomena”:
* Association (e.g. σ1,2.3 = 0.75)
* Tendency (left-skew, right-skew)
* Causal relationship (e.g. if X, then, Y)
* Trend, pattern, dispersion, range
Used in non-parametric analysis
9. INFERENTIAL STATISTICS
Using sample statistics to infer some “phenomena” of
population parameters
Hypothesis Testing
Common “phenomena”: cause-and-effect
* One-way r/ship - ANOVA
* Multi-directional r/ship - MANOVA
Use parametric analysis
10. COMMON MISTAKES (CONTD.) – “ABUSE OF
STATISTICS”
Issue Data analysis techniques
Example of abuse Correct technique
Measure the “influence” of a variable
on another
Using partial correlation
(e.g. Spearman coeff.)
Using a regression
parameter
Finding the “relationship” between one
variable with another
Multi-dimensional
scaling, Likert scaling
Simple regression
coefficient
To evaluate whether a model fits data
better than the other
Using R2 Many – a.o.t. Box-Cox
c2 test for model
equivalence
To evaluate accuracy of “prediction” Using R2 and/or F-value
of a model
Hold-out sample’s
MAPE,MAD
“Compare” whether a group is different
from another
Multi-dimensional
scaling, Likert scaling
Many – a.o.t. two-way
anova, c2, Z test
To determine whether a group of
factors “significantly influence” the
observed phenomenon
Multi-dimensional
scaling, Likert scaling
Many – a.o.t. manova,
regression
11. TIME SERIES ANALYSIS
A time series is a collection of observations made
sequentially in time.
11
12. STOCHASTIC PROCESSES USEFUL
IN MODELING TIME SERIES
(1) a purely random process,
(2) a random walk,
(3) a moving average (MA) process,
(4) an autoregressive (AR) process,
(5) an autoregressive moving average (ARMA)
process, and
(6) an autoregressive integrated moving
average (ARIMA)process.
12