SlideShare a Scribd company logo
Introduction to Data Analytics Techniques and their
Implementation in R
Dr Bruno Voisin
Irish Centre for High End Computing (ICHEC)
November 14, 2013
Introduction to analytics techniques 1
Outline
Preparing vs Processing
Preparing the Data
◮ Outliers
◮ Missing values
◮ R data types: numerical vs factors
◮ Reshaping data
Forecasting, Predicting, Classifying...
◮ Linear Regression
◮ K nearest neighbours
◮ Decision Trees
◮ Time Series
Going Further
◮ Ensembles of models
◮ Rattle
Introduction to analytics techniques 2
Preparing vs Processing
Before considering what mathematical models could fit your data, ask
yourself: ”is my data ready for this?”
Pro-tip: the answer is no. Sorry. Chances are...
It’s ”noisy”.
It’s wrong.
It’s incomplete.
It’s not in shape.
Spending 90% of your time preparing data, 10% fitting models isn’t
necessarily a bad ratio!
Introduction to analytics techniques 3
Data preparation
Outliers
Missing values
R data types: numerical vs factors
R Reshaping
Introduction to analytics techniques 4
Outliers
Outliers are records with unusual values for an attribute or
combination of attributes. As a rule, we need to:
◮ detect them
◮ understand them (typo vs genuine but unusual value)
◮ decide what to do with them (remove them or not, correct them)
Introduction to analytics techniques 5
Detecting outliers: mean vs median
Both mean and median provide an expected ’typical’ value useful to
detect outlier.
Mean has some nice useful properties (standard deviation).
Median is more tolerant of outliers and asymetrical data.
Rule of thumb:
◮ nicely symetrical data with mean ≈ median: safe to use mean.
◮ noisy, asymetrical data where mean = median: use median.
Introduction to analytics techniques 6
Detecting outliers: 2 standard deviations
> x <- iris$Sepal.Width
> sdx <- sd(x)
> m <- mean(x)
> iris[(m-2*sdx)>x | x>(m+2*sdx),]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
15 5.8 4.0 1.2 0.2 setosa
16 5.7 4.4 1.5 0.4 setosa
33 5.2 4.1 1.5 0.1 setosa
34 5.5 4.2 1.4 0.2 setosa
61 5.0 2.0 3.5 1.0 versicolor
Introduction to analytics techniques 7
Detecting outliers: the boxplot
Graphical representation of median, quartiles, and last observations
not considered as outliers.
> data(iris)
> boxplot(iris[,c(1,2,3,4)], col=rainbow(4), notch=TRUE)
Introduction to analytics techniques 8
Detecting outliers: the boxplot
Use identify to turn outliers into clickable dots and have R return
their indices:
> boxplot(iris[,c(1,2,3,4)], col=rainbow(4), notch=TRUE)
> identify(array(2,length(iris[[2]])),iris$Sepal.Width)
[1] 16 33 34 61
> outliers <- identify(array(2,length(iris[[1]])),
iris$Sepal.Width)
> iris[outliers,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Speci
16 5.7 4.4 1.5 0.4 seto
33 5.2 4.1 1.5 0.1 seto
34 5.5 4.2 1.4 0.2 seto
61 5.0 2.0 3.5 1.0 versicol
Introduction to analytics techniques 9
Detecting outliers: the boxplot
For automated tasks, use the boxplot object itself:
> x <- iris$Sepal.Width
> bp <- boxplot( iris$Sepal.Width )
> iris[x %in% bp$out,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Speci
16 5.7 4.4 1.5 0.4 seto
33 5.2 4.1 1.5 0.1 seto
34 5.5 4.2 1.4 0.2 seto
61 5.0 2.0 3.5 1.0 versicol
Introduction to analytics techniques 10
Detecting outliers: Mk1 Eyeballs
Some weird cases may always show up which quick stats won’t pick
up.
Visual approach: show visual identification of weird cases like:
*******.........*...........*********
^
outlier
Introduction to analytics techniques 11
Understanding Outliers
No general rule, pretty much a domain-dependent task.
Data analyst/domain experts work together and identify genuine
record vs obvious errors (127 years old driver renting a car).
Class information is at the centre of automated classification.
Consider outliers in regards to their own class if available.
Introduction to analytics techniques 12
Understanding Outliers
Iris example: for Setosa, of the three extreme Sepal.Width values,
only one genuinely out of range. For Versicolor, odd one out
disappears. New outliers appear on other variables:
> par(mfrow = c(1,2))
> boxplot(iris[iris$Species=="setosa",c(1,2,3,4)], main="Seto
> boxplot(iris[iris$Species=="versicolor",c(1,2,3,4)], main="
Introduction to analytics techniques 13
Managing outliers
Incorrect data should be treated as missing (to be ignored or
simulated, see below).
Genuine but unusual data is processed according to context:
◮ should generally be kept (sometimes even, particular interest in the
exceptions, ex: fraud detection)
◮ may eventually be removed (bad practice, but sometimes there’s
interest in modelling only the mainstream data)
Introduction to analytics techniques 14
Missing values
Missing values are represented in R by a special ‘NA‘ value.
Amusingly, ’NA’ in some data sets may mean ’North America’, ’North
American Airlines’, etc. Keep it in mind while importing/exporting
data.
Finding/counting them from a variable or data frame:
> sum(is.na(BreastCancer[,7]))
[1] 16
> incomplete <- BreastCancer[!complete.cases(BreastCancer),]
> nrow(incomplete)
[1] 16
Introduction to analytics techniques 15
Strategies for missing values
removing NAs:
> nona <- BreastCancer[complete.cases(BreastCancer),]
replacing NAs:
◮ mean
> x <- iris$Sepal.Width
> x[sample(length(x),5)] <- NA
> x[is.na(x)] <- mean(x, na.rm=TRUE)
◮ median (can be grabbed from the boxplot), most common nominative
variable
◮ value of closest case in other dimensions
◮ domain expert decided value (caveat: DM aims at finding unknowns
from domain experts)
◮ etc.
Introduction to analytics techniques 16
R data types: numerical vs factors
Mainly number crunching algorithms.
However, discrete variables can be managed by some techniques. R
modules generally require those to be stored as factors.
Discrete variables better fit for some techniques (decision trees)
◮ consider conversion of numerical to meaningful ranges (ex: customer
age range)
◮ integer variables can be used as either numerical or factor
Introduction to analytics techniques 17
Factor to numerical
as.numeric isn’t sufficient since it would simply return the the factor
levels of a variable. Need to ’translate’ the level into its value.
> library(mlbench)
> data(BreastCancer)
> f <- BreastCancer$Cell.shape[1:10]
> as.numeric(levels(f))[f]
[1] 1 4 1 8 1 10 1 2 1 1
Introduction to analytics techniques 18
Numerical to factor
Converting numerical to factor ”as is” with as.factor:
> s <- c(21, 43, 55, 18, 21, 50, 20, 67, 36, 33, 36)
> as.factor(s)
[1] 21 43 55 18 21 50 20 67 36 33 36
Levels: 18 20 21 33 36 43 50 55 67
Converting numerical ranges to a factor with cut:
> cut(s, c(-Inf, 21, 26, 30, 34, 44, 54, 64, Inf), labels=
c("21 and Under", "22 to 26", "27 to 30", "31 to 34",
"35 to 44", "45 to 54", "55 to 64", "65 and Over"))
[1] 21 and Under 35 to 44 55 to 64 21 and Under 21 a
[6] 45 to 54 21 and Under 65 and Over 35 to 44 31 t
[11] 35 to 44
8 Levels: 21 and Under 22 to 26 27 to 30 31 to 34 35 to 44 ..
Introduction to analytics techniques 19
Reshaping
More often than not, the ’shape’ of the data as it comes won’t be
convenient.
Look at the following example:
> pop <- read.csv("http://2010.census.gov/2010census/data/pop
> pop <- pop[,1:12]
> colnames(pop)
[1] "STATE_OR_REGION" "X1910_POPULATION" "X1920_POPULATION"
[5] "X1940_POPULATION" "X1950_POPULATION" "X1960_POPULATION"
[9] "X1980_POPULATION" "X1990_POPULATION" "X2000_POPULATION"
> pop[1:10,]
STATE_OR_REGION X1910_POPULATION X1920_POPULATION X19
1 United States 92228531 106021568
2 Alabama 2138093 2348174
3 Alaska 64356 55036
4 Arizona 204354 334162
5 Arkansas 1574449 1752204
6 California 2377549 3426861
7 Colorado 799024 939629
8 Connecticut 1114756 1380631
9 Delaware 202322 223003
10 District of Columbia 331069 437571
Introduction to analytics techniques 20
Reshaping: melt
The reshape2 package provides convenient functions for reshaping
data:
> library(reshape2)
> colnames(pop) <- c("state", seq(1910, 2010, 10))
> mpop <- melt(pop, id.vars="state", variable.name="year",
value.name="population")
> mpop[1:10,]
state year population
1 United States 1910 92228531
2 Alabama 1910 2138093
3 Alaska 1910 64356
4 Arizona 1910 204354
5 Arkansas 1910 1574449
6 California 1910 2377549
7 Colorado 1910 799024
8 Connecticut 1910 1114756
9 Delaware 1910 202322
10 District of Columbia 1910 331069
more friendly to a relational database table too.
Introduction to analytics techniques 21
Reshaping: cast
acast and dcast reverse the melt and produce respectively an
array/matrix or a data frame:
> dcast(mpop, state˜year, value_var="population")[1:10,]
Using population as value column: use value.var to override.
state 1910 1920 1930 1940 1
1 Alabama 2138093 2348174 2646248 2832961 3061
2 Alaska 64356 55036 59278 72524 128
3 Arizona 204354 334162 435573 499261 749
4 Arkansas 1574449 1752204 1854482 1949387 1909
5 California 2377549 3426861 5677251 6907387 10586
6 Colorado 799024 939629 1035791 1123296 1325
7 Connecticut 1114756 1380631 1606903 1709242 2007
8 Delaware 202322 223003 238380 266505 318
9 District of Columbia 331069 437571 486869 663091 802
10 Florida 752619 968470 1468211 1897414 2771
Introduction to analytics techniques 22
Forecasting, Predicting, Classifying...
Ultimately, we’re trying to understand a behaviour from our data.
To this end, various mathematical models have been developed,
matching various known behaviours.
Each model will come with its own sweet/blind spots and its own
scaling issues when moving towards Big Data.
Today’s overview of models will cover: Linear Regression, kNN,
Decision Trees and basic Time Series, but there’s a lot more models
around...
Introduction to analytics techniques 23
Linear Regression
One of the simplest models.
Establish a linear relationship between variables, predicting one
variable’s value (the response) from the others (the predictors).
Intuitively, it’s all about drawing a line. But the right line.
Introduction to analytics techniques 24
Simple Linear Regression
> data(trees)
> plot(trees$Girth, trees$Volume)
Introduction to analytics techniques 25
Simple Linear Regression
> lm(formula=Volume˜Girth, data=trees)
Call:
lm(formula = Volume ˜ Girth, data = trees)
Coefficients:
(Intercept) Girth
-36.943 5.066
> abline(-36.943, 5.066)
Introduction to analytics techniques 26
Simple Linear Regression
For a response variable r and predictor variables p1, p2, . . ., pn the
lm() function generates a simple linear model based on a formula
object of the form:
r ∼ p1 + p2 + · · · + pn
Example: building a linear model using both Girth and Height as
predictors for a tree’s Volume:
> lm(formula=Volume˜Girth+Height, data=trees)
Call:
lm(formula = Volume ˜ Girth + Height, data = trees)
Coefficients:
(Intercept) Girth Height
-57.9877 4.7082 0.3393
By default, lm() fits the model that minimizes the sum of square
errors.
Introduction to analytics techniques 27
Linear Model Evaluation
> fit <- lm(formula=Volume˜Girth+Height, data=trees)
> summary(fit)
Call:
lm(formula = Volume ˜ Girth + Height, data = trees)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
Girth 4.7082 0.2643 17.816 < 2e-16 ***
Height 0.3393 0.1302 2.607 0.0145 *
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Introduction to analytics techniques 28
Refining the model
Low-significance predictor attributes complexify a model for a small
gain.
The anova() function helps evaluate predictors.
The update() function allows us to remove predictors from the
model.
The step() function can repeat such a task using a different
criterion.
Introduction to analytics techniques 29
Refining the model: anova()
> data(airquality)
> fit <- lm(formula = Ozone ˜ . , data=airquality)
> anova(fit)
Analysis of Variance Table
Response: Ozone
Df Sum Sq Mean Sq F value Pr(>F)
Solar.R 1 14780 14780 33.9704 6.216e-08 ***
Wind 1 39969 39969 91.8680 5.243e-16 ***
Temp 1 19050 19050 43.7854 1.584e-09 ***
Month 1 1701 1701 3.9101 0.05062 .
Day 1 619 619 1.4220 0.23576
Residuals 105 45683 435
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Sum Sq shows the reduction in the residual sum of squares as each
predictor is added. Small values contribute less.
Introduction to analytics techniques 30
Refining the model: update()
> fit2 <- update (fit, . ˜ . - Day)
> summary(fit2)
Call:
lm(formula = Ozone ˜ Solar.R + Wind + Temp + Month, data = airqua
[...]
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -58.05384 22.97114 -2.527 0.0130 *
Solar.R 0.04960 0.02346 2.114 0.0368 *
Wind -3.31651 0.64579 -5.136 1.29e-06 ***
Temp 1.87087 0.27363 6.837 5.34e-10 ***
Month -2.99163 1.51592 -1.973 0.0510 .
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 20.9 on 106 degrees of freedom
(42 observations deleted due to missingness)
Multiple R-squared: 0.6199, Adjusted R-squared: 0.6055
F-statistic: 43.21 on 4 and 106 DF, p-value: < 2.2e-16
We removed the Day predictor from the model.
New model is slightly worse... but simpler...
Introduction to analytics techniques 31
Refining the model: step()
The step() function can automatically reduce the model:
> final <- step(fit)
Start: AIC=680.21
Ozone ˜ Solar.R + Wind + Temp + Month + Day
Df Sum of Sq RSS AIC
- Day 1 618.7 46302 679.71
<none> 45683 680.21
- Month 1 1755.3 47438 682.40
- Solar.R 1 2005.1 47688 682.98
- Wind 1 11533.9 57217 703.20
- Temp 1 20845.0 66528 719.94
Step: AIC=679.71
Ozone ˜ Solar.R + Wind + Temp + Month
Df Sum of Sq RSS AIC
<none> 46302 679.71
- Month 1 1701.2 48003 681.71
- Solar.R 1 1952.6 48254 682.29
- Wind 1 11520.5 57822 702.37
- Temp 1 20419.5 66721 718.26
Introduction to analytics techniques 32
K-nearest neighbours (KNN)
K-nearest neighbour classification is amongst the simplest
classification algorithms.
consists in classifying an element as the majority of the k elements of
the learning set closest to it in the multidimensional feature space.
no training needed.
classification can be compute-intensive for high k values (many
distances to evaluate) and requires access to learning data set.
very intuitive for end-user, but does not provide any insight into the
data.
Introduction to analytics techniques 33
An example
With k = 5, the central dot would be classified as red.
Introduction to analytics techniques 34
What value for k?
smaller k value faster to process.
higher k values more robust to noise.
n-fold cross validation can be used on incremental values of k to
select a k value that minimises error.
Introduction to analytics techniques 35
KNN with R
The knn function (package class) provides KNN classification for R.
knn(train, test, cl, k = 1, l = 0, prob = FALSE, use.all = TR
Arguments:
train: matrix or data frame of training set cases.
test: matrix or data frame of test set cases. A vector wi
interpreted as a row vector for a single case.
cl: factor of true classifications of training set
k: number of neighbours considered.
l: minimum vote for definite decision, otherwise ’doub
precisely, less than ’k-l’ dissenting votes are all
if ’k’ is increased by ties.)
prob: If this is true, the proportion of the votes for th
class are returned as attribute ’prob’.
use.all: controls handling of ties. If true, all distances e
the ’k’th largest are included. If false, a random
of distances equal to the ’k’th is chosen to use ex
neighbours.
Introduction to analytics techniques 36
Using knn()
> library(class)
> train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])
> test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3]
> cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
> knn(train, test, cl, k = 3, prob=TRUE)
[1] s s s s s s s s s s s s s s s s s s s s s s s s s c c v c c
[39] c c c c c c c c c c c c v c c v v v v v c v v v v c v v v v
attr(,"prob")
[1] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
[8] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
[15] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
[22] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
[29] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.6666667
[36] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
[43] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
[50] 1.0000000 1.0000000 0.6666667 0.7500000 1.0000000 1.0000000
[57] 1.0000000 1.0000000 0.5000000 1.0000000 1.0000000 1.0000000
[64] 0.6666667 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
[71] 1.0000000 0.6666667 1.0000000 1.0000000 0.6666667
Levels: c s v
Introduction to analytics techniques 37
Counting errors with cross-validation for different k values
‘knn.cv‘ does leave one out cross-validation: classifies each item while
leaving it out of the learning set.
> train <- rbind(iris3[,,1], iris3[,,2], iris3[,,3])
> cl <- factor(c(rep("s",50), rep("c",50), rep("v",50)))
> sum( (knn.cv(train, cl, k = 1) == cl) == FALSE )
[1] 6
> sum( (knn.cv(train, cl, k = 5) == cl) == FALSE )
[1] 5
> sum( (knn.cv(train, cl, k = 15) == cl) == FALSE )
[1] 4
> sum( (knn.cv(train, cl, k = 30) == cl) == FALSE )
[1] 8
Introduction to analytics techniques 38
Decision trees
A decision tree is a tree-structured representation of a dataset and its
class-relevant partitioning.
The root node ’contains’ the entire learning dataset.
Each non-terminal node is split according to a particular
attribute/value combination.
Class distribution in terminal nodes is used to affect a class
probability to further unclassified data.
Human readable!
Introduction to analytics techniques 39
An example
Introduction to analytics techniques 40
Building the tree
A tree is built by successive partitioning.
Starting from the root, every attribute is considered for a potential
split of the data set.
For each attribute, every possible split is considered.
The ”best split” is picked by comparing the resulting distribution of
classes in the generated child nodes.
Each child node is then considered for further partitioning, and so on
until:
◮ partitioning a node doesn’t improve the class distribution (ex: only 1
class represented in a node),
◮ a node’s ”population” is too small (min split),
◮ a node’s potential partitioning would generate a child node with a too
small population (min bucket).
Introduction to analytics techniques 41
Decision trees with R: the rpart module
rpart is a R module providing functions for generating decision trees
(among other things).
rpart(formula, data, weights, subset, na.action = na.rpart,
method, model = FALSE, x = FALSE, y = TRUE,
parms, control, cost, ...)
formula: class ˜ att1 + att2 + · · · + attn
data: name of dataframe whose columns include attributes used in
the formula.
weights: optional case weights.
subset: optional subsetting of the data set for use in the fit.
na.action: strategies for missing values.
method: defaults to ”class” which applies to class-based decision
trees.
control: rpart control options (like min split/bucket, refer to
?rpart.control for details).
Introduction to analytics techniques 42
Using rpart
> library(rpart)
> model <- rpart(Species ˜ Sepal.Length + Sepal.Width +
Petal.Length + Petal.Width, data=iris)
> # textual representation of tree:
> model
n= 150
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)
2) Petal.Length< 2.45 50 0 setosa (1.00000000 0.00000000 0.00
3) Petal.Length>=2.45 100 50 versicolor (0.00000000 0.50000000
6) Petal.Width< 1.75 54 5 versicolor (0.00000000 0.90740741
7) Petal.Width>=1.75 46 1 virginica (0.00000000 0.02173913
Introduction to analytics techniques 43
Plotting the tree model
> # basic R graphics plot of tree:
> plot(model)
> text(model)
> # fancier postscript plot of tree:
> post(model, file="mytree.ps", title="Iris Classification")
Introduction to analytics techniques 44
rpart model classification
Use ‘predict‘ to apply a model to a dataframe:
> unclassified <- iris[c(13,54,76,104,32,114,56),c(
"Sepal.Length","Sepal.Width", "Petal.Length", "Petal.Width")]
> predict(model, newdata=unclassified, type="prob")
setosa versicolor virginica
13 1 0.00000000 0.0000000
54 0 0.90740741 0.0925926
76 0 0.90740741 0.0925926
104 0 0.02173913 0.9782609
32 1 0.00000000 0.0000000
114 0 0.02173913 0.9782609
56 0 0.90740741 0.0925926
> predict(model, newdata=unclassified, type="vector")
13 54 76 104 32 114 56
1 2 2 3 1 3 2
> predict(model, newdata=unclassified, type="class")
13 54 76 104 32 114
setosa versicolor versicolor virginica setosa virginica
Levels: setosa versicolor virginica
Introduction to analytics techniques 45
rpart model evaluation
Use a confusion matrix to measure accuracy of predictions:
> pred <- predict(model, iris[,c(1,2,3,4)], type="class")
> conf <- table(pred, iris$Species)
> sum(diag(conf)) / sum(conf)
[1] 0.96
Introduction to analytics techniques 46
Time Series
Another type of model, applying to time-related data.
Additive or multiplicative decomposition of signal into components.
Many models and parameters used to fit the series, to then be used
for forecasting.
Some automated fitting is available in R.
Introduction to analytics techniques 47
Time Series
R manages time series objects by default:
> data(AirPassengers)
> AirPassengers
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432
Introduction to analytics techniques 48
Time Series
Simple things like changing the series frequency are handled natively
too:
> ts(AirPassengers, frequency=4, start=c(1949,1),
end=c(1960,4))
Qtr1 Qtr2 Qtr3 Qtr4
1949 112 118 132 129
1950 121 135 148 148
1951 136 119 104 118
1952 115 126 141 135
1953 125 149 170 170
1954 158 133 114 140
1955 145 150 178 163
1956 172 178 199 199
1957 184 162 146 166
1958 171 180 193 181
1959 183 218 230 242
1960 209 191 172 194
Introduction to analytics techniques 49
Time Series
As are plotting and decomposition:
> plot(AirPassengers)
> plot(decompose(AirPassengers))
Introduction to analytics techniques 50
Time Series Decomposition
In a simple seasonal time series, the signal can decomposed into three
components that can then be analysed separately:
◮ the Trend component, that shows the progression of the series.
◮ the Seasonal component, that shows the periodic variation.
◮ the Irregular component, that shows the rest of the variations.
In an additive decomposition, our signal is
Trend + Seasonal + Irregular.
In a multiplicative decomposition, our signal is
Trend ∗ Seasonal ∗ Irregular.
Multiplicative decomposition makes sense when absolute difference in
values are of less interest that percentage changes.
A multiplicative signal can also be decomposed in additive fashion
through working on log(data).
Introduction to analytics techniques 51
Additive/Multiplicative Decomposition
Our example shows typical multiplicative behaviour.
> plot(decompose(AirPassengers))
> plot(decompose(AirPassengers, type="multiplicative"))
Introduction to analytics techniques 52
Log of a multiplicative series
Using log() to decompose our series in additive fashion:
> plot(log(AirPassengers))
> plot(decompose(log(AirPassengers)))
Introduction to analytics techniques 53
The ARIMA model
ARIMA stands for AutoRegressive Integrated Moving Average.
ARIMA is one of the most general class of models for time series
forecasting.
An ARIMA model is characterized by three non-negative integer
parameters commonly called (p, d, q):
◮ p is the autoregressive order (AR).
◮ d is the integrated order (I).
◮ q is the moving average order (MA)
An ARIMA model with zero for some of those values is in fact a
simpler model, be it AR, MA or ARMA...
Like for linear regression, an information criterion can be used to
evaluate which values of (p, d, q) provide a better fit.
A Seasonal ARIMA model (p, d, q) × (P, D, Q) has three additional
parameters modelling the seasonal behaviour of the series in the same
fashion.
Introduction to analytics techniques 54
Automated ARIMA fitting and forecasting
The auto.arima() function will explore a range of values for
(p, d, q) × (P, D, Q) and return the best fitting model, which can
then be used for forecasting:
> library(forecast)
> fit <- auto.arima(AirPassengers)
> plot(forecast(fit, h=20))
Introduction to analytics techniques 55
Going further
Ensembles of models
Rattle
Introduction to analytics techniques 56
Ensembles of models
Models built with a specific set of parameters have a limit to the data
relationship they can express.
Choice of model or initial parameter will create specific recurring
misclassification.
Solution: build several competing models and average classification.
Some techniques are built around the idea, like random forests (see
’rf’ module in R).
Introduction to analytics techniques 57
Rattle
Rattle is a data mining framework for R. Installable as a CRAN
module, it features:
◮ Graphical user interface to common mining modules
◮ Full mining framework: data preprocessing, analysis, mining, validating
◮ Automatic generation of R code
In addition to fast hands-on data mining, the rattle log is a great R
learning resource.
Introduction paper at:
http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Williams.pdf
> install.packages("RGtk2")
> install.packages("rattle")
> library(rattle)
Rattle: Graphical interface for data mining using R.
Version 2.5.40 Copyright (c) 2006-2010 Togaware Pty Ltd.
Type ’rattle()’ to shake, rattle, and roll your data.
> rattle()
Introduction to analytics techniques 58
Rattle
Introduction to analytics techniques 59
The End
Thank you. :)
Introduction to analytics techniques 60

More Related Content

Similar to 2013.11.14 Big Data Workshop Bruno Voisin

Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
Yanchang Zhao
 
R Programming Intro
R Programming IntroR Programming Intro
R Programming Intro
062MayankSinghal
 
R_Proficiency.pptx
R_Proficiency.pptxR_Proficiency.pptx
R_Proficiency.pptx
Shivammittal880395
 
Methods of Unsupervised Learning (Article 10 - Practical Exercises)
Methods of Unsupervised Learning (Article 10 - Practical Exercises)Methods of Unsupervised Learning (Article 10 - Practical Exercises)
Methods of Unsupervised Learning (Article 10 - Practical Exercises)
Theodore Grammatikopoulos
 
software engineering modules iii & iv.pptx
software engineering  modules iii & iv.pptxsoftware engineering  modules iii & iv.pptx
software engineering modules iii & iv.pptx
rani marri
 
Cluto presentation
Cluto presentationCluto presentation
Cluto presentation
Roseline Antai
 
R for Statistical Computing
R for Statistical ComputingR for Statistical Computing
R for Statistical Computing
Mohammed El Rafie Tarabay
 
Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017
Parth Khare
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classification
Yanchang Zhao
 
Data mining
Data miningData mining
Data mining
Kani Selvam
 
R and data mining
R and data miningR and data mining
R and data mining
Chaozhong Yang
 
Slides ads ia
Slides ads iaSlides ads ia
Slides ads ia
Arthur Charpentier
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
AmanBhalla14
 
Descriptive Statistics, Numerical Description
Descriptive Statistics, Numerical DescriptionDescriptive Statistics, Numerical Description
Descriptive Statistics, Numerical Description
getyourcheaton
 
R Programming Homework Help
R Programming Homework HelpR Programming Homework Help
R Programming Homework Help
Statistics Homework Helper
 
IA-advanced-R
IA-advanced-RIA-advanced-R
IA-advanced-R
Arthur Charpentier
 
Human_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_ModelHuman_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_Model
David Ritchie
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
Chandrika Sweety
 
20100528
2010052820100528
20100528
byron zhao
 
20100528
2010052820100528
20100528
byron zhao
 

Similar to 2013.11.14 Big Data Workshop Bruno Voisin (20)

Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
R Programming Intro
R Programming IntroR Programming Intro
R Programming Intro
 
R_Proficiency.pptx
R_Proficiency.pptxR_Proficiency.pptx
R_Proficiency.pptx
 
Methods of Unsupervised Learning (Article 10 - Practical Exercises)
Methods of Unsupervised Learning (Article 10 - Practical Exercises)Methods of Unsupervised Learning (Article 10 - Practical Exercises)
Methods of Unsupervised Learning (Article 10 - Practical Exercises)
 
software engineering modules iii & iv.pptx
software engineering  modules iii & iv.pptxsoftware engineering  modules iii & iv.pptx
software engineering modules iii & iv.pptx
 
Cluto presentation
Cluto presentationCluto presentation
Cluto presentation
 
R for Statistical Computing
R for Statistical ComputingR for Statistical Computing
R for Statistical Computing
 
Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classification
 
Data mining
Data miningData mining
Data mining
 
R and data mining
R and data miningR and data mining
R and data mining
 
Slides ads ia
Slides ads iaSlides ads ia
Slides ads ia
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
Descriptive Statistics, Numerical Description
Descriptive Statistics, Numerical DescriptionDescriptive Statistics, Numerical Description
Descriptive Statistics, Numerical Description
 
R Programming Homework Help
R Programming Homework HelpR Programming Homework Help
R Programming Homework Help
 
IA-advanced-R
IA-advanced-RIA-advanced-R
IA-advanced-R
 
Human_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_ModelHuman_Activity_Recognition_Predictive_Model
Human_Activity_Recognition_Predictive_Model
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 
20100528
2010052820100528
20100528
 
20100528
2010052820100528
20100528
 

More from NUI Galway

Vincenzo MacCarrone, Explaining the trajectory of collective bargaining in Ir...
Vincenzo MacCarrone, Explaining the trajectory of collective bargaining in Ir...Vincenzo MacCarrone, Explaining the trajectory of collective bargaining in Ir...
Vincenzo MacCarrone, Explaining the trajectory of collective bargaining in Ir...
NUI Galway
 
Tom Turner, Tipping the scales for labour in Ireland?
Tom Turner, Tipping the scales for labour in Ireland? Tom Turner, Tipping the scales for labour in Ireland?
Tom Turner, Tipping the scales for labour in Ireland?
NUI Galway
 
Tom McDonnell, Medium-term trends in the Irish labour market and possibilitie...
Tom McDonnell, Medium-term trends in the Irish labour market and possibilitie...Tom McDonnell, Medium-term trends in the Irish labour market and possibilitie...
Tom McDonnell, Medium-term trends in the Irish labour market and possibilitie...
NUI Galway
 
Stephen Byrne, A non-employment index for Ireland
Stephen Byrne, A non-employment index for IrelandStephen Byrne, A non-employment index for Ireland
Stephen Byrne, A non-employment index for Ireland
NUI Galway
 
Sorcha Foster, The risk of automation of work in Ireland
Sorcha Foster, The risk of automation of work in IrelandSorcha Foster, The risk of automation of work in Ireland
Sorcha Foster, The risk of automation of work in Ireland
NUI Galway
 
Sinead Pembroke, Living with uncertainty: The social implications of precario...
Sinead Pembroke, Living with uncertainty: The social implications of precario...Sinead Pembroke, Living with uncertainty: The social implications of precario...
Sinead Pembroke, Living with uncertainty: The social implications of precario...
NUI Galway
 
Paul MacFlynn, A low skills equilibrium in Northern Ireland
Paul MacFlynn, A low skills equilibrium in Northern IrelandPaul MacFlynn, A low skills equilibrium in Northern Ireland
Paul MacFlynn, A low skills equilibrium in Northern Ireland
NUI Galway
 
Nuala Whelan, The role of labour market activation in building a healthy work...
Nuala Whelan, The role of labour market activation in building a healthy work...Nuala Whelan, The role of labour market activation in building a healthy work...
Nuala Whelan, The role of labour market activation in building a healthy work...
NUI Galway
 
Michéal Collins, and Dr Michelle Maher, Auto enrolment
 Michéal Collins, and Dr Michelle Maher, Auto enrolment Michéal Collins, and Dr Michelle Maher, Auto enrolment
Michéal Collins, and Dr Michelle Maher, Auto enrolment
NUI Galway
 
Michael Taft, A new enterprise model
Michael Taft, A new enterprise modelMichael Taft, A new enterprise model
Michael Taft, A new enterprise model
NUI Galway
 
Luke Rehill, Patterns of firm-level productivity in Ireland
Luke Rehill, Patterns of firm-level productivity in IrelandLuke Rehill, Patterns of firm-level productivity in Ireland
Luke Rehill, Patterns of firm-level productivity in Ireland
NUI Galway
 
Lucy Pyne, Evidence from the Social Inclusion and Community Activation Programme
Lucy Pyne, Evidence from the Social Inclusion and Community Activation ProgrammeLucy Pyne, Evidence from the Social Inclusion and Community Activation Programme
Lucy Pyne, Evidence from the Social Inclusion and Community Activation Programme
NUI Galway
 
Lisa Wilson, The gendered nature of job quality and job insecurity
Lisa Wilson, The gendered nature of job quality and job insecurityLisa Wilson, The gendered nature of job quality and job insecurity
Lisa Wilson, The gendered nature of job quality and job insecurity
NUI Galway
 
Karina Doorley, axation, labour force participation and gender equality in Ir...
Karina Doorley, axation, labour force participation and gender equality in Ir...Karina Doorley, axation, labour force participation and gender equality in Ir...
Karina Doorley, axation, labour force participation and gender equality in Ir...
NUI Galway
 
Jason Loughrey, Household income volatility in Ireland
Jason Loughrey, Household income volatility in IrelandJason Loughrey, Household income volatility in Ireland
Jason Loughrey, Household income volatility in Ireland
NUI Galway
 
Ivan Privalko, What do Workers get from Mobility?
Ivan Privalko, What do Workers get from Mobility?Ivan Privalko, What do Workers get from Mobility?
Ivan Privalko, What do Workers get from Mobility?
NUI Galway
 
Helen Johnston, Labour market transitions: barriers and enablers
Helen Johnston, Labour market transitions: barriers and enablersHelen Johnston, Labour market transitions: barriers and enablers
Helen Johnston, Labour market transitions: barriers and enablers
NUI Galway
 
Gail Irvine, Fulfilling work in Ireland
Gail Irvine, Fulfilling work in IrelandGail Irvine, Fulfilling work in Ireland
Gail Irvine, Fulfilling work in Ireland
NUI Galway
 
Frank Walsh, Assessing competing explanations for the decline in trade union ...
Frank Walsh, Assessing competing explanations for the decline in trade union ...Frank Walsh, Assessing competing explanations for the decline in trade union ...
Frank Walsh, Assessing competing explanations for the decline in trade union ...
NUI Galway
 
Eamon Murphy, An overview of labour market participation in Ireland over the ...
Eamon Murphy, An overview of labour market participation in Ireland over the ...Eamon Murphy, An overview of labour market participation in Ireland over the ...
Eamon Murphy, An overview of labour market participation in Ireland over the ...
NUI Galway
 

More from NUI Galway (20)

Vincenzo MacCarrone, Explaining the trajectory of collective bargaining in Ir...
Vincenzo MacCarrone, Explaining the trajectory of collective bargaining in Ir...Vincenzo MacCarrone, Explaining the trajectory of collective bargaining in Ir...
Vincenzo MacCarrone, Explaining the trajectory of collective bargaining in Ir...
 
Tom Turner, Tipping the scales for labour in Ireland?
Tom Turner, Tipping the scales for labour in Ireland? Tom Turner, Tipping the scales for labour in Ireland?
Tom Turner, Tipping the scales for labour in Ireland?
 
Tom McDonnell, Medium-term trends in the Irish labour market and possibilitie...
Tom McDonnell, Medium-term trends in the Irish labour market and possibilitie...Tom McDonnell, Medium-term trends in the Irish labour market and possibilitie...
Tom McDonnell, Medium-term trends in the Irish labour market and possibilitie...
 
Stephen Byrne, A non-employment index for Ireland
Stephen Byrne, A non-employment index for IrelandStephen Byrne, A non-employment index for Ireland
Stephen Byrne, A non-employment index for Ireland
 
Sorcha Foster, The risk of automation of work in Ireland
Sorcha Foster, The risk of automation of work in IrelandSorcha Foster, The risk of automation of work in Ireland
Sorcha Foster, The risk of automation of work in Ireland
 
Sinead Pembroke, Living with uncertainty: The social implications of precario...
Sinead Pembroke, Living with uncertainty: The social implications of precario...Sinead Pembroke, Living with uncertainty: The social implications of precario...
Sinead Pembroke, Living with uncertainty: The social implications of precario...
 
Paul MacFlynn, A low skills equilibrium in Northern Ireland
Paul MacFlynn, A low skills equilibrium in Northern IrelandPaul MacFlynn, A low skills equilibrium in Northern Ireland
Paul MacFlynn, A low skills equilibrium in Northern Ireland
 
Nuala Whelan, The role of labour market activation in building a healthy work...
Nuala Whelan, The role of labour market activation in building a healthy work...Nuala Whelan, The role of labour market activation in building a healthy work...
Nuala Whelan, The role of labour market activation in building a healthy work...
 
Michéal Collins, and Dr Michelle Maher, Auto enrolment
 Michéal Collins, and Dr Michelle Maher, Auto enrolment Michéal Collins, and Dr Michelle Maher, Auto enrolment
Michéal Collins, and Dr Michelle Maher, Auto enrolment
 
Michael Taft, A new enterprise model
Michael Taft, A new enterprise modelMichael Taft, A new enterprise model
Michael Taft, A new enterprise model
 
Luke Rehill, Patterns of firm-level productivity in Ireland
Luke Rehill, Patterns of firm-level productivity in IrelandLuke Rehill, Patterns of firm-level productivity in Ireland
Luke Rehill, Patterns of firm-level productivity in Ireland
 
Lucy Pyne, Evidence from the Social Inclusion and Community Activation Programme
Lucy Pyne, Evidence from the Social Inclusion and Community Activation ProgrammeLucy Pyne, Evidence from the Social Inclusion and Community Activation Programme
Lucy Pyne, Evidence from the Social Inclusion and Community Activation Programme
 
Lisa Wilson, The gendered nature of job quality and job insecurity
Lisa Wilson, The gendered nature of job quality and job insecurityLisa Wilson, The gendered nature of job quality and job insecurity
Lisa Wilson, The gendered nature of job quality and job insecurity
 
Karina Doorley, axation, labour force participation and gender equality in Ir...
Karina Doorley, axation, labour force participation and gender equality in Ir...Karina Doorley, axation, labour force participation and gender equality in Ir...
Karina Doorley, axation, labour force participation and gender equality in Ir...
 
Jason Loughrey, Household income volatility in Ireland
Jason Loughrey, Household income volatility in IrelandJason Loughrey, Household income volatility in Ireland
Jason Loughrey, Household income volatility in Ireland
 
Ivan Privalko, What do Workers get from Mobility?
Ivan Privalko, What do Workers get from Mobility?Ivan Privalko, What do Workers get from Mobility?
Ivan Privalko, What do Workers get from Mobility?
 
Helen Johnston, Labour market transitions: barriers and enablers
Helen Johnston, Labour market transitions: barriers and enablersHelen Johnston, Labour market transitions: barriers and enablers
Helen Johnston, Labour market transitions: barriers and enablers
 
Gail Irvine, Fulfilling work in Ireland
Gail Irvine, Fulfilling work in IrelandGail Irvine, Fulfilling work in Ireland
Gail Irvine, Fulfilling work in Ireland
 
Frank Walsh, Assessing competing explanations for the decline in trade union ...
Frank Walsh, Assessing competing explanations for the decline in trade union ...Frank Walsh, Assessing competing explanations for the decline in trade union ...
Frank Walsh, Assessing competing explanations for the decline in trade union ...
 
Eamon Murphy, An overview of labour market participation in Ireland over the ...
Eamon Murphy, An overview of labour market participation in Ireland over the ...Eamon Murphy, An overview of labour market participation in Ireland over the ...
Eamon Murphy, An overview of labour market participation in Ireland over the ...
 

Recently uploaded

Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 

Recently uploaded (20)

Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 

2013.11.14 Big Data Workshop Bruno Voisin

  • 1. Introduction to Data Analytics Techniques and their Implementation in R Dr Bruno Voisin Irish Centre for High End Computing (ICHEC) November 14, 2013 Introduction to analytics techniques 1
  • 2. Outline Preparing vs Processing Preparing the Data ◮ Outliers ◮ Missing values ◮ R data types: numerical vs factors ◮ Reshaping data Forecasting, Predicting, Classifying... ◮ Linear Regression ◮ K nearest neighbours ◮ Decision Trees ◮ Time Series Going Further ◮ Ensembles of models ◮ Rattle Introduction to analytics techniques 2
  • 3. Preparing vs Processing Before considering what mathematical models could fit your data, ask yourself: ”is my data ready for this?” Pro-tip: the answer is no. Sorry. Chances are... It’s ”noisy”. It’s wrong. It’s incomplete. It’s not in shape. Spending 90% of your time preparing data, 10% fitting models isn’t necessarily a bad ratio! Introduction to analytics techniques 3
  • 4. Data preparation Outliers Missing values R data types: numerical vs factors R Reshaping Introduction to analytics techniques 4
  • 5. Outliers Outliers are records with unusual values for an attribute or combination of attributes. As a rule, we need to: ◮ detect them ◮ understand them (typo vs genuine but unusual value) ◮ decide what to do with them (remove them or not, correct them) Introduction to analytics techniques 5
  • 6. Detecting outliers: mean vs median Both mean and median provide an expected ’typical’ value useful to detect outlier. Mean has some nice useful properties (standard deviation). Median is more tolerant of outliers and asymetrical data. Rule of thumb: ◮ nicely symetrical data with mean ≈ median: safe to use mean. ◮ noisy, asymetrical data where mean = median: use median. Introduction to analytics techniques 6
  • 7. Detecting outliers: 2 standard deviations > x <- iris$Sepal.Width > sdx <- sd(x) > m <- mean(x) > iris[(m-2*sdx)>x | x>(m+2*sdx),] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 15 5.8 4.0 1.2 0.2 setosa 16 5.7 4.4 1.5 0.4 setosa 33 5.2 4.1 1.5 0.1 setosa 34 5.5 4.2 1.4 0.2 setosa 61 5.0 2.0 3.5 1.0 versicolor Introduction to analytics techniques 7
  • 8. Detecting outliers: the boxplot Graphical representation of median, quartiles, and last observations not considered as outliers. > data(iris) > boxplot(iris[,c(1,2,3,4)], col=rainbow(4), notch=TRUE) Introduction to analytics techniques 8
  • 9. Detecting outliers: the boxplot Use identify to turn outliers into clickable dots and have R return their indices: > boxplot(iris[,c(1,2,3,4)], col=rainbow(4), notch=TRUE) > identify(array(2,length(iris[[2]])),iris$Sepal.Width) [1] 16 33 34 61 > outliers <- identify(array(2,length(iris[[1]])), iris$Sepal.Width) > iris[outliers,] Sepal.Length Sepal.Width Petal.Length Petal.Width Speci 16 5.7 4.4 1.5 0.4 seto 33 5.2 4.1 1.5 0.1 seto 34 5.5 4.2 1.4 0.2 seto 61 5.0 2.0 3.5 1.0 versicol Introduction to analytics techniques 9
  • 10. Detecting outliers: the boxplot For automated tasks, use the boxplot object itself: > x <- iris$Sepal.Width > bp <- boxplot( iris$Sepal.Width ) > iris[x %in% bp$out,] Sepal.Length Sepal.Width Petal.Length Petal.Width Speci 16 5.7 4.4 1.5 0.4 seto 33 5.2 4.1 1.5 0.1 seto 34 5.5 4.2 1.4 0.2 seto 61 5.0 2.0 3.5 1.0 versicol Introduction to analytics techniques 10
  • 11. Detecting outliers: Mk1 Eyeballs Some weird cases may always show up which quick stats won’t pick up. Visual approach: show visual identification of weird cases like: *******.........*...........********* ^ outlier Introduction to analytics techniques 11
  • 12. Understanding Outliers No general rule, pretty much a domain-dependent task. Data analyst/domain experts work together and identify genuine record vs obvious errors (127 years old driver renting a car). Class information is at the centre of automated classification. Consider outliers in regards to their own class if available. Introduction to analytics techniques 12
  • 13. Understanding Outliers Iris example: for Setosa, of the three extreme Sepal.Width values, only one genuinely out of range. For Versicolor, odd one out disappears. New outliers appear on other variables: > par(mfrow = c(1,2)) > boxplot(iris[iris$Species=="setosa",c(1,2,3,4)], main="Seto > boxplot(iris[iris$Species=="versicolor",c(1,2,3,4)], main=" Introduction to analytics techniques 13
  • 14. Managing outliers Incorrect data should be treated as missing (to be ignored or simulated, see below). Genuine but unusual data is processed according to context: ◮ should generally be kept (sometimes even, particular interest in the exceptions, ex: fraud detection) ◮ may eventually be removed (bad practice, but sometimes there’s interest in modelling only the mainstream data) Introduction to analytics techniques 14
  • 15. Missing values Missing values are represented in R by a special ‘NA‘ value. Amusingly, ’NA’ in some data sets may mean ’North America’, ’North American Airlines’, etc. Keep it in mind while importing/exporting data. Finding/counting them from a variable or data frame: > sum(is.na(BreastCancer[,7])) [1] 16 > incomplete <- BreastCancer[!complete.cases(BreastCancer),] > nrow(incomplete) [1] 16 Introduction to analytics techniques 15
  • 16. Strategies for missing values removing NAs: > nona <- BreastCancer[complete.cases(BreastCancer),] replacing NAs: ◮ mean > x <- iris$Sepal.Width > x[sample(length(x),5)] <- NA > x[is.na(x)] <- mean(x, na.rm=TRUE) ◮ median (can be grabbed from the boxplot), most common nominative variable ◮ value of closest case in other dimensions ◮ domain expert decided value (caveat: DM aims at finding unknowns from domain experts) ◮ etc. Introduction to analytics techniques 16
  • 17. R data types: numerical vs factors Mainly number crunching algorithms. However, discrete variables can be managed by some techniques. R modules generally require those to be stored as factors. Discrete variables better fit for some techniques (decision trees) ◮ consider conversion of numerical to meaningful ranges (ex: customer age range) ◮ integer variables can be used as either numerical or factor Introduction to analytics techniques 17
  • 18. Factor to numerical as.numeric isn’t sufficient since it would simply return the the factor levels of a variable. Need to ’translate’ the level into its value. > library(mlbench) > data(BreastCancer) > f <- BreastCancer$Cell.shape[1:10] > as.numeric(levels(f))[f] [1] 1 4 1 8 1 10 1 2 1 1 Introduction to analytics techniques 18
  • 19. Numerical to factor Converting numerical to factor ”as is” with as.factor: > s <- c(21, 43, 55, 18, 21, 50, 20, 67, 36, 33, 36) > as.factor(s) [1] 21 43 55 18 21 50 20 67 36 33 36 Levels: 18 20 21 33 36 43 50 55 67 Converting numerical ranges to a factor with cut: > cut(s, c(-Inf, 21, 26, 30, 34, 44, 54, 64, Inf), labels= c("21 and Under", "22 to 26", "27 to 30", "31 to 34", "35 to 44", "45 to 54", "55 to 64", "65 and Over")) [1] 21 and Under 35 to 44 55 to 64 21 and Under 21 a [6] 45 to 54 21 and Under 65 and Over 35 to 44 31 t [11] 35 to 44 8 Levels: 21 and Under 22 to 26 27 to 30 31 to 34 35 to 44 .. Introduction to analytics techniques 19
  • 20. Reshaping More often than not, the ’shape’ of the data as it comes won’t be convenient. Look at the following example: > pop <- read.csv("http://2010.census.gov/2010census/data/pop > pop <- pop[,1:12] > colnames(pop) [1] "STATE_OR_REGION" "X1910_POPULATION" "X1920_POPULATION" [5] "X1940_POPULATION" "X1950_POPULATION" "X1960_POPULATION" [9] "X1980_POPULATION" "X1990_POPULATION" "X2000_POPULATION" > pop[1:10,] STATE_OR_REGION X1910_POPULATION X1920_POPULATION X19 1 United States 92228531 106021568 2 Alabama 2138093 2348174 3 Alaska 64356 55036 4 Arizona 204354 334162 5 Arkansas 1574449 1752204 6 California 2377549 3426861 7 Colorado 799024 939629 8 Connecticut 1114756 1380631 9 Delaware 202322 223003 10 District of Columbia 331069 437571 Introduction to analytics techniques 20
  • 21. Reshaping: melt The reshape2 package provides convenient functions for reshaping data: > library(reshape2) > colnames(pop) <- c("state", seq(1910, 2010, 10)) > mpop <- melt(pop, id.vars="state", variable.name="year", value.name="population") > mpop[1:10,] state year population 1 United States 1910 92228531 2 Alabama 1910 2138093 3 Alaska 1910 64356 4 Arizona 1910 204354 5 Arkansas 1910 1574449 6 California 1910 2377549 7 Colorado 1910 799024 8 Connecticut 1910 1114756 9 Delaware 1910 202322 10 District of Columbia 1910 331069 more friendly to a relational database table too. Introduction to analytics techniques 21
  • 22. Reshaping: cast acast and dcast reverse the melt and produce respectively an array/matrix or a data frame: > dcast(mpop, state˜year, value_var="population")[1:10,] Using population as value column: use value.var to override. state 1910 1920 1930 1940 1 1 Alabama 2138093 2348174 2646248 2832961 3061 2 Alaska 64356 55036 59278 72524 128 3 Arizona 204354 334162 435573 499261 749 4 Arkansas 1574449 1752204 1854482 1949387 1909 5 California 2377549 3426861 5677251 6907387 10586 6 Colorado 799024 939629 1035791 1123296 1325 7 Connecticut 1114756 1380631 1606903 1709242 2007 8 Delaware 202322 223003 238380 266505 318 9 District of Columbia 331069 437571 486869 663091 802 10 Florida 752619 968470 1468211 1897414 2771 Introduction to analytics techniques 22
  • 23. Forecasting, Predicting, Classifying... Ultimately, we’re trying to understand a behaviour from our data. To this end, various mathematical models have been developed, matching various known behaviours. Each model will come with its own sweet/blind spots and its own scaling issues when moving towards Big Data. Today’s overview of models will cover: Linear Regression, kNN, Decision Trees and basic Time Series, but there’s a lot more models around... Introduction to analytics techniques 23
  • 24. Linear Regression One of the simplest models. Establish a linear relationship between variables, predicting one variable’s value (the response) from the others (the predictors). Intuitively, it’s all about drawing a line. But the right line. Introduction to analytics techniques 24
  • 25. Simple Linear Regression > data(trees) > plot(trees$Girth, trees$Volume) Introduction to analytics techniques 25
  • 26. Simple Linear Regression > lm(formula=Volume˜Girth, data=trees) Call: lm(formula = Volume ˜ Girth, data = trees) Coefficients: (Intercept) Girth -36.943 5.066 > abline(-36.943, 5.066) Introduction to analytics techniques 26
  • 27. Simple Linear Regression For a response variable r and predictor variables p1, p2, . . ., pn the lm() function generates a simple linear model based on a formula object of the form: r ∼ p1 + p2 + · · · + pn Example: building a linear model using both Girth and Height as predictors for a tree’s Volume: > lm(formula=Volume˜Girth+Height, data=trees) Call: lm(formula = Volume ˜ Girth + Height, data = trees) Coefficients: (Intercept) Girth Height -57.9877 4.7082 0.3393 By default, lm() fits the model that minimizes the sum of square errors. Introduction to analytics techniques 27
  • 28. Linear Model Evaluation > fit <- lm(formula=Volume˜Girth+Height, data=trees) > summary(fit) Call: lm(formula = Volume ˜ Girth + Height, data = trees) Residuals: Min 1Q Median 3Q Max -6.4065 -2.6493 -0.2876 2.2003 8.4847 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -57.9877 8.6382 -6.713 2.75e-07 *** Girth 4.7082 0.2643 17.816 < 2e-16 *** Height 0.3393 0.1302 2.607 0.0145 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 3.882 on 28 degrees of freedom Multiple R-squared: 0.948, Adjusted R-squared: 0.9442 F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16 Introduction to analytics techniques 28
  • 29. Refining the model Low-significance predictor attributes complexify a model for a small gain. The anova() function helps evaluate predictors. The update() function allows us to remove predictors from the model. The step() function can repeat such a task using a different criterion. Introduction to analytics techniques 29
  • 30. Refining the model: anova() > data(airquality) > fit <- lm(formula = Ozone ˜ . , data=airquality) > anova(fit) Analysis of Variance Table Response: Ozone Df Sum Sq Mean Sq F value Pr(>F) Solar.R 1 14780 14780 33.9704 6.216e-08 *** Wind 1 39969 39969 91.8680 5.243e-16 *** Temp 1 19050 19050 43.7854 1.584e-09 *** Month 1 1701 1701 3.9101 0.05062 . Day 1 619 619 1.4220 0.23576 Residuals 105 45683 435 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Sum Sq shows the reduction in the residual sum of squares as each predictor is added. Small values contribute less. Introduction to analytics techniques 30
  • 31. Refining the model: update() > fit2 <- update (fit, . ˜ . - Day) > summary(fit2) Call: lm(formula = Ozone ˜ Solar.R + Wind + Temp + Month, data = airqua [...] Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -58.05384 22.97114 -2.527 0.0130 * Solar.R 0.04960 0.02346 2.114 0.0368 * Wind -3.31651 0.64579 -5.136 1.29e-06 *** Temp 1.87087 0.27363 6.837 5.34e-10 *** Month -2.99163 1.51592 -1.973 0.0510 . --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 20.9 on 106 degrees of freedom (42 observations deleted due to missingness) Multiple R-squared: 0.6199, Adjusted R-squared: 0.6055 F-statistic: 43.21 on 4 and 106 DF, p-value: < 2.2e-16 We removed the Day predictor from the model. New model is slightly worse... but simpler... Introduction to analytics techniques 31
  • 32. Refining the model: step() The step() function can automatically reduce the model: > final <- step(fit) Start: AIC=680.21 Ozone ˜ Solar.R + Wind + Temp + Month + Day Df Sum of Sq RSS AIC - Day 1 618.7 46302 679.71 <none> 45683 680.21 - Month 1 1755.3 47438 682.40 - Solar.R 1 2005.1 47688 682.98 - Wind 1 11533.9 57217 703.20 - Temp 1 20845.0 66528 719.94 Step: AIC=679.71 Ozone ˜ Solar.R + Wind + Temp + Month Df Sum of Sq RSS AIC <none> 46302 679.71 - Month 1 1701.2 48003 681.71 - Solar.R 1 1952.6 48254 682.29 - Wind 1 11520.5 57822 702.37 - Temp 1 20419.5 66721 718.26 Introduction to analytics techniques 32
  • 33. K-nearest neighbours (KNN) K-nearest neighbour classification is amongst the simplest classification algorithms. consists in classifying an element as the majority of the k elements of the learning set closest to it in the multidimensional feature space. no training needed. classification can be compute-intensive for high k values (many distances to evaluate) and requires access to learning data set. very intuitive for end-user, but does not provide any insight into the data. Introduction to analytics techniques 33
  • 34. An example With k = 5, the central dot would be classified as red. Introduction to analytics techniques 34
  • 35. What value for k? smaller k value faster to process. higher k values more robust to noise. n-fold cross validation can be used on incremental values of k to select a k value that minimises error. Introduction to analytics techniques 35
  • 36. KNN with R The knn function (package class) provides KNN classification for R. knn(train, test, cl, k = 1, l = 0, prob = FALSE, use.all = TR Arguments: train: matrix or data frame of training set cases. test: matrix or data frame of test set cases. A vector wi interpreted as a row vector for a single case. cl: factor of true classifications of training set k: number of neighbours considered. l: minimum vote for definite decision, otherwise ’doub precisely, less than ’k-l’ dissenting votes are all if ’k’ is increased by ties.) prob: If this is true, the proportion of the votes for th class are returned as attribute ’prob’. use.all: controls handling of ties. If true, all distances e the ’k’th largest are included. If false, a random of distances equal to the ’k’th is chosen to use ex neighbours. Introduction to analytics techniques 36
  • 37. Using knn() > library(class) > train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3]) > test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3] > cl <- factor(c(rep("s",25), rep("c",25), rep("v",25))) > knn(train, test, cl, k = 3, prob=TRUE) [1] s s s s s s s s s s s s s s s s s s s s s s s s s c c v c c [39] c c c c c c c c c c c c v c c v v v v v c v v v v c v v v v attr(,"prob") [1] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 [8] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 [15] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 [22] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 [29] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.6666667 [36] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 [43] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 [50] 1.0000000 1.0000000 0.6666667 0.7500000 1.0000000 1.0000000 [57] 1.0000000 1.0000000 0.5000000 1.0000000 1.0000000 1.0000000 [64] 0.6666667 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 [71] 1.0000000 0.6666667 1.0000000 1.0000000 0.6666667 Levels: c s v Introduction to analytics techniques 37
  • 38. Counting errors with cross-validation for different k values ‘knn.cv‘ does leave one out cross-validation: classifies each item while leaving it out of the learning set. > train <- rbind(iris3[,,1], iris3[,,2], iris3[,,3]) > cl <- factor(c(rep("s",50), rep("c",50), rep("v",50))) > sum( (knn.cv(train, cl, k = 1) == cl) == FALSE ) [1] 6 > sum( (knn.cv(train, cl, k = 5) == cl) == FALSE ) [1] 5 > sum( (knn.cv(train, cl, k = 15) == cl) == FALSE ) [1] 4 > sum( (knn.cv(train, cl, k = 30) == cl) == FALSE ) [1] 8 Introduction to analytics techniques 38
  • 39. Decision trees A decision tree is a tree-structured representation of a dataset and its class-relevant partitioning. The root node ’contains’ the entire learning dataset. Each non-terminal node is split according to a particular attribute/value combination. Class distribution in terminal nodes is used to affect a class probability to further unclassified data. Human readable! Introduction to analytics techniques 39
  • 40. An example Introduction to analytics techniques 40
  • 41. Building the tree A tree is built by successive partitioning. Starting from the root, every attribute is considered for a potential split of the data set. For each attribute, every possible split is considered. The ”best split” is picked by comparing the resulting distribution of classes in the generated child nodes. Each child node is then considered for further partitioning, and so on until: ◮ partitioning a node doesn’t improve the class distribution (ex: only 1 class represented in a node), ◮ a node’s ”population” is too small (min split), ◮ a node’s potential partitioning would generate a child node with a too small population (min bucket). Introduction to analytics techniques 41
  • 42. Decision trees with R: the rpart module rpart is a R module providing functions for generating decision trees (among other things). rpart(formula, data, weights, subset, na.action = na.rpart, method, model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ...) formula: class ˜ att1 + att2 + · · · + attn data: name of dataframe whose columns include attributes used in the formula. weights: optional case weights. subset: optional subsetting of the data set for use in the fit. na.action: strategies for missing values. method: defaults to ”class” which applies to class-based decision trees. control: rpart control options (like min split/bucket, refer to ?rpart.control for details). Introduction to analytics techniques 42
  • 43. Using rpart > library(rpart) > model <- rpart(Species ˜ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris) > # textual representation of tree: > model n= 150 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 150 100 setosa (0.33333333 0.33333333 0.33333333) 2) Petal.Length< 2.45 50 0 setosa (1.00000000 0.00000000 0.00 3) Petal.Length>=2.45 100 50 versicolor (0.00000000 0.50000000 6) Petal.Width< 1.75 54 5 versicolor (0.00000000 0.90740741 7) Petal.Width>=1.75 46 1 virginica (0.00000000 0.02173913 Introduction to analytics techniques 43
  • 44. Plotting the tree model > # basic R graphics plot of tree: > plot(model) > text(model) > # fancier postscript plot of tree: > post(model, file="mytree.ps", title="Iris Classification") Introduction to analytics techniques 44
  • 45. rpart model classification Use ‘predict‘ to apply a model to a dataframe: > unclassified <- iris[c(13,54,76,104,32,114,56),c( "Sepal.Length","Sepal.Width", "Petal.Length", "Petal.Width")] > predict(model, newdata=unclassified, type="prob") setosa versicolor virginica 13 1 0.00000000 0.0000000 54 0 0.90740741 0.0925926 76 0 0.90740741 0.0925926 104 0 0.02173913 0.9782609 32 1 0.00000000 0.0000000 114 0 0.02173913 0.9782609 56 0 0.90740741 0.0925926 > predict(model, newdata=unclassified, type="vector") 13 54 76 104 32 114 56 1 2 2 3 1 3 2 > predict(model, newdata=unclassified, type="class") 13 54 76 104 32 114 setosa versicolor versicolor virginica setosa virginica Levels: setosa versicolor virginica Introduction to analytics techniques 45
  • 46. rpart model evaluation Use a confusion matrix to measure accuracy of predictions: > pred <- predict(model, iris[,c(1,2,3,4)], type="class") > conf <- table(pred, iris$Species) > sum(diag(conf)) / sum(conf) [1] 0.96 Introduction to analytics techniques 46
  • 47. Time Series Another type of model, applying to time-related data. Additive or multiplicative decomposition of signal into components. Many models and parameters used to fit the series, to then be used for forecasting. Some automated fitting is available in R. Introduction to analytics techniques 47
  • 48. Time Series R manages time series objects by default: > data(AirPassengers) > AirPassengers Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1949 112 118 132 129 121 135 148 148 136 119 104 118 1950 115 126 141 135 125 149 170 170 158 133 114 140 1951 145 150 178 163 172 178 199 199 184 162 146 166 1952 171 180 193 181 183 218 230 242 209 191 172 194 1953 196 196 236 235 229 243 264 272 237 211 180 201 1954 204 188 235 227 234 264 302 293 259 229 203 229 1955 242 233 267 269 270 315 364 347 312 274 237 278 1956 284 277 317 313 318 374 413 405 355 306 271 306 1957 315 301 356 348 355 422 465 467 404 347 305 336 1958 340 318 362 348 363 435 491 505 404 359 310 337 1959 360 342 406 396 420 472 548 559 463 407 362 405 1960 417 391 419 461 472 535 622 606 508 461 390 432 Introduction to analytics techniques 48
  • 49. Time Series Simple things like changing the series frequency are handled natively too: > ts(AirPassengers, frequency=4, start=c(1949,1), end=c(1960,4)) Qtr1 Qtr2 Qtr3 Qtr4 1949 112 118 132 129 1950 121 135 148 148 1951 136 119 104 118 1952 115 126 141 135 1953 125 149 170 170 1954 158 133 114 140 1955 145 150 178 163 1956 172 178 199 199 1957 184 162 146 166 1958 171 180 193 181 1959 183 218 230 242 1960 209 191 172 194 Introduction to analytics techniques 49
  • 50. Time Series As are plotting and decomposition: > plot(AirPassengers) > plot(decompose(AirPassengers)) Introduction to analytics techniques 50
  • 51. Time Series Decomposition In a simple seasonal time series, the signal can decomposed into three components that can then be analysed separately: ◮ the Trend component, that shows the progression of the series. ◮ the Seasonal component, that shows the periodic variation. ◮ the Irregular component, that shows the rest of the variations. In an additive decomposition, our signal is Trend + Seasonal + Irregular. In a multiplicative decomposition, our signal is Trend ∗ Seasonal ∗ Irregular. Multiplicative decomposition makes sense when absolute difference in values are of less interest that percentage changes. A multiplicative signal can also be decomposed in additive fashion through working on log(data). Introduction to analytics techniques 51
  • 52. Additive/Multiplicative Decomposition Our example shows typical multiplicative behaviour. > plot(decompose(AirPassengers)) > plot(decompose(AirPassengers, type="multiplicative")) Introduction to analytics techniques 52
  • 53. Log of a multiplicative series Using log() to decompose our series in additive fashion: > plot(log(AirPassengers)) > plot(decompose(log(AirPassengers))) Introduction to analytics techniques 53
  • 54. The ARIMA model ARIMA stands for AutoRegressive Integrated Moving Average. ARIMA is one of the most general class of models for time series forecasting. An ARIMA model is characterized by three non-negative integer parameters commonly called (p, d, q): ◮ p is the autoregressive order (AR). ◮ d is the integrated order (I). ◮ q is the moving average order (MA) An ARIMA model with zero for some of those values is in fact a simpler model, be it AR, MA or ARMA... Like for linear regression, an information criterion can be used to evaluate which values of (p, d, q) provide a better fit. A Seasonal ARIMA model (p, d, q) × (P, D, Q) has three additional parameters modelling the seasonal behaviour of the series in the same fashion. Introduction to analytics techniques 54
  • 55. Automated ARIMA fitting and forecasting The auto.arima() function will explore a range of values for (p, d, q) × (P, D, Q) and return the best fitting model, which can then be used for forecasting: > library(forecast) > fit <- auto.arima(AirPassengers) > plot(forecast(fit, h=20)) Introduction to analytics techniques 55
  • 56. Going further Ensembles of models Rattle Introduction to analytics techniques 56
  • 57. Ensembles of models Models built with a specific set of parameters have a limit to the data relationship they can express. Choice of model or initial parameter will create specific recurring misclassification. Solution: build several competing models and average classification. Some techniques are built around the idea, like random forests (see ’rf’ module in R). Introduction to analytics techniques 57
  • 58. Rattle Rattle is a data mining framework for R. Installable as a CRAN module, it features: ◮ Graphical user interface to common mining modules ◮ Full mining framework: data preprocessing, analysis, mining, validating ◮ Automatic generation of R code In addition to fast hands-on data mining, the rattle log is a great R learning resource. Introduction paper at: http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Williams.pdf > install.packages("RGtk2") > install.packages("rattle") > library(rattle) Rattle: Graphical interface for data mining using R. Version 2.5.40 Copyright (c) 2006-2010 Togaware Pty Ltd. Type ’rattle()’ to shake, rattle, and roll your data. > rattle() Introduction to analytics techniques 58
  • 60. The End Thank you. :) Introduction to analytics techniques 60